π― Introduction
Deploying machine learning models to production is a complex challenge that goes far beyond training a model. When working with large models from Hugging Faceβwhether it’s image generation, text-to-image synthesis, or other AI tasksβyou need robust infrastructure that handles:
- Scalability: Auto-scaling to handle variable loads from 0 to thousands of concurrent requests
- Cost Efficiency: Paying only for what you use while maintaining performance
- Reliability: 99.9%+ uptime with proper error handling and monitoring
- Security: Protecting models, data, and API endpoints
- Observability: Comprehensive logging, metrics, and tracing
This comprehensive guide demonstrates how to deploy a Hugging Face model to AWS using infrastructure as code (CDK with TypeScript), combining SageMaker for model hosting and Lambda for API orchestration.
π‘ Core Philosophy: “Production ML deployment isn’t about running inferenceβit’s about building a reliable, scalable, cost-effective system that serves predictions while handling failures gracefully”
π¬ What We’ll Build
We’ll deploy a complete ML inference system with:
- Hugging Face Model on SageMaker for scalable inference
- Lambda Functions for API endpoints and orchestration
- API Gateway for RESTful API access
- S3 for model artifacts and output storage
- CloudWatch for monitoring and logging
- IAM for fine-grained security controls
- VPC configuration for network isolation
ποΈ System Architecture
π High-Level Architecture
graph TB
Client[Client Application] --> APIG[API Gateway]
APIG --> Lambda1[Lambda: API Handler]
Lambda1 --> SQS[SQS Queue - Optional]
Lambda1 --> SageMaker[SageMaker Endpoint]
SageMaker --> Model[Hugging Face Model]
Lambda1 --> S3[S3: Results Storage]
Lambda1 --> DynamoDB[DynamoDB: Metadata]
CloudWatch[CloudWatch Logs & Metrics]
SageMaker -.-> CloudWatch
Lambda1 -.-> CloudWatch
CDK[CDK Stack TypeScript] -.->|Deploys| APIG
CDK -.->|Deploys| Lambda1
CDK -.->|Deploys| SageMaker
CDK -.->|Deploys| S3
style Client fill:#ff6b6b
style SageMaker fill:#4ecdc4
style Lambda1 fill:#feca57
style CDK fill:#95e1d3
style Model fill:#a29bfe
π Request Flow
sequenceDiagram
participant Client
participant API Gateway
participant Lambda
participant SageMaker
participant S3
participant DynamoDB
Client->>API Gateway: POST /predict
API Gateway->>Lambda: Invoke with payload
Lambda->>DynamoDB: Store request metadata
Lambda->>SageMaker: InvokeEndpoint
SageMaker->>SageMaker: Run inference
SageMaker-->>Lambda: Return prediction
Lambda->>S3: Store result (if large)
Lambda->>DynamoDB: Update status
Lambda-->>API Gateway: Return response
API Gateway-->>Client: JSON response
π― Architecture Decisions
| Decision | Choice | Reasoning |
|---|---|---|
| Model Hosting | SageMaker | Auto-scaling, managed infrastructure, optimized for ML |
| API Layer | Lambda + API Gateway | Serverless, cost-effective, scales automatically |
| Storage | S3 + DynamoDB | Durable storage for results, fast metadata access |
| IaC Tool | AWS CDK (TypeScript) | Type-safe, familiar language, great AWS integration |
| Async Processing | SQS (Optional) | Handles long-running inference, decouples components |
π¦ Prerequisites and Setup
π οΈ Required Tools
1# Node.js and npm
2node --version # v18+ recommended
3npm --version
4
5# AWS CDK
6npm install -g aws-cdk
7cdk --version
8
9# AWS CLI
10aws --version
11aws configure # Set up credentials
12
13# Python (for model code)
14python3 --version # 3.9+ recommended
15pip3 --version
16
17# Docker (for building container images)
18docker --version
π AWS Credentials Setup
1# Configure AWS credentials
2aws configure
3
4# Verify credentials
5aws sts get-caller-identity
6
7# Bootstrap CDK (first time only)
8cdk bootstrap aws://ACCOUNT-ID/REGION
ποΈ CDK Project Structure
ml-inference-cdk/
βββ bin/
β βββ ml-inference.ts # CDK app entry point
βββ lib/
β βββ stacks/
β β βββ vpc-stack.ts # VPC configuration
β β βββ sagemaker-stack.ts # SageMaker endpoint
β β βββ lambda-stack.ts # Lambda functions
β β βββ api-stack.ts # API Gateway
β βββ constructs/
β β βββ sagemaker-model.ts # Reusable SageMaker construct
β β βββ lambda-api.ts # Lambda + API construct
β βββ config/
β βββ model-config.ts # Model configuration
β βββ app-config.ts # Application config
βββ lambda/
β βββ predict/
β β βββ index.ts # Prediction Lambda
β β βββ package.json
β βββ async-predict/
β βββ index.ts # Async prediction Lambda
β βββ package.json
βββ model/
β βββ inference.py # SageMaker inference script
β βββ requirements.txt # Python dependencies
β βββ Dockerfile # Container image
βββ test/
β βββ ml-inference.test.ts # CDK tests
βββ cdk.json # CDK configuration
βββ tsconfig.json # TypeScript config
βββ package.json # Node.js dependencies
π Step 1: Initialize CDK Project
π Create New CDK Project
1# Create project directory
2mkdir ml-inference-cdk
3cd ml-inference-cdk
4
5# Initialize CDK project
6cdk init app --language=typescript
7
8# Install dependencies
9npm install @aws-cdk/aws-sagemaker-alpha
10npm install @aws-cdk/aws-apigatewayv2-alpha @aws-cdk/aws-apigatewayv2-integrations-alpha
βοΈ Configuration Files
1// lib/config/model-config.ts
2export interface ModelConfig {
3 modelId: string;
4 modelVersion: string;
5 instanceType: string;
6 instanceCount: number;
7 containerImage: string;
8 environment: Record<string, string>;
9}
10
11export const modelConfigs = {
12 development: {
13 modelId: 'stabilityai/stable-diffusion-xl-base-1.0',
14 modelVersion: '1.0',
15 instanceType: 'ml.g4dn.xlarge',
16 instanceCount: 1,
17 containerImage: '', // Will be set after build
18 environment: {
19 MODEL_CACHE_DIR: '/opt/ml/model',
20 TRANSFORMERS_CACHE: '/opt/ml/model',
21 HF_HOME: '/opt/ml/model'
22 }
23 },
24 production: {
25 modelId: 'stabilityai/stable-diffusion-xl-base-1.0',
26 modelVersion: '1.0',
27 instanceType: 'ml.g4dn.2xlarge',
28 instanceCount: 2,
29 containerImage: '',
30 environment: {
31 MODEL_CACHE_DIR: '/opt/ml/model',
32 TRANSFORMERS_CACHE: '/opt/ml/model',
33 HF_HOME: '/opt/ml/model'
34 }
35 }
36} as const;
37
38export type Environment = keyof typeof modelConfigs;
1// lib/config/app-config.ts
2import * as cdk from 'aws-cdk-lib';
3
4export interface AppConfig {
5 environment: string;
6 region: string;
7 account: string;
8 vpcCidr: string;
9 enableVpc: boolean;
10 tags: Record<string, string>;
11}
12
13export function getAppConfig(app: cdk.App): AppConfig {
14 const environment = app.node.tryGetContext('environment') || 'development';
15
16 return {
17 environment,
18 region: process.env.CDK_DEFAULT_REGION || 'us-east-1',
19 account: process.env.CDK_DEFAULT_ACCOUNT || '',
20 vpcCidr: '10.0.0.0/16',
21 enableVpc: environment === 'production',
22 tags: {
23 Environment: environment,
24 Project: 'MLInference',
25 ManagedBy: 'CDK'
26 }
27 };
28}
π³ Step 2: Create SageMaker Inference Container
π Inference Script
1# model/inference.py
2import json
3import os
4import torch
5from diffusers import DiffusionPipeline
6import base64
7from io import BytesIO
8from PIL import Image
9
10class ModelHandler:
11 def __init__(self):
12 self.model = None
13 self.device = "cuda" if torch.cuda.is_available() else "cpu"
14 print(f"Using device: {self.device}")
15
16 def load_model(self):
17 """Load the Hugging Face model"""
18 model_id = os.environ.get('MODEL_ID', 'stabilityai/stable-diffusion-xl-base-1.0')
19
20 print(f"Loading model: {model_id}")
21
22 self.model = DiffusionPipeline.from_pretrained(
23 model_id,
24 torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
25 use_safetensors=True,
26 variant="fp16" if self.device == "cuda" else None
27 )
28
29 self.model = self.model.to(self.device)
30
31 # Enable memory efficient attention if available
32 if hasattr(self.model, 'enable_xformers_memory_efficient_attention'):
33 try:
34 self.model.enable_xformers_memory_efficient_attention()
35 except Exception as e:
36 print(f"Could not enable xformers: {e}")
37
38 print("Model loaded successfully")
39
40 def preprocess(self, request_body):
41 """Preprocess the input request"""
42 try:
43 if isinstance(request_body, bytes):
44 request_body = request_body.decode('utf-8')
45
46 data = json.loads(request_body)
47
48 prompt = data.get('prompt', '')
49 negative_prompt = data.get('negative_prompt', '')
50 num_inference_steps = data.get('num_inference_steps', 50)
51 guidance_scale = data.get('guidance_scale', 7.5)
52 width = data.get('width', 1024)
53 height = data.get('height', 1024)
54 seed = data.get('seed', None)
55
56 return {
57 'prompt': prompt,
58 'negative_prompt': negative_prompt,
59 'num_inference_steps': num_inference_steps,
60 'guidance_scale': guidance_scale,
61 'width': width,
62 'height': height,
63 'seed': seed
64 }
65 except Exception as e:
66 raise ValueError(f"Error preprocessing request: {str(e)}")
67
68 def predict(self, data):
69 """Run inference"""
70 if self.model is None:
71 self.load_model()
72
73 # Set seed for reproducibility
74 if data['seed'] is not None:
75 generator = torch.Generator(device=self.device).manual_seed(data['seed'])
76 else:
77 generator = None
78
79 # Generate image
80 with torch.no_grad():
81 image = self.model(
82 prompt=data['prompt'],
83 negative_prompt=data['negative_prompt'],
84 num_inference_steps=data['num_inference_steps'],
85 guidance_scale=data['guidance_scale'],
86 width=data['width'],
87 height=data['height'],
88 generator=generator
89 ).images[0]
90
91 return image
92
93 def postprocess(self, image):
94 """Convert image to base64"""
95 buffered = BytesIO()
96 image.save(buffered, format="PNG")
97 img_str = base64.b64encode(buffered.getvalue()).decode()
98
99 return {
100 'image': img_str,
101 'format': 'png'
102 }
103
104# Global model handler
105model_handler = ModelHandler()
106
107def model_fn(model_dir):
108 """Load model - called once when container starts"""
109 model_handler.load_model()
110 return model_handler
111
112def input_fn(request_body, request_content_type):
113 """Parse input data"""
114 if request_content_type == 'application/json':
115 return model_handler.preprocess(request_body)
116 else:
117 raise ValueError(f"Unsupported content type: {request_content_type}")
118
119def predict_fn(data, model):
120 """Run prediction"""
121 return model.predict(data)
122
123def output_fn(prediction, response_content_type):
124 """Format output"""
125 if response_content_type == 'application/json':
126 return json.dumps(model_handler.postprocess(prediction))
127 else:
128 raise ValueError(f"Unsupported content type: {response_content_type}")
π¦ Requirements and Dockerfile
1# model/requirements.txt
2torch==2.1.0
3diffusers==0.24.0
4transformers==4.36.0
5accelerate==0.25.0
6safetensors==0.4.1
7pillow==10.1.0
8xformers==0.0.23 # Optional, for memory efficiency
1# model/Dockerfile
2FROM pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime
3
4# Set working directory
5WORKDIR /opt/ml/code
6
7# Install system dependencies
8RUN apt-get update && apt-get install -y \
9 git \
10 wget \
11 && rm -rf /var/lib/apt/lists/*
12
13# Copy requirements and install Python dependencies
14COPY requirements.txt .
15RUN pip install --no-cache-dir -r requirements.txt
16
17# Copy inference script
18COPY inference.py .
19
20# Set environment variables
21ENV PYTHONUNBUFFERED=1
22ENV MODEL_CACHE_DIR=/opt/ml/model
23ENV TRANSFORMERS_CACHE=/opt/ml/model
24ENV HF_HOME=/opt/ml/model
25
26# SageMaker uses port 8080
27ENV SAGEMAKER_BIND_TO_PORT=8080
28ENV SAGEMAKER_PROGRAM=inference.py
29
30# Health check
31HEALTHCHECK --interval=30s --timeout=10s --start-period=5m --retries=3 \
32 CMD wget --quiet --tries=1 --spider http://localhost:8080/ping || exit 1
33
34ENTRYPOINT ["python", "inference.py"]
ποΈ Step 3: CDK Stacks Implementation
π VPC Stack (Optional but Recommended)
1// lib/stacks/vpc-stack.ts
2import * as cdk from 'aws-cdk-lib';
3import * as ec2 from 'aws-cdk-lib/aws-ec2';
4import { Construct } from 'constructs';
5
6export interface VpcStackProps extends cdk.StackProps {
7 vpcCidr: string;
8}
9
10export class VpcStack extends cdk.Stack {
11 public readonly vpc: ec2.Vpc;
12
13 constructor(scope: Construct, id: string, props: VpcStackProps) {
14 super(scope, id, props);
15
16 // Create VPC with public and private subnets
17 this.vpc = new ec2.Vpc(this, 'MLInferenceVpc', {
18 ipAddresses: ec2.IpAddresses.cidr(props.vpcCidr),
19 maxAzs: 2,
20 natGateways: 1, // Cost optimization
21 subnetConfiguration: [
22 {
23 name: 'Public',
24 subnetType: ec2.SubnetType.PUBLIC,
25 cidrMask: 24,
26 },
27 {
28 name: 'Private',
29 subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
30 cidrMask: 24,
31 },
32 ],
33 enableDnsHostnames: true,
34 enableDnsSupport: true,
35 });
36
37 // VPC Endpoints for cost optimization (avoid NAT charges)
38 this.vpc.addInterfaceEndpoint('SageMakerRuntimeEndpoint', {
39 service: ec2.InterfaceVpcEndpointAwsService.SAGEMAKER_RUNTIME,
40 });
41
42 this.vpc.addGatewayEndpoint('S3Endpoint', {
43 service: ec2.GatewayVpcEndpointAwsService.S3,
44 });
45
46 // Output VPC ID
47 new cdk.CfnOutput(this, 'VpcId', {
48 value: this.vpc.vpcId,
49 description: 'VPC ID',
50 });
51 }
52}
π€ SageMaker Stack
1// lib/stacks/sagemaker-stack.ts
2import * as cdk from 'aws-cdk-lib';
3import * as sagemaker from 'aws-cdk-lib/aws-sagemaker';
4import * as iam from 'aws-cdk-lib/aws-iam';
5import * as ec2 from 'aws-cdk-lib/aws-ec2';
6import * as ecr from 'aws-cdk-lib/aws-ecr';
7import { Construct } from 'constructs';
8import { ModelConfig } from '../config/model-config';
9
10export interface SageMakerStackProps extends cdk.StackProps {
11 modelConfig: ModelConfig;
12 vpc?: ec2.Vpc;
13 ecrRepository: ecr.Repository;
14}
15
16export class SageMakerStack extends cdk.Stack {
17 public readonly endpointName: string;
18 public readonly endpoint: sagemaker.CfnEndpoint;
19
20 constructor(scope: Construct, id: string, props: SageMakerStackProps) {
21 super(scope, id, props);
22
23 const { modelConfig, vpc, ecrRepository } = props;
24
25 // IAM Role for SageMaker
26 const sagemakerRole = new iam.Role(this, 'SageMakerExecutionRole', {
27 assumedBy: new iam.ServicePrincipal('sagemaker.amazonaws.com'),
28 managedPolicies: [
29 iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonSageMakerFullAccess'),
30 ],
31 });
32
33 // Grant ECR access
34 ecrRepository.grantPull(sagemakerRole);
35
36 // Model
37 const model = new sagemaker.CfnModel(this, 'HuggingFaceModel', {
38 executionRoleArn: sagemakerRole.roleArn,
39 primaryContainer: {
40 image: `${ecrRepository.repositoryUri}:latest`,
41 mode: 'SingleModel',
42 environment: {
43 ...modelConfig.environment,
44 MODEL_ID: modelConfig.modelId,
45 },
46 },
47 vpcConfig: vpc
48 ? {
49 subnets: vpc.privateSubnets.map((subnet) => subnet.subnetId),
50 securityGroupIds: [
51 new ec2.SecurityGroup(this, 'SageMakerSecurityGroup', {
52 vpc,
53 description: 'Security group for SageMaker endpoint',
54 allowAllOutbound: true,
55 }).securityGroupId,
56 ],
57 }
58 : undefined,
59 });
60
61 // Endpoint Configuration
62 const endpointConfig = new sagemaker.CfnEndpointConfig(
63 this,
64 'EndpointConfig',
65 {
66 productionVariants: [
67 {
68 modelName: model.attrModelName,
69 variantName: 'AllTraffic',
70 initialInstanceCount: modelConfig.instanceCount,
71 instanceType: modelConfig.instanceType,
72 initialVariantWeight: 1.0,
73 },
74 ],
75 // Auto-scaling configuration
76 asyncInferenceConfig: {
77 outputConfig: {
78 s3OutputPath: `s3://${cdk.Aws.ACCOUNT_ID}-ml-inference-output`,
79 },
80 },
81 }
82 );
83
84 endpointConfig.addDependency(model);
85
86 // Endpoint
87 this.endpointName = `ml-inference-endpoint-${cdk.Aws.STACK_NAME}`;
88 this.endpoint = new sagemaker.CfnEndpoint(this, 'Endpoint', {
89 endpointName: this.endpointName,
90 endpointConfigName: endpointConfig.attrEndpointConfigName,
91 });
92
93 this.endpoint.addDependency(endpointConfig);
94
95 // Auto-scaling
96 const scalableTarget = new cdk.aws_applicationautoscaling.ScalableTarget(
97 this,
98 'ScalableTarget',
99 {
100 serviceNamespace: cdk.aws_applicationautoscaling.ServiceNamespace.SAGEMAKER,
101 resourceId: `endpoint/${this.endpointName}/variant/AllTraffic`,
102 scalableDimension: 'sagemaker:variant:DesiredInstanceCount',
103 minCapacity: 1,
104 maxCapacity: 5,
105 }
106 );
107
108 scalableTarget.node.addDependency(this.endpoint);
109
110 // Target tracking scaling policy
111 scalableTarget.scaleToTrackMetric('TargetTracking', {
112 targetValue: 70,
113 predefinedMetric: cdk.aws_applicationautoscaling.PredefinedMetric.SAGEMAKER_VARIANT_INVOCATIONS_PER_INSTANCE,
114 scaleInCooldown: cdk.Duration.seconds(300),
115 scaleOutCooldown: cdk.Duration.seconds(60),
116 });
117
118 // Outputs
119 new cdk.CfnOutput(this, 'EndpointName', {
120 value: this.endpointName,
121 description: 'SageMaker Endpoint Name',
122 });
123
124 new cdk.CfnOutput(this, 'EndpointArn', {
125 value: this.endpoint.ref,
126 description: 'SageMaker Endpoint ARN',
127 });
128 }
129}
β‘ Lambda Stack
1// lib/stacks/lambda-stack.ts
2import * as cdk from 'aws-cdk-lib';
3import * as lambda from 'aws-cdk-lib/aws-lambda';
4import * as iam from 'aws-cdk-lib/aws-iam';
5import * as logs from 'aws-cdk-lib/aws-logs';
6import * as s3 from 'aws-cdk-lib/aws-s3';
7import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
8import { Construct } from 'constructs';
9import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';
10import * as path from 'path';
11
12export interface LambdaStackProps extends cdk.StackProps {
13 endpointName: string;
14 resultsBucket: s3.Bucket;
15 metadataTable: dynamodb.Table;
16}
17
18export class LambdaStack extends cdk.Stack {
19 public readonly predictFunction: lambda.Function;
20 public readonly statusFunction: lambda.Function;
21
22 constructor(scope: Construct, id: string, props: LambdaStackProps) {
23 super(scope, id, props);
24
25 const { endpointName, resultsBucket, metadataTable } = props;
26
27 // Lambda execution role
28 const lambdaRole = new iam.Role(this, 'LambdaExecutionRole', {
29 assumedBy: new iam.ServicePrincipal('lambda.amazonaws.com'),
30 managedPolicies: [
31 iam.ManagedPolicy.fromAwsManagedPolicyName(
32 'service-role/AWSLambdaBasicExecutionRole'
33 ),
34 ],
35 });
36
37 // Grant SageMaker invoke permissions
38 lambdaRole.addToPolicy(
39 new iam.PolicyStatement({
40 actions: ['sagemaker:InvokeEndpoint'],
41 resources: [
42 `arn:aws:sagemaker:${cdk.Aws.REGION}:${cdk.Aws.ACCOUNT_ID}:endpoint/${endpointName}`,
43 ],
44 })
45 );
46
47 // Grant S3 permissions
48 resultsBucket.grantReadWrite(lambdaRole);
49
50 // Grant DynamoDB permissions
51 metadataTable.grantReadWriteData(lambdaRole);
52
53 // Prediction Lambda Function
54 this.predictFunction = new NodejsFunction(this, 'PredictFunction', {
55 runtime: lambda.Runtime.NODEJS_20_X,
56 handler: 'handler',
57 entry: path.join(__dirname, '../../lambda/predict/index.ts'),
58 timeout: cdk.Duration.minutes(5),
59 memorySize: 512,
60 role: lambdaRole,
61 environment: {
62 ENDPOINT_NAME: endpointName,
63 RESULTS_BUCKET: resultsBucket.bucketName,
64 METADATA_TABLE: metadataTable.tableName,
65 REGION: cdk.Aws.REGION,
66 },
67 logRetention: logs.RetentionDays.ONE_WEEK,
68 bundling: {
69 minify: true,
70 sourceMap: true,
71 target: 'es2020',
72 },
73 });
74
75 // Status Check Lambda Function
76 this.statusFunction = new NodejsFunction(this, 'StatusFunction', {
77 runtime: lambda.Runtime.NODEJS_20_X,
78 handler: 'handler',
79 entry: path.join(__dirname, '../../lambda/status/index.ts'),
80 timeout: cdk.Duration.seconds(30),
81 memorySize: 256,
82 role: lambdaRole,
83 environment: {
84 RESULTS_BUCKET: resultsBucket.bucketName,
85 METADATA_TABLE: metadataTable.tableName,
86 REGION: cdk.Aws.REGION,
87 },
88 logRetention: logs.RetentionDays.ONE_WEEK,
89 });
90
91 // Outputs
92 new cdk.CfnOutput(this, 'PredictFunctionArn', {
93 value: this.predictFunction.functionArn,
94 description: 'Predict Lambda Function ARN',
95 });
96 }
97}
π API Gateway Stack
1// lib/stacks/api-stack.ts
2import * as cdk from 'aws-cdk-lib';
3import * as apigateway from 'aws-cdk-lib/aws-apigateway';
4import * as lambda from 'aws-cdk-lib/aws-lambda';
5import * as logs from 'aws-cdk-lib/aws-logs';
6import { Construct } from 'constructs';
7
8export interface ApiStackProps extends cdk.StackProps {
9 predictFunction: lambda.Function;
10 statusFunction: lambda.Function;
11}
12
13export class ApiStack extends cdk.Stack {
14 public readonly api: apigateway.RestApi;
15
16 constructor(scope: Construct, id: string, props: ApiStackProps) {
17 super(scope, id, props);
18
19 const { predictFunction, statusFunction } = props;
20
21 // CloudWatch Logs for API Gateway
22 const logGroup = new logs.LogGroup(this, 'ApiGatewayLogs', {
23 retention: logs.RetentionDays.ONE_WEEK,
24 removalPolicy: cdk.RemovalPolicy.DESTROY,
25 });
26
27 // REST API
28 this.api = new apigateway.RestApi(this, 'MLInferenceApi', {
29 restApiName: 'ML Inference API',
30 description: 'API for ML model inference',
31 deployOptions: {
32 stageName: 'prod',
33 loggingLevel: apigateway.MethodLoggingLevel.INFO,
34 dataTraceEnabled: true,
35 accessLogDestination: new apigateway.LogGroupLogDestination(logGroup),
36 accessLogFormat: apigateway.AccessLogFormat.jsonWithStandardFields(),
37 throttlingRateLimit: 100,
38 throttlingBurstLimit: 200,
39 },
40 defaultCorsPreflightOptions: {
41 allowOrigins: apigateway.Cors.ALL_ORIGINS,
42 allowMethods: apigateway.Cors.ALL_METHODS,
43 allowHeaders: ['Content-Type', 'Authorization'],
44 },
45 });
46
47 // API Key for authentication
48 const apiKey = this.api.addApiKey('ApiKey', {
49 apiKeyName: 'MLInferenceApiKey',
50 });
51
52 const usagePlan = this.api.addUsagePlan('UsagePlan', {
53 name: 'Standard',
54 throttle: {
55 rateLimit: 100,
56 burstLimit: 200,
57 },
58 quota: {
59 limit: 10000,
60 period: apigateway.Period.DAY,
61 },
62 });
63
64 usagePlan.addApiKey(apiKey);
65 usagePlan.addApiStage({
66 stage: this.api.deploymentStage,
67 });
68
69 // Request validator
70 const requestValidator = new apigateway.RequestValidator(
71 this,
72 'RequestValidator',
73 {
74 restApi: this.api,
75 validateRequestBody: true,
76 validateRequestParameters: true,
77 }
78 );
79
80 // Request model
81 const requestModel = this.api.addModel('PredictRequestModel', {
82 contentType: 'application/json',
83 modelName: 'PredictRequest',
84 schema: {
85 type: apigateway.JsonSchemaType.OBJECT,
86 required: ['prompt'],
87 properties: {
88 prompt: { type: apigateway.JsonSchemaType.STRING },
89 negative_prompt: { type: apigateway.JsonSchemaType.STRING },
90 num_inference_steps: { type: apigateway.JsonSchemaType.INTEGER },
91 guidance_scale: { type: apigateway.JsonSchemaType.NUMBER },
92 width: { type: apigateway.JsonSchemaType.INTEGER },
93 height: { type: apigateway.JsonSchemaType.INTEGER },
94 seed: { type: apigateway.JsonSchemaType.INTEGER },
95 },
96 },
97 });
98
99 // /predict endpoint
100 const predictResource = this.api.root.addResource('predict');
101 predictResource.addMethod(
102 'POST',
103 new apigateway.LambdaIntegration(predictFunction, {
104 proxy: true,
105 }),
106 {
107 apiKeyRequired: true,
108 requestValidator,
109 requestModels: {
110 'application/json': requestModel,
111 },
112 }
113 );
114
115 // /status/{jobId} endpoint
116 const statusResource = this.api.root.addResource('status');
117 const statusJobResource = statusResource.addResource('{jobId}');
118 statusJobResource.addMethod(
119 'GET',
120 new apigateway.LambdaIntegration(statusFunction, {
121 proxy: true,
122 }),
123 {
124 apiKeyRequired: true,
125 }
126 );
127
128 // Outputs
129 new cdk.CfnOutput(this, 'ApiUrl', {
130 value: this.api.url,
131 description: 'API Gateway URL',
132 });
133
134 new cdk.CfnOutput(this, 'ApiKeyId', {
135 value: apiKey.keyId,
136 description: 'API Key ID',
137 });
138 }
139}
π§ Step 4: Lambda Function Implementation
π― Prediction Lambda
1// lambda/predict/index.ts
2import {
3 SageMakerRuntimeClient,
4 InvokeEndpointCommand,
5} from '@aws-sdk/client-sagemaker-runtime';
6import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
7import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
8import { DynamoDBDocumentClient, PutCommand } from '@aws-sdk/lib-dynamodb';
9import { APIGatewayProxyEvent, APIGatewayProxyResult } from 'aws-lambda';
10import { v4 as uuidv4 } from 'uuid';
11
12const sagemakerClient = new SageMakerRuntimeClient({
13 region: process.env.REGION,
14});
15
16const s3Client = new S3Client({ region: process.env.REGION });
17
18const dynamoClient = DynamoDBDocumentClient.from(
19 new DynamoDBClient({ region: process.env.REGION })
20);
21
22interface PredictRequest {
23 prompt: string;
24 negative_prompt?: string;
25 num_inference_steps?: number;
26 guidance_scale?: number;
27 width?: number;
28 height?: number;
29 seed?: number;
30}
31
32interface PredictResponse {
33 jobId: string;
34 status: 'processing' | 'completed' | 'failed';
35 message: string;
36 result?: {
37 image: string;
38 s3Url?: string;
39 };
40}
41
42export const handler = async (
43 event: APIGatewayProxyEvent
44): Promise<APIGatewayProxyResult> => {
45 console.log('Event:', JSON.stringify(event, null, 2));
46
47 try {
48 // Parse request body
49 if (!event.body) {
50 return {
51 statusCode: 400,
52 body: JSON.stringify({ error: 'Request body is required' }),
53 };
54 }
55
56 const request: PredictRequest = JSON.parse(event.body);
57
58 // Validate request
59 if (!request.prompt || request.prompt.trim() === '') {
60 return {
61 statusCode: 400,
62 body: JSON.stringify({ error: 'Prompt is required' }),
63 };
64 }
65
66 // Generate job ID
67 const jobId = uuidv4();
68 const timestamp = new Date().toISOString();
69
70 // Store job metadata in DynamoDB
71 await dynamoClient.send(
72 new PutCommand({
73 TableName: process.env.METADATA_TABLE,
74 Item: {
75 jobId,
76 status: 'processing',
77 prompt: request.prompt,
78 timestamp,
79 ttl: Math.floor(Date.now() / 1000) + 86400, // 24 hours
80 },
81 })
82 );
83
84 // Prepare SageMaker request
85 const sagemakerPayload = {
86 prompt: request.prompt,
87 negative_prompt: request.negative_prompt || '',
88 num_inference_steps: request.num_inference_steps || 50,
89 guidance_scale: request.guidance_scale || 7.5,
90 width: request.width || 1024,
91 height: request.height || 1024,
92 seed: request.seed || null,
93 };
94
95 console.log('Invoking SageMaker endpoint:', process.env.ENDPOINT_NAME);
96
97 // Invoke SageMaker endpoint
98 const command = new InvokeEndpointCommand({
99 EndpointName: process.env.ENDPOINT_NAME,
100 ContentType: 'application/json',
101 Body: JSON.stringify(sagemakerPayload),
102 });
103
104 const response = await sagemakerClient.send(command);
105
106 // Parse response
107 const result = JSON.parse(new TextDecoder().decode(response.Body));
108
109 // Store image in S3 if it's large
110 let s3Url: string | undefined;
111
112 if (result.image && result.image.length > 100000) {
113 // Store in S3 if > 100KB
114 const s3Key = `results/${jobId}.png`;
115
116 await s3Client.send(
117 new PutObjectCommand({
118 Bucket: process.env.RESULTS_BUCKET,
119 Key: s3Key,
120 Body: Buffer.from(result.image, 'base64'),
121 ContentType: 'image/png',
122 })
123 );
124
125 s3Url = `s3://${process.env.RESULTS_BUCKET}/${s3Key}`;
126
127 // Update DynamoDB with result
128 await dynamoClient.send(
129 new PutCommand({
130 TableName: process.env.METADATA_TABLE,
131 Item: {
132 jobId,
133 status: 'completed',
134 prompt: request.prompt,
135 timestamp,
136 s3Url,
137 completedAt: new Date().toISOString(),
138 ttl: Math.floor(Date.now() / 1000) + 86400,
139 },
140 })
141 );
142
143 // Return response with S3 URL
144 const responseBody: PredictResponse = {
145 jobId,
146 status: 'completed',
147 message: 'Inference completed successfully',
148 result: {
149 image: result.image.substring(0, 100) + '...', // Truncated
150 s3Url,
151 },
152 };
153
154 return {
155 statusCode: 200,
156 headers: {
157 'Content-Type': 'application/json',
158 'Access-Control-Allow-Origin': '*',
159 },
160 body: JSON.stringify(responseBody),
161 };
162 } else {
163 // Update DynamoDB
164 await dynamoClient.send(
165 new PutCommand({
166 TableName: process.env.METADATA_TABLE,
167 Item: {
168 jobId,
169 status: 'completed',
170 prompt: request.prompt,
171 timestamp,
172 completedAt: new Date().toISOString(),
173 ttl: Math.floor(Date.now() / 1000) + 86400,
174 },
175 })
176 );
177
178 // Return response with inline image
179 const responseBody: PredictResponse = {
180 jobId,
181 status: 'completed',
182 message: 'Inference completed successfully',
183 result: {
184 image: result.image,
185 },
186 };
187
188 return {
189 statusCode: 200,
190 headers: {
191 'Content-Type': 'application/json',
192 'Access-Control-Allow-Origin': '*',
193 },
194 body: JSON.stringify(responseBody),
195 };
196 }
197 } catch (error) {
198 console.error('Error:', error);
199
200 return {
201 statusCode: 500,
202 headers: {
203 'Content-Type': 'application/json',
204 'Access-Control-Allow-Origin': '*',
205 },
206 body: JSON.stringify({
207 error: 'Internal server error',
208 message: error instanceof Error ? error.message : 'Unknown error',
209 }),
210 };
211 }
212};
π Status Lambda
1// lambda/status/index.ts
2import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
3import { DynamoDBDocumentClient, GetCommand } from '@aws-sdk/lib-dynamodb';
4import { S3Client, GetObjectCommand } from '@aws-sdk/client-s3';
5import { getSignedUrl } from '@aws-sdk/s3-request-presigner';
6import { APIGatewayProxyEvent, APIGatewayProxyResult } from 'aws-lambda';
7
8const dynamoClient = DynamoDBDocumentClient.from(
9 new DynamoDBClient({ region: process.env.REGION })
10);
11
12const s3Client = new S3Client({ region: process.env.REGION });
13
14export const handler = async (
15 event: APIGatewayProxyEvent
16): Promise<APIGatewayProxyResult> => {
17 try {
18 const jobId = event.pathParameters?.jobId;
19
20 if (!jobId) {
21 return {
22 statusCode: 400,
23 body: JSON.stringify({ error: 'Job ID is required' }),
24 };
25 }
26
27 // Get job metadata from DynamoDB
28 const result = await dynamoClient.send(
29 new GetCommand({
30 TableName: process.env.METADATA_TABLE,
31 Key: { jobId },
32 })
33 );
34
35 if (!result.Item) {
36 return {
37 statusCode: 404,
38 body: JSON.stringify({ error: 'Job not found' }),
39 };
40 }
41
42 // Generate presigned URL if result is in S3
43 let presignedUrl: string | undefined;
44
45 if (result.Item.s3Url) {
46 const s3Key = result.Item.s3Url.replace(
47 `s3://${process.env.RESULTS_BUCKET}/`,
48 ''
49 );
50
51 presignedUrl = await getSignedUrl(
52 s3Client,
53 new GetObjectCommand({
54 Bucket: process.env.RESULTS_BUCKET,
55 Key: s3Key,
56 }),
57 { expiresIn: 3600 } // 1 hour
58 );
59 }
60
61 return {
62 statusCode: 200,
63 headers: {
64 'Content-Type': 'application/json',
65 'Access-Control-Allow-Origin': '*',
66 },
67 body: JSON.stringify({
68 jobId: result.Item.jobId,
69 status: result.Item.status,
70 prompt: result.Item.prompt,
71 timestamp: result.Item.timestamp,
72 completedAt: result.Item.completedAt,
73 ...(presignedUrl && { downloadUrl: presignedUrl }),
74 }),
75 };
76 } catch (error) {
77 console.error('Error:', error);
78
79 return {
80 statusCode: 500,
81 headers: {
82 'Content-Type': 'application/json',
83 'Access-Control-Allow-Origin': '*',
84 },
85 body: JSON.stringify({
86 error: 'Internal server error',
87 message: error instanceof Error ? error.message : 'Unknown error',
88 }),
89 };
90 }
91};
ποΈ Step 5: Main CDK App
1// bin/ml-inference.ts
2#!/usr/bin/env node
3import 'source-map-support/register';
4import * as cdk from 'aws-cdk-lib';
5import * as s3 from 'aws-cdk-lib/aws-s3';
6import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
7import * as ecr from 'aws-cdk-lib/aws-ecr';
8import { VpcStack } from '../lib/stacks/vpc-stack';
9import { SageMakerStack } from '../lib/stacks/sagemaker-stack';
10import { LambdaStack } from '../lib/stacks/lambda-stack';
11import { ApiStack } from '../lib/stacks/api-stack';
12import { getAppConfig } from '../lib/config/app-config';
13import { modelConfigs } from '../lib/config/model-config';
14
15const app = new cdk.App();
16const config = getAppConfig(app);
17
18// Shared resources
19const resultsBucket = new s3.Bucket(app, 'ResultsBucket', {
20 bucketName: `${config.account}-ml-inference-results`,
21 removalPolicy: cdk.RemovalPolicy.DESTROY,
22 autoDeleteObjects: true,
23 encryption: s3.BucketEncryption.S3_MANAGED,
24 lifecycleRules: [
25 {
26 expiration: cdk.Duration.days(7),
27 },
28 ],
29});
30
31const metadataTable = new dynamodb.Table(app, 'MetadataTable', {
32 tableName: 'ml-inference-jobs',
33 partitionKey: { name: 'jobId', type: dynamodb.AttributeType.STRING },
34 billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
35 removalPolicy: cdk.RemovalPolicy.DESTROY,
36 timeToLiveAttribute: 'ttl',
37 pointInTimeRecovery: true,
38});
39
40const ecrRepository = new ecr.Repository(app, 'ModelRepository', {
41 repositoryName: 'ml-inference-model',
42 removalPolicy: cdk.RemovalPolicy.DESTROY,
43 autoDeleteImages: true,
44});
45
46// VPC Stack (optional)
47let vpcStack: VpcStack | undefined;
48if (config.enableVpc) {
49 vpcStack = new VpcStack(app, 'VpcStack', {
50 vpcCidr: config.vpcCidr,
51 env: {
52 account: config.account,
53 region: config.region,
54 },
55 });
56}
57
58// SageMaker Stack
59const sagemakerStack = new SageMakerStack(app, 'SageMakerStack', {
60 modelConfig: modelConfigs[config.environment as keyof typeof modelConfigs],
61 vpc: vpcStack?.vpc,
62 ecrRepository,
63 env: {
64 account: config.account,
65 region: config.region,
66 },
67});
68
69// Lambda Stack
70const lambdaStack = new LambdaStack(app, 'LambdaStack', {
71 endpointName: sagemakerStack.endpointName,
72 resultsBucket,
73 metadataTable,
74 env: {
75 account: config.account,
76 region: config.region,
77 },
78});
79
80lambdaStack.addDependency(sagemakerStack);
81
82// API Stack
83const apiStack = new ApiStack(app, 'ApiStack', {
84 predictFunction: lambdaStack.predictFunction,
85 statusFunction: lambdaStack.statusFunction,
86 env: {
87 account: config.account,
88 region: config.region,
89 },
90});
91
92apiStack.addDependency(lambdaStack);
93
94// Add tags to all resources
95Object.entries(config.tags).forEach(([key, value]) => {
96 cdk.Tags.of(app).add(key, value);
97});
98
99app.synth();
π Step 6: Deployment
π¦ Build and Push Docker Image
1# Build Docker image
2cd model
3docker build -t ml-inference-model .
4
5# Get ECR login
6aws ecr get-login-password --region us-east-1 | \
7 docker login --username AWS --password-stdin \
8 ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com
9
10# Tag image
11docker tag ml-inference-model:latest \
12 ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/ml-inference-model:latest
13
14# Push to ECR
15docker push ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/ml-inference-model:latest
π Deploy CDK Stacks
1# Install dependencies
2npm install
3
4# Synthesize CloudFormation templates
5cdk synth
6
7# Deploy all stacks
8cdk deploy --all --require-approval never
9
10# Or deploy individually
11cdk deploy VpcStack
12cdk deploy SageMakerStack
13cdk deploy LambdaStack
14cdk deploy ApiStack
π§ͺ Test the API
1# Get API Key
2aws apigateway get-api-keys --include-values
3
4# Make prediction request
5curl -X POST https://YOUR_API_ID.execute-api.us-east-1.amazonaws.com/prod/predict \
6 -H "Content-Type: application/json" \
7 -H "x-api-key: YOUR_API_KEY" \
8 -d '{
9 "prompt": "A serene landscape with mountains and a lake at sunset",
10 "num_inference_steps": 50,
11 "guidance_scale": 7.5,
12 "width": 1024,
13 "height": 1024
14 }'
15
16# Check job status
17curl https://YOUR_API_ID.execute-api.us-east-1.amazonaws.com/prod/status/JOB_ID \
18 -H "x-api-key: YOUR_API_KEY"
π Monitoring and Observability
π CloudWatch Dashboard
1// Add to lib/stacks/monitoring-stack.ts
2import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';
3
4const dashboard = new cloudwatch.Dashboard(this, 'MLInferenceDashboard', {
5 dashboardName: 'ML-Inference-Metrics',
6});
7
8// SageMaker metrics
9dashboard.addWidgets(
10 new cloudwatch.GraphWidget({
11 title: 'SageMaker Invocations',
12 left: [
13 new cloudwatch.Metric({
14 namespace: 'AWS/SageMaker',
15 metricName: 'Invocations',
16 dimensionsMap: {
17 EndpointName: endpointName,
18 VariantName: 'AllTraffic',
19 },
20 statistic: 'Sum',
21 }),
22 ],
23 })
24);
25
26// Lambda metrics
27dashboard.addWidgets(
28 new cloudwatch.GraphWidget({
29 title: 'Lambda Duration',
30 left: [
31 predictFunction.metricDuration(),
32 ],
33 })
34);
π Alarms
1// SageMaker endpoint alarm
2const endpointAlarm = new cloudwatch.Alarm(this, 'EndpointFailureAlarm', {
3 metric: new cloudwatch.Metric({
4 namespace: 'AWS/SageMaker',
5 metricName: 'ModelInvocation4XXErrors',
6 dimensionsMap: {
7 EndpointName: endpointName,
8 VariantName: 'AllTraffic',
9 },
10 statistic: 'Sum',
11 }),
12 threshold: 10,
13 evaluationPeriods: 2,
14 alarmDescription: 'Alert when SageMaker endpoint has too many 4XX errors',
15});
16
17// Lambda error alarm
18const lambdaAlarm = new cloudwatch.Alarm(this, 'LambdaErrorAlarm', {
19 metric: predictFunction.metricErrors(),
20 threshold: 5,
21 evaluationPeriods: 2,
22 alarmDescription: 'Alert when Lambda function has too many errors',
23});
π° Cost Optimization
π‘ Cost Breakdown
| Service | Cost Factor | Optimization Strategy |
|---|---|---|
| SageMaker | Instance hours | Use auto-scaling, smaller instances for dev |
| Lambda | Invocations + Duration | Optimize code, use appropriate memory |
| API Gateway | Requests | Cache responses when possible |
| S3 | Storage + Requests | Lifecycle policies, intelligent tiering |
| DynamoDB | Read/Write units | Use on-demand pricing, TTL for cleanup |
π― Optimization Tips
1// 1. Use Spot Instances for SageMaker (development)
2// Add to SageMaker endpoint config for non-production
3productionVariants: [{
4 // ... other config
5 instanceType: 'ml.g4dn.xlarge',
6 initialInstanceCount: 1,
7 // Enable managed spot training (not available for all instances)
8}]
9
10// 2. Implement caching in Lambda
11const cache = new Map<string, any>();
12
13export const handler = async (event: any) => {
14 const cacheKey = JSON.stringify(event.body);
15
16 if (cache.has(cacheKey)) {
17 return cache.get(cacheKey);
18 }
19
20 const result = await invokeModel(event);
21 cache.set(cacheKey, result);
22
23 return result;
24};
25
26// 3. Use Reserved Capacity for predictable workloads
27// Purchase SageMaker Savings Plans for production workloads
π Security Best Practices
π‘οΈ Security Checklist
1// 1. Enable encryption at rest
2const resultsBucket = new s3.Bucket(this, 'ResultsBucket', {
3 encryption: s3.BucketEncryption.S3_MANAGED,
4 enforceSSL: true,
5});
6
7// 2. Restrict S3 bucket access
8resultsBucket.addToResourcePolicy(
9 new iam.PolicyStatement({
10 effect: iam.Effect.DENY,
11 principals: [new iam.AnyPrincipal()],
12 actions: ['s3:*'],
13 resources: [resultsBucket.arnForObjects('*')],
14 conditions: {
15 Bool: {
16 'aws:SecureTransport': 'false',
17 },
18 },
19 })
20);
21
22// 3. Enable API Gateway authentication
23// Use Cognito, API Keys, or Lambda Authorizers
24
25// 4. Implement rate limiting
26const throttleSettings = {
27 rateLimit: 100,
28 burstLimit: 200,
29};
30
31// 5. Enable VPC for SageMaker (production)
32// Isolate SageMaker endpoints in private subnets
33
34// 6. Use Secrets Manager for sensitive data
35const apiSecret = new secretsmanager.Secret(this, 'ApiSecret', {
36 secretName: 'ml-inference-api-key',
37});
38
39// 7. Enable CloudTrail logging
40// Monitor API calls and access patterns
π Summary and Best Practices
π― Key Takeaways
- Infrastructure as Code: Use CDK for reproducible, version-controlled infrastructure
- Separation of Concerns: Keep model code (Python) separate from infrastructure (TypeScript)
- Auto-scaling: Configure SageMaker and Lambda to scale based on demand
- Monitoring: Implement comprehensive logging and alerting
- Cost Management: Use auto-scaling, lifecycle policies, and appropriate instance types
- Security: Enable encryption, use IAM roles, implement API authentication
- Testing: Test locally with Docker before deploying to AWS
π οΈ Essential Commands
1# Development
2npm run build # Build TypeScript
3cdk synth # Generate CloudFormation
4cdk diff # See changes before deploy
5
6# Deployment
7cdk deploy --all # Deploy all stacks
8cdk deploy VpcStack # Deploy specific stack
9
10# Docker
11docker build -t model .
12docker push ECR_URI
13
14# Testing
15aws sagemaker-runtime invoke-endpoint \
16 --endpoint-name ENDPOINT_NAME \
17 --body file://request.json \
18 output.json
19
20# Cleanup
21cdk destroy --all # Delete all resources
π Project Checklist
- Set up AWS credentials and CDK bootstrap
- Create Hugging Face inference script
- Build and test Docker image locally
- Push image to ECR
- Deploy VPC stack (if needed)
- Deploy SageMaker stack
- Test SageMaker endpoint directly
- Deploy Lambda stack
- Test Lambda functions
- Deploy API Gateway stack
- Test end-to-end API
- Set up monitoring and alarms
- Configure auto-scaling
- Implement security best practices
- Document API endpoints
- Set up CI/CD pipeline
π Further Learning
- AWS CDK: AWS CDK Documentation
- SageMaker: Amazon SageMaker Developer Guide
- Hugging Face: Hugging Face Documentation
- MLOps: ML Engineering Best Practices
π― Conclusion
Deploying machine learning models to production requires careful consideration of architecture, scalability, cost, and security. By using AWS CDK with TypeScript, you can create infrastructure as code that’s maintainable, testable, and reproducible.
This guide provided a complete solution for deploying Hugging Face models using SageMaker and Lambda, with:
- Type-safe infrastructure code
- Scalable architecture
- Production-ready security
- Comprehensive monitoring
- Cost optimization strategies
Next Steps:
- Implement CI/CD pipeline with GitHub Actions or AWS CodePipeline
- Add A/B testing for model versions
- Implement caching for frequently requested predictions
- Set up multi-region deployment for global availability
- Add custom domain and SSL certificates
Related Posts:
- Building Production Kubernetes Platform with AWS EKS and CDK
- Building Serverless URL Shortener with AWS CDK
- TypeScript Best Practices: A Comprehensive Guide to Type-Safe Development
- Express.js Best Practices: Building Production-Ready Node.js Backend Applications
Tags: #AWS #CDK #SageMaker #Lambda #HuggingFace #MachineLearning #MLOps #TypeScript #Python #InfrastructureAsCode #Serverless #AI