π― Introduction
In distributed systems running on AWS, observability is critical for maintaining reliability, debugging issues, and ensuring optimal performance. A centralized monitoring system provides:
- Unified Visibility: Single pane of glass for all services, applications, and infrastructure
- Proactive Alerting: Detect and respond to issues before they impact users
- Performance Optimization: Identify bottlenecks and optimization opportunities
- Cost Management: Track resource utilization and spending patterns
- Compliance: Meet audit and regulatory requirements for logging
- Troubleshooting: Quickly diagnose and resolve production issues
This comprehensive guide demonstrates how to build a production-ready centralized monitoring system using AWS CloudWatch and Grafana, deployed with CDK (TypeScript). We’ll focus on cross-service log aggregation, metric collection, proper IAM permissions, and creating actionable dashboards.
π‘ Core Philosophy: “You can’t improve what you don’t measure. Good monitoring isn’t about collecting dataβit’s about extracting insights that drive action”
π¬ What We’ll Build
A complete centralized monitoring and observability platform featuring:
- CloudWatch Logs Aggregation from Lambda, ECS, API Gateway, and more
- CloudWatch Metrics collection and custom metrics
- Grafana on ECS Fargate for advanced visualization
- CloudWatch as Data Source for Grafana dashboards
- Cross-Account Monitoring capabilities
- IAM Roles and Policies for secure cross-service access
- Automated Alerting via SNS and PagerDuty
- Custom Dashboards for different teams and services
- Log Insights Queries for log analysis
- Metric Filters for extracting metrics from logs
- Container Insights for ECS/EKS monitoring
- Lambda Insights for serverless monitoring
ποΈ Architecture Overview
π High-Level Architecture
graph TB
subgraph "AWS Services Being Monitored"
Lambda[Lambda Functions]
ECS[ECS/Fargate]
RDS[RDS Databases]
APIGW[API Gateway]
ALB[Application Load Balancer]
S3[S3 Buckets]
DynamoDB[DynamoDB]
SQS[SQS Queues]
end
subgraph "Log Collection Layer"
CWLogs[CloudWatch Logs]
LogGroups[Log Groups]
MetricFilters[Metric Filters]
end
subgraph "Metrics Collection Layer"
CWMetrics[CloudWatch Metrics]
CustomMetrics[Custom Metrics]
ContainerInsights[Container Insights]
LambdaInsights[Lambda Insights]
end
subgraph "Monitoring Platform"
Grafana[Grafana on ECS]
CWDashboards[CloudWatch Dashboards]
LogInsights[CloudWatch Log Insights]
end
subgraph "Alerting & Notifications"
CWAlarms[CloudWatch Alarms]
SNS[SNS Topics]
Lambda2[Alert Lambda]
PagerDuty[PagerDuty]
Slack[Slack]
Email[Email]
end
subgraph "IAM & Security"
MonitoringRole[Monitoring IAM Role]
ServiceRoles[Service Roles]
CrossAccount[Cross-Account Access]
end
Lambda --> CWLogs
ECS --> CWLogs
APIGW --> CWLogs
ALB --> CWLogs
Lambda --> CWMetrics
ECS --> CWMetrics
RDS --> CWMetrics
DynamoDB --> CWMetrics
SQS --> CWMetrics
CWLogs --> LogGroups
LogGroups --> MetricFilters
MetricFilters --> CWMetrics
ECS --> ContainerInsights
Lambda --> LambdaInsights
CWMetrics --> Grafana
CWLogs --> Grafana
ContainerInsights --> Grafana
LambdaInsights --> Grafana
CWMetrics --> CWDashboards
CWLogs --> LogInsights
CWMetrics --> CWAlarms
CWAlarms --> SNS
SNS --> Lambda2
Lambda2 --> PagerDuty
Lambda2 --> Slack
SNS --> Email
MonitoringRole --> CWLogs
MonitoringRole --> CWMetrics
MonitoringRole --> Grafana
ServiceRoles --> CWLogs
ServiceRoles --> CWMetrics
style Grafana fill:#ff6b6b
style CWMetrics fill:#4ecdc4
style CWAlarms fill:#feca57
style MonitoringRole fill:#95e1d3
π Data Flow
sequenceDiagram
participant Service as AWS Service
participant CWLogs as CloudWatch Logs
participant MetricFilter as Metric Filter
participant CWMetrics as CloudWatch Metrics
participant Alarm as CloudWatch Alarm
participant SNS as SNS Topic
participant Grafana as Grafana
Service->>CWLogs: Send Logs
CWLogs->>MetricFilter: Process Logs
MetricFilter->>CWMetrics: Extract Metrics
Service->>CWMetrics: Send Metrics
CWMetrics->>Alarm: Evaluate Threshold
Alarm->>SNS: Trigger Alert
SNS->>SNS: Fan out notifications
Grafana->>CWMetrics: Query Metrics
Grafana->>CWLogs: Query Logs
Grafana->>Grafana: Render Dashboard
π¨ Monitoring Strategy: CloudWatch vs Grafana
π Feature Comparison
| Feature | CloudWatch | Grafana |
|---|---|---|
| Native AWS Integration | β Excellent | β οΈ Requires setup |
| Custom Dashboards | β Good | β Excellent |
| Visualization Options | β οΈ Limited | β Extensive |
| Multi-Cloud Support | β AWS only | β Yes |
| Cost | Pay per metric/log | Infrastructure cost |
| Setup Complexity | β Minimal | β οΈ Moderate |
| Alerting | β Native | β Advanced |
| Log Analysis | β Log Insights | β οΈ Via plugins |
| Query Language | Log Insights | PromQL, LogQL |
| User Management | AWS IAM | Built-in |
| Customization | β οΈ Limited | β Extensive |
β Recommended: Hybrid Approach
Use both CloudWatch and Grafana for maximum flexibility:
- CloudWatch: Primary data store and native AWS service monitoring
- Grafana: Advanced visualization and unified dashboard for all data sources
1// Hybrid monitoring strategy
2CloudWatch (Data Layer)
3βββ Collect all logs and metrics
4βββ Native AWS service integration
5βββ CloudWatch Alarms for critical alerts
6βββ Log Insights for ad-hoc queries
7
8Grafana (Visualization Layer)
9βββ Use CloudWatch as data source
10βββ Advanced dashboards
11βββ Custom visualizations
12βββ Unified view across services
π¦ CDK Project Structure
monitoring-system-cdk/
βββ bin/
β βββ monitoring-system.ts # CDK app entry point
βββ lib/
β βββ stacks/
β β βββ cloudwatch-stack.ts # CloudWatch setup
β β βββ grafana-stack.ts # Grafana on ECS
β β βββ alerting-stack.ts # Alarms and SNS
β β βββ iam-stack.ts # IAM roles and policies
β β βββ dashboards-stack.ts # Dashboard definitions
β βββ constructs/
β β βββ log-aggregation.ts # Log collection construct
β β βββ metric-collection.ts # Metrics construct
β β βββ grafana-cluster.ts # Grafana ECS construct
β β βββ alert-manager.ts # Alerting construct
β βββ config/
β β βββ monitoring-config.ts # Monitoring configuration
β β βββ services-config.ts # Services to monitor
β β βββ dashboard-config.ts # Dashboard definitions
β βββ utils/
β βββ metric-utils.ts # Metric helpers
β βββ log-utils.ts # Log helpers
βββ lambda/
β βββ alert-processor/
β β βββ index.ts # Alert processing
β β βββ formatters.ts # Alert formatters
β βββ metric-collector/
β βββ index.ts # Custom metrics collector
βββ dashboards/
β βββ cloudwatch/
β β βββ main-dashboard.json
β βββ grafana/
β βββ service-dashboard.json
β βββ infrastructure-dashboard.json
βββ test/
βββ cdk.json
βββ tsconfig.json
βββ package.json
βοΈ Configuration Design
π― Monitoring Configuration
1// lib/config/monitoring-config.ts
2export interface MonitoringConfig {
3 logRetention: number;
4 metricNamespace: string;
5 enableDetailedMonitoring: boolean;
6 enableContainerInsights: boolean;
7 enableLambdaInsights: boolean;
8 alerting: AlertingConfig;
9 grafana?: GrafanaConfig;
10}
11
12export interface AlertingConfig {
13 enabled: boolean;
14 emailEndpoints: string[];
15 slackWebhookUrl?: string;
16 pagerDutyIntegrationKey?: string;
17 criticalAlarmActions: string[];
18 warningAlarmActions: string[];
19}
20
21export interface GrafanaConfig {
22 enabled: boolean;
23 instanceType: string;
24 desiredCount: number;
25 domain?: string;
26 adminPassword: string;
27 oauth?: {
28 enabled: boolean;
29 provider: 'google' | 'github' | 'cognito';
30 clientId: string;
31 clientSecret: string;
32 };
33}
34
35export const monitoringConfigs = {
36 development: {
37 logRetention: 7, // days
38 metricNamespace: 'Development',
39 enableDetailedMonitoring: false,
40 enableContainerInsights: false,
41 enableLambdaInsights: false,
42 alerting: {
43 enabled: true,
44 emailEndpoints: ['dev-team@company.com'],
45 criticalAlarmActions: [],
46 warningAlarmActions: [],
47 },
48 grafana: {
49 enabled: true,
50 instanceType: 't3.small',
51 desiredCount: 1,
52 adminPassword: 'change-me-dev',
53 },
54 },
55 production: {
56 logRetention: 90, // days
57 metricNamespace: 'Production',
58 enableDetailedMonitoring: true,
59 enableContainerInsights: true,
60 enableLambdaInsights: true,
61 alerting: {
62 enabled: true,
63 emailEndpoints: ['oncall@company.com', 'platform-team@company.com'],
64 slackWebhookUrl: process.env.SLACK_WEBHOOK_URL,
65 pagerDutyIntegrationKey: process.env.PAGERDUTY_KEY,
66 criticalAlarmActions: ['arn:aws:sns:us-east-1:xxx:critical-alerts'],
67 warningAlarmActions: ['arn:aws:sns:us-east-1:xxx:warning-alerts'],
68 },
69 grafana: {
70 enabled: true,
71 instanceType: 't3.medium',
72 desiredCount: 2,
73 domain: 'monitoring.yourdomain.com',
74 adminPassword: process.env.GRAFANA_ADMIN_PASSWORD || '',
75 oauth: {
76 enabled: true,
77 provider: 'cognito',
78 clientId: process.env.COGNITO_CLIENT_ID || '',
79 clientSecret: process.env.COGNITO_CLIENT_SECRET || '',
80 },
81 },
82 },
83} as const;
π Services to Monitor Configuration
1// lib/config/services-config.ts
2export interface ServiceMonitoringConfig {
3 serviceName: string;
4 type: 'lambda' | 'ecs' | 'rds' | 'dynamodb' | 'api-gateway' | 'alb' | 'sqs';
5 logGroups: string[];
6 metrics: MetricConfig[];
7 alarms: AlarmConfig[];
8}
9
10export interface MetricConfig {
11 name: string;
12 namespace: string;
13 dimensions?: Record<string, string>;
14 statistic: 'Average' | 'Sum' | 'Minimum' | 'Maximum' | 'SampleCount';
15 period: number;
16 unit?: string;
17}
18
19export interface AlarmConfig {
20 name: string;
21 description: string;
22 metric: MetricConfig;
23 threshold: number;
24 comparisonOperator: 'GreaterThanThreshold' | 'LessThanThreshold' | 'GreaterThanOrEqualToThreshold' | 'LessThanOrEqualToThreshold';
25 evaluationPeriods: number;
26 datapointsToAlarm?: number;
27 treatMissingData?: 'notBreaching' | 'breaching' | 'ignore' | 'missing';
28 severity: 'critical' | 'warning' | 'info';
29}
30
31export const servicesConfig: ServiceMonitoringConfig[] = [
32 {
33 serviceName: 'api-service',
34 type: 'lambda',
35 logGroups: ['/aws/lambda/api-handler', '/aws/lambda/api-authorizer'],
36 metrics: [
37 {
38 name: 'Invocations',
39 namespace: 'AWS/Lambda',
40 statistic: 'Sum',
41 period: 300,
42 },
43 {
44 name: 'Duration',
45 namespace: 'AWS/Lambda',
46 statistic: 'Average',
47 period: 300,
48 unit: 'Milliseconds',
49 },
50 {
51 name: 'Errors',
52 namespace: 'AWS/Lambda',
53 statistic: 'Sum',
54 period: 300,
55 },
56 {
57 name: 'Throttles',
58 namespace: 'AWS/Lambda',
59 statistic: 'Sum',
60 period: 300,
61 },
62 ],
63 alarms: [
64 {
65 name: 'api-service-high-error-rate',
66 description: 'API service error rate is too high',
67 metric: {
68 name: 'Errors',
69 namespace: 'AWS/Lambda',
70 statistic: 'Sum',
71 period: 300,
72 },
73 threshold: 10,
74 comparisonOperator: 'GreaterThanThreshold',
75 evaluationPeriods: 2,
76 datapointsToAlarm: 2,
77 treatMissingData: 'notBreaching',
78 severity: 'critical',
79 },
80 {
81 name: 'api-service-high-duration',
82 description: 'API service response time is too slow',
83 metric: {
84 name: 'Duration',
85 namespace: 'AWS/Lambda',
86 statistic: 'Average',
87 period: 300,
88 },
89 threshold: 3000, // 3 seconds
90 comparisonOperator: 'GreaterThanThreshold',
91 evaluationPeriods: 3,
92 severity: 'warning',
93 },
94 ],
95 },
96 {
97 serviceName: 'web-application',
98 type: 'ecs',
99 logGroups: ['/ecs/web-app'],
100 metrics: [
101 {
102 name: 'CPUUtilization',
103 namespace: 'AWS/ECS',
104 statistic: 'Average',
105 period: 300,
106 },
107 {
108 name: 'MemoryUtilization',
109 namespace: 'AWS/ECS',
110 statistic: 'Average',
111 period: 300,
112 },
113 ],
114 alarms: [
115 {
116 name: 'ecs-high-cpu',
117 description: 'ECS service CPU usage is too high',
118 metric: {
119 name: 'CPUUtilization',
120 namespace: 'AWS/ECS',
121 statistic: 'Average',
122 period: 300,
123 },
124 threshold: 80,
125 comparisonOperator: 'GreaterThanThreshold',
126 evaluationPeriods: 2,
127 severity: 'warning',
128 },
129 ],
130 },
131 {
132 serviceName: 'api-gateway',
133 type: 'api-gateway',
134 logGroups: ['/aws/apigateway/api-logs'],
135 metrics: [
136 {
137 name: 'Count',
138 namespace: 'AWS/ApiGateway',
139 statistic: 'Sum',
140 period: 300,
141 },
142 {
143 name: '4XXError',
144 namespace: 'AWS/ApiGateway',
145 statistic: 'Sum',
146 period: 300,
147 },
148 {
149 name: '5XXError',
150 namespace: 'AWS/ApiGateway',
151 statistic: 'Sum',
152 period: 300,
153 },
154 {
155 name: 'Latency',
156 namespace: 'AWS/ApiGateway',
157 statistic: 'Average',
158 period: 300,
159 unit: 'Milliseconds',
160 },
161 ],
162 alarms: [
163 {
164 name: 'api-gateway-5xx-errors',
165 description: 'API Gateway is returning too many 5XX errors',
166 metric: {
167 name: '5XXError',
168 namespace: 'AWS/ApiGateway',
169 statistic: 'Sum',
170 period: 300,
171 },
172 threshold: 5,
173 comparisonOperator: 'GreaterThanThreshold',
174 evaluationPeriods: 1,
175 severity: 'critical',
176 },
177 ],
178 },
179 {
180 serviceName: 'database',
181 type: 'rds',
182 logGroups: ['/aws/rds/instance/postgres/postgresql'],
183 metrics: [
184 {
185 name: 'CPUUtilization',
186 namespace: 'AWS/RDS',
187 statistic: 'Average',
188 period: 300,
189 },
190 {
191 name: 'DatabaseConnections',
192 namespace: 'AWS/RDS',
193 statistic: 'Average',
194 period: 300,
195 },
196 {
197 name: 'FreeableMemory',
198 namespace: 'AWS/RDS',
199 statistic: 'Average',
200 period: 300,
201 },
202 {
203 name: 'ReadLatency',
204 namespace: 'AWS/RDS',
205 statistic: 'Average',
206 period: 300,
207 },
208 {
209 name: 'WriteLatency',
210 namespace: 'AWS/RDS',
211 statistic: 'Average',
212 period: 300,
213 },
214 ],
215 alarms: [
216 {
217 name: 'rds-high-cpu',
218 description: 'RDS CPU utilization is too high',
219 metric: {
220 name: 'CPUUtilization',
221 namespace: 'AWS/RDS',
222 statistic: 'Average',
223 period: 300,
224 },
225 threshold: 80,
226 comparisonOperator: 'GreaterThanThreshold',
227 evaluationPeriods: 3,
228 severity: 'critical',
229 },
230 {
231 name: 'rds-low-memory',
232 description: 'RDS freeable memory is too low',
233 metric: {
234 name: 'FreeableMemory',
235 namespace: 'AWS/RDS',
236 statistic: 'Average',
237 period: 300,
238 },
239 threshold: 1000000000, // 1 GB
240 comparisonOperator: 'LessThanThreshold',
241 evaluationPeriods: 2,
242 severity: 'warning',
243 },
244 ],
245 },
246];
ποΈ CDK Stack Implementation
π IAM Stack - Cross-Service Permissions
1// lib/stacks/iam-stack.ts
2import * as cdk from 'aws-cdk-lib';
3import * as iam from 'aws-cdk-lib/aws-iam';
4import { Construct } from 'constructs';
5
6export class IAMStack extends cdk.Stack {
7 public readonly monitoringRole: iam.Role;
8 public readonly grafanaRole: iam.Role;
9 public readonly alertProcessorRole: iam.Role;
10
11 constructor(scope: Construct, id: string, props?: cdk.StackProps) {
12 super(scope, id, props);
13
14 // Monitoring Role - for reading logs and metrics from all services
15 this.monitoringRole = new iam.Role(this, 'MonitoringRole', {
16 roleName: 'centralized-monitoring-role',
17 assumedBy: new iam.ServicePrincipal('lambda.amazonaws.com'),
18 description: 'Role for centralized monitoring to access logs and metrics',
19 });
20
21 // Grant CloudWatch Logs read permissions
22 this.monitoringRole.addToPolicy(
23 new iam.PolicyStatement({
24 effect: iam.Effect.ALLOW,
25 actions: [
26 'logs:DescribeLogGroups',
27 'logs:DescribeLogStreams',
28 'logs:GetLogEvents',
29 'logs:FilterLogEvents',
30 'logs:StartQuery',
31 'logs:StopQuery',
32 'logs:GetQueryResults',
33 'logs:GetLogRecord',
34 ],
35 resources: ['*'],
36 })
37 );
38
39 // Grant CloudWatch Metrics read permissions
40 this.monitoringRole.addToPolicy(
41 new iam.PolicyStatement({
42 effect: iam.Effect.ALLOW,
43 actions: [
44 'cloudwatch:DescribeAlarms',
45 'cloudwatch:GetMetricData',
46 'cloudwatch:GetMetricStatistics',
47 'cloudwatch:ListMetrics',
48 'cloudwatch:DescribeAlarmsForMetric',
49 ],
50 resources: ['*'],
51 })
52 );
53
54 // Grant permissions to read from specific AWS services
55 this.monitoringRole.addToPolicy(
56 new iam.PolicyStatement({
57 effect: iam.Effect.ALLOW,
58 actions: [
59 // Lambda
60 'lambda:GetFunction',
61 'lambda:ListFunctions',
62 // ECS
63 'ecs:DescribeClusters',
64 'ecs:DescribeServices',
65 'ecs:DescribeTasks',
66 'ecs:ListClusters',
67 'ecs:ListServices',
68 'ecs:ListTasks',
69 // RDS
70 'rds:DescribeDBInstances',
71 'rds:DescribeDBClusters',
72 // DynamoDB
73 'dynamodb:DescribeTable',
74 'dynamodb:ListTables',
75 // API Gateway
76 'apigateway:GET',
77 // ALB
78 'elasticloadbalancing:DescribeLoadBalancers',
79 'elasticloadbalancing:DescribeTargetGroups',
80 'elasticloadbalancing:DescribeTargetHealth',
81 // SQS
82 'sqs:GetQueueAttributes',
83 'sqs:ListQueues',
84 // S3
85 's3:GetBucketLocation',
86 's3:ListBucket',
87 's3:GetBucketMetricsConfiguration',
88 ],
89 resources: ['*'],
90 })
91 );
92
93 // Grafana Role - for ECS task
94 this.grafanaRole = new iam.Role(this, 'GrafanaRole', {
95 roleName: 'grafana-ecs-task-role',
96 assumedBy: new iam.ServicePrincipal('ecs-tasks.amazonaws.com'),
97 description: 'Role for Grafana to access CloudWatch',
98 });
99
100 // Grant Grafana permissions to read CloudWatch data
101 this.grafanaRole.addToPolicy(
102 new iam.PolicyStatement({
103 effect: iam.Effect.ALLOW,
104 actions: [
105 'cloudwatch:DescribeAlarmsForMetric',
106 'cloudwatch:DescribeAlarmHistory',
107 'cloudwatch:DescribeAlarms',
108 'cloudwatch:ListMetrics',
109 'cloudwatch:GetMetricData',
110 'cloudwatch:GetMetricStatistics',
111 ],
112 resources: ['*'],
113 })
114 );
115
116 this.grafanaRole.addToPolicy(
117 new iam.PolicyStatement({
118 effect: iam.Effect.ALLOW,
119 actions: [
120 'logs:DescribeLogGroups',
121 'logs:GetLogGroupFields',
122 'logs:StartQuery',
123 'logs:StopQuery',
124 'logs:GetQueryResults',
125 'logs:GetLogEvents',
126 ],
127 resources: ['*'],
128 })
129 );
130
131 this.grafanaRole.addToPolicy(
132 new iam.PolicyStatement({
133 effect: iam.Effect.ALLOW,
134 actions: [
135 'ec2:DescribeRegions',
136 'ec2:DescribeInstances',
137 'tag:GetResources',
138 ],
139 resources: ['*'],
140 })
141 );
142
143 // Alert Processor Role
144 this.alertProcessorRole = new iam.Role(this, 'AlertProcessorRole', {
145 roleName: 'alert-processor-role',
146 assumedBy: new iam.ServicePrincipal('lambda.amazonaws.com'),
147 managedPolicies: [
148 iam.ManagedPolicy.fromAwsManagedPolicyName(
149 'service-role/AWSLambdaBasicExecutionRole'
150 ),
151 ],
152 });
153
154 // Grant SNS publish permissions
155 this.alertProcessorRole.addToPolicy(
156 new iam.PolicyStatement({
157 effect: iam.Effect.ALLOW,
158 actions: ['sns:Publish'],
159 resources: ['*'],
160 })
161 );
162
163 // Outputs
164 new cdk.CfnOutput(this, 'MonitoringRoleArn', {
165 value: this.monitoringRole.roleArn,
166 exportName: 'MonitoringRoleArn',
167 });
168
169 new cdk.CfnOutput(this, 'GrafanaRoleArn', {
170 value: this.grafanaRole.roleArn,
171 exportName: 'GrafanaRoleArn',
172 });
173 }
174}
π CloudWatch Stack - Log Aggregation and Metrics
1// lib/stacks/cloudwatch-stack.ts
2import * as cdk from 'aws-cdk-lib';
3import * as logs from 'aws-cdk-lib/aws-logs';
4import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';
5import { Construct } from 'constructs';
6import { servicesConfig, ServiceMonitoringConfig } from '../config/services-config';
7import { monitoringConfigs } from '../config/monitoring-config';
8
9export interface CloudWatchStackProps extends cdk.StackProps {
10 environment: 'development' | 'production';
11}
12
13export class CloudWatchStack extends cdk.Stack {
14 public readonly logGroups: Map<string, logs.LogGroup>;
15 public readonly metricFilters: Map<string, logs.MetricFilter>;
16 public readonly dashboard: cloudwatch.Dashboard;
17
18 constructor(scope: Construct, id: string, props: CloudWatchStackProps) {
19 super(scope, id, props);
20
21 const config = monitoringConfigs[props.environment];
22
23 this.logGroups = new Map();
24 this.metricFilters = new Map();
25
26 // Create centralized log groups and metric filters
27 servicesConfig.forEach((serviceConfig) => {
28 this.setupServiceMonitoring(serviceConfig, config.logRetention);
29 });
30
31 // Create CloudWatch Dashboard
32 this.dashboard = this.createDashboard(config.metricNamespace);
33
34 // Create custom metrics namespace
35 this.createCustomMetricsNamespace(config.metricNamespace);
36 }
37
38 private setupServiceMonitoring(
39 serviceConfig: ServiceMonitoringConfig,
40 retentionDays: number
41 ): void {
42 // Create or reference log groups
43 serviceConfig.logGroups.forEach((logGroupName) => {
44 const logGroup = new logs.LogGroup(
45 this,
46 `LogGroup-${serviceConfig.serviceName}-${logGroupName.replace(/\//g, '-')}`,
47 {
48 logGroupName,
49 retention: this.getRetention(retentionDays),
50 removalPolicy: cdk.RemovalPolicy.DESTROY,
51 }
52 );
53
54 this.logGroups.set(logGroupName, logGroup);
55
56 // Create metric filters for common patterns
57 this.createMetricFiltersForLogGroup(
58 logGroup,
59 serviceConfig.serviceName,
60 serviceConfig.type
61 );
62 });
63 }
64
65 private createMetricFiltersForLogGroup(
66 logGroup: logs.LogGroup,
67 serviceName: string,
68 serviceType: string
69 ): void {
70 // Error count metric filter
71 const errorFilter = new logs.MetricFilter(
72 this,
73 `ErrorFilter-${serviceName}`,
74 {
75 logGroup,
76 filterPattern: logs.FilterPattern.anyTerm('ERROR', 'Error', 'error', 'Exception'),
77 metricNamespace: 'CustomMetrics',
78 metricName: `${serviceName}-Errors`,
79 metricValue: '1',
80 defaultValue: 0,
81 }
82 );
83
84 this.metricFilters.set(`${serviceName}-errors`, errorFilter);
85
86 // Warning count metric filter
87 const warningFilter = new logs.MetricFilter(
88 this,
89 `WarningFilter-${serviceName}`,
90 {
91 logGroup,
92 filterPattern: logs.FilterPattern.anyTerm('WARNING', 'Warning', 'warn'),
93 metricNamespace: 'CustomMetrics',
94 metricName: `${serviceName}-Warnings`,
95 metricValue: '1',
96 defaultValue: 0,
97 }
98 );
99
100 this.metricFilters.set(`${serviceName}-warnings`, warningFilter);
101
102 // Response time metric filter (for APIs)
103 if (serviceType === 'lambda' || serviceType === 'api-gateway') {
104 const responseTimeFilter = new logs.MetricFilter(
105 this,
106 `ResponseTimeFilter-${serviceName}`,
107 {
108 logGroup,
109 filterPattern: logs.FilterPattern.exists('$.duration'),
110 metricNamespace: 'CustomMetrics',
111 metricName: `${serviceName}-ResponseTime`,
112 metricValue: '$.duration',
113 }
114 );
115
116 this.metricFilters.set(`${serviceName}-response-time`, responseTimeFilter);
117 }
118
119 // Business metrics (e.g., successful transactions)
120 const successFilter = new logs.MetricFilter(
121 this,
122 `SuccessFilter-${serviceName}`,
123 {
124 logGroup,
125 filterPattern: logs.FilterPattern.anyTerm('SUCCESS', 'Success', 'success'),
126 metricNamespace: 'CustomMetrics',
127 metricName: `${serviceName}-Success`,
128 metricValue: '1',
129 defaultValue: 0,
130 }
131 );
132
133 this.metricFilters.set(`${serviceName}-success`, successFilter);
134 }
135
136 private createDashboard(namespace: string): cloudwatch.Dashboard {
137 const dashboard = new cloudwatch.Dashboard(this, 'CentralMonitoringDashboard', {
138 dashboardName: `${namespace}-Central-Monitoring`,
139 });
140
141 // Add widgets for each service
142 servicesConfig.forEach((serviceConfig) => {
143 const widgets = this.createWidgetsForService(serviceConfig);
144 widgets.forEach((widget) => dashboard.addWidgets(widget));
145 });
146
147 return dashboard;
148 }
149
150 private createWidgetsForService(
151 serviceConfig: ServiceMonitoringConfig
152 ): cloudwatch.IWidget[] {
153 const widgets: cloudwatch.IWidget[] = [];
154
155 // Create a graph widget for each metric
156 serviceConfig.metrics.forEach((metricConfig) => {
157 const widget = new cloudwatch.GraphWidget({
158 title: `${serviceConfig.serviceName} - ${metricConfig.name}`,
159 width: 12,
160 left: [
161 new cloudwatch.Metric({
162 namespace: metricConfig.namespace,
163 metricName: metricConfig.name,
164 dimensionsMap: metricConfig.dimensions,
165 statistic: metricConfig.statistic,
166 period: cdk.Duration.seconds(metricConfig.period),
167 unit: metricConfig.unit as cloudwatch.Unit | undefined,
168 }),
169 ],
170 });
171
172 widgets.push(widget);
173 });
174
175 // Add custom metrics from log filters
176 const errorMetric = new cloudwatch.Metric({
177 namespace: 'CustomMetrics',
178 metricName: `${serviceConfig.serviceName}-Errors`,
179 statistic: 'Sum',
180 period: cdk.Duration.minutes(5),
181 });
182
183 const warningMetric = new cloudwatch.Metric({
184 namespace: 'CustomMetrics',
185 metricName: `${serviceConfig.serviceName}-Warnings`,
186 statistic: 'Sum',
187 period: cdk.Duration.minutes(5),
188 });
189
190 widgets.push(
191 new cloudwatch.GraphWidget({
192 title: `${serviceConfig.serviceName} - Errors & Warnings`,
193 width: 12,
194 left: [errorMetric],
195 right: [warningMetric],
196 })
197 );
198
199 return widgets;
200 }
201
202 private createCustomMetricsNamespace(namespace: string): void {
203 // Create a Lambda function to publish custom metrics
204 // This is a placeholder - implement as needed
205 }
206
207 private getRetention(days: number): logs.RetentionDays {
208 const retentionMap: Record<number, logs.RetentionDays> = {
209 1: logs.RetentionDays.ONE_DAY,
210 3: logs.RetentionDays.THREE_DAYS,
211 5: logs.RetentionDays.FIVE_DAYS,
212 7: logs.RetentionDays.ONE_WEEK,
213 14: logs.RetentionDays.TWO_WEEKS,
214 30: logs.RetentionDays.ONE_MONTH,
215 60: logs.RetentionDays.TWO_MONTHS,
216 90: logs.RetentionDays.THREE_MONTHS,
217 120: logs.RetentionDays.FOUR_MONTHS,
218 180: logs.RetentionDays.SIX_MONTHS,
219 365: logs.RetentionDays.ONE_YEAR,
220 };
221
222 return retentionMap[days] || logs.RetentionDays.ONE_WEEK;
223 }
224}
π Alerting Stack - CloudWatch Alarms and SNS
1// lib/stacks/alerting-stack.ts
2import * as cdk from 'aws-cdk-lib';
3import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';
4import * as cloudwatch_actions from 'aws-cdk-lib/aws-cloudwatch-actions';
5import * as sns from 'aws-cdk-lib/aws-sns';
6import * as sns_subscriptions from 'aws-cdk-lib/aws-sns-subscriptions';
7import * as lambda from 'aws-cdk-lib/aws-lambda';
8import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';
9import { Construct } from 'constructs';
10import * as path from 'path';
11import { servicesConfig } from '../config/services-config';
12import { monitoringConfigs } from '../config/monitoring-config';
13
14export interface AlertingStackProps extends cdk.StackProps {
15 environment: 'development' | 'production';
16 alertProcessorRole: cdk.aws_iam.Role;
17}
18
19export class AlertingStack extends cdk.Stack {
20 public readonly criticalTopic: sns.Topic;
21 public readonly warningTopic: sns.Topic;
22 public readonly alarms: Map<string, cloudwatch.Alarm>;
23
24 constructor(scope: Construct, id: string, props: AlertingStackProps) {
25 super(scope, id, props);
26
27 const config = monitoringConfigs[props.environment];
28 this.alarms = new Map();
29
30 // Create SNS topics for different severity levels
31 this.criticalTopic = new sns.Topic(this, 'CriticalAlertsTopic', {
32 topicName: 'monitoring-critical-alerts',
33 displayName: 'Critical Alerts',
34 });
35
36 this.warningTopic = new sns.Topic(this, 'WarningAlertsTopic', {
37 topicName: 'monitoring-warning-alerts',
38 displayName: 'Warning Alerts',
39 });
40
41 // Subscribe emails to topics
42 if (config.alerting.enabled) {
43 config.alerting.emailEndpoints.forEach((email) => {
44 this.criticalTopic.addSubscription(
45 new sns_subscriptions.EmailSubscription(email)
46 );
47 this.warningTopic.addSubscription(
48 new sns_subscriptions.EmailSubscription(email)
49 );
50 });
51 }
52
53 // Create alert processor Lambda
54 const alertProcessor = this.createAlertProcessor(props.alertProcessorRole, config);
55
56 // Subscribe Lambda to SNS topics
57 this.criticalTopic.addSubscription(
58 new sns_subscriptions.LambdaSubscription(alertProcessor)
59 );
60 this.warningTopic.addSubscription(
61 new sns_subscriptions.LambdaSubscription(alertProcessor)
62 );
63
64 // Create alarms for all services
65 servicesConfig.forEach((serviceConfig) => {
66 this.createAlarmsForService(serviceConfig);
67 });
68
69 // Create composite alarms
70 this.createCompositeAlarms();
71 }
72
73 private createAlertProcessor(
74 role: cdk.aws_iam.Role,
75 config: any
76 ): lambda.Function {
77 const alertProcessor = new NodejsFunction(this, 'AlertProcessor', {
78 runtime: lambda.Runtime.NODEJS_20_X,
79 handler: 'handler',
80 entry: path.join(__dirname, '../../lambda/alert-processor/index.ts'),
81 timeout: cdk.Duration.seconds(30),
82 memorySize: 256,
83 role,
84 environment: {
85 SLACK_WEBHOOK_URL: config.alerting.slackWebhookUrl || '',
86 PAGERDUTY_KEY: config.alerting.pagerDutyIntegrationKey || '',
87 },
88 });
89
90 return alertProcessor;
91 }
92
93 private createAlarmsForService(
94 serviceConfig: any
95 ): void {
96 serviceConfig.alarms.forEach((alarmConfig: any) => {
97 const metric = new cloudwatch.Metric({
98 namespace: alarmConfig.metric.namespace,
99 metricName: alarmConfig.metric.name,
100 dimensionsMap: alarmConfig.metric.dimensions,
101 statistic: alarmConfig.metric.statistic,
102 period: cdk.Duration.seconds(alarmConfig.metric.period),
103 });
104
105 const alarm = new cloudwatch.Alarm(
106 this,
107 `Alarm-${serviceConfig.serviceName}-${alarmConfig.name}`,
108 {
109 alarmName: alarmConfig.name,
110 alarmDescription: alarmConfig.description,
111 metric,
112 threshold: alarmConfig.threshold,
113 comparisonOperator: this.getComparisonOperator(
114 alarmConfig.comparisonOperator
115 ),
116 evaluationPeriods: alarmConfig.evaluationPeriods,
117 datapointsToAlarm: alarmConfig.datapointsToAlarm,
118 treatMissingData: this.getTreatMissingData(alarmConfig.treatMissingData),
119 }
120 );
121
122 // Add alarm actions based on severity
123 if (alarmConfig.severity === 'critical') {
124 alarm.addAlarmAction(new cloudwatch_actions.SnsAction(this.criticalTopic));
125 } else if (alarmConfig.severity === 'warning') {
126 alarm.addAlarmAction(new cloudwatch_actions.SnsAction(this.warningTopic));
127 }
128
129 this.alarms.set(`${serviceConfig.serviceName}-${alarmConfig.name}`, alarm);
130 });
131 }
132
133 private createCompositeAlarms(): void {
134 // Create a composite alarm that triggers if multiple services are down
135 const serviceDownAlarms = Array.from(this.alarms.values()).filter((alarm) =>
136 alarm.alarmName.includes('high-error-rate')
137 );
138
139 if (serviceDownAlarms.length > 0) {
140 new cloudwatch.CompositeAlarm(this, 'MultipleServicesDown', {
141 compositeAlarmName: 'multiple-services-down',
142 alarmDescription: 'Multiple services are experiencing high error rates',
143 alarmRule: cloudwatch.AlarmRule.anyOf(
144 ...serviceDownAlarms.map((alarm) => cloudwatch.AlarmRule.fromAlarm(alarm, cloudwatch.AlarmState.ALARM))
145 ),
146 });
147 }
148 }
149
150 private getComparisonOperator(
151 operator: string
152 ): cloudwatch.ComparisonOperator {
153 const operatorMap: Record<string, cloudwatch.ComparisonOperator> = {
154 GreaterThanThreshold: cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
155 LessThanThreshold: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
156 GreaterThanOrEqualToThreshold:
157 cloudwatch.ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
158 LessThanOrEqualToThreshold:
159 cloudwatch.ComparisonOperator.LESS_THAN_OR_EQUAL_TO_THRESHOLD,
160 };
161
162 return operatorMap[operator];
163 }
164
165 private getTreatMissingData(
166 treatment: string | undefined
167 ): cloudwatch.TreatMissingData {
168 if (!treatment) return cloudwatch.TreatMissingData.NOT_BREACHING;
169
170 const treatmentMap: Record<string, cloudwatch.TreatMissingData> = {
171 notBreaching: cloudwatch.TreatMissingData.NOT_BREACHING,
172 breaching: cloudwatch.TreatMissingData.BREACHING,
173 ignore: cloudwatch.TreatMissingData.IGNORE,
174 missing: cloudwatch.TreatMissingData.MISSING,
175 };
176
177 return treatmentMap[treatment];
178 }
179}
π Grafana Stack - ECS Deployment
1// lib/stacks/grafana-stack.ts
2import * as cdk from 'aws-cdk-lib';
3import * as ec2 from 'aws-cdk-lib/aws-ec2';
4import * as ecs from 'aws-cdk-lib/aws-ecs';
5import * as ecs_patterns from 'aws-cdk-lib/aws-ecs-patterns';
6import * as efs from 'aws-cdk-lib/aws-efs';
7import * as secretsmanager from 'aws-cdk-lib/aws-secretsmanager';
8import * as iam from 'aws-cdk-lib/aws-iam';
9import { Construct } from 'constructs';
10import { monitoringConfigs } from '../config/monitoring-config';
11
12export interface GrafanaStackProps extends cdk.StackProps {
13 environment: 'development' | 'production';
14 vpc: ec2.Vpc;
15 grafanaRole: iam.Role;
16}
17
18export class GrafanaStack extends cdk.Stack {
19 public readonly service: ecs_patterns.ApplicationLoadBalancedFargateService;
20 public readonly url: string;
21
22 constructor(scope: Construct, id: string, props: GrafanaStackProps) {
23 super(scope, id, props);
24
25 const config = monitoringConfigs[props.environment];
26
27 if (!config.grafana?.enabled) {
28 console.log('Grafana is disabled for this environment');
29 return;
30 }
31
32 const { vpc, grafanaRole } = props;
33
34 // Create ECS Cluster
35 const cluster = new ecs.Cluster(this, 'GrafanaCluster', {
36 vpc,
37 clusterName: 'monitoring-grafana-cluster',
38 containerInsights: config.enableContainerInsights,
39 });
40
41 // Create EFS for Grafana data persistence
42 const fileSystem = new efs.FileSystem(this, 'GrafanaEFS', {
43 vpc,
44 encrypted: true,
45 lifecyclePolicy: efs.LifecyclePolicy.AFTER_14_DAYS,
46 performanceMode: efs.PerformanceMode.GENERAL_PURPOSE,
47 throughputMode: efs.ThroughputMode.BURSTING,
48 removalPolicy: cdk.RemovalPolicy.RETAIN,
49 });
50
51 // Create Secrets Manager secret for Grafana admin password
52 const grafanaSecret = new secretsmanager.Secret(this, 'GrafanaAdminPassword', {
53 secretName: 'grafana-admin-password',
54 generateSecretString: {
55 secretStringTemplate: JSON.stringify({ username: 'admin' }),
56 generateStringKey: 'password',
57 excludePunctuation: true,
58 passwordLength: 16,
59 },
60 });
61
62 // Create Fargate service with ALB
63 this.service = new ecs_patterns.ApplicationLoadBalancedFargateService(
64 this,
65 'GrafanaService',
66 {
67 cluster,
68 serviceName: 'grafana',
69 desiredCount: config.grafana.desiredCount,
70 cpu: this.getCpuFromInstanceType(config.grafana.instanceType),
71 memoryLimitMiB: this.getMemoryFromInstanceType(config.grafana.instanceType),
72 taskImageOptions: {
73 image: ecs.ContainerImage.fromRegistry('grafana/grafana:latest'),
74 containerPort: 3000,
75 taskRole: grafanaRole,
76 environment: {
77 GF_SERVER_ROOT_URL: config.grafana.domain
78 ? `https://${config.grafana.domain}`
79 : '',
80 GF_AUTH_ANONYMOUS_ENABLED: 'false',
81 GF_SECURITY_ADMIN_USER: 'admin',
82 GF_INSTALL_PLUGINS: 'grafana-clock-panel,grafana-simple-json-datasource,grafana-piechart-panel',
83 // CloudWatch data source configuration
84 GF_AWS_DEFAULT_REGION: cdk.Aws.REGION,
85 GF_AWS_cloudwatch_ASSUME_ROLE_ENABLED: 'true',
86 },
87 secrets: {
88 GF_SECURITY_ADMIN_PASSWORD: ecs.Secret.fromSecretsManager(
89 grafanaSecret,
90 'password'
91 ),
92 },
93 },
94 publicLoadBalancer: true,
95 }
96 );
97
98 // Configure volume mount for EFS
99 const volumeName = 'grafana-storage';
100
101 this.service.taskDefinition.addVolume({
102 name: volumeName,
103 efsVolumeConfiguration: {
104 fileSystemId: fileSystem.fileSystemId,
105 transitEncryption: 'ENABLED',
106 },
107 });
108
109 this.service.taskDefinition.defaultContainer?.addMountPoints({
110 sourceVolume: volumeName,
111 containerPath: '/var/lib/grafana',
112 readOnly: false,
113 });
114
115 // Allow connections from ALB to EFS
116 fileSystem.connections.allowDefaultPortFrom(this.service.service.connections);
117
118 // Configure health check
119 this.service.targetGroup.configureHealthCheck({
120 path: '/api/health',
121 interval: cdk.Duration.seconds(30),
122 timeout: cdk.Duration.seconds(5),
123 healthyThresholdCount: 2,
124 unhealthyThresholdCount: 3,
125 });
126
127 // Auto scaling
128 const scaling = this.service.service.autoScaleTaskCount({
129 minCapacity: 1,
130 maxCapacity: config.grafana.desiredCount * 2,
131 });
132
133 scaling.scaleOnCpuUtilization('CpuScaling', {
134 targetUtilizationPercent: 70,
135 scaleInCooldown: cdk.Duration.seconds(300),
136 scaleOutCooldown: cdk.Duration.seconds(60),
137 });
138
139 scaling.scaleOnMemoryUtilization('MemoryScaling', {
140 targetUtilizationPercent: 80,
141 scaleInCooldown: cdk.Duration.seconds(300),
142 scaleOutCooldown: cdk.Duration.seconds(60),
143 });
144
145 this.url = this.service.loadBalancer.loadBalancerDnsName;
146
147 // Outputs
148 new cdk.CfnOutput(this, 'GrafanaURL', {
149 value: `http://${this.url}`,
150 description: 'Grafana Dashboard URL',
151 });
152
153 new cdk.CfnOutput(this, 'GrafanaAdminSecretArn', {
154 value: grafanaSecret.secretArn,
155 description: 'Grafana Admin Password Secret ARN',
156 });
157 }
158
159 private getCpuFromInstanceType(instanceType: string): number {
160 const cpuMap: Record<string, number> = {
161 't3.small': 512,
162 't3.medium': 1024,
163 't3.large': 2048,
164 };
165
166 return cpuMap[instanceType] || 512;
167 }
168
169 private getMemoryFromInstanceType(instanceType: string): number {
170 const memoryMap: Record<string, number> = {
171 't3.small': 2048,
172 't3.medium': 4096,
173 't3.large': 8192,
174 };
175
176 return memoryMap[instanceType] || 2048;
177 }
178}
β‘ Lambda Functions Implementation
π Alert Processor Lambda
1// lambda/alert-processor/index.ts
2import { SNSEvent, SNSHandler } from 'aws-lambda';
3import axios from 'axios';
4
5interface CloudWatchAlarm {
6 AlarmName: string;
7 AlarmDescription: string;
8 AWSAccountId: string;
9 NewStateValue: string;
10 NewStateReason: string;
11 StateChangeTime: string;
12 Region: string;
13 OldStateValue: string;
14 Trigger: {
15 MetricName: string;
16 Namespace: string;
17 StatisticType: string;
18 Statistic: string;
19 Unit: string | null;
20 Dimensions: Array<{ name: string; value: string }>;
21 Period: number;
22 EvaluationPeriods: number;
23 ComparisonOperator: string;
24 Threshold: number;
25 };
26}
27
28export const handler: SNSHandler = async (event: SNSEvent) => {
29 console.log('Alert Processor received event:', JSON.stringify(event, null, 2));
30
31 for (const record of event.Records) {
32 try {
33 const message = JSON.parse(record.Sns.Message) as CloudWatchAlarm;
34
35 // Process alarm
36 await processAlarm(message);
37
38 // Send to Slack
39 if (process.env.SLACK_WEBHOOK_URL) {
40 await sendSlackNotification(message);
41 }
42
43 // Send to PagerDuty
44 if (process.env.PAGERDUTY_KEY && message.NewStateValue === 'ALARM') {
45 await sendPagerDutyAlert(message);
46 }
47 } catch (error) {
48 console.error('Error processing alarm:', error);
49 }
50 }
51};
52
53async function processAlarm(alarm: CloudWatchAlarm): Promise<void> {
54 console.log('Processing alarm:', alarm.AlarmName);
55 console.log('State:', alarm.OldStateValue, '->', alarm.NewStateValue);
56 console.log('Reason:', alarm.NewStateReason);
57
58 // Add custom processing logic here
59 // e.g., Update dashboard, create ticket, trigger auto-remediation
60}
61
62async function sendSlackNotification(alarm: CloudWatchAlarm): Promise<void> {
63 const webhookUrl = process.env.SLACK_WEBHOOK_URL;
64 if (!webhookUrl) return;
65
66 const color = alarm.NewStateValue === 'ALARM' ? '#ff0000' : '#36a64f';
67 const emoji = alarm.NewStateValue === 'ALARM' ? ':rotating_light:' : ':white_check_mark:';
68
69 const message = {
70 text: `${emoji} CloudWatch Alarm ${alarm.NewStateValue}`,
71 attachments: [
72 {
73 color,
74 title: alarm.AlarmName,
75 text: alarm.AlarmDescription,
76 fields: [
77 {
78 title: 'State',
79 value: `${alarm.OldStateValue} β ${alarm.NewStateValue}`,
80 short: true,
81 },
82 {
83 title: 'Region',
84 value: alarm.Region,
85 short: true,
86 },
87 {
88 title: 'Metric',
89 value: `${alarm.Trigger.Namespace}/${alarm.Trigger.MetricName}`,
90 short: true,
91 },
92 {
93 title: 'Threshold',
94 value: `${alarm.Trigger.ComparisonOperator} ${alarm.Trigger.Threshold}`,
95 short: true,
96 },
97 {
98 title: 'Reason',
99 value: alarm.NewStateReason,
100 short: false,
101 },
102 ],
103 footer: 'AWS CloudWatch',
104 ts: Math.floor(new Date(alarm.StateChangeTime).getTime() / 1000),
105 },
106 ],
107 };
108
109 try {
110 await axios.post(webhookUrl, message);
111 console.log('Slack notification sent successfully');
112 } catch (error) {
113 console.error('Error sending Slack notification:', error);
114 }
115}
116
117async function sendPagerDutyAlert(alarm: CloudWatchAlarm): Promise<void> {
118 const integrationKey = process.env.PAGERDUTY_KEY;
119 if (!integrationKey) return;
120
121 const event = {
122 routing_key: integrationKey,
123 event_action: 'trigger',
124 dedup_key: alarm.AlarmName,
125 payload: {
126 summary: `${alarm.AlarmName}: ${alarm.NewStateReason}`,
127 severity: 'critical',
128 source: alarm.Region,
129 custom_details: {
130 alarm_name: alarm.AlarmName,
131 alarm_description: alarm.AlarmDescription,
132 metric: `${alarm.Trigger.Namespace}/${alarm.Trigger.MetricName}`,
133 threshold: alarm.Trigger.Threshold,
134 current_state: alarm.NewStateValue,
135 previous_state: alarm.OldStateValue,
136 },
137 },
138 };
139
140 try {
141 await axios.post('https://events.pagerduty.com/v2/enqueue', event);
142 console.log('PagerDuty alert sent successfully');
143 } catch (error) {
144 console.error('Error sending PagerDuty alert:', error);
145 }
146}
π Grafana Configuration
π§ CloudWatch Data Source Configuration
1// dashboards/grafana/datasource-config.json
2{
3 "name": "CloudWatch",
4 "type": "cloudwatch",
5 "access": "proxy",
6 "jsonData": {
7 "authType": "default",
8 "defaultRegion": "us-east-1"
9 }
10}
π Sample Grafana Dashboard
1// dashboards/grafana/service-dashboard.json
2{
3 "dashboard": {
4 "title": "Service Monitoring Dashboard",
5 "tags": ["monitoring", "services"],
6 "timezone": "browser",
7 "panels": [
8 {
9 "id": 1,
10 "title": "Lambda Invocations",
11 "type": "graph",
12 "datasource": "CloudWatch",
13 "targets": [
14 {
15 "namespace": "AWS/Lambda",
16 "metricName": "Invocations",
17 "dimensions": {
18 "FunctionName": "*"
19 },
20 "statistics": ["Sum"],
21 "period": "300"
22 }
23 ],
24 "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
25 },
26 {
27 "id": 2,
28 "title": "Lambda Errors",
29 "type": "graph",
30 "datasource": "CloudWatch",
31 "targets": [
32 {
33 "namespace": "AWS/Lambda",
34 "metricName": "Errors",
35 "dimensions": {
36 "FunctionName": "*"
37 },
38 "statistics": ["Sum"],
39 "period": "300"
40 }
41 ],
42 "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
43 },
44 {
45 "id": 3,
46 "title": "API Gateway Requests",
47 "type": "graph",
48 "datasource": "CloudWatch",
49 "targets": [
50 {
51 "namespace": "AWS/ApiGateway",
52 "metricName": "Count",
53 "statistics": ["Sum"],
54 "period": "300"
55 }
56 ],
57 "gridPos": { "x": 0, "y": 8, "w": 12, "h": 8 }
58 },
59 {
60 "id": 4,
61 "title": "RDS CPU Utilization",
62 "type": "graph",
63 "datasource": "CloudWatch",
64 "targets": [
65 {
66 "namespace": "AWS/RDS",
67 "metricName": "CPUUtilization",
68 "statistics": ["Average"],
69 "period": "300"
70 }
71 ],
72 "gridPos": { "x": 12, "y": 8, "w": 12, "h": 8 }
73 }
74 ]
75 }
76}
π Main CDK App
1// bin/monitoring-system.ts
2#!/usr/bin/env node
3import 'source-map-support/register';
4import * as cdk from 'aws-cdk-lib';
5import * as ec2 from 'aws-cdk-lib/aws-ec2';
6import { IAMStack } from '../lib/stacks/iam-stack';
7import { CloudWatchStack } from '../lib/stacks/cloudwatch-stack';
8import { AlertingStack } from '../lib/stacks/alerting-stack';
9import { GrafanaStack } from '../lib/stacks/grafana-stack';
10
11const app = new cdk.App();
12const environment = app.node.tryGetContext('environment') || 'development';
13
14const env = {
15 account: process.env.CDK_DEFAULT_ACCOUNT,
16 region: process.env.CDK_DEFAULT_REGION || 'us-east-1',
17};
18
19// IAM Stack - Create roles first
20const iamStack = new IAMStack(app, 'MonitoringIAMStack', { env });
21
22// CloudWatch Stack - Log aggregation and metrics
23const cloudWatchStack = new CloudWatchStack(app, 'MonitoringCloudWatchStack', {
24 environment,
25 env,
26});
27
28// Alerting Stack - Alarms and notifications
29const alertingStack = new AlertingStack(app, 'MonitoringAlertingStack', {
30 environment,
31 alertProcessorRole: iamStack.alertProcessorRole,
32 env,
33});
34
35alertingStack.addDependency(cloudWatchStack);
36
37// VPC for Grafana (optional - can use default VPC)
38const vpc = new ec2.Vpc(app, 'MonitoringVPC', {
39 maxAzs: 2,
40 natGateways: 1,
41});
42
43// Grafana Stack - Advanced visualization
44const grafanaStack = new GrafanaStack(app, 'MonitoringGrafanaStack', {
45 environment,
46 vpc,
47 grafanaRole: iamStack.grafanaRole,
48 env,
49});
50
51grafanaStack.addDependency(iamStack);
52
53// Tags
54cdk.Tags.of(app).add('Project', 'CentralizedMonitoring');
55cdk.Tags.of(app).add('Environment', environment);
56cdk.Tags.of(app).add('ManagedBy', 'CDK');
57
58app.synth();
π Log Insights Queries
π Useful Query Examples
1// Common CloudWatch Log Insights queries
2
3// 1. Error analysis
4const errorQuery = `
5fields @timestamp, @message
6| filter @message like /ERROR/
7| stats count() by bin(5m)
8`;
9
10// 2. Slow queries
11const slowQueryLog = `
12fields @timestamp, @message, @duration
13| filter @duration > 1000
14| sort @duration desc
15| limit 100
16`;
17
18// 3. API request analysis
19const apiAnalysisQuery = `
20fields @timestamp, method, path, statusCode, duration
21| filter statusCode >= 400
22| stats count() by statusCode, bin(5m)
23`;
24
25// 4. Lambda cold starts
26const coldStartQuery = `
27fields @timestamp, @message, @initDuration
28| filter @type = "REPORT"
29| filter @initDuration > 0
30| stats count(), avg(@initDuration), max(@initDuration) by bin(1h)
31`;
32
33// 5. Top error messages
34const topErrorsQuery = `
35fields @message
36| filter @message like /ERROR/
37| stats count() as error_count by @message
38| sort error_count desc
39| limit 10
40`;
π° Cost Optimization
π‘ Cost Breakdown and Strategies
| Component | Cost Factor | Optimization Strategy |
|---|---|---|
| CloudWatch Logs | Ingestion + Storage | Adjust retention, use metric filters |
| CloudWatch Metrics | Number of metrics | Use custom metrics wisely, aggregation |
| CloudWatch Alarms | Number of alarms | Composite alarms, reduce evaluation frequency |
| Grafana (ECS) | EC2/Fargate hours | Right-size instances, use Spot for dev |
| Data Transfer | Cross-region/AZ | Keep monitoring in same region |
| API Calls | CloudWatch API calls | Cache dashboard data, batch queries |
π― Cost Optimization Strategies
1// 1. Log retention policies
2const logGroup = new logs.LogGroup(this, 'LogGroup', {
3 retention: logs.RetentionDays.ONE_WEEK, // Shorter for dev
4});
5
6// 2. Metric filters instead of custom metrics
7// Extract metrics from logs instead of publishing custom metrics
8const metricFilter = new logs.MetricFilter(this, 'ErrorMetric', {
9 logGroup,
10 filterPattern: logs.FilterPattern.literal('[ERROR]'),
11 metricNamespace: 'CustomMetrics',
12 metricName: 'Errors',
13 metricValue: '1',
14});
15
16// 3. Use sampling for high-volume logs
17// Implement sampling in application code
18
19// 4. Composite alarms
20// Reduce alarm count by combining multiple conditions
21const compositeAlarm = new cloudwatch.CompositeAlarm(this, 'CompositeAlarm', {
22 alarmRule: cloudwatch.AlarmRule.anyOf(alarm1, alarm2, alarm3),
23});
24
25// 5. Use Fargate Spot for non-production Grafana
26// Add capacity provider with Spot instances
π Integration Examples
π Enabling CloudWatch Logs for Services
1// Lambda with CloudWatch Logs
2const lambdaFunction = new lambda.Function(this, 'Function', {
3 runtime: lambda.Runtime.NODEJS_20_X,
4 handler: 'index.handler',
5 code: lambda.Code.fromAsset('lambda'),
6 logRetention: logs.RetentionDays.ONE_WEEK,
7});
8
9// ECS with CloudWatch Logs
10const taskDefinition = new ecs.FargateTaskDefinition(this, 'TaskDef');
11taskDefinition.addContainer('app', {
12 image: ecs.ContainerImage.fromRegistry('my-app'),
13 logging: ecs.LogDrivers.awsLogs({
14 streamPrefix: 'ecs',
15 logRetention: logs.RetentionDays.ONE_WEEK,
16 }),
17});
18
19// API Gateway with CloudWatch Logs
20const api = new apigateway.RestApi(this, 'API', {
21 deployOptions: {
22 loggingLevel: apigateway.MethodLoggingLevel.INFO,
23 dataTraceEnabled: true,
24 accessLogDestination: new apigateway.LogGroupLogDestination(logGroup),
25 },
26});
27
28// RDS with CloudWatch Logs
29const database = new rds.DatabaseInstance(this, 'Database', {
30 engine: rds.DatabaseInstanceEngine.postgres({
31 version: rds.PostgresEngineVersion.VER_14,
32 }),
33 cloudwatchLogsExports: ['postgresql'],
34});
35
36// Enable Container Insights for ECS
37const cluster = new ecs.Cluster(this, 'Cluster', {
38 containerInsights: true,
39});
40
41// Enable Lambda Insights
42const lambdaWithInsights = new lambda.Function(this, 'FunctionWithInsights', {
43 runtime: lambda.Runtime.NODEJS_20_X,
44 handler: 'index.handler',
45 code: lambda.Code.fromAsset('lambda'),
46 insightsVersion: lambda.LambdaInsightsVersion.VERSION_1_0_229_0,
47});
π Deployment
1# Install dependencies
2npm install
3
4# Bootstrap CDK (first time only)
5cdk bootstrap
6
7# Synthesize CloudFormation templates
8cdk synth
9
10# Deploy IAM stack first
11cdk deploy MonitoringIAMStack --context environment=development
12
13# Deploy CloudWatch stack
14cdk deploy MonitoringCloudWatchStack --context environment=development
15
16# Deploy Alerting stack
17cdk deploy MonitoringAlertingStack --context environment=development
18
19# Deploy Grafana stack
20cdk deploy MonitoringGrafanaStack --context environment=development
21
22# Deploy all stacks
23cdk deploy --all --context environment=production --require-approval never
24
25# View Grafana URL
26aws cloudformation describe-stacks \
27 --stack-name MonitoringGrafanaStack \
28 --query 'Stacks[0].Outputs[?OutputKey==`GrafanaURL`].OutputValue' \
29 --output text
π Summary and Best Practices
π― Key Takeaways
- Hybrid Approach: Use CloudWatch for data collection, Grafana for visualization
- IAM Permissions: Properly configure cross-service access with least privilege
- Log Aggregation: Centralize logs from all services in CloudWatch
- Metric Filters: Extract metrics from logs to reduce custom metric costs
- Alerting: Multi-tier alerting with SNS topics and Lambda processors
- Retention: Configure appropriate log retention based on environment
- Cost Management: Monitor costs, use sampling, optimize retention
- Dashboard Design: Create role-specific dashboards for different teams
β Monitoring Checklist
- Define monitoring requirements for all services
- Set up IAM roles with cross-service permissions
- Enable CloudWatch Logs for all services
- Create metric filters for log-based metrics
- Configure CloudWatch Alarms with appropriate thresholds
- Set up SNS topics for different alert severities
- Deploy Grafana on ECS with proper security
- Create dashboards for different teams/services
- Enable Container Insights (ECS/EKS)
- Enable Lambda Insights for Lambda functions
- Configure log retention policies
- Set up alert notifications (Slack, PagerDuty, email)
- Test alerting workflows
- Document dashboard usage and queries
- Implement cost monitoring for the monitoring stack itself
π¨ Dashboard Organization
Dashboards by Role:
βββ Executive Dashboard
β βββ High-level KPIs
β βββ Cost metrics
β βββ Availability metrics
βββ Operations Dashboard
β βββ Infrastructure health
β βββ Service availability
β βββ Active incidents
βββ Development Dashboard
β βββ Application metrics
β βββ Error rates
β βββ Performance metrics
βββ SRE Dashboard
βββ SLIs/SLOs
βββ Detailed service metrics
βββ Capacity planning
π Further Learning
- CloudWatch: Amazon CloudWatch Documentation
- Grafana: Grafana Documentation
- Observability: AWS Observability Best Practices
- Log Insights: CloudWatch Logs Insights Query Syntax
π― Conclusion
Building a centralized monitoring system is essential for maintaining visibility, reliability, and performance in distributed AWS environments. By combining CloudWatch’s native AWS integration with Grafana’s powerful visualization capabilities, you create a robust observability platform.
The key to successful monitoring is:
- Comprehensive Coverage: Monitor all services and infrastructure
- Proper Permissions: Use IAM to enable cross-service monitoring
- Actionable Alerts: Alert on symptoms, not just metrics
- Cost Awareness: Balance observability needs with cost
- Team-Specific Dashboards: Provide relevant views for different roles
Key Benefits:
- Single pane of glass for all AWS services
- Proactive issue detection and alerting
- Reduced MTTR (Mean Time To Resolution)
- Better understanding of system behavior
- Data-driven optimization decisions
- Compliance and audit readiness
Related Posts:
- Centralized User Access Control with AWS Cognito and CDK
- Deploying Hugging Face Models to AWS with CDK and SageMaker
- Building Production Kubernetes Platform with AWS EKS and CDK
- Express.js Best Practices: Building Production-Ready Node.js Backend Applications
Tags: #AWS #CloudWatch #Grafana #Monitoring #Observability #CDK #TypeScript #Logging #Metrics #Alerting #DevOps #SRE #Infrastructure