Building a Centralized Monitoring System with AWS CloudWatch and Grafana using CDK

🎯 Introduction

In distributed systems running on AWS, observability is critical for maintaining reliability, debugging issues, and ensuring optimal performance. A centralized monitoring system provides:

  • Unified Visibility: Single pane of glass for all services, applications, and infrastructure
  • Proactive Alerting: Detect and respond to issues before they impact users
  • Performance Optimization: Identify bottlenecks and optimization opportunities
  • Cost Management: Track resource utilization and spending patterns
  • Compliance: Meet audit and regulatory requirements for logging
  • Troubleshooting: Quickly diagnose and resolve production issues

This comprehensive guide demonstrates how to build a production-ready centralized monitoring system using AWS CloudWatch and Grafana, deployed with CDK (TypeScript). We’ll focus on cross-service log aggregation, metric collection, proper IAM permissions, and creating actionable dashboards.

πŸ’‘ Core Philosophy: “You can’t improve what you don’t measure. Good monitoring isn’t about collecting dataβ€”it’s about extracting insights that drive action”

🎬 What We’ll Build

A complete centralized monitoring and observability platform featuring:

  • CloudWatch Logs Aggregation from Lambda, ECS, API Gateway, and more
  • CloudWatch Metrics collection and custom metrics
  • Grafana on ECS Fargate for advanced visualization
  • CloudWatch as Data Source for Grafana dashboards
  • Cross-Account Monitoring capabilities
  • IAM Roles and Policies for secure cross-service access
  • Automated Alerting via SNS and PagerDuty
  • Custom Dashboards for different teams and services
  • Log Insights Queries for log analysis
  • Metric Filters for extracting metrics from logs
  • Container Insights for ECS/EKS monitoring
  • Lambda Insights for serverless monitoring

πŸ—οΈ Architecture Overview

πŸ“Š High-Level Architecture

graph TB
    subgraph "AWS Services Being Monitored"
        Lambda[Lambda Functions]
        ECS[ECS/Fargate]
        RDS[RDS Databases]
        APIGW[API Gateway]
        ALB[Application Load Balancer]
        S3[S3 Buckets]
        DynamoDB[DynamoDB]
        SQS[SQS Queues]
    end

    subgraph "Log Collection Layer"
        CWLogs[CloudWatch Logs]
        LogGroups[Log Groups]
        MetricFilters[Metric Filters]
    end

    subgraph "Metrics Collection Layer"
        CWMetrics[CloudWatch Metrics]
        CustomMetrics[Custom Metrics]
        ContainerInsights[Container Insights]
        LambdaInsights[Lambda Insights]
    end

    subgraph "Monitoring Platform"
        Grafana[Grafana on ECS]
        CWDashboards[CloudWatch Dashboards]
        LogInsights[CloudWatch Log Insights]
    end

    subgraph "Alerting & Notifications"
        CWAlarms[CloudWatch Alarms]
        SNS[SNS Topics]
        Lambda2[Alert Lambda]
        PagerDuty[PagerDuty]
        Slack[Slack]
        Email[Email]
    end

    subgraph "IAM & Security"
        MonitoringRole[Monitoring IAM Role]
        ServiceRoles[Service Roles]
        CrossAccount[Cross-Account Access]
    end

    Lambda --> CWLogs
    ECS --> CWLogs
    APIGW --> CWLogs
    ALB --> CWLogs

    Lambda --> CWMetrics
    ECS --> CWMetrics
    RDS --> CWMetrics
    DynamoDB --> CWMetrics
    SQS --> CWMetrics

    CWLogs --> LogGroups
    LogGroups --> MetricFilters
    MetricFilters --> CWMetrics

    ECS --> ContainerInsights
    Lambda --> LambdaInsights

    CWMetrics --> Grafana
    CWLogs --> Grafana
    ContainerInsights --> Grafana
    LambdaInsights --> Grafana

    CWMetrics --> CWDashboards
    CWLogs --> LogInsights

    CWMetrics --> CWAlarms
    CWAlarms --> SNS
    SNS --> Lambda2
    Lambda2 --> PagerDuty
    Lambda2 --> Slack
    SNS --> Email

    MonitoringRole --> CWLogs
    MonitoringRole --> CWMetrics
    MonitoringRole --> Grafana
    ServiceRoles --> CWLogs
    ServiceRoles --> CWMetrics

    style Grafana fill:#ff6b6b
    style CWMetrics fill:#4ecdc4
    style CWAlarms fill:#feca57
    style MonitoringRole fill:#95e1d3

πŸ”„ Data Flow

sequenceDiagram
    participant Service as AWS Service
    participant CWLogs as CloudWatch Logs
    participant MetricFilter as Metric Filter
    participant CWMetrics as CloudWatch Metrics
    participant Alarm as CloudWatch Alarm
    participant SNS as SNS Topic
    participant Grafana as Grafana

    Service->>CWLogs: Send Logs
    CWLogs->>MetricFilter: Process Logs
    MetricFilter->>CWMetrics: Extract Metrics

    Service->>CWMetrics: Send Metrics

    CWMetrics->>Alarm: Evaluate Threshold
    Alarm->>SNS: Trigger Alert
    SNS->>SNS: Fan out notifications

    Grafana->>CWMetrics: Query Metrics
    Grafana->>CWLogs: Query Logs
    Grafana->>Grafana: Render Dashboard

🎨 Monitoring Strategy: CloudWatch vs Grafana

πŸ“‹ Feature Comparison

FeatureCloudWatchGrafana
Native AWS Integrationβœ… Excellent⚠️ Requires setup
Custom Dashboardsβœ… Goodβœ… Excellent
Visualization Options⚠️ Limitedβœ… Extensive
Multi-Cloud Support❌ AWS onlyβœ… Yes
CostPay per metric/logInfrastructure cost
Setup Complexityβœ… Minimal⚠️ Moderate
Alertingβœ… Nativeβœ… Advanced
Log Analysisβœ… Log Insights⚠️ Via plugins
Query LanguageLog InsightsPromQL, LogQL
User ManagementAWS IAMBuilt-in
Customization⚠️ Limitedβœ… Extensive

Use both CloudWatch and Grafana for maximum flexibility:

  • CloudWatch: Primary data store and native AWS service monitoring
  • Grafana: Advanced visualization and unified dashboard for all data sources
 1// Hybrid monitoring strategy
 2CloudWatch (Data Layer)
 3β”œβ”€β”€ Collect all logs and metrics
 4β”œβ”€β”€ Native AWS service integration
 5β”œβ”€β”€ CloudWatch Alarms for critical alerts
 6└── Log Insights for ad-hoc queries
 7
 8Grafana (Visualization Layer)
 9β”œβ”€β”€ Use CloudWatch as data source
10β”œβ”€β”€ Advanced dashboards
11β”œβ”€β”€ Custom visualizations
12└── Unified view across services

πŸ“¦ CDK Project Structure

monitoring-system-cdk/
β”œβ”€β”€ bin/
β”‚   └── monitoring-system.ts        # CDK app entry point
β”œβ”€β”€ lib/
β”‚   β”œβ”€β”€ stacks/
β”‚   β”‚   β”œβ”€β”€ cloudwatch-stack.ts     # CloudWatch setup
β”‚   β”‚   β”œβ”€β”€ grafana-stack.ts        # Grafana on ECS
β”‚   β”‚   β”œβ”€β”€ alerting-stack.ts       # Alarms and SNS
β”‚   β”‚   β”œβ”€β”€ iam-stack.ts            # IAM roles and policies
β”‚   β”‚   └── dashboards-stack.ts     # Dashboard definitions
β”‚   β”œβ”€β”€ constructs/
β”‚   β”‚   β”œβ”€β”€ log-aggregation.ts      # Log collection construct
β”‚   β”‚   β”œβ”€β”€ metric-collection.ts    # Metrics construct
β”‚   β”‚   β”œβ”€β”€ grafana-cluster.ts      # Grafana ECS construct
β”‚   β”‚   └── alert-manager.ts        # Alerting construct
β”‚   β”œβ”€β”€ config/
β”‚   β”‚   β”œβ”€β”€ monitoring-config.ts    # Monitoring configuration
β”‚   β”‚   β”œβ”€β”€ services-config.ts      # Services to monitor
β”‚   β”‚   └── dashboard-config.ts     # Dashboard definitions
β”‚   └── utils/
β”‚       β”œβ”€β”€ metric-utils.ts         # Metric helpers
β”‚       └── log-utils.ts            # Log helpers
β”œβ”€β”€ lambda/
β”‚   β”œβ”€β”€ alert-processor/
β”‚   β”‚   β”œβ”€β”€ index.ts                # Alert processing
β”‚   β”‚   └── formatters.ts           # Alert formatters
β”‚   └── metric-collector/
β”‚       └── index.ts                # Custom metrics collector
β”œβ”€β”€ dashboards/
β”‚   β”œβ”€β”€ cloudwatch/
β”‚   β”‚   └── main-dashboard.json
β”‚   └── grafana/
β”‚       β”œβ”€β”€ service-dashboard.json
β”‚       └── infrastructure-dashboard.json
β”œβ”€β”€ test/
β”œβ”€β”€ cdk.json
β”œβ”€β”€ tsconfig.json
└── package.json

βš™οΈ Configuration Design

🎯 Monitoring Configuration

 1// lib/config/monitoring-config.ts
 2export interface MonitoringConfig {
 3  logRetention: number;
 4  metricNamespace: string;
 5  enableDetailedMonitoring: boolean;
 6  enableContainerInsights: boolean;
 7  enableLambdaInsights: boolean;
 8  alerting: AlertingConfig;
 9  grafana?: GrafanaConfig;
10}
11
12export interface AlertingConfig {
13  enabled: boolean;
14  emailEndpoints: string[];
15  slackWebhookUrl?: string;
16  pagerDutyIntegrationKey?: string;
17  criticalAlarmActions: string[];
18  warningAlarmActions: string[];
19}
20
21export interface GrafanaConfig {
22  enabled: boolean;
23  instanceType: string;
24  desiredCount: number;
25  domain?: string;
26  adminPassword: string;
27  oauth?: {
28    enabled: boolean;
29    provider: 'google' | 'github' | 'cognito';
30    clientId: string;
31    clientSecret: string;
32  };
33}
34
35export const monitoringConfigs = {
36  development: {
37    logRetention: 7, // days
38    metricNamespace: 'Development',
39    enableDetailedMonitoring: false,
40    enableContainerInsights: false,
41    enableLambdaInsights: false,
42    alerting: {
43      enabled: true,
44      emailEndpoints: ['dev-team@company.com'],
45      criticalAlarmActions: [],
46      warningAlarmActions: [],
47    },
48    grafana: {
49      enabled: true,
50      instanceType: 't3.small',
51      desiredCount: 1,
52      adminPassword: 'change-me-dev',
53    },
54  },
55  production: {
56    logRetention: 90, // days
57    metricNamespace: 'Production',
58    enableDetailedMonitoring: true,
59    enableContainerInsights: true,
60    enableLambdaInsights: true,
61    alerting: {
62      enabled: true,
63      emailEndpoints: ['oncall@company.com', 'platform-team@company.com'],
64      slackWebhookUrl: process.env.SLACK_WEBHOOK_URL,
65      pagerDutyIntegrationKey: process.env.PAGERDUTY_KEY,
66      criticalAlarmActions: ['arn:aws:sns:us-east-1:xxx:critical-alerts'],
67      warningAlarmActions: ['arn:aws:sns:us-east-1:xxx:warning-alerts'],
68    },
69    grafana: {
70      enabled: true,
71      instanceType: 't3.medium',
72      desiredCount: 2,
73      domain: 'monitoring.yourdomain.com',
74      adminPassword: process.env.GRAFANA_ADMIN_PASSWORD || '',
75      oauth: {
76        enabled: true,
77        provider: 'cognito',
78        clientId: process.env.COGNITO_CLIENT_ID || '',
79        clientSecret: process.env.COGNITO_CLIENT_SECRET || '',
80      },
81    },
82  },
83} as const;

πŸ“Š Services to Monitor Configuration

  1// lib/config/services-config.ts
  2export interface ServiceMonitoringConfig {
  3  serviceName: string;
  4  type: 'lambda' | 'ecs' | 'rds' | 'dynamodb' | 'api-gateway' | 'alb' | 'sqs';
  5  logGroups: string[];
  6  metrics: MetricConfig[];
  7  alarms: AlarmConfig[];
  8}
  9
 10export interface MetricConfig {
 11  name: string;
 12  namespace: string;
 13  dimensions?: Record<string, string>;
 14  statistic: 'Average' | 'Sum' | 'Minimum' | 'Maximum' | 'SampleCount';
 15  period: number;
 16  unit?: string;
 17}
 18
 19export interface AlarmConfig {
 20  name: string;
 21  description: string;
 22  metric: MetricConfig;
 23  threshold: number;
 24  comparisonOperator: 'GreaterThanThreshold' | 'LessThanThreshold' | 'GreaterThanOrEqualToThreshold' | 'LessThanOrEqualToThreshold';
 25  evaluationPeriods: number;
 26  datapointsToAlarm?: number;
 27  treatMissingData?: 'notBreaching' | 'breaching' | 'ignore' | 'missing';
 28  severity: 'critical' | 'warning' | 'info';
 29}
 30
 31export const servicesConfig: ServiceMonitoringConfig[] = [
 32  {
 33    serviceName: 'api-service',
 34    type: 'lambda',
 35    logGroups: ['/aws/lambda/api-handler', '/aws/lambda/api-authorizer'],
 36    metrics: [
 37      {
 38        name: 'Invocations',
 39        namespace: 'AWS/Lambda',
 40        statistic: 'Sum',
 41        period: 300,
 42      },
 43      {
 44        name: 'Duration',
 45        namespace: 'AWS/Lambda',
 46        statistic: 'Average',
 47        period: 300,
 48        unit: 'Milliseconds',
 49      },
 50      {
 51        name: 'Errors',
 52        namespace: 'AWS/Lambda',
 53        statistic: 'Sum',
 54        period: 300,
 55      },
 56      {
 57        name: 'Throttles',
 58        namespace: 'AWS/Lambda',
 59        statistic: 'Sum',
 60        period: 300,
 61      },
 62    ],
 63    alarms: [
 64      {
 65        name: 'api-service-high-error-rate',
 66        description: 'API service error rate is too high',
 67        metric: {
 68          name: 'Errors',
 69          namespace: 'AWS/Lambda',
 70          statistic: 'Sum',
 71          period: 300,
 72        },
 73        threshold: 10,
 74        comparisonOperator: 'GreaterThanThreshold',
 75        evaluationPeriods: 2,
 76        datapointsToAlarm: 2,
 77        treatMissingData: 'notBreaching',
 78        severity: 'critical',
 79      },
 80      {
 81        name: 'api-service-high-duration',
 82        description: 'API service response time is too slow',
 83        metric: {
 84          name: 'Duration',
 85          namespace: 'AWS/Lambda',
 86          statistic: 'Average',
 87          period: 300,
 88        },
 89        threshold: 3000, // 3 seconds
 90        comparisonOperator: 'GreaterThanThreshold',
 91        evaluationPeriods: 3,
 92        severity: 'warning',
 93      },
 94    ],
 95  },
 96  {
 97    serviceName: 'web-application',
 98    type: 'ecs',
 99    logGroups: ['/ecs/web-app'],
100    metrics: [
101      {
102        name: 'CPUUtilization',
103        namespace: 'AWS/ECS',
104        statistic: 'Average',
105        period: 300,
106      },
107      {
108        name: 'MemoryUtilization',
109        namespace: 'AWS/ECS',
110        statistic: 'Average',
111        period: 300,
112      },
113    ],
114    alarms: [
115      {
116        name: 'ecs-high-cpu',
117        description: 'ECS service CPU usage is too high',
118        metric: {
119          name: 'CPUUtilization',
120          namespace: 'AWS/ECS',
121          statistic: 'Average',
122          period: 300,
123        },
124        threshold: 80,
125        comparisonOperator: 'GreaterThanThreshold',
126        evaluationPeriods: 2,
127        severity: 'warning',
128      },
129    ],
130  },
131  {
132    serviceName: 'api-gateway',
133    type: 'api-gateway',
134    logGroups: ['/aws/apigateway/api-logs'],
135    metrics: [
136      {
137        name: 'Count',
138        namespace: 'AWS/ApiGateway',
139        statistic: 'Sum',
140        period: 300,
141      },
142      {
143        name: '4XXError',
144        namespace: 'AWS/ApiGateway',
145        statistic: 'Sum',
146        period: 300,
147      },
148      {
149        name: '5XXError',
150        namespace: 'AWS/ApiGateway',
151        statistic: 'Sum',
152        period: 300,
153      },
154      {
155        name: 'Latency',
156        namespace: 'AWS/ApiGateway',
157        statistic: 'Average',
158        period: 300,
159        unit: 'Milliseconds',
160      },
161    ],
162    alarms: [
163      {
164        name: 'api-gateway-5xx-errors',
165        description: 'API Gateway is returning too many 5XX errors',
166        metric: {
167          name: '5XXError',
168          namespace: 'AWS/ApiGateway',
169          statistic: 'Sum',
170          period: 300,
171        },
172        threshold: 5,
173        comparisonOperator: 'GreaterThanThreshold',
174        evaluationPeriods: 1,
175        severity: 'critical',
176      },
177    ],
178  },
179  {
180    serviceName: 'database',
181    type: 'rds',
182    logGroups: ['/aws/rds/instance/postgres/postgresql'],
183    metrics: [
184      {
185        name: 'CPUUtilization',
186        namespace: 'AWS/RDS',
187        statistic: 'Average',
188        period: 300,
189      },
190      {
191        name: 'DatabaseConnections',
192        namespace: 'AWS/RDS',
193        statistic: 'Average',
194        period: 300,
195      },
196      {
197        name: 'FreeableMemory',
198        namespace: 'AWS/RDS',
199        statistic: 'Average',
200        period: 300,
201      },
202      {
203        name: 'ReadLatency',
204        namespace: 'AWS/RDS',
205        statistic: 'Average',
206        period: 300,
207      },
208      {
209        name: 'WriteLatency',
210        namespace: 'AWS/RDS',
211        statistic: 'Average',
212        period: 300,
213      },
214    ],
215    alarms: [
216      {
217        name: 'rds-high-cpu',
218        description: 'RDS CPU utilization is too high',
219        metric: {
220          name: 'CPUUtilization',
221          namespace: 'AWS/RDS',
222          statistic: 'Average',
223          period: 300,
224        },
225        threshold: 80,
226        comparisonOperator: 'GreaterThanThreshold',
227        evaluationPeriods: 3,
228        severity: 'critical',
229      },
230      {
231        name: 'rds-low-memory',
232        description: 'RDS freeable memory is too low',
233        metric: {
234          name: 'FreeableMemory',
235          namespace: 'AWS/RDS',
236          statistic: 'Average',
237          period: 300,
238        },
239        threshold: 1000000000, // 1 GB
240        comparisonOperator: 'LessThanThreshold',
241        evaluationPeriods: 2,
242        severity: 'warning',
243      },
244    ],
245  },
246];

πŸ—οΈ CDK Stack Implementation

πŸ“Š IAM Stack - Cross-Service Permissions

  1// lib/stacks/iam-stack.ts
  2import * as cdk from 'aws-cdk-lib';
  3import * as iam from 'aws-cdk-lib/aws-iam';
  4import { Construct } from 'constructs';
  5
  6export class IAMStack extends cdk.Stack {
  7  public readonly monitoringRole: iam.Role;
  8  public readonly grafanaRole: iam.Role;
  9  public readonly alertProcessorRole: iam.Role;
 10
 11  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
 12    super(scope, id, props);
 13
 14    // Monitoring Role - for reading logs and metrics from all services
 15    this.monitoringRole = new iam.Role(this, 'MonitoringRole', {
 16      roleName: 'centralized-monitoring-role',
 17      assumedBy: new iam.ServicePrincipal('lambda.amazonaws.com'),
 18      description: 'Role for centralized monitoring to access logs and metrics',
 19    });
 20
 21    // Grant CloudWatch Logs read permissions
 22    this.monitoringRole.addToPolicy(
 23      new iam.PolicyStatement({
 24        effect: iam.Effect.ALLOW,
 25        actions: [
 26          'logs:DescribeLogGroups',
 27          'logs:DescribeLogStreams',
 28          'logs:GetLogEvents',
 29          'logs:FilterLogEvents',
 30          'logs:StartQuery',
 31          'logs:StopQuery',
 32          'logs:GetQueryResults',
 33          'logs:GetLogRecord',
 34        ],
 35        resources: ['*'],
 36      })
 37    );
 38
 39    // Grant CloudWatch Metrics read permissions
 40    this.monitoringRole.addToPolicy(
 41      new iam.PolicyStatement({
 42        effect: iam.Effect.ALLOW,
 43        actions: [
 44          'cloudwatch:DescribeAlarms',
 45          'cloudwatch:GetMetricData',
 46          'cloudwatch:GetMetricStatistics',
 47          'cloudwatch:ListMetrics',
 48          'cloudwatch:DescribeAlarmsForMetric',
 49        ],
 50        resources: ['*'],
 51      })
 52    );
 53
 54    // Grant permissions to read from specific AWS services
 55    this.monitoringRole.addToPolicy(
 56      new iam.PolicyStatement({
 57        effect: iam.Effect.ALLOW,
 58        actions: [
 59          // Lambda
 60          'lambda:GetFunction',
 61          'lambda:ListFunctions',
 62          // ECS
 63          'ecs:DescribeClusters',
 64          'ecs:DescribeServices',
 65          'ecs:DescribeTasks',
 66          'ecs:ListClusters',
 67          'ecs:ListServices',
 68          'ecs:ListTasks',
 69          // RDS
 70          'rds:DescribeDBInstances',
 71          'rds:DescribeDBClusters',
 72          // DynamoDB
 73          'dynamodb:DescribeTable',
 74          'dynamodb:ListTables',
 75          // API Gateway
 76          'apigateway:GET',
 77          // ALB
 78          'elasticloadbalancing:DescribeLoadBalancers',
 79          'elasticloadbalancing:DescribeTargetGroups',
 80          'elasticloadbalancing:DescribeTargetHealth',
 81          // SQS
 82          'sqs:GetQueueAttributes',
 83          'sqs:ListQueues',
 84          // S3
 85          's3:GetBucketLocation',
 86          's3:ListBucket',
 87          's3:GetBucketMetricsConfiguration',
 88        ],
 89        resources: ['*'],
 90      })
 91    );
 92
 93    // Grafana Role - for ECS task
 94    this.grafanaRole = new iam.Role(this, 'GrafanaRole', {
 95      roleName: 'grafana-ecs-task-role',
 96      assumedBy: new iam.ServicePrincipal('ecs-tasks.amazonaws.com'),
 97      description: 'Role for Grafana to access CloudWatch',
 98    });
 99
100    // Grant Grafana permissions to read CloudWatch data
101    this.grafanaRole.addToPolicy(
102      new iam.PolicyStatement({
103        effect: iam.Effect.ALLOW,
104        actions: [
105          'cloudwatch:DescribeAlarmsForMetric',
106          'cloudwatch:DescribeAlarmHistory',
107          'cloudwatch:DescribeAlarms',
108          'cloudwatch:ListMetrics',
109          'cloudwatch:GetMetricData',
110          'cloudwatch:GetMetricStatistics',
111        ],
112        resources: ['*'],
113      })
114    );
115
116    this.grafanaRole.addToPolicy(
117      new iam.PolicyStatement({
118        effect: iam.Effect.ALLOW,
119        actions: [
120          'logs:DescribeLogGroups',
121          'logs:GetLogGroupFields',
122          'logs:StartQuery',
123          'logs:StopQuery',
124          'logs:GetQueryResults',
125          'logs:GetLogEvents',
126        ],
127        resources: ['*'],
128      })
129    );
130
131    this.grafanaRole.addToPolicy(
132      new iam.PolicyStatement({
133        effect: iam.Effect.ALLOW,
134        actions: [
135          'ec2:DescribeRegions',
136          'ec2:DescribeInstances',
137          'tag:GetResources',
138        ],
139        resources: ['*'],
140      })
141    );
142
143    // Alert Processor Role
144    this.alertProcessorRole = new iam.Role(this, 'AlertProcessorRole', {
145      roleName: 'alert-processor-role',
146      assumedBy: new iam.ServicePrincipal('lambda.amazonaws.com'),
147      managedPolicies: [
148        iam.ManagedPolicy.fromAwsManagedPolicyName(
149          'service-role/AWSLambdaBasicExecutionRole'
150        ),
151      ],
152    });
153
154    // Grant SNS publish permissions
155    this.alertProcessorRole.addToPolicy(
156      new iam.PolicyStatement({
157        effect: iam.Effect.ALLOW,
158        actions: ['sns:Publish'],
159        resources: ['*'],
160      })
161    );
162
163    // Outputs
164    new cdk.CfnOutput(this, 'MonitoringRoleArn', {
165      value: this.monitoringRole.roleArn,
166      exportName: 'MonitoringRoleArn',
167    });
168
169    new cdk.CfnOutput(this, 'GrafanaRoleArn', {
170      value: this.grafanaRole.roleArn,
171      exportName: 'GrafanaRoleArn',
172    });
173  }
174}

πŸ“ˆ CloudWatch Stack - Log Aggregation and Metrics

  1// lib/stacks/cloudwatch-stack.ts
  2import * as cdk from 'aws-cdk-lib';
  3import * as logs from 'aws-cdk-lib/aws-logs';
  4import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';
  5import { Construct } from 'constructs';
  6import { servicesConfig, ServiceMonitoringConfig } from '../config/services-config';
  7import { monitoringConfigs } from '../config/monitoring-config';
  8
  9export interface CloudWatchStackProps extends cdk.StackProps {
 10  environment: 'development' | 'production';
 11}
 12
 13export class CloudWatchStack extends cdk.Stack {
 14  public readonly logGroups: Map<string, logs.LogGroup>;
 15  public readonly metricFilters: Map<string, logs.MetricFilter>;
 16  public readonly dashboard: cloudwatch.Dashboard;
 17
 18  constructor(scope: Construct, id: string, props: CloudWatchStackProps) {
 19    super(scope, id, props);
 20
 21    const config = monitoringConfigs[props.environment];
 22
 23    this.logGroups = new Map();
 24    this.metricFilters = new Map();
 25
 26    // Create centralized log groups and metric filters
 27    servicesConfig.forEach((serviceConfig) => {
 28      this.setupServiceMonitoring(serviceConfig, config.logRetention);
 29    });
 30
 31    // Create CloudWatch Dashboard
 32    this.dashboard = this.createDashboard(config.metricNamespace);
 33
 34    // Create custom metrics namespace
 35    this.createCustomMetricsNamespace(config.metricNamespace);
 36  }
 37
 38  private setupServiceMonitoring(
 39    serviceConfig: ServiceMonitoringConfig,
 40    retentionDays: number
 41  ): void {
 42    // Create or reference log groups
 43    serviceConfig.logGroups.forEach((logGroupName) => {
 44      const logGroup = new logs.LogGroup(
 45        this,
 46        `LogGroup-${serviceConfig.serviceName}-${logGroupName.replace(/\//g, '-')}`,
 47        {
 48          logGroupName,
 49          retention: this.getRetention(retentionDays),
 50          removalPolicy: cdk.RemovalPolicy.DESTROY,
 51        }
 52      );
 53
 54      this.logGroups.set(logGroupName, logGroup);
 55
 56      // Create metric filters for common patterns
 57      this.createMetricFiltersForLogGroup(
 58        logGroup,
 59        serviceConfig.serviceName,
 60        serviceConfig.type
 61      );
 62    });
 63  }
 64
 65  private createMetricFiltersForLogGroup(
 66    logGroup: logs.LogGroup,
 67    serviceName: string,
 68    serviceType: string
 69  ): void {
 70    // Error count metric filter
 71    const errorFilter = new logs.MetricFilter(
 72      this,
 73      `ErrorFilter-${serviceName}`,
 74      {
 75        logGroup,
 76        filterPattern: logs.FilterPattern.anyTerm('ERROR', 'Error', 'error', 'Exception'),
 77        metricNamespace: 'CustomMetrics',
 78        metricName: `${serviceName}-Errors`,
 79        metricValue: '1',
 80        defaultValue: 0,
 81      }
 82    );
 83
 84    this.metricFilters.set(`${serviceName}-errors`, errorFilter);
 85
 86    // Warning count metric filter
 87    const warningFilter = new logs.MetricFilter(
 88      this,
 89      `WarningFilter-${serviceName}`,
 90      {
 91        logGroup,
 92        filterPattern: logs.FilterPattern.anyTerm('WARNING', 'Warning', 'warn'),
 93        metricNamespace: 'CustomMetrics',
 94        metricName: `${serviceName}-Warnings`,
 95        metricValue: '1',
 96        defaultValue: 0,
 97      }
 98    );
 99
100    this.metricFilters.set(`${serviceName}-warnings`, warningFilter);
101
102    // Response time metric filter (for APIs)
103    if (serviceType === 'lambda' || serviceType === 'api-gateway') {
104      const responseTimeFilter = new logs.MetricFilter(
105        this,
106        `ResponseTimeFilter-${serviceName}`,
107        {
108          logGroup,
109          filterPattern: logs.FilterPattern.exists('$.duration'),
110          metricNamespace: 'CustomMetrics',
111          metricName: `${serviceName}-ResponseTime`,
112          metricValue: '$.duration',
113        }
114      );
115
116      this.metricFilters.set(`${serviceName}-response-time`, responseTimeFilter);
117    }
118
119    // Business metrics (e.g., successful transactions)
120    const successFilter = new logs.MetricFilter(
121      this,
122      `SuccessFilter-${serviceName}`,
123      {
124        logGroup,
125        filterPattern: logs.FilterPattern.anyTerm('SUCCESS', 'Success', 'success'),
126        metricNamespace: 'CustomMetrics',
127        metricName: `${serviceName}-Success`,
128        metricValue: '1',
129        defaultValue: 0,
130      }
131    );
132
133    this.metricFilters.set(`${serviceName}-success`, successFilter);
134  }
135
136  private createDashboard(namespace: string): cloudwatch.Dashboard {
137    const dashboard = new cloudwatch.Dashboard(this, 'CentralMonitoringDashboard', {
138      dashboardName: `${namespace}-Central-Monitoring`,
139    });
140
141    // Add widgets for each service
142    servicesConfig.forEach((serviceConfig) => {
143      const widgets = this.createWidgetsForService(serviceConfig);
144      widgets.forEach((widget) => dashboard.addWidgets(widget));
145    });
146
147    return dashboard;
148  }
149
150  private createWidgetsForService(
151    serviceConfig: ServiceMonitoringConfig
152  ): cloudwatch.IWidget[] {
153    const widgets: cloudwatch.IWidget[] = [];
154
155    // Create a graph widget for each metric
156    serviceConfig.metrics.forEach((metricConfig) => {
157      const widget = new cloudwatch.GraphWidget({
158        title: `${serviceConfig.serviceName} - ${metricConfig.name}`,
159        width: 12,
160        left: [
161          new cloudwatch.Metric({
162            namespace: metricConfig.namespace,
163            metricName: metricConfig.name,
164            dimensionsMap: metricConfig.dimensions,
165            statistic: metricConfig.statistic,
166            period: cdk.Duration.seconds(metricConfig.period),
167            unit: metricConfig.unit as cloudwatch.Unit | undefined,
168          }),
169        ],
170      });
171
172      widgets.push(widget);
173    });
174
175    // Add custom metrics from log filters
176    const errorMetric = new cloudwatch.Metric({
177      namespace: 'CustomMetrics',
178      metricName: `${serviceConfig.serviceName}-Errors`,
179      statistic: 'Sum',
180      period: cdk.Duration.minutes(5),
181    });
182
183    const warningMetric = new cloudwatch.Metric({
184      namespace: 'CustomMetrics',
185      metricName: `${serviceConfig.serviceName}-Warnings`,
186      statistic: 'Sum',
187      period: cdk.Duration.minutes(5),
188    });
189
190    widgets.push(
191      new cloudwatch.GraphWidget({
192        title: `${serviceConfig.serviceName} - Errors & Warnings`,
193        width: 12,
194        left: [errorMetric],
195        right: [warningMetric],
196      })
197    );
198
199    return widgets;
200  }
201
202  private createCustomMetricsNamespace(namespace: string): void {
203    // Create a Lambda function to publish custom metrics
204    // This is a placeholder - implement as needed
205  }
206
207  private getRetention(days: number): logs.RetentionDays {
208    const retentionMap: Record<number, logs.RetentionDays> = {
209      1: logs.RetentionDays.ONE_DAY,
210      3: logs.RetentionDays.THREE_DAYS,
211      5: logs.RetentionDays.FIVE_DAYS,
212      7: logs.RetentionDays.ONE_WEEK,
213      14: logs.RetentionDays.TWO_WEEKS,
214      30: logs.RetentionDays.ONE_MONTH,
215      60: logs.RetentionDays.TWO_MONTHS,
216      90: logs.RetentionDays.THREE_MONTHS,
217      120: logs.RetentionDays.FOUR_MONTHS,
218      180: logs.RetentionDays.SIX_MONTHS,
219      365: logs.RetentionDays.ONE_YEAR,
220    };
221
222    return retentionMap[days] || logs.RetentionDays.ONE_WEEK;
223  }
224}

πŸ”” Alerting Stack - CloudWatch Alarms and SNS

  1// lib/stacks/alerting-stack.ts
  2import * as cdk from 'aws-cdk-lib';
  3import * as cloudwatch from 'aws-cdk-lib/aws-cloudwatch';
  4import * as cloudwatch_actions from 'aws-cdk-lib/aws-cloudwatch-actions';
  5import * as sns from 'aws-cdk-lib/aws-sns';
  6import * as sns_subscriptions from 'aws-cdk-lib/aws-sns-subscriptions';
  7import * as lambda from 'aws-cdk-lib/aws-lambda';
  8import { NodejsFunction } from 'aws-cdk-lib/aws-lambda-nodejs';
  9import { Construct } from 'constructs';
 10import * as path from 'path';
 11import { servicesConfig } from '../config/services-config';
 12import { monitoringConfigs } from '../config/monitoring-config';
 13
 14export interface AlertingStackProps extends cdk.StackProps {
 15  environment: 'development' | 'production';
 16  alertProcessorRole: cdk.aws_iam.Role;
 17}
 18
 19export class AlertingStack extends cdk.Stack {
 20  public readonly criticalTopic: sns.Topic;
 21  public readonly warningTopic: sns.Topic;
 22  public readonly alarms: Map<string, cloudwatch.Alarm>;
 23
 24  constructor(scope: Construct, id: string, props: AlertingStackProps) {
 25    super(scope, id, props);
 26
 27    const config = monitoringConfigs[props.environment];
 28    this.alarms = new Map();
 29
 30    // Create SNS topics for different severity levels
 31    this.criticalTopic = new sns.Topic(this, 'CriticalAlertsTopic', {
 32      topicName: 'monitoring-critical-alerts',
 33      displayName: 'Critical Alerts',
 34    });
 35
 36    this.warningTopic = new sns.Topic(this, 'WarningAlertsTopic', {
 37      topicName: 'monitoring-warning-alerts',
 38      displayName: 'Warning Alerts',
 39    });
 40
 41    // Subscribe emails to topics
 42    if (config.alerting.enabled) {
 43      config.alerting.emailEndpoints.forEach((email) => {
 44        this.criticalTopic.addSubscription(
 45          new sns_subscriptions.EmailSubscription(email)
 46        );
 47        this.warningTopic.addSubscription(
 48          new sns_subscriptions.EmailSubscription(email)
 49        );
 50      });
 51    }
 52
 53    // Create alert processor Lambda
 54    const alertProcessor = this.createAlertProcessor(props.alertProcessorRole, config);
 55
 56    // Subscribe Lambda to SNS topics
 57    this.criticalTopic.addSubscription(
 58      new sns_subscriptions.LambdaSubscription(alertProcessor)
 59    );
 60    this.warningTopic.addSubscription(
 61      new sns_subscriptions.LambdaSubscription(alertProcessor)
 62    );
 63
 64    // Create alarms for all services
 65    servicesConfig.forEach((serviceConfig) => {
 66      this.createAlarmsForService(serviceConfig);
 67    });
 68
 69    // Create composite alarms
 70    this.createCompositeAlarms();
 71  }
 72
 73  private createAlertProcessor(
 74    role: cdk.aws_iam.Role,
 75    config: any
 76  ): lambda.Function {
 77    const alertProcessor = new NodejsFunction(this, 'AlertProcessor', {
 78      runtime: lambda.Runtime.NODEJS_20_X,
 79      handler: 'handler',
 80      entry: path.join(__dirname, '../../lambda/alert-processor/index.ts'),
 81      timeout: cdk.Duration.seconds(30),
 82      memorySize: 256,
 83      role,
 84      environment: {
 85        SLACK_WEBHOOK_URL: config.alerting.slackWebhookUrl || '',
 86        PAGERDUTY_KEY: config.alerting.pagerDutyIntegrationKey || '',
 87      },
 88    });
 89
 90    return alertProcessor;
 91  }
 92
 93  private createAlarmsForService(
 94    serviceConfig: any
 95  ): void {
 96    serviceConfig.alarms.forEach((alarmConfig: any) => {
 97      const metric = new cloudwatch.Metric({
 98        namespace: alarmConfig.metric.namespace,
 99        metricName: alarmConfig.metric.name,
100        dimensionsMap: alarmConfig.metric.dimensions,
101        statistic: alarmConfig.metric.statistic,
102        period: cdk.Duration.seconds(alarmConfig.metric.period),
103      });
104
105      const alarm = new cloudwatch.Alarm(
106        this,
107        `Alarm-${serviceConfig.serviceName}-${alarmConfig.name}`,
108        {
109          alarmName: alarmConfig.name,
110          alarmDescription: alarmConfig.description,
111          metric,
112          threshold: alarmConfig.threshold,
113          comparisonOperator: this.getComparisonOperator(
114            alarmConfig.comparisonOperator
115          ),
116          evaluationPeriods: alarmConfig.evaluationPeriods,
117          datapointsToAlarm: alarmConfig.datapointsToAlarm,
118          treatMissingData: this.getTreatMissingData(alarmConfig.treatMissingData),
119        }
120      );
121
122      // Add alarm actions based on severity
123      if (alarmConfig.severity === 'critical') {
124        alarm.addAlarmAction(new cloudwatch_actions.SnsAction(this.criticalTopic));
125      } else if (alarmConfig.severity === 'warning') {
126        alarm.addAlarmAction(new cloudwatch_actions.SnsAction(this.warningTopic));
127      }
128
129      this.alarms.set(`${serviceConfig.serviceName}-${alarmConfig.name}`, alarm);
130    });
131  }
132
133  private createCompositeAlarms(): void {
134    // Create a composite alarm that triggers if multiple services are down
135    const serviceDownAlarms = Array.from(this.alarms.values()).filter((alarm) =>
136      alarm.alarmName.includes('high-error-rate')
137    );
138
139    if (serviceDownAlarms.length > 0) {
140      new cloudwatch.CompositeAlarm(this, 'MultipleServicesDown', {
141        compositeAlarmName: 'multiple-services-down',
142        alarmDescription: 'Multiple services are experiencing high error rates',
143        alarmRule: cloudwatch.AlarmRule.anyOf(
144          ...serviceDownAlarms.map((alarm) => cloudwatch.AlarmRule.fromAlarm(alarm, cloudwatch.AlarmState.ALARM))
145        ),
146      });
147    }
148  }
149
150  private getComparisonOperator(
151    operator: string
152  ): cloudwatch.ComparisonOperator {
153    const operatorMap: Record<string, cloudwatch.ComparisonOperator> = {
154      GreaterThanThreshold: cloudwatch.ComparisonOperator.GREATER_THAN_THRESHOLD,
155      LessThanThreshold: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
156      GreaterThanOrEqualToThreshold:
157        cloudwatch.ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
158      LessThanOrEqualToThreshold:
159        cloudwatch.ComparisonOperator.LESS_THAN_OR_EQUAL_TO_THRESHOLD,
160    };
161
162    return operatorMap[operator];
163  }
164
165  private getTreatMissingData(
166    treatment: string | undefined
167  ): cloudwatch.TreatMissingData {
168    if (!treatment) return cloudwatch.TreatMissingData.NOT_BREACHING;
169
170    const treatmentMap: Record<string, cloudwatch.TreatMissingData> = {
171      notBreaching: cloudwatch.TreatMissingData.NOT_BREACHING,
172      breaching: cloudwatch.TreatMissingData.BREACHING,
173      ignore: cloudwatch.TreatMissingData.IGNORE,
174      missing: cloudwatch.TreatMissingData.MISSING,
175    };
176
177    return treatmentMap[treatment];
178  }
179}

πŸ“Š Grafana Stack - ECS Deployment

  1// lib/stacks/grafana-stack.ts
  2import * as cdk from 'aws-cdk-lib';
  3import * as ec2 from 'aws-cdk-lib/aws-ec2';
  4import * as ecs from 'aws-cdk-lib/aws-ecs';
  5import * as ecs_patterns from 'aws-cdk-lib/aws-ecs-patterns';
  6import * as efs from 'aws-cdk-lib/aws-efs';
  7import * as secretsmanager from 'aws-cdk-lib/aws-secretsmanager';
  8import * as iam from 'aws-cdk-lib/aws-iam';
  9import { Construct } from 'constructs';
 10import { monitoringConfigs } from '../config/monitoring-config';
 11
 12export interface GrafanaStackProps extends cdk.StackProps {
 13  environment: 'development' | 'production';
 14  vpc: ec2.Vpc;
 15  grafanaRole: iam.Role;
 16}
 17
 18export class GrafanaStack extends cdk.Stack {
 19  public readonly service: ecs_patterns.ApplicationLoadBalancedFargateService;
 20  public readonly url: string;
 21
 22  constructor(scope: Construct, id: string, props: GrafanaStackProps) {
 23    super(scope, id, props);
 24
 25    const config = monitoringConfigs[props.environment];
 26
 27    if (!config.grafana?.enabled) {
 28      console.log('Grafana is disabled for this environment');
 29      return;
 30    }
 31
 32    const { vpc, grafanaRole } = props;
 33
 34    // Create ECS Cluster
 35    const cluster = new ecs.Cluster(this, 'GrafanaCluster', {
 36      vpc,
 37      clusterName: 'monitoring-grafana-cluster',
 38      containerInsights: config.enableContainerInsights,
 39    });
 40
 41    // Create EFS for Grafana data persistence
 42    const fileSystem = new efs.FileSystem(this, 'GrafanaEFS', {
 43      vpc,
 44      encrypted: true,
 45      lifecyclePolicy: efs.LifecyclePolicy.AFTER_14_DAYS,
 46      performanceMode: efs.PerformanceMode.GENERAL_PURPOSE,
 47      throughputMode: efs.ThroughputMode.BURSTING,
 48      removalPolicy: cdk.RemovalPolicy.RETAIN,
 49    });
 50
 51    // Create Secrets Manager secret for Grafana admin password
 52    const grafanaSecret = new secretsmanager.Secret(this, 'GrafanaAdminPassword', {
 53      secretName: 'grafana-admin-password',
 54      generateSecretString: {
 55        secretStringTemplate: JSON.stringify({ username: 'admin' }),
 56        generateStringKey: 'password',
 57        excludePunctuation: true,
 58        passwordLength: 16,
 59      },
 60    });
 61
 62    // Create Fargate service with ALB
 63    this.service = new ecs_patterns.ApplicationLoadBalancedFargateService(
 64      this,
 65      'GrafanaService',
 66      {
 67        cluster,
 68        serviceName: 'grafana',
 69        desiredCount: config.grafana.desiredCount,
 70        cpu: this.getCpuFromInstanceType(config.grafana.instanceType),
 71        memoryLimitMiB: this.getMemoryFromInstanceType(config.grafana.instanceType),
 72        taskImageOptions: {
 73          image: ecs.ContainerImage.fromRegistry('grafana/grafana:latest'),
 74          containerPort: 3000,
 75          taskRole: grafanaRole,
 76          environment: {
 77            GF_SERVER_ROOT_URL: config.grafana.domain
 78              ? `https://${config.grafana.domain}`
 79              : '',
 80            GF_AUTH_ANONYMOUS_ENABLED: 'false',
 81            GF_SECURITY_ADMIN_USER: 'admin',
 82            GF_INSTALL_PLUGINS: 'grafana-clock-panel,grafana-simple-json-datasource,grafana-piechart-panel',
 83            // CloudWatch data source configuration
 84            GF_AWS_DEFAULT_REGION: cdk.Aws.REGION,
 85            GF_AWS_cloudwatch_ASSUME_ROLE_ENABLED: 'true',
 86          },
 87          secrets: {
 88            GF_SECURITY_ADMIN_PASSWORD: ecs.Secret.fromSecretsManager(
 89              grafanaSecret,
 90              'password'
 91            ),
 92          },
 93        },
 94        publicLoadBalancer: true,
 95      }
 96    );
 97
 98    // Configure volume mount for EFS
 99    const volumeName = 'grafana-storage';
100
101    this.service.taskDefinition.addVolume({
102      name: volumeName,
103      efsVolumeConfiguration: {
104        fileSystemId: fileSystem.fileSystemId,
105        transitEncryption: 'ENABLED',
106      },
107    });
108
109    this.service.taskDefinition.defaultContainer?.addMountPoints({
110      sourceVolume: volumeName,
111      containerPath: '/var/lib/grafana',
112      readOnly: false,
113    });
114
115    // Allow connections from ALB to EFS
116    fileSystem.connections.allowDefaultPortFrom(this.service.service.connections);
117
118    // Configure health check
119    this.service.targetGroup.configureHealthCheck({
120      path: '/api/health',
121      interval: cdk.Duration.seconds(30),
122      timeout: cdk.Duration.seconds(5),
123      healthyThresholdCount: 2,
124      unhealthyThresholdCount: 3,
125    });
126
127    // Auto scaling
128    const scaling = this.service.service.autoScaleTaskCount({
129      minCapacity: 1,
130      maxCapacity: config.grafana.desiredCount * 2,
131    });
132
133    scaling.scaleOnCpuUtilization('CpuScaling', {
134      targetUtilizationPercent: 70,
135      scaleInCooldown: cdk.Duration.seconds(300),
136      scaleOutCooldown: cdk.Duration.seconds(60),
137    });
138
139    scaling.scaleOnMemoryUtilization('MemoryScaling', {
140      targetUtilizationPercent: 80,
141      scaleInCooldown: cdk.Duration.seconds(300),
142      scaleOutCooldown: cdk.Duration.seconds(60),
143    });
144
145    this.url = this.service.loadBalancer.loadBalancerDnsName;
146
147    // Outputs
148    new cdk.CfnOutput(this, 'GrafanaURL', {
149      value: `http://${this.url}`,
150      description: 'Grafana Dashboard URL',
151    });
152
153    new cdk.CfnOutput(this, 'GrafanaAdminSecretArn', {
154      value: grafanaSecret.secretArn,
155      description: 'Grafana Admin Password Secret ARN',
156    });
157  }
158
159  private getCpuFromInstanceType(instanceType: string): number {
160    const cpuMap: Record<string, number> = {
161      't3.small': 512,
162      't3.medium': 1024,
163      't3.large': 2048,
164    };
165
166    return cpuMap[instanceType] || 512;
167  }
168
169  private getMemoryFromInstanceType(instanceType: string): number {
170    const memoryMap: Record<string, number> = {
171      't3.small': 2048,
172      't3.medium': 4096,
173      't3.large': 8192,
174    };
175
176    return memoryMap[instanceType] || 2048;
177  }
178}

⚑ Lambda Functions Implementation

πŸ”” Alert Processor Lambda

  1// lambda/alert-processor/index.ts
  2import { SNSEvent, SNSHandler } from 'aws-lambda';
  3import axios from 'axios';
  4
  5interface CloudWatchAlarm {
  6  AlarmName: string;
  7  AlarmDescription: string;
  8  AWSAccountId: string;
  9  NewStateValue: string;
 10  NewStateReason: string;
 11  StateChangeTime: string;
 12  Region: string;
 13  OldStateValue: string;
 14  Trigger: {
 15    MetricName: string;
 16    Namespace: string;
 17    StatisticType: string;
 18    Statistic: string;
 19    Unit: string | null;
 20    Dimensions: Array<{ name: string; value: string }>;
 21    Period: number;
 22    EvaluationPeriods: number;
 23    ComparisonOperator: string;
 24    Threshold: number;
 25  };
 26}
 27
 28export const handler: SNSHandler = async (event: SNSEvent) => {
 29  console.log('Alert Processor received event:', JSON.stringify(event, null, 2));
 30
 31  for (const record of event.Records) {
 32    try {
 33      const message = JSON.parse(record.Sns.Message) as CloudWatchAlarm;
 34
 35      // Process alarm
 36      await processAlarm(message);
 37
 38      // Send to Slack
 39      if (process.env.SLACK_WEBHOOK_URL) {
 40        await sendSlackNotification(message);
 41      }
 42
 43      // Send to PagerDuty
 44      if (process.env.PAGERDUTY_KEY && message.NewStateValue === 'ALARM') {
 45        await sendPagerDutyAlert(message);
 46      }
 47    } catch (error) {
 48      console.error('Error processing alarm:', error);
 49    }
 50  }
 51};
 52
 53async function processAlarm(alarm: CloudWatchAlarm): Promise<void> {
 54  console.log('Processing alarm:', alarm.AlarmName);
 55  console.log('State:', alarm.OldStateValue, '->', alarm.NewStateValue);
 56  console.log('Reason:', alarm.NewStateReason);
 57
 58  // Add custom processing logic here
 59  // e.g., Update dashboard, create ticket, trigger auto-remediation
 60}
 61
 62async function sendSlackNotification(alarm: CloudWatchAlarm): Promise<void> {
 63  const webhookUrl = process.env.SLACK_WEBHOOK_URL;
 64  if (!webhookUrl) return;
 65
 66  const color = alarm.NewStateValue === 'ALARM' ? '#ff0000' : '#36a64f';
 67  const emoji = alarm.NewStateValue === 'ALARM' ? ':rotating_light:' : ':white_check_mark:';
 68
 69  const message = {
 70    text: `${emoji} CloudWatch Alarm ${alarm.NewStateValue}`,
 71    attachments: [
 72      {
 73        color,
 74        title: alarm.AlarmName,
 75        text: alarm.AlarmDescription,
 76        fields: [
 77          {
 78            title: 'State',
 79            value: `${alarm.OldStateValue} β†’ ${alarm.NewStateValue}`,
 80            short: true,
 81          },
 82          {
 83            title: 'Region',
 84            value: alarm.Region,
 85            short: true,
 86          },
 87          {
 88            title: 'Metric',
 89            value: `${alarm.Trigger.Namespace}/${alarm.Trigger.MetricName}`,
 90            short: true,
 91          },
 92          {
 93            title: 'Threshold',
 94            value: `${alarm.Trigger.ComparisonOperator} ${alarm.Trigger.Threshold}`,
 95            short: true,
 96          },
 97          {
 98            title: 'Reason',
 99            value: alarm.NewStateReason,
100            short: false,
101          },
102        ],
103        footer: 'AWS CloudWatch',
104        ts: Math.floor(new Date(alarm.StateChangeTime).getTime() / 1000),
105      },
106    ],
107  };
108
109  try {
110    await axios.post(webhookUrl, message);
111    console.log('Slack notification sent successfully');
112  } catch (error) {
113    console.error('Error sending Slack notification:', error);
114  }
115}
116
117async function sendPagerDutyAlert(alarm: CloudWatchAlarm): Promise<void> {
118  const integrationKey = process.env.PAGERDUTY_KEY;
119  if (!integrationKey) return;
120
121  const event = {
122    routing_key: integrationKey,
123    event_action: 'trigger',
124    dedup_key: alarm.AlarmName,
125    payload: {
126      summary: `${alarm.AlarmName}: ${alarm.NewStateReason}`,
127      severity: 'critical',
128      source: alarm.Region,
129      custom_details: {
130        alarm_name: alarm.AlarmName,
131        alarm_description: alarm.AlarmDescription,
132        metric: `${alarm.Trigger.Namespace}/${alarm.Trigger.MetricName}`,
133        threshold: alarm.Trigger.Threshold,
134        current_state: alarm.NewStateValue,
135        previous_state: alarm.OldStateValue,
136      },
137    },
138  };
139
140  try {
141    await axios.post('https://events.pagerduty.com/v2/enqueue', event);
142    console.log('PagerDuty alert sent successfully');
143  } catch (error) {
144    console.error('Error sending PagerDuty alert:', error);
145  }
146}

πŸ“Š Grafana Configuration

πŸ”§ CloudWatch Data Source Configuration

 1// dashboards/grafana/datasource-config.json
 2{
 3  "name": "CloudWatch",
 4  "type": "cloudwatch",
 5  "access": "proxy",
 6  "jsonData": {
 7    "authType": "default",
 8    "defaultRegion": "us-east-1"
 9  }
10}

πŸ“ˆ Sample Grafana Dashboard

 1// dashboards/grafana/service-dashboard.json
 2{
 3  "dashboard": {
 4    "title": "Service Monitoring Dashboard",
 5    "tags": ["monitoring", "services"],
 6    "timezone": "browser",
 7    "panels": [
 8      {
 9        "id": 1,
10        "title": "Lambda Invocations",
11        "type": "graph",
12        "datasource": "CloudWatch",
13        "targets": [
14          {
15            "namespace": "AWS/Lambda",
16            "metricName": "Invocations",
17            "dimensions": {
18              "FunctionName": "*"
19            },
20            "statistics": ["Sum"],
21            "period": "300"
22          }
23        ],
24        "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
25      },
26      {
27        "id": 2,
28        "title": "Lambda Errors",
29        "type": "graph",
30        "datasource": "CloudWatch",
31        "targets": [
32          {
33            "namespace": "AWS/Lambda",
34            "metricName": "Errors",
35            "dimensions": {
36              "FunctionName": "*"
37            },
38            "statistics": ["Sum"],
39            "period": "300"
40          }
41        ],
42        "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
43      },
44      {
45        "id": 3,
46        "title": "API Gateway Requests",
47        "type": "graph",
48        "datasource": "CloudWatch",
49        "targets": [
50          {
51            "namespace": "AWS/ApiGateway",
52            "metricName": "Count",
53            "statistics": ["Sum"],
54            "period": "300"
55          }
56        ],
57        "gridPos": { "x": 0, "y": 8, "w": 12, "h": 8 }
58      },
59      {
60        "id": 4,
61        "title": "RDS CPU Utilization",
62        "type": "graph",
63        "datasource": "CloudWatch",
64        "targets": [
65          {
66            "namespace": "AWS/RDS",
67            "metricName": "CPUUtilization",
68            "statistics": ["Average"],
69            "period": "300"
70          }
71        ],
72        "gridPos": { "x": 12, "y": 8, "w": 12, "h": 8 }
73      }
74    ]
75  }
76}

πŸš€ Main CDK App

 1// bin/monitoring-system.ts
 2#!/usr/bin/env node
 3import 'source-map-support/register';
 4import * as cdk from 'aws-cdk-lib';
 5import * as ec2 from 'aws-cdk-lib/aws-ec2';
 6import { IAMStack } from '../lib/stacks/iam-stack';
 7import { CloudWatchStack } from '../lib/stacks/cloudwatch-stack';
 8import { AlertingStack } from '../lib/stacks/alerting-stack';
 9import { GrafanaStack } from '../lib/stacks/grafana-stack';
10
11const app = new cdk.App();
12const environment = app.node.tryGetContext('environment') || 'development';
13
14const env = {
15  account: process.env.CDK_DEFAULT_ACCOUNT,
16  region: process.env.CDK_DEFAULT_REGION || 'us-east-1',
17};
18
19// IAM Stack - Create roles first
20const iamStack = new IAMStack(app, 'MonitoringIAMStack', { env });
21
22// CloudWatch Stack - Log aggregation and metrics
23const cloudWatchStack = new CloudWatchStack(app, 'MonitoringCloudWatchStack', {
24  environment,
25  env,
26});
27
28// Alerting Stack - Alarms and notifications
29const alertingStack = new AlertingStack(app, 'MonitoringAlertingStack', {
30  environment,
31  alertProcessorRole: iamStack.alertProcessorRole,
32  env,
33});
34
35alertingStack.addDependency(cloudWatchStack);
36
37// VPC for Grafana (optional - can use default VPC)
38const vpc = new ec2.Vpc(app, 'MonitoringVPC', {
39  maxAzs: 2,
40  natGateways: 1,
41});
42
43// Grafana Stack - Advanced visualization
44const grafanaStack = new GrafanaStack(app, 'MonitoringGrafanaStack', {
45  environment,
46  vpc,
47  grafanaRole: iamStack.grafanaRole,
48  env,
49});
50
51grafanaStack.addDependency(iamStack);
52
53// Tags
54cdk.Tags.of(app).add('Project', 'CentralizedMonitoring');
55cdk.Tags.of(app).add('Environment', environment);
56cdk.Tags.of(app).add('ManagedBy', 'CDK');
57
58app.synth();

πŸ” Log Insights Queries

πŸ“Š Useful Query Examples

 1// Common CloudWatch Log Insights queries
 2
 3// 1. Error analysis
 4const errorQuery = `
 5fields @timestamp, @message
 6| filter @message like /ERROR/
 7| stats count() by bin(5m)
 8`;
 9
10// 2. Slow queries
11const slowQueryLog = `
12fields @timestamp, @message, @duration
13| filter @duration > 1000
14| sort @duration desc
15| limit 100
16`;
17
18// 3. API request analysis
19const apiAnalysisQuery = `
20fields @timestamp, method, path, statusCode, duration
21| filter statusCode >= 400
22| stats count() by statusCode, bin(5m)
23`;
24
25// 4. Lambda cold starts
26const coldStartQuery = `
27fields @timestamp, @message, @initDuration
28| filter @type = "REPORT"
29| filter @initDuration > 0
30| stats count(), avg(@initDuration), max(@initDuration) by bin(1h)
31`;
32
33// 5. Top error messages
34const topErrorsQuery = `
35fields @message
36| filter @message like /ERROR/
37| stats count() as error_count by @message
38| sort error_count desc
39| limit 10
40`;

πŸ’° Cost Optimization

πŸ’‘ Cost Breakdown and Strategies

ComponentCost FactorOptimization Strategy
CloudWatch LogsIngestion + StorageAdjust retention, use metric filters
CloudWatch MetricsNumber of metricsUse custom metrics wisely, aggregation
CloudWatch AlarmsNumber of alarmsComposite alarms, reduce evaluation frequency
Grafana (ECS)EC2/Fargate hoursRight-size instances, use Spot for dev
Data TransferCross-region/AZKeep monitoring in same region
API CallsCloudWatch API callsCache dashboard data, batch queries

🎯 Cost Optimization Strategies

 1// 1. Log retention policies
 2const logGroup = new logs.LogGroup(this, 'LogGroup', {
 3  retention: logs.RetentionDays.ONE_WEEK, // Shorter for dev
 4});
 5
 6// 2. Metric filters instead of custom metrics
 7// Extract metrics from logs instead of publishing custom metrics
 8const metricFilter = new logs.MetricFilter(this, 'ErrorMetric', {
 9  logGroup,
10  filterPattern: logs.FilterPattern.literal('[ERROR]'),
11  metricNamespace: 'CustomMetrics',
12  metricName: 'Errors',
13  metricValue: '1',
14});
15
16// 3. Use sampling for high-volume logs
17// Implement sampling in application code
18
19// 4. Composite alarms
20// Reduce alarm count by combining multiple conditions
21const compositeAlarm = new cloudwatch.CompositeAlarm(this, 'CompositeAlarm', {
22  alarmRule: cloudwatch.AlarmRule.anyOf(alarm1, alarm2, alarm3),
23});
24
25// 5. Use Fargate Spot for non-production Grafana
26// Add capacity provider with Spot instances

πŸ“š Integration Examples

πŸ”— Enabling CloudWatch Logs for Services

 1// Lambda with CloudWatch Logs
 2const lambdaFunction = new lambda.Function(this, 'Function', {
 3  runtime: lambda.Runtime.NODEJS_20_X,
 4  handler: 'index.handler',
 5  code: lambda.Code.fromAsset('lambda'),
 6  logRetention: logs.RetentionDays.ONE_WEEK,
 7});
 8
 9// ECS with CloudWatch Logs
10const taskDefinition = new ecs.FargateTaskDefinition(this, 'TaskDef');
11taskDefinition.addContainer('app', {
12  image: ecs.ContainerImage.fromRegistry('my-app'),
13  logging: ecs.LogDrivers.awsLogs({
14    streamPrefix: 'ecs',
15    logRetention: logs.RetentionDays.ONE_WEEK,
16  }),
17});
18
19// API Gateway with CloudWatch Logs
20const api = new apigateway.RestApi(this, 'API', {
21  deployOptions: {
22    loggingLevel: apigateway.MethodLoggingLevel.INFO,
23    dataTraceEnabled: true,
24    accessLogDestination: new apigateway.LogGroupLogDestination(logGroup),
25  },
26});
27
28// RDS with CloudWatch Logs
29const database = new rds.DatabaseInstance(this, 'Database', {
30  engine: rds.DatabaseInstanceEngine.postgres({
31    version: rds.PostgresEngineVersion.VER_14,
32  }),
33  cloudwatchLogsExports: ['postgresql'],
34});
35
36// Enable Container Insights for ECS
37const cluster = new ecs.Cluster(this, 'Cluster', {
38  containerInsights: true,
39});
40
41// Enable Lambda Insights
42const lambdaWithInsights = new lambda.Function(this, 'FunctionWithInsights', {
43  runtime: lambda.Runtime.NODEJS_20_X,
44  handler: 'index.handler',
45  code: lambda.Code.fromAsset('lambda'),
46  insightsVersion: lambda.LambdaInsightsVersion.VERSION_1_0_229_0,
47});

πŸš€ Deployment

 1# Install dependencies
 2npm install
 3
 4# Bootstrap CDK (first time only)
 5cdk bootstrap
 6
 7# Synthesize CloudFormation templates
 8cdk synth
 9
10# Deploy IAM stack first
11cdk deploy MonitoringIAMStack --context environment=development
12
13# Deploy CloudWatch stack
14cdk deploy MonitoringCloudWatchStack --context environment=development
15
16# Deploy Alerting stack
17cdk deploy MonitoringAlertingStack --context environment=development
18
19# Deploy Grafana stack
20cdk deploy MonitoringGrafanaStack --context environment=development
21
22# Deploy all stacks
23cdk deploy --all --context environment=production --require-approval never
24
25# View Grafana URL
26aws cloudformation describe-stacks \
27  --stack-name MonitoringGrafanaStack \
28  --query 'Stacks[0].Outputs[?OutputKey==`GrafanaURL`].OutputValue' \
29  --output text

πŸ“‹ Summary and Best Practices

🎯 Key Takeaways

  1. Hybrid Approach: Use CloudWatch for data collection, Grafana for visualization
  2. IAM Permissions: Properly configure cross-service access with least privilege
  3. Log Aggregation: Centralize logs from all services in CloudWatch
  4. Metric Filters: Extract metrics from logs to reduce custom metric costs
  5. Alerting: Multi-tier alerting with SNS topics and Lambda processors
  6. Retention: Configure appropriate log retention based on environment
  7. Cost Management: Monitor costs, use sampling, optimize retention
  8. Dashboard Design: Create role-specific dashboards for different teams

βœ… Monitoring Checklist

  • Define monitoring requirements for all services
  • Set up IAM roles with cross-service permissions
  • Enable CloudWatch Logs for all services
  • Create metric filters for log-based metrics
  • Configure CloudWatch Alarms with appropriate thresholds
  • Set up SNS topics for different alert severities
  • Deploy Grafana on ECS with proper security
  • Create dashboards for different teams/services
  • Enable Container Insights (ECS/EKS)
  • Enable Lambda Insights for Lambda functions
  • Configure log retention policies
  • Set up alert notifications (Slack, PagerDuty, email)
  • Test alerting workflows
  • Document dashboard usage and queries
  • Implement cost monitoring for the monitoring stack itself

🎨 Dashboard Organization

Dashboards by Role:
β”œβ”€β”€ Executive Dashboard
β”‚   β”œβ”€β”€ High-level KPIs
β”‚   β”œβ”€β”€ Cost metrics
β”‚   └── Availability metrics
β”œβ”€β”€ Operations Dashboard
β”‚   β”œβ”€β”€ Infrastructure health
β”‚   β”œβ”€β”€ Service availability
β”‚   └── Active incidents
β”œβ”€β”€ Development Dashboard
β”‚   β”œβ”€β”€ Application metrics
β”‚   β”œβ”€β”€ Error rates
β”‚   └── Performance metrics
└── SRE Dashboard
    β”œβ”€β”€ SLIs/SLOs
    β”œβ”€β”€ Detailed service metrics
    └── Capacity planning

πŸŽ“ Further Learning

🎯 Conclusion

Building a centralized monitoring system is essential for maintaining visibility, reliability, and performance in distributed AWS environments. By combining CloudWatch’s native AWS integration with Grafana’s powerful visualization capabilities, you create a robust observability platform.

The key to successful monitoring is:

  • Comprehensive Coverage: Monitor all services and infrastructure
  • Proper Permissions: Use IAM to enable cross-service monitoring
  • Actionable Alerts: Alert on symptoms, not just metrics
  • Cost Awareness: Balance observability needs with cost
  • Team-Specific Dashboards: Provide relevant views for different roles

Key Benefits:

  • Single pane of glass for all AWS services
  • Proactive issue detection and alerting
  • Reduced MTTR (Mean Time To Resolution)
  • Better understanding of system behavior
  • Data-driven optimization decisions
  • Compliance and audit readiness

Related Posts:

Tags: #AWS #CloudWatch #Grafana #Monitoring #Observability #CDK #TypeScript #Logging #Metrics #Alerting #DevOps #SRE #Infrastructure