Building Centralized Grafana + Prometheus Monitoring with AWS CDK: Multi-Service Observability Platform

In modern distributed systems running on AWS, monitoring individual services in isolation creates operational blind spots. A centralized Prometheus + Grafana platform provides unified visibility across all infrastructure, enabling correlation analysis, efficient troubleshooting, and proactive alerting. This post explores building a production-grade monitoring hub using AWS CDK that aggregates metrics from EKS clusters, ECS services, Lambda functions, and custom applications.

The Challenge: Fragmented Observability

Organizations running microservices across AWS face critical monitoring challenges:

  • Isolated Metrics: Each service/cluster has its own Prometheus instance, preventing holistic analysis
  • Data Silos: No centralized view of system-wide performance and health
  • Scaling Complexity: Managing dozens of Prometheus instances becomes operationally expensive
  • Correlation Difficulty: Cross-service debugging requires manual metric aggregation
  • Storage Management: Each Prometheus instance handles its own long-term storage
  • Alerting Chaos: Duplicate alerts from multiple Prometheus instances
  • Query Performance: Complex cross-cluster queries are slow or impossible

Why Centralized Prometheus + Grafana Architecture?

Before diving into implementation, let’s understand the architectural benefits:

Centralized vs Distributed Monitoring

Traditional Distributed Approach:
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ EKS Cluster  │    │ ECS Service  │    │  Lambda      │
│              │    │              │    │  Functions   │
│ Prometheus   │    │ Prometheus   │    │  CloudWatch  │
│ Grafana      │    │ Grafana      │    │  Metrics     │
└──────────────┘    └──────────────┘    └──────────────┘
      ↓                    ↓                    ↓
   Isolated          No Correlation        Manual Export

Centralized Approach:
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ EKS Cluster  │    │ ECS Service  │    │  Lambda      │
│              │    │              │    │  Functions   │
│ Prometheus   │───►│ Prometheus   │───►│ CloudWatch   │
│ (Scraper)    │    │ (Scraper)    │    │ Exporter     │
└──────────────┘    └──────────────┘    └──────────────┘
      │                    │                    │
      └────────────────────┴────────────────────┘
                           │
                           ▼
              ┌─────────────────────────┐
              │  Central Prometheus     │
              │  (Federation/Remote)    │
              └────────────┬────────────┘
                           │
                           ▼
              ┌─────────────────────────┐
              │   Centralized Grafana   │
              │   (Unified Dashboards)  │
              └─────────────────────────┘

Key Benefits

CapabilityDistributedCentralized
Cross-Service QueriesManual aggregationNative support
Unified DashboardsMultiple loginsSingle pane of glass
Alert DeduplicationComplex rules neededBuilt-in
Long-term StoragePer-instance managementCentralized (Cortex/Thanos)
Operational OverheadHigh (N instances)Low (1 instance)
Cost EfficiencyN × infrastructureOptimized shared resources

Architecture Overview: Multi-Tier Monitoring Platform

Our centralized monitoring architecture uses Prometheus federation and remote write to aggregate metrics from distributed sources:

┌─────────────────────────────────────────────────────────────────────┐
│                        AWS CLOUD                                    │
│                                                                     │
│  ┌────────────────────────────────────────────────────────────┐   │
│  │              METRIC SOURCES (Multi-Account/Region)         │   │
│  │                                                             │   │
│  │  ┌───────────────┐  ┌───────────────┐  ┌──────────────┐   │   │
│  │  │  EKS Cluster  │  │ ECS Services  │  │   Lambda     │   │   │
│  │  │  us-east-1    │  │  us-west-2    │  │  Functions   │   │   │
│  │  │               │  │               │  │              │   │   │
│  │  │ Prometheus    │  │ Prometheus    │  │ CloudWatch   │   │   │
│  │  │ Server        │  │ Server        │  │ → Exporter   │   │   │
│  │  │ (Scraper)     │  │ (Scraper)     │  │              │   │   │
│  │  └───────┬───────┘  └───────┬───────┘  └──────┬───────┘   │   │
│  │          │                   │                 │           │   │
│  │          │ /federate         │ remote_write    │ /metrics  │   │
│  └──────────┼───────────────────┼─────────────────┼───────────┘   │
│             │                   │                 │               │
│             └───────────────────┴─────────────────┘               │
│                                 │                                 │
│                                 ▼                                 │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │         CENTRALIZED PROMETHEUS (ECS Fargate)                │ │
│  │                                                              │ │
│  │  ┌──────────────────────────────────────────────────────┐  │ │
│  │  │  Prometheus Server (High Availability)               │  │ │
│  │  │                                                       │  │ │
│  │  │  • Federation Endpoint                               │  │ │
│  │  │  • Remote Write Receiver                             │  │ │
│  │  │  • Time Series Database                              │  │ │
│  │  │  • Recording Rules                                   │  │ │
│  │  │  • Alert Manager Integration                         │  │ │
│  │  └──────────────────────────────────────────────────────┘  │ │
│  │                           │                                 │ │
│  │  ┌────────────────────────┴────────────────────┐           │ │
│  │  │                                              │           │ │
│  │  ▼                                              ▼           │ │
│  │  ┌──────────────────┐           ┌──────────────────────┐  │ │
│  │  │ EFS Volume       │           │  Alert Manager       │  │ │
│  │  │ (TSDB Storage)   │           │  (ECS Task)          │  │ │
│  │  │                  │           │                      │  │ │
│  │  │ • Long-term data │           │ • Route alerts       │  │ │
│  │  │ • High IOPS      │           │ • Deduplication      │  │ │
│  │  │ • Automatic      │           │ • Grouping           │  │ │
│  │  │   backup         │           │ • Silencing          │  │ │
│  │  └──────────────────┘           └──────────┬───────────┘  │ │
│  └──────────────────────────────────────────────┼────────────┘ │
│                                                 │               │
│                                                 ▼               │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │           CENTRALIZED GRAFANA (ECS Fargate)                 │ │
│  │                                                              │ │
│  │  ┌──────────────────────────────────────────────────────┐  │ │
│  │  │  Grafana Server (Multi-AZ)                           │  │ │
│  │  │                                                       │  │ │
│  │  │  • Unified Dashboards                                │  │ │
│  │  │  • Multi-Prometheus Data Sources                     │  │ │
│  │  │  • Cross-Cluster Queries                             │  │ │
│  │  │  • Team Dashboards                                   │  │ │
│  │  │  • Alert Visualization                               │  │ │
│  │  └──────────────────────────────────────────────────────┘  │ │
│  │                           │                                 │ │
│  │                           ▼                                 │ │
│  │  ┌──────────────────────────────────────────────────────┐  │ │
│  │  │  EFS Volume (Grafana Data)                           │  │ │
│  │  │  • Dashboards persistence                            │  │ │
│  │  │  • User settings                                     │  │ │
│  │  │  • Plugins                                           │  │ │
│  │  └──────────────────────────────────────────────────────┘  │ │
│  └─────────────────────────────────────────────────────────────┘ │
│                                                                   │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │              NOTIFICATION CHANNELS                          │ │
│  │                                                              │ │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │ │
│  │  │  Slack   │  │ PagerDuty│  │   Email  │  │  Webhook │   │ │
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘   │ │
│  └─────────────────────────────────────────────────────────────┘ │
│                                                                   │
│  ┌─────────────────────────────────────────────────────────────┐ │
│  │              SUPPORTING SERVICES                            │ │
│  │                                                              │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │ │
│  │  │     VPC      │  │     ALB      │  │   Route53 DNS    │  │ │
│  │  │  (Multi-AZ)  │  │  (HTTPS)     │  │   (monitoring.)  │  │ │
│  │  └──────────────┘  └──────────────┘  └──────────────────┘  │ │
│  │                                                              │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │ │
│  │  │   Secrets    │  │   IAM Roles  │  │   CloudWatch     │  │ │
│  │  │   Manager    │  │              │  │   Metrics        │  │ │
│  │  └──────────────┘  └──────────────┘  └──────────────────┘  │ │
│  └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

Data Flow Architecture

┌──────────────────────────────────────────────────────────────┐
│ 1. METRIC COLLECTION (Distributed Sources)                  │
│    • Kubernetes pods expose /metrics endpoints               │
│    • Node exporters scrape system metrics                    │
│    • Custom applications export business metrics             │
└──────────────────────┬───────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────┐
│ 2. LOCAL PROMETHEUS (Per Cluster/Service)                   │
│    • Scrape metrics every 15-30 seconds                      │
│    • Apply initial relabeling rules                          │
│    • Store metrics locally (1-7 days)                        │
│    • Evaluate local alerts                                   │
└──────────────────────┬───────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────┐
│ 3. METRIC AGGREGATION (Federation/Remote Write)             │
│    • Federation: Central pulls from /federate endpoints      │
│    • Remote Write: Local pushes to central receiver          │
│    • Compression and batching for efficiency                 │
│    • Cross-region replication                                │
└──────────────────────┬───────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────┐
│ 4. CENTRAL PROMETHEUS (ECS Fargate)                         │
│    • Receive and deduplicate metrics                         │
│    • Apply global recording rules                            │
│    • Store in EFS-backed TSDB (30-90 days)                  │
│    • Evaluate global alerts                                  │
│    • Expose unified query API                                │
└──────────────────────┬───────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────┐
│ 5. VISUALIZATION (Grafana)                                  │
│    • Query central Prometheus                                │
│    • Cross-cluster analysis                                  │
│    • Render unified dashboards                               │
│    • Display alerts and annotations                          │
└──────────────────────┬───────────────────────────────────────┘
                       │
                       ▼
┌──────────────────────────────────────────────────────────────┐
│ 6. ALERTING (Alert Manager)                                 │
│    • Receive alerts from Prometheus                          │
│    • Deduplicate and group alerts                            │
│    • Route to notification channels                          │
│    • Track alert states and silences                         │
└──────────────────────────────────────────────────────────────┘

CDK Implementation: Infrastructure as Code

Let’s build the complete monitoring platform using AWS CDK with TypeScript.

Project Structure

centralized-monitoring-cdk/
├── bin/
│   └── monitoring-platform.ts          # CDK app entry
├── lib/
│   ├── stacks/
│   │   ├── vpc-stack.ts                # Network infrastructure
│   │   ├── prometheus-stack.ts         # Central Prometheus
│   │   ├── grafana-stack.ts            # Grafana server
│   │   ├── alertmanager-stack.ts       # Alert Manager
│   │   └── exporters-stack.ts          # CloudWatch exporters
│   ├── constructs/
│   │   ├── efs-storage.ts              # EFS for persistence
│   │   ├── alb-setup.ts                # Load balancer
│   │   └── iam-roles.ts                # IAM permissions
│   └── config/
│       ├── prometheus-config.yaml      # Prometheus configuration
│       ├── alertmanager-config.yaml    # Alert Manager config
│       └── recording-rules.yaml        # Prometheus rules
├── grafana/
│   ├── dashboards/
│   │   ├── cluster-overview.json       # Kubernetes dashboard
│   │   ├── ecs-services.json           # ECS metrics
│   │   └── cross-cluster.json          # Multi-cluster view
│   └── provisioning/
│       ├── datasources/                # Prometheus data sources
│       └── dashboards/                 # Dashboard provisioning
├── prometheus/
│   ├── rules/
│   │   ├── recording-rules.yml         # Pre-aggregation rules
│   │   └── alerting-rules.yml          # Alert definitions
│   └── config/
│       └── prometheus.yml              # Main configuration
├── package.json
├── tsconfig.json
└── cdk.json

Configuration Management

  1// lib/config/monitoring-config.ts
  2export interface MonitoringConfig {
  3  environment: 'dev' | 'staging' | 'production';
  4  prometheus: PrometheusConfig;
  5  grafana: GrafanaConfig;
  6  alertManager: AlertManagerConfig;
  7  storage: StorageConfig;
  8  networking: NetworkingConfig;
  9}
 10
 11export interface PrometheusConfig {
 12  retention: string;              // e.g., "30d"
 13  scrapeInterval: string;         // e.g., "15s"
 14  evaluationInterval: string;     // e.g., "15s"
 15  remoteWriteEnabled: boolean;
 16  federationEnabled: boolean;
 17  cpu: number;                    // ECS CPU units
 18  memory: number;                 // ECS memory (MB)
 19  desiredCount: number;           // Number of tasks
 20  externalLabels: Record<string, string>;
 21}
 22
 23export interface GrafanaConfig {
 24  adminUser: string;
 25  domain?: string;
 26  enableAuth: boolean;
 27  cpu: number;
 28  memory: number;
 29  desiredCount: number;
 30  plugins: string[];
 31}
 32
 33export interface AlertManagerConfig {
 34  enabled: boolean;
 35  cpu: number;
 36  memory: number;
 37  slackWebhook?: string;
 38  pagerDutyKey?: string;
 39  emailFrom?: string;
 40  emailTo: string[];
 41}
 42
 43export interface StorageConfig {
 44  efsPerformanceMode: 'generalPurpose' | 'maxIO';
 45  efsThroughputMode: 'bursting' | 'provisioned';
 46  provisionedThroughputMibps?: number;
 47  prometheusVolumeSize: number;  // GB
 48  grafanaVolumeSize: number;     // GB
 49}
 50
 51export interface NetworkingConfig {
 52  vpcCidr: string;
 53  maxAzs: number;
 54  enableNatGateway: boolean;
 55  enableVpcFlowLogs: boolean;
 56}
 57
 58export const monitoringConfigs: Record<string, MonitoringConfig> = {
 59  dev: {
 60    environment: 'dev',
 61    prometheus: {
 62      retention: '7d',
 63      scrapeInterval: '30s',
 64      evaluationInterval: '30s',
 65      remoteWriteEnabled: true,
 66      federationEnabled: true,
 67      cpu: 1024,        // 1 vCPU
 68      memory: 2048,     // 2 GB
 69      desiredCount: 1,
 70      externalLabels: {
 71        cluster: 'central',
 72        environment: 'dev',
 73      },
 74    },
 75    grafana: {
 76      adminUser: 'admin',
 77      enableAuth: true,
 78      cpu: 512,
 79      memory: 1024,
 80      desiredCount: 1,
 81      plugins: [
 82        'grafana-piechart-panel',
 83        'grafana-worldmap-panel',
 84      ],
 85    },
 86    alertManager: {
 87      enabled: true,
 88      cpu: 256,
 89      memory: 512,
 90      emailTo: ['dev-team@company.com'],
 91    },
 92    storage: {
 93      efsPerformanceMode: 'generalPurpose',
 94      efsThroughputMode: 'bursting',
 95      prometheusVolumeSize: 100,
 96      grafanaVolumeSize: 20,
 97    },
 98    networking: {
 99      vpcCidr: '10.0.0.0/16',
100      maxAzs: 2,
101      enableNatGateway: true,
102      enableVpcFlowLogs: false,
103    },
104  },
105  production: {
106    environment: 'production',
107    prometheus: {
108      retention: '90d',
109      scrapeInterval: '15s',
110      evaluationInterval: '15s',
111      remoteWriteEnabled: true,
112      federationEnabled: true,
113      cpu: 4096,        // 4 vCPU
114      memory: 16384,    // 16 GB
115      desiredCount: 3,  // High availability
116      externalLabels: {
117        cluster: 'central',
118        environment: 'production',
119      },
120    },
121    grafana: {
122      adminUser: 'admin',
123      domain: 'monitoring.company.com',
124      enableAuth: true,
125      cpu: 2048,
126      memory: 4096,
127      desiredCount: 2,
128      plugins: [
129        'grafana-piechart-panel',
130        'grafana-worldmap-panel',
131        'grafana-clock-panel',
132      ],
133    },
134    alertManager: {
135      enabled: true,
136      cpu: 1024,
137      memory: 2048,
138      slackWebhook: process.env.SLACK_WEBHOOK_URL,
139      pagerDutyKey: process.env.PAGERDUTY_INTEGRATION_KEY,
140      emailFrom: 'alerts@company.com',
141      emailTo: [
142        'oncall@company.com',
143        'sre-team@company.com',
144      ],
145    },
146    storage: {
147      efsPerformanceMode: 'maxIO',
148      efsThroughputMode: 'provisioned',
149      provisionedThroughputMibps: 100,
150      prometheusVolumeSize: 1000,
151      grafanaVolumeSize: 100,
152    },
153    networking: {
154      vpcCidr: '10.0.0.0/16',
155      maxAzs: 3,
156      enableNatGateway: true,
157      enableVpcFlowLogs: true,
158    },
159  },
160};

VPC Stack - Network Foundation

  1// lib/stacks/vpc-stack.ts
  2import * as cdk from 'aws-cdk-lib';
  3import * as ec2 from 'aws-cdk-lib/aws-ec2';
  4import { Construct } from 'constructs';
  5import { NetworkingConfig } from '../config/monitoring-config';
  6
  7export interface VpcStackProps extends cdk.StackProps {
  8  config: NetworkingConfig;
  9}
 10
 11export class VpcStack extends cdk.Stack {
 12  public readonly vpc: ec2.Vpc;
 13  public readonly prometheusSecurityGroup: ec2.SecurityGroup;
 14  public readonly grafanaSecurityGroup: ec2.SecurityGroup;
 15  public readonly albSecurityGroup: ec2.SecurityGroup;
 16
 17  constructor(scope: Construct, id: string, props: VpcStackProps) {
 18    super(scope, id, props);
 19
 20    const { config } = props;
 21
 22    // Create VPC with public and private subnets
 23    this.vpc = new ec2.Vpc(this, 'MonitoringVpc', {
 24      ipAddresses: ec2.IpAddresses.cidr(config.vpcCidr),
 25      maxAzs: config.maxAzs,
 26      natGateways: config.enableNatGateway ? config.maxAzs : 0,
 27      subnetConfiguration: [
 28        {
 29          name: 'Public',
 30          subnetType: ec2.SubnetType.PUBLIC,
 31          cidrMask: 24,
 32        },
 33        {
 34          name: 'Private',
 35          subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
 36          cidrMask: 24,
 37        },
 38        {
 39          name: 'Isolated',
 40          subnetType: ec2.SubnetType.PRIVATE_ISOLATED,
 41          cidrMask: 24,
 42        },
 43      ],
 44    });
 45
 46    // Enable VPC Flow Logs if configured
 47    if (config.enableVpcFlowLogs) {
 48      this.vpc.addFlowLog('VpcFlowLogs', {
 49        destination: ec2.FlowLogDestination.toCloudWatchLogs(),
 50      });
 51    }
 52
 53    // Security group for ALB
 54    this.albSecurityGroup = new ec2.SecurityGroup(this, 'AlbSecurityGroup', {
 55      vpc: this.vpc,
 56      description: 'Security group for monitoring ALB',
 57      allowAllOutbound: true,
 58    });
 59
 60    // Allow HTTPS from anywhere
 61    this.albSecurityGroup.addIngressRule(
 62      ec2.Peer.anyIpv4(),
 63      ec2.Port.tcp(443),
 64      'Allow HTTPS from internet'
 65    );
 66
 67    // Security group for Prometheus
 68    this.prometheusSecurityGroup = new ec2.SecurityGroup(
 69      this,
 70      'PrometheusSecurityGroup',
 71      {
 72        vpc: this.vpc,
 73        description: 'Security group for Prometheus',
 74        allowAllOutbound: true,
 75      }
 76    );
 77
 78    // Allow Prometheus port from ALB and internal VPC
 79    this.prometheusSecurityGroup.addIngressRule(
 80      this.albSecurityGroup,
 81      ec2.Port.tcp(9090),
 82      'Allow Prometheus access from ALB'
 83    );
 84
 85    this.prometheusSecurityGroup.addIngressRule(
 86      ec2.Peer.ipv4(config.vpcCidr),
 87      ec2.Port.tcp(9090),
 88      'Allow Prometheus federation from VPC'
 89    );
 90
 91    // Security group for Grafana
 92    this.grafanaSecurityGroup = new ec2.SecurityGroup(
 93      this,
 94      'GrafanaSecurityGroup',
 95      {
 96        vpc: this.vpc,
 97        description: 'Security group for Grafana',
 98        allowAllOutbound: true,
 99      }
100    );
101
102    // Allow Grafana port from ALB
103    this.grafanaSecurityGroup.addIngressRule(
104      this.albSecurityGroup,
105      ec2.Port.tcp(3000),
106      'Allow Grafana access from ALB'
107    );
108
109    // Allow Grafana to access Prometheus
110    this.prometheusSecurityGroup.connections.allowFrom(
111      this.grafanaSecurityGroup,
112      ec2.Port.tcp(9090),
113      'Allow Grafana to query Prometheus'
114    );
115
116    // Outputs
117    new cdk.CfnOutput(this, 'VpcId', {
118      value: this.vpc.vpcId,
119      exportName: 'MonitoringVpcId',
120    });
121
122    new cdk.CfnOutput(this, 'VpcCidr', {
123      value: this.vpc.vpcCidrBlock,
124      exportName: 'MonitoringVpcCidr',
125    });
126  }
127}

Prometheus Stack - Central Metrics Server

  1// lib/stacks/prometheus-stack.ts
  2import * as cdk from 'aws-cdk-lib';
  3import * as ec2 from 'aws-cdk-lib/aws-ec2';
  4import * as ecs from 'aws-cdk-lib/aws-ecs';
  5import * as ecs_patterns from 'aws-cdk-lib/aws-ecs-patterns';
  6import * as efs from 'aws-cdk-lib/aws-efs';
  7import * as iam from 'aws-cdk-lib/aws-iam';
  8import * as logs from 'aws-cdk-lib/aws-logs';
  9import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';
 10import { Construct } from 'constructs';
 11import { PrometheusConfig, StorageConfig } from '../config/monitoring-config';
 12import * as path from 'path';
 13import * as fs from 'fs';
 14
 15export interface PrometheusStackProps extends cdk.StackProps {
 16  vpc: ec2.Vpc;
 17  securityGroup: ec2.SecurityGroup;
 18  prometheusConfig: PrometheusConfig;
 19  storageConfig: StorageConfig;
 20  alb: elbv2.ApplicationLoadBalancer;
 21}
 22
 23export class PrometheusStack extends cdk.Stack {
 24  public readonly service: ecs_patterns.ApplicationLoadBalancedFargateService;
 25  public readonly fileSystem: efs.FileSystem;
 26  public readonly prometheusUrl: string;
 27
 28  constructor(scope: Construct, id: string, props: PrometheusStackProps) {
 29    super(scope, id, props);
 30
 31    const { vpc, securityGroup, prometheusConfig, storageConfig, alb } = props;
 32
 33    // Create ECS Cluster
 34    const cluster = new ecs.Cluster(this, 'PrometheusCluster', {
 35      vpc,
 36      clusterName: 'monitoring-prometheus-cluster',
 37      containerInsights: true,
 38    });
 39
 40    // Create EFS for Prometheus data persistence
 41    this.fileSystem = new efs.FileSystem(this, 'PrometheusEfs', {
 42      vpc,
 43      encrypted: true,
 44      lifecyclePolicy: efs.LifecyclePolicy.AFTER_14_DAYS,
 45      performanceMode: storageConfig.efsPerformanceMode === 'maxIO'
 46        ? efs.PerformanceMode.MAX_IO
 47        : efs.PerformanceMode.GENERAL_PURPOSE,
 48      throughputMode: storageConfig.efsThroughputMode === 'provisioned'
 49        ? efs.ThroughputMode.PROVISIONED
 50        : efs.ThroughputMode.BURSTING,
 51      provisionedThroughputPerSecond: storageConfig.provisionedThroughputMibps
 52        ? cdk.Size.mebibytes(storageConfig.provisionedThroughputMibps)
 53        : undefined,
 54      removalPolicy: cdk.RemovalPolicy.RETAIN,
 55    });
 56
 57    // Create access point for Prometheus
 58    const accessPoint = this.fileSystem.addAccessPoint('PrometheusAccessPoint', {
 59      path: '/prometheus',
 60      createAcl: {
 61        ownerGid: '65534',
 62        ownerUid: '65534',
 63        permissions: '755',
 64      },
 65      posixUser: {
 66        gid: '65534',
 67        uid: '65534',
 68      },
 69    });
 70
 71    // Create task definition
 72    const taskDefinition = new ecs.FargateTaskDefinition(this, 'PrometheusTask', {
 73      cpu: prometheusConfig.cpu,
 74      memoryLimitMiB: prometheusConfig.memory,
 75      family: 'prometheus-server',
 76    });
 77
 78    // Add EFS volume
 79    const volumeName = 'prometheus-storage';
 80    taskDefinition.addVolume({
 81      name: volumeName,
 82      efsVolumeConfiguration: {
 83        fileSystemId: this.fileSystem.fileSystemId,
 84        transitEncryption: 'ENABLED',
 85        authorizationConfig: {
 86          accessPointId: accessPoint.accessPointId,
 87          iam: 'ENABLED',
 88        },
 89      },
 90    });
 91
 92    // Load Prometheus configuration from file
 93    const prometheusConfigYaml = this.loadPrometheusConfig(prometheusConfig);
 94
 95    // Add Prometheus container
 96    const prometheusContainer = taskDefinition.addContainer('prometheus', {
 97      image: ecs.ContainerImage.fromRegistry('prom/prometheus:latest'),
 98      logging: ecs.LogDrivers.awsLogs({
 99        streamPrefix: 'prometheus',
100        logRetention: logs.RetentionDays.ONE_WEEK,
101      }),
102      environment: {
103        PROMETHEUS_RETENTION: prometheusConfig.retention,
104      },
105      command: [
106        '--config.file=/etc/prometheus/prometheus.yml',
107        '--storage.tsdb.path=/prometheus',
108        `--storage.tsdb.retention.time=${prometheusConfig.retention}`,
109        '--web.console.libraries=/usr/share/prometheus/console_libraries',
110        '--web.console.templates=/usr/share/prometheus/consoles',
111        '--web.enable-lifecycle',
112        '--web.enable-admin-api',
113      ],
114      portMappings: [{
115        containerPort: 9090,
116        protocol: ecs.Protocol.TCP,
117      }],
118      healthCheck: {
119        command: ['CMD-SHELL', 'wget --no-verbose --tries=1 --spider http://localhost:9090/-/healthy || exit 1'],
120        interval: cdk.Duration.seconds(30),
121        timeout: cdk.Duration.seconds(5),
122        retries: 3,
123        startPeriod: cdk.Duration.seconds(60),
124      },
125    });
126
127    // Mount EFS volume
128    prometheusContainer.addMountPoints({
129      sourceVolume: volumeName,
130      containerPath: '/prometheus',
131      readOnly: false,
132    });
133
134    // Grant EFS permissions
135    this.fileSystem.grantReadWrite(taskDefinition.taskRole);
136
137    // Create Fargate service
138    this.service = new ecs_patterns.ApplicationLoadBalancedFargateService(
139      this,
140      'PrometheusService',
141      {
142        cluster,
143        serviceName: 'prometheus',
144        taskDefinition,
145        desiredCount: prometheusConfig.desiredCount,
146        loadBalancer: alb,
147        publicLoadBalancer: false,
148        securityGroups: [securityGroup],
149        taskSubnets: {
150          subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
151        },
152      }
153    );
154
155    // Configure health check
156    this.service.targetGroup.configureHealthCheck({
157      path: '/-/healthy',
158      interval: cdk.Duration.seconds(30),
159      timeout: cdk.Duration.seconds(5),
160      healthyThresholdCount: 2,
161      unhealthyThresholdCount: 3,
162    });
163
164    // Auto scaling
165    const scaling = this.service.service.autoScaleTaskCount({
166      minCapacity: prometheusConfig.desiredCount,
167      maxCapacity: prometheusConfig.desiredCount * 2,
168    });
169
170    scaling.scaleOnCpuUtilization('CpuScaling', {
171      targetUtilizationPercent: 70,
172      scaleInCooldown: cdk.Duration.seconds(300),
173      scaleOutCooldown: cdk.Duration.seconds(60),
174    });
175
176    scaling.scaleOnMemoryUtilization('MemoryScaling', {
177      targetUtilizationPercent: 80,
178      scaleInCooldown: cdk.Duration.seconds(300),
179      scaleOutCooldown: cdk.Duration.seconds(60),
180    });
181
182    // Allow EFS connections
183    this.fileSystem.connections.allowDefaultPortFrom(this.service.service.connections);
184
185    this.prometheusUrl = `http://${this.service.loadBalancer.loadBalancerDnsName}`;
186
187    // Outputs
188    new cdk.CfnOutput(this, 'PrometheusUrl', {
189      value: this.prometheusUrl,
190      description: 'Prometheus Server URL',
191      exportName: 'PrometheusUrl',
192    });
193
194    new cdk.CfnOutput(this, 'PrometheusEfsId', {
195      value: this.fileSystem.fileSystemId,
196      description: 'Prometheus EFS File System ID',
197      exportName: 'PrometheusEfsId',
198    });
199  }
200
201  private loadPrometheusConfig(config: PrometheusConfig): string {
202    // Generate Prometheus configuration dynamically
203    const prometheusConfig = {
204      global: {
205        scrape_interval: config.scrapeInterval,
206        evaluation_interval: config.evaluationInterval,
207        external_labels: config.externalLabels,
208      },
209      scrape_configs: [
210        {
211          job_name: 'prometheus',
212          static_configs: [{
213            targets: ['localhost:9090'],
214          }],
215        },
216        // Federation from remote Prometheus instances
217        ...(config.federationEnabled ? [{
218          job_name: 'federate-eks-clusters',
219          scrape_interval: '30s',
220          honor_labels: true,
221          metrics_path: '/federate',
222          params: {
223            'match[]': [
224              '{job="kubernetes-apiservers"}',
225              '{job="kubernetes-nodes"}',
226              '{job="kubernetes-pods"}',
227              '{job="kubernetes-cadvisor"}',
228              '{job="kubernetes-service-endpoints"}',
229            ],
230          },
231          static_configs: [
232            // Add your EKS Prometheus endpoints here
233            // { targets: ['prometheus.eks-cluster-1.internal:9090'] },
234            // { targets: ['prometheus.eks-cluster-2.internal:9090'] },
235          ],
236        }] : []),
237      ],
238      remote_write: config.remoteWriteEnabled ? [
239        // Configure remote write endpoints if needed
240        // { url: 'http://cortex:9009/api/prom/push' }
241      ] : [],
242      rule_files: [
243        '/etc/prometheus/recording-rules.yml',
244        '/etc/prometheus/alerting-rules.yml',
245      ],
246    };
247
248    return JSON.stringify(prometheusConfig, null, 2);
249  }
250}

Grafana Stack - Visualization Platform

  1// lib/stacks/grafana-stack.ts
  2import * as cdk from 'aws-cdk-lib';
  3import * as ec2 from 'aws-cdk-lib/aws-ec2';
  4import * as ecs from 'aws-cdk-lib/aws-ecs';
  5import * as ecs_patterns from 'aws-cdk-lib/aws-ecs-patterns';
  6import * as efs from 'aws-cdk-lib/aws-efs';
  7import * as secretsmanager from 'aws-cdk-lib/aws-secretsmanager';
  8import * as logs from 'aws-cdk-lib/aws-logs';
  9import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';
 10import { Construct } from 'constructs';
 11import { GrafanaConfig, StorageConfig } from '../config/monitoring-config';
 12
 13export interface GrafanaStackProps extends cdk.StackProps {
 14  vpc: ec2.Vpc;
 15  securityGroup: ec2.SecurityGroup;
 16  grafanaConfig: GrafanaConfig;
 17  storageConfig: StorageConfig;
 18  prometheusUrl: string;
 19  alb: elbv2.ApplicationLoadBalancer;
 20}
 21
 22export class GrafanaStack extends cdk.Stack {
 23  public readonly service: ecs_patterns.ApplicationLoadBalancedFargateService;
 24  public readonly grafanaUrl: string;
 25
 26  constructor(scope: Construct, id: string, props: GrafanaStackProps) {
 27    super(scope, id, props);
 28
 29    const { vpc, securityGroup, grafanaConfig, storageConfig, prometheusUrl, alb } = props;
 30
 31    // Create ECS Cluster
 32    const cluster = new ecs.Cluster(this, 'GrafanaCluster', {
 33      vpc,
 34      clusterName: 'monitoring-grafana-cluster',
 35      containerInsights: true,
 36    });
 37
 38    // Create EFS for Grafana data
 39    const fileSystem = new efs.FileSystem(this, 'GrafanaEfs', {
 40      vpc,
 41      encrypted: true,
 42      performanceMode: efs.PerformanceMode.GENERAL_PURPOSE,
 43      throughputMode: efs.ThroughputMode.BURSTING,
 44      removalPolicy: cdk.RemovalPolicy.RETAIN,
 45    });
 46
 47    const accessPoint = fileSystem.addAccessPoint('GrafanaAccessPoint', {
 48      path: '/grafana',
 49      createAcl: {
 50        ownerGid: '472',
 51        ownerUid: '472',
 52        permissions: '755',
 53      },
 54      posixUser: {
 55        gid: '472',
 56        uid: '472',
 57      },
 58    });
 59
 60    // Create admin password secret
 61    const adminSecret = new secretsmanager.Secret(this, 'GrafanaAdminPassword', {
 62      secretName: 'grafana-admin-credentials',
 63      generateSecretString: {
 64        secretStringTemplate: JSON.stringify({ username: grafanaConfig.adminUser }),
 65        generateStringKey: 'password',
 66        excludePunctuation: true,
 67        passwordLength: 16,
 68      },
 69    });
 70
 71    // Task definition
 72    const taskDefinition = new ecs.FargateTaskDefinition(this, 'GrafanaTask', {
 73      cpu: grafanaConfig.cpu,
 74      memoryLimitMiB: grafanaConfig.memory,
 75      family: 'grafana-server',
 76    });
 77
 78    // Add volume
 79    const volumeName = 'grafana-storage';
 80    taskDefinition.addVolume({
 81      name: volumeName,
 82      efsVolumeConfiguration: {
 83        fileSystemId: fileSystem.fileSystemId,
 84        transitEncryption: 'ENABLED',
 85        authorizationConfig: {
 86          accessPointId: accessPoint.accessPointId,
 87          iam: 'ENABLED',
 88        },
 89      },
 90    });
 91
 92    // Grafana container
 93    const grafanaContainer = taskDefinition.addContainer('grafana', {
 94      image: ecs.ContainerImage.fromRegistry('grafana/grafana:latest'),
 95      logging: ecs.LogDrivers.awsLogs({
 96        streamPrefix: 'grafana',
 97        logRetention: logs.RetentionDays.ONE_WEEK,
 98      }),
 99      environment: {
100        GF_SERVER_ROOT_URL: grafanaConfig.domain
101          ? `https://${grafanaConfig.domain}`
102          : '',
103        GF_SECURITY_ADMIN_USER: grafanaConfig.adminUser,
104        GF_INSTALL_PLUGINS: grafanaConfig.plugins.join(','),
105        GF_AUTH_ANONYMOUS_ENABLED: (!grafanaConfig.enableAuth).toString(),
106        GF_PATHS_DATA: '/var/lib/grafana',
107        GF_PATHS_PROVISIONING: '/etc/grafana/provisioning',
108      },
109      secrets: {
110        GF_SECURITY_ADMIN_PASSWORD: ecs.Secret.fromSecretsManager(adminSecret, 'password'),
111      },
112      portMappings: [{
113        containerPort: 3000,
114        protocol: ecs.Protocol.TCP,
115      }],
116      healthCheck: {
117        command: ['CMD-SHELL', 'wget --no-verbose --tries=1 --spider http://localhost:3000/api/health || exit 1'],
118        interval: cdk.Duration.seconds(30),
119        timeout: cdk.Duration.seconds(5),
120        retries: 3,
121        startPeriod: cdk.Duration.seconds(60),
122      },
123    });
124
125    // Mount EFS
126    grafanaContainer.addMountPoints({
127      sourceVolume: volumeName,
128      containerPath: '/var/lib/grafana',
129      readOnly: false,
130    });
131
132    // Grant EFS permissions
133    fileSystem.grantReadWrite(taskDefinition.taskRole);
134
135    // Create service
136    this.service = new ecs_patterns.ApplicationLoadBalancedFargateService(
137      this,
138      'GrafanaService',
139      {
140        cluster,
141        serviceName: 'grafana',
142        taskDefinition,
143        desiredCount: grafanaConfig.desiredCount,
144        loadBalancer: alb,
145        publicLoadBalancer: true,
146        securityGroups: [securityGroup],
147        taskSubnets: {
148          subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
149        },
150      }
151    );
152
153    // Health check
154    this.service.targetGroup.configureHealthCheck({
155      path: '/api/health',
156      interval: cdk.Duration.seconds(30),
157      timeout: cdk.Duration.seconds(5),
158      healthyThresholdCount: 2,
159      unhealthyThresholdCount: 3,
160    });
161
162    // Auto scaling
163    const scaling = this.service.service.autoScaleTaskCount({
164      minCapacity: grafanaConfig.desiredCount,
165      maxCapacity: grafanaConfig.desiredCount * 2,
166    });
167
168    scaling.scaleOnCpuUtilization('CpuScaling', {
169      targetUtilizationPercent: 70,
170    });
171
172    // Allow EFS connections
173    fileSystem.connections.allowDefaultPortFrom(this.service.service.connections);
174
175    this.grafanaUrl = `http://${this.service.loadBalancer.loadBalancerDnsName}`;
176
177    // Outputs
178    new cdk.CfnOutput(this, 'GrafanaUrl', {
179      value: this.grafanaUrl,
180      description: 'Grafana Dashboard URL',
181      exportName: 'GrafanaUrl',
182    });
183
184    new cdk.CfnOutput(this, 'GrafanaAdminSecretArn', {
185      value: adminSecret.secretArn,
186      description: 'Grafana Admin Password Secret ARN',
187      exportName: 'GrafanaAdminSecretArn',
188    });
189  }
190}

Prometheus Configuration Files

Main Prometheus Configuration

 1# prometheus/config/prometheus.yml
 2global:
 3  scrape_interval: 15s
 4  evaluation_interval: 15s
 5  external_labels:
 6    cluster: 'central'
 7    environment: 'production'
 8
 9# Alertmanager configuration
10alerting:
11  alertmanagers:
12    - static_configs:
13        - targets:
14            - alertmanager:9093
15
16# Load rules
17rule_files:
18  - '/etc/prometheus/recording-rules.yml'
19  - '/etc/prometheus/alerting-rules.yml'
20
21# Scrape configurations
22scrape_configs:
23  # Prometheus itself
24  - job_name: 'prometheus'
25    static_configs:
26      - targets: ['localhost:9090']
27
28  # Federation from EKS clusters
29  - job_name: 'federate-eks-us-east-1'
30    scrape_interval: 30s
31    honor_labels: true
32    metrics_path: '/federate'
33    params:
34      'match[]':
35        - '{job=~"kubernetes-.*"}'
36        - '{__name__=~"container_.*"}'
37        - '{__name__=~"node_.*"}'
38    static_configs:
39      - targets:
40          - 'prometheus.eks-us-east-1.internal:9090'
41        labels:
42          cluster: 'eks-us-east-1'
43          region: 'us-east-1'
44
45  - job_name: 'federate-eks-us-west-2'
46    scrape_interval: 30s
47    honor_labels: true
48    metrics_path: '/federate'
49    params:
50      'match[]':
51        - '{job=~"kubernetes-.*"}'
52        - '{__name__=~"container_.*"}'
53        - '{__name__=~"node_.*"}'
54    static_configs:
55      - targets:
56          - 'prometheus.eks-us-west-2.internal:9090'
57        labels:
58          cluster: 'eks-us-west-2'
59          region: 'us-west-2'
60
61  # ECS Service Discovery
62  - job_name: 'ecs-services'
63    ec2_sd_configs:
64      - region: us-east-1
65        port: 9090
66        filters:
67          - name: tag:monitoring
68            values: ['prometheus']
69    relabel_configs:
70      - source_labels: [__meta_ec2_tag_Name]
71        target_label: instance
72      - source_labels: [__meta_ec2_tag_Service]
73        target_label: service
74
75  # CloudWatch Exporter
76  - job_name: 'cloudwatch'
77    static_configs:
78      - targets:
79          - 'cloudwatch-exporter:9106'
80
81# Remote write (optional - for long-term storage)
82remote_write:
83  - url: http://cortex:9009/api/prom/push
84    queue_config:
85      capacity: 10000
86      max_shards: 200
87      max_samples_per_send: 1000

Recording Rules

 1# prometheus/rules/recording-rules.yml
 2groups:
 3  - name: aggregation_rules
 4    interval: 30s
 5    rules:
 6      # Aggregate CPU usage by cluster
 7      - record: cluster:cpu_usage:rate5m
 8        expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (cluster)
 9
10      # Aggregate memory usage by cluster
11      - record: cluster:memory_usage_bytes:sum
12        expr: sum(container_memory_usage_bytes) by (cluster)
13
14      # Aggregate request rate by service
15      - record: service:http_requests:rate5m
16        expr: sum(rate(http_requests_total[5m])) by (service, cluster)
17
18      # Aggregate error rate by service
19      - record: service:http_errors:rate5m
20        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service, cluster)
21
22      # P95 latency by service
23      - record: service:http_request_duration_p95:5m
24        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))
25
26  - name: kubernetes_aggregations
27    interval: 30s
28    rules:
29      # Pod count by namespace and cluster
30      - record: namespace:pod_count:sum
31        expr: sum(kube_pod_info) by (namespace, cluster)
32
33      # Node capacity by cluster
34      - record: cluster:node_capacity_cpu_cores:sum
35        expr: sum(kube_node_status_capacity{resource="cpu"}) by (cluster)
36
37      # Node available memory by cluster
38      - record: cluster:node_available_memory_bytes:sum
39        expr: sum(kube_node_status_allocatable{resource="memory"}) by (cluster)

Alerting Rules

 1# prometheus/rules/alerting-rules.yml
 2groups:
 3  - name: infrastructure_alerts
 4    interval: 30s
 5    rules:
 6      # High CPU usage across cluster
 7      - alert: HighClusterCPUUsage
 8        expr: cluster:cpu_usage:rate5m > 0.8
 9        for: 5m
10        labels:
11          severity: warning
12          team: platform
13        annotations:
14          summary: "High CPU usage in cluster {{ $labels.cluster }}"
15          description: "CPU usage is {{ $value | humanizePercentage }} in cluster {{ $labels.cluster }}"
16
17      # Low available memory
18      - alert: LowClusterMemory
19        expr: cluster:node_available_memory_bytes:sum < 1073741824
20        for: 5m
21        labels:
22          severity: critical
23          team: platform
24        annotations:
25          summary: "Low available memory in cluster {{ $labels.cluster }}"
26          description: "Only {{ $value | humanize1024 }} available in cluster {{ $labels.cluster }}"
27
28      # Pod crash looping
29      - alert: PodCrashLooping
30        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
31        for: 5m
32        labels:
33          severity: warning
34          team: platform
35        annotations:
36          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
37          description: "Pod has restarted {{ $value }} times in the last 15 minutes"
38
39  - name: application_alerts
40    interval: 30s
41    rules:
42      # High error rate
43      - alert: HighErrorRate
44        expr: service:http_errors:rate5m / service:http_requests:rate5m > 0.05
45        for: 5m
46        labels:
47          severity: critical
48          team: backend
49        annotations:
50          summary: "High error rate in {{ $labels.service }}"
51          description: "Error rate is {{ $value | humanizePercentage }} in {{ $labels.service }}"
52
53      # High latency
54      - alert: HighLatency
55        expr: service:http_request_duration_p95:5m > 1
56        for: 10m
57        labels:
58          severity: warning
59          team: backend
60        annotations:
61          summary: "High P95 latency in {{ $labels.service }}"
62          description: "P95 latency is {{ $value }}s in {{ $labels.service }}"
63
64      # Service down
65      - alert: ServiceDown
66        expr: up{job=~".*"} == 0
67        for: 2m
68        labels:
69          severity: critical
70          team: platform
71        annotations:
72          summary: "Service {{ $labels.job }} is down"
73          description: "Service {{ $labels.job }} in cluster {{ $labels.cluster }} is unreachable"

Grafana Dashboard Provisioning

Prometheus Data Source Configuration

 1# grafana/provisioning/datasources/prometheus.yaml
 2apiVersion: 1
 3
 4datasources:
 5  - name: Central Prometheus
 6    type: prometheus
 7    access: proxy
 8    url: http://prometheus:9090
 9    isDefault: true
10    editable: false
11    jsonData:
12      timeInterval: 15s
13      queryTimeout: 60s
14      httpMethod: POST
15
16  - name: EKS US-East-1
17    type: prometheus
18    access: proxy
19    url: http://prometheus.eks-us-east-1.internal:9090
20    editable: false
21    jsonData:
22      timeInterval: 15s
23
24  - name: EKS US-West-2
25    type: prometheus
26    access: proxy
27    url: http://prometheus.eks-us-west-2.internal:9090
28    editable: false
29    jsonData:
30      timeInterval: 15s

Dashboard Provisioning

 1# grafana/provisioning/dashboards/default.yaml
 2apiVersion: 1
 3
 4providers:
 5  - name: 'default'
 6    orgId: 1
 7    folder: ''
 8    type: file
 9    disableDeletion: false
10    updateIntervalSeconds: 10
11    allowUiUpdates: true
12    options:
13      path: /etc/grafana/provisioning/dashboards
14      foldersFromFilesStructure: true

Cross-Cluster Dashboard Example

 1// grafana/dashboards/cross-cluster.json
 2{
 3  "dashboard": {
 4    "title": "Cross-Cluster Overview",
 5    "tags": ["kubernetes", "cross-cluster"],
 6    "timezone": "browser",
 7    "panels": [
 8      {
 9        "id": 1,
10        "title": "CPU Usage by Cluster",
11        "type": "graph",
12        "datasource": "Central Prometheus",
13        "targets": [
14          {
15            "expr": "cluster:cpu_usage:rate5m",
16            "legendFormat": "{{ cluster }}",
17            "refId": "A"
18          }
19        ],
20        "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
21      },
22      {
23        "id": 2,
24        "title": "Memory Usage by Cluster",
25        "type": "graph",
26        "datasource": "Central Prometheus",
27        "targets": [
28          {
29            "expr": "cluster:memory_usage_bytes:sum",
30            "legendFormat": "{{ cluster }}",
31            "refId": "A"
32          }
33        ],
34        "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
35      },
36      {
37        "id": 3,
38        "title": "Request Rate by Service (All Clusters)",
39        "type": "graph",
40        "datasource": "Central Prometheus",
41        "targets": [
42          {
43            "expr": "sum(service:http_requests:rate5m) by (service)",
44            "legendFormat": "{{ service }}",
45            "refId": "A"
46          }
47        ],
48        "gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 }
49      },
50      {
51        "id": 4,
52        "title": "Error Rate Comparison",
53        "type": "heatmap",
54        "datasource": "Central Prometheus",
55        "targets": [
56          {
57            "expr": "service:http_errors:rate5m / service:http_requests:rate5m",
58            "legendFormat": "{{ service }} @ {{ cluster }}",
59            "refId": "A"
60          }
61        ],
62        "gridPos": { "x": 0, "y": 16, "w": 24, "h": 8 }
63      }
64    ]
65  }
66}

Deployment and Operations

Deploy the Complete Stack

 1# Set environment
 2export AWS_REGION=us-east-1
 3export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
 4
 5# Install dependencies
 6npm install
 7
 8# Bootstrap CDK (first time only)
 9cdk bootstrap aws://$CDK_DEFAULT_ACCOUNT/$AWS_REGION
10
11# Deploy VPC stack first
12cdk deploy MonitoringVpcStack --context env=production
13
14# Deploy Prometheus
15cdk deploy PrometheusStack --context env=production
16
17# Deploy Grafana
18cdk deploy GrafanaStack --context env=production
19
20# Deploy all stacks
21cdk deploy --all --context env=production --require-approval never
22
23# Get deployment outputs
24aws cloudformation describe-stacks \
25  --stack-name GrafanaStack \
26  --query 'Stacks[0].Outputs' \
27  --output table

Configure Remote Prometheus Instances

On each remote Prometheus (EKS, ECS), configure remote write:

 1# Remote Prometheus configuration
 2remote_write:
 3  - url: http://<central-prometheus-alb-dns>:9090/api/v1/write
 4    queue_config:
 5      capacity: 10000
 6      max_shards: 100
 7      max_samples_per_send: 500
 8    write_relabel_configs:
 9      - source_labels: [__name__]
10        regex: 'up|container_.*|node_.*|kube_.*'
11        action: keep

Verify Federation

1# Test federation endpoint
2curl 'http://<central-prometheus>/federate?match[]={job="kubernetes-nodes"}'
3
4# Check ingested metrics
5curl 'http://<central-prometheus>/api/v1/query?query=up'
6
7# Verify remote write
8curl 'http://<central-prometheus>/api/v1/query?query=prometheus_remote_storage_samples_total'

Production Best Practices

1. High Availability Configuration

 1// Deploy Prometheus with multiple replicas
 2prometheusConfig: {
 3  desiredCount: 3,  // 3 instances for HA
 4  // Use consistent hashing for federation
 5}
 6
 7// Use EFS for shared storage
 8storageConfig: {
 9  efsPerformanceMode: 'maxIO',
10  efsThroughputMode: 'provisioned',
11  provisionedThroughputMibps: 100,
12}

2. Cost Optimization

StrategyImplementationSavings
Metric FilteringOnly scrape essential metrics40-60% storage
Down-samplingReduce resolution for old data30-50% storage
Recording RulesPre-aggregate common queries20-40% query cost
Fargate SpotUse Spot instances for non-prod70% compute cost

3. Security Hardening

 1// Enable encryption
 2prometheusSecurityGroup.addIngressRule(
 3  ec2.Peer.ipv4(vpc.vpcCidrBlock),
 4  ec2.Port.tcp(9090),
 5  'Allow only VPC traffic'
 6);
 7
 8// Use IAM roles for service accounts
 9// Implement network policies
10// Enable audit logging

4. Monitoring the Monitoring

 1# Alert on Prometheus issues
 2- alert: PrometheusDown
 3  expr: up{job="prometheus"} == 0
 4  for: 5m
 5  labels:
 6    severity: critical
 7  annotations:
 8    summary: "Prometheus is down"
 9
10- alert: PrometheusFederationFailing
11  expr: prometheus_remote_storage_samples_failed_total > 0
12  for: 5m
13  labels:
14    severity: warning
15  annotations:
16    summary: "Prometheus federation failing"

Conclusion

This centralized Prometheus + Grafana architecture provides enterprise-grade observability for distributed AWS environments. By federating metrics from multiple sources into a unified platform, teams gain:

  • Unified Visibility: Single dashboard for all infrastructure and applications
  • Efficient Operations: Centralized management reduces operational overhead
  • Better Correlation: Cross-service analysis for faster troubleshooting
  • Cost Optimization: Shared infrastructure reduces total monitoring costs
  • Scalability: Architecture scales horizontally with workload growth

Key Takeaways:

  1. Federation enables centralized metrics without changing application code
  2. ECS Fargate provides serverless, scalable infrastructure for Prometheus/Grafana
  3. EFS storage ensures data persistence and high availability
  4. Recording rules optimize query performance and reduce storage costs
  5. Multi-tier alerting prevents alert fatigue and ensures timely responses

The complete implementation is available in the CDK Playground repository, including full configuration examples, dashboards, and deployment scripts.


Related Posts:

Tags: #prometheus #grafana #aws #cdk #monitoring #observability #kubernetes #eks #ecs #fargate #metrics #alerting #federation

Yen

Yen

Yen