In modern distributed systems running on AWS, monitoring individual services in isolation creates operational blind spots. A centralized Prometheus + Grafana platform provides unified visibility across all infrastructure, enabling correlation analysis, efficient troubleshooting, and proactive alerting. This post explores building a production-grade monitoring hub using AWS CDK that aggregates metrics from EKS clusters, ECS services, Lambda functions, and custom applications.
The Challenge: Fragmented Observability
Organizations running microservices across AWS face critical monitoring challenges:
- Isolated Metrics: Each service/cluster has its own Prometheus instance, preventing holistic analysis
- Data Silos: No centralized view of system-wide performance and health
- Scaling Complexity: Managing dozens of Prometheus instances becomes operationally expensive
- Correlation Difficulty: Cross-service debugging requires manual metric aggregation
- Storage Management: Each Prometheus instance handles its own long-term storage
- Alerting Chaos: Duplicate alerts from multiple Prometheus instances
- Query Performance: Complex cross-cluster queries are slow or impossible
Why Centralized Prometheus + Grafana Architecture?
Before diving into implementation, let’s understand the architectural benefits:
Centralized vs Distributed Monitoring
Traditional Distributed Approach:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ EKS Cluster │ │ ECS Service │ │ Lambda │
│ │ │ │ │ Functions │
│ Prometheus │ │ Prometheus │ │ CloudWatch │
│ Grafana │ │ Grafana │ │ Metrics │
└──────────────┘ └──────────────┘ └──────────────┘
↓ ↓ ↓
Isolated No Correlation Manual Export
Centralized Approach:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ EKS Cluster │ │ ECS Service │ │ Lambda │
│ │ │ │ │ Functions │
│ Prometheus │───►│ Prometheus │───►│ CloudWatch │
│ (Scraper) │ │ (Scraper) │ │ Exporter │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└────────────────────┴────────────────────┘
│
▼
┌─────────────────────────┐
│ Central Prometheus │
│ (Federation/Remote) │
└────────────┬────────────┘
│
▼
┌─────────────────────────┐
│ Centralized Grafana │
│ (Unified Dashboards) │
└─────────────────────────┘
Key Benefits
| Capability | Distributed | Centralized |
|---|---|---|
| Cross-Service Queries | Manual aggregation | Native support |
| Unified Dashboards | Multiple logins | Single pane of glass |
| Alert Deduplication | Complex rules needed | Built-in |
| Long-term Storage | Per-instance management | Centralized (Cortex/Thanos) |
| Operational Overhead | High (N instances) | Low (1 instance) |
| Cost Efficiency | N × infrastructure | Optimized shared resources |
Architecture Overview: Multi-Tier Monitoring Platform
Our centralized monitoring architecture uses Prometheus federation and remote write to aggregate metrics from distributed sources:
┌─────────────────────────────────────────────────────────────────────┐
│ AWS CLOUD │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ METRIC SOURCES (Multi-Account/Region) │ │
│ │ │ │
│ │ ┌───────────────┐ ┌───────────────┐ ┌──────────────┐ │ │
│ │ │ EKS Cluster │ │ ECS Services │ │ Lambda │ │ │
│ │ │ us-east-1 │ │ us-west-2 │ │ Functions │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ Prometheus │ │ Prometheus │ │ CloudWatch │ │ │
│ │ │ Server │ │ Server │ │ → Exporter │ │ │
│ │ │ (Scraper) │ │ (Scraper) │ │ │ │ │
│ │ └───────┬───────┘ └───────┬───────┘ └──────┬───────┘ │ │
│ │ │ │ │ │ │
│ │ │ /federate │ remote_write │ /metrics │ │
│ └──────────┼───────────────────┼─────────────────┼───────────┘ │
│ │ │ │ │
│ └───────────────────┴─────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ CENTRALIZED PROMETHEUS (ECS Fargate) │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Prometheus Server (High Availability) │ │ │
│ │ │ │ │ │
│ │ │ • Federation Endpoint │ │ │
│ │ │ • Remote Write Receiver │ │ │
│ │ │ • Time Series Database │ │ │
│ │ │ • Recording Rules │ │ │
│ │ │ • Alert Manager Integration │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ┌────────────────────────┴────────────────────┐ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌──────────────────┐ ┌──────────────────────┐ │ │
│ │ │ EFS Volume │ │ Alert Manager │ │ │
│ │ │ (TSDB Storage) │ │ (ECS Task) │ │ │
│ │ │ │ │ │ │ │
│ │ │ • Long-term data │ │ • Route alerts │ │ │
│ │ │ • High IOPS │ │ • Deduplication │ │ │
│ │ │ • Automatic │ │ • Grouping │ │ │
│ │ │ backup │ │ • Silencing │ │ │
│ │ └──────────────────┘ └──────────┬───────────┘ │ │
│ └──────────────────────────────────────────────┼────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ CENTRALIZED GRAFANA (ECS Fargate) │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ Grafana Server (Multi-AZ) │ │ │
│ │ │ │ │ │
│ │ │ • Unified Dashboards │ │ │
│ │ │ • Multi-Prometheus Data Sources │ │ │
│ │ │ • Cross-Cluster Queries │ │ │
│ │ │ • Team Dashboards │ │ │
│ │ │ • Alert Visualization │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌──────────────────────────────────────────────────────┐ │ │
│ │ │ EFS Volume (Grafana Data) │ │ │
│ │ │ • Dashboards persistence │ │ │
│ │ │ • User settings │ │ │
│ │ │ • Plugins │ │ │
│ │ └──────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ NOTIFICATION CHANNELS │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Slack │ │ PagerDuty│ │ Email │ │ Webhook │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ SUPPORTING SERVICES │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │
│ │ │ VPC │ │ ALB │ │ Route53 DNS │ │ │
│ │ │ (Multi-AZ) │ │ (HTTPS) │ │ (monitoring.) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────────┘ │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │
│ │ │ Secrets │ │ IAM Roles │ │ CloudWatch │ │ │
│ │ │ Manager │ │ │ │ Metrics │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Data Flow Architecture
┌──────────────────────────────────────────────────────────────┐
│ 1. METRIC COLLECTION (Distributed Sources) │
│ • Kubernetes pods expose /metrics endpoints │
│ • Node exporters scrape system metrics │
│ • Custom applications export business metrics │
└──────────────────────┬───────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 2. LOCAL PROMETHEUS (Per Cluster/Service) │
│ • Scrape metrics every 15-30 seconds │
│ • Apply initial relabeling rules │
│ • Store metrics locally (1-7 days) │
│ • Evaluate local alerts │
└──────────────────────┬───────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 3. METRIC AGGREGATION (Federation/Remote Write) │
│ • Federation: Central pulls from /federate endpoints │
│ • Remote Write: Local pushes to central receiver │
│ • Compression and batching for efficiency │
│ • Cross-region replication │
└──────────────────────┬───────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 4. CENTRAL PROMETHEUS (ECS Fargate) │
│ • Receive and deduplicate metrics │
│ • Apply global recording rules │
│ • Store in EFS-backed TSDB (30-90 days) │
│ • Evaluate global alerts │
│ • Expose unified query API │
└──────────────────────┬───────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 5. VISUALIZATION (Grafana) │
│ • Query central Prometheus │
│ • Cross-cluster analysis │
│ • Render unified dashboards │
│ • Display alerts and annotations │
└──────────────────────┬───────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 6. ALERTING (Alert Manager) │
│ • Receive alerts from Prometheus │
│ • Deduplicate and group alerts │
│ • Route to notification channels │
│ • Track alert states and silences │
└──────────────────────────────────────────────────────────────┘
CDK Implementation: Infrastructure as Code
Let’s build the complete monitoring platform using AWS CDK with TypeScript.
Project Structure
centralized-monitoring-cdk/
├── bin/
│ └── monitoring-platform.ts # CDK app entry
├── lib/
│ ├── stacks/
│ │ ├── vpc-stack.ts # Network infrastructure
│ │ ├── prometheus-stack.ts # Central Prometheus
│ │ ├── grafana-stack.ts # Grafana server
│ │ ├── alertmanager-stack.ts # Alert Manager
│ │ └── exporters-stack.ts # CloudWatch exporters
│ ├── constructs/
│ │ ├── efs-storage.ts # EFS for persistence
│ │ ├── alb-setup.ts # Load balancer
│ │ └── iam-roles.ts # IAM permissions
│ └── config/
│ ├── prometheus-config.yaml # Prometheus configuration
│ ├── alertmanager-config.yaml # Alert Manager config
│ └── recording-rules.yaml # Prometheus rules
├── grafana/
│ ├── dashboards/
│ │ ├── cluster-overview.json # Kubernetes dashboard
│ │ ├── ecs-services.json # ECS metrics
│ │ └── cross-cluster.json # Multi-cluster view
│ └── provisioning/
│ ├── datasources/ # Prometheus data sources
│ └── dashboards/ # Dashboard provisioning
├── prometheus/
│ ├── rules/
│ │ ├── recording-rules.yml # Pre-aggregation rules
│ │ └── alerting-rules.yml # Alert definitions
│ └── config/
│ └── prometheus.yml # Main configuration
├── package.json
├── tsconfig.json
└── cdk.json
Configuration Management
1// lib/config/monitoring-config.ts
2export interface MonitoringConfig {
3 environment: 'dev' | 'staging' | 'production';
4 prometheus: PrometheusConfig;
5 grafana: GrafanaConfig;
6 alertManager: AlertManagerConfig;
7 storage: StorageConfig;
8 networking: NetworkingConfig;
9}
10
11export interface PrometheusConfig {
12 retention: string; // e.g., "30d"
13 scrapeInterval: string; // e.g., "15s"
14 evaluationInterval: string; // e.g., "15s"
15 remoteWriteEnabled: boolean;
16 federationEnabled: boolean;
17 cpu: number; // ECS CPU units
18 memory: number; // ECS memory (MB)
19 desiredCount: number; // Number of tasks
20 externalLabels: Record<string, string>;
21}
22
23export interface GrafanaConfig {
24 adminUser: string;
25 domain?: string;
26 enableAuth: boolean;
27 cpu: number;
28 memory: number;
29 desiredCount: number;
30 plugins: string[];
31}
32
33export interface AlertManagerConfig {
34 enabled: boolean;
35 cpu: number;
36 memory: number;
37 slackWebhook?: string;
38 pagerDutyKey?: string;
39 emailFrom?: string;
40 emailTo: string[];
41}
42
43export interface StorageConfig {
44 efsPerformanceMode: 'generalPurpose' | 'maxIO';
45 efsThroughputMode: 'bursting' | 'provisioned';
46 provisionedThroughputMibps?: number;
47 prometheusVolumeSize: number; // GB
48 grafanaVolumeSize: number; // GB
49}
50
51export interface NetworkingConfig {
52 vpcCidr: string;
53 maxAzs: number;
54 enableNatGateway: boolean;
55 enableVpcFlowLogs: boolean;
56}
57
58export const monitoringConfigs: Record<string, MonitoringConfig> = {
59 dev: {
60 environment: 'dev',
61 prometheus: {
62 retention: '7d',
63 scrapeInterval: '30s',
64 evaluationInterval: '30s',
65 remoteWriteEnabled: true,
66 federationEnabled: true,
67 cpu: 1024, // 1 vCPU
68 memory: 2048, // 2 GB
69 desiredCount: 1,
70 externalLabels: {
71 cluster: 'central',
72 environment: 'dev',
73 },
74 },
75 grafana: {
76 adminUser: 'admin',
77 enableAuth: true,
78 cpu: 512,
79 memory: 1024,
80 desiredCount: 1,
81 plugins: [
82 'grafana-piechart-panel',
83 'grafana-worldmap-panel',
84 ],
85 },
86 alertManager: {
87 enabled: true,
88 cpu: 256,
89 memory: 512,
90 emailTo: ['dev-team@company.com'],
91 },
92 storage: {
93 efsPerformanceMode: 'generalPurpose',
94 efsThroughputMode: 'bursting',
95 prometheusVolumeSize: 100,
96 grafanaVolumeSize: 20,
97 },
98 networking: {
99 vpcCidr: '10.0.0.0/16',
100 maxAzs: 2,
101 enableNatGateway: true,
102 enableVpcFlowLogs: false,
103 },
104 },
105 production: {
106 environment: 'production',
107 prometheus: {
108 retention: '90d',
109 scrapeInterval: '15s',
110 evaluationInterval: '15s',
111 remoteWriteEnabled: true,
112 federationEnabled: true,
113 cpu: 4096, // 4 vCPU
114 memory: 16384, // 16 GB
115 desiredCount: 3, // High availability
116 externalLabels: {
117 cluster: 'central',
118 environment: 'production',
119 },
120 },
121 grafana: {
122 adminUser: 'admin',
123 domain: 'monitoring.company.com',
124 enableAuth: true,
125 cpu: 2048,
126 memory: 4096,
127 desiredCount: 2,
128 plugins: [
129 'grafana-piechart-panel',
130 'grafana-worldmap-panel',
131 'grafana-clock-panel',
132 ],
133 },
134 alertManager: {
135 enabled: true,
136 cpu: 1024,
137 memory: 2048,
138 slackWebhook: process.env.SLACK_WEBHOOK_URL,
139 pagerDutyKey: process.env.PAGERDUTY_INTEGRATION_KEY,
140 emailFrom: 'alerts@company.com',
141 emailTo: [
142 'oncall@company.com',
143 'sre-team@company.com',
144 ],
145 },
146 storage: {
147 efsPerformanceMode: 'maxIO',
148 efsThroughputMode: 'provisioned',
149 provisionedThroughputMibps: 100,
150 prometheusVolumeSize: 1000,
151 grafanaVolumeSize: 100,
152 },
153 networking: {
154 vpcCidr: '10.0.0.0/16',
155 maxAzs: 3,
156 enableNatGateway: true,
157 enableVpcFlowLogs: true,
158 },
159 },
160};
VPC Stack - Network Foundation
1// lib/stacks/vpc-stack.ts
2import * as cdk from 'aws-cdk-lib';
3import * as ec2 from 'aws-cdk-lib/aws-ec2';
4import { Construct } from 'constructs';
5import { NetworkingConfig } from '../config/monitoring-config';
6
7export interface VpcStackProps extends cdk.StackProps {
8 config: NetworkingConfig;
9}
10
11export class VpcStack extends cdk.Stack {
12 public readonly vpc: ec2.Vpc;
13 public readonly prometheusSecurityGroup: ec2.SecurityGroup;
14 public readonly grafanaSecurityGroup: ec2.SecurityGroup;
15 public readonly albSecurityGroup: ec2.SecurityGroup;
16
17 constructor(scope: Construct, id: string, props: VpcStackProps) {
18 super(scope, id, props);
19
20 const { config } = props;
21
22 // Create VPC with public and private subnets
23 this.vpc = new ec2.Vpc(this, 'MonitoringVpc', {
24 ipAddresses: ec2.IpAddresses.cidr(config.vpcCidr),
25 maxAzs: config.maxAzs,
26 natGateways: config.enableNatGateway ? config.maxAzs : 0,
27 subnetConfiguration: [
28 {
29 name: 'Public',
30 subnetType: ec2.SubnetType.PUBLIC,
31 cidrMask: 24,
32 },
33 {
34 name: 'Private',
35 subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
36 cidrMask: 24,
37 },
38 {
39 name: 'Isolated',
40 subnetType: ec2.SubnetType.PRIVATE_ISOLATED,
41 cidrMask: 24,
42 },
43 ],
44 });
45
46 // Enable VPC Flow Logs if configured
47 if (config.enableVpcFlowLogs) {
48 this.vpc.addFlowLog('VpcFlowLogs', {
49 destination: ec2.FlowLogDestination.toCloudWatchLogs(),
50 });
51 }
52
53 // Security group for ALB
54 this.albSecurityGroup = new ec2.SecurityGroup(this, 'AlbSecurityGroup', {
55 vpc: this.vpc,
56 description: 'Security group for monitoring ALB',
57 allowAllOutbound: true,
58 });
59
60 // Allow HTTPS from anywhere
61 this.albSecurityGroup.addIngressRule(
62 ec2.Peer.anyIpv4(),
63 ec2.Port.tcp(443),
64 'Allow HTTPS from internet'
65 );
66
67 // Security group for Prometheus
68 this.prometheusSecurityGroup = new ec2.SecurityGroup(
69 this,
70 'PrometheusSecurityGroup',
71 {
72 vpc: this.vpc,
73 description: 'Security group for Prometheus',
74 allowAllOutbound: true,
75 }
76 );
77
78 // Allow Prometheus port from ALB and internal VPC
79 this.prometheusSecurityGroup.addIngressRule(
80 this.albSecurityGroup,
81 ec2.Port.tcp(9090),
82 'Allow Prometheus access from ALB'
83 );
84
85 this.prometheusSecurityGroup.addIngressRule(
86 ec2.Peer.ipv4(config.vpcCidr),
87 ec2.Port.tcp(9090),
88 'Allow Prometheus federation from VPC'
89 );
90
91 // Security group for Grafana
92 this.grafanaSecurityGroup = new ec2.SecurityGroup(
93 this,
94 'GrafanaSecurityGroup',
95 {
96 vpc: this.vpc,
97 description: 'Security group for Grafana',
98 allowAllOutbound: true,
99 }
100 );
101
102 // Allow Grafana port from ALB
103 this.grafanaSecurityGroup.addIngressRule(
104 this.albSecurityGroup,
105 ec2.Port.tcp(3000),
106 'Allow Grafana access from ALB'
107 );
108
109 // Allow Grafana to access Prometheus
110 this.prometheusSecurityGroup.connections.allowFrom(
111 this.grafanaSecurityGroup,
112 ec2.Port.tcp(9090),
113 'Allow Grafana to query Prometheus'
114 );
115
116 // Outputs
117 new cdk.CfnOutput(this, 'VpcId', {
118 value: this.vpc.vpcId,
119 exportName: 'MonitoringVpcId',
120 });
121
122 new cdk.CfnOutput(this, 'VpcCidr', {
123 value: this.vpc.vpcCidrBlock,
124 exportName: 'MonitoringVpcCidr',
125 });
126 }
127}
Prometheus Stack - Central Metrics Server
1// lib/stacks/prometheus-stack.ts
2import * as cdk from 'aws-cdk-lib';
3import * as ec2 from 'aws-cdk-lib/aws-ec2';
4import * as ecs from 'aws-cdk-lib/aws-ecs';
5import * as ecs_patterns from 'aws-cdk-lib/aws-ecs-patterns';
6import * as efs from 'aws-cdk-lib/aws-efs';
7import * as iam from 'aws-cdk-lib/aws-iam';
8import * as logs from 'aws-cdk-lib/aws-logs';
9import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';
10import { Construct } from 'constructs';
11import { PrometheusConfig, StorageConfig } from '../config/monitoring-config';
12import * as path from 'path';
13import * as fs from 'fs';
14
15export interface PrometheusStackProps extends cdk.StackProps {
16 vpc: ec2.Vpc;
17 securityGroup: ec2.SecurityGroup;
18 prometheusConfig: PrometheusConfig;
19 storageConfig: StorageConfig;
20 alb: elbv2.ApplicationLoadBalancer;
21}
22
23export class PrometheusStack extends cdk.Stack {
24 public readonly service: ecs_patterns.ApplicationLoadBalancedFargateService;
25 public readonly fileSystem: efs.FileSystem;
26 public readonly prometheusUrl: string;
27
28 constructor(scope: Construct, id: string, props: PrometheusStackProps) {
29 super(scope, id, props);
30
31 const { vpc, securityGroup, prometheusConfig, storageConfig, alb } = props;
32
33 // Create ECS Cluster
34 const cluster = new ecs.Cluster(this, 'PrometheusCluster', {
35 vpc,
36 clusterName: 'monitoring-prometheus-cluster',
37 containerInsights: true,
38 });
39
40 // Create EFS for Prometheus data persistence
41 this.fileSystem = new efs.FileSystem(this, 'PrometheusEfs', {
42 vpc,
43 encrypted: true,
44 lifecyclePolicy: efs.LifecyclePolicy.AFTER_14_DAYS,
45 performanceMode: storageConfig.efsPerformanceMode === 'maxIO'
46 ? efs.PerformanceMode.MAX_IO
47 : efs.PerformanceMode.GENERAL_PURPOSE,
48 throughputMode: storageConfig.efsThroughputMode === 'provisioned'
49 ? efs.ThroughputMode.PROVISIONED
50 : efs.ThroughputMode.BURSTING,
51 provisionedThroughputPerSecond: storageConfig.provisionedThroughputMibps
52 ? cdk.Size.mebibytes(storageConfig.provisionedThroughputMibps)
53 : undefined,
54 removalPolicy: cdk.RemovalPolicy.RETAIN,
55 });
56
57 // Create access point for Prometheus
58 const accessPoint = this.fileSystem.addAccessPoint('PrometheusAccessPoint', {
59 path: '/prometheus',
60 createAcl: {
61 ownerGid: '65534',
62 ownerUid: '65534',
63 permissions: '755',
64 },
65 posixUser: {
66 gid: '65534',
67 uid: '65534',
68 },
69 });
70
71 // Create task definition
72 const taskDefinition = new ecs.FargateTaskDefinition(this, 'PrometheusTask', {
73 cpu: prometheusConfig.cpu,
74 memoryLimitMiB: prometheusConfig.memory,
75 family: 'prometheus-server',
76 });
77
78 // Add EFS volume
79 const volumeName = 'prometheus-storage';
80 taskDefinition.addVolume({
81 name: volumeName,
82 efsVolumeConfiguration: {
83 fileSystemId: this.fileSystem.fileSystemId,
84 transitEncryption: 'ENABLED',
85 authorizationConfig: {
86 accessPointId: accessPoint.accessPointId,
87 iam: 'ENABLED',
88 },
89 },
90 });
91
92 // Load Prometheus configuration from file
93 const prometheusConfigYaml = this.loadPrometheusConfig(prometheusConfig);
94
95 // Add Prometheus container
96 const prometheusContainer = taskDefinition.addContainer('prometheus', {
97 image: ecs.ContainerImage.fromRegistry('prom/prometheus:latest'),
98 logging: ecs.LogDrivers.awsLogs({
99 streamPrefix: 'prometheus',
100 logRetention: logs.RetentionDays.ONE_WEEK,
101 }),
102 environment: {
103 PROMETHEUS_RETENTION: prometheusConfig.retention,
104 },
105 command: [
106 '--config.file=/etc/prometheus/prometheus.yml',
107 '--storage.tsdb.path=/prometheus',
108 `--storage.tsdb.retention.time=${prometheusConfig.retention}`,
109 '--web.console.libraries=/usr/share/prometheus/console_libraries',
110 '--web.console.templates=/usr/share/prometheus/consoles',
111 '--web.enable-lifecycle',
112 '--web.enable-admin-api',
113 ],
114 portMappings: [{
115 containerPort: 9090,
116 protocol: ecs.Protocol.TCP,
117 }],
118 healthCheck: {
119 command: ['CMD-SHELL', 'wget --no-verbose --tries=1 --spider http://localhost:9090/-/healthy || exit 1'],
120 interval: cdk.Duration.seconds(30),
121 timeout: cdk.Duration.seconds(5),
122 retries: 3,
123 startPeriod: cdk.Duration.seconds(60),
124 },
125 });
126
127 // Mount EFS volume
128 prometheusContainer.addMountPoints({
129 sourceVolume: volumeName,
130 containerPath: '/prometheus',
131 readOnly: false,
132 });
133
134 // Grant EFS permissions
135 this.fileSystem.grantReadWrite(taskDefinition.taskRole);
136
137 // Create Fargate service
138 this.service = new ecs_patterns.ApplicationLoadBalancedFargateService(
139 this,
140 'PrometheusService',
141 {
142 cluster,
143 serviceName: 'prometheus',
144 taskDefinition,
145 desiredCount: prometheusConfig.desiredCount,
146 loadBalancer: alb,
147 publicLoadBalancer: false,
148 securityGroups: [securityGroup],
149 taskSubnets: {
150 subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
151 },
152 }
153 );
154
155 // Configure health check
156 this.service.targetGroup.configureHealthCheck({
157 path: '/-/healthy',
158 interval: cdk.Duration.seconds(30),
159 timeout: cdk.Duration.seconds(5),
160 healthyThresholdCount: 2,
161 unhealthyThresholdCount: 3,
162 });
163
164 // Auto scaling
165 const scaling = this.service.service.autoScaleTaskCount({
166 minCapacity: prometheusConfig.desiredCount,
167 maxCapacity: prometheusConfig.desiredCount * 2,
168 });
169
170 scaling.scaleOnCpuUtilization('CpuScaling', {
171 targetUtilizationPercent: 70,
172 scaleInCooldown: cdk.Duration.seconds(300),
173 scaleOutCooldown: cdk.Duration.seconds(60),
174 });
175
176 scaling.scaleOnMemoryUtilization('MemoryScaling', {
177 targetUtilizationPercent: 80,
178 scaleInCooldown: cdk.Duration.seconds(300),
179 scaleOutCooldown: cdk.Duration.seconds(60),
180 });
181
182 // Allow EFS connections
183 this.fileSystem.connections.allowDefaultPortFrom(this.service.service.connections);
184
185 this.prometheusUrl = `http://${this.service.loadBalancer.loadBalancerDnsName}`;
186
187 // Outputs
188 new cdk.CfnOutput(this, 'PrometheusUrl', {
189 value: this.prometheusUrl,
190 description: 'Prometheus Server URL',
191 exportName: 'PrometheusUrl',
192 });
193
194 new cdk.CfnOutput(this, 'PrometheusEfsId', {
195 value: this.fileSystem.fileSystemId,
196 description: 'Prometheus EFS File System ID',
197 exportName: 'PrometheusEfsId',
198 });
199 }
200
201 private loadPrometheusConfig(config: PrometheusConfig): string {
202 // Generate Prometheus configuration dynamically
203 const prometheusConfig = {
204 global: {
205 scrape_interval: config.scrapeInterval,
206 evaluation_interval: config.evaluationInterval,
207 external_labels: config.externalLabels,
208 },
209 scrape_configs: [
210 {
211 job_name: 'prometheus',
212 static_configs: [{
213 targets: ['localhost:9090'],
214 }],
215 },
216 // Federation from remote Prometheus instances
217 ...(config.federationEnabled ? [{
218 job_name: 'federate-eks-clusters',
219 scrape_interval: '30s',
220 honor_labels: true,
221 metrics_path: '/federate',
222 params: {
223 'match[]': [
224 '{job="kubernetes-apiservers"}',
225 '{job="kubernetes-nodes"}',
226 '{job="kubernetes-pods"}',
227 '{job="kubernetes-cadvisor"}',
228 '{job="kubernetes-service-endpoints"}',
229 ],
230 },
231 static_configs: [
232 // Add your EKS Prometheus endpoints here
233 // { targets: ['prometheus.eks-cluster-1.internal:9090'] },
234 // { targets: ['prometheus.eks-cluster-2.internal:9090'] },
235 ],
236 }] : []),
237 ],
238 remote_write: config.remoteWriteEnabled ? [
239 // Configure remote write endpoints if needed
240 // { url: 'http://cortex:9009/api/prom/push' }
241 ] : [],
242 rule_files: [
243 '/etc/prometheus/recording-rules.yml',
244 '/etc/prometheus/alerting-rules.yml',
245 ],
246 };
247
248 return JSON.stringify(prometheusConfig, null, 2);
249 }
250}
Grafana Stack - Visualization Platform
1// lib/stacks/grafana-stack.ts
2import * as cdk from 'aws-cdk-lib';
3import * as ec2 from 'aws-cdk-lib/aws-ec2';
4import * as ecs from 'aws-cdk-lib/aws-ecs';
5import * as ecs_patterns from 'aws-cdk-lib/aws-ecs-patterns';
6import * as efs from 'aws-cdk-lib/aws-efs';
7import * as secretsmanager from 'aws-cdk-lib/aws-secretsmanager';
8import * as logs from 'aws-cdk-lib/aws-logs';
9import * as elbv2 from 'aws-cdk-lib/aws-elasticloadbalancingv2';
10import { Construct } from 'constructs';
11import { GrafanaConfig, StorageConfig } from '../config/monitoring-config';
12
13export interface GrafanaStackProps extends cdk.StackProps {
14 vpc: ec2.Vpc;
15 securityGroup: ec2.SecurityGroup;
16 grafanaConfig: GrafanaConfig;
17 storageConfig: StorageConfig;
18 prometheusUrl: string;
19 alb: elbv2.ApplicationLoadBalancer;
20}
21
22export class GrafanaStack extends cdk.Stack {
23 public readonly service: ecs_patterns.ApplicationLoadBalancedFargateService;
24 public readonly grafanaUrl: string;
25
26 constructor(scope: Construct, id: string, props: GrafanaStackProps) {
27 super(scope, id, props);
28
29 const { vpc, securityGroup, grafanaConfig, storageConfig, prometheusUrl, alb } = props;
30
31 // Create ECS Cluster
32 const cluster = new ecs.Cluster(this, 'GrafanaCluster', {
33 vpc,
34 clusterName: 'monitoring-grafana-cluster',
35 containerInsights: true,
36 });
37
38 // Create EFS for Grafana data
39 const fileSystem = new efs.FileSystem(this, 'GrafanaEfs', {
40 vpc,
41 encrypted: true,
42 performanceMode: efs.PerformanceMode.GENERAL_PURPOSE,
43 throughputMode: efs.ThroughputMode.BURSTING,
44 removalPolicy: cdk.RemovalPolicy.RETAIN,
45 });
46
47 const accessPoint = fileSystem.addAccessPoint('GrafanaAccessPoint', {
48 path: '/grafana',
49 createAcl: {
50 ownerGid: '472',
51 ownerUid: '472',
52 permissions: '755',
53 },
54 posixUser: {
55 gid: '472',
56 uid: '472',
57 },
58 });
59
60 // Create admin password secret
61 const adminSecret = new secretsmanager.Secret(this, 'GrafanaAdminPassword', {
62 secretName: 'grafana-admin-credentials',
63 generateSecretString: {
64 secretStringTemplate: JSON.stringify({ username: grafanaConfig.adminUser }),
65 generateStringKey: 'password',
66 excludePunctuation: true,
67 passwordLength: 16,
68 },
69 });
70
71 // Task definition
72 const taskDefinition = new ecs.FargateTaskDefinition(this, 'GrafanaTask', {
73 cpu: grafanaConfig.cpu,
74 memoryLimitMiB: grafanaConfig.memory,
75 family: 'grafana-server',
76 });
77
78 // Add volume
79 const volumeName = 'grafana-storage';
80 taskDefinition.addVolume({
81 name: volumeName,
82 efsVolumeConfiguration: {
83 fileSystemId: fileSystem.fileSystemId,
84 transitEncryption: 'ENABLED',
85 authorizationConfig: {
86 accessPointId: accessPoint.accessPointId,
87 iam: 'ENABLED',
88 },
89 },
90 });
91
92 // Grafana container
93 const grafanaContainer = taskDefinition.addContainer('grafana', {
94 image: ecs.ContainerImage.fromRegistry('grafana/grafana:latest'),
95 logging: ecs.LogDrivers.awsLogs({
96 streamPrefix: 'grafana',
97 logRetention: logs.RetentionDays.ONE_WEEK,
98 }),
99 environment: {
100 GF_SERVER_ROOT_URL: grafanaConfig.domain
101 ? `https://${grafanaConfig.domain}`
102 : '',
103 GF_SECURITY_ADMIN_USER: grafanaConfig.adminUser,
104 GF_INSTALL_PLUGINS: grafanaConfig.plugins.join(','),
105 GF_AUTH_ANONYMOUS_ENABLED: (!grafanaConfig.enableAuth).toString(),
106 GF_PATHS_DATA: '/var/lib/grafana',
107 GF_PATHS_PROVISIONING: '/etc/grafana/provisioning',
108 },
109 secrets: {
110 GF_SECURITY_ADMIN_PASSWORD: ecs.Secret.fromSecretsManager(adminSecret, 'password'),
111 },
112 portMappings: [{
113 containerPort: 3000,
114 protocol: ecs.Protocol.TCP,
115 }],
116 healthCheck: {
117 command: ['CMD-SHELL', 'wget --no-verbose --tries=1 --spider http://localhost:3000/api/health || exit 1'],
118 interval: cdk.Duration.seconds(30),
119 timeout: cdk.Duration.seconds(5),
120 retries: 3,
121 startPeriod: cdk.Duration.seconds(60),
122 },
123 });
124
125 // Mount EFS
126 grafanaContainer.addMountPoints({
127 sourceVolume: volumeName,
128 containerPath: '/var/lib/grafana',
129 readOnly: false,
130 });
131
132 // Grant EFS permissions
133 fileSystem.grantReadWrite(taskDefinition.taskRole);
134
135 // Create service
136 this.service = new ecs_patterns.ApplicationLoadBalancedFargateService(
137 this,
138 'GrafanaService',
139 {
140 cluster,
141 serviceName: 'grafana',
142 taskDefinition,
143 desiredCount: grafanaConfig.desiredCount,
144 loadBalancer: alb,
145 publicLoadBalancer: true,
146 securityGroups: [securityGroup],
147 taskSubnets: {
148 subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
149 },
150 }
151 );
152
153 // Health check
154 this.service.targetGroup.configureHealthCheck({
155 path: '/api/health',
156 interval: cdk.Duration.seconds(30),
157 timeout: cdk.Duration.seconds(5),
158 healthyThresholdCount: 2,
159 unhealthyThresholdCount: 3,
160 });
161
162 // Auto scaling
163 const scaling = this.service.service.autoScaleTaskCount({
164 minCapacity: grafanaConfig.desiredCount,
165 maxCapacity: grafanaConfig.desiredCount * 2,
166 });
167
168 scaling.scaleOnCpuUtilization('CpuScaling', {
169 targetUtilizationPercent: 70,
170 });
171
172 // Allow EFS connections
173 fileSystem.connections.allowDefaultPortFrom(this.service.service.connections);
174
175 this.grafanaUrl = `http://${this.service.loadBalancer.loadBalancerDnsName}`;
176
177 // Outputs
178 new cdk.CfnOutput(this, 'GrafanaUrl', {
179 value: this.grafanaUrl,
180 description: 'Grafana Dashboard URL',
181 exportName: 'GrafanaUrl',
182 });
183
184 new cdk.CfnOutput(this, 'GrafanaAdminSecretArn', {
185 value: adminSecret.secretArn,
186 description: 'Grafana Admin Password Secret ARN',
187 exportName: 'GrafanaAdminSecretArn',
188 });
189 }
190}
Prometheus Configuration Files
Main Prometheus Configuration
1# prometheus/config/prometheus.yml
2global:
3 scrape_interval: 15s
4 evaluation_interval: 15s
5 external_labels:
6 cluster: 'central'
7 environment: 'production'
8
9# Alertmanager configuration
10alerting:
11 alertmanagers:
12 - static_configs:
13 - targets:
14 - alertmanager:9093
15
16# Load rules
17rule_files:
18 - '/etc/prometheus/recording-rules.yml'
19 - '/etc/prometheus/alerting-rules.yml'
20
21# Scrape configurations
22scrape_configs:
23 # Prometheus itself
24 - job_name: 'prometheus'
25 static_configs:
26 - targets: ['localhost:9090']
27
28 # Federation from EKS clusters
29 - job_name: 'federate-eks-us-east-1'
30 scrape_interval: 30s
31 honor_labels: true
32 metrics_path: '/federate'
33 params:
34 'match[]':
35 - '{job=~"kubernetes-.*"}'
36 - '{__name__=~"container_.*"}'
37 - '{__name__=~"node_.*"}'
38 static_configs:
39 - targets:
40 - 'prometheus.eks-us-east-1.internal:9090'
41 labels:
42 cluster: 'eks-us-east-1'
43 region: 'us-east-1'
44
45 - job_name: 'federate-eks-us-west-2'
46 scrape_interval: 30s
47 honor_labels: true
48 metrics_path: '/federate'
49 params:
50 'match[]':
51 - '{job=~"kubernetes-.*"}'
52 - '{__name__=~"container_.*"}'
53 - '{__name__=~"node_.*"}'
54 static_configs:
55 - targets:
56 - 'prometheus.eks-us-west-2.internal:9090'
57 labels:
58 cluster: 'eks-us-west-2'
59 region: 'us-west-2'
60
61 # ECS Service Discovery
62 - job_name: 'ecs-services'
63 ec2_sd_configs:
64 - region: us-east-1
65 port: 9090
66 filters:
67 - name: tag:monitoring
68 values: ['prometheus']
69 relabel_configs:
70 - source_labels: [__meta_ec2_tag_Name]
71 target_label: instance
72 - source_labels: [__meta_ec2_tag_Service]
73 target_label: service
74
75 # CloudWatch Exporter
76 - job_name: 'cloudwatch'
77 static_configs:
78 - targets:
79 - 'cloudwatch-exporter:9106'
80
81# Remote write (optional - for long-term storage)
82remote_write:
83 - url: http://cortex:9009/api/prom/push
84 queue_config:
85 capacity: 10000
86 max_shards: 200
87 max_samples_per_send: 1000
Recording Rules
1# prometheus/rules/recording-rules.yml
2groups:
3 - name: aggregation_rules
4 interval: 30s
5 rules:
6 # Aggregate CPU usage by cluster
7 - record: cluster:cpu_usage:rate5m
8 expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (cluster)
9
10 # Aggregate memory usage by cluster
11 - record: cluster:memory_usage_bytes:sum
12 expr: sum(container_memory_usage_bytes) by (cluster)
13
14 # Aggregate request rate by service
15 - record: service:http_requests:rate5m
16 expr: sum(rate(http_requests_total[5m])) by (service, cluster)
17
18 # Aggregate error rate by service
19 - record: service:http_errors:rate5m
20 expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service, cluster)
21
22 # P95 latency by service
23 - record: service:http_request_duration_p95:5m
24 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))
25
26 - name: kubernetes_aggregations
27 interval: 30s
28 rules:
29 # Pod count by namespace and cluster
30 - record: namespace:pod_count:sum
31 expr: sum(kube_pod_info) by (namespace, cluster)
32
33 # Node capacity by cluster
34 - record: cluster:node_capacity_cpu_cores:sum
35 expr: sum(kube_node_status_capacity{resource="cpu"}) by (cluster)
36
37 # Node available memory by cluster
38 - record: cluster:node_available_memory_bytes:sum
39 expr: sum(kube_node_status_allocatable{resource="memory"}) by (cluster)
Alerting Rules
1# prometheus/rules/alerting-rules.yml
2groups:
3 - name: infrastructure_alerts
4 interval: 30s
5 rules:
6 # High CPU usage across cluster
7 - alert: HighClusterCPUUsage
8 expr: cluster:cpu_usage:rate5m > 0.8
9 for: 5m
10 labels:
11 severity: warning
12 team: platform
13 annotations:
14 summary: "High CPU usage in cluster {{ $labels.cluster }}"
15 description: "CPU usage is {{ $value | humanizePercentage }} in cluster {{ $labels.cluster }}"
16
17 # Low available memory
18 - alert: LowClusterMemory
19 expr: cluster:node_available_memory_bytes:sum < 1073741824
20 for: 5m
21 labels:
22 severity: critical
23 team: platform
24 annotations:
25 summary: "Low available memory in cluster {{ $labels.cluster }}"
26 description: "Only {{ $value | humanize1024 }} available in cluster {{ $labels.cluster }}"
27
28 # Pod crash looping
29 - alert: PodCrashLooping
30 expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
31 for: 5m
32 labels:
33 severity: warning
34 team: platform
35 annotations:
36 summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
37 description: "Pod has restarted {{ $value }} times in the last 15 minutes"
38
39 - name: application_alerts
40 interval: 30s
41 rules:
42 # High error rate
43 - alert: HighErrorRate
44 expr: service:http_errors:rate5m / service:http_requests:rate5m > 0.05
45 for: 5m
46 labels:
47 severity: critical
48 team: backend
49 annotations:
50 summary: "High error rate in {{ $labels.service }}"
51 description: "Error rate is {{ $value | humanizePercentage }} in {{ $labels.service }}"
52
53 # High latency
54 - alert: HighLatency
55 expr: service:http_request_duration_p95:5m > 1
56 for: 10m
57 labels:
58 severity: warning
59 team: backend
60 annotations:
61 summary: "High P95 latency in {{ $labels.service }}"
62 description: "P95 latency is {{ $value }}s in {{ $labels.service }}"
63
64 # Service down
65 - alert: ServiceDown
66 expr: up{job=~".*"} == 0
67 for: 2m
68 labels:
69 severity: critical
70 team: platform
71 annotations:
72 summary: "Service {{ $labels.job }} is down"
73 description: "Service {{ $labels.job }} in cluster {{ $labels.cluster }} is unreachable"
Grafana Dashboard Provisioning
Prometheus Data Source Configuration
1# grafana/provisioning/datasources/prometheus.yaml
2apiVersion: 1
3
4datasources:
5 - name: Central Prometheus
6 type: prometheus
7 access: proxy
8 url: http://prometheus:9090
9 isDefault: true
10 editable: false
11 jsonData:
12 timeInterval: 15s
13 queryTimeout: 60s
14 httpMethod: POST
15
16 - name: EKS US-East-1
17 type: prometheus
18 access: proxy
19 url: http://prometheus.eks-us-east-1.internal:9090
20 editable: false
21 jsonData:
22 timeInterval: 15s
23
24 - name: EKS US-West-2
25 type: prometheus
26 access: proxy
27 url: http://prometheus.eks-us-west-2.internal:9090
28 editable: false
29 jsonData:
30 timeInterval: 15s
Dashboard Provisioning
1# grafana/provisioning/dashboards/default.yaml
2apiVersion: 1
3
4providers:
5 - name: 'default'
6 orgId: 1
7 folder: ''
8 type: file
9 disableDeletion: false
10 updateIntervalSeconds: 10
11 allowUiUpdates: true
12 options:
13 path: /etc/grafana/provisioning/dashboards
14 foldersFromFilesStructure: true
Cross-Cluster Dashboard Example
1// grafana/dashboards/cross-cluster.json
2{
3 "dashboard": {
4 "title": "Cross-Cluster Overview",
5 "tags": ["kubernetes", "cross-cluster"],
6 "timezone": "browser",
7 "panels": [
8 {
9 "id": 1,
10 "title": "CPU Usage by Cluster",
11 "type": "graph",
12 "datasource": "Central Prometheus",
13 "targets": [
14 {
15 "expr": "cluster:cpu_usage:rate5m",
16 "legendFormat": "{{ cluster }}",
17 "refId": "A"
18 }
19 ],
20 "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }
21 },
22 {
23 "id": 2,
24 "title": "Memory Usage by Cluster",
25 "type": "graph",
26 "datasource": "Central Prometheus",
27 "targets": [
28 {
29 "expr": "cluster:memory_usage_bytes:sum",
30 "legendFormat": "{{ cluster }}",
31 "refId": "A"
32 }
33 ],
34 "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }
35 },
36 {
37 "id": 3,
38 "title": "Request Rate by Service (All Clusters)",
39 "type": "graph",
40 "datasource": "Central Prometheus",
41 "targets": [
42 {
43 "expr": "sum(service:http_requests:rate5m) by (service)",
44 "legendFormat": "{{ service }}",
45 "refId": "A"
46 }
47 ],
48 "gridPos": { "x": 0, "y": 8, "w": 24, "h": 8 }
49 },
50 {
51 "id": 4,
52 "title": "Error Rate Comparison",
53 "type": "heatmap",
54 "datasource": "Central Prometheus",
55 "targets": [
56 {
57 "expr": "service:http_errors:rate5m / service:http_requests:rate5m",
58 "legendFormat": "{{ service }} @ {{ cluster }}",
59 "refId": "A"
60 }
61 ],
62 "gridPos": { "x": 0, "y": 16, "w": 24, "h": 8 }
63 }
64 ]
65 }
66}
Deployment and Operations
Deploy the Complete Stack
1# Set environment
2export AWS_REGION=us-east-1
3export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
4
5# Install dependencies
6npm install
7
8# Bootstrap CDK (first time only)
9cdk bootstrap aws://$CDK_DEFAULT_ACCOUNT/$AWS_REGION
10
11# Deploy VPC stack first
12cdk deploy MonitoringVpcStack --context env=production
13
14# Deploy Prometheus
15cdk deploy PrometheusStack --context env=production
16
17# Deploy Grafana
18cdk deploy GrafanaStack --context env=production
19
20# Deploy all stacks
21cdk deploy --all --context env=production --require-approval never
22
23# Get deployment outputs
24aws cloudformation describe-stacks \
25 --stack-name GrafanaStack \
26 --query 'Stacks[0].Outputs' \
27 --output table
Configure Remote Prometheus Instances
On each remote Prometheus (EKS, ECS), configure remote write:
1# Remote Prometheus configuration
2remote_write:
3 - url: http://<central-prometheus-alb-dns>:9090/api/v1/write
4 queue_config:
5 capacity: 10000
6 max_shards: 100
7 max_samples_per_send: 500
8 write_relabel_configs:
9 - source_labels: [__name__]
10 regex: 'up|container_.*|node_.*|kube_.*'
11 action: keep
Verify Federation
1# Test federation endpoint
2curl 'http://<central-prometheus>/federate?match[]={job="kubernetes-nodes"}'
3
4# Check ingested metrics
5curl 'http://<central-prometheus>/api/v1/query?query=up'
6
7# Verify remote write
8curl 'http://<central-prometheus>/api/v1/query?query=prometheus_remote_storage_samples_total'
Production Best Practices
1. High Availability Configuration
1// Deploy Prometheus with multiple replicas
2prometheusConfig: {
3 desiredCount: 3, // 3 instances for HA
4 // Use consistent hashing for federation
5}
6
7// Use EFS for shared storage
8storageConfig: {
9 efsPerformanceMode: 'maxIO',
10 efsThroughputMode: 'provisioned',
11 provisionedThroughputMibps: 100,
12}
2. Cost Optimization
| Strategy | Implementation | Savings |
|---|---|---|
| Metric Filtering | Only scrape essential metrics | 40-60% storage |
| Down-sampling | Reduce resolution for old data | 30-50% storage |
| Recording Rules | Pre-aggregate common queries | 20-40% query cost |
| Fargate Spot | Use Spot instances for non-prod | 70% compute cost |
3. Security Hardening
1// Enable encryption
2prometheusSecurityGroup.addIngressRule(
3 ec2.Peer.ipv4(vpc.vpcCidrBlock),
4 ec2.Port.tcp(9090),
5 'Allow only VPC traffic'
6);
7
8// Use IAM roles for service accounts
9// Implement network policies
10// Enable audit logging
4. Monitoring the Monitoring
1# Alert on Prometheus issues
2- alert: PrometheusDown
3 expr: up{job="prometheus"} == 0
4 for: 5m
5 labels:
6 severity: critical
7 annotations:
8 summary: "Prometheus is down"
9
10- alert: PrometheusFederationFailing
11 expr: prometheus_remote_storage_samples_failed_total > 0
12 for: 5m
13 labels:
14 severity: warning
15 annotations:
16 summary: "Prometheus federation failing"
Conclusion
This centralized Prometheus + Grafana architecture provides enterprise-grade observability for distributed AWS environments. By federating metrics from multiple sources into a unified platform, teams gain:
- Unified Visibility: Single dashboard for all infrastructure and applications
- Efficient Operations: Centralized management reduces operational overhead
- Better Correlation: Cross-service analysis for faster troubleshooting
- Cost Optimization: Shared infrastructure reduces total monitoring costs
- Scalability: Architecture scales horizontally with workload growth
Key Takeaways:
- Federation enables centralized metrics without changing application code
- ECS Fargate provides serverless, scalable infrastructure for Prometheus/Grafana
- EFS storage ensures data persistence and high availability
- Recording rules optimize query performance and reduce storage costs
- Multi-tier alerting prevents alert fatigue and ensures timely responses
The complete implementation is available in the CDK Playground repository, including full configuration examples, dashboards, and deployment scripts.
Related Posts:
- Building Centralized Monitoring System with AWS CloudWatch and Grafana
- Building Centralized Logging with OpenSearch and AWS CDK
- Building Production Kubernetes Platform with AWS EKS and CDK
Tags: #prometheus #grafana #aws #cdk #monitoring #observability #kubernetes #eks #ecs #fargate #metrics #alerting #federation
