Series Overview
This is Part 6 of the Kubernetes Autoscaling Complete Guide series:
- Part 1: Horizontal Pod Autoscaler - Application-level autoscaling theory
- Part 2: Cluster Autoscaling & Cloud Providers - Infrastructure-level autoscaling
- Part 3: Hands-On HPA Demo - Practical implementation
- Part 4: Monitoring, Alerting & Threshold Tuning - Production observability
- Part 5: VPA & Resource Optimization - Right-sizing strategies
- Part 6 (This Post): Advanced Autoscaling Patterns - Stateful apps, multi-cluster, cost optimization
Beyond basic HPA and cluster autoscaling, production Kubernetes deployments require sophisticated patterns for stateful workloads, multi-cluster architectures, aggressive cost optimization, and specialized workload types. This guide explores advanced autoscaling strategies used by leading organizations.
Pattern 1: Stateful Application Autoscaling
The StatefulSet Challenge
Traditional HPA with StatefulSets:
┌────────────────────────────────────────────────────────────┐
│ CHALLENGES │
│ │
│ 1. Ordered Pod Creation/Deletion │
│ • pod-0 must exist before pod-1 │
│ • Slow scale-up during traffic spikes │
│ │
│ 2. Persistent Volumes │
│ • Each pod has unique PVC │
│ • Storage costs accumulate │
│ • PVCs remain after scale-down │
│ │
│ 3. State Synchronization │
│ • New pods must sync state (databases, caches) │
│ • Sync time adds to scale-up latency │
│ • Potential data consistency issues │
│ │
│ 4. Service Discovery │
│ • Clients must discover new pods │
│ • DNS updates take time │
│ • Connection draining needed on scale-down │
└────────────────────────────────────────────────────────────┘
Pattern 1A: Database Scaling with StatefulSet
Scenario: PostgreSQL cluster with read replicas that scale based on read query load.
1# PostgreSQL StatefulSet
2apiVersion: apps/v1
3kind: StatefulSet
4metadata:
5 name: postgres-replicas
6 namespace: databases
7spec:
8 serviceName: postgres-replicas
9 replicas: 2 # Initial: 1 primary + 1 replica
10
11 selector:
12 matchLabels:
13 app: postgres
14 role: replica
15
16 template:
17 metadata:
18 labels:
19 app: postgres
20 role: replica
21 annotations:
22 prometheus.io/scrape: "true"
23 prometheus.io/port: "9187" # postgres_exporter
24 spec:
25 initContainers:
26 # Initialize replica from primary
27 - name: init-replica
28 image: postgres:15
29 command:
30 - bash
31 - -c
32 - |
33 if [ ! -f /var/lib/postgresql/data/PG_VERSION ]; then
34 # Clone from primary
35 pg_basebackup -h postgres-primary -D /var/lib/postgresql/data -U replication -v -P
36 # Create recovery signal
37 touch /var/lib/postgresql/data/standby.signal
38 fi
39 volumeMounts:
40 - name: data
41 mountPath: /var/lib/postgresql/data
42
43 containers:
44 # PostgreSQL replica
45 - name: postgres
46 image: postgres:15
47 env:
48 - name: POSTGRES_USER
49 value: postgres
50 - name: POSTGRES_PASSWORD
51 valueFrom:
52 secretKeyRef:
53 name: postgres-secret
54 key: password
55 - name: PGDATA
56 value: /var/lib/postgresql/data/pgdata
57
58 ports:
59 - containerPort: 5432
60 name: postgres
61
62 resources:
63 requests:
64 cpu: 1
65 memory: 2Gi
66 limits:
67 cpu: 4
68 memory: 8Gi
69
70 volumeMounts:
71 - name: data
72 mountPath: /var/lib/postgresql/data
73 - name: config
74 mountPath: /etc/postgresql/postgresql.conf
75 subPath: postgresql.conf
76
77 # Postgres Exporter for metrics
78 - name: postgres-exporter
79 image: prometheuscommunity/postgres-exporter:latest
80 env:
81 - name: DATA_SOURCE_NAME
82 value: "postgresql://postgres:$(POSTGRES_PASSWORD)@localhost:5432/postgres?sslmode=disable"
83 ports:
84 - containerPort: 9187
85 name: metrics
86 resources:
87 requests:
88 cpu: 100m
89 memory: 128Mi
90
91 volumes:
92 - name: config
93 configMap:
94 name: postgres-config
95
96 volumeClaimTemplates:
97 - metadata:
98 name: data
99 spec:
100 accessModes: ["ReadWriteOnce"]
101 storageClassName: gp3-encrypted
102 resources:
103 requests:
104 storage: 100Gi
105
106---
107# Headless service for StatefulSet
108apiVersion: v1
109kind: Service
110metadata:
111 name: postgres-replicas
112 namespace: databases
113spec:
114 clusterIP: None
115 selector:
116 app: postgres
117 role: replica
118 ports:
119 - port: 5432
120 name: postgres
121
122---
123# Regular service for read traffic (load balanced)
124apiVersion: v1
125kind: Service
126metadata:
127 name: postgres-read
128 namespace: databases
129 annotations:
130 prometheus.io/scrape: "true"
131 prometheus.io/port: "9187"
132spec:
133 type: ClusterIP
134 selector:
135 app: postgres
136 role: replica
137 ports:
138 - port: 5432
139 name: postgres
140
141---
142# HPA for read replicas based on custom metrics
143apiVersion: autoscaling/v2
144kind: HorizontalPodAutoscaler
145metadata:
146 name: postgres-replicas-hpa
147 namespace: databases
148spec:
149 scaleTargetRef:
150 apiVersion: apps/v1
151 kind: StatefulSet
152 name: postgres-replicas
153
154 minReplicas: 2 # Always have at least 1 replica + 1 primary
155 maxReplicas: 10 # Max read replicas
156
157 metrics:
158 # Scale based on active connections
159 - type: Pods
160 pods:
161 metric:
162 name: pg_stat_database_numbackends
163 target:
164 type: AverageValue
165 averageValue: "50" # 50 connections per replica
166
167 # Scale based on replication lag
168 - type: Pods
169 pods:
170 metric:
171 name: pg_replication_lag_seconds
172 target:
173 type: AverageValue
174 averageValue: "5" # Keep lag under 5 seconds
175
176 behavior:
177 scaleUp:
178 stabilizationWindowSeconds: 60 # Wait 1 min before scale-up
179 policies:
180 - type: Pods
181 value: 1 # Add 1 replica at a time
182 periodSeconds: 60
183 selectPolicy: Min
184
185 scaleDown:
186 stabilizationWindowSeconds: 600 # Wait 10 min before scale-down
187 policies:
188 - type: Pods
189 value: 1 # Remove 1 replica at a time
190 periodSeconds: 300 # Every 5 minutes
191 selectPolicy: Min
192
193---
194# PrometheusRule for PostgreSQL monitoring
195apiVersion: monitoring.coreos.com/v1
196kind: PrometheusRule
197metadata:
198 name: postgres-autoscaling-rules
199 namespace: monitoring
200spec:
201 groups:
202 - name: postgres-custom-metrics
203 interval: 15s
204 rules:
205 # Active connections per pod
206 - record: pg_stat_database_numbackends
207 expr: |
208 sum(pg_stat_database_numbackends{datname="postgres"}) by (pod, namespace)
209
210 # Replication lag in seconds
211 - record: pg_replication_lag_seconds
212 expr: |
213 pg_replication_lag
214
215 - name: postgres-alerts
216 rules:
217 # Alert when replicas are at max
218 - alert: PostgresReplicasMaxedOut
219 expr: |
220 (
221 kube_statefulset_status_replicas{statefulset="postgres-replicas"}
222 /
223 kube_statefulset_spec_replicas{statefulset="postgres-replicas"}
224 ) >= 0.9
225 for: 10m
226 labels:
227 severity: warning
228 annotations:
229 summary: "PostgreSQL replicas near maximum capacity"
230 description: "Consider increasing maxReplicas or optimizing queries"
231
232 # Alert on high replication lag
233 - alert: PostgresHighReplicationLag
234 expr: pg_replication_lag_seconds > 30
235 for: 5m
236 labels:
237 severity: critical
238 annotations:
239 summary: "PostgreSQL replication lag is high"
240 description: "Replication lag is {{ $value }}s, may impact read consistency"
Pattern 1B: Redis Cache Cluster Autoscaling
1# Redis Cluster with dynamic scaling
2apiVersion: apps/v1
3kind: StatefulSet
4metadata:
5 name: redis-cluster
6 namespace: caching
7spec:
8 serviceName: redis-cluster
9 replicas: 6 # 3 master + 3 replica
10
11 selector:
12 matchLabels:
13 app: redis-cluster
14
15 template:
16 metadata:
17 labels:
18 app: redis-cluster
19 spec:
20 containers:
21 - name: redis
22 image: redis:7-alpine
23 command:
24 - redis-server
25 args:
26 - /conf/redis.conf
27 - --cluster-enabled
28 - "yes"
29 - --cluster-config-file
30 - /data/nodes.conf
31 - --cluster-node-timeout
32 - "5000"
33 - --maxmemory
34 - "2gb"
35 - --maxmemory-policy
36 - "allkeys-lru"
37
38 ports:
39 - containerPort: 6379
40 name: client
41 - containerPort: 16379
42 name: gossip
43
44 resources:
45 requests:
46 cpu: 500m
47 memory: 2Gi
48 limits:
49 cpu: 2
50 memory: 4Gi
51
52 volumeMounts:
53 - name: data
54 mountPath: /data
55 - name: conf
56 mountPath: /conf
57
58 # Redis Exporter sidecar
59 - name: redis-exporter
60 image: oliver006/redis_exporter:latest
61 ports:
62 - containerPort: 9121
63 name: metrics
64 resources:
65 requests:
66 cpu: 100m
67 memory: 128Mi
68
69 volumeClaimTemplates:
70 - metadata:
71 name: data
72 spec:
73 accessModes: ["ReadWriteOnce"]
74 resources:
75 requests:
76 storage: 50Gi
77
78---
79# Custom Metrics based on Redis metrics
80apiVersion: v1
81kind: ConfigMap
82metadata:
83 name: prometheus-adapter-redis
84 namespace: monitoring
85data:
86 config.yaml: |
87 rules:
88 # Redis memory usage percentage
89 - seriesQuery: 'redis_memory_used_bytes'
90 resources:
91 overrides:
92 namespace: {resource: "namespace"}
93 pod: {resource: "pod"}
94 name:
95 as: "redis_memory_usage_percentage"
96 metricsQuery: |
97 (redis_memory_used_bytes / redis_memory_max_bytes) * 100
98
99 # Redis connected clients
100 - seriesQuery: 'redis_connected_clients'
101 resources:
102 overrides:
103 namespace: {resource: "namespace"}
104 pod: {resource: "pod"}
105 name:
106 as: "redis_clients_per_pod"
107 metricsQuery: |
108 sum(redis_connected_clients) by (pod, namespace)
109
110 # Redis operations per second
111 - seriesQuery: 'redis_instantaneous_ops_per_sec'
112 resources:
113 overrides:
114 namespace: {resource: "namespace"}
115 pod: {resource: "pod"}
116 name:
117 as: "redis_ops_per_second"
118 metricsQuery: |
119 sum(rate(redis_commands_total[2m])) by (pod, namespace)
120
121---
122# HPA for Redis based on memory and ops
123apiVersion: autoscaling/v2
124kind: HorizontalPodAutoscaler
125metadata:
126 name: redis-cluster-hpa
127 namespace: caching
128spec:
129 scaleTargetRef:
130 apiVersion: apps/v1
131 kind: StatefulSet
132 name: redis-cluster
133
134 minReplicas: 6 # Minimum cluster size (3 master + 3 replica)
135 maxReplicas: 18 # Max 9 master + 9 replica
136
137 metrics:
138 # Memory usage
139 - type: Pods
140 pods:
141 metric:
142 name: redis_memory_usage_percentage
143 target:
144 type: AverageValue
145 averageValue: "75" # Scale when memory > 75%
146
147 # Operations per second
148 - type: Pods
149 pods:
150 metric:
151 name: redis_ops_per_second
152 target:
153 type: AverageValue
154 averageValue: "10000" # Scale at 10k ops/sec per pod
155
156 behavior:
157 scaleUp:
158 stabilizationWindowSeconds: 120
159 policies:
160 - type: Pods
161 value: 2 # Add 2 pods at a time (1 master + 1 replica)
162 periodSeconds: 120
163
164 scaleDown:
165 stabilizationWindowSeconds: 600
166 policies:
167 - type: Pods
168 value: 2 # Remove 2 pods at a time
169 periodSeconds: 300
Key Considerations for Stateful Autoscaling
- Data Synchronization Time: Account for data replication delays
- Ordered Scaling: StatefulSets scale sequentially, slower than Deployments
- Storage Management: Implement PVC cleanup policies
- State Warmup: Consider warm-up time for caches/databases
- Split Read/Write: Scale read replicas independently from write nodes
Pattern 2: Multi-Cluster & Multi-Region Autoscaling
Architecture Overview
┌──────────────────────────────────────────────────────────────────────┐
│ MULTI-CLUSTER AUTOSCALING ARCHITECTURE │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ REGION 1 │ │ REGION 2 │ │ REGION 3 │ │
│ │ (US-EAST) │ │ (EU-WEST) │ │ (AP-SOUTH) │ │
│ │ │ │ │ │ │ │
│ │ EKS Cluster 1 │ │ EKS Cluster 2 │ │ EKS Cluster 3 │ │
│ │ • HPA │ │ • HPA │ │ • HPA │ │
│ │ • Karpenter │ │ • Karpenter │ │ • Karpenter │ │
│ │ • Local LB │ │ • Local LB │ │ • Local LB │ │
│ └───────┬────────┘ └───────┬────────┘ └───────┬────────┘ │
│ │ │ │ │
│ └──────────────────────┴──────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ GLOBAL LOAD BALANCER │ │
│ │ │ │
│ │ • Route 53 / CloudFlare / Global Accelerator │ │
│ │ • Geographic routing │ │
│ │ • Latency-based routing │ │
│ │ • Weighted routing (for gradual shifts) │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ CENTRALIZED AUTOSCALING CONTROLLER │ │
│ │ │ │
│ │ • Aggregate metrics from all clusters │ │
│ │ • Intelligent workload distribution │ │
│ │ • Cost-aware cluster selection │ │
│ │ • Capacity prediction │ │
│ └────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
Pattern 2A: Federated HPA with Cluster API
1# Install Cluster API
2---
3apiVersion: v1
4kind: Namespace
5metadata:
6 name: cluster-api-system
7
8---
9# Management cluster setup
10apiVersion: cluster.x-k8s.io/v1beta1
11kind: Cluster
12metadata:
13 name: workload-cluster-us-east
14 namespace: default
15spec:
16 clusterNetwork:
17 pods:
18 cidrBlocks: ["192.168.0.0/16"]
19 infrastructureRef:
20 apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
21 kind: AWSCluster
22 name: workload-cluster-us-east
23 controlPlaneRef:
24 kind: KubeadmControlPlane
25 apiVersion: controlplane.cluster.x-k8s.io/v1beta1
26 name: workload-cluster-us-east-control-plane
27
28---
29# Multi-cluster autoscaling with KubeFed
30apiVersion: types.kubefed.io/v1beta1
31kind: FederatedHorizontalPodAutoscaler
32metadata:
33 name: federated-app-hpa
34 namespace: default
35spec:
36 # Target deployment across clusters
37 placement:
38 clusters:
39 - name: us-east-1-cluster
40 weight: 40
41 - name: eu-west-1-cluster
42 weight: 30
43 - name: ap-south-1-cluster
44 weight: 30
45
46 template:
47 spec:
48 scaleTargetRef:
49 apiVersion: apps/v1
50 kind: Deployment
51 name: my-app
52
53 minReplicas: 3 # Per cluster minimum
54 maxReplicas: 20 # Per cluster maximum
55
56 metrics:
57 - type: Resource
58 resource:
59 name: cpu
60 target:
61 type: Utilization
62 averageUtilization: 70
63
64 # Override for specific clusters
65 overrides:
66 - clusterName: us-east-1-cluster
67 clusterOverrides:
68 - path: "/spec/minReplicas"
69 value: 5 # Higher baseline in primary region
70 - path: "/spec/maxReplicas"
71 value: 50
72
73---
74# Federated deployment
75apiVersion: types.kubefed.io/v1beta1
76kind: FederatedDeployment
77metadata:
78 name: my-app
79 namespace: default
80spec:
81 placement:
82 clusters:
83 - name: us-east-1-cluster
84 - name: eu-west-1-cluster
85 - name: ap-south-1-cluster
86
87 template:
88 spec:
89 replicas: 5
90 selector:
91 matchLabels:
92 app: my-app
93 template:
94 metadata:
95 labels:
96 app: my-app
97 spec:
98 containers:
99 - name: app
100 image: myapp:v1.0
101 resources:
102 requests:
103 cpu: 500m
104 memory: 512Mi
Pattern 2B: Custom Multi-Cluster Autoscaler
1// Custom multi-cluster autoscaling controller
2package main
3
4import (
5 "context"
6 "fmt"
7 "time"
8
9 metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
10 "k8s.io/client-go/kubernetes"
11 "k8s.io/client-go/tools/clientcmd"
12)
13
14type ClusterConfig struct {
15 Name string
16 KubeConfig string
17 Region string
18 CostPerCPUHour float64
19 Latency time.Duration
20}
21
22type MultiClusterAutoscaler struct {
23 clusters map[string]*kubernetes.Clientset
24 configs []ClusterConfig
25}
26
27func NewMultiClusterAutoscaler(configs []ClusterConfig) (*MultiClusterAutoscaler, error) {
28 mca := &MultiClusterAutoscaler{
29 clusters: make(map[string]*kubernetes.Clientset),
30 configs: configs,
31 }
32
33 // Initialize clients for each cluster
34 for _, config := range configs {
35 clientConfig, err := clientcmd.BuildConfigFromFlags("", config.KubeConfig)
36 if err != nil {
37 return nil, err
38 }
39
40 clientset, err := kubernetes.NewForConfig(clientConfig)
41 if err != nil {
42 return nil, err
43 }
44
45 mca.clusters[config.Name] = clientset
46 }
47
48 return mca, nil
49}
50
51// Decision algorithm: cost-aware + latency-aware scaling
52func (mca *MultiClusterAutoscaler) ScaleDecision(
53 ctx context.Context,
54 totalReplicas int,
55 userRegion string,
56) (map[string]int, error) {
57
58 allocation := make(map[string]int)
59
60 // Step 1: Get current capacity in each cluster
61 capacities := make(map[string]int)
62 for name, client := range mca.clusters {
63 nodes, err := client.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
64 if err != nil {
65 return nil, err
66 }
67
68 // Calculate available capacity
69 var availableCPU int64
70 for _, node := range nodes.Items {
71 availableCPU += node.Status.Allocatable.Cpu().MilliValue()
72 }
73 capacities[name] = int(availableCPU / 500) // Assume 500m per pod
74 }
75
76 // Step 2: Cost-aware allocation
77 // Prioritize cheapest region first
78 sortedConfigs := sortByCost(mca.configs)
79
80 remaining := totalReplicas
81 for _, config := range sortedConfigs {
82 available := capacities[config.Name]
83
84 // Allocate up to available capacity
85 allocated := min(remaining, available)
86 allocation[config.Name] = allocated
87 remaining -= allocated
88
89 if remaining == 0 {
90 break
91 }
92 }
93
94 // Step 3: Latency-aware adjustment
95 // If user is in specific region, ensure minimum local replicas
96 if userRegion != "" {
97 minLocal := max(3, totalReplicas/10) // At least 10% or 3 replicas
98 if allocation[userRegion] < minLocal {
99 allocation[userRegion] = minLocal
100 }
101 }
102
103 return allocation, nil
104}
105
106// Apply scaling decisions to clusters
107func (mca *MultiClusterAutoscaler) ApplyScaling(
108 ctx context.Context,
109 allocation map[string]int,
110 deployment string,
111 namespace string,
112) error {
113
114 for clusterName, replicas := range allocation {
115 client := mca.clusters[clusterName]
116
117 // Update deployment replica count
118 scale, err := client.AppsV1().Deployments(namespace).
119 GetScale(ctx, deployment, metav1.GetOptions{})
120 if err != nil {
121 return fmt.Errorf("failed to get scale for %s in %s: %v",
122 deployment, clusterName, err)
123 }
124
125 scale.Spec.Replicas = int32(replicas)
126
127 _, err = client.AppsV1().Deployments(namespace).
128 UpdateScale(ctx, deployment, scale, metav1.UpdateOptions{})
129 if err != nil {
130 return fmt.Errorf("failed to update scale for %s in %s: %v",
131 deployment, clusterName, err)
132 }
133
134 fmt.Printf("Scaled %s in %s to %d replicas\n",
135 deployment, clusterName, replicas)
136 }
137
138 return nil
139}
140
141func main() {
142 configs := []ClusterConfig{
143 {
144 Name: "us-east-1",
145 KubeConfig: "/home/user/.kube/us-east-1",
146 Region: "us-east-1",
147 CostPerCPUHour: 0.04,
148 Latency: 50 * time.Millisecond,
149 },
150 {
151 Name: "eu-west-1",
152 KubeConfig: "/home/user/.kube/eu-west-1",
153 Region: "eu-west-1",
154 CostPerCPUHour: 0.045,
155 Latency: 100 * time.Millisecond,
156 },
157 {
158 Name: "ap-south-1",
159 KubeConfig: "/home/user/.kube/ap-south-1",
160 Region: "ap-south-1",
161 CostPerCPUHour: 0.038, // Cheapest
162 Latency: 150 * time.Millisecond,
163 },
164 }
165
166 autoscaler, err := NewMultiClusterAutoscaler(configs)
167 if err != nil {
168 panic(err)
169 }
170
171 ctx := context.Background()
172
173 // Main reconciliation loop
174 ticker := time.NewTicker(30 * time.Second)
175 defer ticker.Stop()
176
177 for range ticker.C {
178 // Get total desired replicas from global metrics
179 totalReplicas := calculateGlobalReplicas()
180
181 // Determine optimal allocation
182 allocation, err := autoscaler.ScaleDecision(
183 ctx,
184 totalReplicas,
185 "us-east-1", // Primary user region
186 )
187 if err != nil {
188 fmt.Printf("Error in scale decision: %v\n", err)
189 continue
190 }
191
192 // Apply scaling
193 err = autoscaler.ApplyScaling(
194 ctx,
195 allocation,
196 "my-app",
197 "production",
198 )
199 if err != nil {
200 fmt.Printf("Error applying scaling: %v\n", err)
201 }
202 }
203}
204
205func calculateGlobalReplicas() int {
206 // Aggregate metrics from all clusters
207 // Calculate desired total replicas
208 // This would query Prometheus/Thanos for global metrics
209 return 50 // Placeholder
210}
211
212func sortByCost(configs []ClusterConfig) []ClusterConfig {
213 // Sort by cost (cheapest first)
214 sorted := make([]ClusterConfig, len(configs))
215 copy(sorted, configs)
216 // ... sorting logic
217 return sorted
218}
219
220func min(a, b int) int {
221 if a < b {
222 return a
223 }
224 return b
225}
226
227func max(a, b int) int {
228 if a > b {
229 return a
230 }
231 return b
232}
Pattern 2C: Global Metrics Aggregation with Thanos
1# Thanos setup for multi-cluster metrics
2---
3# Thanos Sidecar on each cluster's Prometheus
4apiVersion: apps/v1
5kind: StatefulSet
6metadata:
7 name: prometheus
8 namespace: monitoring
9spec:
10 template:
11 spec:
12 containers:
13 # Prometheus
14 - name: prometheus
15 image: prom/prometheus:latest
16 args:
17 - --storage.tsdb.path=/prometheus
18 - --storage.tsdb.min-block-duration=2h
19 - --storage.tsdb.max-block-duration=2h
20 volumeMounts:
21 - name: storage
22 mountPath: /prometheus
23
24 # Thanos Sidecar
25 - name: thanos-sidecar
26 image: thanosio/thanos:latest
27 args:
28 - sidecar
29 - --prometheus.url=http://localhost:9090
30 - --tsdb.path=/prometheus
31 - --objstore.config-file=/etc/thanos/objstore.yaml
32 - --grpc-address=0.0.0.0:10901
33 volumeMounts:
34 - name: storage
35 mountPath: /prometheus
36 - name: objstore-config
37 mountPath: /etc/thanos
38 ports:
39 - containerPort: 10901
40 name: grpc
41
42---
43# Thanos Query (global query layer)
44apiVersion: apps/v1
45kind: Deployment
46metadata:
47 name: thanos-query
48 namespace: monitoring
49spec:
50 replicas: 2
51 template:
52 spec:
53 containers:
54 - name: thanos-query
55 image: thanosio/thanos:latest
56 args:
57 - query
58 - --http-address=0.0.0.0:9090
59 - --grpc-address=0.0.0.0:10901
60 # Connect to all cluster Prometheus instances
61 - --store=prometheus-us-east-1.monitoring.svc.cluster.local:10901
62 - --store=prometheus-eu-west-1.monitoring.svc.cluster.local:10901
63 - --store=prometheus-ap-south-1.monitoring.svc.cluster.local:10901
64 - --query.replica-label=replica
65 ports:
66 - containerPort: 9090
67 name: http
68 - containerPort: 10901
69 name: grpc
70
71---
72# Global HPA using Thanos metrics
73apiVersion: v1
74kind: ConfigMap
75metadata:
76 name: prometheus-adapter-thanos
77 namespace: monitoring
78data:
79 config.yaml: |
80 rules:
81 # Global request rate across all clusters
82 - seriesQuery: 'http_requests_total{job="my-app"}'
83 resources:
84 template: <<.Resource>>
85 name:
86 as: "global_requests_per_second"
87 metricsQuery: |
88 sum(rate(http_requests_total{job="my-app"}[2m]))
89
90 # Global CPU usage
91 - seriesQuery: 'container_cpu_usage_seconds_total{pod=~"my-app.*"}'
92 resources:
93 overrides:
94 namespace: {resource: "namespace"}
95 name:
96 as: "global_cpu_usage"
97 metricsQuery: |
98 sum(rate(container_cpu_usage_seconds_total{pod=~"my-app.*"}[5m]))
Pattern 3: Aggressive Cost Optimization
Spot Instance Strategy with Multiple Fallbacks
1# Karpenter NodePool with spot + on-demand mix
2---
3apiVersion: karpenter.sh/v1beta1
4kind: NodePool
5metadata:
6 name: cost-optimized-spot
7spec:
8 template:
9 metadata:
10 labels:
11 workload-type: spot-eligible
12 cost-optimized: "true"
13 spec:
14 requirements:
15 # Maximize spot instance types for availability
16 - key: karpenter.sh/capacity-type
17 operator: In
18 values: ["spot"]
19
20 # Allow wide range of instance types
21 - key: karpenter.k8s.aws/instance-category
22 operator: In
23 values: ["c", "m", "r", "t"] # Compute, general, memory, burstable
24
25 - key: karpenter.k8s.aws/instance-generation
26 operator: Gt
27 values: ["4"] # Generation 5+
28
29 # Size flexibility
30 - key: karpenter.k8s.aws/instance-size
31 operator: In
32 values: ["large", "xlarge", "2xlarge", "4xlarge"]
33
34 nodeClassRef:
35 name: cost-optimized
36
37 # Aggressive consolidation
38 disruption:
39 consolidationPolicy: WhenUnderutilized
40 consolidateAfter: 30s
41 expireAfter: 12h # Refresh nodes every 12 hours
42
43 limits:
44 cpu: "500"
45 memory: 1000Gi
46
47---
48# On-demand fallback NodePool
49apiVersion: karpenter.sh/v1beta1
50kind: NodePool
51metadata:
52 name: on-demand-fallback
53spec:
54 template:
55 metadata:
56 labels:
57 workload-type: on-demand-fallback
58 spec:
59 requirements:
60 - key: karpenter.sh/capacity-type
61 operator: In
62 values: ["on-demand"]
63
64 - key: karpenter.k8s.aws/instance-category
65 operator: In
66 values: ["m", "c"]
67
68 nodeClassRef:
69 name: cost-optimized
70
71 weight: 10 # Lower priority, used when spot unavailable
72
73 limits:
74 cpu: "200"
75
76---
77# Application deployment with spot tolerance
78apiVersion: apps/v1
79kind: Deployment
80metadata:
81 name: cost-sensitive-app
82 namespace: production
83spec:
84 replicas: 10
85 template:
86 spec:
87 # Prefer spot nodes
88 affinity:
89 nodeAffinity:
90 preferredDuringSchedulingIgnoredDuringExecution:
91 - weight: 100
92 preference:
93 matchExpressions:
94 - key: karpenter.sh/capacity-type
95 operator: In
96 values: ["spot"]
97
98 # Fallback to on-demand if needed
99 - weight: 50
100 preference:
101 matchExpressions:
102 - key: workload-type
103 operator: In
104 values: ["on-demand-fallback"]
105
106 # Tolerate spot interruptions
107 tolerations:
108 - key: karpenter.sh/disruption
109 operator: Exists
110 effect: NoSchedule
111
112 # Topology spread for availability
113 topologySpreadConstraints:
114 - maxSkew: 1
115 topologyKey: topology.kubernetes.io/zone
116 whenUnsatisfiable: DoNotSchedule
117 labelSelector:
118 matchLabels:
119 app: cost-sensitive-app
120
121 containers:
122 - name: app
123 image: myapp:v1.0
124 resources:
125 requests:
126 cpu: 500m
127 memory: 512Mi
128
129---
130# PDB to handle spot interruptions gracefully
131apiVersion: policy/v1
132kind: PodDisruptionBudget
133metadata:
134 name: cost-sensitive-app-pdb
135 namespace: production
136spec:
137 minAvailable: 70% # Keep 70% pods running during spot interruptions
138 selector:
139 matchLabels:
140 app: cost-sensitive-app
Cost-Aware Scheduling with Custom Scheduler
1// Custom scheduler plugin for cost-aware pod placement
2package main
3
4import (
5 "context"
6 "fmt"
7
8 v1 "k8s.io/api/core/v1"
9 "k8s.io/apimachinery/pkg/runtime"
10 "k8s.io/kubernetes/pkg/scheduler/framework"
11)
12
13type CostAwarePlugin struct {
14 handle framework.Handle
15}
16
17var _ framework.ScorePlugin = &CostAwarePlugin{}
18
19// Pricing data (could be fetched from external API)
20var instancePricing = map[string]float64{
21 "t3.large": 0.0832,
22 "m5.large": 0.096,
23 "c5.large": 0.085,
24 "m5.xlarge": 0.192,
25 "c5.xlarge": 0.17,
26 "r5.large": 0.126,
27 "spot-t3.large": 0.0250, // ~70% savings
28 "spot-m5.large": 0.0288,
29 "spot-c5.large": 0.0255,
30}
31
32func (c *CostAwarePlugin) Name() string {
33 return "CostAwarePlugin"
34}
35
36// Score nodes based on cost
37func (c *CostAwarePlugin) Score(
38 ctx context.Context,
39 state *framework.CycleState,
40 pod *v1.Pod,
41 nodeName string,
42) (int64, *framework.Status) {
43
44 nodeInfo, err := c.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
45 if err != nil {
46 return 0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q: %v", nodeName, err))
47 }
48
49 node := nodeInfo.Node()
50
51 // Get instance type from node labels
52 instanceType := node.Labels["node.kubernetes.io/instance-type"]
53 capacityType := node.Labels["karpenter.sh/capacity-type"]
54
55 // Determine pricing key
56 pricingKey := instanceType
57 if capacityType == "spot" {
58 pricingKey = "spot-" + instanceType
59 }
60
61 // Get cost
62 cost, exists := instancePricing[pricingKey]
63 if !exists {
64 cost = 0.1 // Default cost if unknown
65 }
66
67 // Convert to score (lower cost = higher score)
68 // Normalize: max price 0.2, min price 0.02
69 // Score range: 0-100
70 normalizedCost := (cost - 0.02) / (0.2 - 0.02)
71 score := int64((1 - normalizedCost) * 100)
72
73 // Bonus for spot instances
74 if capacityType == "spot" {
75 score += 20
76 }
77
78 return score, framework.NewStatus(framework.Success)
79}
80
81// ScoreExtensions of the Score plugin
82func (c *CostAwarePlugin) ScoreExtensions() framework.ScoreExtensions {
83 return c
84}
85
86// NormalizeScore is called after scoring all nodes
87func (c *CostAwarePlugin) NormalizeScore(
88 ctx context.Context,
89 state *framework.CycleState,
90 pod *v1.Pod,
91 scores framework.NodeScoreList,
92) *framework.Status {
93 // Scores are already normalized in Score()
94 return framework.NewStatus(framework.Success)
95}
96
97func New(_ runtime.Object, h framework.Handle) (framework.Plugin, error) {
98 return &CostAwarePlugin{handle: h}, nil
99}
FinOps Dashboard and Automation
1# CronJob for daily cost optimization report
2---
3apiVersion: batch/v1
4kind: CronJob
5metadata:
6 name: cost-optimization-report
7 namespace: finops
8spec:
9 schedule: "0 9 * * *" # Daily at 9 AM
10 jobTemplate:
11 spec:
12 template:
13 spec:
14 serviceAccountName: finops-reporter
15 containers:
16 - name: reporter
17 image: finops-reporter:latest
18 env:
19 - name: PROMETHEUS_URL
20 value: "http://prometheus.monitoring:9090"
21 - name: SLACK_WEBHOOK
22 valueFrom:
23 secretKeyRef:
24 name: slack-webhook
25 key: url
26 command:
27 - /bin/bash
28 - -c
29 - |
30 #!/bin/bash
31
32 echo "=== Daily Cost Optimization Report ==="
33 echo ""
34
35 # Calculate total cluster cost
36 TOTAL_CPU=$(curl -s "$PROMETHEUS_URL/api/v1/query?query=sum(kube_pod_container_resource_requests{resource='cpu'})" | jq -r '.data.result[0].value[1]')
37 TOTAL_MEM=$(curl -s "$PROMETHEUS_URL/api/v1/query?query=sum(kube_pod_container_resource_requests{resource='memory'})" | jq -r '.data.result[0].value[1]')
38
39 CPU_COST=$(echo "$TOTAL_CPU * 0.04 * 24" | bc)
40 MEM_COST=$(echo "$TOTAL_MEM / 1073741824 * 0.005 * 24" | bc)
41 DAILY_COST=$(echo "$CPU_COST + $MEM_COST" | bc)
42
43 echo "Daily Cost: \$${DAILY_COST}"
44 echo ""
45
46 # Identify optimization opportunities
47 echo "=== Optimization Opportunities ==="
48
49 # Over-provisioned workloads (VPA recommendations)
50 curl -s "$PROMETHEUS_URL/api/v1/query?query=(kube_pod_container_resource_requests{resource='cpu'} - on(pod) kube_verticalpodautoscaler_status_recommendation_containerrecommendations_target{resource='cpu'}) / kube_pod_container_resource_requests{resource='cpu'} > 0.5" \
51 | jq -r '.data.result[] | "\(.metric.namespace)/\(.metric.pod): \(.value[1] * 100)% over-provisioned"'
52
53 # Spot instance opportunities
54 ONDEMAND_COUNT=$(curl -s "$PROMETHEUS_URL/api/v1/query?query=count(kube_node_labels{label_karpenter_sh_capacity_type='on-demand'})" | jq -r '.data.result[0].value[1]')
55 echo ""
56 echo "On-demand nodes: $ONDEMAND_COUNT (Consider spot instances for 70% savings)"
57
58 # Send to Slack
59 curl -X POST $SLACK_WEBHOOK \
60 -H 'Content-Type: application/json' \
61 -d "{\"text\": \"Daily Cost Report: \\\$${DAILY_COST}\"}"
62
63 restartPolicy: OnFailure
64
65---
66# PrometheusRule for cost alerts
67apiVersion: monitoring.coreos.com/v1
68kind: PrometheusRule
69metadata:
70 name: cost-alerts
71 namespace: monitoring
72spec:
73 groups:
74 - name: cost-optimization
75 interval: 1h
76 rules:
77 # Alert when daily cost exceeds budget
78 - alert: DailyCostExceedsBudget
79 expr: |
80 (
81 sum(kube_pod_container_resource_requests{resource="cpu"}) * 0.04 +
82 sum(kube_pod_container_resource_requests{resource="memory"}) / 1073741824 * 0.005
83 ) * 24 > 1000
84 labels:
85 severity: warning
86 team: finops
87 annotations:
88 summary: "Daily infrastructure cost exceeds $1000"
89 description: "Current daily cost: ${{ $value }}"
90
91 # Alert on underutilized nodes
92 - alert: UnderutilizedNodes
93 expr: |
94 (
95 sum(kube_node_status_allocatable{resource="cpu"}) -
96 sum(kube_pod_container_resource_requests{resource="cpu"})
97 ) / sum(kube_node_status_allocatable{resource="cpu"}) > 0.5
98 for: 2h
99 labels:
100 severity: info
101 team: platform
102 annotations:
103 summary: "Cluster has >50% unused CPU capacity"
104 description: "Consider scaling down or consolidating workloads"
105
106 # Spot savings opportunity
107 - alert: SpotSavingsOpportunity
108 expr: |
109 count(kube_node_labels{label_karpenter_sh_capacity_type="on-demand"})
110 /
111 count(kube_node_labels)
112 > 0.3
113 for: 4h
114 labels:
115 severity: info
116 team: finops
117 annotations:
118 summary: ">30% on-demand nodes detected"
119 description: "Evaluate workloads for spot instance eligibility (70% potential savings)"
Pattern 4: Batch Job & Queue-Based Autoscaling
Pattern 4A: Kubernetes Job Autoscaling with KEDA
1# KEDA ScaledJob for queue-driven batch processing
2---
3apiVersion: v1
4kind: Secret
5metadata:
6 name: aws-sqs-credentials
7 namespace: batch-processing
8type: Opaque
9stringData:
10 AWS_ACCESS_KEY_ID: "AKIAIOSFODNN7EXAMPLE"
11 AWS_SECRET_ACCESS_KEY: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
12
13---
14apiVersion: keda.sh/v1alpha1
15kind: TriggerAuthentication
16metadata:
17 name: aws-sqs-auth
18 namespace: batch-processing
19spec:
20 secretTargetRef:
21 - parameter: awsAccessKeyID
22 name: aws-sqs-credentials
23 key: AWS_ACCESS_KEY_ID
24 - parameter: awsSecretAccessKey
25 name: aws-sqs-credentials
26 key: AWS_SECRET_ACCESS_KEY
27
28---
29# ScaledJob (not Deployment) for batch processing
30apiVersion: keda.sh/v1alpha1
31kind: ScaledJob
32metadata:
33 name: image-processing-job
34 namespace: batch-processing
35spec:
36 # Job template
37 jobTargetRef:
38 template:
39 spec:
40 containers:
41 - name: processor
42 image: image-processor:v1.0
43 env:
44 - name: SQS_QUEUE_URL
45 value: "https://sqs.us-west-2.amazonaws.com/123456789/image-queue"
46 - name: AWS_REGION
47 value: "us-west-2"
48 resources:
49 requests:
50 cpu: 2
51 memory: 4Gi
52 limits:
53 cpu: 4
54 memory: 8Gi
55 restartPolicy: OnFailure
56
57 # Polling interval
58 pollingInterval: 10 # Check queue every 10 seconds
59
60 # Cooldown period
61 cooldownPeriod: 60 # Wait 60s after last trigger before scaling down
62
63 # Max replicas
64 maxReplicaCount: 100
65
66 # Successful job retention
67 successfulJobsHistoryLimit: 5
68 failedJobsHistoryLimit: 5
69
70 # Scaling strategy
71 scalingStrategy:
72 strategy: "accurate" # Create jobs based on queue length
73 # "default" = one job per event
74 # "custom" = custom logic
75 # "accurate" = jobs = queue length / messages per job
76
77 triggers:
78 - type: aws-sqs-queue
79 authenticationRef:
80 name: aws-sqs-auth
81 metadata:
82 queueURL: "https://sqs.us-west-2.amazonaws.com/123456789/image-queue"
83 queueLength: "5" # Process 5 messages per job
84 awsRegion: "us-west-2"
85 identityOwner: "operator"
86
87---
88# Alternative: Kafka-based job scaling
89apiVersion: keda.sh/v1alpha1
90kind: ScaledJob
91metadata:
92 name: kafka-consumer-job
93 namespace: batch-processing
94spec:
95 jobTargetRef:
96 template:
97 spec:
98 containers:
99 - name: consumer
100 image: kafka-consumer:v1.0
101 env:
102 - name: KAFKA_BROKERS
103 value: "kafka:9092"
104 - name: KAFKA_TOPIC
105 value: "events"
106 - name: KAFKA_CONSUMER_GROUP
107 value: "batch-processors"
108 restartPolicy: OnFailure
109
110 pollingInterval: 15
111 maxReplicaCount: 50
112
113 triggers:
114 - type: kafka
115 metadata:
116 bootstrapServers: "kafka:9092"
117 consumerGroup: "batch-processors"
118 topic: "events"
119 lagThreshold: "100" # Create job when lag > 100 messages
120 offsetResetPolicy: "latest"
Pattern 4B: ML Training Job Autoscaling with Volcano
1# Install Volcano scheduler
2---
3apiVersion: v1
4kind: Namespace
5metadata:
6 name: volcano-system
7
8---
9# Volcano scheduler deployment
10# (Use official Volcano installation)
11
12---
13# ML Training job with gang scheduling
14apiVersion: batch.volcano.sh/v1alpha1
15kind: Job
16metadata:
17 name: distributed-training
18 namespace: ml-training
19spec:
20 # Minimum pods required to start job
21 minAvailable: 4 # 1 master + 3 workers minimum
22
23 schedulerName: volcano
24
25 # Queue for resource management
26 queue: ml-training-queue
27
28 # Plugins
29 plugins:
30 ssh: [] # Enable SSH between pods
31 svc: [] # Create service for pod communication
32 env: [] # Environment variable injection
33
34 # Policies
35 policies:
36 - event: PodEvicted
37 action: RestartJob
38 - event: PodFailed
39 action: RestartJob
40
41 # Task groups
42 tasks:
43 # Master task
44 - name: master
45 replicas: 1
46 template:
47 spec:
48 containers:
49 - name: tensorflow
50 image: tensorflow/tensorflow:latest-gpu
51 command:
52 - python
53 - train.py
54 - --role=master
55 resources:
56 requests:
57 cpu: 4
58 memory: 16Gi
59 nvidia.com/gpu: 1
60 limits:
61 cpu: 8
62 memory: 32Gi
63 nvidia.com/gpu: 1
64
65 # Worker tasks (auto-scalable)
66 - name: worker
67 replicas: 3
68 minAvailable: 1 # At least 1 worker
69 template:
70 spec:
71 containers:
72 - name: tensorflow
73 image: tensorflow/tensorflow:latest-gpu
74 command:
75 - python
76 - train.py
77 - --role=worker
78 resources:
79 requests:
80 cpu: 8
81 memory: 32Gi
82 nvidia.com/gpu: 2
83 limits:
84 cpu: 16
85 memory: 64Gi
86 nvidia.com/gpu: 2
87
88 # Parameter server tasks
89 - name: ps
90 replicas: 2
91 template:
92 spec:
93 containers:
94 - name: tensorflow
95 image: tensorflow/tensorflow:latest
96 command:
97 - python
98 - train.py
99 - --role=ps
100 resources:
101 requests:
102 cpu: 2
103 memory: 8Gi
104 limits:
105 cpu: 4
106 memory: 16Gi
107
108---
109# Queue with capacity limits
110apiVersion: scheduling.volcano.sh/v1beta1
111kind: Queue
112metadata:
113 name: ml-training-queue
114spec:
115 weight: 1
116 capability:
117 cpu: "100"
118 memory: "500Gi"
119 nvidia.com/gpu: "20"
120
121---
122# HPA for worker pods (scale workers based on GPU utilization)
123apiVersion: autoscaling/v2
124kind: HorizontalPodAutoscaler
125metadata:
126 name: training-workers-hpa
127 namespace: ml-training
128spec:
129 scaleTargetRef:
130 apiVersion: batch.volcano.sh/v1alpha1
131 kind: Job
132 name: distributed-training
133
134 minReplicas: 3
135 maxReplicas: 20
136
137 metrics:
138 # GPU utilization
139 - type: Pods
140 pods:
141 metric:
142 name: DCGM_FI_DEV_GPU_UTIL
143 target:
144 type: AverageValue
145 averageValue: "80" # Target 80% GPU utilization
146
147 # Training throughput
148 - type: Pods
149 pods:
150 metric:
151 name: training_samples_per_second
152 target:
153 type: AverageValue
154 averageValue: "1000"
Pattern 4C: Scheduled Autoscaling (Predictive)
1# CronHPA for scheduled scaling
2---
3apiVersion: autoscaling.alibabacloud.com/v1beta1
4kind: CronHorizontalPodAutoscaler
5metadata:
6 name: business-hours-scaling
7 namespace: production
8spec:
9 scaleTargetRef:
10 apiVersion: apps/v1
11 kind: Deployment
12 name: api-server
13
14 # Business hours scaling schedule
15 jobs:
16 # Scale up for morning traffic (8 AM)
17 - name: morning-scale-up
18 schedule: "0 8 * * 1-5" # Weekdays at 8 AM
19 targetSize: 20
20
21 # Scale up for lunch traffic (12 PM)
22 - name: lunch-scale-up
23 schedule: "0 12 * * 1-5"
24 targetSize: 30
25
26 # Scale down for evening (6 PM)
27 - name: evening-scale-down
28 schedule: "0 18 * * 1-5"
29 targetSize: 15
30
31 # Scale down for night (10 PM)
32 - name: night-scale-down
33 schedule: "0 22 * * *"
34 targetSize: 5
35
36 # Weekend minimal scaling
37 - name: weekend-minimal
38 schedule: "0 0 * * 0,6" # Midnight on Sat/Sun
39 targetSize: 3
40
41---
42# Alternative: Using native CronJob + kubectl scale
43apiVersion: batch/v1
44kind: CronJob
45metadata:
46 name: morning-scale-up
47 namespace: production
48spec:
49 schedule: "0 8 * * 1-5"
50 jobTemplate:
51 spec:
52 template:
53 spec:
54 serviceAccountName: autoscaler
55 containers:
56 - name: kubectl
57 image: bitnami/kubectl:latest
58 command:
59 - kubectl
60 - scale
61 - deployment/api-server
62 - --replicas=20
63 - -n
64 - production
65 restartPolicy: OnFailure
Pattern 5: Emerging Technologies & Future Patterns
Pattern 5A: Predictive Autoscaling with Machine Learning
1# ML-based predictive autoscaling model
2import pandas as pd
3import numpy as np
4from sklearn.ensemble import RandomForestRegressor
5from kubernetes import client, config
6import datetime
7
8class PredictiveAutoscaler:
9 def __init__(self):
10 config.load_kube_config()
11 self.apps_v1 = client.AppsV1Api()
12 self.model = RandomForestRegressor(n_estimators=100)
13 self.is_trained = False
14
15 def collect_training_data(self, days=30):
16 """Collect historical data for training"""
17 # Query Prometheus for historical metrics
18 # Features: hour, day_of_week, month, previous_load, etc.
19 # Target: actual_replicas_needed
20
21 data = {
22 'hour': [],
23 'day_of_week': [],
24 'month': [],
25 'previous_load': [],
26 'previous_replicas': [],
27 'actual_replicas': []
28 }
29
30 # Fetch from Prometheus
31 # ... (implementation details)
32
33 return pd.DataFrame(data)
34
35 def train(self):
36 """Train the prediction model"""
37 df = self.collect_training_data()
38
39 X = df[['hour', 'day_of_week', 'month', 'previous_load', 'previous_replicas']]
40 y = df['actual_replicas']
41
42 self.model.fit(X, y)
43 self.is_trained = True
44
45 print(f"Model trained with {len(df)} samples")
46 print(f"Feature importances: {self.model.feature_importances_}")
47
48 def predict_replicas(self, deployment, namespace):
49 """Predict required replicas for next hour"""
50 if not self.is_trained:
51 raise Exception("Model not trained")
52
53 now = datetime.datetime.now()
54
55 # Current state
56 deployment_obj = self.apps_v1.read_namespaced_deployment(
57 deployment, namespace
58 )
59 current_replicas = deployment_obj.spec.replicas
60
61 # Get current load from Prometheus
62 current_load = self.get_current_load(deployment, namespace)
63
64 # Prepare features
65 features = np.array([[
66 now.hour,
67 now.weekday(),
68 now.month,
69 current_load,
70 current_replicas
71 ]])
72
73 # Predict
74 predicted_replicas = int(self.model.predict(features)[0])
75
76 # Apply safety bounds
77 min_replicas = 2
78 max_replicas = 100
79 predicted_replicas = max(min_replicas, min(predicted_replicas, max_replicas))
80
81 return predicted_replicas
82
83 def apply_scaling(self, deployment, namespace, replicas):
84 """Apply predicted scaling"""
85 body = {
86 'spec': {
87 'replicas': replicas
88 }
89 }
90
91 self.apps_v1.patch_namespaced_deployment_scale(
92 deployment,
93 namespace,
94 body
95 )
96
97 print(f"Scaled {deployment} to {replicas} replicas")
98
99 def run(self, deployment, namespace, interval=300):
100 """Main loop"""
101 import time
102
103 while True:
104 try:
105 predicted = self.predict_replicas(deployment, namespace)
106 self.apply_scaling(deployment, namespace, predicted)
107
108 print(f"[{datetime.datetime.now()}] Scaled to {predicted} replicas")
109
110 except Exception as e:
111 print(f"Error: {e}")
112
113 time.sleep(interval) # Every 5 minutes
114
115# Usage
116if __name__ == "__main__":
117 autoscaler = PredictiveAutoscaler()
118 autoscaler.train()
119 autoscaler.run("api-server", "production")
Pattern 5B: Serverless Kubernetes with Knative
1# Knative Service with autoscaling
2---
3apiVersion: serving.knative.dev/v1
4kind: Service
5metadata:
6 name: knative-app
7 namespace: serverless
8spec:
9 template:
10 metadata:
11 annotations:
12 # Autoscaling configuration
13 autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"
14 autoscaling.knative.dev/metric: "concurrency"
15 autoscaling.knative.dev/target: "10" # Target 10 concurrent requests
16 autoscaling.knative.dev/minScale: "0" # Scale to zero
17 autoscaling.knative.dev/maxScale: "100"
18 autoscaling.knative.dev/scaleDownDelay: "30s"
19 autoscaling.knative.dev/window: "60s" # Evaluation window
20
21 spec:
22 containers:
23 - image: myapp:v1.0
24 ports:
25 - containerPort: 8080
26 resources:
27 requests:
28 cpu: 100m
29 memory: 128Mi
30 limits:
31 cpu: 1000m
32 memory: 512Mi
33
34---
35# Advanced: RPS-based autoscaling
36apiVersion: serving.knative.dev/v1
37kind: Service
38metadata:
39 name: rps-based-app
40 namespace: serverless
41spec:
42 template:
43 metadata:
44 annotations:
45 autoscaling.knative.dev/metric: "rps" # Requests per second
46 autoscaling.knative.dev/target: "100" # Target 100 RPS per pod
47 autoscaling.knative.dev/targetUtilizationPercentage: "70"
48 spec:
49 containers:
50 - image: myapp:v1.0
51 resources:
52 requests:
53 cpu: 200m
54 memory: 256Mi
Pattern 5C: Service Mesh Integration (Istio)
1# Istio VirtualService with traffic-based autoscaling
2---
3apiVersion: networking.istio.io/v1beta1
4kind: VirtualService
5metadata:
6 name: my-app
7 namespace: production
8spec:
9 hosts:
10 - my-app.example.com
11 http:
12 - match:
13 - headers:
14 x-version:
15 exact: canary
16 route:
17 - destination:
18 host: my-app
19 subset: canary
20 weight: 10 # 10% traffic to canary
21 - route:
22 - destination:
23 host: my-app
24 subset: stable
25 weight: 90
26
27---
28# DestinationRule
29apiVersion: networking.istio.io/v1beta1
30kind: DestinationRule
31metadata:
32 name: my-app
33 namespace: production
34spec:
35 host: my-app
36 subsets:
37 - name: stable
38 labels:
39 version: stable
40 - name: canary
41 labels:
42 version: canary
43
44---
45# HPA using Istio metrics
46apiVersion: autoscaling/v2
47kind: HorizontalPodAutoscaler
48metadata:
49 name: my-app-istio-hpa
50 namespace: production
51spec:
52 scaleTargetRef:
53 apiVersion: apps/v1
54 kind: Deployment
55 name: my-app
56
57 minReplicas: 2
58 maxReplicas: 50
59
60 metrics:
61 # Istio request rate
62 - type: Pods
63 pods:
64 metric:
65 name: istio_requests_per_second
66 target:
67 type: AverageValue
68 averageValue: "100"
69
70 # Istio P99 latency
71 - type: Pods
72 pods:
73 metric:
74 name: istio_request_duration_p99
75 target:
76 type: AverageValue
77 averageValue: "200m" # 200ms
78
79---
80# Prometheus rules for Istio metrics
81apiVersion: monitoring.coreos.com/v1
82kind: PrometheusRule
83metadata:
84 name: istio-custom-metrics
85 namespace: monitoring
86spec:
87 groups:
88 - name: istio-autoscaling
89 interval: 15s
90 rules:
91 - record: istio_requests_per_second
92 expr: |
93 sum(rate(istio_requests_total{destination_workload="my-app"}[2m])) by (pod)
94
95 - record: istio_request_duration_p99
96 expr: |
97 histogram_quantile(0.99,
98 sum(rate(istio_request_duration_milliseconds_bucket{destination_workload="my-app"}[2m])) by (pod, le)
99 )
Best Practices Summary
Stateful Applications
✅ Use conservative scaling policies (slower scale-up/down) ✅ Implement proper health checks and readiness probes ✅ Plan for data synchronization time ✅ Use PVCs with appropriate storage classes ✅ Consider split architectures (read/write separation)
Multi-Cluster
✅ Centralize metrics with Thanos or Prometheus federation ✅ Implement intelligent routing with global load balancers ✅ Use cost-aware scheduling ✅ Plan for cross-cluster failover ✅ Monitor inter-cluster latency
Cost Optimization
✅ Maximize spot instance usage (70-90% savings) ✅ Implement aggressive consolidation ✅ Use FinOps dashboards for visibility ✅ Set up cost alerts and budgets ✅ Regular right-sizing reviews
Batch Jobs
✅ Use KEDA ScaledJobs for queue-driven processing ✅ Implement proper job cleanup policies ✅ Set resource limits to prevent runaway costs ✅ Use gang scheduling for distributed jobs ✅ Monitor job success rates
Key Takeaways
- Stateful Scaling: Requires careful planning, slower policies, and split read/write architectures
- Multi-Cluster: Centralized metrics and intelligent distribution critical for success
- Cost Optimization: Spot instances + right-sizing + consolidation = 60-80% savings
- Batch Processing: Queue-based autoscaling with KEDA scales jobs efficiently
- Future: ML-based prediction, serverless K8s, and service mesh integration emerging
Related Topics
Autoscaling Series
- Part 1: HPA Fundamentals
- Part 2: Cluster Autoscaling
- Part 3: Hands-On Demo
- Part 4: Monitoring & Alerting
- Part 5: VPA & Resource Optimization
Conclusion
Advanced autoscaling patterns unlock significant value:
- Stateful applications can scale safely with proper planning
- Multi-cluster deployments enable global scale and resilience
- Cost optimization delivers 60-80% infrastructure savings
- Batch processing scales efficiently with queue-based triggers
- Emerging technologies push boundaries of what’s possible
These patterns, combined with foundational HPA and VPA, create comprehensive autoscaling architectures that balance performance, cost, and reliability at scale.
Next up: Part 7 - Production Troubleshooting & War Stories 🔧
Happy scaling! 🚀