Series Overview
This is Part 7 of the Kubernetes Autoscaling Complete Guide series:
- Part 1: Horizontal Pod Autoscaler - Application-level autoscaling theory
- Part 2: Cluster Autoscaling & Cloud Providers - Infrastructure-level autoscaling
- Part 3: Hands-On HPA Demo - Practical implementation
- Part 4: Monitoring, Alerting & Threshold Tuning - Production observability
- Part 5: VPA & Resource Optimization - Right-sizing strategies
- Part 6: Advanced Autoscaling Patterns - Complex architectures
- Part 7 (This Post): Production Troubleshooting & War Stories - Real-world incidents
Theory is important, but production teaches the hardest lessons. This guide documents real-world autoscaling failures, debugging methodologies, and hard-won insights from managing Kubernetes autoscaling at scale. These are the stories rarely told in documentation—the 2 AM incidents, cascading failures, and subtle bugs that cost millions.
War Story #1: The Black Friday Meltdown
The Incident
Date: November 25, 2022 Duration: 2 hours 37 minutes Impact: $3.2M revenue loss, 89% service degradation Root Cause: HPA thrashing during traffic spike
Timeline
08:45 UTC - Black Friday sale begins
08:46 UTC - Traffic increases from 10k to 150k req/min
08:47 UTC - HPA scales from 50 to 100 pods
08:48 UTC - New pods start, but not ready (app startup: 2 min)
08:49 UTC - Existing pods overloaded, CPU hits 95%
08:50 UTC - HPA sees high CPU, scales to 200 pods
08:51 UTC - Kubernetes scheduler cannot place pods (insufficient nodes)
08:52 UTC - Cluster Autoscaler adds nodes (provision time: 3 min)
08:53 UTC - API server overwhelmed by HPA queries (1000+ req/s)
08:54 UTC - HPA controller starts timing out
08:55 UTC - Pods begin OOMKilling due to memory pressure
08:56 UTC - Service enters cascading failure mode
08:57 UTC - Manual intervention begins
09:15 UTC - Emergency scale-up of node pool
09:30 UTC - Services stabilize
11:22 UTC - Full recovery
What Went Wrong
┌────────────────────────────────────────────────────────────────┐
│ FAILURE CASCADE │
│ │
│ Traffic Spike │
│ ↓ │
│ Slow App Startup (2 min) │
│ ↓ │
│ Existing Pods Overloaded │
│ ↓ │
│ HPA Aggressive Scale-Up │
│ ↓ │
│ No Node Capacity │
│ ↓ │
│ Cluster Autoscaler Delay (3 min) │
│ ↓ │
│ Pods Pending │
│ ↓ │
│ API Server Overload (HPA queries) │
│ ↓ │
│ HPA Timeouts │
│ ↓ │
│ OOMKills Begin │
│ ↓ │
│ TOTAL SYSTEM FAILURE │
└────────────────────────────────────────────────────────────────┘
Configuration Issues
Original HPA Configuration:
1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4 name: api-server-hpa
5spec:
6 scaleTargetRef:
7 apiVersion: apps/v1
8 kind: Deployment
9 name: api-server
10
11 minReplicas: 50
12 maxReplicas: 500 # ❌ Too aggressive
13
14 metrics:
15 - type: Resource
16 resource:
17 name: cpu
18 target:
19 type: Utilization
20 averageUtilization: 70 # ❌ Too sensitive
21
22 behavior: # ❌ No behavior control
23 scaleUp:
24 stabilizationWindowSeconds: 0 # ❌ Immediate
25 policies:
26 - type: Percent
27 value: 100 # ❌ Doubles every 15s
28 periodSeconds: 15
Deployment Issues:
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: api-server
5spec:
6 template:
7 spec:
8 containers:
9 - name: api
10 image: api-server:v1.0
11
12 # ❌ No readiness probe - pods receive traffic before ready
13 # ❌ Slow startup time not accounted for
14
15 resources:
16 requests:
17 cpu: 500m
18 memory: 512Mi
19 limits:
20 cpu: 2 # ❌ 4x request - causes throttling
21 memory: 1Gi # ❌ 2x request - causes OOM
Cluster Autoscaler Configuration:
1# ❌ No node over-provisioning
2# ❌ Single node pool type (no hot spare capacity)
3# ❌ 3-minute node startup time not accounted for
The Fix
1. Improved HPA Configuration:
1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4 name: api-server-hpa
5spec:
6 scaleTargetRef:
7 apiVersion: apps/v1
8 kind: Deployment
9 name: api-server
10
11 minReplicas: 100 # ✅ Higher baseline for Black Friday
12 maxReplicas: 500
13
14 metrics:
15 - type: Resource
16 resource:
17 name: cpu
18 target:
19 type: Utilization
20 averageUtilization: 60 # ✅ More headroom
21
22 # ✅ Custom metric: actual request rate (more predictive)
23 - type: Pods
24 pods:
25 metric:
26 name: http_requests_per_second
27 target:
28 type: AverageValue
29 averageValue: "100"
30
31 behavior:
32 scaleUp:
33 stabilizationWindowSeconds: 30 # ✅ 30s buffer
34 policies:
35 - type: Pods
36 value: 20 # ✅ Max 20 pods per 30s
37 periodSeconds: 30
38 - type: Percent
39 value: 50 # ✅ Max 50% increase
40 periodSeconds: 30
41 selectPolicy: Min # ✅ Conservative
42
43 scaleDown:
44 stabilizationWindowSeconds: 300 # ✅ 5 min cooldown
45 policies:
46 - type: Pods
47 value: 5
48 periodSeconds: 60
2. Application Optimization:
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: api-server
5spec:
6 template:
7 spec:
8 containers:
9 - name: api
10 image: api-server:v2.0
11
12 # ✅ Readiness probe
13 readinessProbe:
14 httpGet:
15 path: /health/ready
16 port: 8080
17 initialDelaySeconds: 30
18 periodSeconds: 5
19 failureThreshold: 3
20
21 # ✅ Startup probe for slow startup
22 startupProbe:
23 httpGet:
24 path: /health/startup
25 port: 8080
26 initialDelaySeconds: 0
27 periodSeconds: 10
28 failureThreshold: 18 # Allow 3 minutes
29
30 # ✅ Liveness probe
31 livenessProbe:
32 httpGet:
33 path: /health/live
34 port: 8080
35 periodSeconds: 10
36 failureThreshold: 3
37
38 resources:
39 requests:
40 cpu: 500m
41 memory: 512Mi
42 limits:
43 cpu: 1000m # ✅ 2x request (reasonable burst)
44 memory: 1Gi # ✅ 2x request
45
46 # ✅ Graceful shutdown
47 lifecycle:
48 preStop:
49 exec:
50 command: ["/bin/sh", "-c", "sleep 15"]
3. Cluster Pre-warming:
1# ✅ Node over-provisioning deployment
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: cluster-overprovisioner
6 namespace: kube-system
7spec:
8 replicas: 10 # Reserve capacity for 10 pods
9 template:
10 spec:
11 priorityClassName: overprovisioning # Low priority
12 containers:
13 - name: pause
14 image: k8s.gcr.io/pause
15 resources:
16 requests:
17 cpu: 500m
18 memory: 512Mi
19
20---
21# ✅ Priority class for overprovisioning
22apiVersion: scheduling.k8s.io/v1
23kind: PriorityClass
24metadata:
25 name: overprovisioning
26value: -1 # Negative priority - first to evict
27globalDefault: false
28description: "Pods that reserve cluster capacity"
29
30---
31# ✅ Priority class for production workloads
32apiVersion: scheduling.k8s.io/v1
33kind: PriorityClass
34metadata:
35 name: production-high
36value: 1000
37globalDefault: false
38description: "High priority production workloads"
4. Scheduled Pre-scaling:
1# ✅ CronJob to pre-scale before Black Friday
2apiVersion: batch/v1
3kind: CronJob
4metadata:
5 name: blackfriday-prescale
6 namespace: production
7spec:
8 # 30 minutes before sale
9 schedule: "15 8 25 11 *" # Nov 25, 08:15 UTC
10 jobTemplate:
11 spec:
12 template:
13 spec:
14 serviceAccountName: autoscaler
15 containers:
16 - name: prescale
17 image: bitnami/kubectl:latest
18 command:
19 - /bin/bash
20 - -c
21 - |
22 echo "Pre-scaling for Black Friday"
23
24 # Scale up deployment
25 kubectl scale deployment api-server --replicas=150 -n production
26
27 # Update HPA minReplicas
28 kubectl patch hpa api-server-hpa -n production -p '{"spec":{"minReplicas":150}}'
29
30 # Add extra nodes
31 aws autoscaling set-desired-capacity \
32 --auto-scaling-group-name eks-node-group \
33 --desired-capacity 50
34
35 echo "Pre-scaling complete"
36 restartPolicy: OnFailure
Lessons Learned
- Slow startup kills autoscaling - 2-minute app startup + 3-minute node provisioning = 5 minutes total lag
- Traffic spikes need pre-warming - Reactive scaling is too slow for flash events
- HPA + CA delays compound - Each layer adds latency; total delay can be fatal
- API server is a bottleneck - HPA can overwhelm API server with queries
- Readiness probes are critical - Without them, traffic hits non-ready pods
Preventive Measures
1# ✅ Comprehensive monitoring
2apiVersion: monitoring.coreos.com/v1
3kind: PrometheusRule
4metadata:
5 name: autoscaling-sla-alerts
6 namespace: monitoring
7spec:
8 groups:
9 - name: autoscaling-sla
10 rules:
11 # Alert when scaling is too slow
12 - alert: SlowAutoscaling
13 expr: |
14 (
15 kube_horizontalpodautoscaler_status_desired_replicas
16 - kube_horizontalpodautoscaler_status_current_replicas
17 ) > 5
18 for: 2m
19 labels:
20 severity: warning
21 annotations:
22 summary: "HPA scaling lag detected"
23 description: "Desired replicas not reached for 2 minutes"
24
25 # Alert on pod startup time
26 - alert: SlowPodStartup
27 expr: |
28 (time() - kube_pod_start_time) > 120
29 and kube_pod_status_phase{phase="Running"} == 1
30 and kube_pod_status_ready{condition="true"} == 0
31 for: 1m
32 labels:
33 severity: warning
34 annotations:
35 summary: "Pod {{ $labels.pod }} taking >2 min to start"
36
37 # Alert on pending pods
38 - alert: PodsPendingTooLong
39 expr: |
40 kube_pod_status_phase{phase="Pending"} == 1
41 for: 3m
42 labels:
43 severity: critical
44 annotations:
45 summary: "Pod {{ $labels.pod }} pending for >3 minutes"
46 description: "Likely node capacity issue"
War Story #2: The VPA OOMKill Loop
The Incident
Date: March 15, 2023 Duration: 6 hours 12 minutes Impact: 45% service availability, database corruption Root Cause: VPA recommendations too aggressive, causing OOM loop
The Problem
VPA Recommendation: 4GB memory
Actual Pod Usage: 3.8GB memory (95% utilization)
Pod starts with 4GB limit
↓
App loads data into memory
↓
Memory usage: 3.9GB
↓
Java GC overhead increases
↓
Memory peaks at 4.1GB
↓
OOMKilled by kernel
↓
Pod restarts
↓
REPEAT INFINITELY
Root Cause Analysis
VPA Configuration:
1apiVersion: autoscaling.k8s.io/v1
2kind: VerticalPodAutoscaler
3metadata:
4 name: cache-service-vpa
5spec:
6 targetRef:
7 apiVersion: apps/v1
8 kind: StatefulSet
9 name: cache-service
10
11 updatePolicy:
12 updateMode: "Auto" # ❌ Aggressive mode
13
14 resourcePolicy:
15 containerPolicies:
16 - containerName: redis
17 minAllowed:
18 memory: 1Gi
19 maxAllowed:
20 memory: 8Gi
21 # ❌ No safety margin configured
22 # ❌ mode: Auto sets both requests AND limits
What VPA Did:
Time 00:00 - VPA observes: avg 3.5GB, P95 3.8GB
Time 00:15 - VPA sets: request=3.8GB, limit=3.8GB
Time 00:30 - Pod restarted with new limits
Time 00:35 - Pod reaches 3.9GB
Time 00:36 - OOMKilled (limit: 3.8GB)
Time 00:37 - Pod restart #1
Time 00:42 - OOMKilled again
Time 00:43 - Pod restart #2
... (crash loop continues)
The Fix
1. VPA with Safety Margin:
1apiVersion: autoscaling.k8s.io/v1
2kind: VerticalPodAutoscaler
3metadata:
4 name: cache-service-vpa
5spec:
6 targetRef:
7 apiVersion: apps/v1
8 kind: StatefulSet
9 name: cache-service
10
11 updatePolicy:
12 updateMode: "Initial" # ✅ Less aggressive
13
14 resourcePolicy:
15 containerPolicies:
16 - containerName: redis
17 minAllowed:
18 memory: 2Gi # ✅ Higher minimum
19 maxAllowed:
20 memory: 16Gi # ✅ Higher maximum
21
22 # ✅ Only control requests, not limits
23 controlledValues: RequestsOnly
24
25 mode: Auto
2. Manual Limit with Buffer:
1apiVersion: apps/v1
2kind: StatefulSet
3metadata:
4 name: cache-service
5spec:
6 template:
7 spec:
8 containers:
9 - name: redis
10 image: redis:7
11 resources:
12 requests:
13 memory: 4Gi # VPA will adjust this
14 limits:
15 memory: 8Gi # ✅ Manual limit with 2x buffer
3. Application-Level Memory Management:
1apiVersion: apps/v1
2kind: StatefulSet
3metadata:
4 name: cache-service
5spec:
6 template:
7 spec:
8 containers:
9 - name: redis
10 image: redis:7
11 command:
12 - redis-server
13 args:
14 - --maxmemory
15 - "3gb" # ✅ App-level limit (75% of request)
16 - --maxmemory-policy
17 - "allkeys-lru" # ✅ Evict keys when limit reached
18
19 resources:
20 requests:
21 memory: 4Gi
22 limits:
23 memory: 8Gi
4. OOMKill Detection and Auto-remediation:
1# CronJob to detect and fix OOM loops
2apiVersion: batch/v1
3kind: CronJob
4metadata:
5 name: oomkill-detector
6 namespace: kube-system
7spec:
8 schedule: "*/5 * * * *" # Every 5 minutes
9 jobTemplate:
10 spec:
11 template:
12 spec:
13 serviceAccountName: oomkill-detector
14 containers:
15 - name: detector
16 image: bitnami/kubectl:latest
17 command:
18 - /bin/bash
19 - -c
20 - |
21 #!/bin/bash
22
23 echo "Checking for OOMKill loops..."
24
25 # Find pods with multiple OOMKills in last 10 minutes
26 OOMKILLED_PODS=$(kubectl get events -A \
27 --field-selector reason=OOMKilling \
28 -o json | jq -r '
29 .items[] |
30 select(.lastTimestamp > (now - 600 | strftime("%Y-%m-%dT%H:%M:%SZ"))) |
31 "\(.involvedObject.namespace)/\(.involvedObject.name)"
32 ' | sort | uniq -c | awk '$1 > 2 {print $2}')
33
34 if [ -z "$OOMKILLED_PODS" ]; then
35 echo "No OOMKill loops detected"
36 exit 0
37 fi
38
39 echo "OOMKill loops detected:"
40 echo "$OOMKILLED_PODS"
41
42 # Increase memory limits
43 for POD in $OOMKILLED_PODS; do
44 NAMESPACE=$(echo $POD | cut -d/ -f1)
45 POD_NAME=$(echo $POD | cut -d/ -f2)
46
47 # Get deployment name
48 DEPLOYMENT=$(kubectl get pod $POD_NAME -n $NAMESPACE \
49 -o jsonpath='{.metadata.labels.app}')
50
51 echo "Increasing memory for $DEPLOYMENT in $NAMESPACE"
52
53 # Patch to increase memory by 50%
54 kubectl patch deployment $DEPLOYMENT -n $NAMESPACE --type=json -p='[
55 {
56 "op": "replace",
57 "path": "/spec/template/spec/containers/0/resources/limits/memory",
58 "value": "12Gi"
59 }
60 ]'
61
62 # Disable VPA temporarily
63 kubectl patch vpa ${DEPLOYMENT}-vpa -n $NAMESPACE -p '
64 {"spec":{"updatePolicy":{"updateMode":"Off"}}}'
65
66 # Alert on Slack
67 curl -X POST $SLACK_WEBHOOK \
68 -H 'Content-Type: application/json' \
69 -d "{\"text\": \"⚠️ OOMKill loop detected for $DEPLOYMENT. Auto-increased memory to 12Gi and disabled VPA.\"}"
70 done
71 restartPolicy: OnFailure
Lessons Learned
- VPA needs safety margins - Set limits higher than recommendations
- Requests ≠ Limits - Use
controlledValues: RequestsOnly - Application-level limits - Don’t rely solely on Kubernetes limits
- Monitor OOMKills - Set up automated detection and remediation
- Test VPA changes - Don’t enable
Automode without thorough testing
War Story #3: The Spot Instance Cascade
The Incident
Date: August 8, 2023 Duration: 1 hour 23 minutes Impact: 70% pod evictions, service disruption Root Cause: AWS spot instance interruptions not handled gracefully
The Timeline
14:00 UTC - AWS spot price spike in us-east-1a
14:01 UTC - 30% of spot instances interrupted (2-minute warning)
14:03 UTC - 50 pods evicted
14:04 UTC - Karpenter provisions new spot instances
14:05 UTC - New spot instances also interrupted (different AZ)
14:07 UTC - 100 more pods evicted
14:08 UTC - Service degradation begins
14:10 UTC - Karpenter tries on-demand fallback
14:12 UTC - On-demand capacity exhausted
14:15 UTC - Cascading failure across all AZs
14:30 UTC - Manual intervention: forced on-demand scaling
15:23 UTC - Full recovery
Root Cause
Insufficient Instance Type Diversity:
1# ❌ Original Karpenter NodePool
2apiVersion: karpenter.sh/v1beta1
3kind: NodePool
4metadata:
5 name: general-spot
6spec:
7 template:
8 spec:
9 requirements:
10 - key: karpenter.sh/capacity-type
11 operator: In
12 values: ["spot"]
13
14 # ❌ Limited to 2 instance families
15 - key: karpenter.k8s.aws/instance-category
16 operator: In
17 values: ["m5", "c5"]
18
19 # ❌ Single generation
20 - key: karpenter.k8s.aws/instance-generation
21 operator: In
22 values: ["5"]
No PodDisruptionBudgets:
1# ❌ No PDB configured
2# All pods can be evicted simultaneously
Inadequate Fallback Strategy:
1# ❌ Single NodePool
2# No prioritization between spot and on-demand
The Fix
1. Maximum Instance Diversity:
1apiVersion: karpenter.sh/v1beta1
2kind: NodePool
3metadata:
4 name: diversified-spot
5spec:
6 template:
7 spec:
8 requirements:
9 - key: karpenter.sh/capacity-type
10 operator: In
11 values: ["spot"]
12
13 # ✅ Multiple instance families
14 - key: karpenter.k8s.aws/instance-category
15 operator: In
16 values: ["c", "m", "r", "t", "i", "d"]
17
18 # ✅ Multiple generations
19 - key: karpenter.k8s.aws/instance-generation
20 operator: Gt
21 values: ["4"] # Anything 5+
22
23 # ✅ Multiple sizes
24 - key: karpenter.k8s.aws/instance-size
25 operator: In
26 values: ["large", "xlarge", "2xlarge", "4xlarge", "8xlarge"]
27
28 # ✅ Spread across all AZs
29 - key: topology.kubernetes.io/zone
30 operator: In
31 values: ["us-east-1a", "us-east-1b", "us-east-1c", "us-east-1d"]
32
33 nodeClassRef:
34 name: diversified
35
36 # ✅ Short expiration to refresh instances frequently
37 disruption:
38 consolidationPolicy: WhenUnderutilized
39 expireAfter: 12h
40
41 limits:
42 cpu: "500"
2. On-Demand Fallback Pool:
1apiVersion: karpenter.sh/v1beta1
2kind: NodePool
3metadata:
4 name: on-demand-fallback
5spec:
6 template:
7 spec:
8 requirements:
9 - key: karpenter.sh/capacity-type
10 operator: In
11 values: ["on-demand"]
12
13 - key: karpenter.k8s.aws/instance-category
14 operator: In
15 values: ["m", "c"]
16
17 nodeClassRef:
18 name: on-demand-fallback
19
20 # ✅ Lower priority (higher weight value)
21 weight: 100
22
23 limits:
24 cpu: "200" # Reserve capacity
3. Critical Workload Isolation:
1# ✅ On-demand pool for critical services
2apiVersion: karpenter.sh/v1beta1
3kind: NodePool
4metadata:
5 name: critical-on-demand
6spec:
7 template:
8 spec:
9 requirements:
10 - key: karpenter.sh/capacity-type
11 operator: In
12 values: ["on-demand"]
13
14 taints:
15 - key: workload-type
16 value: critical
17 effect: NoSchedule
18
19 labels:
20 workload-type: critical
21
22 nodeClassRef:
23 name: critical
24
25 limits:
26 cpu: "100"
27
28---
29# Critical deployment on on-demand nodes
30apiVersion: apps/v1
31kind: Deployment
32metadata:
33 name: payment-service
34spec:
35 template:
36 spec:
37 # ✅ Force on-demand nodes
38 nodeSelector:
39 workload-type: critical
40
41 tolerations:
42 - key: workload-type
43 value: critical
44 effect: NoSchedule
4. Comprehensive PodDisruptionBudgets:
1# ✅ PDB for all production services
2apiVersion: policy/v1
3kind: PodDisruptionBudget
4metadata:
5 name: api-server-pdb
6 namespace: production
7spec:
8 minAvailable: 75% # Keep 75% pods running
9 selector:
10 matchLabels:
11 app: api-server
12
13---
14# ✅ PDB for critical services
15apiVersion: policy/v1
16kind: PodDisruptionBudget
17metadata:
18 name: payment-service-pdb
19 namespace: production
20spec:
21 maxUnavailable: 1 # Only 1 pod can be down
22 selector:
23 matchLabels:
24 app: payment-service
5. Spot Interruption Handler:
1# AWS Node Termination Handler
2apiVersion: apps/v1
3kind: DaemonSet
4metadata:
5 name: aws-node-termination-handler
6 namespace: kube-system
7spec:
8 selector:
9 matchLabels:
10 app: aws-node-termination-handler
11 template:
12 spec:
13 serviceAccountName: aws-node-termination-handler
14 hostNetwork: true
15 containers:
16 - name: handler
17 image: amazon/aws-node-termination-handler:latest
18 env:
19 - name: NODE_NAME
20 valueFrom:
21 fieldRef:
22 fieldPath: spec.nodeName
23 - name: POD_NAME
24 valueFrom:
25 fieldRef:
26 fieldPath: metadata.name
27 - name: NAMESPACE
28 valueFrom:
29 fieldRef:
30 fieldPath: metadata.namespace
31 - name: ENABLE_SPOT_INTERRUPTION_DRAINING
32 value: "true"
33 - name: ENABLE_SCHEDULED_EVENT_DRAINING
34 value: "true"
35 - name: ENABLE_REBALANCE_MONITORING
36 value: "true"
37 - name: WEBHOOK_URL
38 value: "http://slack-webhook/v1/webhook"
39 securityContext:
40 privileged: true
41
42---
43# Monitor spot interruptions
44apiVersion: monitoring.coreos.com/v1
45kind: PrometheusRule
46metadata:
47 name: spot-interruption-alerts
48 namespace: monitoring
49spec:
50 groups:
51 - name: spot-interruptions
52 rules:
53 - alert: HighSpotInterruptionRate
54 expr: |
55 rate(aws_node_termination_handler_actions_node_total[10m]) > 0.1
56 labels:
57 severity: warning
58 annotations:
59 summary: "High spot interruption rate"
60 description: "{{ $value }} nodes/minute being interrupted"
61
62 - alert: SpotCapacityShortage
63 expr: |
64 rate(karpenter_pods_state{state="pending"}[5m]) > 10
65 and on() karpenter_nodes_created{capacity_type="spot"} == 0
66 for: 5m
67 labels:
68 severity: critical
69 annotations:
70 summary: "Unable to provision spot instances"
71 description: "Spot capacity exhausted, fallback to on-demand"
Lessons Learned
- Diversity is survival - More instance types = better spot availability
- PDBs are mandatory - Without them, all pods can evict simultaneously
- Layered fallback - Spot → Different spot family → On-demand
- Critical services need on-demand - Don’t run payment systems on spot
- Monitor interruption patterns - AWS publishes spot interruption frequency data
Debugging Workflow: The Systematic Approach
Step 1: Quick Health Check
1#!/bin/bash
2# autoscaling-health-check.sh
3
4echo "=== Kubernetes Autoscaling Health Check ==="
5echo ""
6
7# 1. HPA Status
8echo "1. HPA Status:"
9kubectl get hpa -A
10echo ""
11
12# 2. Check for unknown metrics
13echo "2. HPAs with unknown metrics:"
14kubectl get hpa -A -o json | jq -r '
15 .items[] |
16 select(.status.conditions[] | select(.type == "ScalingActive" and .status == "False")) |
17 "\(.metadata.namespace)/\(.metadata.name): \(.status.conditions[] | select(.type == "ScalingActive").message)"
18'
19echo ""
20
21# 3. Metrics Server
22echo "3. Metrics Server:"
23kubectl get pods -n kube-system -l k8s-app=metrics-server
24kubectl top nodes | head -5
25echo ""
26
27# 4. VPA Status
28echo "4. VPA Status:"
29kubectl get vpa -A
30echo ""
31
32# 5. Cluster Autoscaler
33echo "5. Cluster Autoscaler/Karpenter:"
34kubectl get pods -n kube-system -l app.kubernetes.io/name=cluster-autoscaler
35kubectl get pods -n karpenter
36echo ""
37
38# 6. Pending Pods
39echo "6. Pending Pods:"
40PENDING=$(kubectl get pods -A --field-selector=status.phase=Pending --no-headers | wc -l)
41echo "Total pending pods: $PENDING"
42if [ $PENDING -gt 0 ]; then
43 kubectl get pods -A --field-selector=status.phase=Pending
44fi
45echo ""
46
47# 7. Recent Events
48echo "7. Recent Autoscaling Events (last 10):"
49kubectl get events -A --sort-by='.lastTimestamp' | grep -E 'Scale|HPA|VPA|Evict' | tail -10
50echo ""
51
52# 8. Node Pressure
53echo "8. Node Resource Pressure:"
54kubectl describe nodes | grep -A 5 "Allocated resources"
55echo ""
56
57# 9. Failed to Schedule
58echo "9. Pods Failed to Schedule:"
59kubectl get events -A --field-selector reason=FailedScheduling | tail -10
Step 2: Deep Dive - HPA Not Scaling
1#!/bin/bash
2# debug-hpa.sh
3
4HPA_NAME=$1
5NAMESPACE=${2:-default}
6
7echo "=== Debugging HPA: $HPA_NAME in $NAMESPACE ==="
8echo ""
9
10# 1. HPA Configuration
11echo "1. HPA Configuration:"
12kubectl get hpa $HPA_NAME -n $NAMESPACE -o yaml
13echo ""
14
15# 2. HPA Status
16echo "2. HPA Status:"
17kubectl describe hpa $HPA_NAME -n $NAMESPACE
18echo ""
19
20# 3. Current Metrics
21echo "3. Current Metrics:"
22kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/$NAMESPACE/pods" | \
23 jq -r ".items[] | select(.metadata.labels.app == \"$HPA_NAME\") | {name: .metadata.name, cpu: .containers[].usage.cpu, memory: .containers[].usage.memory}"
24echo ""
25
26# 4. Target Deployment
27echo "4. Target Deployment:"
28TARGET=$(kubectl get hpa $HPA_NAME -n $NAMESPACE -o jsonpath='{.spec.scaleTargetRef.name}')
29kubectl get deployment $TARGET -n $NAMESPACE
30echo ""
31
32# 5. Pod Resource Requests
33echo "5. Pod Resource Requests:"
34kubectl get deployment $TARGET -n $NAMESPACE -o jsonpath='{.spec.template.spec.containers[].resources}'
35echo ""
36
37# 6. Scaling Events
38echo "6. Recent Scaling Events:"
39kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$HPA_NAME --sort-by='.lastTimestamp' | tail -20
40echo ""
41
42# 7. HPA Controller Logs
43echo "7. HPA Controller Logs (last 50 lines):"
44kubectl logs -n kube-system deployment/kube-controller-manager --tail=50 | grep -i hpa
45echo ""
46
47# 8. Check if metrics are available
48echo "8. Metrics Availability:"
49if kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 &>/dev/null; then
50 echo "✅ Custom metrics API available"
51 kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r '.resources[].name' | head -10
52else
53 echo "❌ Custom metrics API not available"
54fi
55echo ""
56
57# 9. Prometheus Metrics (if using custom metrics)
58echo "9. Prometheus Metrics:"
59POD=$(kubectl get pods -n $NAMESPACE -l app=$TARGET -o jsonpath='{.items[0].metadata.name}')
60kubectl exec -n $NAMESPACE $POD -- curl -s localhost:9090/metrics 2>/dev/null | grep -E "cpu|memory|requests" | head -10
Step 3: Deep Dive - Cluster Autoscaler Issues
1#!/bin/bash
2# debug-cluster-autoscaler.sh
3
4echo "=== Debugging Cluster Autoscaler ==="
5echo ""
6
7# 1. Cluster Autoscaler Status
8echo "1. Cluster Autoscaler Status:"
9kubectl get pods -n kube-system -l app=cluster-autoscaler
10echo ""
11
12# 2. CA Logs (errors only)
13echo "2. Recent Errors:"
14CA_POD=$(kubectl get pods -n kube-system -l app=cluster-autoscaler -o jsonpath='{.items[0].metadata.name}')
15kubectl logs -n kube-system $CA_POD --tail=100 | grep -i error
16echo ""
17
18# 3. Node Groups
19echo "3. Node Groups Status:"
20kubectl logs -n kube-system $CA_POD --tail=50 | grep -i "node group"
21echo ""
22
23# 4. Scale Up Events
24echo "4. Recent Scale Up Attempts:"
25kubectl logs -n kube-system $CA_POD --tail=100 | grep -i "scale up"
26echo ""
27
28# 5. Scale Down Events
29echo "5. Recent Scale Down Attempts:"
30kubectl logs -n kube-system $CA_POD --tail=100 | grep -i "scale down"
31echo ""
32
33# 6. Unschedulable Pods
34echo "6. Unschedulable Pods:"
35kubectl logs -n kube-system $CA_POD --tail=50 | grep -i "unschedulable"
36echo ""
37
38# 7. Node Group Sizes (AWS)
39echo "7. AWS ASG Sizes:"
40aws autoscaling describe-auto-scaling-groups \
41 --query 'AutoScalingGroups[?contains(Tags[?Key==`k8s.io/cluster-autoscaler/enabled`].Value, `true`)].{Name:AutoScalingGroupName,Desired:DesiredCapacity,Min:MinSize,Max:MaxSize,Current:Instances|length(@)}' \
42 --output table
43echo ""
44
45# 8. Node Capacity
46echo "8. Current Node Capacity:"
47kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.capacity.cpu,MEMORY:.status.capacity.memory,PODS:.status.capacity.pods
48echo ""
49
50# 9. ConfigMap
51echo "9. Cluster Autoscaler ConfigMap:"
52kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
Common Failure Patterns
Pattern 1: The Thundering Herd
Symptom: All pods restart simultaneously
Cause:
- VPA in
Recreatemode with no PDB - All pods get new resource recommendations at once
- VPA evicts all pods simultaneously
Detection:
1# Check for mass pod restarts
2kubectl get events -A | grep -E "Killing|Evicted" | wc -l
Fix:
1# Add PDB
2apiVersion: policy/v1
3kind: PodDisruptionBudget
4metadata:
5 name: my-app-pdb
6spec:
7 minAvailable: 70%
Pattern 2: The Resource Starvation
Symptom: HPA scales up but pods remain pending
Cause:
- No cluster autoscaler
- Node resources exhausted
- Pod resource requests too large
Detection:
1# Check pending pods
2kubectl get pods -A --field-selector=status.phase=Pending
3
4# Check node allocatable resources
5kubectl describe nodes | grep -A 10 "Allocated resources"
Fix:
1# Enable Cluster Autoscaler or add nodes manually
2# Or reduce resource requests
Pattern 3: The Metric Lag
Symptom: HPA scales late, after traffic spike already passed
Cause:
- Long metric scrape interval (15s)
- Long HPA evaluation interval (15s)
- Total lag: 30-60 seconds
Detection:
1# Check Metrics Server update frequency
2kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes | jq -r '.items[0].timestamp'
3# Wait 10 seconds
4kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes | jq -r '.items[0].timestamp'
5# Compare timestamps
Fix:
1# Use custom metrics with lower scrape interval
2# Or implement predictive scaling
Emergency Runbook
Scenario: HPA Not Scaling
1# 1. Check HPA status
2kubectl describe hpa <name> -n <namespace>
3
4# 2. Verify Metrics Server
5kubectl top pods -n <namespace>
6
7# 3. Check resource requests are set
8kubectl get deployment <name> -n <namespace> -o yaml | grep -A 10 resources
9
10# 4. Manual scale if urgent
11kubectl scale deployment <name> --replicas=<N> -n <namespace>
12
13# 5. Check HPA controller logs
14kubectl logs -n kube-system -l component=kube-controller-manager | grep HPA
Scenario: Mass OOMKills
1# 1. Identify OOMKilled pods
2kubectl get events -A --field-selector reason=OOMKilling
3
4# 2. Check memory usage patterns
5kubectl top pods -A --sort-by=memory
6
7# 3. Emergency memory increase
8kubectl patch deployment <name> -n <namespace> -p '
9{
10 "spec": {
11 "template": {
12 "spec": {
13 "containers": [{
14 "name": "<container>",
15 "resources": {
16 "limits": {"memory": "4Gi"}
17 }
18 }]
19 }
20 }
21 }
22}'
23
24# 4. Disable VPA temporarily
25kubectl patch vpa <name> -n <namespace> -p '
26{"spec":{"updatePolicy":{"updateMode":"Off"}}}'
Scenario: Spot Instance Cascade
1# 1. Check node status
2kubectl get nodes -o wide
3
4# 2. Check Karpenter/CA logs
5kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=100
6
7# 3. Force on-demand scaling
8# AWS: Update ASG to use on-demand
9aws autoscaling update-auto-scaling-group \
10 --auto-scaling-group-name <name> \
11 --desired-capacity <N>
12
13# 4. Emergency pod rescheduling
14kubectl drain <spot-node> --ignore-daemonsets --delete-emptydir-data
Lessons from the Trenches
Top 10 Production Lessons
- Always set readiness probes - Traffic to non-ready pods kills performance
- HPA + CA delays compound - Plan for 5-minute worst-case scaling time
- VPA needs safety margins - Set limits 2x higher than recommendations
- PDBs are not optional - Without them, chaos ensues
- Spot needs diversity - Single instance type = guaranteed interruption
- Monitor metric lag - 60-second lag can cause total failure during spikes
- Pre-warm for known events - Black Friday, etc. need manual pre-scaling
- Test failover paths - Spot → On-demand fallback must be tested
- API server capacity matters - HPA can overwhelm API server
- OOMKills propagate - One OOM can cascade to entire service
Recommended SLOs
1# SLO Targets
2HPA Scaling Latency: P95 < 60 seconds
3Cluster Autoscaler Provisioning: P95 < 5 minutes
4Pod Startup Time: P95 < 90 seconds
5OOMKill Rate: < 0.1% of pod starts
6Spot Interruption Handling: 100% graceful (no dropped requests)
7Autoscaling Accuracy: ±10% of optimal replica count
Key Takeaways
- Production is different - Theory works until 2 AM on Black Friday
- Layered defenses - Multiple fallback strategies save the day
- Monitor everything - You can’t fix what you can’t see
- Test failure modes - Chaos engineering finds issues before customers do
- Document incidents - Today’s postmortem is tomorrow’s runbook
Related Topics
Autoscaling Series
- Part 1: HPA Fundamentals
- Part 2: Cluster Autoscaling
- Part 3: Hands-On Demo
- Part 4: Monitoring & Alerting
- Part 5: VPA & Resource Optimization
- Part 6: Advanced Patterns
Conclusion
Production Kubernetes autoscaling teaches lessons that can’t be learned from documentation:
- Black Friday taught us: Pre-warming and readiness probes are critical
- The OOMKill loop taught us: VPA needs safety margins
- The spot cascade taught us: Instance diversity saves the day
The best SRE teams learn from failures, document thoroughly, and build systems that fail gracefully. Every 2 AM page makes the system more resilient.
Remember: The goal isn’t to eliminate failures—it’s to learn from them and ensure they don’t happen twice.
Next up: Part 8 - Security, Compliance & Governance 🔒
Stay resilient! 🛡️