Kubernetes Autoscaling Complete Guide (Part 7): Production Troubleshooting & War Stories

Series Overview

This is Part 7 of the Kubernetes Autoscaling Complete Guide series:


Theory is important, but production teaches the hardest lessons. This guide documents real-world autoscaling failures, debugging methodologies, and hard-won insights from managing Kubernetes autoscaling at scale. These are the stories rarely told in documentation—the 2 AM incidents, cascading failures, and subtle bugs that cost millions.

War Story #1: The Black Friday Meltdown

The Incident

Date: November 25, 2022 Duration: 2 hours 37 minutes Impact: $3.2M revenue loss, 89% service degradation Root Cause: HPA thrashing during traffic spike

Timeline

08:45 UTC - Black Friday sale begins
08:46 UTC - Traffic increases from 10k to 150k req/min
08:47 UTC - HPA scales from 50 to 100 pods
08:48 UTC - New pods start, but not ready (app startup: 2 min)
08:49 UTC - Existing pods overloaded, CPU hits 95%
08:50 UTC - HPA sees high CPU, scales to 200 pods
08:51 UTC - Kubernetes scheduler cannot place pods (insufficient nodes)
08:52 UTC - Cluster Autoscaler adds nodes (provision time: 3 min)
08:53 UTC - API server overwhelmed by HPA queries (1000+ req/s)
08:54 UTC - HPA controller starts timing out
08:55 UTC - Pods begin OOMKilling due to memory pressure
08:56 UTC - Service enters cascading failure mode
08:57 UTC - Manual intervention begins
09:15 UTC - Emergency scale-up of node pool
09:30 UTC - Services stabilize
11:22 UTC - Full recovery

What Went Wrong

┌────────────────────────────────────────────────────────────────┐
│                    FAILURE CASCADE                             │
│                                                                 │
│  Traffic Spike                                                 │
│       ↓                                                         │
│  Slow App Startup (2 min)                                      │
│       ↓                                                         │
│  Existing Pods Overloaded                                      │
│       ↓                                                         │
│  HPA Aggressive Scale-Up                                       │
│       ↓                                                         │
│  No Node Capacity                                              │
│       ↓                                                         │
│  Cluster Autoscaler Delay (3 min)                             │
│       ↓                                                         │
│  Pods Pending                                                   │
│       ↓                                                         │
│  API Server Overload (HPA queries)                            │
│       ↓                                                         │
│  HPA Timeouts                                                   │
│       ↓                                                         │
│  OOMKills Begin                                                 │
│       ↓                                                         │
│  TOTAL SYSTEM FAILURE                                          │
└────────────────────────────────────────────────────────────────┘

Configuration Issues

Original HPA Configuration:

 1apiVersion: autoscaling/v2
 2kind: HorizontalPodAutoscaler
 3metadata:
 4  name: api-server-hpa
 5spec:
 6  scaleTargetRef:
 7    apiVersion: apps/v1
 8    kind: Deployment
 9    name: api-server
10
11  minReplicas: 50
12  maxReplicas: 500   # ❌ Too aggressive
13
14  metrics:
15  - type: Resource
16    resource:
17      name: cpu
18      target:
19        type: Utilization
20        averageUtilization: 70  # ❌ Too sensitive
21
22  behavior:  # ❌ No behavior control
23    scaleUp:
24      stabilizationWindowSeconds: 0  # ❌ Immediate
25      policies:
26      - type: Percent
27        value: 100                    # ❌ Doubles every 15s
28        periodSeconds: 15

Deployment Issues:

 1apiVersion: apps/v1
 2kind: Deployment
 3metadata:
 4  name: api-server
 5spec:
 6  template:
 7    spec:
 8      containers:
 9      - name: api
10        image: api-server:v1.0
11
12        # ❌ No readiness probe - pods receive traffic before ready
13        # ❌ Slow startup time not accounted for
14
15        resources:
16          requests:
17            cpu: 500m
18            memory: 512Mi
19          limits:
20            cpu: 2          # ❌ 4x request - causes throttling
21            memory: 1Gi     # ❌ 2x request - causes OOM

Cluster Autoscaler Configuration:

1# ❌ No node over-provisioning
2# ❌ Single node pool type (no hot spare capacity)
3# ❌ 3-minute node startup time not accounted for

The Fix

1. Improved HPA Configuration:

 1apiVersion: autoscaling/v2
 2kind: HorizontalPodAutoscaler
 3metadata:
 4  name: api-server-hpa
 5spec:
 6  scaleTargetRef:
 7    apiVersion: apps/v1
 8    kind: Deployment
 9    name: api-server
10
11  minReplicas: 100  # ✅ Higher baseline for Black Friday
12  maxReplicas: 500
13
14  metrics:
15  - type: Resource
16    resource:
17      name: cpu
18      target:
19        type: Utilization
20        averageUtilization: 60  # ✅ More headroom
21
22  # ✅ Custom metric: actual request rate (more predictive)
23  - type: Pods
24    pods:
25      metric:
26        name: http_requests_per_second
27      target:
28        type: AverageValue
29        averageValue: "100"
30
31  behavior:
32    scaleUp:
33      stabilizationWindowSeconds: 30  # ✅ 30s buffer
34      policies:
35      - type: Pods
36        value: 20                      # ✅ Max 20 pods per 30s
37        periodSeconds: 30
38      - type: Percent
39        value: 50                      # ✅ Max 50% increase
40        periodSeconds: 30
41      selectPolicy: Min                # ✅ Conservative
42
43    scaleDown:
44      stabilizationWindowSeconds: 300  # ✅ 5 min cooldown
45      policies:
46      - type: Pods
47        value: 5
48        periodSeconds: 60

2. Application Optimization:

 1apiVersion: apps/v1
 2kind: Deployment
 3metadata:
 4  name: api-server
 5spec:
 6  template:
 7    spec:
 8      containers:
 9      - name: api
10        image: api-server:v2.0
11
12        # ✅ Readiness probe
13        readinessProbe:
14          httpGet:
15            path: /health/ready
16            port: 8080
17          initialDelaySeconds: 30
18          periodSeconds: 5
19          failureThreshold: 3
20
21        # ✅ Startup probe for slow startup
22        startupProbe:
23          httpGet:
24            path: /health/startup
25            port: 8080
26          initialDelaySeconds: 0
27          periodSeconds: 10
28          failureThreshold: 18  # Allow 3 minutes
29
30        # ✅ Liveness probe
31        livenessProbe:
32          httpGet:
33            path: /health/live
34            port: 8080
35          periodSeconds: 10
36          failureThreshold: 3
37
38        resources:
39          requests:
40            cpu: 500m
41            memory: 512Mi
42          limits:
43            cpu: 1000m    # ✅ 2x request (reasonable burst)
44            memory: 1Gi   # ✅ 2x request
45
46        # ✅ Graceful shutdown
47        lifecycle:
48          preStop:
49            exec:
50              command: ["/bin/sh", "-c", "sleep 15"]

3. Cluster Pre-warming:

 1# ✅ Node over-provisioning deployment
 2apiVersion: apps/v1
 3kind: Deployment
 4metadata:
 5  name: cluster-overprovisioner
 6  namespace: kube-system
 7spec:
 8  replicas: 10  # Reserve capacity for 10 pods
 9  template:
10    spec:
11      priorityClassName: overprovisioning  # Low priority
12      containers:
13      - name: pause
14        image: k8s.gcr.io/pause
15        resources:
16          requests:
17            cpu: 500m
18            memory: 512Mi
19
20---
21# ✅ Priority class for overprovisioning
22apiVersion: scheduling.k8s.io/v1
23kind: PriorityClass
24metadata:
25  name: overprovisioning
26value: -1  # Negative priority - first to evict
27globalDefault: false
28description: "Pods that reserve cluster capacity"
29
30---
31# ✅ Priority class for production workloads
32apiVersion: scheduling.k8s.io/v1
33kind: PriorityClass
34metadata:
35  name: production-high
36value: 1000
37globalDefault: false
38description: "High priority production workloads"

4. Scheduled Pre-scaling:

 1# ✅ CronJob to pre-scale before Black Friday
 2apiVersion: batch/v1
 3kind: CronJob
 4metadata:
 5  name: blackfriday-prescale
 6  namespace: production
 7spec:
 8  # 30 minutes before sale
 9  schedule: "15 8 25 11 *"  # Nov 25, 08:15 UTC
10  jobTemplate:
11    spec:
12      template:
13        spec:
14          serviceAccountName: autoscaler
15          containers:
16          - name: prescale
17            image: bitnami/kubectl:latest
18            command:
19            - /bin/bash
20            - -c
21            - |
22              echo "Pre-scaling for Black Friday"
23
24              # Scale up deployment
25              kubectl scale deployment api-server --replicas=150 -n production
26
27              # Update HPA minReplicas
28              kubectl patch hpa api-server-hpa -n production -p '{"spec":{"minReplicas":150}}'
29
30              # Add extra nodes
31              aws autoscaling set-desired-capacity \
32                --auto-scaling-group-name eks-node-group \
33                --desired-capacity 50
34
35              echo "Pre-scaling complete"              
36          restartPolicy: OnFailure

Lessons Learned

  1. Slow startup kills autoscaling - 2-minute app startup + 3-minute node provisioning = 5 minutes total lag
  2. Traffic spikes need pre-warming - Reactive scaling is too slow for flash events
  3. HPA + CA delays compound - Each layer adds latency; total delay can be fatal
  4. API server is a bottleneck - HPA can overwhelm API server with queries
  5. Readiness probes are critical - Without them, traffic hits non-ready pods

Preventive Measures

 1# ✅ Comprehensive monitoring
 2apiVersion: monitoring.coreos.com/v1
 3kind: PrometheusRule
 4metadata:
 5  name: autoscaling-sla-alerts
 6  namespace: monitoring
 7spec:
 8  groups:
 9  - name: autoscaling-sla
10    rules:
11    # Alert when scaling is too slow
12    - alert: SlowAutoscaling
13      expr: |
14        (
15          kube_horizontalpodautoscaler_status_desired_replicas
16          - kube_horizontalpodautoscaler_status_current_replicas
17        ) > 5        
18      for: 2m
19      labels:
20        severity: warning
21      annotations:
22        summary: "HPA scaling lag detected"
23        description: "Desired replicas not reached for 2 minutes"
24
25    # Alert on pod startup time
26    - alert: SlowPodStartup
27      expr: |
28        (time() - kube_pod_start_time) > 120
29        and kube_pod_status_phase{phase="Running"} == 1
30        and kube_pod_status_ready{condition="true"} == 0        
31      for: 1m
32      labels:
33        severity: warning
34      annotations:
35        summary: "Pod {{ $labels.pod }} taking >2 min to start"
36
37    # Alert on pending pods
38    - alert: PodsPendingTooLong
39      expr: |
40        kube_pod_status_phase{phase="Pending"} == 1        
41      for: 3m
42      labels:
43        severity: critical
44      annotations:
45        summary: "Pod {{ $labels.pod }} pending for >3 minutes"
46        description: "Likely node capacity issue"

War Story #2: The VPA OOMKill Loop

The Incident

Date: March 15, 2023 Duration: 6 hours 12 minutes Impact: 45% service availability, database corruption Root Cause: VPA recommendations too aggressive, causing OOM loop

The Problem

VPA Recommendation: 4GB memory
Actual Pod Usage: 3.8GB memory (95% utilization)

Pod starts with 4GB limit
↓
App loads data into memory
↓
Memory usage: 3.9GB
↓
Java GC overhead increases
↓
Memory peaks at 4.1GB
↓
OOMKilled by kernel
↓
Pod restarts
↓
REPEAT INFINITELY

Root Cause Analysis

VPA Configuration:

 1apiVersion: autoscaling.k8s.io/v1
 2kind: VerticalPodAutoscaler
 3metadata:
 4  name: cache-service-vpa
 5spec:
 6  targetRef:
 7    apiVersion: apps/v1
 8    kind: StatefulSet
 9    name: cache-service
10
11  updatePolicy:
12    updateMode: "Auto"  # ❌ Aggressive mode
13
14  resourcePolicy:
15    containerPolicies:
16    - containerName: redis
17      minAllowed:
18        memory: 1Gi
19      maxAllowed:
20        memory: 8Gi
21      # ❌ No safety margin configured
22      # ❌ mode: Auto sets both requests AND limits

What VPA Did:

Time 00:00 - VPA observes: avg 3.5GB, P95 3.8GB
Time 00:15 - VPA sets: request=3.8GB, limit=3.8GB
Time 00:30 - Pod restarted with new limits
Time 00:35 - Pod reaches 3.9GB
Time 00:36 - OOMKilled (limit: 3.8GB)
Time 00:37 - Pod restart #1
Time 00:42 - OOMKilled again
Time 00:43 - Pod restart #2
... (crash loop continues)

The Fix

1. VPA with Safety Margin:

 1apiVersion: autoscaling.k8s.io/v1
 2kind: VerticalPodAutoscaler
 3metadata:
 4  name: cache-service-vpa
 5spec:
 6  targetRef:
 7    apiVersion: apps/v1
 8    kind: StatefulSet
 9    name: cache-service
10
11  updatePolicy:
12    updateMode: "Initial"  # ✅ Less aggressive
13
14  resourcePolicy:
15    containerPolicies:
16    - containerName: redis
17      minAllowed:
18        memory: 2Gi    # ✅ Higher minimum
19      maxAllowed:
20        memory: 16Gi   # ✅ Higher maximum
21
22      # ✅ Only control requests, not limits
23      controlledValues: RequestsOnly
24
25      mode: Auto

2. Manual Limit with Buffer:

 1apiVersion: apps/v1
 2kind: StatefulSet
 3metadata:
 4  name: cache-service
 5spec:
 6  template:
 7    spec:
 8      containers:
 9      - name: redis
10        image: redis:7
11        resources:
12          requests:
13            memory: 4Gi    # VPA will adjust this
14          limits:
15            memory: 8Gi    # ✅ Manual limit with 2x buffer

3. Application-Level Memory Management:

 1apiVersion: apps/v1
 2kind: StatefulSet
 3metadata:
 4  name: cache-service
 5spec:
 6  template:
 7    spec:
 8      containers:
 9      - name: redis
10        image: redis:7
11        command:
12        - redis-server
13        args:
14        - --maxmemory
15        - "3gb"              # ✅ App-level limit (75% of request)
16        - --maxmemory-policy
17        - "allkeys-lru"      # ✅ Evict keys when limit reached
18
19        resources:
20          requests:
21            memory: 4Gi
22          limits:
23            memory: 8Gi

4. OOMKill Detection and Auto-remediation:

 1# CronJob to detect and fix OOM loops
 2apiVersion: batch/v1
 3kind: CronJob
 4metadata:
 5  name: oomkill-detector
 6  namespace: kube-system
 7spec:
 8  schedule: "*/5 * * * *"  # Every 5 minutes
 9  jobTemplate:
10    spec:
11      template:
12        spec:
13          serviceAccountName: oomkill-detector
14          containers:
15          - name: detector
16            image: bitnami/kubectl:latest
17            command:
18            - /bin/bash
19            - -c
20            - |
21              #!/bin/bash
22
23              echo "Checking for OOMKill loops..."
24
25              # Find pods with multiple OOMKills in last 10 minutes
26              OOMKILLED_PODS=$(kubectl get events -A \
27                --field-selector reason=OOMKilling \
28                -o json | jq -r '
29                  .items[] |
30                  select(.lastTimestamp > (now - 600 | strftime("%Y-%m-%dT%H:%M:%SZ"))) |
31                  "\(.involvedObject.namespace)/\(.involvedObject.name)"
32                ' | sort | uniq -c | awk '$1 > 2 {print $2}')
33
34              if [ -z "$OOMKILLED_PODS" ]; then
35                echo "No OOMKill loops detected"
36                exit 0
37              fi
38
39              echo "OOMKill loops detected:"
40              echo "$OOMKILLED_PODS"
41
42              # Increase memory limits
43              for POD in $OOMKILLED_PODS; do
44                NAMESPACE=$(echo $POD | cut -d/ -f1)
45                POD_NAME=$(echo $POD | cut -d/ -f2)
46
47                # Get deployment name
48                DEPLOYMENT=$(kubectl get pod $POD_NAME -n $NAMESPACE \
49                  -o jsonpath='{.metadata.labels.app}')
50
51                echo "Increasing memory for $DEPLOYMENT in $NAMESPACE"
52
53                # Patch to increase memory by 50%
54                kubectl patch deployment $DEPLOYMENT -n $NAMESPACE --type=json -p='[
55                  {
56                    "op": "replace",
57                    "path": "/spec/template/spec/containers/0/resources/limits/memory",
58                    "value": "12Gi"
59                  }
60                ]'
61
62                # Disable VPA temporarily
63                kubectl patch vpa ${DEPLOYMENT}-vpa -n $NAMESPACE -p '
64                  {"spec":{"updatePolicy":{"updateMode":"Off"}}}'
65
66                # Alert on Slack
67                curl -X POST $SLACK_WEBHOOK \
68                  -H 'Content-Type: application/json' \
69                  -d "{\"text\": \"⚠️ OOMKill loop detected for $DEPLOYMENT. Auto-increased memory to 12Gi and disabled VPA.\"}"
70              done              
71          restartPolicy: OnFailure

Lessons Learned

  1. VPA needs safety margins - Set limits higher than recommendations
  2. Requests ≠ Limits - Use controlledValues: RequestsOnly
  3. Application-level limits - Don’t rely solely on Kubernetes limits
  4. Monitor OOMKills - Set up automated detection and remediation
  5. Test VPA changes - Don’t enable Auto mode without thorough testing

War Story #3: The Spot Instance Cascade

The Incident

Date: August 8, 2023 Duration: 1 hour 23 minutes Impact: 70% pod evictions, service disruption Root Cause: AWS spot instance interruptions not handled gracefully

The Timeline

14:00 UTC - AWS spot price spike in us-east-1a
14:01 UTC - 30% of spot instances interrupted (2-minute warning)
14:03 UTC - 50 pods evicted
14:04 UTC - Karpenter provisions new spot instances
14:05 UTC - New spot instances also interrupted (different AZ)
14:07 UTC - 100 more pods evicted
14:08 UTC - Service degradation begins
14:10 UTC - Karpenter tries on-demand fallback
14:12 UTC - On-demand capacity exhausted
14:15 UTC - Cascading failure across all AZs
14:30 UTC - Manual intervention: forced on-demand scaling
15:23 UTC - Full recovery

Root Cause

Insufficient Instance Type Diversity:

 1# ❌ Original Karpenter NodePool
 2apiVersion: karpenter.sh/v1beta1
 3kind: NodePool
 4metadata:
 5  name: general-spot
 6spec:
 7  template:
 8    spec:
 9      requirements:
10      - key: karpenter.sh/capacity-type
11        operator: In
12        values: ["spot"]
13
14      # ❌ Limited to 2 instance families
15      - key: karpenter.k8s.aws/instance-category
16        operator: In
17        values: ["m5", "c5"]
18
19      # ❌ Single generation
20      - key: karpenter.k8s.aws/instance-generation
21        operator: In
22        values: ["5"]

No PodDisruptionBudgets:

1# ❌ No PDB configured
2# All pods can be evicted simultaneously

Inadequate Fallback Strategy:

1# ❌ Single NodePool
2# No prioritization between spot and on-demand

The Fix

1. Maximum Instance Diversity:

 1apiVersion: karpenter.sh/v1beta1
 2kind: NodePool
 3metadata:
 4  name: diversified-spot
 5spec:
 6  template:
 7    spec:
 8      requirements:
 9      - key: karpenter.sh/capacity-type
10        operator: In
11        values: ["spot"]
12
13      # ✅ Multiple instance families
14      - key: karpenter.k8s.aws/instance-category
15        operator: In
16        values: ["c", "m", "r", "t", "i", "d"]
17
18      # ✅ Multiple generations
19      - key: karpenter.k8s.aws/instance-generation
20        operator: Gt
21        values: ["4"]  # Anything 5+
22
23      # ✅ Multiple sizes
24      - key: karpenter.k8s.aws/instance-size
25        operator: In
26        values: ["large", "xlarge", "2xlarge", "4xlarge", "8xlarge"]
27
28      # ✅ Spread across all AZs
29      - key: topology.kubernetes.io/zone
30        operator: In
31        values: ["us-east-1a", "us-east-1b", "us-east-1c", "us-east-1d"]
32
33      nodeClassRef:
34        name: diversified
35
36  # ✅ Short expiration to refresh instances frequently
37  disruption:
38    consolidationPolicy: WhenUnderutilized
39    expireAfter: 12h
40
41  limits:
42    cpu: "500"

2. On-Demand Fallback Pool:

 1apiVersion: karpenter.sh/v1beta1
 2kind: NodePool
 3metadata:
 4  name: on-demand-fallback
 5spec:
 6  template:
 7    spec:
 8      requirements:
 9      - key: karpenter.sh/capacity-type
10        operator: In
11        values: ["on-demand"]
12
13      - key: karpenter.k8s.aws/instance-category
14        operator: In
15        values: ["m", "c"]
16
17      nodeClassRef:
18        name: on-demand-fallback
19
20  # ✅ Lower priority (higher weight value)
21  weight: 100
22
23  limits:
24    cpu: "200"  # Reserve capacity

3. Critical Workload Isolation:

 1# ✅ On-demand pool for critical services
 2apiVersion: karpenter.sh/v1beta1
 3kind: NodePool
 4metadata:
 5  name: critical-on-demand
 6spec:
 7  template:
 8    spec:
 9      requirements:
10      - key: karpenter.sh/capacity-type
11        operator: In
12        values: ["on-demand"]
13
14      taints:
15      - key: workload-type
16        value: critical
17        effect: NoSchedule
18
19      labels:
20        workload-type: critical
21
22      nodeClassRef:
23        name: critical
24
25  limits:
26    cpu: "100"
27
28---
29# Critical deployment on on-demand nodes
30apiVersion: apps/v1
31kind: Deployment
32metadata:
33  name: payment-service
34spec:
35  template:
36    spec:
37      # ✅ Force on-demand nodes
38      nodeSelector:
39        workload-type: critical
40
41      tolerations:
42      - key: workload-type
43        value: critical
44        effect: NoSchedule

4. Comprehensive PodDisruptionBudgets:

 1# ✅ PDB for all production services
 2apiVersion: policy/v1
 3kind: PodDisruptionBudget
 4metadata:
 5  name: api-server-pdb
 6  namespace: production
 7spec:
 8  minAvailable: 75%  # Keep 75% pods running
 9  selector:
10    matchLabels:
11      app: api-server
12
13---
14# ✅ PDB for critical services
15apiVersion: policy/v1
16kind: PodDisruptionBudget
17metadata:
18  name: payment-service-pdb
19  namespace: production
20spec:
21  maxUnavailable: 1  # Only 1 pod can be down
22  selector:
23    matchLabels:
24      app: payment-service

5. Spot Interruption Handler:

 1# AWS Node Termination Handler
 2apiVersion: apps/v1
 3kind: DaemonSet
 4metadata:
 5  name: aws-node-termination-handler
 6  namespace: kube-system
 7spec:
 8  selector:
 9    matchLabels:
10      app: aws-node-termination-handler
11  template:
12    spec:
13      serviceAccountName: aws-node-termination-handler
14      hostNetwork: true
15      containers:
16      - name: handler
17        image: amazon/aws-node-termination-handler:latest
18        env:
19        - name: NODE_NAME
20          valueFrom:
21            fieldRef:
22              fieldPath: spec.nodeName
23        - name: POD_NAME
24          valueFrom:
25            fieldRef:
26              fieldPath: metadata.name
27        - name: NAMESPACE
28          valueFrom:
29            fieldRef:
30              fieldPath: metadata.namespace
31        - name: ENABLE_SPOT_INTERRUPTION_DRAINING
32          value: "true"
33        - name: ENABLE_SCHEDULED_EVENT_DRAINING
34          value: "true"
35        - name: ENABLE_REBALANCE_MONITORING
36          value: "true"
37        - name: WEBHOOK_URL
38          value: "http://slack-webhook/v1/webhook"
39        securityContext:
40          privileged: true
41
42---
43# Monitor spot interruptions
44apiVersion: monitoring.coreos.com/v1
45kind: PrometheusRule
46metadata:
47  name: spot-interruption-alerts
48  namespace: monitoring
49spec:
50  groups:
51  - name: spot-interruptions
52    rules:
53    - alert: HighSpotInterruptionRate
54      expr: |
55        rate(aws_node_termination_handler_actions_node_total[10m]) > 0.1        
56      labels:
57        severity: warning
58      annotations:
59        summary: "High spot interruption rate"
60        description: "{{ $value }} nodes/minute being interrupted"
61
62    - alert: SpotCapacityShortage
63      expr: |
64        rate(karpenter_pods_state{state="pending"}[5m]) > 10
65        and on() karpenter_nodes_created{capacity_type="spot"} == 0        
66      for: 5m
67      labels:
68        severity: critical
69      annotations:
70        summary: "Unable to provision spot instances"
71        description: "Spot capacity exhausted, fallback to on-demand"

Lessons Learned

  1. Diversity is survival - More instance types = better spot availability
  2. PDBs are mandatory - Without them, all pods can evict simultaneously
  3. Layered fallback - Spot → Different spot family → On-demand
  4. Critical services need on-demand - Don’t run payment systems on spot
  5. Monitor interruption patterns - AWS publishes spot interruption frequency data

Debugging Workflow: The Systematic Approach

Step 1: Quick Health Check

 1#!/bin/bash
 2# autoscaling-health-check.sh
 3
 4echo "=== Kubernetes Autoscaling Health Check ==="
 5echo ""
 6
 7# 1. HPA Status
 8echo "1. HPA Status:"
 9kubectl get hpa -A
10echo ""
11
12# 2. Check for unknown metrics
13echo "2. HPAs with unknown metrics:"
14kubectl get hpa -A -o json | jq -r '
15  .items[] |
16  select(.status.conditions[] | select(.type == "ScalingActive" and .status == "False")) |
17  "\(.metadata.namespace)/\(.metadata.name): \(.status.conditions[] | select(.type == "ScalingActive").message)"
18'
19echo ""
20
21# 3. Metrics Server
22echo "3. Metrics Server:"
23kubectl get pods -n kube-system -l k8s-app=metrics-server
24kubectl top nodes | head -5
25echo ""
26
27# 4. VPA Status
28echo "4. VPA Status:"
29kubectl get vpa -A
30echo ""
31
32# 5. Cluster Autoscaler
33echo "5. Cluster Autoscaler/Karpenter:"
34kubectl get pods -n kube-system -l app.kubernetes.io/name=cluster-autoscaler
35kubectl get pods -n karpenter
36echo ""
37
38# 6. Pending Pods
39echo "6. Pending Pods:"
40PENDING=$(kubectl get pods -A --field-selector=status.phase=Pending --no-headers | wc -l)
41echo "Total pending pods: $PENDING"
42if [ $PENDING -gt 0 ]; then
43  kubectl get pods -A --field-selector=status.phase=Pending
44fi
45echo ""
46
47# 7. Recent Events
48echo "7. Recent Autoscaling Events (last 10):"
49kubectl get events -A --sort-by='.lastTimestamp' | grep -E 'Scale|HPA|VPA|Evict' | tail -10
50echo ""
51
52# 8. Node Pressure
53echo "8. Node Resource Pressure:"
54kubectl describe nodes | grep -A 5 "Allocated resources"
55echo ""
56
57# 9. Failed to Schedule
58echo "9. Pods Failed to Schedule:"
59kubectl get events -A --field-selector reason=FailedScheduling | tail -10

Step 2: Deep Dive - HPA Not Scaling

 1#!/bin/bash
 2# debug-hpa.sh
 3
 4HPA_NAME=$1
 5NAMESPACE=${2:-default}
 6
 7echo "=== Debugging HPA: $HPA_NAME in $NAMESPACE ==="
 8echo ""
 9
10# 1. HPA Configuration
11echo "1. HPA Configuration:"
12kubectl get hpa $HPA_NAME -n $NAMESPACE -o yaml
13echo ""
14
15# 2. HPA Status
16echo "2. HPA Status:"
17kubectl describe hpa $HPA_NAME -n $NAMESPACE
18echo ""
19
20# 3. Current Metrics
21echo "3. Current Metrics:"
22kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/$NAMESPACE/pods" | \
23  jq -r ".items[] | select(.metadata.labels.app == \"$HPA_NAME\") | {name: .metadata.name, cpu: .containers[].usage.cpu, memory: .containers[].usage.memory}"
24echo ""
25
26# 4. Target Deployment
27echo "4. Target Deployment:"
28TARGET=$(kubectl get hpa $HPA_NAME -n $NAMESPACE -o jsonpath='{.spec.scaleTargetRef.name}')
29kubectl get deployment $TARGET -n $NAMESPACE
30echo ""
31
32# 5. Pod Resource Requests
33echo "5. Pod Resource Requests:"
34kubectl get deployment $TARGET -n $NAMESPACE -o jsonpath='{.spec.template.spec.containers[].resources}'
35echo ""
36
37# 6. Scaling Events
38echo "6. Recent Scaling Events:"
39kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$HPA_NAME --sort-by='.lastTimestamp' | tail -20
40echo ""
41
42# 7. HPA Controller Logs
43echo "7. HPA Controller Logs (last 50 lines):"
44kubectl logs -n kube-system deployment/kube-controller-manager --tail=50 | grep -i hpa
45echo ""
46
47# 8. Check if metrics are available
48echo "8. Metrics Availability:"
49if kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 &>/dev/null; then
50  echo "✅ Custom metrics API available"
51  kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r '.resources[].name' | head -10
52else
53  echo "❌ Custom metrics API not available"
54fi
55echo ""
56
57# 9. Prometheus Metrics (if using custom metrics)
58echo "9. Prometheus Metrics:"
59POD=$(kubectl get pods -n $NAMESPACE -l app=$TARGET -o jsonpath='{.items[0].metadata.name}')
60kubectl exec -n $NAMESPACE $POD -- curl -s localhost:9090/metrics 2>/dev/null | grep -E "cpu|memory|requests" | head -10

Step 3: Deep Dive - Cluster Autoscaler Issues

 1#!/bin/bash
 2# debug-cluster-autoscaler.sh
 3
 4echo "=== Debugging Cluster Autoscaler ==="
 5echo ""
 6
 7# 1. Cluster Autoscaler Status
 8echo "1. Cluster Autoscaler Status:"
 9kubectl get pods -n kube-system -l app=cluster-autoscaler
10echo ""
11
12# 2. CA Logs (errors only)
13echo "2. Recent Errors:"
14CA_POD=$(kubectl get pods -n kube-system -l app=cluster-autoscaler -o jsonpath='{.items[0].metadata.name}')
15kubectl logs -n kube-system $CA_POD --tail=100 | grep -i error
16echo ""
17
18# 3. Node Groups
19echo "3. Node Groups Status:"
20kubectl logs -n kube-system $CA_POD --tail=50 | grep -i "node group"
21echo ""
22
23# 4. Scale Up Events
24echo "4. Recent Scale Up Attempts:"
25kubectl logs -n kube-system $CA_POD --tail=100 | grep -i "scale up"
26echo ""
27
28# 5. Scale Down Events
29echo "5. Recent Scale Down Attempts:"
30kubectl logs -n kube-system $CA_POD --tail=100 | grep -i "scale down"
31echo ""
32
33# 6. Unschedulable Pods
34echo "6. Unschedulable Pods:"
35kubectl logs -n kube-system $CA_POD --tail=50 | grep -i "unschedulable"
36echo ""
37
38# 7. Node Group Sizes (AWS)
39echo "7. AWS ASG Sizes:"
40aws autoscaling describe-auto-scaling-groups \
41  --query 'AutoScalingGroups[?contains(Tags[?Key==`k8s.io/cluster-autoscaler/enabled`].Value, `true`)].{Name:AutoScalingGroupName,Desired:DesiredCapacity,Min:MinSize,Max:MaxSize,Current:Instances|length(@)}' \
42  --output table
43echo ""
44
45# 8. Node Capacity
46echo "8. Current Node Capacity:"
47kubectl get nodes -o custom-columns=NAME:.metadata.name,CPU:.status.capacity.cpu,MEMORY:.status.capacity.memory,PODS:.status.capacity.pods
48echo ""
49
50# 9. ConfigMap
51echo "9. Cluster Autoscaler ConfigMap:"
52kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

Common Failure Patterns

Pattern 1: The Thundering Herd

Symptom: All pods restart simultaneously

Cause:

  • VPA in Recreate mode with no PDB
  • All pods get new resource recommendations at once
  • VPA evicts all pods simultaneously

Detection:

1# Check for mass pod restarts
2kubectl get events -A | grep -E "Killing|Evicted" | wc -l

Fix:

1# Add PDB
2apiVersion: policy/v1
3kind: PodDisruptionBudget
4metadata:
5  name: my-app-pdb
6spec:
7  minAvailable: 70%

Pattern 2: The Resource Starvation

Symptom: HPA scales up but pods remain pending

Cause:

  • No cluster autoscaler
  • Node resources exhausted
  • Pod resource requests too large

Detection:

1# Check pending pods
2kubectl get pods -A --field-selector=status.phase=Pending
3
4# Check node allocatable resources
5kubectl describe nodes | grep -A 10 "Allocated resources"

Fix:

1# Enable Cluster Autoscaler or add nodes manually
2# Or reduce resource requests

Pattern 3: The Metric Lag

Symptom: HPA scales late, after traffic spike already passed

Cause:

  • Long metric scrape interval (15s)
  • Long HPA evaluation interval (15s)
  • Total lag: 30-60 seconds

Detection:

1# Check Metrics Server update frequency
2kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes | jq -r '.items[0].timestamp'
3# Wait 10 seconds
4kubectl get --raw /apis/metrics.k8s.io/v1beta1/nodes | jq -r '.items[0].timestamp'
5# Compare timestamps

Fix:

1# Use custom metrics with lower scrape interval
2# Or implement predictive scaling

Emergency Runbook

Scenario: HPA Not Scaling

 1# 1. Check HPA status
 2kubectl describe hpa <name> -n <namespace>
 3
 4# 2. Verify Metrics Server
 5kubectl top pods -n <namespace>
 6
 7# 3. Check resource requests are set
 8kubectl get deployment <name> -n <namespace> -o yaml | grep -A 10 resources
 9
10# 4. Manual scale if urgent
11kubectl scale deployment <name> --replicas=<N> -n <namespace>
12
13# 5. Check HPA controller logs
14kubectl logs -n kube-system -l component=kube-controller-manager | grep HPA

Scenario: Mass OOMKills

 1# 1. Identify OOMKilled pods
 2kubectl get events -A --field-selector reason=OOMKilling
 3
 4# 2. Check memory usage patterns
 5kubectl top pods -A --sort-by=memory
 6
 7# 3. Emergency memory increase
 8kubectl patch deployment <name> -n <namespace> -p '
 9{
10  "spec": {
11    "template": {
12      "spec": {
13        "containers": [{
14          "name": "<container>",
15          "resources": {
16            "limits": {"memory": "4Gi"}
17          }
18        }]
19      }
20    }
21  }
22}'
23
24# 4. Disable VPA temporarily
25kubectl patch vpa <name> -n <namespace> -p '
26{"spec":{"updatePolicy":{"updateMode":"Off"}}}'

Scenario: Spot Instance Cascade

 1# 1. Check node status
 2kubectl get nodes -o wide
 3
 4# 2. Check Karpenter/CA logs
 5kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=100
 6
 7# 3. Force on-demand scaling
 8# AWS: Update ASG to use on-demand
 9aws autoscaling update-auto-scaling-group \
10  --auto-scaling-group-name <name> \
11  --desired-capacity <N>
12
13# 4. Emergency pod rescheduling
14kubectl drain <spot-node> --ignore-daemonsets --delete-emptydir-data

Lessons from the Trenches

Top 10 Production Lessons

  1. Always set readiness probes - Traffic to non-ready pods kills performance
  2. HPA + CA delays compound - Plan for 5-minute worst-case scaling time
  3. VPA needs safety margins - Set limits 2x higher than recommendations
  4. PDBs are not optional - Without them, chaos ensues
  5. Spot needs diversity - Single instance type = guaranteed interruption
  6. Monitor metric lag - 60-second lag can cause total failure during spikes
  7. Pre-warm for known events - Black Friday, etc. need manual pre-scaling
  8. Test failover paths - Spot → On-demand fallback must be tested
  9. API server capacity matters - HPA can overwhelm API server
  10. OOMKills propagate - One OOM can cascade to entire service
1# SLO Targets
2HPA Scaling Latency: P95 < 60 seconds
3Cluster Autoscaler Provisioning: P95 < 5 minutes
4Pod Startup Time: P95 < 90 seconds
5OOMKill Rate: < 0.1% of pod starts
6Spot Interruption Handling: 100% graceful (no dropped requests)
7Autoscaling Accuracy: ±10% of optimal replica count

Key Takeaways

  1. Production is different - Theory works until 2 AM on Black Friday
  2. Layered defenses - Multiple fallback strategies save the day
  3. Monitor everything - You can’t fix what you can’t see
  4. Test failure modes - Chaos engineering finds issues before customers do
  5. Document incidents - Today’s postmortem is tomorrow’s runbook

Autoscaling Series

Conclusion

Production Kubernetes autoscaling teaches lessons that can’t be learned from documentation:

  • Black Friday taught us: Pre-warming and readiness probes are critical
  • The OOMKill loop taught us: VPA needs safety margins
  • The spot cascade taught us: Instance diversity saves the day

The best SRE teams learn from failures, document thoroughly, and build systems that fail gracefully. Every 2 AM page makes the system more resilient.

Remember: The goal isn’t to eliminate failures—it’s to learn from them and ensure they don’t happen twice.

Next up: Part 8 - Security, Compliance & Governance 🔒

Stay resilient! 🛡️