Kubernetes Troubleshooting & Production Best Practices

Troubleshooting Workflow

Pod not running?
├── Pending → Check scheduling (resources, node selector, taints)
├── CrashLoopBackOff → Check logs, command, health probes
├── ImagePullBackOff → Check image name, registry auth
├── OOMKilled → Increase memory limits
└── Evicted → Node under resource pressure

Diagnostic Commands

# Pod status and events
kubectl get pods -o wide
kubectl describe pod <pod-name>
 
# Container logs
kubectl logs <pod-name>
kubectl logs <pod-name> -c <container>  # Multi-container
kubectl logs <pod-name> --previous       # Previous crash
 
# Execute into a running pod
kubectl exec -it <pod-name> -- /bin/sh
 
# Debug with ephemeral container
kubectl debug <pod-name> -it --image=busybox
 
# Resource usage
kubectl top pods
kubectl top nodes
 
# Events (cluster-wide)
kubectl get events --sort-by='.lastTimestamp'
kubectl get events --field-selector type=Warning

Common Issues & Fixes

CrashLoopBackOff

# Check logs for the error
kubectl logs <pod-name> --previous
 
# Common causes:
# 1. Application error on startup
# 2. Missing environment variables
# 3. Wrong command/entrypoint
# 4. Health probe failing too quickly
 
# Fix: Adjust startupProbe or initialDelaySeconds

ImagePullBackOff

# Check the image name
kubectl describe pod <pod-name> | grep -A5 "Events"
 
# Common causes:
# 1. Typo in image name
# 2. Private registry without imagePullSecrets
# 3. Image tag doesn't exist
 
# Fix: Create image pull secret
kubectl create secret docker-registry regcred \
  --docker-server=registry.example.com \
  --docker-username=user \
  --docker-password=pass

Pending Pods

# Check why pod isn't scheduled
kubectl describe pod <pod-name> | grep -A10 "Events"
 
# Common causes:
# 1. Insufficient resources (CPU/memory)
# 2. No nodes match nodeSelector
# 3. PVC not bound
# 4. Taints without tolerations
 
# Check node resources
kubectl describe nodes | grep -A5 "Allocated resources"

Networking Issues

# Test DNS resolution
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup kubernetes
 
# Test service connectivity
kubectl run curl-test --image=curlimages/curl --rm -it -- curl http://service-name
 
# Check endpoints
kubectl get endpoints <service-name>
 
# Check network policies
kubectl get networkpolicies -A

Resource Management

Requests vs Limits

Setting	Purpose	Effect
Requests	Guaranteed minimum	Used for scheduling
Limits	Maximum allowed	Pod killed if exceeded (memory)

resources:
  requests:
    memory: "128Mi"    # Guaranteed
    cpu: "100m"        # 0.1 CPU cores
  limits:
    memory: "256Mi"    # Max before OOMKill
    cpu: "500m"        # Throttled if exceeded

Resource Quotas

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: production
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"
    pods: "50"
    services: "10"
    persistentvolumeclaims: "20"

LimitRange (Defaults)

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - default:
      memory: "256Mi"
      cpu: "200m"
    defaultRequest:
      memory: "128Mi"
      cpu: "100m"
    type: Container

Production Best Practices

Deployment Checklist

Resource requests and limits set
Liveness and readiness probes configured
Pod Disruption Budget defined
Anti-affinity for high availability
Security context (non-root, read-only FS)
Image tags pinned (no :latest)
Namespace isolation
Network policies applied
Secrets encrypted at rest

Pod Disruption Budget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2    # Or maxUnavailable: 1
  selector:
    matchLabels:
      app: nextgen-app

Anti-Affinity (Spread Across Nodes)

spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values: ["nextgen-app"]
          topologyKey: kubernetes.io/hostname

Priority Classes

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "Critical production workloads"
---
# Use in pod spec
spec:
  priorityClassName: high-priority

Monitoring Kubernetes

# Cluster health
kubectl cluster-info
kubectl get componentstatuses
 
# Node conditions
kubectl get nodes -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].type
 
# Resource pressure
kubectl top nodes
kubectl describe node <node> | grep -A5 Conditions

Summary

You've learned:

Systematic troubleshooting for common pod failures
Diagnostic commands and debugging techniques
Resource management with requests, limits, and quotas
Production best practices (PDB, anti-affinity, security)
Monitoring cluster health and resource pressure

Next Steps

You now have a complete Kubernetes foundation. Apply these skills with Helm charts, CI/CD pipelines, and monitoring for production-grade container orchestration.

Kubernetes Troubleshooting & Production Best Practices

Learning Objectives

Troubleshooting Workflow

Diagnostic Commands

Common Issues & Fixes

CrashLoopBackOff

ImagePullBackOff

Pending Pods

Networking Issues

Resource Management

Requests vs Limits

Resource Quotas

LimitRange (Defaults)

Production Best Practices

Deployment Checklist

Pod Disruption Budget

Anti-Affinity (Spread Across Nodes)

Priority Classes

Monitoring Kubernetes

Summary

Next Steps