Troubleshooting Workflow
Pod not running?
├── Pending → Check scheduling (resources, node selector, taints)
├── CrashLoopBackOff → Check logs, command, health probes
├── ImagePullBackOff → Check image name, registry auth
├── OOMKilled → Increase memory limits
└── Evicted → Node under resource pressure
Diagnostic Commands
# Pod status and events
kubectl get pods -o wide
kubectl describe pod <pod-name>
# Container logs
kubectl logs <pod-name>
kubectl logs <pod-name> -c <container> # Multi-container
kubectl logs <pod-name> --previous # Previous crash
# Execute into a running pod
kubectl exec -it <pod-name> -- /bin/sh
# Debug with ephemeral container
kubectl debug <pod-name> -it --image=busybox
# Resource usage
kubectl top pods
kubectl top nodes
# Events (cluster-wide)
kubectl get events --sort-by='.lastTimestamp'
kubectl get events --field-selector type=WarningCommon Issues & Fixes
CrashLoopBackOff
# Check logs for the error
kubectl logs <pod-name> --previous
# Common causes:
# 1. Application error on startup
# 2. Missing environment variables
# 3. Wrong command/entrypoint
# 4. Health probe failing too quickly
# Fix: Adjust startupProbe or initialDelaySecondsImagePullBackOff
# Check the image name
kubectl describe pod <pod-name> | grep -A5 "Events"
# Common causes:
# 1. Typo in image name
# 2. Private registry without imagePullSecrets
# 3. Image tag doesn't exist
# Fix: Create image pull secret
kubectl create secret docker-registry regcred \
--docker-server=registry.example.com \
--docker-username=user \
--docker-password=passPending Pods
# Check why pod isn't scheduled
kubectl describe pod <pod-name> | grep -A10 "Events"
# Common causes:
# 1. Insufficient resources (CPU/memory)
# 2. No nodes match nodeSelector
# 3. PVC not bound
# 4. Taints without tolerations
# Check node resources
kubectl describe nodes | grep -A5 "Allocated resources"Networking Issues
# Test DNS resolution
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup kubernetes
# Test service connectivity
kubectl run curl-test --image=curlimages/curl --rm -it -- curl http://service-name
# Check endpoints
kubectl get endpoints <service-name>
# Check network policies
kubectl get networkpolicies -AResource Management
Requests vs Limits
| Setting | Purpose | Effect |
|---|---|---|
| Requests | Guaranteed minimum | Used for scheduling |
| Limits | Maximum allowed | Pod killed if exceeded (memory) |
resources:
requests:
memory: "128Mi" # Guaranteed
cpu: "100m" # 0.1 CPU cores
limits:
memory: "256Mi" # Max before OOMKill
cpu: "500m" # Throttled if exceededResource Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: production
spec:
hard:
requests.cpu: "10"
requests.memory: "20Gi"
limits.cpu: "20"
limits.memory: "40Gi"
pods: "50"
services: "10"
persistentvolumeclaims: "20"LimitRange (Defaults)
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- default:
memory: "256Mi"
cpu: "200m"
defaultRequest:
memory: "128Mi"
cpu: "100m"
type: ContainerProduction Best Practices
Deployment Checklist
- Resource requests and limits set
- Liveness and readiness probes configured
- Pod Disruption Budget defined
- Anti-affinity for high availability
- Security context (non-root, read-only FS)
- Image tags pinned (no
:latest) - Namespace isolation
- Network policies applied
- Secrets encrypted at rest
Pod Disruption Budget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2 # Or maxUnavailable: 1
selector:
matchLabels:
app: nextgen-appAnti-Affinity (Spread Across Nodes)
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["nextgen-app"]
topologyKey: kubernetes.io/hostnamePriority Classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "Critical production workloads"
---
# Use in pod spec
spec:
priorityClassName: high-priorityMonitoring Kubernetes
# Cluster health
kubectl cluster-info
kubectl get componentstatuses
# Node conditions
kubectl get nodes -o custom-columns=NAME:.metadata.name,STATUS:.status.conditions[-1].type
# Resource pressure
kubectl top nodes
kubectl describe node <node> | grep -A5 ConditionsSummary
You've learned:
- Systematic troubleshooting for common pod failures
- Diagnostic commands and debugging techniques
- Resource management with requests, limits, and quotas
- Production best practices (PDB, anti-affinity, security)
- Monitoring cluster health and resource pressure
Next Steps
You now have a complete Kubernetes foundation. Apply these skills with Helm charts, CI/CD pipelines, and monitoring for production-grade container orchestration.