Alerting & Incident Response

20 minLesson 3 of 7

Learning Objectives

  • Design meaningful alerts that reduce noise
  • Configure multi-channel notification routing
  • Build incident response runbooks
  • Implement SLOs and error budgets

Alerting Philosophy

Good alerts are actionable, relevant, and timely. Bad alerts cause alert fatigue and get ignored.

Alert Quality Checklist

Good AlertBad Alert
Requires human actionInformational only
Indicates user impactInternal metric spike
Has clear remediation"Something is wrong"
Fires rarelyFires constantly
Grouped logicallyOne per symptom

Severity Levels

LevelMeaningResponse TimeExample
CriticalService down, data lossImmediate (page)Database unreachable
WarningDegraded performanceWithin hoursCPU > 80% for 10min
InfoNotable eventNext business dayDeployment completed

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/...'
 
route:
  receiver: 'default'
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
 
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 15m
 
    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 2h
 
receivers:
  - name: 'default'
    slack_configs:
      - channel: '#monitoring'
        send_resolved: true
 
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<pagerduty-key>'
 
  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warning'
        send_resolved: true
 
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

SLOs and Error Budgets

Service Level Objectives

TermDefinitionExample
SLIService Level IndicatorRequest success rate
SLOService Level Objective99.9% success rate
SLAService Level AgreementContractual guarantee
Error BudgetAllowed failures0.1% = 43 min/month

SLO-Based Alerts

groups:
  - name: slo_alerts
    rules:
      # Burn rate alert — fast burn
      - alert: HighBurnRate_Fast
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > 14.4 * 0.001
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error budget burning 14x faster than allowed"
 
      # Burn rate alert — slow burn
      - alert: HighBurnRate_Slow
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > 3 * 0.001
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Error budget burning 3x faster than allowed"

Incident Response Runbook

Template

# Incident: [Alert Name]
 
## Symptoms
- What the user experiences
- What metrics show
 
## Diagnosis
1. Check service health: `kubectl get pods -l app=<service>`
2. Check logs: `kubectl logs -l app=<service> --tail=100`
3. Check dependencies: `curl http://<dependency>/health`
4. Check resources: `kubectl top pods`
 
## Remediation
1. **Quick fix:** Restart pods — `kubectl rollout restart deployment/<name>`
2. **Scale up:** `kubectl scale deployment/<name> --replicas=5`
3. **Rollback:** `kubectl rollout undo deployment/<name>`
 
## Escalation
- L1: On-call engineer (first 15 min)
- L2: Team lead (after 30 min)
- L3: Engineering manager (after 1 hour)
 
## Post-Incident
- [ ] Timeline documented
- [ ] Root cause identified
- [ ] Fix deployed
- [ ] Post-mortem scheduled

Monitoring Best Practices

The Four Golden Signals (Google SRE)

SignalWhat to MeasurePromQL Example
LatencyRequest durationhistogram_quantile(0.95, rate(http_duration_seconds_bucket[5m]))
TrafficRequest ratesum(rate(http_requests_total[5m]))
ErrorsError ratesum(rate(http_requests_total{status=~"5.."}[5m]))
SaturationResource usagecontainer_memory_usage_bytes / container_spec_memory_limit_bytes

USE Method (for infrastructure)

ComponentUtilizationSaturationErrors
CPUUsage %Run queue lengthSystem errors
MemoryUsed / TotalSwap usageOOM kills
DiskSpace used %I/O waitRead/write errors
NetworkBandwidth %Dropped packetsInterface errors

Summary

You've learned:

  • Designing meaningful alerts that reduce fatigue
  • Configuring Alertmanager with routing and receivers
  • SLOs, error budgets, and burn-rate alerting
  • Building incident response runbooks
  • The Four Golden Signals and USE method

Next Steps

You now have a complete monitoring foundation. Apply these practices to your Kubernetes deployments and CI/CD pipelines for full observability.