Alerting & Incident Response

Alerting Philosophy

Good alerts are actionable, relevant, and timely. Bad alerts cause alert fatigue and get ignored.

Alert Quality Checklist

Good Alert	Bad Alert
Requires human action	Informational only
Indicates user impact	Internal metric spike
Has clear remediation	"Something is wrong"
Fires rarely	Fires constantly
Grouped logically	One per symptom

Severity Levels

Level	Meaning	Response Time	Example
Critical	Service down, data loss	Immediate (page)	Database unreachable
Warning	Degraded performance	Within hours	CPU > 80% for 10min
Info	Notable event	Next business day	Deployment completed

Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/...'
 
route:
  receiver: 'default'
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
 
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      repeat_interval: 15m
 
    - match:
        severity: warning
      receiver: 'slack-warnings'
      repeat_interval: 2h
 
receivers:
  - name: 'default'
    slack_configs:
      - channel: '#monitoring'
        send_resolved: true
 
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<pagerduty-key>'
 
  - name: 'slack-warnings'
    slack_configs:
      - channel: '#alerts-warning'
        send_resolved: true
 
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

SLOs and Error Budgets

Service Level Objectives

Term	Definition	Example
SLI	Service Level Indicator	Request success rate
SLO	Service Level Objective	99.9% success rate
SLA	Service Level Agreement	Contractual guarantee
Error Budget	Allowed failures	0.1% = 43 min/month

SLO-Based Alerts

groups:
  - name: slo_alerts
    rules:
      # Burn rate alert — fast burn
      - alert: HighBurnRate_Fast
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > 14.4 * 0.001
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error budget burning 14x faster than allowed"
 
      # Burn rate alert — slow burn
      - alert: HighBurnRate_Slow
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > 3 * 0.001
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Error budget burning 3x faster than allowed"

Incident Response Runbook

Template

# Incident: [Alert Name]
 
## Symptoms
- What the user experiences
- What metrics show
 
## Diagnosis
1. Check service health: `kubectl get pods -l app=<service>`
2. Check logs: `kubectl logs -l app=<service> --tail=100`
3. Check dependencies: `curl http://<dependency>/health`
4. Check resources: `kubectl top pods`
 
## Remediation
1. **Quick fix:** Restart pods — `kubectl rollout restart deployment/<name>`
2. **Scale up:** `kubectl scale deployment/<name> --replicas=5`
3. **Rollback:** `kubectl rollout undo deployment/<name>`
 
## Escalation
- L1: On-call engineer (first 15 min)
- L2: Team lead (after 30 min)
- L3: Engineering manager (after 1 hour)
 
## Post-Incident
- [ ] Timeline documented
- [ ] Root cause identified
- [ ] Fix deployed
- [ ] Post-mortem scheduled

Monitoring Best Practices

The Four Golden Signals (Google SRE)

Signal	What to Measure	PromQL Example
Latency	Request duration	`histogram_quantile(0.95, rate(http_duration_seconds_bucket[5m]))`
Traffic	Request rate	`sum(rate(http_requests_total[5m]))`
Errors	Error rate	`sum(rate(http_requests_total{status=~"5.."}[5m]))`
Saturation	Resource usage	`container_memory_usage_bytes / container_spec_memory_limit_bytes`

USE Method (for infrastructure)

Component	Utilization	Saturation	Errors
CPU	Usage %	Run queue length	System errors
Memory	Used / Total	Swap usage	OOM kills
Disk	Space used %	I/O wait	Read/write errors
Network	Bandwidth %	Dropped packets	Interface errors

Summary

You've learned:

Designing meaningful alerts that reduce fatigue
Configuring Alertmanager with routing and receivers
SLOs, error budgets, and burn-rate alerting
Building incident response runbooks
The Four Golden Signals and USE method

Next Steps

You now have a complete monitoring foundation. Apply these practices to your Kubernetes deployments and CI/CD pipelines for full observability.

Learning Objectives