Alerting Philosophy
Good alerts are actionable, relevant, and timely. Bad alerts cause alert fatigue and get ignored.
Alert Quality Checklist
| Good Alert | Bad Alert |
|---|
| Requires human action | Informational only |
| Indicates user impact | Internal metric spike |
| Has clear remediation | "Something is wrong" |
| Fires rarely | Fires constantly |
| Grouped logically | One per symptom |
Severity Levels
| Level | Meaning | Response Time | Example |
|---|
| Critical | Service down, data loss | Immediate (page) | Database unreachable |
| Warning | Degraded performance | Within hours | CPU > 80% for 10min |
| Info | Notable event | Next business day | Deployment completed |
Alertmanager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/...'
route:
receiver: 'default'
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 15m
- match:
severity: warning
receiver: 'slack-warnings'
repeat_interval: 2h
receivers:
- name: 'default'
slack_configs:
- channel: '#monitoring'
send_resolved: true
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<pagerduty-key>'
- name: 'slack-warnings'
slack_configs:
- channel: '#alerts-warning'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
SLOs and Error Budgets
Service Level Objectives
| Term | Definition | Example |
|---|
| SLI | Service Level Indicator | Request success rate |
| SLO | Service Level Objective | 99.9% success rate |
| SLA | Service Level Agreement | Contractual guarantee |
| Error Budget | Allowed failures | 0.1% = 43 min/month |
SLO-Based Alerts
groups:
- name: slo_alerts
rules:
# Burn rate alert — fast burn
- alert: HighBurnRate_Fast
expr: |
(
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 14.4 * 0.001
for: 2m
labels:
severity: critical
annotations:
summary: "Error budget burning 14x faster than allowed"
# Burn rate alert — slow burn
- alert: HighBurnRate_Slow
expr: |
(
sum(rate(http_requests_total{status_code=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > 3 * 0.001
for: 15m
labels:
severity: warning
annotations:
summary: "Error budget burning 3x faster than allowed"
Incident Response Runbook
Template
# Incident: [Alert Name]
## Symptoms
- What the user experiences
- What metrics show
## Diagnosis
1. Check service health: `kubectl get pods -l app=<service>`
2. Check logs: `kubectl logs -l app=<service> --tail=100`
3. Check dependencies: `curl http://<dependency>/health`
4. Check resources: `kubectl top pods`
## Remediation
1. **Quick fix:** Restart pods — `kubectl rollout restart deployment/<name>`
2. **Scale up:** `kubectl scale deployment/<name> --replicas=5`
3. **Rollback:** `kubectl rollout undo deployment/<name>`
## Escalation
- L1: On-call engineer (first 15 min)
- L2: Team lead (after 30 min)
- L3: Engineering manager (after 1 hour)
## Post-Incident
- [ ] Timeline documented
- [ ] Root cause identified
- [ ] Fix deployed
- [ ] Post-mortem scheduled
Monitoring Best Practices
The Four Golden Signals (Google SRE)
| Signal | What to Measure | PromQL Example |
|---|
| Latency | Request duration | histogram_quantile(0.95, rate(http_duration_seconds_bucket[5m])) |
| Traffic | Request rate | sum(rate(http_requests_total[5m])) |
| Errors | Error rate | sum(rate(http_requests_total{status=~"5.."}[5m])) |
| Saturation | Resource usage | container_memory_usage_bytes / container_spec_memory_limit_bytes |
USE Method (for infrastructure)
| Component | Utilization | Saturation | Errors |
|---|
| CPU | Usage % | Run queue length | System errors |
| Memory | Used / Total | Swap usage | OOM kills |
| Disk | Space used % | I/O wait | Read/write errors |
| Network | Bandwidth % | Dropped packets | Interface errors |
Summary
You've learned:
- Designing meaningful alerts that reduce fatigue
- Configuring Alertmanager with routing and receivers
- SLOs, error budgets, and burn-rate alerting
- Building incident response runbooks
- The Four Golden Signals and USE method
Next Steps
You now have a complete monitoring foundation. Apply these practices to your Kubernetes deployments and CI/CD pipelines for full observability.