Complete Observability Architecture
┌─────────────────────────────────────────────────────────┐
│ Applications │
│ (Instrumented with OpenTelemetry SDK) │
└───────┬──────────────────┬──────────────────┬───────────┘
│ Metrics │ Logs │ Traces
┌───────▼────────┐ ┌─────▼──────┐ ┌───────▼────────┐
│ Prometheus │ │ Loki │ │ Jaeger │
│ (Time Series) │ │(Log Store) │ │ (Trace Store) │
└───────┬────────┘ └─────┬──────┘ └───────┬────────┘
│ │ │
┌───────▼──────────────────▼──────────────────▼───────────┐
│ Grafana │
│ (Unified Visualization) │
└───────────────────────┬─────────────────────────────────┘
│
┌───────────────────────▼─────────────────────────────────┐
│ Alertmanager │
│ ┌─────────┐ ┌──────────┐ ┌────────────┐ │
│ │ Slack │ │PagerDuty │ │ Email │ │
│ └─────────┘ └──────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────┘
Correlation Between Signals
Linking Metrics → Traces
When a metric spikes, find the traces that caused it:
# Find slow requests
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2In Grafana, configure "Exemplars" to link metric data points to trace IDs.
Linking Logs → Traces
Include trace IDs in every log entry:
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "error",
"message": "Payment processing failed",
"trace_id": "abc123def456",
"span_id": "span789",
"service": "payment-service"
}LogQL query to find logs for a trace:
{service="payment-service"} |= "abc123def456"Grafana Data Source Linking
Configure in Grafana:
- Prometheus → Loki: Derive log queries from metric labels
- Loki → Jaeger: Extract trace IDs from logs
- Jaeger → Prometheus: Show metrics for traced services
On-Call Best Practices
Rotation Structure
| Tier | Responsibility | Response Time |
|---|---|---|
| Primary | First responder | 5 minutes |
| Secondary | Backup if primary unavailable | 15 minutes |
| Escalation | Team lead / manager | 30 minutes |
On-Call Checklist
## Starting On-Call Shift
- [ ] Verify pager/phone notifications working
- [ ] Review recent deployments and changes
- [ ] Check current alert status (any active?)
- [ ] Review handoff notes from previous on-call
- [ ] Ensure VPN/access to all systems
## During an Incident
1. Acknowledge the alert
2. Assess severity and user impact
3. Start incident channel (Slack/Teams)
4. Diagnose using dashboards and logs
5. Apply fix or escalate
6. Communicate status updates every 15 min
7. Resolve and document
## Ending On-Call Shift
- [ ] Document any ongoing issues
- [ ] Write handoff notes
- [ ] File tickets for non-urgent improvements
- [ ] Update runbooks if gaps foundReducing Alert Fatigue
| Strategy | Implementation |
|---|---|
| Alert on symptoms, not causes | Alert on error rate, not CPU |
| Group related alerts | Alertmanager group_by |
| Set appropriate thresholds | Avoid alerting on normal variance |
| Use inhibition rules | Suppress child alerts when parent fires |
| Regular alert review | Monthly review of alert frequency |
Capacity Planning
Resource Forecasting
# Predict disk full in X days
predict_linear(node_filesystem_avail_bytes[7d], 30*24*3600) < 0
# Growth rate of storage usage
deriv(prometheus_tsdb_storage_blocks_bytes[7d])
# Request growth trend
avg_over_time(sum(rate(http_requests_total[1h]))[30d:1d])Cost Optimization
| Component | Optimization |
|---|---|
| Prometheus | Reduce cardinality, shorter retention |
| Loki | Use retention policies, compress old logs |
| Jaeger | Sample traces (not 100%), TTL on storage |
| Grafana | Limit dashboard refresh rates |
Retention Policies
# Prometheus retention
--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=50GB
# Loki retention
limits_config:
retention_period: 14d
# Jaeger TTL
span_store:
max_traces: 1000000
max_span_age: 7dInfrastructure as Code for Monitoring
Terraform + Monitoring
# Deploy monitoring stack with Terraform
resource "helm_release" "monitoring" {
name = "monitoring"
repository = "https://prometheus-community.github.io/helm-charts"
chart = "kube-prometheus-stack"
namespace = "monitoring"
set {
name = "grafana.adminPassword"
value = var.grafana_password
}
set {
name = "prometheus.prometheusSpec.retention"
value = "30d"
}
}Grafana Dashboards as Code
{
"dashboard": {
"title": "NextGen Platform Overview",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"targets": [{
"expr": "sum(rate(http_requests_total[5m]))"
}]
}
]
}
}Summary
You've learned:
- Designing a complete observability platform
- Correlating metrics, logs, and traces
- On-call best practices and incident management
- Capacity planning and cost optimization
- Infrastructure as Code for monitoring
Next Steps
You now have a complete monitoring and observability foundation. Apply these practices across your entire infrastructure for production-grade visibility.