Building a Complete Observability Platform

Complete Observability Architecture

┌─────────────────────────────────────────────────────────┐
│                    Applications                           │
│  (Instrumented with OpenTelemetry SDK)                   │
└───────┬──────────────────┬──────────────────┬───────────┘
        │ Metrics          │ Logs             │ Traces
┌───────▼────────┐  ┌─────▼──────┐  ┌───────▼────────┐
│   Prometheus   │  │    Loki    │  │     Jaeger     │
│  (Time Series) │  │(Log Store) │  │ (Trace Store)  │
└───────┬────────┘  └─────┬──────┘  └───────┬────────┘
        │                  │                  │
┌───────▼──────────────────▼──────────────────▼───────────┐
│                      Grafana                             │
│            (Unified Visualization)                       │
└───────────────────────┬─────────────────────────────────┘
                        │
┌───────────────────────▼─────────────────────────────────┐
│                   Alertmanager                            │
│  ┌─────────┐  ┌──────────┐  ┌────────────┐             │
│  │  Slack  │  │PagerDuty │  │   Email    │             │
│  └─────────┘  └──────────┘  └────────────┘             │
└─────────────────────────────────────────────────────────┘

Correlation Between Signals

Linking Metrics → Traces

When a metric spikes, find the traces that caused it:

# Find slow requests
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2

In Grafana, configure "Exemplars" to link metric data points to trace IDs.

Linking Logs → Traces

Include trace IDs in every log entry:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "error",
  "message": "Payment processing failed",
  "trace_id": "abc123def456",
  "span_id": "span789",
  "service": "payment-service"
}

LogQL query to find logs for a trace:

{service="payment-service"} |= "abc123def456"

Grafana Data Source Linking

Configure in Grafana:

Prometheus → Loki: Derive log queries from metric labels
Loki → Jaeger: Extract trace IDs from logs
Jaeger → Prometheus: Show metrics for traced services

On-Call Best Practices

Rotation Structure

Tier	Responsibility	Response Time
Primary	First responder	5 minutes
Secondary	Backup if primary unavailable	15 minutes
Escalation	Team lead / manager	30 minutes

On-Call Checklist

## Starting On-Call Shift
- [ ] Verify pager/phone notifications working
- [ ] Review recent deployments and changes
- [ ] Check current alert status (any active?)
- [ ] Review handoff notes from previous on-call
- [ ] Ensure VPN/access to all systems
 
## During an Incident
1. Acknowledge the alert
2. Assess severity and user impact
3. Start incident channel (Slack/Teams)
4. Diagnose using dashboards and logs
5. Apply fix or escalate
6. Communicate status updates every 15 min
7. Resolve and document
 
## Ending On-Call Shift
- [ ] Document any ongoing issues
- [ ] Write handoff notes
- [ ] File tickets for non-urgent improvements
- [ ] Update runbooks if gaps found

Reducing Alert Fatigue

Strategy	Implementation
Alert on symptoms, not causes	Alert on error rate, not CPU
Group related alerts	Alertmanager group_by
Set appropriate thresholds	Avoid alerting on normal variance
Use inhibition rules	Suppress child alerts when parent fires
Regular alert review	Monthly review of alert frequency

Capacity Planning

Resource Forecasting

# Predict disk full in X days
predict_linear(node_filesystem_avail_bytes[7d], 30*24*3600) < 0
 
# Growth rate of storage usage
deriv(prometheus_tsdb_storage_blocks_bytes[7d])
 
# Request growth trend
avg_over_time(sum(rate(http_requests_total[1h]))[30d:1d])

Cost Optimization

Component	Optimization
Prometheus	Reduce cardinality, shorter retention
Loki	Use retention policies, compress old logs
Jaeger	Sample traces (not 100%), TTL on storage
Grafana	Limit dashboard refresh rates

Retention Policies

# Prometheus retention
--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=50GB
 
# Loki retention
limits_config:
  retention_period: 14d
 
# Jaeger TTL
span_store:
  max_traces: 1000000
  max_span_age: 7d

Infrastructure as Code for Monitoring

Terraform + Monitoring

# Deploy monitoring stack with Terraform
resource "helm_release" "monitoring" {
  name       = "monitoring"
  repository = "https://prometheus-community.github.io/helm-charts"
  chart      = "kube-prometheus-stack"
  namespace  = "monitoring"
 
  set {
    name  = "grafana.adminPassword"
    value = var.grafana_password
  }
 
  set {
    name  = "prometheus.prometheusSpec.retention"
    value = "30d"
  }
}

Grafana Dashboards as Code

{
  "dashboard": {
    "title": "NextGen Platform Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "targets": [{
          "expr": "sum(rate(http_requests_total[5m]))"
        }]
      }
    ]
  }
}

Summary

You've learned:

Designing a complete observability platform
Correlating metrics, logs, and traces
On-call best practices and incident management
Capacity planning and cost optimization
Infrastructure as Code for monitoring

Next Steps

You now have a complete monitoring and observability foundation. Apply these practices across your entire infrastructure for production-grade visibility.

Learning Objectives