The Three Pillars of Observability
Observability lets you understand what's happening inside your systems from the outside.
┌─────────────────────────────────────────────┐
│ Observability │
├───────────────┬──────────────┬──────────────┤
│ Metrics │ Logs │ Traces │
│ (Numbers) │ (Events) │ (Requests) │
├───────────────┼──────────────┼──────────────┤
│ Prometheus │ Loki/ELK │ Jaeger │
│ Datadog │ Fluentd │ Zipkin │
│ Grafana │ CloudWatch │ Tempo │
└───────────────┴──────────────┴──────────────┘
| Pillar | What It Tells You | Example |
|---|---|---|
| Metrics | How much / how fast | CPU at 85%, 200 req/s |
| Logs | What happened | "Error: connection refused" |
| Traces | Where time is spent | Request took 2s (DB: 1.5s) |
What is Prometheus?
Prometheus is an open-source monitoring system that collects metrics via a pull model, stores them as time series, and provides a powerful query language (PromQL).
Architecture
┌──────────────────────────────────────────────┐
│ Prometheus Server │
│ ┌──────────┐ ┌─────────┐ ┌────────────┐ │
│ │ Retrieval│ │ TSDB │ │ HTTP Server│ │
│ │ (Scrape) │ │(Storage)│ │ (PromQL) │ │
│ └────┬─────┘ └─────────┘ └────────────┘ │
└───────┼──────────────────────────────────────┘
│ scrape
┌────┼────────────────┐
│ │ │
┌──▼──┐ ┌──▼──┐ ┌────────▼───────┐
│App 1│ │App 2│ │Node Exporter │
│/metrics│/metrics│ /metrics │
└─────┘ └─────┘ └───────────────┘
Key Concepts
| Concept | Description |
|---|---|
| Target | Endpoint Prometheus scrapes |
| Metric | Named time series data |
| Label | Key-value metadata on metrics |
| Scrape | Pulling metrics from a target |
| TSDB | Time Series Database (storage) |
Installing Prometheus
With Docker
# Create config directory
mkdir -p /etc/prometheus
# Run Prometheus
docker run -d \
--name prometheus \
-p 9090:9090 \
-v /etc/prometheus:/etc/prometheus \
-v prometheus_data:/prometheus \
prom/prometheus:latestprometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
- job_name: "app"
static_configs:
- targets: ["app:3000"]
metrics_path: /metricsNode Exporter (System Metrics)
docker run -d \
--name node-exporter \
-p 9100:9100 \
--pid="host" \
-v "/:/host:ro,rslave" \
quay.io/prometheus/node-exporter:latest \
--path.rootfs=/hostMetric Types
| Type | Description | Example |
|---|---|---|
| Counter | Only goes up | Total requests, errors |
| Gauge | Goes up and down | Temperature, memory usage |
| Histogram | Distribution of values | Request duration buckets |
| Summary | Similar to histogram | Quantiles (p50, p95, p99) |
PromQL Basics
# Instant vector — current value
node_cpu_seconds_total
# Filter by label
node_cpu_seconds_total{mode="idle"}
# Rate — per-second increase over time
rate(http_requests_total[5m])
# Aggregation
sum(rate(http_requests_total[5m])) by (status_code)
# CPU usage percentage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
# Disk usage percentage
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100Alerting Rules
# alert_rules.yml
groups:
- name: system_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for 5 minutes."
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
for: 10m
labels:
severity: warning
annotations:
summary: "Disk space low on {{ $labels.instance }}"Alertmanager
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'team-notifications'
receivers:
- name: 'team-notifications'
slack_configs:
- channel: '#alerts'
send_resolved: trueSummary
You've learned:
- The three pillars of observability (metrics, logs, traces)
- Prometheus architecture and pull-based model
- Installing Prometheus and Node Exporter
- Metric types and PromQL queries
- Configuring alerting rules and Alertmanager
Next Steps
Next, we'll visualize metrics with Grafana dashboards.