Introduction to Monitoring & Prometheus

25 minLesson 1 of 7

Learning Objectives

  • Understand the three pillars of observability
  • Install and configure Prometheus
  • Write PromQL queries for metrics analysis
  • Configure alerting rules

The Three Pillars of Observability

Observability lets you understand what's happening inside your systems from the outside.

┌─────────────────────────────────────────────┐
│              Observability                    │
├───────────────┬──────────────┬──────────────┤
│    Metrics    │     Logs     │    Traces    │
│  (Numbers)   │   (Events)   │  (Requests)  │
├───────────────┼──────────────┼──────────────┤
│  Prometheus   │   Loki/ELK   │    Jaeger    │
│  Datadog      │   Fluentd    │    Zipkin    │
│  Grafana      │   CloudWatch │    Tempo     │
└───────────────┴──────────────┴──────────────┘
PillarWhat It Tells YouExample
MetricsHow much / how fastCPU at 85%, 200 req/s
LogsWhat happened"Error: connection refused"
TracesWhere time is spentRequest took 2s (DB: 1.5s)

What is Prometheus?

Prometheus is an open-source monitoring system that collects metrics via a pull model, stores them as time series, and provides a powerful query language (PromQL).

Architecture

┌──────────────────────────────────────────────┐
│                Prometheus Server               │
│  ┌──────────┐  ┌─────────┐  ┌────────────┐  │
│  │ Retrieval│  │  TSDB   │  │ HTTP Server│  │
│  │ (Scrape) │  │(Storage)│  │  (PromQL)  │  │
│  └────┬─────┘  └─────────┘  └────────────┘  │
└───────┼──────────────────────────────────────┘
        │ scrape
   ┌────┼────────────────┐
   │    │                │
┌──▼──┐ ┌──▼──┐ ┌────────▼───────┐
│App 1│ │App 2│ │Node Exporter   │
│/metrics│/metrics│  /metrics     │
└─────┘ └─────┘ └───────────────┘

Key Concepts

ConceptDescription
TargetEndpoint Prometheus scrapes
MetricNamed time series data
LabelKey-value metadata on metrics
ScrapePulling metrics from a target
TSDBTime Series Database (storage)

Installing Prometheus

With Docker

# Create config directory
mkdir -p /etc/prometheus
 
# Run Prometheus
docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v /etc/prometheus:/etc/prometheus \
  -v prometheus_data:/prometheus \
  prom/prometheus:latest

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
 
  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]
 
  - job_name: "app"
    static_configs:
      - targets: ["app:3000"]
    metrics_path: /metrics

Node Exporter (System Metrics)

docker run -d \
  --name node-exporter \
  -p 9100:9100 \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  quay.io/prometheus/node-exporter:latest \
  --path.rootfs=/host

Metric Types

TypeDescriptionExample
CounterOnly goes upTotal requests, errors
GaugeGoes up and downTemperature, memory usage
HistogramDistribution of valuesRequest duration buckets
SummarySimilar to histogramQuantiles (p50, p95, p99)

PromQL Basics

# Instant vector — current value
node_cpu_seconds_total
 
# Filter by label
node_cpu_seconds_total{mode="idle"}
 
# Rate — per-second increase over time
rate(http_requests_total[5m])
 
# Aggregation
sum(rate(http_requests_total[5m])) by (status_code)
 
# CPU usage percentage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
 
# Memory usage
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
 
# Disk usage percentage
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100

Alerting Rules

# alert_rules.yml
groups:
  - name: system_alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for 5 minutes."
 
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
 
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"

Alertmanager

# alertmanager.yml
global:
  resolve_timeout: 5m
 
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'team-notifications'
 
receivers:
  - name: 'team-notifications'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true

Summary

You've learned:

  • The three pillars of observability (metrics, logs, traces)
  • Prometheus architecture and pull-based model
  • Installing Prometheus and Node Exporter
  • Metric types and PromQL queries
  • Configuring alerting rules and Alertmanager

Next Steps

Next, we'll visualize metrics with Grafana dashboards.