What is Distributed Tracing?
Tracing follows a single request as it flows through multiple services, showing where time is spent.
Request: GET /api/courses
├── API Gateway (5ms)
├── Auth Service (15ms)
├── Course Service (120ms)
│ ├── Database Query (80ms) ← Bottleneck!
│ └── Cache Lookup (5ms)
└── Response (total: 145ms)
Key Concepts
| Term | Definition |
|---|---|
| Trace | End-to-end journey of a request |
| Span | A single operation within a trace |
| Context | Metadata propagated between services |
| Trace ID | Unique identifier for the entire request |
| Span ID | Unique identifier for each operation |
| Parent Span | The calling operation |
Jaeger Architecture
┌──────────┐ ┌──────────────┐ ┌──────────┐
│ App + │───▶│ Jaeger │───▶│ Jaeger │
│ Agent │ │ Collector │ │ UI │
└──────────┘ └──────┬───────┘ └──────────┘
│
┌──────▼───────┐
│ Storage │
│(Elasticsearch│
│ or Cassandra)│
└──────────────┘
Deploy Jaeger (All-in-One)
docker run -d \
--name jaeger \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:latest| Port | Protocol | Purpose |
|---|---|---|
| 16686 | HTTP | Jaeger UI |
| 4317 | gRPC | OpenTelemetry collector |
| 4318 | HTTP | OpenTelemetry collector |
OpenTelemetry Instrumentation
Node.js Application
// tracing.js — Initialize before app code
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const sdk = new NodeSDK({
serviceName: 'nextgen-api',
traceExporter: new OTLPTraceExporter({
url: 'http://jaeger:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();# Install dependencies
npm install @opentelemetry/sdk-node \
@opentelemetry/exporter-trace-otlp-http \
@opentelemetry/auto-instrumentations-node
# Run with tracing
node --require ./tracing.js app.jsCustom Spans
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('nextgen-api');
async function getCourses() {
return tracer.startActiveSpan('getCourses', async (span) => {
try {
span.setAttribute('query.type', 'courses');
const courses = await db.query('SELECT * FROM courses');
span.setAttribute('result.count', courses.length);
return courses;
} catch (error) {
span.recordException(error);
span.setStatus({ code: 2, message: error.message });
throw error;
} finally {
span.end();
}
});
}Analyzing Traces
In Jaeger UI
- Select service from dropdown
- Choose operation (endpoint)
- Set time range
- Click "Find Traces"
What to Look For
| Pattern | Indicates |
|---|---|
| Long spans | Slow operations (optimize) |
| Many child spans | N+1 query problem |
| Gaps between spans | Network latency |
| Error spans | Failed operations |
| Fan-out pattern | Parallel calls (good) |
Trace-Based Alerts
# Prometheus alert based on trace metrics
- alert: HighLatencyTrace
expr: histogram_quantile(0.95, rate(traces_spanmetrics_latency_bucket{service="nextgen-api"}[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency above 2s for nextgen-api"Connecting the Three Pillars
┌─────────────────────────────────────────┐
│ Unified Observability │
├─────────────┬─────────────┬─────────────┤
│ Metrics │ Logs │ Traces │
│ (Prometheus)│ (Loki) │ (Jaeger) │
├─────────────┴─────────────┴─────────────┤
│ Grafana │
│ (Single pane of glass) │
└─────────────────────────────────────────┘
Link them together:
- Trace → Logs: Use trace ID to find related log entries
- Metrics → Traces: Click high-latency metric to see example traces
- Logs → Traces: Extract trace ID from log to view full request path
Summary
You've learned:
- Distributed tracing concepts (traces, spans, context)
- Deploying Jaeger for trace collection
- Instrumenting applications with OpenTelemetry
- Analyzing traces to find bottlenecks
- Connecting metrics, logs, and traces
Next Steps
Next, we'll explore monitoring Kubernetes clusters and containerized applications.