Kubernetes Monitoring Architecture
┌─────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌───────────┐ │
│ │kube-state│ │node-exporter │ │ cAdvisor │ │
│ │-metrics │ │(per node) │ │(per node) │ │
│ └────┬─────┘ └──────┬───────┘ └─────┬─────┘ │
│ │ │ │ │
│ ┌────▼───────────────▼────────────────▼──────┐ │
│ │ Prometheus │ │
│ └────────────────────┬───────────────────────┘ │
│ │ │
│ ┌────────────────────▼───────────────────────┐ │
│ │ Grafana │ │
│ └────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────┘
Metric Sources
| Source | Metrics Provided |
|---|---|
| kube-state-metrics | Deployment status, pod phases, replica counts |
| node-exporter | CPU, memory, disk, network per node |
| cAdvisor | Container CPU, memory, I/O per pod |
| API server | Request latency, etcd health |
| kubelet | Pod lifecycle, volume stats |
Deploy kube-prometheus-stack
# Add Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install the full stack
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword=admin123
# Verify
kubectl get pods -n monitoringThis installs:
- Prometheus (metrics collection)
- Grafana (visualization)
- Alertmanager (alerting)
- Node Exporter (node metrics)
- kube-state-metrics (K8s object metrics)
Key Kubernetes Metrics
Cluster Level
# Total cluster CPU usage
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
# Total cluster memory usage
sum(container_memory_working_set_bytes{container!=""})
# Node count
count(kube_node_info)
# Pod count by namespace
sum by (namespace) (kube_pod_info)Node Level
# CPU usage per node
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage per node
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Disk pressure
node_filesystem_avail_bytes / node_filesystem_size_bytes * 100Pod Level
# CPU usage per pod
sum by (pod) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))
# Memory usage per pod
sum by (pod) (container_memory_working_set_bytes{container!=""})
# Pod restart count
sum by (pod) (kube_pod_container_status_restarts_total)
# Pods not ready
sum by (namespace) (kube_pod_status_ready{condition="false"})Deployment Level
# Deployment replicas vs desired
kube_deployment_status_replicas / kube_deployment_spec_replicas
# Unavailable replicas
kube_deployment_status_replicas_unavailable > 0
# Deployment rollout stuck
kube_deployment_status_observed_generation != kube_deployment_metadata_generationKubernetes Alerts
groups:
- name: kubernetes
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is not ready"
- alert: HighPodMemory
expr: |
sum by (pod, namespace) (container_memory_working_set_bytes{container!=""})
/
sum by (pod, namespace) (kube_pod_container_resource_limits{resource="memory"})
> 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} using >90% memory limit"
- alert: PVCAlmostFull
expr: kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} is >85% full"Grafana Dashboards for K8s
Recommended Dashboards
| Dashboard | ID | Shows |
|---|---|---|
| K8s Cluster Overview | 7249 | Cluster health summary |
| Node Exporter | 1860 | Per-node system metrics |
| Pod Resources | 6879 | Pod CPU/memory/network |
| Namespace Overview | 15758 | Per-namespace resource usage |
Custom Dashboard Panels
Namespace Resource Usage:
# CPU by namespace
sum by (namespace) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))
# Memory by namespace
sum by (namespace) (container_memory_working_set_bytes{container!=""}) / 1024 / 1024 / 1024Summary
You've learned:
- Kubernetes monitoring architecture and metric sources
- Deploying the kube-prometheus-stack with Helm
- Key metrics at cluster, node, pod, and deployment levels
- Kubernetes-specific alerting rules
- Building dashboards for cluster observability
Next Steps
Next, we'll cover monitoring best practices and building a complete observability platform.