Prometheus Fundamentals (20%)¶
Overview¶
This domain covers Prometheus architecture, configuration, scraping, data model, labels, and understanding Prometheus limitations.
Prometheus Architecture¶
Core Components¶
┌─────────────────────────────────────────────────────────────┐
│ Prometheus Server │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Retrieval │ │ TSDB │ │ HTTP Server │ │
│ │ (Scraper) │──│ (Storage) │──│ (PromQL, API) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────────┐
│ Targets │ │ Visualization │
│ (Exporters, │ │ (Grafana, etc.) │
│ Applications) │ └─────────────────────┘
└─────────────────┘ │
│ ▼
▼ ┌─────────────────────┐
┌─────────────────┐ │ Alertmanager │
│ Pushgateway │ │ (Notifications) │
│ (Short-lived │ └─────────────────────┘
│ jobs) │
└─────────────────┘
Component Descriptions¶
| Component | Purpose |
|---|---|
| Prometheus Server | Core component that scrapes and stores metrics |
| TSDB | Time Series Database for efficient storage |
| Retrieval | Scrapes metrics from configured targets |
| HTTP Server | Serves PromQL queries and API requests |
| Alertmanager | Handles alerts, routing, and notifications |
| Pushgateway | Accepts pushed metrics from short-lived jobs |
| Exporters | Expose metrics from third-party systems |
Configuration¶
Basic Configuration Structure¶
# prometheus.yml
global:
scrape_interval: 15s # How often to scrape targets
evaluation_interval: 15s # How often to evaluate rules
scrape_timeout: 10s # Timeout for scrape requests
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Rule files
rule_files:
- "rules/*.yml"
- "alerts/*.yml"
# Scrape configurations
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Scrape Configuration Options¶
scrape_configs:
- job_name: 'example'
# Override global settings
scrape_interval: 30s
scrape_timeout: 10s
# Metrics path (default: /metrics)
metrics_path: /metrics
# Scheme (http or https)
scheme: https
# Basic authentication
basic_auth:
username: admin
password: secret
# TLS configuration
tls_config:
ca_file: /path/to/ca.crt
cert_file: /path/to/client.crt
key_file: /path/to/client.key
# Static targets
static_configs:
- targets: ['host1:9090', 'host2:9090']
labels:
env: production
team: backend
Relabeling Configuration¶
Relabeling allows modifying labels before scraping or storing.
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Keep only pods with annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Replace metrics path
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Add namespace label
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
# Drop specific labels
- action: labeldrop
regex: __meta_kubernetes_pod_label_(.+)
Data Model¶
Time Series Structure¶
Every time series is uniquely identified by: - Metric name: Describes what is being measured - Labels: Key-value pairs for dimensions
<metric_name>{<label_name>=<label_value>, ...}
# Examples
http_requests_total{method="GET", status="200", path="/api"}
node_cpu_seconds_total{cpu="0", mode="idle"}
Metric Naming Conventions¶
# Format: <namespace>_<name>_<unit>_<suffix>
# Good examples
http_requests_total # Counter
http_request_duration_seconds # Histogram
node_memory_bytes_total # Gauge
process_cpu_seconds_total # Counter
# Suffixes
_total - Counter
_count - Number of observations (histogram/summary)
_sum - Sum of observations (histogram/summary)
_bucket - Histogram bucket
_info - Info metric (gauge with value 1)
Label Best Practices¶
# Good labels - low cardinality
http_requests_total{method="GET", status="200"}
http_requests_total{method="POST", status="201"}
# Bad labels - high cardinality (avoid!)
http_requests_total{user_id="12345", request_id="abc-123"}
Cardinality Warning: Each unique label combination creates a new time series. High cardinality can cause performance issues.
Metric Types¶
Counter¶
A cumulative metric that only increases (or resets to zero on restart).
# Raw counter value
http_requests_total
# Rate of increase per second
rate(http_requests_total[5m])
# Total increase over time window
increase(http_requests_total[1h])
Gauge¶
A metric that can go up or down.
# Current value
node_memory_MemAvailable_bytes
# Changes over time
delta(node_memory_MemAvailable_bytes[1h])
# Derivative (rate of change)
deriv(node_memory_MemAvailable_bytes[1h])
Histogram¶
Samples observations and counts them in configurable buckets.
# Bucket counts
http_request_duration_seconds_bucket{le="0.1"}
http_request_duration_seconds_bucket{le="0.5"}
http_request_duration_seconds_bucket{le="+Inf"}
# Calculate percentiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Average duration
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
Summary¶
Similar to histogram but calculates quantiles on the client side.
# Pre-calculated quantiles
go_gc_duration_seconds{quantile="0.5"}
go_gc_duration_seconds{quantile="0.9"}
go_gc_duration_seconds{quantile="0.99"}
Histogram vs Summary: | Feature | Histogram | Summary | |---------|-----------|---------| | Quantile calculation | Server-side (PromQL) | Client-side | | Aggregatable | Yes | No | | Bucket configuration | Required | Not needed | | Accuracy | Depends on buckets | Configurable |
Exposition Format¶
Text Format¶
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1234 1609459200000
http_requests_total{method="POST",status="201"} 567
# HELP node_cpu_seconds_total CPU time spent in each mode
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78
node_cpu_seconds_total{cpu="0",mode="user"} 45678.90
Format Components¶
# HELP <metric_name> <description>
# TYPE <metric_name> <type>
<metric_name>{<labels>} <value> [<timestamp>]
Prometheus Limitations¶
Storage Limitations¶
- Local storage only: No built-in clustering
- Retention: Limited by disk space
- No long-term storage: Use remote write for long-term
# Storage configuration
storage:
tsdb:
path: /prometheus/data
retention.time: 15d
retention.size: 50GB
Scalability Limitations¶
- Single server: Prometheus is designed for single-server operation
- High availability: Requires running multiple instances
- Federation: For hierarchical scaling
# Federation configuration
scrape_configs:
- job_name: 'federate'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 'prometheus-1:9090'
- 'prometheus-2:9090'
Other Limitations¶
| Limitation | Workaround |
|---|---|
| No event logging | Use logging systems (Loki, ELK) |
| Pull-only model | Pushgateway for batch jobs |
| No built-in auth | Use reverse proxy |
| Single-node | Federation, Thanos, Cortex |
Remote Write/Read¶
Remote Write Configuration¶
remote_write:
- url: "http://remote-storage:9201/write"
queue_config:
max_samples_per_send: 1000
batch_send_deadline: 5s
write_relabel_configs:
- source_labels: [__name__]
regex: 'expensive_metric_.*'
action: drop
Remote Read Configuration¶
Practice Questions¶
- What are the main components of Prometheus architecture?
- What is the difference between
scrape_intervalandevaluation_interval? - How do you configure basic authentication for a scrape target?
- What is the purpose of relabeling in Prometheus?
- What are the four metric types in Prometheus?
- Why should you avoid high-cardinality labels?
- What is the difference between histogram and summary metrics?
- How can you scale Prometheus for high availability?