Prometheus Fundamentals (20%)¶

Overview¶

This domain covers Prometheus architecture, configuration, scraping, data model, labels, and understanding Prometheus limitations.

Prometheus Architecture¶

Core Components¶

┌─────────────────────────────────────────────────────────────┐
│                     Prometheus Server                        │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐ │
│  │  Retrieval  │  │    TSDB     │  │    HTTP Server      │ │
│  │  (Scraper)  │──│  (Storage)  │──│  (PromQL, API)      │ │
│  └─────────────┘  └─────────────┘  └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
         │                                      │
         ▼                                      ▼
┌─────────────────┐                  ┌─────────────────────┐
│     Targets     │                  │   Visualization     │
│  (Exporters,    │                  │   (Grafana, etc.)   │
│   Applications) │                  └─────────────────────┘
└─────────────────┘                           │
         │                                    ▼
         ▼                          ┌─────────────────────┐
┌─────────────────┐                 │   Alertmanager      │
│   Pushgateway   │                 │   (Notifications)   │
│  (Short-lived   │                 └─────────────────────┘
│   jobs)         │
└─────────────────┘

Component Descriptions¶

Component	Purpose
Prometheus Server	Core component that scrapes and stores metrics
TSDB	Time Series Database for efficient storage
Retrieval	Scrapes metrics from configured targets
HTTP Server	Serves PromQL queries and API requests
Alertmanager	Handles alerts, routing, and notifications
Pushgateway	Accepts pushed metrics from short-lived jobs
Exporters	Expose metrics from third-party systems

Configuration¶

Basic Configuration Structure¶

# prometheus.yml
global:
  scrape_interval: 15s      # How often to scrape targets
  evaluation_interval: 15s  # How often to evaluate rules
  scrape_timeout: 10s       # Timeout for scrape requests

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Rule files
rule_files:
  - "rules/*.yml"
  - "alerts/*.yml"

# Scrape configurations
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Scrape Configuration Options¶

scrape_configs:
  - job_name: 'example'
    # Override global settings
    scrape_interval: 30s
    scrape_timeout: 10s

    # Metrics path (default: /metrics)
    metrics_path: /metrics

    # Scheme (http or https)
    scheme: https

    # Basic authentication
    basic_auth:
      username: admin
      password: secret

    # TLS configuration
    tls_config:
      ca_file: /path/to/ca.crt
      cert_file: /path/to/client.crt
      key_file: /path/to/client.key

    # Static targets
    static_configs:
      - targets: ['host1:9090', 'host2:9090']
        labels:
          env: production
          team: backend

Relabeling Configuration¶

Relabeling allows modifying labels before scraping or storing.

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod

    relabel_configs:
      # Keep only pods with annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

      # Replace metrics path
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

      # Add namespace label
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace

      # Drop specific labels
      - action: labeldrop
        regex: __meta_kubernetes_pod_label_(.+)

Data Model¶

Time Series Structure¶

Every time series is uniquely identified by: - Metric name: Describes what is being measured - Labels: Key-value pairs for dimensions

<metric_name>{<label_name>=<label_value>, ...}

# Examples
http_requests_total{method="GET", status="200", path="/api"}
node_cpu_seconds_total{cpu="0", mode="idle"}

Metric Naming Conventions¶

# Format: <namespace>_<name>_<unit>_<suffix>

# Good examples
http_requests_total              # Counter
http_request_duration_seconds    # Histogram
node_memory_bytes_total          # Gauge
process_cpu_seconds_total        # Counter

# Suffixes
_total    - Counter
_count    - Number of observations (histogram/summary)
_sum      - Sum of observations (histogram/summary)
_bucket   - Histogram bucket
_info     - Info metric (gauge with value 1)

Label Best Practices¶

# Good labels - low cardinality
http_requests_total{method="GET", status="200"}
http_requests_total{method="POST", status="201"}

# Bad labels - high cardinality (avoid!)
http_requests_total{user_id="12345", request_id="abc-123"}

Cardinality Warning: Each unique label combination creates a new time series. High cardinality can cause performance issues.

Metric Types¶

Counter¶

A cumulative metric that only increases (or resets to zero on restart).

# Raw counter value
http_requests_total

# Rate of increase per second
rate(http_requests_total[5m])

# Total increase over time window
increase(http_requests_total[1h])

Gauge¶

A metric that can go up or down.

# Current value
node_memory_MemAvailable_bytes

# Changes over time
delta(node_memory_MemAvailable_bytes[1h])

# Derivative (rate of change)
deriv(node_memory_MemAvailable_bytes[1h])

Histogram¶

Samples observations and counts them in configurable buckets.

# Bucket counts
http_request_duration_seconds_bucket{le="0.1"}
http_request_duration_seconds_bucket{le="0.5"}
http_request_duration_seconds_bucket{le="+Inf"}

# Calculate percentiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Average duration
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])

Summary¶

Similar to histogram but calculates quantiles on the client side.

# Pre-calculated quantiles
go_gc_duration_seconds{quantile="0.5"}
go_gc_duration_seconds{quantile="0.9"}
go_gc_duration_seconds{quantile="0.99"}

Histogram vs Summary: | Feature | Histogram | Summary | |---------|-----------|---------| | Quantile calculation | Server-side (PromQL) | Client-side | | Aggregatable | Yes | No | | Bucket configuration | Required | Not needed | | Accuracy | Depends on buckets | Configurable |

Exposition Format¶

Text Format¶

# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1234 1609459200000
http_requests_total{method="POST",status="201"} 567

# HELP node_cpu_seconds_total CPU time spent in each mode
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78
node_cpu_seconds_total{cpu="0",mode="user"} 45678.90

Format Components¶

# HELP <metric_name> <description>
# TYPE <metric_name> <type>
<metric_name>{<labels>} <value> [<timestamp>]

Prometheus Limitations¶

Storage Limitations¶

Local storage only: No built-in clustering
Retention: Limited by disk space
No long-term storage: Use remote write for long-term

# Storage configuration
storage:
  tsdb:
    path: /prometheus/data
    retention.time: 15d
    retention.size: 50GB

Scalability Limitations¶

Single server: Prometheus is designed for single-server operation
High availability: Requires running multiple instances
Federation: For hierarchical scaling

# Federation configuration
scrape_configs:
  - job_name: 'federate'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="prometheus"}'
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets:
          - 'prometheus-1:9090'
          - 'prometheus-2:9090'

Other Limitations¶

Limitation	Workaround
No event logging	Use logging systems (Loki, ELK)
Pull-only model	Pushgateway for batch jobs
No built-in auth	Use reverse proxy
Single-node	Federation, Thanos, Cortex

Remote Write/Read¶

Remote Write Configuration¶

remote_write:
  - url: "http://remote-storage:9201/write"
    queue_config:
      max_samples_per_send: 1000
      batch_send_deadline: 5s
    write_relabel_configs:
      - source_labels: [__name__]
        regex: 'expensive_metric_.*'
        action: drop

Remote Read Configuration¶

remote_read:
  - url: "http://remote-storage:9201/read"
    read_recent: true

Practice Questions¶

What are the main components of Prometheus architecture?
What is the difference between scrape_interval and evaluation_interval?
How do you configure basic authentication for a scrape target?
What is the purpose of relabeling in Prometheus?
What are the four metric types in Prometheus?
Why should you avoid high-cardinality labels?
What is the difference between histogram and summary metrics?
How can you scale Prometheus for high availability?