Observability Concepts (18%)¶

Overview¶

This domain covers the foundational concepts of observability, including the three pillars (metrics, logs, traces), service discovery, and understanding SLAs, SLOs, and SLIs.

The Three Pillars of Observability¶

1. Metrics¶

Metrics are numerical values that measure some aspect of a system over intervals of time.

Characteristics: - Aggregatable and compressible - Low storage overhead - Good for alerting and trending - Examples: CPU usage, request count, error rate

Prometheus Metric Types:

# Counter - only increases (resets on restart)
http_requests_total{method="GET", status="200"} 1234

# Gauge - can go up or down
temperature_celsius{location="server_room"} 23.5

# Histogram - samples observations into buckets
http_request_duration_seconds_bucket{le="0.1"} 24054
http_request_duration_seconds_bucket{le="0.5"} 33444
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320

# Summary - similar to histogram with quantiles
go_gc_duration_seconds{quantile="0.5"} 0.000107458
go_gc_duration_seconds{quantile="0.9"} 0.000262326

2. Logs¶

Logs are immutable records that describe discrete events that have happened over time.

Characteristics: - High cardinality data - Detailed context for debugging - Higher storage requirements - Examples: Application errors, access logs, audit trails

Log Levels: - DEBUG: Detailed information for debugging - INFO: General operational information - WARN: Warning conditions - ERROR: Error conditions - FATAL/CRITICAL: Severe errors causing shutdown

3. Traces¶

Traces are records of the full paths or sequences of events that occur as requests flow through a system.

Key Concepts: - Trace: Complete journey of a request through the system - Span: A single operation within a trace - Context Propagation: Passing trace context between services

Trace ID: abc123
├── Span 1: API Gateway (10ms)
│   ├── Span 2: Auth Service (5ms)
│   └── Span 3: Backend Service (50ms)
│       ├── Span 4: Database Query (30ms)
│       └── Span 5: Cache Lookup (2ms)

Push vs Pull Model¶

Pull Model (Prometheus Default)¶

Prometheus actively scrapes metrics from targets at regular intervals.

Advantages: - Prometheus controls scrape timing - Easy to detect if a target is down - No need for targets to know about Prometheus - Simpler target configuration

Configuration Example:

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
    scrape_interval: 15s

Push Model (Pushgateway)¶

Applications push metrics to an intermediary (Pushgateway).

Use Cases: - Short-lived batch jobs - Jobs behind firewalls - Legacy systems that can't expose endpoints

When to Use Push:

# Push metrics to Pushgateway
echo "job_completion_time $(date +%s)" | curl --data-binary @- http://pushgateway:9091/metrics/job/batch_job

Important: Pushgateway should NOT be used as a general metrics aggregator.

Service Discovery¶

Service discovery automatically finds and monitors targets without manual configuration.

Static Configuration¶

scrape_configs:
  - job_name: 'static_targets'
    static_configs:
      - targets: ['server1:9090', 'server2:9090']

Kubernetes Service Discovery¶

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

File-based Service Discovery¶

scrape_configs:
  - job_name: 'file_sd'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/*.json'
        refresh_interval: 5m

DNS Service Discovery¶

scrape_configs:
  - job_name: 'dns_sd'
    dns_sd_configs:
      - names:
          - 'myservice.example.com'
        type: 'A'
        port: 9090

SLAs, SLOs, and SLIs¶

Service Level Agreement (SLA)¶

A formal agreement between a service provider and customer defining expected service levels.

Example:

"The service will be available 99.9% of the time, measured monthly. If availability falls below this threshold, customers will receive a 10% credit."

Service Level Objective (SLO)¶

Internal targets that teams aim to achieve to meet SLAs.