Alerting & Dashboarding (18%)¶

Overview¶

This domain covers configuring alerting rules, understanding Alertmanager, and dashboarding basics with Grafana.

Alerting Rules¶

Rule Configuration¶

Alert rules are defined in YAML files and loaded by Prometheus.

# alerts.yml
groups:
  - name: example-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.instance }}"

      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."

Rule Components¶

Component	Description
`alert`	Name of the alert
`expr`	PromQL expression that triggers the alert
`for`	Duration the condition must be true before firing
`labels`	Additional labels to attach to the alert
`annotations`	Informational labels (summary, description, runbook)

Alert States¶

Inactive: Condition is not met
Pending: Condition is met but for duration hasn't elapsed
Firing: Condition met for the for duration

Inactive → Pending → Firing
              ↓
           Inactive (if condition becomes false)

Recording Rules¶

Pre-compute frequently used or expensive expressions.

groups:
  - name: recording-rules
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: instance:node_cpu_utilization:ratio
        expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

      - record: job:http_request_duration_seconds:p95
        expr: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

Prometheus Configuration¶

# prometheus.yml
rule_files:
  - "rules/*.yml"
  - "alerts/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

Alertmanager¶

Architecture¶

┌─────────────┐     ┌─────────────────────────────────────────┐
│  Prometheus │────▶│            Alertmanager                  │
│   (alerts)  │     │  ┌─────────┐  ┌──────────┐  ┌────────┐ │
└─────────────┘     │  │ Routing │─▶│ Grouping │─▶│ Notify │ │
                    │  └─────────┘  └──────────┘  └────────┘ │
                    └─────────────────────────────────────────┘
                                                      │
                    ┌─────────────────────────────────┼─────────┐
                    │                                 │         │
                    ▼                                 ▼         ▼
              ┌──────────┐                    ┌──────────┐ ┌────────┐
              │  Email   │                    │  Slack   │ │PagerDuty│
              └──────────┘                    └──────────┘ └────────┘

Configuration Structure¶

# alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alertmanager@example.com'

route:
  receiver: 'default-receiver'
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
    - match:
        severity: warning
      receiver: 'slack-warnings'

receivers:
  - name: 'default-receiver'
    email_configs:
      - to: 'team@example.com'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<pagerduty-service-key>'

  - name: 'slack-warnings'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        channel: '#alerts'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster']

Routing¶

Routes determine which receiver handles an alert.

route:
  receiver: 'default'
  routes:
    # Critical alerts go to PagerDuty
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: false

    # Database alerts go to DBA team
    - match_re:
        alertname: ^(MySQL|Postgres).*
      receiver: 'dba-team'

    # Multiple matchers (AND logic)
    - match:
        team: backend
        severity: warning
      receiver: 'backend-slack'

Grouping¶

Groups related alerts together to reduce notification noise.

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s      # Wait before sending first notification
  group_interval: 5m   # Wait before sending updates to group
  repeat_interval: 4h  # Wait before re-sending same alert

Silences¶

Temporarily mute alerts.

# Create silence via API
curl -X POST http://alertmanager:9093/api/v2/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {"name": "alertname", "value": "HighMemoryUsage", "isRegex": false},
      {"name": "instance", "value": "server1", "isRegex": false}
    ],
    "startsAt": "2024-01-01T00:00:00Z",
    "endsAt": "2024-01-01T06:00:00Z",
    "createdBy": "admin",
    "comment": "Maintenance window"
  }'

Inhibition¶

Suppress alerts when related alerts are firing.

inhibit_rules:
  # If critical alert fires, suppress warning for same alertname
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

  # If cluster is down, suppress all other cluster alerts
  - source_match:
      alertname: 'ClusterDown'
    target_match_re:
      alertname: '.+'
    equal: ['cluster']

Receivers¶

Email¶

receivers:
  - name: 'email-team'
    email_configs:
      - to: 'team@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.example.com:587'
        auth_username: 'alertmanager'
        auth_password: 'password'
        require_tls: true

Slack¶

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
        channel: '#alerts'
        username: 'Alertmanager'
        icon_emoji: ':warning:'
        title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

PagerDuty¶

receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '<integration-key>'
        severity: '{{ if eq .Status "firing" }}critical{{ else }}info{{ end }}'
        description: '{{ .CommonAnnotations.summary }}'

Webhook¶

receivers:
  - name: 'webhook'
    webhook_configs:
      - url: 'http://webhook-handler:8080/alerts'
        send_resolved: true

Dashboarding with Grafana¶

Data Source Configuration¶

# Grafana datasource provisioning
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Panel Types¶

Panel Type	Use Case
Time Series	Metrics over time
Stat	Single value display
Gauge	Value with thresholds
Bar Gauge	Horizontal/vertical bars
Table	Tabular data
Heatmap	Distribution over time
Logs	Log data display

Common Dashboard Patterns¶

Request Rate Panel¶

sum(rate(http_requests_total[5m])) by (service)

Error Rate Panel¶

sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) 
/ 
sum(rate(http_requests_total[5m])) by (service) * 100

Latency Percentiles Panel¶

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Resource Utilization Panel¶

# CPU Usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Usage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk Usage
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100

Variables (Template Variables)¶

# Query variable for instances
Name: instance
Type: Query
Query: label_values(up, instance)

# Custom variable for time ranges
Name: interval
Type: Custom
Values: 1m,5m,15m,1h

# Using variables in queries
rate(http_requests_total{instance="$instance"}[$interval])

Dashboard Best Practices¶

Use consistent naming: Follow a naming convention
Add descriptions: Document what each panel shows
Set appropriate time ranges: Match to your SLOs
Use variables: Make dashboards reusable
Group related panels: Use rows to organize
Set thresholds: Visual indicators for good/bad states
Include links: Link to runbooks and related dashboards

Alerting Best Practices¶

When to Alert¶

Alert on symptoms, not causes:

# Good: Alert on user-facing impact
- alert: HighErrorRate
  expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) > 0.01

# Avoid: Alerting on every possible cause
- alert: HighCPU
  expr: cpu_usage > 80  # May not indicate a problem

Alert Fatigue Prevention¶

Set appropriate thresholds: Not too sensitive
Use for duration: Avoid flapping alerts
Group related alerts: Reduce notification volume
Use inhibition: Suppress redundant alerts
Regular review: Remove or tune noisy alerts

Alert Annotations¶

annotations:
  summary: "Brief description of the alert"
  description: "Detailed information with {{ $labels.instance }} and {{ $value }}"
  runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
  dashboard_url: "https://grafana.example.com/d/abc123"

Practice Questions¶

What are the three states of an alert in Prometheus?
What is the purpose of the for clause in an alert rule?
How does Alertmanager group alerts?
What is the difference between silences and inhibition?
Name three notification channels supported by Alertmanager.
What is a recording rule and when should you use one?
How do you configure routing in Alertmanager?
What are template variables in Grafana used for?