Alerting & Dashboarding (18%)¶
Overview¶
This domain covers configuring alerting rules, understanding Alertmanager, and dashboarding basics with Grafana.
Alerting Rules¶
Rule Configuration¶
Alert rules are defined in YAML files and loaded by Prometheus.
# alerts.yml
groups:
- name: example-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.instance }}"
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."
Rule Components¶
| Component | Description |
|---|---|
alert | Name of the alert |
expr | PromQL expression that triggers the alert |
for | Duration the condition must be true before firing |
labels | Additional labels to attach to the alert |
annotations | Informational labels (summary, description, runbook) |
Alert States¶
- Inactive: Condition is not met
- Pending: Condition is met but
forduration hasn't elapsed - Firing: Condition met for the
forduration
Recording Rules¶
Pre-compute frequently used or expensive expressions.
groups:
- name: recording-rules
rules:
- record: job:http_requests_total:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: instance:node_cpu_utilization:ratio
expr: 1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
- record: job:http_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))
Prometheus Configuration¶
# prometheus.yml
rule_files:
- "rules/*.yml"
- "alerts/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
Alertmanager¶
Architecture¶
┌─────────────┐ ┌─────────────────────────────────────────┐
│ Prometheus │────▶│ Alertmanager │
│ (alerts) │ │ ┌─────────┐ ┌──────────┐ ┌────────┐ │
└─────────────┘ │ │ Routing │─▶│ Grouping │─▶│ Notify │ │
│ └─────────┘ └──────────┘ └────────┘ │
└─────────────────────────────────────────┘
│
┌─────────────────────────────────┼─────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌────────┐
│ Email │ │ Slack │ │PagerDuty│
└──────────┘ └──────────┘ └────────┘
Configuration Structure¶
# alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
route:
receiver: 'default-receiver'
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'default-receiver'
email_configs:
- to: 'team@example.com'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<pagerduty-service-key>'
- name: 'slack-warnings'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
channel: '#alerts'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster']
Routing¶
Routes determine which receiver handles an alert.
route:
receiver: 'default'
routes:
# Critical alerts go to PagerDuty
- match:
severity: critical
receiver: 'pagerduty'
continue: false
# Database alerts go to DBA team
- match_re:
alertname: ^(MySQL|Postgres).*
receiver: 'dba-team'
# Multiple matchers (AND logic)
- match:
team: backend
severity: warning
receiver: 'backend-slack'
Grouping¶
Groups related alerts together to reduce notification noise.
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s # Wait before sending first notification
group_interval: 5m # Wait before sending updates to group
repeat_interval: 4h # Wait before re-sending same alert
Silences¶
Temporarily mute alerts.
# Create silence via API
curl -X POST http://alertmanager:9093/api/v2/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{"name": "alertname", "value": "HighMemoryUsage", "isRegex": false},
{"name": "instance", "value": "server1", "isRegex": false}
],
"startsAt": "2024-01-01T00:00:00Z",
"endsAt": "2024-01-01T06:00:00Z",
"createdBy": "admin",
"comment": "Maintenance window"
}'
Inhibition¶
Suppress alerts when related alerts are firing.
inhibit_rules:
# If critical alert fires, suppress warning for same alertname
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
# If cluster is down, suppress all other cluster alerts
- source_match:
alertname: 'ClusterDown'
target_match_re:
alertname: '.+'
equal: ['cluster']
Receivers¶
Email¶
receivers:
- name: 'email-team'
email_configs:
- to: 'team@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager'
auth_password: 'password'
require_tls: true
Slack¶
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
channel: '#alerts'
username: 'Alertmanager'
icon_emoji: ':warning:'
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
PagerDuty¶
receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<integration-key>'
severity: '{{ if eq .Status "firing" }}critical{{ else }}info{{ end }}'
description: '{{ .CommonAnnotations.summary }}'
Webhook¶
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://webhook-handler:8080/alerts'
send_resolved: true
Dashboarding with Grafana¶
Data Source Configuration¶
# Grafana datasource provisioning
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
Panel Types¶
| Panel Type | Use Case |
|---|---|
| Time Series | Metrics over time |
| Stat | Single value display |
| Gauge | Value with thresholds |
| Bar Gauge | Horizontal/vertical bars |
| Table | Tabular data |
| Heatmap | Distribution over time |
| Logs | Log data display |
Common Dashboard Patterns¶
Request Rate Panel¶
Error Rate Panel¶
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service) * 100
Latency Percentiles Panel¶
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Resource Utilization Panel¶
# CPU Usage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory Usage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk Usage
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100
Variables (Template Variables)¶
# Query variable for instances
Name: instance
Type: Query
Query: label_values(up, instance)
# Custom variable for time ranges
Name: interval
Type: Custom
Values: 1m,5m,15m,1h
# Using variables in queries
rate(http_requests_total{instance="$instance"}[$interval])
Dashboard Best Practices¶
- Use consistent naming: Follow a naming convention
- Add descriptions: Document what each panel shows
- Set appropriate time ranges: Match to your SLOs
- Use variables: Make dashboards reusable
- Group related panels: Use rows to organize
- Set thresholds: Visual indicators for good/bad states
- Include links: Link to runbooks and related dashboards
Alerting Best Practices¶
When to Alert¶
Alert on symptoms, not causes:
# Good: Alert on user-facing impact
- alert: HighErrorRate
expr: rate(http_errors_total[5m]) / rate(http_requests_total[5m]) > 0.01
# Avoid: Alerting on every possible cause
- alert: HighCPU
expr: cpu_usage > 80 # May not indicate a problem
Alert Fatigue Prevention¶
- Set appropriate thresholds: Not too sensitive
- Use
forduration: Avoid flapping alerts - Group related alerts: Reduce notification volume
- Use inhibition: Suppress redundant alerts
- Regular review: Remove or tune noisy alerts
Alert Annotations¶
annotations:
summary: "Brief description of the alert"
description: "Detailed information with {{ $labels.instance }} and {{ $value }}"
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
dashboard_url: "https://grafana.example.com/d/abc123"
Practice Questions¶
- What are the three states of an alert in Prometheus?
- What is the purpose of the
forclause in an alert rule? - How does Alertmanager group alerts?
- What is the difference between silences and inhibition?
- Name three notification channels supported by Alertmanager.
- What is a recording rule and when should you use one?
- How do you configure routing in Alertmanager?
- What are template variables in Grafana used for?