Skip to content

PCA Sample Practice Questions

Practice Resources


Domain 1: Observability Concepts (18%)

Question 1

What are the three pillars of observability?

Show Answer **Answer:** Metrics, Logs, and Traces - **Metrics**: Numerical values that measure aspects of a system over time - **Logs**: Immutable records of discrete events - **Traces**: Records of request paths through distributed systems

Question 2

What is the difference between an SLA, SLO, and SLI?

Show Answer - **SLA (Service Level Agreement)**: A formal agreement with customers defining expected service levels - **SLO (Service Level Objective)**: Internal targets that teams aim to achieve - **SLI (Service Level Indicator)**: The actual metrics used to measure service performance Example: SLA promises 99.9% uptime, SLO targets 99.95%, SLI measures actual availability.

Question 3

When should you use the Push model (Pushgateway) instead of the Pull model?

Show Answer Use Pushgateway for: - Short-lived batch jobs - Cron jobs that complete before scraping - Jobs behind firewalls that can't be scraped - Legacy systems that can't expose endpoints **Important**: Pushgateway should NOT be used as a general metrics aggregator.

Question 4

What is a span in the context of distributed tracing?

Show Answer A **span** represents a single operation within a trace. It provides: - Start and end timestamps - Operation name - Tags/labels - Logs/events - Parent span reference Multiple spans together form a complete trace showing the request flow through a system.

Domain 2: Prometheus Fundamentals (20%)

Question 5

What are the four metric types in Prometheus?

Show Answer 1. **Counter**: Cumulative metric that only increases (resets on restart) 2. **Gauge**: Metric that can go up or down 3. **Histogram**: Samples observations into configurable buckets 4. **Summary**: Similar to histogram but calculates quantiles client-side

Question 6

What is the purpose of relabeling in Prometheus?

Show Answer Relabeling allows you to: - Modify labels before scraping (`relabel_configs`) - Filter which targets to scrape - Modify labels before storing (`metric_relabel_configs`) - Drop unwanted metrics - Rename labels - Extract values from labels using regex

Question 7

Why should you avoid high-cardinality labels?

Show Answer High-cardinality labels (like user IDs or request IDs) create problems because: - Each unique label combination creates a new time series - Increases memory usage significantly - Slows down queries - Can cause Prometheus to run out of memory **Best practice**: Use labels with bounded, low-cardinality values.

Question 8

What is the difference between scrape_interval and evaluation_interval?

Show Answer - **scrape_interval**: How often Prometheus scrapes targets for metrics (default: 1m) - **evaluation_interval**: How often Prometheus evaluates recording and alerting rules (default: 1m) These can be set globally and overridden per scrape job.

Domain 3: PromQL (28%)

Question 9

What is the difference between rate() and irate()?

Show Answer - **rate()**: Calculates per-second average rate over the entire range - More stable, better for alerting - Uses all data points in the range - **irate()**: Calculates instant rate using only the last two data points - More responsive to changes - Better for volatile metrics in graphs - Can miss spikes between scrapes

Question 10

Write a PromQL query to calculate the 95th percentile latency from a histogram.

Show Answer
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Or with aggregation by service:
histogram_quantile(0.95, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])))

Question 11

How do you calculate error rate as a percentage?

Show Answer
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m])) 
* 100
This divides error requests by total requests and multiplies by 100 for percentage.

Question 12

What does the absent() function do?

Show Answer `absent()` returns 1 if the vector has no elements, otherwise returns nothing. Use cases: - Alert when a metric is missing - Detect when a service stops reporting
# Alert if no data from job
absent(up{job="myservice"})

Question 13

How do you compare current values to values from 1 hour ago?

Show Answer Use the `offset` modifier:
# Difference from 1 hour ago
http_requests_total - http_requests_total offset 1h

# Percentage change
(http_requests_total - http_requests_total offset 1h) / http_requests_total offset 1h * 100

Question 14

What is the difference between sum by and sum without?

Show Answer - **sum by (label)**: Aggregates and keeps only the specified labels - **sum without (label)**: Aggregates and removes the specified labels, keeping all others
# Keep only 'method' label
sum by (method) (rate(http_requests_total[5m]))

# Remove 'instance' label, keep everything else
sum without (instance) (rate(http_requests_total[5m]))

Domain 4: Instrumentation and Exporters (16%)

Question 15

What are the Four Golden Signals of monitoring?

Show Answer From Google SRE: 1. **Latency**: Time to service a request 2. **Traffic**: Demand on your system (requests/second) 3. **Errors**: Rate of failed requests 4. **Saturation**: How "full" your service is

Question 16

What metrics does the Node Exporter provide?

Show Answer Node Exporter provides hardware and OS metrics: - CPU usage (`node_cpu_seconds_total`) - Memory (`node_memory_*`) - Disk (`node_filesystem_*`, `node_disk_*`) - Network (`node_network_*`) - Load average (`node_load1`, `node_load5`, `node_load15`) - System info

Question 17

What is the correct naming convention for Prometheus metrics?

Show Answer Format: `___` Rules: - Use snake_case - Include unit in name (seconds, bytes) - Use base units (seconds not milliseconds) - Use `_total` suffix for counters - Use `_info` suffix for info metrics Examples: - `http_requests_total` - `http_request_duration_seconds` - `node_memory_bytes_total`

Question 18

When should you use the Blackbox Exporter?

Show Answer Use Blackbox Exporter for: - HTTP/HTTPS endpoint probing - TCP port checks - DNS lookups - ICMP ping checks - SSL certificate expiry monitoring It's useful for monitoring external services or endpoints where you can't install an exporter.

Domain 5: Alerting & Dashboarding (18%)

Question 19

What are the three states of an alert in Prometheus?

Show Answer 1. **Inactive**: The alert condition is not met 2. **Pending**: Condition is met but `for` duration hasn't elapsed 3. **Firing**: Condition has been true for the `for` duration

Question 20

What is the purpose of the for clause in an alert rule?

Show Answer The `for` clause specifies how long the condition must be true before the alert fires. Benefits: - Prevents flapping alerts - Reduces false positives from brief spikes - Ensures the issue is persistent
- alert: HighErrorRate
  expr: error_rate > 0.05
  for: 5m  # Must be true for 5 minutes

Question 21

What is the difference between silences and inhibition in Alertmanager?

Show Answer **Silences**: - Manually created to mute specific alerts - Time-bounded (start and end time) - Used for maintenance windows - Created via UI or API **Inhibition**: - Automatic suppression based on rules - Suppresses alerts when related alerts are firing - Configured in alertmanager.yml - Example: Suppress warnings when critical is firing

Question 22

What is a recording rule and when should you use one?

Show Answer Recording rules pre-compute frequently used or expensive PromQL expressions. Use when: - Query is computationally expensive - Query is used in multiple dashboards/alerts - You need to aggregate across federation - Query performance is critical
- record: job:http_requests:rate5m
  expr: sum by (job) (rate(http_requests_total[5m]))

Question 23

How does Alertmanager group alerts?

Show Answer Alertmanager groups alerts based on: - `group_by` labels in the route configuration - Alerts with matching group labels are batched together Configuration:
route:
  group_by: ['alertname', 'cluster']
  group_wait: 30s      # Wait before first notification
  group_interval: 5m   # Wait between group updates
  repeat_interval: 4h  # Wait before re-sending

Question 24

What notification channels does Alertmanager support?

Show Answer Built-in receivers: - Email (SMTP) - Slack - PagerDuty - OpsGenie - VictorOps - Webhook (for custom integrations) - Pushover - WeChat - Telegram Custom integrations can be built using the webhook receiver.

Bonus Questions

Question 25

What is meta-monitoring?

Show Answer Meta-monitoring is monitoring the monitoring system itself (Prometheus monitoring Prometheus). Important metrics to monitor: - `prometheus_tsdb_head_series` - Number of time series - `prometheus_engine_query_duration_seconds` - Query performance - `prometheus_target_scrape_pool_sync_total` - Scrape health - `up{job="prometheus"}` - Prometheus availability

Question 26

How can you scale Prometheus for high availability?

Show Answer Options for scaling: 1. **Multiple instances**: Run identical Prometheus servers 2. **Federation**: Hierarchical Prometheus setup 3. **Remote storage**: Thanos, Cortex, Mimir for long-term storage 4. **Sharding**: Split targets across multiple Prometheus instances Note: Prometheus itself doesn't support clustering natively.