Cloud Native Observability (8%)¶
This domain covers monitoring, logging, and tracing in cloud native environments.
Three Pillars of Observability¶
1. Metrics¶
Numeric measurements collected over time.
Examples: - CPU utilization - Memory usage - Request count - Error rate - Response latency
2. Logs¶
Timestamped records of discrete events.
Examples: - Application errors - Access logs - Audit logs - System events
3. Traces¶
Records of requests as they flow through distributed systems.
Examples: - Request path through microservices - Latency at each service - Error propagation
Prometheus¶
What is Prometheus?¶
Prometheus is an open-source monitoring and alerting toolkit, graduated from CNCF.
Key Features¶
- Multi-dimensional data model with time series
- PromQL query language
- Pull-based metrics collection
- Service discovery
- Alerting via Alertmanager
Architecture¶
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Target │ │ Target │ │ Target │
│ (metrics) │ │ (metrics) │ │ (metrics) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└───────────────────┼───────────────────┘
│ scrape
┌──────▼──────┐
│ Prometheus │
│ Server │
└──────┬──────┘
│
┌────────────┼────────────┐
│ │ │
┌──────▼──────┐ ┌───▼───┐ ┌──────▼──────┐
│ Alertmanager│ │ PromQL│ │ Grafana │
└─────────────┘ └───────┘ └─────────────┘
Metric Types¶
| Type | Description | Example |
|---|---|---|
| Counter | Cumulative, only increases | Total requests |
| Gauge | Can go up or down | Current temperature |
| Histogram | Samples in buckets | Request duration |
| Summary | Similar to histogram with quantiles | Request duration |
PromQL Examples¶
# CPU usage
rate(container_cpu_usage_seconds_total[5m])
# Memory usage
container_memory_usage_bytes
# HTTP request rate
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Grafana¶
What is Grafana?¶
Grafana is an open-source visualization and analytics platform.
Key Features¶
- Dashboard creation and sharing
- Multiple data source support
- Alerting capabilities
- Annotations
- Templating
Common Panels¶
- Graph: Time series visualization
- Stat: Single value display
- Gauge: Visual gauge
- Table: Tabular data
- Heatmap: Distribution over time
Logging¶
Logging Architecture in Kubernetes¶
┌─────────────────────────────────────────────────────┐
│ Kubernetes Node │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Pod A │ │ Pod B │ │ Pod C │ │
│ │ stdout │ │ stdout │ │ stdout │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └────────────┼────────────┘ │
│ ▼ │
│ /var/log/containers/ │
│ │ │
│ ┌───────▼───────┐ │
│ │ Log Shipper │ (Fluentd/Fluent Bit) │
│ └───────┬───────┘ │
└────────────────────┼────────────────────────────────┘
│
┌──────▼──────┐
│ Log Store │ (Elasticsearch/Loki)
└──────┬──────┘
│
┌──────▼──────┐
│ Grafana/ │
│ Kibana │
└─────────────┘
Logging Tools¶
| Tool | Description |
|---|---|
| Fluentd | Open-source data collector (CNCF Graduated) |
| Fluent Bit | Lightweight log processor |
| Elasticsearch | Search and analytics engine |
| Loki | Log aggregation system by Grafana |
| Kibana | Visualization for Elasticsearch |
Kubernetes Logging Commands¶
# View pod logs
kubectl logs <pod-name>
# Follow logs
kubectl logs -f <pod-name>
# Logs from specific container
kubectl logs <pod-name> -c <container-name>
# Previous container logs
kubectl logs <pod-name> --previous
# Logs with timestamps
kubectl logs <pod-name> --timestamps
Distributed Tracing¶
What is Distributed Tracing?¶
Distributed tracing tracks requests as they flow through multiple services, helping identify:
- Performance bottlenecks
- Error sources
- Service dependencies
Tracing Concepts¶
| Concept | Description |
|---|---|
| Trace | End-to-end journey of a request |
| Span | Single operation within a trace |
| Context | Metadata propagated between services |
Trace Example¶
Trace ID: abc123
├── Span: API Gateway (10ms)
│ └── Span: Auth Service (5ms)
├── Span: Order Service (50ms)
│ ├── Span: Database Query (20ms)
│ └── Span: Payment Service (25ms)
└── Span: Notification Service (15ms)
Tracing Tools¶
| Tool | Description |
|---|---|
| Jaeger | Distributed tracing platform (CNCF Graduated) |
| Zipkin | Distributed tracing system |
| OpenTelemetry | Observability framework (CNCF) |
OpenTelemetry¶
What is OpenTelemetry?¶
OpenTelemetry is a collection of tools, APIs, and SDKs for instrumenting, generating, collecting, and exporting telemetry data (metrics, logs, traces).
Components¶
- API: Defines how to generate telemetry
- SDK: Implements the API
- Collector: Receives, processes, and exports data
- Exporters: Send data to backends
OpenTelemetry Collector¶
┌─────────────────────────────────────────────────────┐
│ OpenTelemetry Collector │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │Receivers │ → │Processors│ → │ Exporters │ │
│ │(OTLP, │ │(batch, │ │(Jaeger, │ │
│ │ Jaeger) │ │ filter) │ │ Prometheus) │ │
│ └──────────┘ └──────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────┘
Cost Management¶
Observability Costs¶
- Storage: Metrics, logs, and traces consume storage
- Compute: Processing and querying data
- Network: Data transfer between components
Cost Optimization¶
- Set appropriate retention periods
- Use sampling for high-volume traces
- Aggregate metrics where possible
- Filter unnecessary logs
Key Concepts to Remember¶
- Three pillars: Metrics, Logs, Traces
- Prometheus uses pull-based metrics collection
- PromQL is the query language for Prometheus
- OpenTelemetry unifies observability instrumentation
- Jaeger and Zipkin are popular tracing tools
Practice Questions¶
- What are the three pillars of observability?
- What is the difference between a Counter and a Gauge in Prometheus?
- What is the purpose of distributed tracing?
- Name two CNCF graduated observability projects.
- What does OpenTelemetry provide?
← Previous: Cloud Native Architecture | Back to KCNA Overview | Next: Cloud Native Application Delivery →