Cloud Native Observability (8%)¶

This domain covers monitoring, logging, and tracing in cloud native environments.

Three Pillars of Observability¶

1. Metrics¶

Numeric measurements collected over time.

Examples: - CPU utilization - Memory usage - Request count - Error rate - Response latency

2. Logs¶

Timestamped records of discrete events.

Examples: - Application errors - Access logs - Audit logs - System events

3. Traces¶

Records of requests as they flow through distributed systems.

Examples: - Request path through microservices - Latency at each service - Error propagation

Prometheus¶

What is Prometheus?¶

Prometheus is an open-source monitoring and alerting toolkit, graduated from CNCF.

Key Features¶

Multi-dimensional data model with time series
PromQL query language
Pull-based metrics collection
Service discovery
Alerting via Alertmanager

Architecture¶

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Target    │     │   Target    │     │   Target    │
│  (metrics)  │     │  (metrics)  │     │  (metrics)  │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           │ scrape
                    ┌──────▼──────┐
                    │  Prometheus │
                    │   Server    │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
       ┌──────▼──────┐ ┌───▼───┐ ┌──────▼──────┐
       │ Alertmanager│ │ PromQL│ │   Grafana   │
       └─────────────┘ └───────┘ └─────────────┘

Metric Types¶

Type	Description	Example
Counter	Cumulative, only increases	Total requests
Gauge	Can go up or down	Current temperature
Histogram	Samples in buckets	Request duration
Summary	Similar to histogram with quantiles	Request duration

PromQL Examples¶

# CPU usage
rate(container_cpu_usage_seconds_total[5m])

# Memory usage
container_memory_usage_bytes

# HTTP request rate
rate(http_requests_total[5m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Grafana¶

What is Grafana?¶

Grafana is an open-source visualization and analytics platform.

Key Features¶

Dashboard creation and sharing
Multiple data source support
Alerting capabilities
Annotations
Templating

Common Panels¶

Graph: Time series visualization
Stat: Single value display
Gauge: Visual gauge
Table: Tabular data
Heatmap: Distribution over time

Logging¶

Logging Architecture in Kubernetes¶

┌─────────────────────────────────────────────────────┐
│                    Kubernetes Node                   │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐             │
│  │  Pod A  │  │  Pod B  │  │  Pod C  │             │
│  │ stdout  │  │ stdout  │  │ stdout  │             │
│  └────┬────┘  └────┬────┘  └────┬────┘             │
│       │            │            │                   │
│       └────────────┼────────────┘                   │
│                    ▼                                │
│           /var/log/containers/                      │
│                    │                                │
│            ┌───────▼───────┐                        │
│            │  Log Shipper  │ (Fluentd/Fluent Bit)  │
│            └───────┬───────┘                        │
└────────────────────┼────────────────────────────────┘
                     │
              ┌──────▼──────┐
              │  Log Store  │ (Elasticsearch/Loki)
              └──────┬──────┘
                     │
              ┌──────▼──────┐
              │  Grafana/   │
              │  Kibana     │
              └─────────────┘

Logging Tools¶

Tool	Description
Fluentd	Open-source data collector (CNCF Graduated)
Fluent Bit	Lightweight log processor
Elasticsearch	Search and analytics engine
Loki	Log aggregation system by Grafana
Kibana	Visualization for Elasticsearch

Kubernetes Logging Commands¶

# View pod logs
kubectl logs <pod-name>

# Follow logs
kubectl logs -f <pod-name>

# Logs from specific container
kubectl logs <pod-name> -c <container-name>

# Previous container logs
kubectl logs <pod-name> --previous

# Logs with timestamps
kubectl logs <pod-name> --timestamps

Distributed Tracing¶

What is Distributed Tracing?¶

Distributed tracing tracks requests as they flow through multiple services, helping identify:

Performance bottlenecks
Error sources
Service dependencies

Tracing Concepts¶

Concept	Description
Trace	End-to-end journey of a request
Span	Single operation within a trace
Context	Metadata propagated between services

Trace Example¶

Trace ID: abc123
├── Span: API Gateway (10ms)
│   └── Span: Auth Service (5ms)
├── Span: Order Service (50ms)
│   ├── Span: Database Query (20ms)
│   └── Span: Payment Service (25ms)
└── Span: Notification Service (15ms)

Tracing Tools¶

Tool	Description
Jaeger	Distributed tracing platform (CNCF Graduated)
Zipkin	Distributed tracing system
OpenTelemetry	Observability framework (CNCF)

OpenTelemetry¶

What is OpenTelemetry?¶

OpenTelemetry is a collection of tools, APIs, and SDKs for instrumenting, generating, collecting, and exporting telemetry data (metrics, logs, traces).

Components¶

API: Defines how to generate telemetry
SDK: Implements the API
Collector: Receives, processes, and exports data
Exporters: Send data to backends

OpenTelemetry Collector¶

┌─────────────────────────────────────────────────────┐
│              OpenTelemetry Collector                │
│                                                     │
│  ┌──────────┐   ┌──────────┐   ┌──────────────┐   │
│  │Receivers │ → │Processors│ → │  Exporters   │   │
│  │(OTLP,    │   │(batch,   │   │(Jaeger,      │   │
│  │ Jaeger)  │   │ filter)  │   │ Prometheus)  │   │
│  └──────────┘   └──────────┘   └──────────────┘   │
└─────────────────────────────────────────────────────┘

Cost Management¶

Observability Costs¶

Storage: Metrics, logs, and traces consume storage
Compute: Processing and querying data
Network: Data transfer between components

Cost Optimization¶

Set appropriate retention periods
Use sampling for high-volume traces
Aggregate metrics where possible
Filter unnecessary logs

Key Concepts to Remember¶

Three pillars: Metrics, Logs, Traces
Prometheus uses pull-based metrics collection
PromQL is the query language for Prometheus
OpenTelemetry unifies observability instrumentation
Jaeger and Zipkin are popular tracing tools

Practice Questions¶

What are the three pillars of observability?
What is the difference between a Counter and a Gauge in Prometheus?
What is the purpose of distributed tracing?
Name two CNCF graduated observability projects.
What does OpenTelemetry provide?

← Previous: Cloud Native Architecture | Back to KCNA Overview | Next: Cloud Native Application Delivery →