Cluster Troubleshooting Scenarios¶
Real-world troubleshooting scenarios for CKA/CKS exam preparation.
Scenario 1: Node NotReady¶
Problem¶
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready control-plane 10d v1.28.0
worker1 NotReady <none> 10d v1.28.0
Diagnosis Steps¶
# 1. Check node conditions
kubectl describe node worker1 | grep -A5 Conditions
# 2. SSH to node and check kubelet
ssh worker1
sudo systemctl status kubelet
sudo journalctl -u kubelet -f
# 3. Check container runtime
sudo systemctl status containerd
sudo crictl ps
Common Causes & Solutions¶
Network plugin issue
Scenario 2: Pod Stuck in Pending¶
Problem¶
Diagnosis Steps¶
# 1. Check pod events
kubectl describe pod nginx
# 2. Check node resources
kubectl describe nodes | grep -A5 "Allocated resources"
# 3. Check scheduler
kubectl get pods -n kube-system | grep scheduler
Common Causes & Solutions¶
Insufficient resources
No matching node (nodeSelector/affinity)
Taints preventing scheduling
Scenario 3: Pod CrashLoopBackOff¶
Problem¶
Diagnosis Steps¶
# 1. Check pod logs
kubectl logs app
kubectl logs app --previous
# 2. Check pod events
kubectl describe pod app
# 3. Check container exit code
kubectl get pod app -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
Common Causes & Solutions¶
Application error
Missing ConfigMap/Secret
Liveness probe failing
Resource limits too low
Scenario 4: Service Not Accessible¶
Problem¶
Diagnosis Steps¶
# 1. Check service exists
kubectl get svc my-service
# 2. Check endpoints
kubectl get endpoints my-service
# 3. Check pod labels match service selector
kubectl get pods --show-labels
kubectl get svc my-service -o yaml | grep -A5 selector
Common Causes & Solutions¶
No endpoints (selector mismatch)
Pod not ready
NetworkPolicy blocking traffic
Wrong port configuration
Scenario 5: API Server Not Responding¶
Problem¶
Diagnosis Steps¶
# 1. Check API server pod (on control plane)
sudo crictl ps | grep kube-apiserver
# 2. Check API server logs
sudo crictl logs <apiserver-container-id>
# 3. Check manifest
sudo cat /etc/kubernetes/manifests/kube-apiserver.yaml
Common Causes & Solutions¶
API server not running
Certificate issues
etcd not accessible
Scenario 6: etcd Issues¶
Problem¶
etcd cluster unhealthy or data corruption
Diagnosis Steps¶
# 1. Check etcd pod
kubectl get pods -n kube-system | grep etcd
# 2. Check etcd health
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# 3. Check etcd logs
sudo crictl logs <etcd-container-id>
Solutions¶
Restore from backup
# Stop API server
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
# Restore etcd
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
--data-dir=/var/lib/etcd-restored
# Update etcd manifest to use new data-dir
sudo sed -i 's|/var/lib/etcd|/var/lib/etcd-restored|g' \
/etc/kubernetes/manifests/etcd.yaml
# Restore API server
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
Quick Troubleshooting Commands¶
# Overall cluster health
kubectl get componentstatuses
kubectl get nodes
kubectl get pods -A
# Events sorted by time
kubectl get events --sort-by='.lastTimestamp' -A
# Resource usage
kubectl top nodes
kubectl top pods -A
# Logs
kubectl logs <pod> -f
kubectl logs <pod> --previous
kubectl logs <pod> -c <container>
# Debug pod
kubectl run debug --image=busybox --rm -it --restart=Never -- sh
# Check DNS
kubectl run dns-test --image=busybox --rm -it --restart=Never -- nslookup kubernetes