Troubleshooting (30%)
This domain covers troubleshooting Kubernetes clusters, applications, and networking issues. This is the largest domain in the CKA exam.
Cluster Troubleshooting
Check Cluster Health
# Cluster info
kubectl cluster-info
kubectl cluster-info dump
# Component status (deprecated but useful)
kubectl get componentstatuses
# Check nodes
kubectl get nodes
kubectl describe node <node-name>
# Check system pods
kubectl get pods -n kube-system
Control Plane Components
# Check control plane pods (if using kubeadm)
kubectl get pods -n kube-system
# Check static pod manifests
ls /etc/kubernetes/manifests/
cat /etc/kubernetes/manifests/kube-apiserver.yaml
# Check component logs
kubectl logs -n kube-system kube-apiserver-<node>
kubectl logs -n kube-system kube-controller-manager-<node>
kubectl logs -n kube-system kube-scheduler-<node>
kubectl logs -n kube-system etcd-<node>
# If running as systemd services
sudo journalctl -u kubelet
sudo journalctl -u kube-apiserver
kubelet Troubleshooting
# Check kubelet status
sudo systemctl status kubelet
sudo systemctl restart kubelet
# Check kubelet logs
sudo journalctl -u kubelet -f
sudo journalctl -u kubelet --since "10 minutes ago"
# Check kubelet config
cat /var/lib/kubelet/config.yaml
cat /etc/kubernetes/kubelet.conf
etcd Troubleshooting
# Check etcd health
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Check etcd members
ETCDCTL_API=3 etcdctl member list \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
Node Troubleshooting
Node Status
# Get node details
kubectl get nodes -o wide
kubectl describe node <node-name>
# Check node conditions
kubectl get nodes -o jsonpath='{.items[*].status.conditions}'
Node Conditions
| Condition | Description |
Ready | Node is healthy and ready |
MemoryPressure | Node memory is low |
DiskPressure | Node disk space is low |
PIDPressure | Too many processes |
NetworkUnavailable | Network not configured |
Node Maintenance
# Cordon node (prevent scheduling)
kubectl cordon <node-name>
# Drain node (evict pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# Uncordon node
kubectl uncordon <node-name>
Application Troubleshooting
Pod Debugging
# Get pod status
kubectl get pods
kubectl get pods -o wide
kubectl get pods --all-namespaces
# Describe pod (events, conditions)
kubectl describe pod <pod-name>
# Get pod YAML
kubectl get pod <pod-name> -o yaml
# Check pod logs
kubectl logs <pod-name>
kubectl logs <pod-name> -c <container-name>
kubectl logs <pod-name> --previous
kubectl logs <pod-name> -f
kubectl logs <pod-name> --tail=100
# Execute command in pod
kubectl exec -it <pod-name> -- /bin/sh
kubectl exec <pod-name> -- cat /etc/config/app.conf
Common Pod Issues
| Status | Cause | Solution |
| Pending | No node available, resource constraints | Check events, node resources |
| ImagePullBackOff | Image not found, auth issues | Check image name, pull secrets |
| CrashLoopBackOff | Container crashes repeatedly | Check logs, probe config |
| CreateContainerConfigError | ConfigMap/Secret missing | Check references |
| OOMKilled | Out of memory | Increase memory limits |
| Evicted | Node resource pressure | Check node conditions |
Debug with Ephemeral Containers
# Add debug container to running pod
kubectl debug <pod-name> -it --image=busybox --target=<container-name>
# Debug node
kubectl debug node/<node-name> -it --image=ubuntu
Pod Resource Issues
# Check resource usage
kubectl top pods
kubectl top pods --containers
kubectl top nodes
# Check resource requests/limits
kubectl describe pod <pod-name> | grep -A 5 "Requests\|Limits"
Service Troubleshooting
Service Debugging
# Check service
kubectl get svc
kubectl describe svc <service-name>
# Check endpoints
kubectl get endpoints <service-name>
# Test service from within cluster
kubectl run test --image=busybox:1.36 --rm -it -- wget -qO- http://<service-name>
# Check service DNS
kubectl run test --image=busybox:1.36 --rm -it -- nslookup <service-name>
Common Service Issues
| Issue | Cause | Solution |
| No endpoints | Selector mismatch | Check pod labels match service selector |
| Connection refused | Wrong port | Check targetPort matches container port |
| DNS not resolving | CoreDNS issues | Check CoreDNS pods |
Networking Troubleshooting
DNS Debugging
# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
# Test DNS resolution
kubectl run test --image=busybox:1.36 --rm -it -- nslookup kubernetes
kubectl run test --image=busybox:1.36 --rm -it -- nslookup <service>.<namespace>.svc.cluster.local
# Check resolv.conf in pod
kubectl exec <pod-name> -- cat /etc/resolv.conf
Network Policy Debugging
# List network policies
kubectl get networkpolicies
kubectl describe networkpolicy <policy-name>
# Test connectivity
kubectl exec <pod-name> -- nc -zv <target-ip> <port>
kubectl exec <pod-name> -- wget -qO- --timeout=2 http://<service>
CNI Troubleshooting
# Check CNI config
ls /etc/cni/net.d/
cat /etc/cni/net.d/*.conf
# Check CNI binaries
ls /opt/cni/bin/
# Check pod networking
kubectl exec <pod-name> -- ip addr
kubectl exec <pod-name> -- ip route
Certificate Troubleshooting
# Check certificate expiration
kubeadm certs check-expiration
# View certificate details
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout
# Check certificate dates
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
# Renew certificates
kubeadm certs renew all
Logging
Container Logs
# View logs
kubectl logs <pod-name>
kubectl logs <pod-name> -c <container>
kubectl logs <pod-name> --all-containers
# Follow logs
kubectl logs -f <pod-name>
# Previous container logs
kubectl logs <pod-name> --previous
# Logs since time
kubectl logs <pod-name> --since=1h
kubectl logs <pod-name> --since-time=2024-01-01T00:00:00Z
# Logs with timestamps
kubectl logs <pod-name> --timestamps
System Logs
# kubelet logs
sudo journalctl -u kubelet
# Container runtime logs
sudo journalctl -u containerd
sudo journalctl -u docker
# System messages
sudo tail -f /var/log/syslog
sudo tail -f /var/log/messages
Events
# Get events
kubectl get events
kubectl get events --sort-by='.lastTimestamp'
kubectl get events -n <namespace>
# Watch events
kubectl get events -w
# Filter events
kubectl get events --field-selector type=Warning
kubectl get events --field-selector involvedObject.name=<pod-name>
Troubleshooting Checklist
Pod Not Starting
- Check pod status:
kubectl get pod <pod> - Check events:
kubectl describe pod <pod> - Check logs:
kubectl logs <pod> - Check node resources:
kubectl describe node <node> - Check image:
kubectl get pod <pod> -o yaml | grep image
Service Not Working
- Check service:
kubectl get svc <service> - Check endpoints:
kubectl get endpoints <service> - Check pod labels match selector
- Test from within cluster
- Check network policies
Node Not Ready
- Check node status:
kubectl describe node <node> - Check kubelet:
systemctl status kubelet - Check kubelet logs:
journalctl -u kubelet - Check container runtime
- Check disk/memory pressure
Key Concepts to Remember
- kubectl describe - First step for troubleshooting
- kubectl logs - Check container output
- kubectl exec - Debug inside container
- Events - Show what happened
- journalctl - System service logs
Practice Questions
- A pod is in CrashLoopBackOff status. How do you troubleshoot?
- How do you check why a node is NotReady?
- A service has no endpoints. What could be wrong?
- How do you view kubelet logs?
- How do you test DNS resolution from within a pod?
← Previous: Storage | Back to CKA Overview | Next: Sample Practice Questions →