What I check before blaming Kubernetes

Most on-call pages are not Kubernetes being mysterious. A lot of the time it is a config change, a secret that rotated, a node under disk pressure, or a service talking to the wrong dependency. My quick order is logs first, rollout history second, then events and resource limits. If I jump straight into restarting pods, I usually just hide the first useful clue. The best runbooks in our team are…