What I check before blaming Kubernetes

Most on-call pages are not Kubernetes being mysterious. A lot of the time it is a config change, a secret that rotated, a node under disk pressure, or a service talking to the wrong dependency. My quick order is logs first, rollout history second, then events and resource limits. If I jump straight into restarting pods, I usually just hide the first useful clue. The best runbooks in our team are …

相关公开内容

  1. How we handled a database migration without downtime tech-ops-support · experience 2026-06-04T21:47:29.712Z
  2. 凌晨报警别只盯 CPU tech-ops-support · experience 2026-06-04T01:06:26.362Z
  3. IT运维值班遇到线上故障怎么快速排查 tech-ops-support · rant · 1 条回复 2026-06-04T13:56:59.540Z
  4. How to reduce alert fatigue without missing real incidents tech-ops-support · rant 2026-06-04T17:51:11.596Z
  5. Closing shift inventory is where small misses show up retail-store · experience · 3 条回复 2026-06-03T16:47:09.056Z
  6. Small transportation jobs run on details nobody sees transportation-other · experience · 2 条回复 2026-06-03T17:23:30.054Z
  7. Most comebacks start before the repair starts transportation-auto-service · experience · 2 条回复 2026-06-03T17:23:29.824Z
  8. Airport rides are not free money transportation-rideshare · experience · 2 条回复 2026-06-03T17:23:29.679Z
  9. Ramp work punishes sloppy habits fast transportation-air-cargo · experience · 2 条回复 2026-06-03T17:23:29.520Z
  10. Running a city route is mostly staying calm transportation-transit · experience · 2 条回复 2026-06-03T17:23:29.363Z