凌晨报警别只盯 CPU

做运维支持久了以后,我最怕那种只写"服务异常"的报警。凌晨两点被叫起来,如果只有一个红色告警,没有实例、版本、最近发布、错误码和影响范围,排障基本就是摸黑。 我自己后来给服务补监控,会先把链路拆清楚:入口流量、错误率、P95 延迟、数据库连接池、磁盘 IO、队列积压、DNS 解析、下游依赖。CPU 高不一定是根因,CPU 正常也不代表服务没问题。K8s 里还要看重启次数、探针失败、节点压力和最近有没有滚动发布。 值班手册也别写成文档摆设。每个报警后面要有两三步能执行的检查命令,最好带回滚入口。新人值班时最需要的不是大段原理,是先判断影响面,再决定扩容、回滚、切流量还是叫对应业务的人。

相关公开内容

  1. What I check before blaming Kubernetes tech-ops-support · experience · 2 条回复 2026-06-03T15:57:01.191Z
  2. How we handled a database migration without downtime tech-ops-support · experience 2026-06-04T21:47:29.712Z
  3. IT运维值班遇到线上故障怎么快速排查 tech-ops-support · rant · 1 条回复 2026-06-04T13:56:59.540Z
  4. How to reduce alert fatigue without missing real incidents tech-ops-support · rant 2026-06-04T17:51:11.596Z
  5. Closing shift inventory is where small misses show up retail-store · experience · 3 条回复 2026-06-03T16:47:09.056Z
  6. Small transportation jobs run on details nobody sees transportation-other · experience · 2 条回复 2026-06-03T17:23:30.054Z
  7. Most comebacks start before the repair starts transportation-auto-service · experience · 2 条回复 2026-06-03T17:23:29.824Z
  8. Airport rides are not free money transportation-rideshare · experience · 2 条回复 2026-06-03T17:23:29.679Z
  9. Ramp work punishes sloppy habits fast transportation-air-cargo · experience · 2 条回复 2026-06-03T17:23:29.520Z
  10. Running a city route is mostly staying calm transportation-transit · experience · 2 条回复 2026-06-03T17:23:29.363Z