运维与支持行业问答、工资福利与经验 · 智问盟

运维与支持行业的公开问答、工资福利、职业发展和经验分享。

智问盟 · 运维与支持 最新公开讨论

  1. production DNS cutover checklist for small teams

    tech-ops-support

    DNS cutovers always sound simpler in the planning meeting than they feel during the actual window. Change the record, wait for propagation, watch traffic move. In practice, one cached resolver, one forgotten subdomain, …

    2026-06-05T13:28:56.616Z

  2. How to reduce alert fatigue without missing real incidents

    tech-ops-support

    Alert fatigue does not start when there are too many alerts. It starts when the team stops trusting them. I have seen a pretty dashboard still fail the on-call person because every warning had the same urgency. Disk at …

    2026-06-04T17:51:11.596Z

  3. IT运维值班遇到线上故障怎么快速排查

    tech-ops-support

    值班最怕一上来就被群里催,CPU、磁盘、网络、应用日志全都红一点。我的习惯是先看影响面,再看最近变更,不急着重启服务。很多事故其实是证书、DNS、配置发布这种小地方拖出来的。你们排障时第一眼会先看监控、日志,还是发布记录?

    2026-06-04T13:56:59.540Z

  4. How we handled a database migration without downtime

    tech-ops-support

    The migration looked small on paper: add a few columns, backfill old records, then switch the application to read the new shape. The risk was that the table was hot all day and the old worker code would still be running…

    2026-06-04T21:47:29.712Z

  5. 线上服务灰度发布怎么做才容易回滚

    tech-ops-support

    我以前参与过一次发布,灰度只看了接口能不能访问,没看新旧版本数据是否兼容。前10%的流量没出大问题,放到全量后才发现老版本写入的字段新版本读不了,回滚也没用,因为数据已经变了。后面我做灰度会先确认三件事:配置能不能独立开关,数据库和缓存是否前后兼容,回滚后老版本还能不能处理新版本留下的数据。灰度指标也不能只看CPU和错误率,要看关键业务动作,比如登录、下单、支付回调、消息发送这些链路是否正常。发布前我会先在预发环境用生产相近的数据跑一…

    2026-06-05T20:53:23.943Z

  6. Linux服务器磁盘满了怎么排查,别上来就删日志

    tech-ops-support

    服务器磁盘满了,最怕一上来就随手删日志。空间是腾出来了,但后面真要查事故,证据也跟着没了。我一般先确认是哪一个分区满了,别只看一个 df -h 就开始动手。 排查顺序我会先看大目录:/var/log、Docker overlay、数据库备份、临时上传目录、包管理缓存。用 du 的时候尽量限制深度,不然线上机器扫半天还影响 IO。Docker 场景下还要看容器日志是不是没有轮转,很多服务业务日志没多大,stdout 被采集前已经把宿主机…

    2026-06-05T03:53:25.429Z

  7. 凌晨报警别只盯 CPU

    tech-ops-support

    做运维支持久了以后,我最怕那种只写"服务异常"的报警。凌晨两点被叫起来,如果只有一个红色告警,没有实例、版本、最近发布、错误码和影响范围,排障基本就是摸黑。 我自己后来给服务补监控,会先把链路拆清楚:入口流量、错误率、P95 延迟、数据库连接池、磁盘 IO、队列积压、DNS 解析、下游依赖。CPU 高不一定是根因,CPU 正常也不代表服务没问题。K8s 里还要看重启次数、探针失败、节点压力和最近有没有滚动发布。 值班手册也别写成文档摆…

    2026-06-04T01:06:26.362Z

  8. What I check before blaming Kubernetes

    tech-ops-support

    Most on-call pages are not Kubernetes being mysterious. A lot of the time it is a config change, a secret that rotated, a node under disk pressure, or a service talking to the wrong dependency. My quick order is logs fi…

    2026-06-03T15:57:01.191Z