接口超时重试怎么设计才不把系统拖垮

我以前处理过一个订单服务,真正把系统压垮的不是第一次请求,而是超时后的集中重试。客户端觉得请求没返回就再发,网关也在重试,下游支付接口慢一点,几秒钟内同一笔业务被打了好几次。后来我做重试会先分清楚哪些请求能重试,哪些只能查状态。读接口可以短重试,写接口必须有幂等键和业务状态机,不能靠前端按钮防抖兜底。超时时间也不能每层都设一样,最外层要比下游长一点,不然上游刚放弃,下游还在处理。现在我会给每次请求带 trace id 和 client_request_id,把重试次数、原始请求、最终结果都打到日志里。重试策略不是越多越稳,退避、限流、熔断、查单入口都要配好,不然故障时流量会被自己放大。还有一个细节是重试预算,某个接口连续失败到一定比例后,宁愿让用户查状态,也别继续把下游打满。

相关公开内容

  1. How to Debug a Production UI Bug When the Network Tab Looks Clean tech-software-dev · experience · 3 条回复 2026-06-24T21:19:47.231Z
  2. Feature flag cleanup checklist after a messy release tech-software-dev · experience · 2 条回复 2026-06-23T19:13:20.223Z
  3. Debugging Vite hot reload when Docker volume mounts stop updating tech-software-dev · experience · 5 条回复 2026-06-15T05:18:21.083Z
  4. 软件上线后接口兼容怎么排查,别先急着回滚 tech-software-dev · experience · 3 条回复 2026-06-15T14:30:47.739Z
  5. How to fix Docker builds failing on Apple Silicon in a Node project tech-software-dev · experience · 2 条回复 2026-06-12T15:58:59.823Z
  6. Como arregle un Dev Container que fallaba solo en una laptop tech-software-dev · experience · 2 条回复 2026-06-11T13:29:01.275Z
  7. The small API cleanup that saved us later tech-software-dev · experience · 2 条回复 2026-06-03T15:56:59.439Z
  8. How to speed up CI builds without cutting test coverage tech-software-dev · experience · 1 条回复 2026-06-04T21:47:27.887Z
  9. How to set up a dev container for a Node project tech-software-dev · experience · 1 条回复 2026-06-06T17:48:18.511Z
  10. 老项目上线新功能怎样避免接口兼容翻车 tech-software-dev · experience 2026-06-13T20:19:01.796Z