How to design interface timeout retries without crashing the system?
I once handled an order service where the system wasn't crushed by the initial requests, but by the concentrated retries after timeouts. Clients would resend requests if they didn't get a response, the gateway would also retry, and if the downstream payment interface was slow, the same transaction would be hit multiple times within seconds. Later, when I implemented retries, I first…