How to reduce alert fatigue without missing real incidents
Alert fatigue does not start when there are too many alerts. It starts when the team stops trusting them. I have seen a pretty dashboard still fail the on-call person because every warning had the same urgency. Disk at 82%, one pod restart, a delayed batch job, real customer errors, all landing in the same channel. After a while people mute the channel and the only thing left is luck. The useful …