Alerting That Doesn't Cause Alert Fatigue
Last year I consulted for a startup where the on-call engineer was getting 47 alerts per day. Forty-seven. They'd started ignoring most of them, which meant when a real outage happened, nobody noticed for 40 minutes because the alert was buried in noise.
This is alert fatigue, and it's an engineering leadership failure, not an individual one. If your team is drowning in alerts, the system is broken, not the people.
The Only Rule That Matters
Every alert should require human action. If an alert fires and the correct response is "wait and see if it resolves itself," that shouldn't be an alert. Make it a log entry, a metric, a dashboard widget, anything except something that pages a human.
Before creating any alert, ask: "If this fires at 3 AM, what will the on-call person do?" If you can't articulate a specific action, don't create the alert.
Severity Levels That Make Sense
Most teams have too many severity levels. Here's what actually works:
Page (Wake Someone Up)
Users are affected right now. Revenue is being lost. Data might be corrupted. This is the fire alarm. It should go off rarely, and when it does, everyone drops what they're doing.
Examples: Site is down. Payment processing is failing. Database is unreachable.
Urgent (Handle Today)
Something is wrong but users might not notice yet. Left unchecked, this will become a Page. These go to Slack or email, not PagerDuty.
Examples: Disk is 80% full. SSL certificate expires in 7 days. Error rate is elevated but below threshold.
Warning (Handle This Week)
Something looks off but isn't urgent. These feed into your weekly review, not your immediate attention.
Examples: Memory usage trending upward. Slow queries appearing. Deprecated API calls detected.
That's it. Three levels. If you have five or seven severity levels, I guarantee nobody on your team knows the difference between level 3 and level 4.
Threshold Setting: The Art Nobody Teaches
Most alerting documentation tells you how to set thresholds but not what thresholds to set. Here's my approach:
Start With Historical Data
Look at your metrics for the past 30 days. What's the normal range? What does a bad day look like versus a catastrophe? Set your Page threshold at "clearly broken" and your Urgent threshold at "concerning."
Use Rates, Not Absolutes
Don't alert when error count exceeds 100. Alert when error rate exceeds 5%. The absolute number is meaningless without context - 100 errors during a traffic spike might be fine, while 100 errors at 4 AM when you have 50 total requests is a disaster.
Add Duration Requirements
A single spike shouldn't page anyone. Require the condition to persist for at least 5 minutes before firing. This eliminates most false positives from momentary blips.
Example: "Error rate above 5% for more than 5 minutes" is much better than "Error rate above 5%."
Runbooks: The Missing Piece
Every alert should link to a runbook. Not a wiki page that might be outdated - an actual, maintained document that tells you exactly what to do when this alert fires.
A good runbook includes:
- What this alert means in plain English
- Impact assessment - who's affected and how badly
- Diagnostic steps. How to figure out what's actually wrong
- Remediation options. The 3-4 most common fixes
- Escalation path - who to call if you can't fix it
Here's the key: update the runbook every time you handle the alert. If you discovered a new diagnostic step or a new fix, add it. If something in the runbook was wrong or unhelpful, fix it. Runbooks should evolve with your system.
On-Call Hygiene
Good alerting is necessary but not sufficient. You also need good on-call practices.
Regular Alert Review
Once a week, look at every alert that fired. For each one, ask:
- Did this require action?
- Was the threshold right?
- Did the runbook help?
- Should this alert exist at all?
Be aggressive about deleting or modifying alerts that aren't earning their keep.
Blameless Postmortems
When an alert fires and it takes too long to resolve, or when an outage happens without an alert, do a postmortem. Not to assign blame, but to improve the system. What could have detected this sooner? What made it hard to diagnose? What would have helped the on-call person?
On-Call Compensation
This isn't a technical issue, but it affects everything. If your on-call rotation is unpaid and unrewarded, people will have less patience for fixing alerting problems. Compensate people for being on-call, and give them time after rough nights to recover.
The Alert Audit Process
Here's a practical exercise you can do this week: audit every alert in your system.
Export your alert definitions into a spreadsheet. For each one, fill in these columns:
- Alert name
- Times fired in the last 30 days
- Actions taken when it fired
- False positive rate (guess if needed)
- Recommendation: Keep, Modify, or Delete
I guarantee you'll find alerts that fire constantly and get ignored, alerts that have never fired and might not even work, and alerts that made sense two years ago but don't match your current architecture.
Delete ruthlessly. You can always add alerts back, but you can never get back the attention you've wasted on noise.
Building a Culture of Signal
The goal isn't zero alerts - it's signal over noise. You want every alert to be meaningful, every page to be urgent, and every on-call shift to feel manageable.
This requires ongoing work. Systems change, and alerting needs to change with them. Build alert review into your regular processes - sprint retros, quarterly planning, wherever it fits your workflow.
And listen to your on-call people. They know which alerts are useless. They know which runbooks are wrong. Create channels for them to report these issues and actually fix them quickly.
Good alerting is a competitive advantage. When your team isn't burned out from constant noise, they ship better code, catch problems faster, and stick around longer. It's worth investing in.