What is alerting that doesn't cause alert fatigue?

Your on-call rotation shouldn't feel like psychological warfare. Here's how to build alerting that only wakes you up when it matters.

Why does alerting that doesn't cause alert fatigue matter for businesses?

Understanding alerting that doesn't cause alert fatigue is important for businesses looking to grow their online presence and attract more customers. This guide from LXGIC Studios covers practical strategies and actionable advice.

Alerting That Doesn't Cause Alert Fatigue

Last year I consulted for a startup where the on-call engineer was getting 47 alerts per day. Forty-seven. They'd started ignoring most of them, which meant when a real outage happened, nobody noticed for 40 minutes because the alert was buried in noise.

This is alert fatigue, and it's an engineering leadership failure, not an individual one. If your team is drowning in alerts, the system is broken, not the people.

The Only Rule That Matters

Every alert should require human action. If an alert fires and the correct response is "wait and see if it resolves itself," that shouldn't be an alert. Make it a log entry, a metric, a dashboard widget, anything except something that pages a human.

Before creating any alert, ask: "If this fires at 3 AM, what will the on-call person do?" If you can't articulate a specific action, don't create the alert.

Severity Levels That Make Sense

Most teams have too many severity levels. Here's what actually works:

Page (Wake Someone Up)

Users are affected right now. Revenue is being lost. Data might be corrupted. This is the fire alarm. It should go off rarely, and when it does, everyone drops what they're doing.

Examples: Site is down. Payment processing is failing. Database is unreachable.

Urgent (Handle Today)

Something is wrong but users might not notice yet. Left unchecked, this will become a Page. These go to Slack or email, not PagerDuty.

Examples: Disk is 80% full. SSL certificate expires in 7 days. Error rate is elevated but below threshold.

Warning (Handle This Week)

Something looks off but isn't urgent. These feed into your weekly review, not your immediate attention.

Examples: Memory usage trending upward. Slow queries appearing. Deprecated API calls detected.

That's it. Three levels. If you have five or seven severity levels, I guarantee nobody on your team knows the difference between level 3 and level 4.

Threshold Setting: The Art Nobody Teaches

Most alerting documentation tells you how to set thresholds but not what thresholds to set. Here's my approach:

Start With Historical Data

Look at your metrics for the past 30 days. What's the normal range? What does a bad day look like versus a catastrophe? Set your Page threshold at "clearly broken" and your Urgent threshold at "concerning."

Use Rates, Not Absolutes

Don't alert when error count exceeds 100. Alert when error rate exceeds 5%. The absolute number is meaningless without context - 100 errors during a traffic spike might be fine, while 100 errors at 4 AM when you have 50 total requests is a disaster.

Add Duration Requirements

A single spike shouldn't page anyone. Require the condition to persist for at least 5 minutes before firing. This eliminates most false positives from momentary blips.

Example: "Error rate above 5% for more than 5 minutes" is much better than "Error rate above 5%."

Runbooks: The Missing Piece

Every alert should link to a runbook. Not a wiki page that might be outdated - an actual, maintained document that tells you exactly what to do when this alert fires.

A good runbook includes:

What this alert means in plain English
Impact assessment - who's affected and how badly
Diagnostic steps. How to figure out what's actually wrong
Remediation options. The 3-4 most common fixes
Escalation path - who to call if you can't fix it

Here's the key: update the runbook every time you handle the alert. If you discovered a new diagnostic step or a new fix, add it. If something in the runbook was wrong or unhelpful, fix it. Runbooks should evolve with your system.

On-Call Hygiene

Good alerting is necessary but not sufficient. You also need good on-call practices.

Regular Alert Review

Once a week, look at every alert that fired. For each one, ask:

Did this require action?
Was the threshold right?
Did the runbook help?
Should this alert exist at all?

Be aggressive about deleting or modifying alerts that aren't earning their keep.

Blameless Postmortems

When an alert fires and it takes too long to resolve, or when an outage happens without an alert, do a postmortem. Not to assign blame, but to improve the system. What could have detected this sooner? What made it hard to diagnose? What would have helped the on-call person?

On-Call Compensation

This isn't a technical issue, but it affects everything. If your on-call rotation is unpaid and unrewarded, people will have less patience for fixing alerting problems. Compensate people for being on-call, and give them time after rough nights to recover.

The Alert Audit Process

Here's a practical exercise you can do this week: audit every alert in your system.

Export your alert definitions into a spreadsheet. For each one, fill in these columns:

Alert name
Times fired in the last 30 days
Actions taken when it fired
False positive rate (guess if needed)
Recommendation: Keep, Modify, or Delete

I guarantee you'll find alerts that fire constantly and get ignored, alerts that have never fired and might not even work, and alerts that made sense two years ago but don't match your current architecture.

Delete ruthlessly. You can always add alerts back, but you can never get back the attention you've wasted on noise.

Building a Culture of Signal

The goal isn't zero alerts - it's signal over noise. You want every alert to be meaningful, every page to be urgent, and every on-call shift to feel manageable.

This requires ongoing work. Systems change, and alerting needs to change with them. Build alert review into your regular processes - sprint retros, quarterly planning, wherever it fits your workflow.

And listen to your on-call people. They know which alerts are useless. They know which runbooks are wrong. Create channels for them to report these issues and actually fix them quickly.

Good alerting is a competitive advantage. When your team isn't burned out from constant noise, they ship better code, catch problems faster, and stick around longer. It's worth investing in.

This is alert fatigue, and it's an engineering leadership failure, not an individual one. If your team is drowning in alerts, the system is broken, not the people.

The Only Rule That Matters

Before creating any alert, ask: "If this fires at 3 AM, what will the on-call person do?" If you can't articulate a specific action, don't create the alert.

Severity Levels That Make Sense

Most teams have too many severity levels. Here's what actually works:

Page (Wake Someone Up)

Users are affected right now. Revenue is being lost. Data might be corrupted. This is the fire alarm. It should go off rarely, and when it does, everyone drops what they're doing.

Examples: Site is down. Payment processing is failing. Database is unreachable.

Urgent (Handle Today)

Something is wrong but users might not notice yet. Left unchecked, this will become a Page. These go to Slack or email, not PagerDuty.

Examples: Disk is 80% full. SSL certificate expires in 7 days. Error rate is elevated but below threshold.

Warning (Handle This Week)

Something looks off but isn't urgent. These feed into your weekly review, not your immediate attention.

Examples: Memory usage trending upward. Slow queries appearing. Deprecated API calls detected.

That's it. Three levels. If you have five or seven severity levels, I guarantee nobody on your team knows the difference between level 3 and level 4.

Threshold Setting: The Art Nobody Teaches

Most alerting documentation tells you how to set thresholds but not what thresholds to set. Here's my approach:

Start With Historical Data

Use Rates, Not Absolutes

Add Duration Requirements

A single spike shouldn't page anyone. Require the condition to persist for at least 5 minutes before firing. This eliminates most false positives from momentary blips.

Example: "Error rate above 5% for more than 5 minutes" is much better than "Error rate above 5%."

Runbooks: The Missing Piece

Every alert should link to a runbook. Not a wiki page that might be outdated - an actual, maintained document that tells you exactly what to do when this alert fires.

A good runbook includes:

What this alert means in plain English
Impact assessment - who's affected and how badly
Diagnostic steps. How to figure out what's actually wrong
Remediation options. The 3-4 most common fixes
Escalation path - who to call if you can't fix it

On-Call Hygiene

Good alerting is necessary but not sufficient. You also need good on-call practices.

Regular Alert Review

Once a week, look at every alert that fired. For each one, ask:

Did this require action?
Was the threshold right?
Did the runbook help?
Should this alert exist at all?

Be aggressive about deleting or modifying alerts that aren't earning their keep.

Blameless Postmortems

On-Call Compensation

The Alert Audit Process

Here's a practical exercise you can do this week: audit every alert in your system.

Export your alert definitions into a spreadsheet. For each one, fill in these columns:

Alert name
Times fired in the last 30 days
Actions taken when it fired
False positive rate (guess if needed)
Recommendation: Keep, Modify, or Delete

Delete ruthlessly. You can always add alerts back, but you can never get back the attention you've wasted on noise.

Building a Culture of Signal

The goal isn't zero alerts - it's signal over noise. You want every alert to be meaningful, every page to be urgent, and every on-call shift to feel manageable.

This requires ongoing work. Systems change, and alerting needs to change with them. Build alert review into your regular processes - sprint retros, quarterly planning, wherever it fits your workflow.

And listen to your on-call people. They know which alerts are useless. They know which runbooks are wrong. Create channels for them to report these issues and actually fix them quickly.

Good alerting is a competitive advantage. When your team isn't burned out from constant noise, they ship better code, catch problems faster, and stick around longer. It's worth investing in.

The Only Rule That Matters

Severity Levels That Make Sense

Page (Wake Someone Up)

Urgent (Handle Today)

Warning (Handle This Week)

Threshold Setting: The Art Nobody Teaches

Start With Historical Data

Use Rates, Not Absolutes

Add Duration Requirements

Runbooks: The Missing Piece

On-Call Hygiene

Regular Alert Review

Blameless Postmortems

On-Call Compensation

The Alert Audit Process

Building a Culture of Signal

Related Articles

The Ultimate Guide to Website Analytics

Setting Up Your Dev Environment for 2026

Best Developer Tools We Discovered in 2025

The Only Rule That Matters

Severity Levels That Make Sense

Page (Wake Someone Up)

Urgent (Handle Today)

Warning (Handle This Week)

Threshold Setting: The Art Nobody Teaches

Start With Historical Data

Use Rates, Not Absolutes

Add Duration Requirements

Runbooks: The Missing Piece

On-Call Hygiene

Regular Alert Review

Blameless Postmortems

On-Call Compensation

The Alert Audit Process

Building a Culture of Signal

Related Articles

The Ultimate Guide to Website Analytics

Setting Up Your Dev Environment for 2026

Best Developer Tools We Discovered in 2025