Incident Response for Startups
At 2 AM, your monitoring goes red. Users are complaining on Twitter. Your co-founder is texting you. This is not the time to figure out who does what.
Incident response is about preparing for chaos so that when chaos arrives, you have a playbook. Big companies have dedicated incident commanders, war rooms, and 50-page runbooks. You probably have 3-5 engineers and a Slack channel. That's fine. You just need a different approach.
Before the Incident: The Prep Work
Most incident response failures happen before the incident. Teams scramble because they never decided in advance how to scramble.
Define What Counts as an Incident
Not every bug is an incident. Not every alert needs to wake someone up. You need clear criteria:
Incident (all hands): Production is down. Data is being lost. Users can't complete core workflows. Revenue is actively being lost.
Issue (one person): Something's degraded but users can mostly work. A non-critical feature is broken. Performance is slow but functional.
Bug (normal process): Someone found something wrong but it's not affecting users right now. Goes into the backlog.
Write these definitions down. When stress is high, you don't want debates about whether this "really" qualifies as an incident.
Set Up Communication Channels
Decide now where incident communication happens. A dedicated Slack channel works for most teams. Name it something obvious like #incidents or #fires. Keep it separate from #engineering so important messages don't get buried in casual chat.
For larger incidents, consider a video call as the "war room." Being able to see and hear each other makes coordination faster than typing.
Create On-Call Coverage
Someone needs to be the first responder. For tiny teams, this might be "whoever sees it first." But as you grow, you need a rotation. Tools like PagerDuty or Opsgenie handle this, but even a shared calendar showing "Sarah is on-call this week" is better than nothing.
Key principle: one person is clearly responsible at any given time. Diffused responsibility means nobody responds.
During the Incident: The Playbook
When something breaks, follow this flow:
Step 1: Acknowledge and Triage (First 5 Minutes)
First responder acknowledges the incident. This tells everyone "I'm on it" so three people don't start debugging the same thing.
Quick triage questions:
- What's actually broken? (Be specific)
- Who's affected? (All users, some users, internal only?)
- When did it start?
- What changed recently? (Deploys, config changes, new traffic)
Post answers in the incident channel so everyone has context.
Step 2: Communicate Externally (Within 15 Minutes)
If users are affected, tell them. A status page update or a tweet that says "We're aware of issues and investigating" buys you goodwill. Silence makes people assume you don't know or don't care.
Don't wait until you have a fix. Acknowledge the problem, give an ETA for updates (even if it's "we'll update in 30 minutes"), and stick to that schedule.
Step 3: Stabilize (Next Hour)
The goal isn't to fix the root cause - it's to stop the bleeding. Rollback the bad deploy. Restart the crashed service. Scale up to handle the traffic spike. Flip the feature flag off.
You can figure out what went wrong later. Right now, get users working again.
Step 4: Fix Forward If Needed
Sometimes rollback isn't possible. Maybe the database migration can't be reversed. Maybe the old code has a security vulnerability. Now you fix forward, but you keep it minimal. The smallest change that solves the immediate problem.
This is not the time for refactoring or adding features. Fix the thing. Ship it. Improve later.
Step 5: Confirm Resolution
How do you know it's actually fixed? Define success criteria before declaring victory. Monitor for at least 15-30 minutes after the fix. Watch for the problem recurring or new problems appearing.
Update your status page. Let users know it's resolved.
After the Incident: The Postmortem
Every significant incident gets a postmortem. Not to assign blame - to learn and improve. Skip this step and you'll keep having the same incidents.
When to Write One
Any incident that affected users for more than 15 minutes. Any incident that required multiple people. Anything that woke someone up. Anything that cost significant money.
What to Include
- Timeline: What happened when, in detail. Include timestamps.
- Impact: How many users affected, for how long, what couldn't they do.
- Root cause: Why did this happen? Keep asking "why" until you hit something structural.
- What went well: What helped you detect and respond quickly?
- What went poorly: Where did you waste time? What information was missing?
- Action items: Concrete steps to prevent recurrence or improve response.
The Blameless Part
Postmortems fail when they become blame sessions. If Alice made a typo that caused the outage, the root cause isn't "Alice made a typo" - it's "our system allowed a typo to cause an outage." The question is always "how do we make the system safer?" not "who do we punish?"
People make mistakes. Systems should catch those mistakes before they become outages.
Building Your Incident Response Kit
Prepare these in advance so you're not creating them during an incident:
Status page: Use something like Statuspage, Instatus, or even a simple static page. Update it during incidents.
Runbooks: For common failure modes, write step-by-step guides. "Database is slow" should link to a runbook that covers common causes and fixes.
Contact list: Who knows what? If the payment system breaks, who understands it best? Keep a list of people and their domains.
Access: Can your on-call person access everything they need? Production logs, database, hosting dashboard? Don't find out during an incident that they can't see the metrics.
Customer communication templates: Pre-written messages for common scenarios. "We're investigating" / "We've identified the issue" / "This is resolved." Saves precious minutes during a crisis.
Scaling Up: When You Outgrow This
This lightweight process works well for teams up to about 20 people. Beyond that, you might need:
- Formal incident commander roles
- Separate communications lead
- Tiered response (not everyone joins every incident)
- More detailed severity levels
- Incident tracking systems beyond Slack
But don't add complexity until you need it. A 5-person team with a 50-page incident response plan will just ignore the plan. Start simple, refine as you grow.
The Culture Part
Process only works if culture supports it. Build a team where:
- Reporting problems is rewarded, not punished
- Asking for help is normal
- Postmortems lead to actual changes
- On-call load is distributed fairly
- Recovery is celebrated, not just crisis prevention
Incidents are inevitable. How you handle them shapes your team's trust, your users' loyalty, and your product's reputation. Invest the time now, before you need it.