Backup Strategy That Actually Works
I talked to a founder last month who lost three months of customer data. They had backups - automated daily snapshots, running for two years. But when they tried to restore after a database corruption, every single backup was corrupted too. The backup process had been silently failing, and nobody had ever tested a restore.
This happens more than you'd think. Backups feel like insurance. You set them up, pay the storage costs, and assume they'll be there when you need them. But unlike insurance, there's no regulator making sure your backup provider can actually pay out.
Let's fix that.
The 3-2-1 Rule Still Works
The classic backup rule remains solid:
- 3 copies of your data
- 2 different storage types
- 1 offsite location
For a modern web application, this might look like:
- Primary database (copy 1, type 1)
- Automated snapshots in the same cloud region (copy 2, type 1)
- Daily exports to a different cloud provider or S3 bucket in another region (copy 3, type 2, offsite)
The point isn't to follow this formula exactly - it's to avoid having all your eggs in one basket. If your backups live in the same datacenter as your production database, a regional outage takes out both.
What Actually Needs Backing Up
Before building a backup strategy, inventory what you have:
Databases
The obvious one. But think about whether you need point-in-time recovery (PITR) or if daily snapshots are enough. PITR lets you restore to any moment, which is crucial if you need to recover from a bad migration or accidental delete at a specific time.
User-Uploaded Files
If you're storing files in S3 or similar, those need backing up too. Enable versioning at minimum. It won't protect you from a deleted bucket, but it catches accidental overwrites and deletions.
Configuration and Infrastructure
Can you rebuild your infrastructure from scratch? If your Terraform state file disappears, or your CI/CD configuration gets corrupted, how long would it take to get back to production? These should be in git, but make sure that repo is backed up too.
Third-Party Data
What data do you depend on from external services? If Stripe's webhook history disappeared, could you reconstruct your payment records? Think about what you'd need if a vendor had an outage or lost data.
Retention: The Forgotten Dimension
How long do you keep backups? Most teams default to "forever" until storage costs get painful, then "7 days" after someone panics about the bill. Neither is right.
Think about what kinds of problems you're protecting against:
Operational errors (deleted data, bad deploys): Usually caught within hours or days. Keep frequent backups (hourly or better) for the past week.
Data corruption: Might not be noticed for weeks. Keep daily backups for at least 30 days.
Legal or compliance requirements: May require years of history. Keep monthly backups for as long as legally required.
A reasonable default schedule:
- Hourly backups, retained for 24 hours
- Daily backups, retained for 30 days
- Weekly backups, retained for 3 months
- Monthly backups, retained for 1 year
Adjust based on your data volume and recovery time requirements.
Testing: The Part Everyone Skips
Here's the uncomfortable truth: your backup system is exactly as reliable as your last successful test restore. If you've never tested restoring from backup, you don't know if your backups work.
Schedule regular restore tests. I recommend monthly for critical systems. The process should be:
- Spin up a fresh environment (don't pollute production)
- Restore from your most recent backup
- Verify the data is complete and correct
- Test that the application actually works with restored data
- Document how long the whole process took
That last point matters. If your recovery time is 4 hours, your team needs to know that. It affects your incident response planning and SLA commitments.
Chaos Engineering for Backups
Once you're confident in your restore process, take it further. Can you restore a single table without restoring the whole database? Can you restore to a point in time 3 hours ago? Can a junior engineer follow your runbook successfully?
Find the edge cases before production finds them for you.
Monitoring Your Backups
Backup jobs fail silently. Storage fills up. Credentials expire. If you're not monitoring your backup system, you won't know it's broken until you need it.
At minimum, alert on:
- Backup job failures
- Backup jobs that didn't run when expected
- Backup size anomalies (sudden drops might indicate incomplete backups)
- Storage capacity (will you run out of space before the next budget review?)
Also track your backup duration over time. If backups are taking longer and longer, you'll eventually hit the window where they can't complete before the next one starts.
The Human Element
Technical backup systems fail for human reasons:
- The engineer who set it up left, and nobody else knows how it works
- The project got busy, and backup testing dropped off the sprint
- Someone "temporarily" disabled backups during a migration and forgot to re-enable them
Combat this with documentation and process:
Write it down: How do backups work? How do you restore? Where are credentials stored? This should be accessible to anyone who might handle an incident.
Assign ownership: Someone should be responsible for backup health. Put restore testing on a recurring calendar.
Include in onboarding: New engineers should know where backups are and how to restore before they need to do it under pressure.
Cost Optimization Without Cutting Corners
Backup storage costs add up. Here's how to manage them without compromising safety:
Tiered storage: Recent backups need fast access. Old backups can go to cold storage (S3 Glacier, Azure Archive) at a fraction of the cost.
Incremental backups: Full database dumps every hour is wasteful. Use incremental backups where possible - only storing what changed since the last backup.
Compression: Obvious but often overlooked. Database dumps compress well, often 5-10x.
Prune aggressively: If you don't need hourly backups from 6 months ago, delete them. Follow your retention policy.
Starting Point Checklist
If you're starting from scratch or auditing an existing system:
- Inventory everything that needs backing up
- Verify backups are running for all critical systems
- Check that backups are stored in a different location than production
- Test a restore from your most recent backup
- Document the restore procedure
- Set up alerts for backup failures
- Schedule recurring restore tests
Backups are boring until they're not. The time you invest now is time you won't spend panicking during an incident. Make it count.