Monitoring Your App: A Practical Setup Guide
I've seen teams with $50,000/year monitoring bills who still get paged at 3 AM because nobody noticed the database was dying. I've also seen solo founders running million-dollar apps with nothing but uptime checks. Both approaches are wrong.
Good monitoring sits in the middle. It tells you what you need to know, when you need to know it, without drowning you in noise. Here's how to set it up from scratch.
The Three Pillars (But Actually Useful)
You've probably heard about "the three pillars of observability" - metrics, logs, and traces. That's technically accurate but practically useless for most teams. Here's what actually matters:
- Is it up? - Basic availability monitoring
- Is it fast? - Response time and latency
- Is it broken? - Error rates and exceptions
Everything else is optimization. Get these three right first.
Start With Uptime Checks
Before you install any agents or configure any dashboards, set up external uptime monitoring. This is your canary in the coal mine. If your entire infrastructure dies, you'll still get alerted.
I recommend having at least two independent services checking your endpoints. Why two? Because monitoring services go down too. I've had UptimeRobot miss a 20-minute outage because their checking servers had issues.
For each critical endpoint, check:
- HTTP status code (should be 200)
- Response time (set a threshold, like 2 seconds)
- Response body contains expected content
That last one catches the sneaky failures where your app returns 200 OK but the page is actually an error message.
Application Metrics That Matter
Once uptime is covered, instrument your application. But don't go crazy - start with these four metrics:
Request Rate
How many requests per second are you handling? This is your traffic indicator. A sudden drop might mean users can't reach you. A sudden spike might mean you're getting attacked or something went viral.
Error Rate
What percentage of requests are failing? Track both 4xx and 5xx errors, but alert differently. A spike in 4xx might just be bots probing your site. A spike in 5xx means your code is broken.
Response Time (P50, P95, P99)
Averages lie. Your average response time might be 100ms, but if 1% of users are waiting 10 seconds, they're leaving. Track percentiles instead. P50 tells you the typical experience. P95 and P99 tell you about the slow tail.
Saturation
How close are you to running out of something? CPU, memory, disk, database connections - pick the resources that matter for your app and track how full they're getting.
The Logging Setup Nobody Regrets
Structured logging. That's it. That's the setup.
Instead of console.log("User signed up"), write logger.info({ event: "user_signup", userId: user.id, plan: "pro" }). When something breaks at 2 AM, you'll be able to search for all events related to that user instead of grep-ing through walls of text.
Keep logs for at least 30 days. You'd be surprised how often bugs get reported a week after they started happening.
Choosing Your Tools
The specific tools matter less than you think. What matters is that you actually use them. That said, here's what I recommend for different team sizes:
Solo to 5 People
- Uptime: UptimeRobot (free tier is fine) + BetterUptime
- Errors: Sentry (their free tier is generous)
- Metrics: Your hosting provider's built-in tools (Vercel Analytics, Railway metrics, etc.)
- Logs: LogTail or Axiom (both have good free tiers)
5 to 20 People
- Keep the uptime setup
- APM: Datadog or New Relic (pick one, they're similar)
- Logs: Same as your APM vendor to avoid context switching
20+ People
At this point you probably need someone whose job is monitoring. Seriously. The complexity of managing observability at scale is a full-time role.
Setting Up Dashboards That Get Used
Most dashboards are graveyards of good intentions. Someone spends three days building a beautiful 47-panel dashboard, and then nobody looks at it ever again.
Build exactly two dashboards:
The On-Call Dashboard: Put this on a TV in the office (or have it open during incidents). It should show only the things that tell you "is everything okay right now?" - request rate, error rate, latency, and the status of critical dependencies. No historical trends, no breakdowns by endpoint. Just the vitals.
The Investigation Dashboard: This is where you go when something's wrong. It should let you drill down by time range, endpoint, user segment, whatever makes sense for your app. This one can be complex because you only use it when you're actively debugging.
The First Week Checklist
Here's what to set up in order. Don't move to the next step until the previous one is working and alerting properly:
- Day 1-2: External uptime monitoring on your main endpoints
- Day 3: Error tracking (Sentry or similar) with source maps
- Day 4-5: Basic metrics from your hosting provider
- Day 6-7: Structured logging to a searchable destination
That's it for week one. Resist the urge to add more. Live with this setup for at least a month before expanding. You'll learn what you're actually missing versus what sounds cool in blog posts.
Common Mistakes to Avoid
Alerting on everything: We'll cover this more in tomorrow's post, but for now: if you're getting more than 2-3 alerts per day, your alerting is broken.
Not testing your monitoring: How do you know your alerts work? Break something on purpose. Pull the plug on a non-critical service and see if you get notified. Do this monthly.
Monitoring only the happy path: Your checkout page might be up, but is the payment processor responding? Monitor your dependencies, not just your own code.
Monitoring isn't a project you finish. It's an ongoing practice. Start simple, add complexity only when you have evidence you need it, and always ask yourself: "Will this tell me something I'll actually act on?"