Cost Optimization for Cloud Infrastructure
A founder showed me their AWS bill last month. They were paying $4,200/month for an app with 500 daily active users. When we dug in, we found $3,000 worth of orphaned resources, over-provisioned instances, and forgotten experiments. In two hours, we cut their bill to $800.
This isn't unusual. Cloud spending grows silently. Developers spin up resources for testing and forget to tear them down. Auto-scaling policies get set once and never revisited. Load balancers sit idle. Logs accumulate forever.
Cutting cloud costs isn't about being cheap - it's about not wasting money that could fund features, hires, or runway.
Finding the Waste: Where to Look First
1. Idle and Orphaned Resources
Start here because it's the easiest win. Look for:
- Unattached EBS volumes: Volumes that persist after instances are terminated. You're paying for storage nobody's using.
- Old snapshots: EBS and RDS snapshots from three years ago. Do you really need them?
- Unused elastic IPs: AWS charges for allocated IPs that aren't attached to running instances.
- Forgotten load balancers: ALBs with no healthy targets. They still cost ~$20/month minimum.
- Test environments that never got deleted: That staging cluster from last year's project.
AWS Cost Explorer can help identify unused resources, but honestly, sometimes the fastest way is to grep your infrastructure for resources, then ask "do we use this?" about each one.
2. Over-Provisioned Compute
Right-sizing is the art of matching instance size to actual usage. Most teams over-provision because it's safer - better to have too much capacity than too little.
Pull CPU and memory metrics for your instances. If they're consistently below 30% utilization, you're probably paying for twice what you need. AWS Compute Optimizer will give you specific recommendations.
Caveat: don't right-size by average. Look at peaks. If you average 20% CPU but spike to 80% during deployments, you need that headroom.
3. Storage Tiers
Not all data needs the fastest storage:
- S3 Intelligent-Tiering: Automatically moves objects to cheaper tiers based on access patterns. Set it and forget it.
- S3 Lifecycle policies: Move old logs to Glacier after 30 days. Delete them after a year. Saves enormous amounts on long-lived buckets.
- GP3 vs GP2: If you're still on GP2 EBS volumes, GP3 is cheaper and faster. Just switch.
4. Data Transfer Costs
The cloud's hidden tax. Transferring data between regions, between availability zones, or out to the internet adds up fast.
Quick wins:
- Use CloudFront or another CDN for static assets (cheaper than direct S3 egress)
- Keep chatty services in the same availability zone when possible
- Compress data before transfer
- Cache aggressively to reduce repeated fetches
Commitment Discounts: Reserved Instances and Savings Plans
If you know you'll need certain resources for the next year, you can save 30-60% by committing upfront. AWS offers Reserved Instances and Savings Plans; GCP has Committed Use Discounts; Azure has Reserved VM Instances.
The tradeoff: if your usage patterns change, you might be paying for capacity you don't need. Start conservative - reserve only what you're confident about, typically your baseline production load.
Before committing:
- Look at 3-6 months of usage history
- Identify your steady-state baseline (what's always running)
- Reserve that baseline, leave headroom for variable workloads
- Review quarterly and adjust
Spot Instances and Preemptibles
Spot instances (AWS) and preemptible VMs (GCP) offer 60-90% discounts in exchange for the cloud provider being able to terminate them with minimal notice.
Good use cases:
- CI/CD pipelines
- Batch processing jobs
- Development and staging environments
- Stateless workers that can be interrupted
Bad use cases:
- Your only production web server
- Databases
- Anything that can't tolerate sudden interruption
The sweet spot is using spot instances as part of a mixed fleet. Run your minimum required capacity on on-demand or reserved, then burst with spot when you need more.
Monitoring and Budgets
Cost optimization isn't a one-time project. Waste accumulates continuously. Build ongoing visibility:
Set Budget Alerts
AWS Budgets, GCP Billing alerts, Azure Cost Management. All let you set thresholds and get notified when spending exceeds them. Set alerts at 50%, 80%, and 100% of your expected monthly spend.
Tag Everything
Without tags, you can't answer "how much does this project cost?" Enforce tagging policies with tools like AWS Config or terraform-compliance. At minimum, tag by:
- Environment (production, staging, development)
- Team or project
- Owner (who's responsible)
Regular Cost Reviews
Put it on the calendar. Monthly cost review where you look at the bill, identify the biggest spenders, and ask "is this expected?" Catching anomalies early prevents bill shock.
Architecture Changes for Cost
Sometimes the cheapest fix is architectural:
Serverless for sporadic workloads: If your API handles 10 requests per hour most of the time but spikes during events, a Lambda might be cheaper than a server running 24/7.
Managed services vs self-hosted: Running your own Kafka cluster costs engineering time. Sometimes a managed service is cheaper when you factor in maintenance. Sometimes it's not. Do the math.
Multi-region sanity check: Do you actually need three regions? Multi-region adds significant cost. For most startups, a single region with availability zone redundancy is plenty.
Database right-sizing: RDS is often the biggest line item. Are you using the right instance class? Could you use Aurora Serverless for variable workloads? Would read replicas let you use a smaller primary?
The Cost Optimization Process
Here's a repeatable process for ongoing cost management:
Week 1: Audit. Identify orphaned resources, over-provisioned instances, and quick wins. Fix the obvious waste.
Week 2-3: Implement tagging. Get visibility into what's costing what.
Week 4: Set up budgets and alerts. Make sure you'll notice unexpected increases.
Ongoing: Monthly reviews. Check the bill, investigate anomalies, right-size based on actual usage.
Quarterly: Evaluate reserved instances and savings plans. Adjust commitments based on actual patterns.
What Not to Cut
Cost optimization has limits. Don't compromise:
- Monitoring and logging: The visibility that helps you catch problems and optimize further
- Backups: Disaster recovery isn't where you save money
- Security: That WAF might seem expensive until you get attacked
- Redundancy you actually need: Single points of failure cost more in outages than in infrastructure
The goal is eliminating waste, not cutting muscle. Save money on things that don't matter so you can spend on things that do.