Cost Optimization (FinOps)
Cloud bills get expensive. The practices and tools to keep spend predictable and right-sized.
Why FinOps Is a Discipline
Cloud bills surprise everyone. Common patterns:
• "We thought we'd spend $5k. We're at $50k."
• "We have $30k of unused reserved instances."
• "Why is networking 40% of our bill?"
• "Nobody knows what this $10k/month service does."
In on-prem, infrastructure was budgeted upfront. In the cloud, every line of code can spend money. A junior engineer can accidentally provision a $20k/month database with two clicks.
FinOps is the practice of bringing financial accountability to cloud spend. The big ideas:
• Visibility — know what's spent on what
• Optimization — eliminate waste, right-size, use commitments
• Accountability — every cost has an owner
• Forecasting — predict spend, no surprises
This isn't penny-pinching. It's making sure spend goes to value, not waste.
Visibility — Tagging Discipline
The single highest-leverage FinOps practice: tag every resource.
provider "aws" {
default_tags {
tags = {
Environment = "production"
Team = "platform"
Project = "core-api"
CostCenter = "engineering"
ManagedBy = "terraform"
}
}
}
Now AWS Cost Explorer can break down spend by team, project, environment.
Required tags for every resource (enforce in policy):
• Environment (prod, staging, dev)
• Team (or cost center)
• Project (or service name)
• Owner (email or team alias)
Cloud-native cost tools:
• AWS Cost Explorer — breakdown by service, tag, time
• AWS Cost and Usage Report (CUR) — raw data, queryable in Athena
• GCP Billing reports
• Azure Cost Management
Third-party FinOps platforms (for big spenders):
• Vantage, Cloudability, Datadog Cloud Cost Management
• Open source: Komiser, Infracost, OpenCost
Most teams under $50k/month can get by with cloud-native tools + good tagging. Above that, third-party platforms pay back.
Where Cloud Money Goes
Typical cloud spend breakdown for a SaaS company:
40-60% Compute (EC2, K8s, Lambda)
15-25% Networking (data transfer, NAT, CDN)
10-20% Storage & databases (RDS, S3, EBS)
5-15% Monitoring & observability
5-10% Other services (queues, secrets, etc.)
The biggest wins by category:
Compute:
• Right-sizing (most instances are over-provisioned by 50%+)
• Reserved Instances / Savings Plans for steady baseline
• Spot for batch / fault-tolerant
• Autoscaling — pay only for what you need
Networking:
• Move chatty inter-AZ traffic to same AZ when possible
• VPC endpoints for AWS services (cuts NAT data charges)
• CDN in front of S3 / origins (cuts egress, improves latency)
• Compress responses (gzip / brotli)
Storage:
• S3 lifecycle policies (move to Standard-IA, then Glacier)
• Delete old snapshots
• Consolidate underutilized EBS volumes
• Right-size RDS storage
Observability:
• Sample logs (don't store every request log)
• Tier log retention (hot 7d, warm 30d, cold 1y, then delete)
• Reduce metric cardinality (high-cardinality labels explode cost)
• Negotiate enterprise contracts at scale
Right-Sizing & Commitments
The biggest single win in most cloud bills: instances are over-provisioned.
Tools:
• AWS Compute Optimizer — recommends right sizes based on actual utilization
• GCP Recommender
• Azure Advisor
A typical right-sizing finding:
Instance: i-1234abcd
Type: m5.2xlarge ($0.384/hour)
Avg CPU last 30 days: 4%
Max CPU last 30 days: 18%
Recommendation: m5.large ($0.096/hour)
Savings: $210/month per instance
Multiply across 50 instances and you save $10k/month.
Commitments — for workloads you'll DEFINITELY run for the next year or three:
AWS options:
• Reserved Instances — commit to specific instance type, region. 30-72% discount.
• Savings Plans — commit to $/hour, applies to any matching compute. More flexible. 30-66% discount.
• Spot Instances — bid on spare capacity. 60-90% discount. Can be reclaimed.
GCP options:
• Committed Use Discounts — commit to vCPU + RAM amounts. 25-55% discount.
• Sustained Use Discounts — automatic for steady usage. 0-30% discount, no commitment.
Approach:
1. Look at last 6 months of usage
2. Identify the steady baseline (e.g., always at least 20 instances of m5.large)
3. Cover that baseline with 1- or 3-year reserved capacity
4. On-demand for variable load above baseline
5. Spot for batch / fault-tolerant work
Reserved coverage of 60-80% is typical for mature workloads. Don't over-commit — flexibility has value.
Cost Anomaly Detection
The day a deploy starts costing 10x more, you want to know fast.
Tools:
• AWS Cost Anomaly Detection — built-in, free
• GCP Billing budgets and alerts
• Vantage, Anomalo, Yotascale — third-party, more granular
Set up alerts:
• Forecast-vs-actual: alert when daily spend > 120% of forecast
• Service-level: alert when any service's spend doubles
• New service alerts: anyone enabling a new high-cost service notifies finance/engineering leadership
Catch story: a team turned on AWS DataSync for a one-time migration. Forgot to disable it. Three months later, an automated cost alert finally caught the recurring $4k/month charge. Could have been caught the first day.
Other low-hanging fruit:
• Old EBS snapshots: still cost money. Delete after 30 days or move to archive.
• Unattached EBS volumes: pay for nothing.
• Idle Elastic IPs: $3-4/month each.
• Stopped EC2 instances still costing money for storage.
• Forgotten staging environments running 24/7.
FinOps as a Practice
A real FinOps practice has these components:
1. Monthly cost reviews
• Engineering leads review their team's spend
• Identify trends, anomalies, action items
• Compare against forecasts
2. Per-team budgets
• Each team has a budget tied to their projects
• Going over budget triggers conversation, not punishment
3. Showback/chargeback
• Showback: each team SEES what they spend (no money moves)
• Chargeback: each team's budget is debited (real money)
• Showback drives behavior change without political conflict
4. Cost-aware engineering
• Engineers consider cost in design (not as primary, but as a factor)
• PRs that significantly increase costs get extra review
• Architecture decisions include cost analysis
5. Continuous optimization
• Quarterly right-sizing
• Annual reserved capacity review
• Ongoing cleanup of unused resources
The goal isn't lowest possible spend. It's that every dollar spent is a dollar of value. A team that's doubling revenue and 30% over budget is doing well. A team that's stagnating but cutting costs is shrinking.
The next lesson covers GitOps — how operations work flows like code reviews.
⁂ Back to all modules