Home
DevOps & Cloud Engineering / Lesson 33 — Cost Optimization (FinOps)

Cost Optimization (FinOps)

Cloud bills get expensive. The practices and tools to keep spend predictable and right-sized.


Why FinOps Is a Discipline

Cloud bills surprise everyone. Common patterns:
• "We thought we'd spend $5k. We're at $50k."
• "We have $30k of unused reserved instances."
• "Why is networking 40% of our bill?"
• "Nobody knows what this $10k/month service does."

In on-prem, infrastructure was budgeted upfront. In the cloud, every line of code can spend money. A junior engineer can accidentally provision a $20k/month database with two clicks.

FinOps is the practice of bringing financial accountability to cloud spend. The big ideas:
• Visibility — know what's spent on what
• Optimization — eliminate waste, right-size, use commitments
• Accountability — every cost has an owner
• Forecasting — predict spend, no surprises

This isn't penny-pinching. It's making sure spend goes to value, not waste.


Visibility — Tagging Discipline

The single highest-leverage FinOps practice: tag every resource.

Hcl
provider "aws" {
  default_tags {
    tags = {
      Environment = "production"
      Team        = "platform"
      Project     = "core-api"
      CostCenter  = "engineering"
      ManagedBy   = "terraform"
    }
  }
}

Now AWS Cost Explorer can break down spend by team, project, environment.

Required tags for every resource (enforce in policy):
• Environment (prod, staging, dev)
• Team (or cost center)
• Project (or service name)
• Owner (email or team alias)

Cloud-native cost tools:
• AWS Cost Explorer — breakdown by service, tag, time
• AWS Cost and Usage Report (CUR) — raw data, queryable in Athena
• GCP Billing reports
• Azure Cost Management

Third-party FinOps platforms (for big spenders):
• Vantage, Cloudability, Datadog Cloud Cost Management
• Open source: Komiser, Infracost, OpenCost

Most teams under $50k/month can get by with cloud-native tools + good tagging. Above that, third-party platforms pay back.


Where Cloud Money Goes

Typical cloud spend breakdown for a SaaS company:

Text
40-60%   Compute (EC2, K8s, Lambda)
15-25%   Networking (data transfer, NAT, CDN)
10-20%   Storage & databases (RDS, S3, EBS)
 5-15%   Monitoring & observability
 5-10%   Other services (queues, secrets, etc.)

The biggest wins by category:

Compute:
• Right-sizing (most instances are over-provisioned by 50%+)
• Reserved Instances / Savings Plans for steady baseline
• Spot for batch / fault-tolerant
• Autoscaling — pay only for what you need

Networking:
• Move chatty inter-AZ traffic to same AZ when possible
• VPC endpoints for AWS services (cuts NAT data charges)
• CDN in front of S3 / origins (cuts egress, improves latency)
• Compress responses (gzip / brotli)

Storage:
• S3 lifecycle policies (move to Standard-IA, then Glacier)
• Delete old snapshots
• Consolidate underutilized EBS volumes
• Right-size RDS storage

Observability:
• Sample logs (don't store every request log)
• Tier log retention (hot 7d, warm 30d, cold 1y, then delete)
• Reduce metric cardinality (high-cardinality labels explode cost)
• Negotiate enterprise contracts at scale


Right-Sizing & Commitments

The biggest single win in most cloud bills: instances are over-provisioned.

Tools:
• AWS Compute Optimizer — recommends right sizes based on actual utilization
• GCP Recommender
• Azure Advisor

A typical right-sizing finding:

Text
Instance: i-1234abcd
Type: m5.2xlarge ($0.384/hour)
Avg CPU last 30 days: 4%
Max CPU last 30 days: 18%
Recommendation: m5.large ($0.096/hour)
Savings: $210/month per instance

Multiply across 50 instances and you save $10k/month.

Commitments — for workloads you'll DEFINITELY run for the next year or three:

AWS options:
• Reserved Instances — commit to specific instance type, region. 30-72% discount.
• Savings Plans — commit to $/hour, applies to any matching compute. More flexible. 30-66% discount.
• Spot Instances — bid on spare capacity. 60-90% discount. Can be reclaimed.

GCP options:
• Committed Use Discounts — commit to vCPU + RAM amounts. 25-55% discount.
• Sustained Use Discounts — automatic for steady usage. 0-30% discount, no commitment.

Approach:
1. Look at last 6 months of usage
2. Identify the steady baseline (e.g., always at least 20 instances of m5.large)
3. Cover that baseline with 1- or 3-year reserved capacity
4. On-demand for variable load above baseline
5. Spot for batch / fault-tolerant work

Reserved coverage of 60-80% is typical for mature workloads. Don't over-commit — flexibility has value.


Cost Anomaly Detection

The day a deploy starts costing 10x more, you want to know fast.

Tools:
• AWS Cost Anomaly Detection — built-in, free
• GCP Billing budgets and alerts
• Vantage, Anomalo, Yotascale — third-party, more granular

Set up alerts:
• Forecast-vs-actual: alert when daily spend > 120% of forecast
• Service-level: alert when any service's spend doubles
• New service alerts: anyone enabling a new high-cost service notifies finance/engineering leadership

Catch story: a team turned on AWS DataSync for a one-time migration. Forgot to disable it. Three months later, an automated cost alert finally caught the recurring $4k/month charge. Could have been caught the first day.

Other low-hanging fruit:
• Old EBS snapshots: still cost money. Delete after 30 days or move to archive.
• Unattached EBS volumes: pay for nothing.
• Idle Elastic IPs: $3-4/month each.
• Stopped EC2 instances still costing money for storage.
• Forgotten staging environments running 24/7.


FinOps as a Practice

A real FinOps practice has these components:

Snippet
1. Monthly cost reviews
   • Engineering leads review their team's spend
   • Identify trends, anomalies, action items
   • Compare against forecasts
Snippet
2. Per-team budgets
   • Each team has a budget tied to their projects
   • Going over budget triggers conversation, not punishment
Snippet
3. Showback/chargeback
   • Showback: each team SEES what they spend (no money moves)
   • Chargeback: each team's budget is debited (real money)
   • Showback drives behavior change without political conflict
Snippet
4. Cost-aware engineering
   • Engineers consider cost in design (not as primary, but as a factor)
   • PRs that significantly increase costs get extra review
   • Architecture decisions include cost analysis
Snippet
5. Continuous optimization
   • Quarterly right-sizing
   • Annual reserved capacity review
   • Ongoing cleanup of unused resources

The goal isn't lowest possible spend. It's that every dollar spent is a dollar of value. A team that's doubling revenue and 30% over budget is doing well. A team that's stagnating but cutting costs is shrinking.

The next lesson covers GitOps — how operations work flows like code reviews.


⁂ Back to all modules