Home
DevOps & Cloud Engineering / Lesson 30 — Deployment Strategies in Practice

Deployment Strategies in Practice

Blue-green, canary, feature flags — picking the right strategy for the risk and the team.


Why Strategy Matters

Module 7 introduced deployment strategies conceptually. This lesson is about choosing and implementing the right one for your situation.

The honest truth: most teams should start simple and add complexity only as they hit the limits of simpler approaches.

Progression for a typical team:
1. Manual deploys, occasional downtime — fine for early-stage startups
2. Rolling deploys, no downtime — fine for most established apps
3. Canary deploys with metric-based gates — when stakes are high
4. Blue-green for stateful services — when migrations are risky
5. Feature flags for fine-grained control — when you ship multiple times a day

Each step has its own complexity. Don't skip ahead unless you've earned the need.


Blue-Green Done Right

Blue-green keeps two environments. Live (Blue) serves traffic. Idle (Green) is the new version. Switch traffic from Blue to Green; if it works, you're done; if not, switch back instantly.

Text
Before:
  Internet ──► LB ──► Blue (v1, live)     Green (v2, deploying)

Cutover:
  Internet ──► LB ──► Blue (v1, idle)     Green (v2, live)

Rollback (instant):
  Internet ──► LB ──► Blue (v1, live)     Green (v2, kept briefly)

How to do it on AWS:
• Two target groups behind one ALB
• Listener rules direct traffic to one or the other
• Use AWS CodeDeploy to automate the switch with health validation
• Or use weighted routing in Route 53 for DNS-level cutover

Pros:
• Instant rollback (just flip)
• Validate Green fully before traffic flips
• No partial-state issues

Cons:
• Doubles infrastructure cost during deploys
• Database migrations are tricky (Blue and Green can't both run different schemas)
• Stateful sessions may not survive cutover

Best for: critical workloads where instant rollback is essential.


Canary Deploys with Metrics

Canary sends a small percentage of traffic to the new version, gradually increasing if metrics stay healthy.

The flow:
1. Deploy v2 alongside v1
2. Send 1% of traffic to v2
3. Monitor: error rate, latency, business metrics
4. Healthy after 5 minutes → 10% to v2
5. Healthy after 5 minutes → 50% to v2
6. Healthy after 5 minutes → 100% to v2
7. Unhealthy at any step → rollback to 100% v1

Tools that automate this:
• Argo Rollouts (Kubernetes) — declarative canary with metric analysis
• Flagger (Kubernetes) — same idea, integrates with service meshes
• AWS CodeDeploy — for ECS, Lambda

Argo Rollouts example with Prometheus analysis:

YAML
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: myapp
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1

If success rate drops below threshold, the rollout halts and rolls back automatically. Engineers wake up to "canary failed at 10%, rolled back" — not "production is down."

Best for: high-traffic services, anything user-facing where bugs have real cost.


Feature Flags — Decoupling Deploy from Release

Feature flags ship the code but hide the feature behind a runtime toggle.

Python
if feature_flag("new_checkout_flow", user_id=user.id):
    return new_checkout(cart)
else:
    return legacy_checkout(cart)

The toggle can be:
• ON for everyone
• OFF for everyone
• ON for specific users (early access, beta testers)
• ON for X% of users (percentage rollout)
• ON in specific environments
• ON for users matching criteria (paid plan, country)

Powers this gives you:
• Deploy code Friday afternoon, enable feature Monday morning
• Roll out to 1% → 10% → 100% gradually
• Kill a bad feature instantly without redeploy
• A/B test variants
• Per-customer rollouts

Tools:
• LaunchDarkly — most popular commercial
• Flagsmith, Unleash, Split — alternatives
• OpenFeature — open standard
• Self-built — for simple cases

Watch out for:
• Flag debt — flags accumulate. Each flag = code complexity. Remove old flags ruthlessly.
• Performance — every check is a function call (or worse, a network call). Use cached SDK.
• Testing — every combination of flags multiplies test scenarios.

Mature pattern:
1. Add a flag for any user-visible change
2. Deploy with flag OFF
3. Test in production with flag ON for internal users
4. Roll out gradually (1% → 10% → 50% → 100%)
5. Once stable for a week, remove the flag and the old code path


Database Migrations Without Downtime

The hardest deployments involve database changes. Apps and schemas have to coexist during the rollout.

The expand-contract pattern (mentioned in Module 7, expanded here):

To rename email to email_address:

Phase 1: Expand

SQL
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);
UPDATE users SET email_address = email;

Both columns exist. App still uses email.

Phase 2: Dual-write
Deploy code that WRITES to both columns, READS from email. Old app instances still work.

Phase 3: Read from new
Deploy code that READS from email_address. WRITES still go to both.

Phase 4: Stop writing old
Deploy code that only writes to email_address. Old column ignored.

Phase 5: Contract

SQL
ALTER TABLE users DROP COLUMN email;

This is multi-week work for one column rename. But it's zero-downtime and rollback-safe at every phase.

Other tricky migrations:
• Adding a NOT NULL column → first add nullable, backfill, then add the constraint
• Dropping a column → first stop reading, then stop writing, then drop
• Splitting a table → dual-write, gradual migration

Tools:
• Migration runners (Flyway, Liquibase, framework-native ones)
• Online schema change tools (gh-ost, pt-online-schema-change for MySQL)

The discipline: NEVER write a migration that requires a specific app version to be running. Apps and schemas evolve independently.


Environment Strategy

How many environments? What does each do?

Common setups:

Three environments (most teams):
• development — engineers' personal sandboxes, possibly local + cloud
• staging / pre-prod — production-like, integration testing, demos
• production — the real thing

Sometimes useful:
• QA — manual testing
• UAT — user acceptance testing
• Performance — load testing
• DR — disaster recovery (warm standby)

Per-PR ephemeral environments — deploy each PR to its own environment for review. Vercel and Netlify do this for frontend; Render, Railway have native support; Custom: Terraform + GitHub Actions.

For staging to be useful, it must MIRROR production:
• Same architecture
• Similar (anonymized) data
• Same monitoring
• Same deploy process

Anti-pattern: "staging works fine, but production has different behavior." Either staging isn't production-like enough OR production has manual changes that aren't reflected anywhere. Either way, fix it.

Avoid letting staging rot. Treat it as production for a smaller user base. If staging is broken for a week and nobody notices, it's not actually testing anything.


⁂ Back to all modules