Progressive Delivery + Alembic

Rolling updates, blue-green, and canary deployments shift application traffic gradually. They do not split your production database into safe isolated copies. In most services, the database remains one shared singleton, which means old and new application versions must coexist against the same schema for some period of time. In that situation, Alembic is not merely a DDL runner. It is part of the compatibility contract between rollout phases.

Quick takeaway: in shared-database systems, the main deployment question is not only "how do we shift traffic?" but "how long can old and new app versions tolerate the same schema?". The safest default is `expand migration -> compatibility app deploy -> resumable backfill -> progressive traffic shift -> feature flag cutover -> later contract migration`.

Whether the rollout is rolling, blue-green, canary, or Lambda alias-based, shared-schema compatibility is still the common backbone.

1) First truth to accept: the database is usually shared

Blue-green or canary may give you multiple application versions, but most production systems still have one primary database schema.

That makes these questions central:

is old app plus expanded schema safe?
is new app plus expanded schema safe?
from which point does old app plus contracted schema become impossible?

Traffic strategy and schema compatibility are different concerns.

2) How to use Alembic in CI

A common mistake is to stop after verifying that a revision file exists. In practice CI should go further.

Baseline CI checks

review alembic revision --autogenerate output manually
verify alembic upgrade head on an ephemeral database
ideally replay upgrade from the current production head to the candidate head
run application tests against the upgraded schema
confirm whether the change is destructive, requires backfill, or needs rollout splitting

Questions CI should answer

is this really a rename, or did it become drop plus add?
did autogenerate miss index or constraint intent?
is a contract step hidden inside the same release?
is the data migration small enough for an Alembic revision, or does it need a separate job?

3) Default CD shape: separate migration jobs from deploy jobs

The safest baseline looks like this.

Step 1. expand migration job

add nullable columns
add additive indexes, tables, or constraints
apply only changes that do not break the old application

Step 2. compatibility app deploy

deploy a version that understands both old and new schema shapes
prepare dual read, dual write, or flags with the feature still off
do not contract yet

Step 3. backfill job

run larger data migrations as a separate job or worker
include chunked transactions, checkpoints, retry, and metrics

Step 4. progressive traffic shift

use rolling update, blue-green, canary, or Lambda alias traffic shifting
schema still needs to be tolerated by both old and new versions

Step 5. cutover

switch reads or writes to the new column via feature flag or config
watch metrics, error rate, and lag

Treat cutover as an explicit gate, not just a config flip.

confirm new_column IS NULL is effectively zero
confirm old/new representation mismatch queries are within threshold
confirm app error rate, p95/p99 latency, and downstream consumer lag are stable
confirm you can still roll back immediately by flipping flags or routing, without touching schema

Step 6. later contract migration

remove old columns or constraints only after the old app is fully gone
usually in a later release after stabilization

Rollback also changes meaning across these stages.

before or right after cutover, rollback usually means app rollback, traffic-promotion pause, or feature-flag off
after contract, a literal schema downgrade is often less realistic than a forward fix
that is why the observation window before contract is the last easy rollback boundary

4) A good GitHub Actions gate layout

For shared-database systems, it is often a mistake to run database changes and app deployment as one unprotected job.

Splitting GitHub Actions environments lets you separate approval and protection for DB and app changes.

yaml

name: deploy

on:
  push:
    branches: [main]

jobs:
  test-and-build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: uv sync --dev
      - run: uv run pytest
      - run: uv run ruff check .
      - run: uv run ty check

  migrate-expand:
    needs: test-and-build
    runs-on: ubuntu-latest
    environment: production-db
    steps:
      - uses: actions/checkout@v4
      - run: uv sync --dev
      - run: uv run alembic upgrade head

  deploy-compatible:
    needs: migrate-expand
    runs-on: ubuntu-latest
    environment: production-app
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/deploy-compatible.sh

  backfill:
    needs: deploy-compatible
    runs-on: ubuntu-latest
    environment: production-db
    steps:
      - uses: actions/checkout@v4
      - run: uv sync --dev
      - run: uv run python -m app.jobs.backfill_display_name --batch-size 1000

  promote-traffic:
    needs: backfill
    runs-on: ubuntu-latest
    environment: production-app
    steps:
      - uses: actions/checkout@v4
      - run: ./scripts/promote-traffic.sh

The important idea is separating `production-db` from `production-app`. Database approvals, app approvals, and backfill timing are often different concerns.

5) When backfill belongs inside Alembic and when it does not

Case	Inside Alembic revision	Separate backfill job
small data fix that completes in seconds	yes	optional
long-running update with batching and resume needs	no	yes
lock-sensitive change that needs throttling	no	yes
production migration that needs live metrics watching	no	yes

The distinction is between a small schema-adjacent data fix and a real operational migration job.

6) Properties of a good backfill job

1. Idempotent

Reprocessing already-filled rows should be safe.

sql

UPDATE users
SET display_name = full_name
WHERE display_name IS NULL
  AND id BETWEEN :start_id AND :end_id

2. Resumable

keep a checkpoint table or external cursor
store fields such as last_processed_id, updated_rows, and updated_at

3. Bounded transactions

do not update millions of rows in one transaction
commit every small batch

4. Observable

rows per second
lag
remaining null count
error count
last cursor

5. Throttled

batch size or pause intervals should be adjustable from operational signals such as DB CPU, lock pressure, or replica lag

Cutover gates and contract gates are not the same

A common mistake is to bundle traffic promotion and contract migration into one approval step.

at the cutover gate, check backfill completion, mismatch-query results, error-budget burn, and whether reads and writes are actually using the new path
at the contract gate, check that old pods, workers, cron jobs, and old-client traffic are truly gone
in shared-database systems, "most traffic is on the new version" and "the old contract can be deleted" are not the same statement

7) A Python backfill worker baseline

def run_backfill(session_factory: SessionFactory, batch_size: int = 1000) -> None:
    cursor = load_checkpoint("users_display_name")

    while True:
        with session_factory() as session:
            rows = session.execute(
                select(User.id, User.full_name)
                .where(User.id > cursor, User.display_name.is_(None))
                .order_by(User.id)
                .limit(batch_size)
            ).all()

            if not rows:
                return

            for user_id, full_name in rows:
                session.execute(
                    update(User)
                    .where(User.id == user_id, User.display_name.is_(None))
                    .values(display_name=full_name)
                )

            cursor = rows[-1][0]
            save_checkpoint(session, "users_display_name", cursor)
            session.commit()

This pattern shows keyset cursoring, small transactions, and checkpoint persistence together.

8) Rules for rolling updates

Kubernetes Deployment rolling updates keep old and new ReplicaSets alive together for a while.

That leads to a simple rule set:

run expand migrations first
the new app must support dual read or dual write if needed
contract only after old pods are fully gone

Rolling updates are operationally simple, but they make N-1 and N compatibility requirements the most obvious.

9) Rules for blue-green

Blue-green gives you a preview stack, which is excellent for validation, but it does not give you automatic database isolation.

Safe sequence

expand migration
deploy green
run preview smoke or analysis
run backfill if needed
switch traffic
keep blue alive while post-promotion checks run
remove blue after stabilization
contract later

Core misunderstanding to avoid

Blue-green does not automatically create a blue DB and green DB. If the DB is shared, schema still must stay backward compatible.

10) Rules for canary

Canary reduces blast radius for application behavior, but it does not make destructive schema changes safe.

Why

even 1% canary traffic still reads and writes the shared DB
stable 99% and canary 1% still touch the same tables and rows
destructive schema work remains dangerous

What canary is good for

application behavior analysis
query cost, latency, and error-rate analysis
validating feature flags before full promotion

What canary is not good for

contract migrations that old code cannot tolerate
lock-heavy rewrites
"it's only 1%, so destructive DDL is fine" reasoning

11) Lambda weighted alias and CodeDeploy canary follow the same DB rule

AWS Lambda weighted aliases and CodeDeploy canary or linear strategies shift application traffic, not schema compatibility requirements.

Additional Lambda-specific points:

an alias can point to at most two versions
at low traffic, actual traffic split can vary meaningfully from the configured percentage
weighted alias is useful for canarying app code, not for avoiding proper expand/backfill/contract sequencing

So the order is still the same: expand -> compatible code -> backfill -> traffic shift -> contract later.

12) Strategy comparison table

Strategy	Strength	Key schema rule	Common mistake
rolling update	simplest operationally	old and new pods must coexist safely	putting contract in the same release
blue-green	strong preview validation	shared DB still requires backward compatibility	contracting immediately after promotion
canary	controlled blast radius, metrics-driven promotion	schema does not become safe just because traffic is small	assuming 1% makes destructive migration safe
Lambda alias / CodeDeploy	serverless gradual traffic shift	same DB compatibility rule still applies	mixing weighted routing with destructive DB change

13) Common mistakes to avoid

auto-running Alembic during app startup
putting a huge backfill inside one Alembic revision transaction
assuming blue-green reduces schema compatibility work
assuming low canary percentage makes destructive migration acceptable
running contract immediately after traffic promotion

Scenario Table

symptom	inspect first	likely root cause	safe mitigation	what not to do
canary is fine at 10% but old workers start failing at 50%	inspect whether both old and new app versions truly understand the expanded schema	contract-like behavior was enabled before a real compatibility deploy existed	flip the feature path back, restore compatibility behavior, and only then re-promote	keep the destructive path live because "the canary blast radius is still small"
replica lag and DB CPU spike during backfill	inspect batch size, checkpointing, lock pattern, and throttle behavior	the backfill is acting like one giant migration instead of a resumable operational job	reduce batch size, add throttling, and pause traffic promotion if needed	reach for Alembic downgrade as the first response
rollback is requested right after promotion but contract already ran	inspect when contract ran and whether old consumers still exist	cutover and contract were treated as one approval step	favor forward-fix or compatibility restoration over immediate schema downgrade	assume shared-db schema downgrade is always quick and safe

Code Review Lens

Check whether expand, compatibility deploy, backfill, cutover, and contract read as separate stages.
Check whether Alembic revisions and operational backfill jobs have clearly separated responsibilities.
Check whether the document states how long old and new apps must coexist against the shared DB.
Check whether rollback means different things at different stages instead of pretending schema downgrade is always available.

Common Anti-Patterns

auto-running Alembic at app startup and hard-coupling rollout to schema change
forcing a large backfill into one revision transaction
assuming canary or blue-green makes destructive migration safe
bundling traffic promotion and contract migration behind one approval gate

Likely Discussion Questions

Why does progressive delivery not solve schema compatibility by itself?
Which data changes are small enough for an Alembic revision, and which need a separate job?
Why should cutover gates and contract gates stay separate?
Why does rollback strategy change by stage?

Strong Answer Frame

Start by separating rollout strategy from schema compatibility in a shared-database system.
Then anchor the answer around expand -> compatible -> backfill -> cutover -> contract.
Explain the distinct roles of Alembic revisions, backfill workers, feature flags, and traffic promotion.
Close by stating that rollback meaning changes by stage and that forward-fix is often more realistic after contract.

Good companion chapters in this repository

For runnable intuition, pair this with examples/progressive_delivery_backfill_lab.py.

Progressive Delivery + Alembic ​

1) First truth to accept: the database is usually shared ​

2) How to use Alembic in CI ​

Baseline CI checks ​

Questions CI should answer ​

3) Default CD shape: separate migration jobs from deploy jobs ​

Step 1. expand migration job ​

Step 2. compatibility app deploy ​

Step 3. backfill job ​

Step 4. progressive traffic shift ​

Step 5. cutover ​

Step 6. later contract migration ​

4) A good GitHub Actions gate layout ​

5) When backfill belongs inside Alembic and when it does not ​

6) Properties of a good backfill job ​

1. Idempotent ​

2. Resumable ​

3. Bounded transactions ​

4. Observable ​

5. Throttled ​

Cutover gates and contract gates are not the same ​

7) A Python backfill worker baseline ​

8) Rules for rolling updates ​

9) Rules for blue-green ​

Safe sequence ​

Core misunderstanding to avoid ​

10) Rules for canary ​

Why ​

What canary is good for ​

What canary is not good for ​

11) Lambda weighted alias and CodeDeploy canary follow the same DB rule ​

12) Strategy comparison table ​

13) Common mistakes to avoid ​

Scenario Table ​

Code Review Lens ​

Common Anti-Patterns ​

Likely Discussion Questions ​

Strong Answer Frame ​

Good companion chapters in this repository ​

Official References ​