CI/CD: Automating Build and Canary Deploy With Rollback

K
Kai··6 min read

Across sixteen articles, every change went through by hand: type sam deploy, wait, check. That works while building, but shipping to production needs an automated, safe process: every change is built and validated first, then deployed in a way that won't break the service if something is wrong. This article builds exactly that loop, and hits two real traps on the way.

Goal

Build CI with GitHub Actions so every push is built and validated automatically, and safe CD with canary via CodeDeploy: shift traffic gradually to the new version, then roll back automatically if an error alarm fires. The CI part runs for real on GitHub, the canary part runs for real on AWS. Cost is negligible.

CI: build and validate on every push

CI handles integration: on every push, a clean machine rebuilds and validates, so errors surface immediately. The GitHub Actions workflow for this needs no AWS credentials, because sam validate --lint runs offline using cfn-lint:

name: CI
on:
  push: { branches: [main] }
  pull_request:
env:
  AWS_DEFAULT_REGION: ap-southeast-1
jobs:
  build-and-validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 22 }
      - run: npm ci
      - run: npm install -g esbuild
      - uses: aws-actions/setup-sam@v2
        with: { use-installer: true }
      - run: sam build
      - run: sam validate --lint

Trap one: esbuild not on PATH

The first CI run failed right at sam build:

Build Failed
Error: ... Esbuild Failed: Cannot find esbuild. esbuild must be installed on the
host machine to use this feature.

Locally the build runs fine, but CI doesn't. The reason: the functions' CodeUri points at the src directory, which has no package.json of its own, so SAM doesn't find the esbuild installed in the root node_modules. Locally it happens to find it; on a clean runner it doesn't. The fix is to install esbuild globally in CI, the npm install -g esbuild line. After that CI goes green:

✓ build-and-validate in 23s
  ✓ npm ci
  ✓ npm install -g esbuild
  ✓ SAM build
  ✓ SAM validate (lint, offline)
✓ main CI · success

This is the kind of environment difference only CI surfaces, and it's also why you want CI: it runs on a clean machine more like production than yours.

CD: deploy canary, not all at once

The deploy part of the series so far has been replacing the whole function at once. For production, it's safer to go canary: put the new version in front of a small slice of traffic first, watch, then shift the rest. SAM does this through CodeDeploy via DeploymentPreference, tied to an auto-published alias and an alarm for rollback:

ResolveLinkFunction:
  Properties:
    FunctionName: !Sub "${AWS::StackName}-resolve"
    AutoPublishAlias: live
    DeploymentPreference:
      Type: Canary10Percent5Minutes
      Alarms:
        - !Ref ResolveErrorAlarm

AutoPublishAlias: live makes each deploy publish a new version and point the live alias at it. DeploymentPreference tells CodeDeploy to shift 10% of traffic to the new version, hold for 5 minutes, then shift the remaining 100% if nothing goes wrong. Alarms is the rollback condition: if ResolveErrorAlarm (set up in Article 14) goes into ALARM during that window, CodeDeploy automatically returns traffic to the old version.

Trap two: circular dependency

The declaration above creates, at first glance, a cycle: the function references the alarm (via Alarms), and the alarm references the function (via FunctionName in the dimension !Ref ResolveLinkFunction). CloudFormation refuses because of the circular dependency. The way to cut it is to set an explicit FunctionName on the function, then have the alarm point to that name as a fixed string instead of !Ref:

# on the alarm:
Dimensions:
  - Name: FunctionName
    Value: !Sub "${AWS::StackName}-resolve"

Now the alarm no longer depends on the function resource, the cycle is broken, and the stack deploys.

Canary running for real

The first deploy that adds AutoPublishAlias only creates the alias, no canary yet. The next deploy (with a code change) is what triggers the canary. At that point CodeDeploy creates a deployment that shifts traffic:

$ aws deploy get-deployment --deployment-id d-K0GQEAB5J \
    --query 'deploymentInfo.{status:status,config:deploymentConfigName}'
{
    "status": "InProgress",
    "config": "CodeDeployDefault.LambdaCanary10Percent5Minutes"
}

While the canary runs, the service keeps serving normally, a request to open a link still returns 301, because both the old and the new version are alive and traffic is split by ratio:

$ curl -s -o /dev/null -w '%{http_code}' "$API/$CODE"
301
   push code ──▶ GitHub Actions CI (build + validate)  ──▶ (green) ──▶ deploy
                                                                        │
                                              CodeDeploy canary         ▼
                                   ┌────────────────────────────────────────┐
                                   │ 10% traffic ──5 min──▶ 100% traffic      │  (ok)
                                   │      │                                   │
                                   │      └── ResolveErrorAlarm ALARM ──▶ rollback to old version
                                   └────────────────────────────────────────┘

The deploy here is benign (the code only changes one log line), so the canary runs the full 5 minutes, shifts to 100%, and no rollback fires — we're proving the normal path, not deliberately breaking things to trigger a rollback. The rollback mechanism is wired in either way: if during those 5 minutes the error rate climbs and ResolveErrorAlarm fires, CodeDeploy doesn't shift the rest of the traffic but pulls it all back to the old version, so only 10% of users hit the broken version for a short time instead of everyone. The difference is having an observation window and an automatic way back, instead of switching straight to 100% and only then finding out it's broken.

What about deploying from CI

The workflow above stops at build and validate, no deploy, to keep the CI part credential-free. The deploy step (full CD) would add a job using OIDC to obtain temporary AWS permissions instead of stashing long-lived access keys in GitHub: GitHub Actions exchanges an OIDC token for an IAM role via sts:AssumeRoleWithWebIdentity, then runs sam deploy. This way no secret key lives in the repo or in the CI secrets. Here we still deploy from the machine to keep the loop tight for acceptance, but the upgrade path to automatic deploy via OIDC is one added job, not a rebuild.

🧹 Cleanup

The demo links and user were deleted after checking; CI, the alias, and the canary configuration are part of the stack, so they stay.

Wrap-up

The process is now automated and safer. CI builds and validates every push on a clean machine, and that clean machine is exactly what surfaced the esbuild trap a personal machine doesn't hit. CD uses canary via CodeDeploy to shift traffic gradually and roll back automatically on an alarm, with the circular dependency trap cut by an explicit function name. The principle here: no change goes straight to 100% without build, validation, and an observation window with a way back.

There's one question everything serverless eventually has to answer: how much does it cost? The next article opens the bill of this very product service by service, showing what falls inside the free tier, what actually costs money, and where several design choices across the series saved money.