Blue/Green Deploy With ALB and Automatic Rollback

K
KaiΒ·Β·6 min read

Every deploy so far has been in-place: updating directly on the running machine, so there's a window where that machine stops serving. With real traffic, that window is downtime. Blue/green eliminates it: instead of modifying the old machines, we stand up a new fleet (green) in parallel with the running fleet (blue), validate green thoroughly, then shift traffic over with a load balancer. If green is broken, traffic never left blue β€” instant return. This article (closing Part IV) builds an ALB, switches the deployment group to blue/green, runs it for real, and enables automatic rollback.

πŸ’° Cost

Blue/green temporarily runs double the instances (blue + green at the same time), plus an ALB (~$0.0225/hour). This is the most expensive article in Part IV β€” terminate everything cleanly right after you finish (the cleanup section).

Goal

Understand what blue/green solves, build an ALB and switch the deployment group to blue/green, run a real deploy and read the distinctive lifecycle events, then configure automatic rollback.

Load balancer in front of the fleet

Blue/green needs a load balancer to shift traffic between the two fleets. Build an ALB, target group, listener, then attach the ASG (Article 10) to the target group:

$ aws elbv2 create-load-balancer --name awscicd-alb --type application \
    --subnets subnet-aaa subnet-bbb --security-groups <sg-http>
$ aws elbv2 create-target-group --name awscicd-tg --protocol HTTP --port 80 \
    --vpc-id <vpc> --health-check-path / --target-type instance
$ aws elbv2 create-listener --load-balancer-arn <alb> --protocol HTTP --port 80 \
    --default-actions Type=forward,TargetGroupArn=<tg>
$ aws autoscaling attach-load-balancer-target-groups \
    --auto-scaling-group-name awscicd-asg --target-group-arns <tg>

Now users come in through the ALB's DNS; the ALB distributes to healthy instances in the target group.

Switching the deployment group to blue/green

Update the deployment group: change deployment-style to BLUE_GREEN, point at the target group for the ALB, and declare how to build green (copy the current ASG):

$ aws deploy update-deployment-group --application-name awscicd-demo \
    --current-deployment-group-name awscicd-demo-asg-dg \
    --auto-scaling-groups awscicd-asg \
    --deployment-style deploymentType=BLUE_GREEN,deploymentOption=WITH_TRAFFIC_CONTROL \
    --load-balancer-info '{"targetGroupInfoList":[{"name":"awscicd-tg"}]}' \
    --blue-green-deployment-configuration '{"terminateBlueInstancesOnDeploymentSuccess":{"action":"TERMINATE","terminationWaitTimeInMinutes":1},"deploymentReadyOption":{"actionOnTimeout":"CONTINUE_DEPLOYMENT"},"greenFleetProvisioningOption":{"action":"COPY_AUTO_SCALING_GROUP"}}' \
    --auto-rollback-configuration 'enabled=true,events=DEPLOYMENT_FAILURE,DEPLOYMENT_STOP_ON_ALARM' \
    --alarm-configuration '{"enabled":true,"alarms":[{"name":"awscicd-unhealthy-hosts"}]}'

COPY_AUTO_SCALING_GROUP tells CodeDeploy to build green by copying the blue ASG; terminateBlueInstancesOnDeploymentSuccess lets it take blue down after green is good (waiting 1 minute before terminating, for an inspection window).

A very realistic IAM error

My first blue/green deploy failed immediately:

$ aws deploy get-deployment --deployment-id $DID --query 'deploymentInfo.errorInformation'
{
  "code": "IAM_ROLE_PERMISSIONS",
  "message": "The IAM role awscicd-codedeploy-role does not give you permission to
   perform operations in the following AWS service: AmazonAutoScaling."
}

The managed policy AWSCodeDeployRole isn't enough for blue/green when the ASG uses a launch template. The CodeDeploy docs note this explicitly: when the ASG is created from a launch template, the service role needs additional ec2:RunInstances, ec2:CreateTags, and iam:PassRole (to pass the instance role to the new green machines). Add that inline policy (PassRole scoped exactly to awscicd-ec2-role):

{
  "Effect": "Allow",
  "Action": ["ec2:RunInstances","ec2:CreateTags"],
  "Resource": "*"
},
{
  "Effect": "Allow", "Action": ["iam:PassRole"],
  "Resource": "arn:aws:iam::111122223333:role/awscicd-ec2-role"
}

This is the kind of error that only surfaces when you run it for real, invisible if you skim the docs β€” and it's why every article in this series tests for real. After adding the permissions, the deploy succeeded.

Inside a blue/green deploy

The part most worth dissecting. A blue/green deploy runs two sets of lifecycle events in parallel on the two fleets. Look at the events of one green instance (new) and one blue instance (old):

# Green (new instance): installs the app then receives traffic
ApplicationStop β†’ DownloadBundle β†’ BeforeInstall β†’ Install β†’ AfterInstall
  β†’ ApplicationStart β†’ ValidateService β†’ BeforeAllowTraffic β†’ AllowTraffic β†’ AfterAllowTraffic

# Blue (old instance): traffic is cut
BeforeBlockTraffic β†’ BlockTraffic β†’ AfterBlockTraffic

Read closely: green goes through the entire install chain (like in-place) and only then has the three traffic events (BeforeAllowTraffic β†’ AllowTraffic β†’ AfterAllowTraffic) β€” registering green into the ALB target group. Meanwhile blue goes through BlockTraffic β€” being removed from the target group. The crux: AllowTraffic (bringing green in) only runs after ValidateService on green has passed. This means traffic only shifts once green is confirmed healthy β€” if green fails validation, traffic never leaves blue.

   blue ASG (serving)               ALB
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
   β”‚ instance v1  │◀────traffic───────   1. CodeDeploy COPYs ASG β†’ green
   β”‚ instance v1  β”‚                 β”‚   2. deploy v_new to green, ValidateService
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚   3. AllowTraffic: register green into TG
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚   4. BlockTraffic: remove blue from TG
   β”‚ instance vN  │◀────traffic───────   5. wait 1 minute β†’ terminate blue
   β”‚ instance vN  β”‚ (green)         β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β–Ό
   the ALB never points to an unvalidated fleet β†’ zero-downtime

After the deploy, CodeDeploy has created a new green ASG (CodeDeploy_awscicd-demo-asg-dg_d-...) running the new version, and set the blue ASG to desired 0. The ALB served continuously throughout:

$ curl http://awscicd-alb-...elb.amazonaws.com/
... awscicd demo app β€” v2 ...

Automatic rollback

This is the part that makes blue/green truly safe. The deployment group configuration above has two rollback pieces:

--auto-rollback-configuration events=DEPLOYMENT_FAILURE,DEPLOYMENT_STOP_ON_ALARM enables rollback when a deploy fails or when an alarm transitions to the ALARM state. --alarm-configuration attaches the CloudWatch alarm awscicd-unhealthy-hosts (monitoring the target group's UnHealthyHostCount). The mechanism: during the deploy, if this alarm fires (e.g. green produces unhealthy hosts), CodeDeploy stops and rolls back. Because blue is still intact (not terminated within the wait window), rollback is just pointing the ALB back at blue β€” nearly instant, no downtime. Quite unlike in-place: in in-place, rollback means re-deploying the old build (which takes time); in blue/green, the old build is still running, you just shift traffic.

Combine ValidateService (the gate on the machine) + the alarm (system-level monitoring) + blue still alive = a broken deploy is blocked before it reaches users, and reverts within seconds.

🧹 Cleanup

Part IV is done β€” clean up all the costly resources (both ASGs, the ALB, target group, launch template, alarm):

$ aws autoscaling delete-auto-scaling-group --auto-scaling-group-name awscicd-asg --force-delete
$ aws autoscaling delete-auto-scaling-group --auto-scaling-group-name CodeDeploy_awscicd-demo-asg-dg_d-... --force-delete
$ aws elbv2 delete-load-balancer --load-balancer-arn <alb>
$ aws elbv2 delete-target-group --target-group-arn <tg>
$ aws ec2 delete-launch-template --launch-template-name awscicd-lt
$ aws cloudwatch delete-alarms --alarm-names awscicd-unhealthy-hosts

Blue/green leaves two ASGs behind (the original blue + the green CodeDeploy created) so remember to delete both. The ALB bills by the hour β€” don't forget it.

Wrap-up

Blue/green stands up a green fleet in parallel, deploys and validates it, then shifts traffic via the ALB β€” blue stays put for the return. Switch the deployment group with deployment-style=BLUE_GREEN + loadBalancerInfo + COPY_AUTO_SCALING_GROUP. A real gotcha: an ASG using a launch template needs the service role to have extra ec2:RunInstances/ec2:CreateTags/iam:PassRole (you get the IAM_ROLE_PERMISSIONS error if it's missing). Green runs the full lifecycle and only then AllowTraffic after ValidateService, blue goes through BlockTraffic β€” traffic only shifts once green is healthy. Automatic rollback (DEPLOYMENT_FAILURE + alarm) only needs to point the ALB back at the live blue, so it's nearly instant and zero-downtime.

Part IV wraps up: we've deployed in-place, dissected hooks, scaled to an ASG, and done blue/green with rollback. But it's all still typing create-deployment by hand. Part V ties every piece together: CodePipeline automatically runs Source β†’ Build β†’ Deploy on every commit. The next article builds the first pipeline joining CodeCommit, CodeBuild and CodeDeploy.