If you heard me speak lately or browsed through some of my slidedecks chances are high you've seen this slide and heard me talk about "immutable deployments":

The basic idea is to separate the "disposable" parts of your infrastructure from the "non-disposable" parts (e.g. databases, message queues,...).

And since the load balancer doesn't really carry any data worth keeping this is what how we used to classify the various elements:

This results environments to be a group of stacks where the actual build is rolled out by creating a new stack and deleting the previous one:

(If you want to learn more about this check out the full slide deck here.)

Problem

So yes, we've been deploying a new ELB with every deployment and then used a Lambda function as a custom CloudFormation resource to update the DNS record sets via Route 53 to point to the new load balancers. With a short TTL this used to be a "good enough" deployment strategy that we've been using for years.

But of course there are a couple of things very wrong with this deployment strategy:

  • Some clients don't respect the DNS settings and will continue trying to access the old ELB long after it's gone.
  • ELBs are "elastic", but expanding does take some time. Switching a high traffic site will result in a lot of dropped connections until the new ELB is fully warmed up.
  • Creating a new ELB on every deployment makes it harder to monitor metrics. While we've solved this problem with another Lambda function the queries the current ELB by tag, fetches the metrics from CloudWatch and then pushes the "normalized" metrics (excluding the deployment specific identifier) to ElasticSearch, this doesn't feel right. (We also had to do this with the metrics we collected from the EC2 instances - which will continue to be truly immutable, but this also can be solved by collecting your own metrics and pushing the to CloudWatch with an already normalized grouping dimension like "qa-magento-fe" instead of the deployment specific AutoScalingGroupName that changes with every deployment).

Solution

So really the last point motivated me to revisit our deployment strategy. I already knew you could attach multiple load balancers to the same auto-scaling group, but could you also attach multiple auto-scaling groups to the same load balancer? Turns out you now can! Although that's not really obvious in the AWS console and in the API, attaching an auto-scaling group to a load balancer is not exclusive anymore and will not result in an error if that load balancer is already attached to another auto-scaling group. (The routing mechanism is still not really clear to me. Example: If ASG A has 3 instances and ASG B has 5 instances will the ELB route evenly between instances or ASGs? However, this really doesn't matter since this in this scenario this will only be the case for a couple of seconds).

Poking around to find out if others have already done this I found Peter Sankauskas's great blog post on exactly this topic "The DOs and DON'Ts of Blue/Green Deployment".

Although Peter clearly suggested "NOT to use CloudFormation to orchestrate this", I think this is a great opportunity for another Lambda backed custom resource for CloudFormation. (And I guess that's kind of "my thing" now to have the complete deployment process driven by Lambda and CloudFormation :)

Here's my Lambda Green/Blue switcher that can be used as custom resource for CloudFormation: https://github.com/AOEpeople/cfn-lambdahelper/tree/master/greenblue-switcher

In your CloudFormation template this will look something like this:

"SwitchToBlue": {
  "DependsOn": [ "WaitConditionHandlBlue" ],
  "Type": "Custom::GreenBlueSwitcher",
  "Properties": {
    "ServiceToken": {"Ref": "GreenBlueSwitcherArn"},
    "LoadBalancerName": { "Ref": "LoadBalancer" },
    "AutoScalingGroupName": { "Ref": "AsgBlue" }
  }
}

Find the full example/demo here.

So this is what happens now - all driven by the Lambda function using the AWS SDK for Node.js:

The load balancer now becomes a "non-disposable" resource and will be moved from the "build" stack into the "static resources" stack. This way it will only be set up once and not recreated on every deployment:

How it works:

This is what happens in the Lambda function

  1. Look up the number of instances of the current ASG. (Actually this doesn't happen in the Green/Blue Switcher, but in a different Lambda function.
  2. Launch new ASG with the desired capacity matching the number of instances in the current ASG. At this point this ASG is not attached to any ELB.
  3. Wait until all instances in the new ASG are "InService" (ASG LifeCycle) and done provisioning (using WaitConditions).
  4. Attach new ASG to ELB (while not touching the current ASG).
  5. Wait until all the instances from the new ASG are marked as "InService" in the ELB. (This is the time period where two ASGs are attached to the same ELB. The instances of the new ones will still be marked us UnHealthy initially and depening on your health check settings it will take only a couple of seconds until all instances are detected as healthy. During this time traffic will be directed to the old ASG and the instances of the new ASG that are already healthy. So this isn't a super clear cut, but that's ok. It wouldn't be with DNS either...)
  6. Detach all ASG except for the new ASG from the ELB

Comments

This website uses disqus for the commenting functionality. In order to protect your privacy comments are disabled by default.

Enable Comments