AWS CloudFormation Case Study

I was asked to write up how I implemented CloudFormation as it began to roll out. It helped me replace RightScale wholesale, in as flexible a manner as I cared to code in json. Below is a draft.


Our task was to roll out a set of ad products based around influence as a metric using AWS for things like analytics, smart display ads, contextual and behavioral targeting to name a few.  Starting fresh and fast with no physical infrastructure and oodles of new data, we had to remain nimble and scale quickly.  Building from a team of one to twenty in the span of months, however, it quickly became necessary to automate and organize not only machines but process.  Each product required QA/LT, staging and production environments.  DR, HA and scaling requirements demanded a plan.  Enter CloudFormation.

Each product has its own set of CloudFormation stacks, configured to procure the necessary AWS services: ELBs, autoscaling groups, queues, security groups, buckets, etc.  Great; then what?  The machines need to talk to each other.  With EC2′s UserData field, machines can be passed values for any of the resources brought up in the stack, e.g. RDS endpoints/creds, private IPs, queue ARNs, etc.  Use an AMI which can execute a command issued by UserData and you’re done.  Ubuntu and Amazon Linux do this off-the-shelf; simply start the UserData with a shebang.  Stack machines can now come up with all the stack data and configure themselves for service via the deployment method of choice.  Furthermore, since they can update themselves, there’s no need for reburning AMIs every time configuration changes or updates are needed.  Our machines even set the prompt to their role and stack name so we don’t get lost in a sea of terminals.  They can even configure their own DNS.

Once the stack for QA is written, the QA lead can launch it with one button.  Change autoscaling groups of one to one hundred and use this copy of the template to launch staging for LT.  With stack mappings, you can use the same template for both.  Add some alarms, change thresholds and launch again for production.  Next round?  Bring up a second production stack next to the old one and cut over.  Need to revert?  Leave the old one up and switch back.  Virginia slides into the ocean?  Launch the exact same stack in Singapore.  Shut down the staging and old production stacks when you’re done to save money.  We call our release method Deployment by Death because all we do is update the release code and kill the boxes.

Here’s how we bring up three Targeting stacks across the world in one quick line:

for i in us-east us-west eu-west; do cfn-create-stack Targeting-QA1 -f targeting.qa1.json --region $i-1; done

They all come up configured and serving, typically within minutes.