Nuvole Computing

Indispensable Software for Exploring The Cloud

netbas — Tue, 26 Mar 2013 03:17:07 +0000

We found a piece of software indispensable in exploring the cloud and thought we would share. You’re welcome.

Continuous Delivery/Release – a Basic Howto with Examples

netbas — Wed, 03 Oct 2012 22:59:57 +0000

When we first went to build a setup with Continuous Delivery (CD) as the goal, we found plenty of excellent theoretical fodder but little in the way of specifics. As with all the hubbub about the cloud, we wondered how to actually get there let alone how best to get there. Here is a rundown of the basics.

Let’s be clear about our goal: checking in code and having it automatically update your production service in a very short timeframe.

Continuous Integration
The first step is automation. While it’s always nice, here it’s all or nothing. If any part of it is prone to breaking or requires manual intervention, delivery will not be continuous. If you’re not already here, you should be. Your systems should be like an ipod when you’re through with them, with a just a few buttons for everything you do more than twice. See our post on Chef and CloudFormation for more details. For the sake of this guide, let’s assume you have Opscode Chef up and running with a Chef server directing them. At this point, machines should be able to come up on their own, grab everything they need and start serving within minutes. Chef 0.10+ supports “environments,” which is what we’ll use to tell machines what version of the codebase they are to use. Here’s a sample chef-repo/environments/qa.json:

{
    "chef_type": "environment",
    "json_class": "Chef::Environment",
    "name": "qa",
    "description": "",
    "default_attributes": {
      "myfrontendapp_revision" : "0fe30e04e8aa610c2e5a34a75b924c2462f87d4e"
    },
    "cookbook_versions": {
      "mycookbook": "0.1.9"
    }
}

When we upload this config to the Chef server, myfrontendapp_revision becomes an “attribute” on every machine in environment qa, meaning chef-client has access to it on each node. For simplicity, we will also assume for now that what we’re releasing straight from github. The simplicity derives from Chef’s built-in support for this by way of their “deploy_revision” resource, the recipe for which should contain something like this in the case of Amazon Linux:

cookbook_file "/home/ec2-user/.ssh/deploy-id_rsa.pub" do
  source "ssh/id_rsa.pub"
  mode 0600
  owner "ec2-user"
  group "ec2-user"
end

cookbook_file "/home/ec2-user/bin/wrap-ssh4git.sh" do
  source "bin/wrap-ssh4git.sh"
  owner "root"
  group "root"
  mode 0755
end

deploy_revision "myfrontendapp" do
  repo "git@github.com:mygitaccount/myfrontendapp.git"
  user "ec2-user"
  revision node['myfrontendapp_revision']
  deploy_to /var/my/dir/for/release
  ssh_wrapper "/home/ec2-user/bin/wrap-ssh4git.sh"
  action :deploy
end

Chef’s idempotency ensures that if the revision doesn’t change, nothing happens, and that if it does, that revision gets released. This way Chef can safely run continuously. Change the revision on the chef server to the name of a branch in git (e.g. master), and you’re done.

    "default_attributes": {
      "myfrontendapp_revision" : "master"
    },

Every time chef-client runs on the machine, it will check to see if the branch has been updated. If changes have been pushed, it releases the branch. Congratulations. You now have continuous integration. If you don’t mind your code being released every few minutes, broken or not, then change qa to prod and you have continuous delivery, straight to the user. Clearly, there are some steps in between continuous integration and continuous delivery.

Continuous Delivery/Release
There’s no way to release production code in an automated fashion unless the testing is automated as well. There are any number of solutions out there. To keep this very basic in favor of focusing on the big picture, let’s say that there’s a script that runs every night that will test the qa environment. At the end of this script, it either passes or it fails. If it fails, a notification is triggered, and it tries again in a bit. If it passes, a second script is invoked which modifies attribute myfrontendapp_revision for environment prod, uploads it to the chef server, then kicks chef-client on all nodes running myfrontendapp. A very basic, POC script looks like this, assuming knife.rb is configured for the user and they have sudo access to all the machines (update_att.rb):

require 'rubygems'
require 'bundler/setup'
require 'chef'

require 'net/ssh'
require 'net/ssh/multi'
require 'readline'
require 'chef/search/query'
require 'chef/mixin/shell_out'
require 'chef/knife/ssh'

environment = ARGV[0]
default_attribute_key = ARGV[1]
default_attribute_value = ARGV[2]

# get env config in json from the chef server
Chef::Config.from_file('/home/me/.chef/knife.rb')
rest = Chef::REST.new(Chef::Config[:chef_server_url])
env_chef = rest.get_rest("/environments/" + environment)
env_json = env_chef.to_json(env_chef)
env = JSON.parse(env_json, :create_additions => false)

# change the revision
env['default_attributes'][default_attribute_key] = default_attribute_value
env_json_new = env.to_json

# write a new json file for source control
File.open(Chef::Config[:cookbook_path].to_s + '/../environments/' + environment + '.json', 'w') {|f| f.write(env_json_new) }

# upload the new environment to the chef server
env_chef = rest.put_rest("/environments/" + environment, env)

# kick chef-client for all machines in environment
Chef::Config.from_file('/home/me/.chef/knife.rb')
ssh =  Chef::Knife::Ssh.new
ssh.config[:attribute] = 'ec2.public_hostname'
#ssh.config[:ssh_user] = username
ssh.name_args << 'chef_environment:' + environment
ssh.name_args << 'sudo su - -c chef-client'
ssh.run

% ruby ./update_att.rb prod myfrontendapp_revision SOME_NEW_REVISION

Congratulations for real. This is the most basic form of Continuous Delivery. All you do is check in code. Your QA script will test and release it however often you choose with no intervention.

Beyond the Basics
This example is far too basic for most production environments. If you're starting from scratch, it's a great way to start, but parts of your code that aren't front-end scripts will likely require compilation (also, relying on github over, say, s3, isn’t the most robust solution). The structure remains the same. In the case of Java, for example, you can easily have Jenkins build binaries, upload them to s3 and tag them with, say, their md5 hashes. These hashes can replace github's revision in the example, and you can write your own recipes to handle s3 retrieval of binaries built by Jenkins. For example, using the s3_file resource for chef:

s3_file "#{node['root_myfrontendapp']}/releases/myfrontendapp.war-" + node[:myfrontendapp_revision] do
  remote_path "/myfrontendapp/myfrontendapp.war-" + node[:myfrontendapp_revision]
  bucket "#{node['deploy_bucket']}"
  aws_access_key_id "#{node['deploy_user']}"
  aws_secret_access_key "#{node['deploy_pass']}"
  action :create
end

directory "root-webapp" do
  recursive true
  path "#{node['tomcat_webapp_base']}/ROOT"
  action :nothing
end

# because s3_file doesn't support notifications, among other reasons                                                                                                                                                                  
link "#{node['tomcat_webapp_base']}/ROOT.war" do
  to "#{node['root_myfrontendapp']}/releases/myfrontendapp.war-" + node[:myfrontendapp_revision]
  # just to be sure that the webapp dir is wiped
  notifies :stop, 'service[tomcat6]', :immediately
  notifies :delete, 'directory[root-webapp]', :immediately
  notifies :restart, 'service[tomcat6]', :delayed
end

If you have Jenkins build an RPM instead, just make myfrontendapp_revision the rpm version number and the chef resources reduce to one: package.

Your QA apparatus should be robust and cover as fully as possible anything you yourself might test on a push to production. Our rule of automation is: if you've done it twice and expect to do it again, automate it. The best way to most comprehensively address this is by implementing Test-Driven Development (TDD) from the beginning. Like most awesome things these days, it requires no small amount of overhead, but you will get results that no amount of after-the-fact QA guesswork can achieve. It's a judgement call whether or not you start with TDD.

Forecast: Cloudy
The qa environment does not properly emulate your real production environment, you say? That's what staging is for, you say? This is where the cloud shines. Not only can your nightly release happen automatically, but you can leave the build machines off until it's time to build, saving money. You can also autoscale them out, saving time. You can even have them, on successful QA test of the qa environment, launch a staging stack and hammer away at it before releasing to production. If your automation was done right, one API call to something like AWS's CloudFormation should do the trick. Once the test passes in stage, prod can be changed and updated while you sleep.

To avoid downtime during release, front and middle-tiers should have at least two machines each. Have your script run each sequentially, reverting on failure. Your back-end jobs should be thoroughly decoupled using some measure such as queues. If they are, they should be able to stop at any time and pick up where they left off when they’re ready, autoscaling to accommodate backed-up queues.

What about dastardly things like updates to a SQL schema? You had better be damn sure such a change won’t cause problems. Begin by automating in anticipation of one day having it fully automated, and reduce the manual portion to a single button. Iron out your process. When things stop going wrong, have a program push the button.

Good luck!

Chef Node Deregistration For Autoscaling Groups

netbas — Mon, 02 Jul 2012 07:47:23 +0000

Cloudreach in the UK has a great article on Chef node de-registration. With autoscaling groups bringing machines up and down all over the place, Chef needs a way to know that they’ve been terminated and remove them from the configs lest things become a cluttered mess. Luckily, AWS provides some great shortcuts. We implemented the solution, then bulletproofed it. Here’s what it takes.

It starts in CloudFormation for both source control and automation. See our summary of CloudFormation and Chef. We set up two queues, one for nodes to deregister and one for any errors.

"DeregQueue" : {
    "Type" : "AWS::SQS::Queue"
},
"DeregErrorQueue" : {
    "Type" : "AWS::SQS::Queue"
}

Simple enough. Then we need to configure all autoscaling groups (ASGs) to put themselves into DeregQueue upon termination. We do this by way of an SNS topic with DeregQueue as its endpoint.

"DeregTopic" : {
    "Type" : "AWS::SNS::Topic",
    "Properties" : {
        "Subscription" : [
            {
                "Endpoint" : { "Fn::GetAtt" : ["DeregQueue", "Arn"]},
                "Protocol" : "sqs"
            }
        ]
    }
}

Just tell all your ASGs to send notification to this topic on termination, and the circle is complete.

"NotificationConfiguration" : {
    "TopicARN" : { "Ref" : "DeregTopic" },
    "NotificationTypes" : ["autoscaling:EC2_INSTANCE_TERMINATE"]
},

The topic needs permission of course, so you’ll need to add an SQS policy.

"DeregQueuePolicy" : {
    "Type" : "AWS::SQS::QueuePolicy",
    "Properties" : {
        "PolicyDocument":  {
            "Id":"DeregQueuePolicy",
            "Statement" : [
                {
                    "Sid":"Allow-SendMessage-To-Queue-From-SNS-Topic",
                    "Effect":"Allow",
                    "Principal" : {"AWS" : "*"},
                    "Action":["sqs:SendMessage"],
                    "Resource": "*",
                    "Condition": {
                        "ArnEquals": {
                            "aws:SourceArn": { "Ref" : "DeregTopic" }
                        }
                    }
                }
            ]
        },
        "Queues" : [ {"Ref" : "DeregQueue"} ]
    }
},

For the processor, We chose the aws-sdk for ruby, easily accessible via gem install aws-sdk. IAM permissions should be set up ahead of time.

require 'rubygems'
require 'aws-sdk'
require 'json'
require 'time'
require 'chef'

asg_queue_url = ENV['ASG_DEREG_QUEUE_URL']
asg_error_queue_url = ENV['ASG_DEREG_ERROR_QUEUE_URL']
topic_arn= ENV['WARNING_TOPIC']
sqs = AWS::SQS.new(:access_key_id => ENV['ADMINPROC_USER'],
                   :secret_access_key => ENV['ADMINPROC_PASS'],
                   :sqs_endpoint => asg_queue_url)
sns = AWS::SNS.new(:access_key_id => ENV['RELEASE_USER'],
                   :secret_access_key => ENV['RELEASE_PASS'])
Chef::Config.from_file(ENV['HOME'] + '/.chef/knife.rb')
rest = Chef::REST.new(Chef::Config[:chef_server_url])

sqs.queues[asg_queue_url].poll do |m|
  sns_msg = m.as_sns_message
  body = JSON.parse(sns_msg.to_h[:body])
  event = body["Event"]
  begin
    if event.include? "autoscaling:EC2_INSTANCE_TERMINATE"
      time = Time.now.utc.iso8601
      # here we assume that the ec2 instance id = node name
      iid = body["EC2InstanceId"]

      puts "deleting node " + iid + "\n"
      del_node = rest.delete_rest("/nodes/" + iid)

      puts "deleting client " + iid + "\n"
      del_client = rest.delete_rest("/clients/" + iid)

    elsif event.include? "autoscaling:TEST_NOTIFICATION"
      m.delete
    end

  # on failure:
  rescue
    msg = "There was a problem deregistering instance #{iid}.\n" + $!.to_s + "\n" + body.to_a.sort.join("\n")
    puts msg    # should go to STDERR

    # send alert to staff
    topic = sns.topics[topic_arn]
    topic.publish(msg)

    # put the message in DeregErrorQueue
    error_q = sqs.queues[asg_error_queue_url]
    error_q.send_message(m.to_s)

    # keep moving
    next
  end
  puts ""
end

Easy peasy.

% tail -f process_asg_queue.log

2012-07-01T16:12:50Z: nodename =   bas-testo-test-i-1288786a, iid = i-1288786a
knife node delete   bas-testo-test-i-1288186a -y...
   result = true
knife client delete   bas-testo-test-i-1218786a -y...
   result = true

2012-07-01T16:13:22Z: nodename =   bas-database-database-i-1c867664, iid = i-1c867664
knife node delete   bas-database-database-i-1c167664 -y...
   result = true
knife client delete   bas-database-database-i-1c167664 -y...
   result = true

All that remains is to properly daemonize. We chose to wrap this guy up in supervisord.

[program:chef_dereg]
directory=<%= node['script_dir'] %>
command=ruby <%= node['script_dir'] %>/process_asg_queue.rb
autostart=true
autorestart=true
startretries=5
stdout_logfile=<%= node['log_dir'] %>/process_asg_queue.log
redirect_stderr=true
stopsignal=INT
stdout_logfile_maxbytes=15000000
stdout_logfile_backups=45

Finally, the error handling proved very tolerant in testing, but you never know, right? So as a catch-all, set an alarm on the queue and never look at it again. Until you have to.

        "DeregQueueAlarm": {
            "Type": "AWS::CloudWatch::Alarm",
            "Properties": {
                "AlarmDescription": "Alarm if queue is not being processed.  The processor might have died.",
                "Namespace": "AWS/SQS",
                "MetricName": "ApproximateNumberOfMessagesVisible",
                "Dimensions": [{
                    "Name": "QueueName",
                    "Value" : { "Fn::GetAtt" : ["DeregQueue", "QueueName"] }
		}],
                "Statistic": "Average",
                "Period": "300",
                "EvaluationPeriods": "3",
                "Threshold": "1",
                "ComparisonOperator": "GreaterThanOrEqualToThreshold",
                "AlarmActions": [{
                    "Ref": "CriticalTopic"
                }]
            }
        },

Note: INSUFFICIENT_DATA is not feasible for sqs.

SeeDub: Universal Custom Metrics for AWS CloudWatch

netbas — Sun, 25 Mar 2012 19:21:44 +0000

When AWS supported an API call to push whatever metric we want, we got very excited. Once a metric is in CloudWatch, we can key a scaling policy to it, have it trigger an action, an alarm, basically perform any action we very well please. Finally, ultimate flexibility in what factors determine what actions, and we get pretty graphs to boot! For example, we could scale a processing group up and down like an accordion based on the number of entries in a DB table, or maybe the amount of free system memory. The hurdles proved disheartening, but we created a solution.

Take for instance API throttling, the shadowy world where AWS can never give a straight answer. Actually, they can, but you have to escalate. All the API endpoints have limits on them, per-customer. You can raise them, but your code has to be able to throttle itself or you risk crazy race conditions everywhere, especially when you’re talking about thousands of machines all pushing their metrics via, say, cron every five minutes.

The CloudWatch team, under NDA, provided me with a development library in perl that would handle API retries, but we wanted one solution we could deploy everywhere and not worry. So we wrote SeeDub, an intermediary that takes simple files and queues them up for batch processing by the Amazon::CloudWatchClient lib, which will handle all the retries. Built-in is some randomness for an offset to make sure thousands of machines aren’t firing all at once. Write a file into /var/nuvole/seedub.d/NAMESPACE/unique_name_for_metric_file, and let bin/putmetricdata.pl take care of CloudWatch, e.g.

$ cat > /var/nuvole/seedub.d/Nuvole/SpecialNamespace/whatever.random1384y28237 < name Crazy value 15 unit Count time 1314639962 dimensions Partition=/ EOF $ bin/putmetricdata.pl us-east-1

Any app, any system tool, any piddly script that can write four simple lines has direct, robust and resilient access to the CloudWatch API regardless of your limit.

And recently, the CloudWatch team has released an updated version of the code, which seems perfectly compatible. And so we're releasing our part. https://github.com/netbas/SeeDub/

For those who want to see it in action on their own machines immediately, create a stack using the SeeDub sample CloudFormation template, which launches a fully operational stack with SeeDub pushing metrics for an autoscaling group of two t1.micros. Feed it your KeyName as a param, and with one button you'll have actionable metrics with pretty graphs like the above within fifteen minutes. Via the command-line:

cfn-create-stack SeeDubSample2 -f seedub.iam.json --capabilities CAPABILITY_IAM --parameters "KeyName=bingo"

NOTE: If you launch this CloudFormation stack via the AWS Management Console, you must check the little box that says it's ok to create an IAM user, which is dangerous even if we claim to follow the principle of least privilege, like so:

NOTE: it also uses EeSeeToo, another of our packages which so far acts solely to provide an instance with the name of its autoscaling group (ASG).

Chef and CloudFormation

netbas — Sun, 15 Jan 2012 20:08:13 +0000

The ephemeral nature that comes along with cloudy virtual machines means that we need to be able to go down and come up at any time without flinching. What good is hardware scriptability if the app itself doesn’t lend itself to automated management? For years, we’ve been using custom sets of homebrew scripts, checked-in alongside both the app code and any virtual hardware scripting. It was cumbersome, but it worked beautifully. Chef promised to make the cumbersome elegant. Did it?

Yes, it did. The learning curve was steep at first. It seemed to take a lot of knowledge just to get set up. We started with CloudFormation (CFN), for which AWS provides sample templates of both a Chef server and a sample app as a Chef client.

After getting them up and running, we replaced the Chef server stack with Hosted Chef. For PCI compliance, we later built a private chef server. The results were spectacular. One button launches an AWS stack in any region, as before, but now the only thing that CFN has its machines do is bootstrap chef. Have the standard AMIs include it already, and all that has to happen is a passing of Chef Server location and initial creds. Chef handles the rest! What’s more, you can configure chef-client to register a nodename corresponding to IP address, stack name, region, whatever you like (see below).

In case it helps, it took about two full-time weeks to go from homebrew stack to chef stack, with no previous knowledge of Chef. This makes the dev/ops approach much easier to sell to all concerned parties. For example, the notorious case of application configurations have a new potential place. One client wisely wrote an app which first looked to a sharding table to figure out where its DB was. The problem: the sharding table was in a DB, and self-referencing. Moving such a typically static, simple thing to Chef attributes makes perfect sense, as it does for a whole range of app configurations as well as ops configurations.

Note: If you ever wonder where ohai’s ec2 attributes are, you may have come across a known ohai bug within the VPC. The solution is simply “With Ohai 6.4.0, create /etc/chef/ohai/hints/ec2.json to enable EC2 attribute collection,” as done below using cfn-init.

Note: These examples use CloudFormation helper scripts, but there is nothing you can’t do here with simple scripting. Below lies a sample LaunchConfig. All that remains is to put the node config into one var.

Update: We recently had to retrofit my old chef stacks to work in particular subnets within a VPC and AWS’s Chef Server template, which works right out of the box. But now, when we need to tweak stack json, we modify the LaunchConfig, cfn-update-stack, and as-terminate-instance-in-auto-scaling-group, quickly iterating and saving oodles of time. Once the chef handoff is made, we can get to recipe-writing and role-assigning.

        "FrontEndLC" : {
            "Type" : "AWS::AutoScaling::LaunchConfiguration",
            "Metadata" : {
                "AWS::CloudFormation::Init" : {
                    "config" : {
                        "packages" : {
                            "rubygems" : {
                                "chef" : [],
                                "ruby-shadow" : [],
                                "ohai" : [],
                                "json" : []
                            },
                            "yum" : {
                                "ruby19"            : [],
                                "ruby19-devel"        : [],
                                "ruby19-irb" : [],
                                "ruby19-libs"            : [],
                                "rubygem19-io-console"              : [],
                                "rubygem19-json"             : [],
                                "rubygem19-rake" : [],
                                "wget"            : [],
                                "rubygem19-rdoc"        : [],
                                "rubygems19"        : [],
                                "rubygems19-devel"        : [],
                                "gcc"        : [],
                                "gcc-c++"        : [],
                                "automake"        : [],
                                "autoconf"        : [],
                                "make"        : [],
                                "curl"        : [],
                                "dmidecode"        : []
                            }
                        },
                        "files" : {
                            "/etc/chef/client.rb" : {
                                "content" : { "Fn::Join" : ["", [
                                    "log_level        :info\n",
                                    "log_location     STDOUT\n",
                                    "ssl_verify_mode  :verify_none\n",
                                    "chef_server_url  '", { "Ref" : "ChefServerURL" }, "'\n",
                                    "environment      '", { "Ref" : "Environment" }, "'\n",
                                    "validation_client_name 'chef-validator'\n"
                                ]]},
                                "mode"  : "000644",
                                "owner" : "root",
                                "group" : "root"
                            },
                            "/etc/chef/roles.json" : {
                                "content" : {
                                    "run_list": [ "role[frontend]" ],
                                    "chef_role": "frontend",
                                    "stack_name": { "Ref" : "AWS::StackName" },
                                    "aws_region": { "Ref" : "AWS::Region" },
                                    "deploy_user": { "Ref" : "DeployUser" },
                                    "deploy_pass": { "Ref" : "DeployPass" },
                                    "deploy_bucket": { "Ref" : "DeployBucket" },
                                    "warning_sns_arn": { "Ref" : "WarningTopic" },
                                    "critical_sns_arn": { "Ref" : "CriticalTopic" },
                                    "iam_access_key": { "Ref" : "IAMAccessKey" },
                                    "iam_secret_key": { "Fn::GetAtt" : ["IAMAccessKey", "SecretAccessKey"] },
                                    "frontend_endpoint": { "Fn::GetAtt" : [ "FrontEndELB", "DNSName" ] },
                                    "s3_bucket": { "Ref" : "S3Bucket" }
                                },
                                "mode"  : "000644",
                                "owner" : "root",
                                "group" : "root"
                            },
                            "/etc/chef/ohai/hints/ec2.json" : {
                                "content" : "{}",
                                "mode"   : "000644",
                                "owner"  : "root",
                                "group"  : "root"
                            }
            "Properties" : {
                "KeyName" : { "Ref" : "KeyName" },
                "SecurityGroups" : [ { "Ref" : "FrontEndSG" } ],
                "InstanceType" : { "Ref" : "FrontEndInstanceType" },
                "ImageId": { "Fn::FindInMap": [ "AWSRegionArch2AMIEBS", { "Ref": "AWS::Region" }, { "Fn::FindInMap": [ "AWSInstanceType2Arch", { "Ref": "FrontEndInstanceType" }, "Arch" ] } ] },
                "UserData" : { "Fn::Base64" :
                               { "Fn::Join" : [ "", [
                                   "#!/bin/bash\n\n",

                                   "/opt/aws/bin/cfn-init -v --region ", { "Ref" : "AWS::Region" },
                                   " -s ", { "Ref" : "AWS::StackName" }, " -r FrontEndLC ",
                                   " --access-key ", { "Ref" : "DeployUser" },
                                   " --secret-key ", { "Ref" : "DeployPass" }, "\n",

                                   "LOCAL_IP=`curl -s http://169.254.169.254/latest/meta-data/local-ipv4`\n",
                                   "IID=`curl -s http://169.254.169.254/latest/meta-data/instance-id`\n",
                                   "echo \"node_name        \\\"", { "Ref" : "AWS::StackName" }, "-frontend-$LOCAL_IP-$IID\\\"\" >> /etc/chef/client.rb\n",
                                   "/usr/local/bin/chef-client -N ", { "Ref" : "AWS::StackName" }, "-frontend-$LOCAL_IP-$IID -j /etc/chef/roles.json", "\n"
                               ]]}
                             }
            }
        },

AWS Architecture for Startup Looking to Scale Quickly

netbas — Thu, 01 Dec 2011 19:51:58 +0000

We were asked to prepare a generic document for a LAMP-based startup, running one on machine in the corner of an office, looking to cope with massive scale, disaster recovery, and self-healing resilience, all within three months. Here’s what we came up with given only the above in the context of Amazon Web Services offerings. Most of their services are just shortcuts for doing the same thing yourself.

Introduction
This document outlines the ways in which AWS can address the concerns of rapid growth from a proof-of-concept to a hard-hit Internet website. It will address options comprehensively, with an eye toward the future. However, as is the case with rapid growth of early stage startups, the question will often come down to priorities.

Scalability
The first step toward scalability is to remove state from components, starting with the PHP. Once this is done, non-sticky load balancing via Elastic Load Balancers (ELBs) can spread the load over an arbitrary number of machines. State can be maintained in a noSQL database such as SimpleDB, but code will have to be altered. Alternatively, to get started quickly, there exists a single PHP configuration directive to maintain sessions in memcached. Such a cache is necessary in any case to widen the most often-encountered bottleneck, the relational database. ElastiCache may be leveraged quickly here, for which modest developer time will be needed to wrap all database calls with simple cacheing calls. In general, it is a good idea to abstract out all database calls one step further so that when the time comes to scale the DB via sharding or moving some data to a noSQL store, the developer time required won’t overwhelm. Also, it makes it easier to split reads and writes so that a number of slaves, read replicas in the case of RDS, can offload read traffic from the master. The final piece of the puzzle, necessary for AutoScaling, is automation. Each machine needs to be able to come up with everything it needs to start serving immediately. Here, CloudFormation, coupled with server role management software like Chef, can be used to fully automate the process.

Self-healing Fault Tolerance
Once state is removed from machines, AutoScaling can maintain a minimum number of healthy instances to keep serving in the event of failure. However, if the app itself is having trouble, only a decoupled architecture can mitigate the effect of such failure. If the app functions can be split into two classes, one that presents data and one that processes it, one can fail while the other keeps going. Typically, presentation is much less likely to fail, and if there is are queues in place for processing, dev/ops can get to work on the failure while the queue grows with backed-up tasks. If the processing layer is stateless as well, and if its AutoScaling is keyed to the queue size, it can catch back up very quickly after the issue is resolved. The presentation of data, albeit possibly a bit stale, continues the whole time. If appropriate, GSLB can distribute load across multiple regions. We recently worked with UltraDNS to make their service available for use with ELBs, and Dyn is reported to offer it as well.

Disaster Recovery
DR can be split up into two aspects: continuity of service and data loss. With the above, the stack can be configured quickly to spread itself out across multiple physical data centers located in each region’s Availability Zones (AZs), including the MySQL DB with the multi-AZ feature of RDS. However, in the event that Virginia slides into the ocean, continuity of service can be managed, but data loss proves much more difficult. CloudFormation stacks, for example, can be configured to fully launch the stack in any AWS region with one button as long as the software is region-agnostic. MySQL data loss, however, is more tricky. Periodic off-site backups help, but MySQL replication across the Internet is not reliable. Data loss can be minimized by spreading the data across multiple regions and perhaps using GSLB to distribute traffic among them. NoSQL DB replication like MongoDB’s is often much more forgiving. It is for this reason, along with the bottleneck issues, that using a NoSQL store whenever possible is recommended.

Filesystem constraints
Removing state for scalability requires that local filesystems be used only for ephemeral processing. This may take the form of storing things like images on S3 and pointing all clients there.

Latency
Latency can be reduced in several ways. First and most effective is serving static content via a Content Delivery Network (CDN) such as CloudFront. The origin can be a host or S3, and if the content is suitable for cacheing CloudFront minimizes latency by providing local copies around the world. The aforementioned GSLB can help minimize latency of traffic to the EC2 instances as well.

Security of data at rest and in transit
First, encryption can be used everywhere. ELBs can be configured to use https, the machines to use ssh for administrative access of course. In addition, traffic to and from the various AWS services can all be encrypted as well. Security Groups should tightly restrict access to data at rest. For example, a database should only be available to the application layer. Virtual Private Cloud (VPC) can be used to allow full access to dev/ops while further restricting access, even if it is already encrypted. Add to this multi-factor authentication and user-based access control with IAM for a very locked-down environment. Assuming the application is in a three-tier architecture, the public would only have access to front-end machines and S3/CloudFront, the middle tier only to the DB and everything only to rigorously screened users. Furthermore, IAM is granular enough to restrict role accounts for write access to services like S3. If inter-region tunnels are required, for, say, international replication, VPC does not yet offer cross-region tunnels, but its gateways can be used in conjunction with tunneling software.

Summary
I’ve offered an outline of an architecture in a perfect world. To scale within one to three months, not all of this can be reasonably implemented by a small startup. We recommend starting with abstracting out all database calls and the introduction of cacheing using ElastiCache. During the abstraction, keep a keen on at least being able to separate DB reads from writes quickly. See if a move to RDS is easy. Also, remove state from the application layer and stick it behind and ELB. This should be achievable within one month. Automation should then be put in-place, AutoScaling enabled, and wrapped up in a region- independent CloudFormation stack. At this point, read replicas can be used with the read/ write split in the database abstraction layer to alleviate any MySQL bottlenecks. The move to a non-filesystem data store like S3 should be implemented if it wasn’t already as part of state removal. Enable CloudFront for the static content and the business should be able to handle foreseeable growth in the next three months.

GSLB Failover for AWS ELBs

netbas — Sat, 27 Aug 2011 18:23:11 +0000

We just launched the first phase of a low-latency ad app in three AWS regions, us-east-1, us-west-1, and eu-west-1. One DNS record distributes load across all three via GSLB. In case you don’t know, this just means that when a client asks the authoritative nameserver where it is, the nameserver asks where the client is. If they’re in California, they go to us-west-1. If they’re in europe, they go to eu-west-1. Asia, it depends. You can configure what goes where, even if it doesn’t make sense. Now that we’re load balancing everywhere, what about failover?

If the EU stack goes down, can we automatically reroute traffic to, say, us-east-1? Obviously, step one is a probe of some type, a health check. Here’s where things get tricky if you’re using an AWS ELB; they use CNAMEs for elasticity. AWS can add any number of machines and do all sorts of magic as long as it’s masked by the ELB’s CNAME, which gives every probe I’ve found on the market grief.

UltraDNS (Neustar) acknowledged the difficulty and promised to fix it. Amazingly, they worked through weekends to push out new code which correctly handles a CNAME endpoint for probes. I tip my hat to their engineers and support team. Making each regional stack autoscale to handle load (keyed on latency between the ELB and machines), you can kill an entire region and watch as the failover region receives the requests and scales up accordingly. Self-healing, redundant, ultra-low latency; simply beautiful.

DNS providers like UltraDNS offer enhanced services to route not only by geographic location, but the actual performance. Maybe it makes more sense for someone in Kansas to hit us-west-1 because of some Midwest Internet indigestion. In that case, such a service would account for that. Given the first we were already handling, we thought it prudent to start with GSLB alone, but I look forward to testing the smarter services.

AWS CloudFormation Case Study

netbas — Mon, 25 Jul 2011 18:28:17 +0000

I was asked to write up how I implemented CloudFormation as it began to roll out. It helped me replace RightScale wholesale, in as flexible a manner as I cared to code in json. Below is a draft.

—-

Our task was to roll out a set of ad products based around influence as a metric using AWS for things like analytics, smart display ads, contextual and behavioral targeting to name a few. Starting fresh and fast with no physical infrastructure and oodles of new data, we had to remain nimble and scale quickly. Building from a team of one to twenty in the span of months, however, it quickly became necessary to automate and organize not only machines but process. Each product required QA/LT, staging and production environments. DR, HA and scaling requirements demanded a plan. Enter CloudFormation.

Each product has its own set of CloudFormation stacks, configured to procure the necessary AWS services: ELBs, autoscaling groups, queues, security groups, buckets, etc. Great; then what? The machines need to talk to each other. With EC2′s UserData field, machines can be passed values for any of the resources brought up in the stack, e.g. RDS endpoints/creds, private IPs, queue ARNs, etc. Use an AMI which can execute a command issued by UserData and you’re done. Ubuntu and Amazon Linux do this off-the-shelf; simply start the UserData with a shebang. Stack machines can now come up with all the stack data and configure themselves for service via the deployment method of choice. Furthermore, since they can update themselves, there’s no need for reburning AMIs every time configuration changes or updates are needed. Our machines even set the prompt to their role and stack name so we don’t get lost in a sea of terminals. They can even configure their own DNS.

Once the stack for QA is written, the QA lead can launch it with one button. Change autoscaling groups of one to one hundred and use this copy of the template to launch staging for LT. With stack mappings, you can use the same template for both. Add some alarms, change thresholds and launch again for production. Next round? Bring up a second production stack next to the old one and cut over. Need to revert? Leave the old one up and switch back. Virginia slides into the ocean? Launch the exact same stack in Singapore. Shut down the staging and old production stacks when you’re done to save money. We call our release method Deployment by Death because all we do is update the release code and kill the boxes.

Here’s how we bring up three Targeting stacks across the world in one quick line:

> for i in us-east us-west eu-west; do cfn-create-stack Targeting-QA1 -f targeting.qa1.json --region $i-1; done arn:aws:cloudformation:us-east-1:281541528619:stack/Targeting-QA1/28af2dsa-b4a7-110e-a938-6861c490a786 arn:aws:cloudformation:us-west-1:281541528619:stack/Targeting-QA1/234e5cd0-b4a7-110e-c8ac-2727c0db5486 arn:aws:cloudformation:eu-west-1:281541528619:stack/Targeting-QA1/154a5d00-b4a7-110e-a26e-275921498aea

They all come up configured and serving, typically within minutes.