AWS Architecture for Startup Looking to Scale Quickly

netbas — Thu, 01 Dec 2011 19:51:58 +0000

We were asked to prepare a generic document for a LAMP-based startup, running one on machine in the corner of an office, looking to cope with massive scale, disaster recovery, and self-healing resilience, all within three months. Here’s what we came up with given only the above in the context of Amazon Web Services offerings. Most of their services are just shortcuts for doing the same thing yourself.

Introduction
This document outlines the ways in which AWS can address the concerns of rapid growth from a proof-of-concept to a hard-hit Internet website. It will address options comprehensively, with an eye toward the future. However, as is the case with rapid growth of early stage startups, the question will often come down to priorities.

Scalability
The first step toward scalability is to remove state from components, starting with the PHP. Once this is done, non-sticky load balancing via Elastic Load Balancers (ELBs) can spread the load over an arbitrary number of machines. State can be maintained in a noSQL database such as SimpleDB, but code will have to be altered. Alternatively, to get started quickly, there exists a single PHP configuration directive to maintain sessions in memcached. Such a cache is necessary in any case to widen the most often-encountered bottleneck, the relational database. ElastiCache may be leveraged quickly here, for which modest developer time will be needed to wrap all database calls with simple cacheing calls. In general, it is a good idea to abstract out all database calls one step further so that when the time comes to scale the DB via sharding or moving some data to a noSQL store, the developer time required won’t overwhelm. Also, it makes it easier to split reads and writes so that a number of slaves, read replicas in the case of RDS, can offload read traffic from the master. The final piece of the puzzle, necessary for AutoScaling, is automation. Each machine needs to be able to come up with everything it needs to start serving immediately. Here, CloudFormation, coupled with server role management software like Chef, can be used to fully automate the process.

Self-healing Fault Tolerance
Once state is removed from machines, AutoScaling can maintain a minimum number of healthy instances to keep serving in the event of failure. However, if the app itself is having trouble, only a decoupled architecture can mitigate the effect of such failure. If the app functions can be split into two classes, one that presents data and one that processes it, one can fail while the other keeps going. Typically, presentation is much less likely to fail, and if there is are queues in place for processing, dev/ops can get to work on the failure while the queue grows with backed-up tasks. If the processing layer is stateless as well, and if its AutoScaling is keyed to the queue size, it can catch back up very quickly after the issue is resolved. The presentation of data, albeit possibly a bit stale, continues the whole time. If appropriate, GSLB can distribute load across multiple regions. We recently worked with UltraDNS to make their service available for use with ELBs, and Dyn is reported to offer it as well.

Disaster Recovery
DR can be split up into two aspects: continuity of service and data loss. With the above, the stack can be configured quickly to spread itself out across multiple physical data centers located in each region’s Availability Zones (AZs), including the MySQL DB with the multi-AZ feature of RDS. However, in the event that Virginia slides into the ocean, continuity of service can be managed, but data loss proves much more difficult. CloudFormation stacks, for example, can be configured to fully launch the stack in any AWS region with one button as long as the software is region-agnostic. MySQL data loss, however, is more tricky. Periodic off-site backups help, but MySQL replication across the Internet is not reliable. Data loss can be minimized by spreading the data across multiple regions and perhaps using GSLB to distribute traffic among them. NoSQL DB replication like MongoDB’s is often much more forgiving. It is for this reason, along with the bottleneck issues, that using a NoSQL store whenever possible is recommended.

Filesystem constraints
Removing state for scalability requires that local filesystems be used only for ephemeral processing. This may take the form of storing things like images on S3 and pointing all clients there.

Latency
Latency can be reduced in several ways. First and most effective is serving static content via a Content Delivery Network (CDN) such as CloudFront. The origin can be a host or S3, and if the content is suitable for cacheing CloudFront minimizes latency by providing local copies around the world. The aforementioned GSLB can help minimize latency of traffic to the EC2 instances as well.

Security of data at rest and in transit
First, encryption can be used everywhere. ELBs can be configured to use https, the machines to use ssh for administrative access of course. In addition, traffic to and from the various AWS services can all be encrypted as well. Security Groups should tightly restrict access to data at rest. For example, a database should only be available to the application layer. Virtual Private Cloud (VPC) can be used to allow full access to dev/ops while further restricting access, even if it is already encrypted. Add to this multi-factor authentication and user-based access control with IAM for a very locked-down environment. Assuming the application is in a three-tier architecture, the public would only have access to front-end machines and S3/CloudFront, the middle tier only to the DB and everything only to rigorously screened users. Furthermore, IAM is granular enough to restrict role accounts for write access to services like S3. If inter-region tunnels are required, for, say, international replication, VPC does not yet offer cross-region tunnels, but its gateways can be used in conjunction with tunneling software.

Summary
I’ve offered an outline of an architecture in a perfect world. To scale within one to three months, not all of this can be reasonably implemented by a small startup. We recommend starting with abstracting out all database calls and the introduction of cacheing using ElastiCache. During the abstraction, keep a keen on at least being able to separate DB reads from writes quickly. See if a move to RDS is easy. Also, remove state from the application layer and stick it behind and ELB. This should be achievable within one month. Automation should then be put in-place, AutoScaling enabled, and wrapped up in a region- independent CloudFormation stack. At this point, read replicas can be used with the read/ write split in the database abstraction layer to alleviate any MySQL bottlenecks. The move to a non-filesystem data store like S3 should be implemented if it wasn’t already as part of state removal. Enable CloudFront for the static content and the business should be able to handle foreseeable growth in the next three months.

AWS CloudFormation Case Study

netbas — Mon, 25 Jul 2011 18:28:17 +0000

I was asked to write up how I implemented CloudFormation as it began to roll out. It helped me replace RightScale wholesale, in as flexible a manner as I cared to code in json. Below is a draft.

—-

Our task was to roll out a set of ad products based around influence as a metric using AWS for things like analytics, smart display ads, contextual and behavioral targeting to name a few. Starting fresh and fast with no physical infrastructure and oodles of new data, we had to remain nimble and scale quickly. Building from a team of one to twenty in the span of months, however, it quickly became necessary to automate and organize not only machines but process. Each product required QA/LT, staging and production environments. DR, HA and scaling requirements demanded a plan. Enter CloudFormation.

Each product has its own set of CloudFormation stacks, configured to procure the necessary AWS services: ELBs, autoscaling groups, queues, security groups, buckets, etc. Great; then what? The machines need to talk to each other. With EC2′s UserData field, machines can be passed values for any of the resources brought up in the stack, e.g. RDS endpoints/creds, private IPs, queue ARNs, etc. Use an AMI which can execute a command issued by UserData and you’re done. Ubuntu and Amazon Linux do this off-the-shelf; simply start the UserData with a shebang. Stack machines can now come up with all the stack data and configure themselves for service via the deployment method of choice. Furthermore, since they can update themselves, there’s no need for reburning AMIs every time configuration changes or updates are needed. Our machines even set the prompt to their role and stack name so we don’t get lost in a sea of terminals. They can even configure their own DNS.

Once the stack for QA is written, the QA lead can launch it with one button. Change autoscaling groups of one to one hundred and use this copy of the template to launch staging for LT. With stack mappings, you can use the same template for both. Add some alarms, change thresholds and launch again for production. Next round? Bring up a second production stack next to the old one and cut over. Need to revert? Leave the old one up and switch back. Virginia slides into the ocean? Launch the exact same stack in Singapore. Shut down the staging and old production stacks when you’re done to save money. We call our release method Deployment by Death because all we do is update the release code and kill the boxes.

Here’s how we bring up three Targeting stacks across the world in one quick line:

> for i in us-east us-west eu-west; do cfn-create-stack Targeting-QA1 -f targeting.qa1.json --region $i-1; done arn:aws:cloudformation:us-east-1:281541528619:stack/Targeting-QA1/28af2dsa-b4a7-110e-a938-6861c490a786 arn:aws:cloudformation:us-west-1:281541528619:stack/Targeting-QA1/234e5cd0-b4a7-110e-c8ac-2727c0db5486 arn:aws:cloudformation:eu-west-1:281541528619:stack/Targeting-QA1/154a5d00-b4a7-110e-a26e-275921498aea

They all come up configured and serving, typically within minutes.

Nuvole Computing » Cloud Architecture

AWS Architecture for Startup Looking to Scale Quickly

AWS CloudFormation Case Study