GSLB Failover for AWS ELBs

We just launched the first phase of a low-latency ad app in three AWS regions, us-east-1, us-west-1, and eu-west-1. One DNS record distributes load across all three via GSLB. In case you don’t know, this just means that when a client asks the authoritative nameserver where it is, the nameserver asks where the client is. If they’re in California, they go to us-west-1. If they’re in europe, they go to eu-west-1. Asia, it depends. You can configure what goes where, even if it doesn’t make sense. Now that we’re load balancing everywhere, what about failover?

If the EU stack goes down, can we automatically reroute traffic to, say, us-east-1? Obviously, step one is a probe of some type, a health check. Here’s where things get tricky if you’re using an AWS ELB; they use CNAMEs for elasticity. AWS can add any number of machines and do all sorts of magic as long as it’s masked by the ELB’s CNAME, which gives every probe I’ve found on the market grief.

UltraDNS (Neustar) acknowledged the difficulty and promised to fix it. Amazingly, they worked through weekends to push out new code which correctly handles a CNAME endpoint for probes. I tip my hat to their engineers and support team. Making each regional stack autoscale to handle load (keyed on latency between the ELB and machines), you can kill an entire region and watch as the failover region receives the requests and scales up accordingly. Self-healing, redundant, ultra-low latency; simply beautiful.

DNS providers like UltraDNS offer enhanced services to route not only by geographic location, but the actual performance. Maybe it makes more sense for someone in Kansas to hit us-west-1 because of some Midwest Internet indigestion. In that case, such a service would account for that. Given the first we were already handling, we thought it prudent to start with GSLB alone, but I look forward to testing the smarter services.