The recent Amazon Web Services outage that took out a significant portion of the internet, games, and even smart home devices for days, was extensively covered in the news. Cloud services’ distributed architecture should protect customers from failures like this one, so what went wrong? Amazon published a detailed technical post-mortem of the failure, and as the famous haiku poem goes: “It’s not DNS. / There’s no way it’s DNS. / It was DNS.”
As a rough analogy, consider what happens when there’s a car crash. There’s a traffic jam that stretches for miles, in an accordion-like effect that lasts well after the accident scene has been cleared. The very first problem was fixed relatively quickly, with a three-hour outage from October 19 at 11:48 PM until October 20 at 2:40 AM. However, as with the traffic jam example, dependencies started breaking, and didn’t fully come online until much later.
The root cause was reportedly that the DNS configuration for DynamoDB (database service) was broken and published to Route53 (DNS service). In turn, parts of EC2 (virtual machine service) also went down, as its automated management services rely on DynamoDB. Amazon’s Network Load Balancer also naturally depends on DNS, so it too encountered issues.
It’s worth noting that DynamoDB failing across the entire US-East-1 region is, by itself, enough to bring down what are probably millions of websites and services. However, not being able to bring up EC2 instances was extra bad, and load balancing being affected was diamond-badge bad.
The specific technical issue behind the DNS failure was a programmer’s “favorite” bug: a race condition, in which two repeating events keep re-doing or undoing each other’s effects — the famous GIF of Bugs Bunny and Daffy Duck with the poster is illustrative.
The DynamoDB DNS resolution uses two components: a DNS Planner that, as the name implies, periodically issues a new Plan that considers system load and availability. The DNS Enactors, whenever they see a new Plan, apply it to Route53 as a transaction, meaning a plan either fully applies or it doesn’t. So far, so good.
What happened was that the first DNS Enactor was taking its sweet time to apply what we’ll call the Old Plan. As New Plans came in, another Enactor took one and applied it. There’s now good and updated data in Route53, and a clean-up of outdated plans (Old Plan included) is issued, just as First Enactor finished applying Old Plan.