You may have heard that there was a pretty substantial outage in Amazon AWS Cloud services November 25, 2020. The summary of what happened can be found at: https://aws.amazon.com/message/11201/ and is emblematic of how a simple maintenance action can lead to layer upon layer of unanticipated effects. Amazon is an amazing company, the AWS service is really impressive, and what happened in this case could easily have occurred in any cloud service. I’m unaware of whether there was any direct impact of this particular outage on any NG9-1-1 service, and I’d be very interested in hearing if there was. The lessons to be learned here apply much more broadly than cloud services.
A major lesson here is that, while the adage of “eat your own dogfood” is really very good advice, and generally should be followed, it has consequences that must be taken into account if you are trying to build very reliable systems. And, at this point, I should point out that as far as I know, this outage did NOT violate any regular AWS Service Level Agreements. They don’t claim it’s five nines, and it’s not. The outage was limited to a single region, and everyone who uses services like this who needs high availability knows to make their system multi-region. Still, looking at what happened is helpful to avoid having similar things happen to your system.
The outage started with a service called “Kinesis”, which enables real time processing of streaming data (which includes media like video and audio but also includes streams of web clicks, and, most importantly to this event, logging data). Like nearly all AWS services, Kinesis has a large set of (virtual) servers that handle the distributed load. The trigger for the event was adding new servers to a part of Kinesis. This was a planned maintenance action, which had been done several times before without incident. This time, however, the total number of “threads” used by the servers exceeded the limit the operating system was configured for. As is very common for a problem like this, the system didn’t create a meaningful error report that showed the problem. Instead, what was observed was a very large set of effects of the problem. It was confusing enough that it took a couple hours to recognize that it was the capacity add that triggered the problem. But then, it turns out the only way to fix it was to restart the servers, one by one. That took many more hours to complete – the problem started at 5:15 AM, and was back to a normal state by 10:23 PM the same day.
This was bad enough. But there is much, much more to the story. Kinesis is used by Cognito, which is the authentication and single sign on mechanism in AWS. A latent bug in Cognito caused the problem with Kinesis to snowball and made Cognito unable to handle logins in some services. In addition, Kinesis is used by the Cloudwatch service, which is the monitoring system AWS uses. That had a lot of knock-on effects, because Cloudwatch is used for all kinds of things in AWS. Specifically, it’s used as the source of metrics for the “autoscale” functions that automatically add servers as demand increases, and deletes servers as demand decreases. So, any customers using autoscale saw that mechanism not work for most of the day. And it also affects the Lambda service, which is a very low overhead way to implement a simple microservice, including simple reactive web servers, seldom occurring events that start from some triggering event, etc. Lambdas are used all over the place in AWS and in customer systems.
But wait, there is more!!
Cognito (the user authentication service) was used to authenticate AWS technicians needing access to the customer notification portal. They couldn’t login to cause notifications to go out!!! There was a backup system, but the technicians on duty weren’t aware of it or trained on it.
Oh my.
So, adding capacity to Kinesis caused all sorts of failures in other services.
This is the “eat your own dogfood” issue. AWS encourages its engineers to use other AWS services to implement their own AWS services. “Eat your own dogfood”. It’s a good idea in general. If the service isn’t good enough for your own people, why would it be good enough for customers? But you see the downside: a failure in one created a failure in another. As you can read in Amazon’s report, their response to this cascade of failures is to make each service capable of dealing with failures itself, mostly as a backup to a failure in the other service. That’s a pretty good strategy, but you have to anticipate the possible failures and create work arounds in advance. And, where the backup is manual, which might be very reasonable for some issues, you have to make sure the people who have to activate the backup know about it and are trained up on it.
It’s almost always the case that these kinds of cross system dependencies arise incrementally. As new services come up or old ones are improved, intertwining incrementally occurs. That means you just can’t test for this when you first deploy a system, you have to look out for it at every update.
Which brings me to “Chaos Monkeys”. While we would like to believe we can anticipate these kinds of failures if only we put some time and effort into analyzing the situation, in my experience, that doesn’t work, and also doesn’t happen. It’s very, very tough to think through all the possibilities and recognize what would happen. Enter the Chaos Monkey.
When we build 5 nines systems, we use redundancy. We have backups and work arounds and alternate paths and back stops. And we test them, but usually only in controlled conditions. The Chaos Monkey is an idea the Netflix guys came up with that provides a whole other way to make sure your redundancy actually works. Kill things on the active system.
The Chaos Monkey randomly terminates instances on the production system to ensure that it really, really works. This induces terror in most engineers and their managers. What? Deliberately break the running system? We can’t do THAT! Yes, you can. You MUST. Your system is designed to work with all of this chaos. It should work. It HAS to work. So do it. Kill random instances and make sure it works.
Some of the failures shown in the AWS incident require some work beyond random process killing. You have to kill entire systems and see what the effect is on other systems. This can only be done in non-production systems (or at least at times everyone is prepared to deal with a system being down). You often have to do that kind of testing at scale, or at least close to scale to see knock-on effects. Do it during maintenance periods. Take 1/2 of the system off-line and run an at-scale test of each system going down. Be prepared to stop the test and restore service if you have a failure on the still-running production system, but try it, probably at least once a year. You will be surprised, as the AWS engineers were, on how failures in one system affect another.