The FCC has released its report on the December 27, 2018 outage at CenturyLink that affected 9-1-1 service. The problem was a packet storm in a management network that controlled a major part of CL’s optical network. The packets kept multiplying and congesting the system, and since it was in the management network, it was very difficult for CL and its vendor Infinera to deal with.
The FCC report goes into a lot of detail of why the packet storm occurred and how widespread its effects were, what steps were taken, and how it was eventually brought under control. No one knows exactly how it started, but the packets managed to pass all the filters that were supposed to stop this kind of thing from happening. The report makes the point that the feature that failed wasn’t actually used by CL, but the vendor had it enabled by default.
I don’t know how to prevent network failures like this from happening. It’s a combination of coding errors and other human errors. Packet storms should never be able to be maintained, especially in a management network. There should have been loop detection. There should never be infinite lifetimes for things like this. There should be shortcuts that let technicians into misbehaving networks no matter what is going on. No features should be on by default if not configured. Lots and lots of examples. But in the end, humans make mistakes. These things happen.
What was avoidable was the consequences of the failures. Lots and lots of systems went down. This happened in the optical network, and lots of transports rode on that common optical network. SS7 networks were affected, packet networks were affected, virtual private networks were affected.
In all of the cases where the consequences were significant, and specifically the 9-1-1 effects, the reason emergency calls were affected was that ALL the paths the services used were on the same optical network. If all your paths are on one network, and that network fails, you are down hard. That’s what happened here, and that is preventable.
This is the real lesson to be learned here. CL is a fine network operator, Infinera is generally well thought of. There is no reason to think that any other network operator, or any other vendor would be better at this kind of thing. They all are dependent on humans, and humans make mistakes. What is wrong is relying entirely on CL, or any other single network operator, for all of your paths.
To get diversity, you need:
- Physical Diversity (where the actual fiber/cable runs)
- Operator Diversity (who runs the network)
- Supplier Diversity (who supplies the systems that underlie the networks)
There have been failures that affect all the switches supplied by one vendor, so if you have multiple paths that are physically diverse, and operator diverse, but all rely on a common set of code, a bug in that code can affect all the paths.
This specific failure shows the effect of both operator diversity and supplier diversity. It used to be that network operators like CL would not allow a single vendor to supply all the equipment in a network – they qualified at least two vendors, and had at least a reasonable set of both vendors in the network just so that a bug like that it did not bring down their network. Those days are long gone (anyone remember the AT&T Frame Relay network failure? That was 1998!). Customers of those operators should assume that the ENTIRE network can fail. Here, because it was in the optical network, customers were aghast to discover that multiple networks they thought were diverse by virtue of being entirely different technology: SS7 and packet networks for example, were all affected, because of the common optical network. And it’s not just the optical network that is common in large operators like CL. Management networks are common, fiber bundles are common, lots and lots of ways that a single human failure or a single code bug can cause widespread outage.
So, to me, the root cause of this particular NETWORK failure was coding bugs in the switch complicated by the configuration issue. But the root cause of CALL failures, which is what we and the FCC really care about, was lack of diversity. That was foreseeable, that was preventable, and that is almost universally a critical design fault of 9-1-1 networks, including NG9-1-1 networks today. I don’t know of any ESInets that have sufficient physical, operator and supplier diversity to prevent a catastrophic failure such as this one from happening.
While this problem tends to be most severe when the ILEC is the ESInet operator, it happens pretty much uniformly even when the ESInet operator isn’t a network operator. The ESInet operator tends to partner up with someone who has lots of fiber/IP network capability, and nearly all the paths are from that partner. And since vendor diversity in operator networks is practically non-existent, 9-1-1 Authorities have to assume their ESInet suffers from lack of both operator and supplier diversity, and this can happen to them.
But operator and supplier diversity isn’t all that hard to get. Physical diversity IS hard, both because PSAPs tend to not be in places with sufficiently diverse cable entrances and because suppliers have woefully inadequate documentation on what physical paths the service they are selling takes. But operator and supplier diversity is almost always simple to get and the expense tends to be very small.
How about your ESInet?
And one more thing. It bothers me that its way too easy for me to criticize an FCC report on a failure. And it’s not just the FCC: I’ve seen reports from other sources about 9-1-1 failures that don’t get to the root cause of why the 9-1-1 calls failed. They stop at the proximate cause, not the root cause. They’re not asking the Five Whys. I don’t think I’m particularly skilled at asking Five Whys. But it does seem obvious to me that they aren’t even asking 3.
I invite your comments.