See my post here to learn more about me and my point of view.
As of this writing, CenturyLink (CL) has recently restored service from a massive outage of their IP network, which affected wireless #E911 service in many areas. We don’t know what happened yet. The FCC is investigating, and we will get a report that details what happened. But on August 1 this year, CL had another outage that affected 9-1-1 calls in several states, including Minnesota. Minnesota has an #NG911 system, so this is one of the first NG9-1-1 outages. The report on this outage is available from the State of Minnesota here. Note that while the document has a CL proprietary notice, it’s available on a public website from the State. If you haven’t heard about this incident, I suggest you read the CL document, and then come back to this post. I have no other information on this incident other than the report and from some discussions with some of the Minnesota 9-1-1 people.
The basics are simple: West Corp (“the vendor”) fat fingered a network provisioning operation that broke routing to their Miami node, and prevented fall back to their Englewood CO node although some calls were (normally) routed through Englewood and were not affected. It was an hour and 5 minute outage, and 693 calls failed in Minnesota. It’s not clear from this report how many calls from other states were affected. West finally noticed that the failures started when the provisioning change was done, and rolled back that change. Service was restored quickly (about 5 minutes) after that.
The report notes that this kind of provisioning activity was not at that time considered “maintenance” and thus was not done during the normal maintenance window, which apparently is every night from 10 pm to 6 am.
The corrective actions described are basically changing processes to implement more validation of provisioning changes, breaking changes up into smaller hunks, and improving test procedures when doing that kind of provisioning.
I think West is a good vendor. Probably better than most. It has a long way to go, in my opinion, to address process and system design faults I think this outage exposes. I’d guess most vendors are no better than West in these areas.
So, what can we learn? First of all, I don’t know how to prevent people from fat fingering manual configurations. It happens. So, I don’t fault West or their tech from making a mistake like that
They also were able to revert the change and get the service restored fairly quickly. Now, to be sure, 5 nines is 5 minutes a year of down time, so every single minute counts, but they had backout procedures and they worked. That’s also good.
This wasn’t a software bug. It wasn’t a hardware failure. This is a system design and operating process problem.
There are several glaring problems not addressed here. This problem will probably happen again, not exactly the same way this one did, but in some similar way, because they did not address the root problems. They did not ask the “Five Whys” that are key to a good Root Cause Analysis (RCA). Or, at least it doesn’t show in this document.
If you don’t know, following an outage, there is a meeting held to determine what happened, and how to prevent it from happening again. Running a good Root Cause Analysis meeting is hard. No one likes to admit their shortcomings, and when it’s a very visible outage, management tends to focus on blame and not on actually finding and fixing the real root cause. Usually, the outcome of the meeting is a report, like this one, and set of follow up corrective actions that should have committed dates and resource commitments.
As an aside, I expect this is not the actual RCA report. It’s a sanitized, finance, sales and management-approved version of the actual RCA. There are legal and financial reasons why this is true, and I don’t think any customer should expect anything differently. Unfortunately, while there are legitimate reasons why customers get a sanitized version, it’s sometimes the case that things they should know are kept from them because it’s embarrassing, or there is a very low probability that effective corrective action will be taken because of cost considerations. I don’t know how to fix that, and I don’t know if that happened here. Additionally, CL is the customer of West, and CL probably got a sanitized version of the West RCA. Then CL may have sanitized what they gave to the State.
Why was this manual? In other words, why is the process subject to manual error? One of my frustrations is that network ops people insist on manual configuration via the command line. It’s just the way they do things. My response to them is “if fat fingering can take the system down, automate”. They don’t like that. It’s a tool thing; they don’t like the available automated provisioning tools. I’d say, tough, manual provisioning is too error prone.
In fact, my rule of thumb is that if a change procedure takes more than 3 or 4 steps, then much more stringent processes must be in place. The changes, no matter how (in)consequential, have to be approved by a change board in advance (like a week in advance). The document that describes the steps has to be very explicit: exactly what is typed/selected/clicked, what the response looks like exactly. There has to be a set of tests that show the change had the desired effect, and, most importantly, the service(s) that run on the network that changed have to have a comprehensive system check completed following the change. That appears to be far from the process used in this instance, although some of the identified corrective actions cover improvements to their process.
Why was this failure not caught immediately by monitoring? The report says “Vendor became aware of the broader issue when they identified an influx of calls”. So their monitoring did not detect a failure. Actually, it explicitly says that because “trunks were available to the Miami ECMC”, they didn’t know calls were failing to route. They are using an interior monitor (trunk usage) instead of an exterior monitor (calls are failing). Here, there is not only a failure on West’s part, CenturyLink is not actively monitoring the system. Apparently, they expected the origination service providers to notice failures of calls not completing, but that didn’t happen. There should be less then 5 minute detection of this kind of problem. Test calls should be automatically generated and automatically traced to both sites. This monitoring fail turned what should have been a 10-15 minute outage into a one hour outage. This system, which should have been 5 nine’s, isn’t even going to make 4 nines this year (the more recent CL failure did affect Minnesota apparently). I’m told that the NOC that serves 9-1-1 was overloaded with service calls, and PSAPs resorted to calling individuals in CL on their cell phones to make them aware of the extent of the problem.
And then we get to the biggie: why did the failure affect the ability to fall back to the alternate site (Englewood)? That should not be possible. There is discussion in the document about remapping cause codes, and it may be that because the wrong cause code was returned, the clients didn’t attempt to try the other site. That seems like a failure in design: it should not be possible for a situation like that to occur. It also seems like a test failure: a good system test would have simulated routing failures and discovered failover didn’t work as planned. What other systems have problems like that which need to be investigated and addressed? What other ways can the system report some failure that causes the alternate site to not be tried? The RCA should have been very explicit about this: a failure of one site didn’t result in the other site getting the calls. In fact, I would say that this is the path to the root cause: anything should be able to go wrong in the Miami center and the Englewood center should have gotten all the traffic. It’s not the fat finger that is the issue, it’s not the change process that’s the issue. It’s the fact that the change in Miami prevented calls from failing over to Englewood. Without knowing more about how this happened, we’re unable to continue asking “Why?” down this path, but that’s what I think is needed more than anything else.
If you look at the report it says the root cause was human error by a vendor. I don’t agree. The root cause appears to me to be related to the inability to fail over. Fat fingers happen, sites go down. Redundancy is supposed to prevent failures in one site from affecting another. The human error was the proximate cause, but the root failure was a system design fault on failover.
That’s how I see it anyway. I invite your comments.