The Great Outage of 2019

liclac · February 24, 2019, 1:30am

As you may or may not be aware, Kaza and the forums have both been down for the past week, and were only brought back up yesterday.

This was caused by a catastrophic domino effect from a combination of technical and human errors, which were, ultimately, my fault. This post is part cautionary tale, part apology, and part promise to do better.

What happened?

For some background, our infrastructure consists of five servers:

Kyousuke, the coordinating server.
Masato and Kengo, the muscles that handle most of the page rendering.
Riki, who handles a little bit of page rendering, and makes a group of three with Masato and Kengo for storage quorum.
Mio, the database server.

All of these except Mio run a piece of software called kubernetes - it lets you cluster a bunch of machines together, providing resources (CPU and RAM, storage buckets, load balancers, etc.), and a desired state for it all - “I want two Discourse instances running, with glusterfs mounted this way to store uploads, I want it monitored for health like this, and exposed on forum.kazamatsuri.org”.

The coordinating node/“master” node, Kyousuke, looks at the description of your desired state, compares it to what currently exists, and issues commands to the other nodes to reconcile the differences. Want two Discourse instances, but only one is running? Let’s pick a node and instruct it to launch another one. One instance is erroring? Let’s update the load balancer to route around it for a bit. Something that’s required isn’t available yet? Hold off on launching it until it is.

All of this works wonders, as long as two things are true:

All servers can reach the master node, and vice versa.
All servers have cryptographic proof of their identity, in the form of certificates signed by the master.

The root cause of all this was, ultimately, the simplest possible failure mode this system has: A certificate expired.

Alarms started blaring, the servers stated accusing each other of being impostors and refused to talk to each other. Kyousuke couldn’t see what was going on with the rest of the cluster, and flagged the nodes as unhealthy. Meanwhile, in isolation, the nodes kept doing the last thing they were told to do: run the last known set of services.

And it… kept working, so well, that apart from the occasional error likely caused by lost health checks, nobody noticed - for two whole months.

Fast forward a bit. Something causes one of the nodes to drop off the network. This kind of thing is fairly normal, and just the way server hosting works; datacentres have maintenance, switches break, now and then someone trips over a network cable and has to fumble to plug it back in. Modern software is designed to cope with this, and we have redundant servers to avoid a single point of failure. And just like always, it came back up.

But it didn’t come back up quite right. Something in the system (likely the managed firewall rules) weren’t set right anymore.

This kind of thing is supposed to be handled automatically - firewall rules are supposed to be monitored and fixed, and load balancers reconfigured to route around problems.

But with internal communication broken by authentication errors, none of this happened. The system that was supposed to manage the firewall, didn’t know what the correct state was anymore - assuming that this would be a temporary network glitch, soon to be resolved, it just kept pinging the master and doing nothing. The load balancer watches the master to keep up with healthy and unhealthy instances, but the master was rejecting it, so it kept going with the same set it had before - inadvertently routing some percentage of traffic down a faulty route, at which point the problem started becoming very visible.

And here’s the first problem. All of these alarms went off, and not a single one of them alerted me.

This is because some time ago, we outgrew the monitoring platform we were using to prevent this exact kind of thing from happening, and we couldn’t afford to pay for the higher price tier they were asking of us. We had to stop using them, but in the middle of replacing them with someone else, something got in the way… and it ended up getting pushed further and further back in favour of the temporary (and woefully insufficient) workaround of manually keeping an eye on dashboards.

Until I ended up in the middle of a messy house move, and no longer had a monitor to keep an eye on it with.

I was alerted to the problem on Thursday, February 14th, when the problem cascaded further, and only a single node remained functional. The site was erroring to the point of being unusable.

I logged into the nodes and found the problem: an expired certificate. A common problem, with a quick fix. I ran the script to rotate the certificates, but a mistake in a command on my end resulted in instead rotating out the entire certificate authority, meaning that everything now had to be reissued and replaced with it, not just the expired certificates. This turned out to be both much trickier, and a wild goose chase down kubernetes’ and kubeadm’s less well documented internals - changing the certificate authority of a running cluster is very much not a normal thing to have to do, and copies, derivations and signed tokens are stored in numerous places.

Now, this could have been resolved in a few hours on a normal day, but this was very much not one. I was in my second week off work ill and bedridden, and not remotely in a state to handle this. For times like these, @Pepe is supposed to be the second in command to step in, but much of the required knowledge existed only in my head and had to be haphazardly transmitted through text messages… across several time zones.

Ultimately, we managed to get everything back to normal, but it took several days of back and fourths, and me managing to muster only a few minutes to hours of focus per day.

Now, there’s a useful term that applies to scenarios like these: bus factor. That is to say, how many people would have to get hit by a bus (or fall ill, or otherwise taken out of commission), before something can no longer function properly?

The answer, it turns out, is one: me. This has to change.

What are we doing about this?

There are some strikingly obvious things that went wrong here, and here is my attempt to address them all. Please feel free to suggest other things that may be worth trying - I’m all ears.

Insufficient monitoring, no alerting

First of all, I’m setting up proper alerting again ASAP. There’s really no excuse for this. This should have woken me up in the middle of the night several months ago, long before it became a visible problem.
Bungled certificate rotation

Certificate rotation can and should be done automatically - this just wasn’t very easy to set up back when our cluster was first built, and was put off in favour of bigger fish in need of frying.

“We have time,” I foolishly thought.

I’m going to be updating kubernetes and migrating it over to use the new APIs for automatic rotation. And as a fallback, I’m adding a calendar event to check on it, a month before it expires next time.
Bus factor

We need better documentation, and we potentially need more people who understand it.

There’s a plan in the works that should solve both of these things, but the details are still being worked out. Please wait warmly.

In the short term, I’m going to be writing proper documentation, then work from there.

tl;dr: Human error is a very scary thing.

Celeskastel · February 24, 2019, 10:50pm

I just wanted to say I respect your professionalism and hard work so much! Thank you so much for all your hard work despite being ill and during your move. Thank you to Pepe too for helping!

MagusVerborum · February 25, 2019, 7:45am

It sounds like you made the best of a bad situation. I’d just like to say that i could never keep this whole thing running, and I hope you’ll keep this shebang running long into the future~

Phlebas · February 25, 2019, 10:20pm

That sounds like a sick hassle to have had to go through… in extremely bad circumstances. Keep up the good work!