kiss
my

< le blog / >

GlusterFS hardcore recovery on a broken Kubernetes cluster

So yeah, shit happens.

Like the day your client tells you « my websites are slow », then you get a monitoring alert (if you don’t have any monitoring, you’re playing with fire), then your client’s whole Kubernetes cluster is down.

The production one. It’s past 10pm and you know you’re definitely going to have a long night.

Of course 99% of the time, you can recover from a broken Kubernetes cluster.

Let’s get a diagnosis

It could be the master that may be down – check our post on HA Kubernetes without a private network or a load balancer if you have only one master. TL;DR: you shouldn’t.

It could be that one or more nodes are down, and maybe you didn’t scale your apps enough for High Availability, or some apps are setup with a nodeAffinity for some reason – geographic proximity, an app that requires Local Storage SSD’s to run fast or dozens of other probable situations you may find yourself unfortunately into.

Or, this can be an evening like today. Like when you login to your master (we had setup a brand new bare-metal HA cluster, and were planning to migrate all our client’s apps this weekend. Too late). An evening like, when you login, Kubelet cannot start anymore because the /etc/kubernetes/bootstrap-kubelet.conf has simply…disappeared.

2020/05/04 22:27:05 http: proxy error: dial tcp 195.154.81.152:6443: connect: connection refused

The connection to the server farniente.unicstay.eu:6443 was refused - did you specify the right host or port?

Like, when you try some basic kubectl get nodes the APIServer isn’t responding. When you try to renew the certs, and now, you’re even « out of the cluster »:

code à mettre

Then you spend one hour or two trying to fix that, without success. You even make things worse.

What are your options?

Now, you have 2 choices:

  • Wait for the morning, maybe some support guy will fix your cluster. Spoiler: they probably won’t, even less in the case of a bare-metal cluster.
  • Migrate the apps to a brand new cluster right now. Before it’s too late and you client starts loosing clients – which means you’ll probably loose him, too.

Tonight, I took the latter decision. It was a bold decision: our HA cluster was pretty much tested in development, but not in production. I also knew that by doing that, not only would I have to redeploy 6 Django apps with all their stack (volumes, services, ingress, uWSGI & Nginx configs…), change the DNS records quickly, and make sure backup CronJobs would be running and backup the databases – which I luckily migrated to Digital Ocean’s Managed Databases just 2 days before – but I’d also loose access to Heketi, thus making it really difficult to recover my files.

And these files were not backed up. I had to recover them.

So, now, here we are. We’ve got a broken Kubernetes cluster – so broken I won’t mention it further, as it was now un-usable – a broken Heketi instance, and a broken GlusterFS cluster.

Broken GlusterFS cluster because, like anyone in despair, I rebooted the master, then some nodes, to see if a miracle would happen. GlusterFS doesn’t like these kind of all-in operations. So now, my bricks are down.

But I never – really, never – give up. Follow me along this journey, which hopefully will give you much to learn.

Time to recover it yourself (RIY)

First: make this GlusterFS cluster talk work

⇒ Article hardcore recovery 1 gluster (upcoming)

Second: mount the volumes on your master

This is the part where hopefully, the feeling of panic starts to begins to fade, and leaves room for another feeling: suspense. Will your files still be there? They should.

Once gluster volume status shows a nice Y in front of each brick, you want to mount the volumes to make sure all the data is still here before copying it somewhere safe.

That’s the easy part.

$ gluster volume list
heketidbstorage
vol_15c18fafbc8ee87e92a158356caa1370
vol_20a07f5456d25926e14e1f17b39ec0c6
vol_7b21e6dd5263fbfc537a0cb656833275
vol_a451296fc82e517148698e28890ba57d
vol_b037e8147572c17038816cd2a3378a46
vol_b13f80f79a96820ab068066925682edd
vol_c8a0306fdd642c7a75352a9744f53c07
vol_c99d9e7ada34d7d7441c6994e1951966
vol_e9a04fb86d81c7c42413b4154611d9ba
vol_f4b1baa49b7943daf7428bf4125da9d3
vol_fcb37680bf52c44cbdc580f739c9a350

OK, let’s copy these lines, and open a text editor – preferably, one with multi-line select (Sublime, VSCode or Atom should do just fine. I use Sublime, as it can handle tens of thousands of lines without a glitch):

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/6a359cc0-d084-46a4-b696-03f2d955ebf3/Untitled.png

Cool. Let’s make some directories on the server to mount these volumes.

$ cd /mnt
$ mkdir gluster && cd gluster

$ mkdir heketidbstorage
$ mkdir vol_15c18fafbc8ee87e92a158356caa1370
$ mkdir vol_20a07f5456d25926e14e1f17b39ec0c6
$ mkdir vol_7b21e6dd5263fbfc537a0cb656833275
$ mkdir vol_a451296fc82e517148698e28890ba57d
$ mkdir vol_b037e8147572c17038816cd2a3378a46
$ mkdir vol_b13f80f79a96820ab068066925682edd
$ mkdir vol_c8a0306fdd642c7a75352a9744f53c07
$ mkdir vol_c99d9e7ada34d7d7441c6994e1951966
$ mkdir vol_e9a04fb86d81c7c42413b4154611d9ba
$ mkdir vol_f4b1baa49b7943daf7428bf4125da9d3
$ mkdir vol_fcb37680bf52c44cbdc580f739c9a350

Done. Now we update /etc/fstab to mount it. Don’t write down everything, use multi-line selection:

# /etc/fstab
# ...other mounts..

195.154.81.152:/heketidbstorage /mnt/gluster/heketidbstorage glusterfs defaults,_netdev 0 0
195.154.81.152:/vol_15c18fafbc8ee87e92a158356caa1370 /mnt/gluster/vol_15c18fafbc8ee87e92a158356caa1370 glusterfs defaults,_netdev 0 0
195.154.81.152:/vol_20a07f5456d25926e14e1f17b39ec0c6 /mnt/gluster/vol_20a07f5456d25926e14e1f17b39ec0c6 glusterfs defaults,_netdev 0 0
195.154.81.152:/vol_7b21e6dd5263fbfc537a0cb656833275 /mnt/gluster/vol_7b21e6dd5263fbfc537a0cb656833275 glusterfs defaults,_netdev 0 0
195.154.81.152:/vol_a451296fc82e517148698e28890ba57d /mnt/gluster/vol_a451296fc82e517148698e28890ba57d glusterfs defaults,_netdev 0 0
195.154.81.152:/vol_b037e8147572c17038816cd2a3378a46 /mnt/gluster/vol_b037e8147572c17038816cd2a3378a46 glusterfs defaults,_netdev 0 0
195.154.81.152:/vol_b13f80f79a96820ab068066925682edd /mnt/gluster/vol_b13f80f79a96820ab068066925682edd glusterfs defaults,_netdev 0 0
195.154.81.152:/vol_c8a0306fdd642c7a75352a9744f53c07 /mnt/gluster/vol_c8a0306fdd642c7a75352a9744f53c07 glusterfs defaults,_netdev 0 0
195.154.81.152:/vol_c99d9e7ada34d7d7441c6994e1951966 /mnt/gluster/vol_c99d9e7ada34d7d7441c6994e1951966 glusterfs defaults,_netdev 0 0
195.154.81.152:/vol_e9a04fb86d81c7c42413b4154611d9ba /mnt/gluster/vol_e9a04fb86d81c7c42413b4154611d9ba glusterfs defaults,_netdev 0 0
195.154.81.152:/vol_f4b1baa49b7943daf7428bf4125da9d3 /mnt/gluster/vol_f4b1baa49b7943daf7428bf4125da9d3 glusterfs defaults,_netdev 0 0
195.154.81.152:/vol_fcb37680bf52c44cbdc580f739c9a350 /mnt/gluster/vol_fcb37680bf52c44cbdc580f739c9a350 glusterfs defaults,_netdev 0 0

Now save the file and mount everything:

$ mount -a

Let’s see….

$ ls /mnt/heketidbstorage

container.log  heketi.db

You good. Take a deep breath, you’ve saved the day. Almost.

Third: who’s PVC is this?

You’re not completely done though. These volumes have silly names, at least, you can’t find out easily which volume corresponds to which PV/PVC. Depending on the number of volumes you have, this is like flipping a coin x the number of volumes. You can’t get the volume names back, but you can get some metadata – size, mostly – that will help you deducting which volume is what.

Export Heketi’s DB to JSON

Open a terminal on your local machine, and scp the heketi.db file some place safe.

Head to https://github.com/heketi/heketi/releases. Don’t worry, if you’re having feelings against Heketi, we’re just gonna use it to make this DB readable.

Download the latest release for your architecture – don’t download the CLI but the full package – extract it, and finally use it to export the database to a JSON file:

$ /home/michel/Downloads/heketi/heketi db export --dbfile=heketi.db --jsonfile=heketidb.json
DB exported to heketidb.json

Oh my, oh my. Open heketidb.json . A bunch of useless (right now) data. But you can get the Info.size field which can be pretty helpful. Remember to check the size field under volumeentries, this is the Gb size of your volume. That should help.

Now you’re heading towards minutes/hours/days of copy. Try doing something of this time (like writing a files backup script).

When you’re done, you can cleanup your fstab, put down the GlusterCluster, wipe your server, and either make it join your new cluster – if you’re stupid|poor enough like me to keep working on bare-metal clusters – or definitely power it off and save some bucks