Today karrot went offline from about 10:30 to 22:30 (10 hours), which I think it’s our longest ever downtime. I’m very happy we’re back online, and with no data loss, or anything bad like that.
I hope it didn’t cause too much disorder in all the lovely projects using Karrot
The ultimate cause was that one of the hard drives in the server failed, and prevented the server from starting up after a reboot during some routine maintenance.
It might be it actually failed a while back, without us noticing. There are two hard drives in a mirrored setup (so data is written to both at the same time) which is so that when one disk fails, it can be replaced whilst the system is still online. This is what is happening now, we have a new disk in, and the data from the healthy disk is being copied to the new disk.
The reason it took 10 hours is because it was quite some time before we were able to diagnose the problem. We could boot into the recovery mode fine, and access the data, but were confused why it would not come up after a reboot, thinking perhaps there is a configuration problem with the boot system.
The real problem was being concealed though, and a bit of help led us to see one of the disks had failed. It’s been running for 7 years now, since the very beginning of the yunity experiment, from which the foodsaving tool emerged, which then became karrot.
The other thing that caused it to take longer was that we were not familiar with the hosting control panel, or the hosting support system, having not had to use them before in all this time. I was a bit daunted as it’s all in German.
We had a super lovely and helpful support person who wrote in English, who confirmed the problem with the disk, and replaced it with a new one they already had on standby. The support is 180EUR/hour as it’s out of hours at the weekend, but it seemed worth using the money for, otherwise it would have meant waiting until Monday morning, with no easy way to communicate with all the Karrot users.
Thanks a lot to @tiltec who was doing much of the investigating and fixing here … a lot of learning for me!
It’s not quite what I wanted to be doing on a lovely sunny Saturday, but hey ho, that’s how it goes! (I had been wanting to sit coding on Karrot…).
I would like to explore a bit more of what we can do in the future to improve the situation, so we might notice issues like this before they become critical and involve downtime, but now, to bed!