this post was submitted on 10 Apr 2025
180 points (98.4% liked)

Home

751 readers
1 users here now

Lemmy.zip instance discussion.

For all things relating to Lemmy.zip.

Main instance rules apply, with the additional rules below:

founded 2 years ago
MODERATORS
 

Hi All,

As some of you may have realised, the planned upgrade sort of crashed everything, and we had our longest period of downtime since the site began.

This is partly because I had to go to sleep (thanks to a newborn and a job).

The good news is that the backup process worked! We've restored to seconds before the upgrade took the site offline.

The bad news is that federation is likely to be.. wonky.. for a little while. The site may also go up and down while I undo some of the fixes I tried.

Ultimately the issue came down to the upgrade failing (I am not sure why - will be digging into this now the priority is no longer getting the site up) and then the containers not talking to eachother, so the UI wouldn't talk to lemmy, and lemmy wouldn't talk to the database.

I rebuilt the containers, restored the backup, restarted everything, and it's all come back up (admittedly not perfect right now).

Importantly, I want to issue an apology. This isn't what I want for Lemmy.zip, and it should've been handled way better by myself. I'm always learning but this took way longer than it should've, and while I take some solace in the fact the backup process worked and has been proven to work in production, the delay in being able to get this back up is entirely my fault and frankly unacceptable.

I'll be working to document this outage, the steps it took to get it back up, and some form of repeatable plan so a repair can be replicated in the future if I'm not available.

In terms of upgrading to 0.19.11 - I will have to try again soon as it's got some security fixes we desperately need to implement.

Thanks

Demigodrick

you are viewing a single comment's thread
view the rest of the comments
[–] 1Fuji2Taka3Nasubi@lemmy.zip 6 points 4 months ago* (last edited 4 months ago) (2 children)

Thanks for the hard work! Glad the server is back online.

A suggestion: Post a message on status.lemmy.zip when there is maintenance. That was where I thought to check when I found that the main site was not working. Though, it was reporting the site was fine when it was unavailable, this time.

Oh, and congratulations on the newborn!

[–] possiblylinux127@lemmy.zip 1 points 4 months ago (1 children)

To add to this:

I would look into doing a redirect when the server is down. You could create a static page that is triggered when the server is unreachable.

[–] Demigodrick@lemmy.zip 2 points 4 months ago

Definitely on the to do list, shame it's a paid cloudflare feature for 502 errors.

[–] Demigodrick@lemmy.zip 1 points 4 months ago

Thank you! Newborns are both miraculous and absolute poop machines.

The status page definitely failed this time around, going to look at alternative options. I'll also start linking to other places.