this post was submitted on 10 Apr 2025
180 points (98.4% liked)

Home

751 readers
1 users here now

Lemmy.zip instance discussion.

For all things relating to Lemmy.zip.

Main instance rules apply, with the additional rules below:

founded 2 years ago
MODERATORS
 

Hi All,

As some of you may have realised, the planned upgrade sort of crashed everything, and we had our longest period of downtime since the site began.

This is partly because I had to go to sleep (thanks to a newborn and a job).

The good news is that the backup process worked! We've restored to seconds before the upgrade took the site offline.

The bad news is that federation is likely to be.. wonky.. for a little while. The site may also go up and down while I undo some of the fixes I tried.

Ultimately the issue came down to the upgrade failing (I am not sure why - will be digging into this now the priority is no longer getting the site up) and then the containers not talking to eachother, so the UI wouldn't talk to lemmy, and lemmy wouldn't talk to the database.

I rebuilt the containers, restored the backup, restarted everything, and it's all come back up (admittedly not perfect right now).

Importantly, I want to issue an apology. This isn't what I want for Lemmy.zip, and it should've been handled way better by myself. I'm always learning but this took way longer than it should've, and while I take some solace in the fact the backup process worked and has been proven to work in production, the delay in being able to get this back up is entirely my fault and frankly unacceptable.

I'll be working to document this outage, the steps it took to get it back up, and some form of repeatable plan so a repair can be replicated in the future if I'm not available.

In terms of upgrading to 0.19.11 - I will have to try again soon as it's got some security fixes we desperately need to implement.

Thanks

Demigodrick

(page 2) 50 comments
sorted by: hot top controversial new old
[–] FrostyTrichs@crazypeople.online 5 points 4 months ago

Unfortunately these things can and do happen. I'm glad you were able to get things functional with a restoration. Best of luck troubleshooting and repairing the leftover gremlins.

Thanks for all you do to support Lemmy.

[–] Eyck_of_denesle@lemmy.zip 5 points 4 months ago

Congratulations on the baby. We should thank you for making us go touch grass.

[–] squirrel@discuss.tchncs.de 4 points 4 months ago (1 children)
[–] Demigodrick@lemmy.zip 2 points 4 months ago

Thanks squirrel! 😊

[–] Rose@lemmy.zip 4 points 4 months ago (1 children)

I definitely felt that! Trying to check the feed and getting a 502 was not nice. Good thing I had an account with another instance for the interim. Anyway, no service is bulletproof and things absolutely go wrong when running a server, no matter how good or prepared we may be. Having working backups is instrumental and I'm glad you have that going.

[–] Demigodrick@lemmy.zip 2 points 4 months ago

Thank you Rose! I think when I saw that first 502 I took a year or two off my life with stress 🤣

[–] Cheesepuffs@lemmy.zip 4 points 4 months ago

Hey man its all good. I understand living the working parent life. It ain't easy

[–] mynamesnotrick@lemmy.zip 4 points 4 months ago* (last edited 4 months ago)

Dude I'm a devops engineer and I totally get it when an app in my cluster goes down and customers start to freakout. However, I get paid to deal with it. Whereas you are doing this as I imagine as a side fun thing. You have ran this entire thing super professional since the beginning. You are doing great. If it was an istio thing that caused your containers to not talk... Believe me when I tell you that is my hell (that I'm currently experience now with one particular app). You are doing amazing work here. Thank you so much. Glad I'm here.

[–] altima_neo@lemmy.zip 3 points 4 months ago (1 children)

No worries man. It's just social media. We survived!

One thing I was curious about is I didn't know where to go to look up info on what was going on? But you mentioned you were posting links, so I'll bookmark some of that info! Anyway thanks for all your hard work.

[–] Demigodrick@lemmy.zip 2 points 4 months ago

Yeah absolutely, I'll be putting more links up. I wasn't prepared for things to go quite so wrong and for so long, so other than the matrix chat there wasn't any other info.

I'll be using mastodon, matrix, and I've updated the status page to be one that should actually work now :)

[–] MrSoup@lemmy.zip 3 points 4 months ago

Chiil, you do a great job managing this. There is no need to blame that way yourself.

Get some rest, enjoy first stages of parenting and take your time updating.

Thanks a lot for lemmy.zip and take care of yourself.

[–] meldrik@lemmy.wtf 2 points 4 months ago

Great work and welcome back 🤗

[–] shortwavesurfer@lemmy.zip 2 points 4 months ago (1 children)

The graph shows the Federation sink is finished after four and a half hours.

load more comments (1 replies)
[–] Elevator7009@lemmy.zip 2 points 4 months ago* (last edited 4 months ago)

This is FOSS, not a job. You are doing it for free + maybe some donations people give. This is social media, not some critical health thing that needs to be working 24/7. Thanks for caring, now you know for next time, but don't beat yourself up. We appreciate your efforts and the transparency you have been giving us by making these posts! Keep on going with the transparency, but take it easy on yourself.

load more comments
view more: ‹ prev next ›