/0

2048 readers

3 users here now

Meta community. Discuss about this lemmy instance or lemmy in general.

Service Uptime view

founded 2 years ago

MODERATORS

db0@lemmy.dbzer0.com

141

Post-Mortem: The massive lemmy.world -> lemmy.dbzer0.com federation delays. (dbzer0.com)

submitted 1 year ago by db0@lemmy.dbzer0.com to c/div0@lemmy.dbzer0.com

28 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] nutomic@lemmy.ml 14 points 1 year ago* (last edited 1 year ago) (14 children)

As someone hosting a service like this, especially when it has 12K people in it, this is very scary! While 2 lemmy core developers were in the chat, the help they provided was very limited overall and this session mostly relied on my own skills to troubleshoot.

This reinforced in my mind that as much as I like the idea of lemmy (or any of the other threadiverse SW), this is only something experts should try hosting. Sadly, this will lead to more centralization of the lemmy community to few big servers instead of many small ones, but given the nature of problems one can encounter and the lack of support to fix them if they’re not experts, I don’t see an option.

I disagree with this conclusion. If you had installed Lemmy according to the official instructions, you would have the database, backend and everything else on the same server and would never have run into this particular issue. And any problems youd have would likely be noticed (and debugged) by many other instances too. Your setup is heavily customized so it is only natural that there are few people who can help with it.

Anyway its an interesting journey, thanks for writing down your experience and for improving the documenation!

[–] kbotc@lemmy.world 9 points 1 year ago (2 children)

Tossing stuff on the same server is not great as I don’t want to pay for fast storage for my image store, but I want fast for my DB. My web server should have extra CPU and network but is otherwise ephemeral. This is the same stuff people have been running for years and is microservices 101.

The correct thing to do here is build in tracing and profiling hooks, as an example OpenTracing so something like Jaeger can consume and show problems and would have lit this up like a Christmas tree, Pyroscope can show changes over time in where CPU goes, and logs get shuffled off into graylog or some other centralized service for correlation.

[–] nutomic@lemmy.ml 1 points 1 year ago (1 children)

Images can be stored in S3 so that's not an issue. And Lemmy has some tracing logs as well as Prometheus stats, not sure if db0 tried looking into those.

[–] db0@lemmy.dbzer0.com 6 points 1 year ago

I don't think if seen mention of these anywhere or how to use them

load more comments (11 replies)