There's a horror community here that we're trying to revive, come check it out: https://lemmy.ml/c/horror
Yes, I understand all of that. I know that it helps all the various instance owners. But that’s a problem that has already been solved. Building for scale is not specific or special to Lemmy. There are already entire automation toolsets—things like K8s or Docker Swarm, Terraform and Ansible, and endless documentation and examples on how to use and implement all of this. You’re talking about the greater whole, and what I’m trying to talk about is Lemmy.ml.
I do agree we’re probably talking past each other, though, and that’s alright, that’s how it goes on the Internet sometimes.
I’m referring specifically to Lemmy.ml, which is what the admins (of that instance) have been discussing and posting links to GitHub issues for. You can’t just take ‘everyone’s’ instance and spread it out into one giant working install of Lemmy. Every single instance that wants to handle scale is going to have to be built, managed, and maintained for it. If Lemmy.ml isn’t built to handle scale, then it’s going to go down when traffic spikes. They’re already having problems with their SQL database and traffic levels are basically nothing. You’ll end up with a bunch of users attempting to access any of the communities on Lemmy.ml and being unable to. They will need to go to a different Lemmy instance, which will have all of the same issues that Lemmy.ml will have regarding traffic load, and interact with threads there. The good thing about federation is that they’ll be able to keep using Lemmy on other instances, even if they don’t have access to Lemmy.ml specifically.
I promise I understand what I’m talking about, building for scale on a global level is what I do for a living. I also know something about open source projects, having co-founded Rocky Linux and the Rocky Enterprise Software Foundation and serving as its Director of Operations.
That’s not how this works. Lemmy itself may be open source, but the instance it runs on is not. All the work in work in the world on the Lemmy codebase won’t mean anything if its actual deployment is not built for scale, and that’s not anything anyone but the admins can do anything about.
I want Lemmy to succeed, but I'm highly skeptical of the ability of the instance operators to be able to do so. There's a great deal of technical sophistication that is required to support a large number of users, and from what I've seen, they don't have it. This isn't a slight against them in any way, but they freely admit that they lack SQL expertise, and I think I've seen some significant gaps in their knowledge on how to horizontally scale. This instance, for example, is all hosted on a single virtual server. There are no load balancers, no database sharding, no fanning out of services onto different servers...security is as well also likely in a shoddy state.
Again, no hate from me, nothing but praise so far. But there are some significant technological gaps here, and I worry their team isn't large or technically deep enough to fill them. What's in place at the moment is just waiting to tip over when any amount of traffic starts coming over. For what it's worth, I have offered my expertise to the admins around networking, security, scale, and automation.
That's Kojima for you. But agreed, the game felt like someone took far too much inspiration from qwop. I've heard it called a 'walking simulator,' which feels apt, lol.
spy on all the traffic
That's...not how things work. Everyone has their philosophical opinions so I won't attempt to argue the point, but if you want to handle scale and distribution, you're going to have to start thinking differently, otherwise you're going to fail when load starts to really increase.
You should use this relatively quiet time to migrate to a larger server, because when the time comes where you need to do it, you're going to be in for a world of hurt. This is the calm before the storm--take advantage of it.
Ultimately, you need to scale horizontally. You need to shard your database and separate out your different functions (database, front end, whatever back end applications you use, etc) onto different servers, all fronted by load balancers. That's going to be the only way to even begin to handle increasing load. If you don't have a small team of experienced engineers with a deep understanding of how to build for scale, and you get a sudden mass exodus of users from Reddit, you're fucked. So if I were you, here's what I'd do:
-
Scale up to the largest instance type you can. If possible, switch (at least temporarily) to AWS and use something in the c6i instance family, such as the c6id.32xlarge. Billing for AWS instances is done by the hour, so you wouldn't need to pay for an entire month up front if you only need that extra horsepower for a few days (such as when the blackouts are planned from the 12th through 14th).
-
Because the above will do nothing but buy you time until you crash--and if you get a huge spike of users, without horizontal scaling, you WILL crash--migrate your DNS to something like Cloudflare. From there, configure workers to respond when health checks to your site fail, so that users attempting to access the site can be shown a static page directing them to something like http://join-lemmy.org or someplace, instead of simply getting 5xx errors.
-
Once the hug of death is over, evaluate where you stand. Reduce your instance size, if you can, and start investigating what it's going to take to scale horizontally.
I'm not a SQL expert, but I am a principal network architect, and my day job for the last 15 years has been working on scale and automation for the world's largest companies, including 7 years spent at AWS. In my world, websites like Reddit, as large as they are, are still considered to be of 'average' size. I can't help you with database, but I'm happy to provide guidance around networking, DNS, scale, automation, security, etc.
You could configure something like a Cloudflare worker to throw up a page directing users elsewhere whenever healthchecks failed.
That's helpful to think of it that way, thank you. Perhaps I will reconsider :)
I'm interested in getting into this, but I think I'd probably end up abandoning it and having it feel like a chore, then feel guilty about not getting it 'done.'
It's not just tech companies like Reddit and Twitter, it seems like it's most companies. Ever since the COVID lockdowns prices have been going through the roof, you get less for what you pay for, they're laying off workers, and all while raking in record profits while also crying about how no one wants to work and how they can't afford anything because of the economy. I've never been more cynical about companies than I have been the last year.