RoundSparrow

joined 2 years ago
MODERATOR OF
[–] RoundSparrow@lemmy.ml 1 points 2 years ago
[–] RoundSparrow@lemmy.ml 1 points 2 years ago* (last edited 2 years ago) (2 children)

Good results with this approach. I hadn't considered the RANK OVER PARTITION BY criteria_a values and it works like a champ. It moves the ORDER BY into the realm of focus (criteria_a) and performance seems decent enough... and it isn't difficult to read the short statement.

SELECT COUNT(ranked_recency.*) AS post_row_count
FROM
  (
     SELECT id, post_id, community_id, published,
        rank() OVER (
           PARTITION BY community_id
           ORDER BY published DESC, id DESC
           )
     FROM post_aggregates) ranked_recency
WHERE rank <= 1000
;

Gives me the expected results over the 5+ million test rows I ran it against.

If you could elaborate on your idea of TOP, please do. I'm hoping there might be a way to wall the LIMIT 1000 into the inner query and not have the outer query need to WHERE filter rank on so many results?

[–] RoundSparrow@lemmy.ml 1 points 2 years ago* (last edited 2 years ago) (11 children)

What problem are you trying to solve?

Reproducible regular server crashes from queries taking tens of seconds long because the whole logic is based on no WHERE clause that has any meat to it. The server overloads in the field have been going on every single day that I've been here testing the big servers since May 2023.

If I want to look through 1000 posts in a community, I probably want to look at more than 1000.

I'm well aware of the push back. Everyone chimes in saying they want counting to be real time, the developers seem to avoid caching at all cost, and out of desperation - I'm trying to build some kind of basic sanity logic into the system so it doesn't plow through 5 million rows to do a LIMIT 10 query.

Right now Lemmy works perfectly fine with no personalization. Anonymous users - it works great. If you want to read a million posts, it works great. Start blocking specific users, start adding in NSFW filters, cherry-picking a blend of communities, etc. and the problems show up. The ORM logic is difficult to follow, based on massive JOIN of every field there is in many tables, and at certain data thresholds with per-account preferences engaged - it goes off the rails into the pile of over 1 million posts (taking 40 seconds to list page = 1 of LIMIT 20 posts for even a single community).

The programmers who built the code for over 4 years don't seem to think it is an urgent problem. So I'm chipping in. I personally have never worked with this ORM and I find it painful compared to the hand-crafted SQL I've done on major projects. I'm doing this because I feel like nobody else has for months.

[–] RoundSparrow@lemmy.ml 2 points 2 years ago (16 children)

ok, experimenting on a massive test data set of over 5 million posts... this PostgreSQL works pretty well

SELECT COUNT(ranked_recency.*) AS post_row_count
FROM
  (
     SELECT id, community_id, published,
        rank() OVER (
           PARTITION BY community_id
           ORDER BY published DESC, id DESC
           )
     FROM post_aggregates) ranked_recency
WHERE rank <= 1000
;

This limits any one community to 1000 posts, picking the most recent created posts. This gives a way to age out older data in very active communities without removing any posts at all for small communities.

[–] RoundSparrow@lemmy.ml 3 points 2 years ago

That is, give a “next page” token

There's already a pull request on changing paging.

My focus is a very hard wall on performance, scale. There is way too much potential for data to run into the full post table as things are now.

[–] RoundSparrow@lemmy.ml 5 points 2 years ago* (last edited 2 years ago)

I'll say this: a lot of discussion seems to take place on Matrix chat that doesn't make it into GitHub code comments as to why specific changes are made.

It used to be you could see the actual content of deleted comments and it was at the discretion of the client to show or not show them. The newcomers from Reddit (June) seemed to not like that people could read content of deleted comments, so I think changes were made for that reason.

With federation, it really isn't reasonable to expect content copies to all be deleted. So it's a complex issue.

[–] RoundSparrow@lemmy.ml 3 points 2 years ago

It can't be a simple as a date range, because we want to be inclusive for smaller communities.

  1. paging is a consideration. 1000 posts per community would allow 10 pages of 20 posts.
  2. small communities are defined to be 1000 or less posts, regardless of age
  3. large communities would focus on recency, the 1000 post would be recently created or edited
  4. Edited can be more tricky, either skip for now or focus on how to limit some kind of mass edit from taking over newly published

 

Also a good time to be reminded that the published date isn't reliable for a couple reasons:

  1. problems in the field have been shown with incoming federation data having future published dates on content. kbin in an easy example, but it isn't limited to kbin.
  2. federation can lag due to server overload and problems paths between specific servers, ISP issues, etc. It is rather common to have received a post hours after the published date. Lemmy currently does not track the 'received' date of content.
[–] RoundSparrow@lemmy.ml 6 points 2 years ago (2 children)

With 0.18.2, 0.18.3 there were changes in the behavior of comment sorting and delete / remove behavior. It is entirely possible that behavior changed, intentional or otherwise.

There is a !test@lemmy.ml community where you could create comments, do some screen shots before and after delete.

[–] RoundSparrow@lemmy.ml 4 points 2 years ago

An Instance is just another word for 'server' in lemmy terminology. HDTV is a classic form of media that doesn't involve TCP/IP to watch films and other video content.

[–] RoundSparrow@lemmy.ml 6 points 2 years ago (2 children)

i'm curious about alternate front-end / API clients....

 

Federation likes (votes) are wildly different from server to server as it stands now. And unless something is way off on my particular server, ~~0.4 seconds is what PostgreSQL is reporting as the mean (average) time per single comment vote INSERT~~, and post vote INSERT is similar. (NOTE: my server is classic hard drives, 100MB/sec bencharked, not a SSD)

Discussion of the SQL statement for a single comment vote insert: https://lemmy.ml/post/1446775

Every single VOTE is both a HTTP transaction from the remote server and a SQL transaction. I am looking into Postgress supporting batches of inserts to not check all the index constraints at each single insert: https://www.postgresql.org/docs/current/sql-set-constraints.html

Can the Rust code for inserts from federation be reasonably modified to BEGIN TRANSACTION only every 10th comment_like INSERT and then do a COMMIT of all of them at one time? and possibly a timer that if say 15 seconds passes with no new like entries from remote servers, do a COMMIT to flush based a timeout.

Storage I/O writing for votes alone is pretty large...

 

INSERT INTO "comment_like" ("person_id", "comment_id", "post_id", "score") VALUES ($1, $2, $3, $4) ON CONFLICT ("comment_id", "person_id") DO UPDATE SET "person_id" = $5, "comment_id" = $6, "post_id" = $7, "score" = $8 RETURNING "comment_like"."id", "comment_like"."person_id", "comment_like"."comment_id", "comment_like"."post_id", "comment_like"."score", "comment_like"."published"

~~The server is showing relatively high execution time for this INSERT statement, like 0.4 seconds mean time. Is this form of blended INSERT with UPDATE and RETURNING slower than doing a direct insert?~~ (was misreading data, these are milliseconds, not seconds)

Every time a remote federation Upvote on a comment comes in to Lemmy, it executes this statement.

 

A posting on the Instance-specific issues/observations about the upgrade: https://lemmy.ml/post/1444409

KNOWN BUGS

  1. Searching site-wide for "0.18" generates an error. This was working fine in 0.17.4 before Lemmy.ml upgraded: https://lemmy.ml/search?q=0.18&type=All&listingType=All&page=1&sort=TopAll
 

 
 

A single Like is measuring as taking significant time on the PostgreSQL database as a INSERT, and comments and postings are even higher. As in 1/2 second of real-time for each and every one.

The lemmy_server code that does https outbound pushes is also not thoughtful about how 20 likes against the same server hosting the community is triggering 20 simultaneous network and database connections... concurrent activity isn't considered in the queuing from how I read the current 0.17.4 code, and it blindly starts going into timing loops without considering that the same target server has too much activity.

 

Lemmy.ml has been giving generic nginx 500 errors on site visits for over a week. Typically it runs into periods of errors for about 1 to 5 seconds, then it clears up, to then return within 5 or 10 minutes.

I've seen them on Beehaw, I've seen them on Lemmy.world, and other reports:

https://lemmy.ml/post/1206363
https://lemmy.ml/post/1213640
https://lemmy.ml/post/1128019

Two Concerns Moving Forward

  1. How can we configure nginx to give a custom error message that server admins are aware of and tracking the problem? Instead of just the generic 500 error webpage.

  2. Why is lemmy_server not responding to nginx? Can we start sharing tips on how to parse the logfiles and what to look for to compile some kind of quantity of how many of these errors we are getting?

I think lemmy_server should start to keep internal error logs about in-code failure. My experience with Lemmy 0.17.4 is that it tends to conceal errors when the database transaction fails or a federationn network transaction fails. I think we need to get these errors to bubble into an organized purpose-named log and back to the end-user so they know what code path they are running into failures on and can communicate more precisely to developers and server operators.

 

I have been working with pg_stat_statements extension to PG and it give us a way to see the actual SQL statements being executed by lemmy_server and the number of times they are being called.

This has less overhead than cranking up logging and several cloud computing services enable it by default (example) - so I don't believe it will have a significant slow down of the server.

A DATABASE RESTART WILL BE REQUIRED

It does require that PostgreSQL be restarted. Which can take 10 or 15 seconds, typically.

Debian / Ubuntu install steps

https://pganalyze.com/docs/install/self_managed/02_enable_pg_stat_statements_deb

Following the conventions of "Lemmy from Scratch" server install commands:

sudo -iu postgres psql -c "ALTER SYSTEM SET shared_preload_libraries = 'pg_stat_statements';"

Followed by a restart of the PostgreSQL service.

 

The activity has a lot of data in each row.

view more: ‹ prev next ›