this post was submitted on 28 May 2025
1 points (100.0% liked)

It's A Digital Disease!

23 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago
MODERATORS
 
The original post: /r/datahoarder by /u/p186 on 2025-05-27 21:32:06.

I've been a Pocket user for many years. I've been meaning to move off for a while, but finally have now that it is being sunset. I was looking at Wallabag a while back, but have gone with Karakeep so I can leverage my Local LLMs for autotagging, especially since the Pocket export doesn't seem to have included the tags I had.

I've accumulated years' worth of saves, so it is taking a while to index and crawl. The processing of my old data has been running for almost a week and looks to be another week, maybe two, till it completes. Is there a way to configure the crawler to do multiple concurrent requests? I run Karakeep via a multi-service Docker compose. I have configured it to do a full-page archive by default, as I like to use the reader view & to guard against link rot. As a result, crawling each URL takes about 4-5 seconds.

Does anyone have recommendations that could speed up the processing of my imported data? Is it possible to run multiple http/https request threads or run multiple instances of the Chrome service/container? I'd rather not lower the crawler timeout to mitigate failures.

SOLVED: Increased the crawler workers from 1 to 15 (https://www.reddit.com/r/selfhosted/comments/1kwzhdu/comment/mulypk8/) and switched to a smaller LLM for text inference (gemma3:4b). It should now finish sometime tomorrow.

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here