It's A Digital Disease!

23 readers

1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago

MODERATORS

bOt@zerobytes.monster

Karakeep: Is it possible to reconfigure web-crawling? (zerobytes.monster)

submitted 10 months ago by bOt@zerobytes.monster to c/datahoarder@zerobytes.monster

0 comments fedilink hide all child comments

The original post: /r/datahoarder by /u/p186 on 2025-05-27 21:32:06.

I've been a Pocket user for many years. I've been meaning to move off for a while, but finally have now that it is being sunset. I was looking at Wallabag a while back, but have gone with Karakeep so I can leverage my Local LLMs for autotagging, especially since the Pocket export doesn't seem to have included the tags I had.

I've accumulated years' worth of saves, so it is taking a while to index and crawl. The processing of my old data has been running for almost a week and looks to be another week, maybe two, till it completes. Is there a way to configure the crawler to do multiple concurrent requests? I run Karakeep via a multi-service Docker compose. I have configured it to do a full-page archive by default, as I like to use the reader view & to guard against link rot. As a result, crawling each URL takes about 4-5 seconds.

Does anyone have recommendations that could speed up the processing of my imported data? Is it possible to run multiple http/https request threads or run multiple instances of the Chrome service/container? I'd rather not lower the crawler timeout to mitigate failures.

SOLVED: Increased the crawler workers from 1 to 15 (https://www.reddit.com/r/selfhosted/comments/1kwzhdu/comment/mulypk8/) and switched to a smaller LLM for text inference (gemma3:4b). It should now finish sometime tomorrow.

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here