this post was submitted on 16 Apr 2025
1 points (100.0% liked)

It's A Digital Disease!

23 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago
MODERATORS
 
The original post: /r/datahoarder by /u/Sinath_973 on 2025-04-16 16:06:31.

My goal is to scrape websites and extract their textual content to later use it in an AI context.

Currently i am working with n8n and you can scrape single urls, download their content and easily extract their content. But it seems very clunky to me and doesnt work with deeper nested pages. I would have to recousively go through links, filter for same domain and repeat the process for sub pages.

Do you have any better ideas?

I have checked for node.js libs to include in my n8n nodes but wasn't really convinced.

If someone knows a selfhostable scraper (docker preferred) with a clean API i would be super happy.

Cheers

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here