this post was submitted on 27 Feb 2025
1 points (100.0% liked)

It's A Digital Disease!

23 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago
MODERATORS
 
The original post: /r/datahoarder by /u/Internal-Ad-2771 on 2025-02-26 15:25:02.

Hello! I want to download the End Of Term Web Archive 2024 to perform text analysis and track changes in textual content. I know that the Internet Archive has a collection where we can download WARC files here https://archive.org/details/EndOfTerm2024WebCrawls, but it amounts to hundreds of terabytes, and I can't download everything. Since I'm only interested in HTML files, and perhaps not all domains but just the most visited ones, I wonder if there is a more optimal solution. I thought of two possibles solutions:

  • WET files, which contain only the text extracted from the EOT and are much smaller, are available here: https://eotarchive.org/data/ for previous years, but not for 2024. Does anyone know of links for 2024?
  • I tried to download each HTML file individually using the Wayback Machine API, but there is a rate limit of 20 requests per second I think. For a website like state.gov, there are more than 500,000 captures between 2024 and 2025 to download, so it would be very long.

Any other ideas?

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here