It's A Digital Disease!

23 readers

1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago

MODERATORS

bOt@zerobytes.monster

My struggle to download every Project Gutenberg book in English (zerobytes.monster)

submitted 1 year ago by bOt@zerobytes.monster to c/datahoarder@zerobytes.monster

0 comments fedilink hide all child comments

The original post: /r/datahoarder by /u/Tom_Sacold on 2025-01-26 02:20:25.

I wanted to do this for a particular project, not just the hoarding, but let's just say we want to do this.

Let's also say to make it simple we're going to download only .txt versions of the books.

Gutenberg have a page telling you you're allowed to do this using wget with a 2-second waiting list between requests, and it gives the command as

wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes%5B%5D=txt&langs%5B%5D=en"

now I believe this is supposed to get a series of HTML pages (following a "next page" link every time), which have in them links to zip files, and download not just the pages but the linked zip files as well. Does that seem right?

This did not work for me. I have tried various options with the -A flag but it didn't download the zips.

So, OK, moving on, what I do have is 724 files (with annoying names because wget can't custom-name them for me), each containing 200-odd links to zip files like this:

<a href="http://aleph.gutenberg.org/1/0/0/3/10036/10036-8.zip">http://aleph.gutenberg.org/1/0/0/3/10036/10036-8.zip</a>

So we can easily grep those out of the files and get a list of the zipfile URLs, right?

egrep -oh 'http://aleph.gutenberg.org/[^"<]+' * | uniq > zipurls.txt

Using uniq there because every URL appears twice, in the text and in the HREF attribute.

So now we have a huge list of the zip file URLs and we can get them with wget using the --input-list option:

wget -w 2 --input-file=zipurls.txt

this works, except … some of the files aren't there.

If you go to this URL in a browser:

http://aleph.gutenberg.org/1/0/0/3/10036/

you'll see that 10036-8.zip isn't there. But there's an old folder. It's in there. What does the 8 mean? I think it means UTF-8 encoding and I might be double-downloading— getting the same files twice in different encodings. What does the old mean? Just … old?

So now I'm working through the list, not with wget but with a script which is essentially this:

try to get the file
if the response is a 404
    add 'old' into the URL and try again

How am I doing? What have I missed? Are we having fun yet?

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here