The original post: /r/datahoarder by /u/gulisav on 2024-06-17 00:09:01.
I'm not extremely tech savy, so I have some possibly silly questions.
Two days ago it's been announced that Great Russian Encyclopedia has been given no funding this year at all and the encyclopedia will be discontinued. (The encyclopedia is a heir to the Great Soviet Encyclopedia, and is fairly decent as far as general encyclopedias go.) Apparently Russia has bigger priorities than funding an encyclopedia... So, I think I might try my hand at saving the encyclopedia's online edition, before it 404s. Now, there are two domains, bigenc and old.bigenc (both .ru domains), and I'll focus on the latter. It seems fairly simple to rip, because each encyclopedic article has a corresponding PDF file, with the URLs only changing their final number (with 6 or 7 digits). I could produce a list of all the possible URLs in Excel. However, if I were to feed that list to a download manager, I'm wondering if that would cause any serious issues on the part of the server. There's probably close to a hundred thousand articles available on the site, and the downloader would also have to check possibly millions of URLs that contain no PDFs. Would this be like a sort of borderline DDOS attack? Could my requests be blocked?
Furthermore, even if I rip all that stuff, it would result in thousands of files with nothing in particular to identify them, as the filenames are just numbers. Is there a way to derive the article titles from the text within the PDFs (which ofc include the title of the article) and rename the files accordingly?
(The PDFs themselves are small in size, so I'm not worrying about space constraints.)