The original post: /r/datahoarder by /u/km14 on 2025-01-17 19:18:45.
I'm an artist/amateur researcher who has 100+ collections of important research material (stupidly) saved in the TikTok app collections feature. I cobbled together a working solution to get them out, WITH METADATA (the one or two semi working guides online so far don't seem to include this).
The gist of the process is that I download the HTML content of the collections on desktop, parse them into a collection of links/lots of other metadata using BeautifulSoup, and then put that data into a script that combines yt-dlp and a custom fork of gallery-dl made by github user CasualYT31 to download all the posts. I also rename the files to be their post ID so it's easy to cross reference metadata, and generally make all the data fairly neat and tidy.
It produces a JSON and CSV of all the relevant metadata I could access via yt-dlp/the HTML of the page.
It also (currently) downloads all the videos without watermarks at full HD.
This has worked 10,000+ times.
Check out the full process/code on Github:
https://github.com/kevin-mead/Collections-Scraper/
Things I wish I'd been able to get working:
-
photo slideshows don't have metadata that can be accessed by yt-dlp or gallery-dl. Most regrettably, I can't figure out how to scrape the names of the sounds used on them.
-
There isn't any meaningful safeguards here to prevent getting IP banned from tiktok for scraping, besides the safeguards in yt-dlp itself. I made it possible to delay each download by a random 1-5 sec but it occasionally broke the metadata file at the end of the run for some reason, so I removed it and called it a day.
-
I want srt caption files of each post so badly. This seems to be one of those features only closed-source downloaders have (like this one)
I am not a talented programmer and this code has been edited to hell by every LLM out there. This is low stakes, non production code. Proceed at your own risk.