It's A Digital Disease!

23 readers

1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago

MODERATORS

bOt@zerobytes.monster

What would be the most optimal way to get the published timestamp for thousands of article links? (zerobytes.monster)

submitted 1 year ago by bOt@zerobytes.monster to c/datahoarder@zerobytes.monster

0 comments fedilink hide all child comments

The original post: /r/datahoarder by /u/busymom0 on 2025-02-07 17:07:00.

I have thousands of links to various articles. Imagine you have thousands of bookmarked links. I need to get the published timestamp for all these links.

I have been able to use newspaper4k library to get the publish_date value:

python3.10 -m newspaper --url="https://phys.org/news/2025-02-antarctic-hoff-crab-males-bigger.html" --output-format=json | jq -r '.[].publish_date'

However, this needs to fetch every single article one by one, then parse it to get the publish_date only. I am trying to avoid having to fetch every article, especially because it has the potential of running into bot detection.

For example, the above phys.org website blocks IP addresses from Digital Ocean servers.

Plus I really don't care about the content of the links, I just need the date.

Another option I thought of doing a google search for each url, then grabbing the date from the top result. While not perfect, it could work.

Another option I thought was getting the RSS feeds from the websites, then searching for my link in it and then grabbing the date. But this won't work because RSS feeds are often only of recent content and older links won't be listed in it.

Is there any other creative way to do this?

EDIT: I am currently looking at if I can use bing or google search API to maybe grab the date.

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here