I have thousands of links to various articles. Imagine you have thousands of bookmarked links. I need to get the published timestamp for all these links.
I have been able to use newspaper4k library to get the publish_date
value:
python3.10 -m newspaper --url="https://phys.org/news/2025-02-antarctic-hoff-crab-males-bigger.html" --output-format=json | jq -r '.[].publish_date'
However, this needs to fetch every single article one by one, then parse it to get the publish_date
only. I am trying to avoid having to fetch every article, especially because it has the potential of running into bot detection.
For example, the above phys.org
website blocks IP addresses from Digital Ocean servers.
Plus I really don't care about the content of the links, I just need the date.
Another option I thought of doing a google search for each url, then grabbing the date from the top result. While not perfect, it could work.
Another option I thought was getting the RSS feeds from the websites, then searching for my link in it and then grabbing the date. But this won't work because RSS feeds are often only of recent content and older links won't be listed in it.
Is there any other creative way to do this?
EDIT: I am currently looking at if I can use bing or google search API to maybe grab the date.