Get published date of news articles using Python web crawler

Question

I need to extract different fields surrounding a news articles and I have been able to automate most of them except the published date of the news articles. Currently, I manually go to the respective website, check the HTML tag surrounding the published date and write a jQuery for extracting the date and implementing the same in pyquery. However, I want to remove this one manual step as well and write a generic web scraper for news websites like NY Times etc. The closest I can think of is writing a lot of regexes that can match the datetime format in the DOM of the article but can't figure out a way how it can differentiate between the actual published date and any other date that may be present in the actual article itself. I researched and realised that both Google and Duckduckgo show timestamp of the article in their search results so it must be possible to implement this.

Edit: I believe the language of my question was not very clear so my question is if there is a way to scrape published date from any news article automatically, i.e. a generic crawler which can extract published date from blog posts or news articles.

@ankits did you found any solution? – Adarsh Patel May 23 '19 at 20:16 — Adarsh Patel, May 23 '19 at 20:16

score 1 · Answer 1 · answered Dec 28 '14 at 15:02

There's no generic way to get the date a news article was written (although you can design a rule for parsing each news website), but you can get the last modified date of the webpage using document.lastModified in Javascript or parse the Last-Modified field from the HTTP header.

Get published date of news articles using Python web crawler

1 Answers1