-5

I need to extract different fields surrounding a news articles and I have been able to automate most of them except the published date of the news articles. Currently, I manually go to the respective website, check the HTML tag surrounding the published date and write a jQuery for extracting the date and implementing the same in pyquery. However, I want to remove this one manual step as well and write a generic web scraper for news websites like NY Times etc. The closest I can think of is writing a lot of regexes that can match the datetime format in the DOM of the article but can't figure out a way how it can differentiate between the actual published date and any other date that may be present in the actual article itself. I researched and realised that both Google and Duckduckgo show timestamp of the article in their search results so it must be possible to implement this.

Edit: I believe the language of my question was not very clear so my question is if there is a way to scrape published date from any news article automatically, i.e. a generic crawler which can extract published date from blog posts or news articles.

ankits
  • 305
  • 1
  • 3
  • 13

1 Answers1

1

There's no generic way to get the date a news article was written (although you can design a rule for parsing each news website), but you can get the last modified date of the webpage using document.lastModified in Javascript or parse the Last-Modified field from the HTTP header.

1''
  • 26,823
  • 32
  • 143
  • 200