0

I have used newspaper3k api from python for scraping articles. I am not able to scrape Times of India articles , getting publish date null from response rest articles are giving proper articles.

article = Article(url)
article.download()
article.parse()
result=vars(article)
print(result['publish_date']) 
  • Can you show the code you've tried, the error messages, and what you expect to happen? – PaulProgrammer Aug 27 '20 at 05:57
  • See all articles are giving proper date but this Times of India (TOI) articles domain articles are giving publish date null can TOI articles block some part of response ? – rohan sawant Aug 27 '20 at 06:02
  • Sure, the publisher of an API has full control over what is returned, and may choose to implement only part of the spec. – PaulProgrammer Aug 27 '20 at 06:28
  • Can you please share article URL and the response ? – Shakeel Aug 28 '20 at 18:15
  • @Shakeel for example you take this article URL - https://timesofindia.indiatimes.com/business/india-business/logistics-it-media-professionals-most-anxious-about-returning-to-work-survey/articleshow/77479303.cms or any TOI articles i will give publish date null in given object response. – rohan sawant Aug 31 '20 at 10:56
  • Yes I they are null values for me as well. Two options either search in TOI forums or have another field as polling date(work around). – Shakeel Sep 01 '20 at 19:57

1 Answers1

0

The current version of Newspaper cannot extract the 'publish date' from the Times of India HTML code, because the date is within a script tag. You can extract this date using requests and BeautifulSoup. The latter is embedded in Newspaper. I also noted that the keywords are in a meta tag, so Newspaper cannot extract these. I added some code to extract the keywords too. Hopefully, the code below helps you query articles on the Times of India. Please let me know if you have any questions.

import requests
import re as regex
from newspaper import Article
from newspaper.utils import BeautifulSoup

base_url = 'https://timesofindia.indiatimes.com/business/india-business/govt-working-to-reduce-e-vehicle-tax-niti-aayog-ceo/articleshow/78210495.cms'

raw_html = requests.get(base_url)
soup = BeautifulSoup(raw_html.text, 'html.parser')

# parse date published
data = soup.findAll('script')[1]
find_date = regex.search(r'datePublished.{3}\d{4}-\d{2}-\d{2}', data.string)
date_published = find_date.group().split('"')[2]

# parse other elements using Newspaper
article = Article('')
article.download(raw_html.content)
article.parse()
article_tags = article.tags
article_content = article.text
article_title = article.title

# parse keywords
article_meta_data = article.meta_data
article_keywords = sorted({value for (key, value) in article_meta_data.items() if key == 'keywords'})
Life is complex
  • 15,374
  • 5
  • 29
  • 58