Can't find publish_date with newspaper3k

Question

I want to scrape an article from a website with the newspaper library (newspaper3k). However, it doesn't find the published_date for the article, which is div.source-date in the website's source text, and the authors (or source rather), which is div.delfi-source-name in the website's source text. How can I scrape the date and the author/source?

Website/URL example: https://www.delfi.lt/en/politics/foreign-ministry-tsikhanouskayas-consultation-needed-for-treating-belarusians-in-lithuania.d?id=91531501

My code:

import newspaper
from newspaper import Article
from newspaper import Source
import pandas as pd

article = Article("url")
article.download()
article.parse()
article.nlp()

df = pd.DataFrame([{'Title':article.title, 'Author':article.authors, 'Text':article.text,
                    'published_date':article.publish_date, 'Source':article.source_url}])

df.to_excel('Delfi-1.xlsx')

Any suggestions?

score 1 · Accepted Answer · answered Oct 22 '22 at 12:26

The date element in your source is located in 2 locations. The one that you see Wednesday, October 19, 2022 is located in a div tag that newspaper3k cannot parse without using BeautifulSoup.

The second date is hidden in the meta tags, which newspaper3k can parse with some additional code.

from newspaper import Config
from newspaper import Article
from newspaper.article import ArticleException

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

base_url = 'https://www.delfi.lt/en/politics/foreign-ministry-tsikhanouskayas-consultation-needed-for-treating-belarusians-in-lithuania.d?id=91531501'
try:
    article = Article(base_url, config=config)
    article.download()
    article.parse()
    article_meta_data = article.meta_data

    article_title = [value['title'] for (key, value) in article_meta_data.items() if key == 'og']
    print(article_title)

    article_published_date = [value['recs']['publishtime'] for key, value in article_meta_data.items()
                              if key == 'cXenseParse']
    print(article_published_date)

    article_description = [value['description'] for (key, value) in article_meta_data.items() if key == 'og']
    print(article_description)

except ArticleException as error:
    print(error)

Output

["Foreign Ministry: Tsikhanouskaya's consultation needed for treating Belarusians in Lithuania"]
['2022-10-19T11:38:07+0300']
["As Belorus, a Belarus-owned sanatorium in Lithuania's southern resort of Druskininkai, complaints over the fact that Lithuania fails to issue visas to Belarusian citizens, forcing the sanatorium to fire a quarter of its staff, Lithuania's Foreign Ministry suggests coordinating the list of arrivals with Belarusian opposition leaders Sviatlana Tsikhanouskaya's office in Vilnius."]

P.S. Newspaper3k has multiple ways to extract the publish dates from articles. Take a look at this document that I wrote on how to use Newspaper3k.

Can't find publish_date with newspaper3k

1 Answers1