Newspaper python cache issue, every call same output

Question

I use this module: https://github.com/codelucas/newspaper to download bitcoin articles from https://news.bitcoin.com/. But when I try to get next articles from next page 'https://news.bitcoin.com/page/2/page' I get same output. Same for any other page.

I have tried with different sites and different starting pages. The articles from first link I used were displayed on all other links.

import newspaper

url = 'https://news.bitcoin.com/page/2'
btc_articles = newspaper.build(url, memoize_articles = False)

for article in btc_articles.articles:
    print(article.url)

From the documentation, try using `import newspaper3k`, also if I'm correct it is scraping or parsing one url so in your case the page you are seeing. You will want to modify and add additional code to be able to get next articles. — Xion, Jan 23 '19 at 17:48

score 1 · Answer 1 · answered Jan 24 '19 at 05:46

The newspaper library tries to scrape the whole website, not just the link you input. This means that you shouldn't have to loop through all pages to the get the articles. However, as you might have noted the lib doesn't find all articles anyway.

The reason for this seems to be that it doesn't identify all pages as categories (and doesn't find the feed), see below (the output was the same regardless of page):

import newspaper

url = 'https://news.bitcoin.com/'
btc_paper = newspaper.build(url, memoize_articles = False)

print('Categories:', [category.url for category in btc_paper.categories])
print('Feeds:', [feed.url for feed in btc_paper.feeds])

Output:

Categories: ['https://news.bitcoin.com/page/2', 'https://news.bitcoin.com']
Feeds: []

This is seems to be a bug in the code (or bad website design on bitcoins part depending on how you look at it), just as you have noted in your trouble report https://github.com/codelucas/newspaper/issues/670.

Newspaper python cache issue, every call same output

1 Answers1