What articles does the newspaper package of Python return?

Question

My basic question is how does the newspaper package in Python determine what urls/articles it returns? One would think it simply returns all of the article links contained on the url you provide it but it doesn't seem to work that way. As an example, if you use "http://www.cnn.com" and "https://www.cnn.com/politics" you get the exact same articles returned. I would think for the latter you should only get articles on the politics page, but that does not seem to be the case.

So what is it actually doing? Is it just getting all of the articles from the homepage?

Here's an example I used to test this (I used python version 3.6.2):

import newspaper

#Build newspaper on cnn homepage
url = "http://www.cnn.com"
paper = newspaper.build(url, memoize_articles=False)
article_list = []
for article in paper.articles:
    article_list.append(article.url)

#Build newspaper on cnn politics page
url = "https://www.cnn.com/politics"
paper = newspaper.build(url, memoize_articles=False)
article_list_2 = []
for article in paper.articles:
    article_list_2.append(article.url)

#print the total number of urls returned
print (str(len(article_list)))
print (str(len(article_list_2)))

I can't reproduce your results. `http://www.cnn.com` returns 846 URLs, `http://www.cnn.com/politics` returns 21 (and `https://www.cnn.com/politics` returns 0, as does `http://www.cnn.com`). — Jongware, Feb 11 '18 at 00:10
May I ask what version of python you're using? That is interesting you're getting different results with the same code. — r1234, Feb 11 '18 at 00:24
Python 3.6, with a mint fresh install of `newspaper3k-0.2.6`. — Jongware, Feb 11 '18 at 00:52
Is this possibly an environment difference then? What other reason would we get different output from the same code? For me it does not matter if I use http/https either. As long as the root website (cnn, fox, whatever) is the same, the number of urls returned is the same for me. — r1234, Feb 11 '18 at 02:48

score 2 · Answer 1 · answered Mar 22 '18 at 07:08

Python newspaper package for Article scraping and curation returns only Home page articles.

import newspaper
news_paper = newspaper.build('http://nypost.com', memoize_articles=False)
print(news_paper.size())
for article in news_paper.articles:
    print(article.url)

It will print all the article urls of home page.I also tested it for CNN 'https://edition.cnn.com'.

What articles does the newspaper package of Python return?

1 Answers1