1

My basic question is how does the newspaper package in Python determine what urls/articles it returns? One would think it simply returns all of the article links contained on the url you provide it but it doesn't seem to work that way. As an example, if you use "http://www.cnn.com" and "https://www.cnn.com/politics" you get the exact same articles returned. I would think for the latter you should only get articles on the politics page, but that does not seem to be the case.

So what is it actually doing? Is it just getting all of the articles from the homepage?

Here's an example I used to test this (I used python version 3.6.2):

import newspaper

#Build newspaper on cnn homepage
url = "http://www.cnn.com"
paper = newspaper.build(url, memoize_articles=False)
article_list = []
for article in paper.articles:
    article_list.append(article.url)

#Build newspaper on cnn politics page
url = "https://www.cnn.com/politics"
paper = newspaper.build(url, memoize_articles=False)
article_list_2 = []
for article in paper.articles:
    article_list_2.append(article.url)

#print the total number of urls returned
print (str(len(article_list)))
print (str(len(article_list_2)))
r1234
  • 21
  • 2
  • I can't reproduce your results. `http://www.cnn.com` returns 846 URLs, `http://www.cnn.com/politics` returns 21 (and `https://www.cnn.com/politics` returns 0, as does `http://www.cnn.com`). – Jongware Feb 11 '18 at 00:10
  • May I ask what version of python you're using? That is interesting you're getting different results with the same code. – r1234 Feb 11 '18 at 00:24
  • Python 3.6, with a mint fresh install of `newspaper3k-0.2.6`. – Jongware Feb 11 '18 at 00:52
  • Is this possibly an environment difference then? What other reason would we get different output from the same code? For me it does not matter if I use http/https either. As long as the root website (cnn, fox, whatever) is the same, the number of urls returned is the same for me. – r1234 Feb 11 '18 at 02:48
  • @usr2564301, I get same results (i.e. 851) for both cases – Istiaque Ahmed Feb 15 '18 at 20:06

1 Answers1

2

Python newspaper package for Article scraping and curation returns only Home page articles.

import newspaper
news_paper = newspaper.build('http://nypost.com', memoize_articles=False)
print(news_paper.size())
for article in news_paper.articles:
    print(article.url)

It will print all the article urls of home page.I also tested it for CNN 'https://edition.cnn.com'.

Sudhanshu Dev
  • 360
  • 3
  • 9