0

When I do

import newspaper
paper = newspaper.build('http://cnn.com', memoize_articles=False)
print(len(paper.articles))

I see that newspaper found 902 articles from http://cnn.com, which seems quite little too me, considering that they publish many articles per day and has published articles online for many years. Are these really all articles there is on http://cnn.com? If not, is there any way I can find the URLs of the rest of the articles too?

HelloGoodbye
  • 3,624
  • 8
  • 42
  • 57

1 Answers1

1

Newspaper is only querying the items on the main page of CNN, so the module does not query all the categories (e.g. business, health, etc.) on the domain. Based on my code, there are only 698 unique articles as of today being discovered by Newspaper. Some of these articles might be the same, because some of the URLs have hashes, but look to be the same article.

P.S. You can query all the categories, but that requires Selenium coupled with Newspaper.

from newspaper import build

articles = []
urls_set = set()
cnn_articles = build('http://cnn.com', memoize_articles=False)
for article in cnn_articles.articles:
   # check to see if the article url is not within the urls_set
   if article.url not in urls_set:
     # add the unique article url to the set
     urls_set.add(article.url)
     articles.append(article.url)


print(len(articles))
# 698 
Life is complex
  • 15,374
  • 5
  • 29
  • 58
  • How do I use Selenium to get all categories? And why do I need Selenium; isn't a tool like Beautiful Soup enough or do I need to interact with the web page somehow? – HelloGoodbye Oct 02 '20 at 23:10
  • @HelloGoodbye I just looked at the category on CNN and you should be able to use Beautiful Soup, because there are not buttons to click. I don't know your use case, but Beautiful Soup is embedded with Newspaper. I have a recent answer on this. – Life is complex Oct 02 '20 at 23:28