Python Newspaper function not reading Article URL during loop?

Question

Apologies if this is a stupid question - I'm new to Python and am more familiar with excel VBA.

I am trying to have Python loop through multiple article URLs housed in an excel document and create summaries of the various URLs. The goal would be to have the article Titles, summaries, and URLs exported to a new excel (or different tab). (ultimate goal would be to scrape for relevant news and summarize, but I'm working towards that!)

I'm having issues, however, with getting the Newspaper Article function to read the URL that is passed through from the list I create. When I print the URL, it looks exactly as it would if I had just copy pasted and set url = 'the copy pasted value'. When I go to run 'Article' functions on that URL, though, it does not appear to be reading the URL correctly. They're stored in a list as a string. Not sure what I might be doing wrong. Any help would be appreciated!!

# Import the libraries
import nltk
from newspaper import Article
import openpyxl

# import the URLs from the Excel
from openpyxl import load_workbook
wb = load_workbook(r'C:\Users\Python\RunPythonScript.xlsm')  # Work Book
ws = wb.get_sheet_by_name('URLs')  # Work Sheet
column = ws['A']  # Column
column_list = [column[x].value for x in range(len(column))] # create a list
url_list = list(filter(None, column_list)) # remove blanks
url_list.pop(0) # remove title

# start loop
x = 0
while x < len(url_list):


   url = str("'" + url_list[x] + "'") # set url  
   article = Article(url) # Get the article ### seems to be where error is ###
   print(article)

   x = x + 1 # move to next url

I get the following output from python:

<newspaper.article.Article object at 0x07DADB38>
<newspaper.article.Article object at 0x0A698670>
<newspaper.article.Article object at 0x07DADB38>
<newspaper.article.Article object at 0x0A698670>
<newspaper.article.Article object at 0x07DADB38>
<newspaper.article.Article object at 0x0A698670>
<newspaper.article.Article object at 0x07DADB38>
<newspaper.article.Article object at 0x0A698670>
<newspaper.article.Article object at 0x07DADB38>
<newspaper.article.Article object at 0x0A698670>

Instead of printing the article, it seems to be erroring out on the URL.

Any insights? Thanks in advance!!

What's the error? The output is the string representation of the Article objects. — luis.parravicini, Apr 13 '20 at 22:00
Never used that library before, here's seems to be the documentation for it: https://newspaper.readthedocs.io/en/latest/ check it out and just print the data you need of each article? — luis.parravicini, Apr 13 '20 at 22:02
The command runs (guess it is not erroring out specifically), but the output should be the URL's article text? — Mondy77, Apr 13 '20 at 22:03
_it seems to be erroring out on the URL._ What makes you say that? As an aside, that while loop should almost certainly be a for loop using range instead. — AMC, Apr 14 '20 at 00:20

score 0 · Answer 1 · answered Apr 13 '20 at 22:11

When calling print() on an object, a string representation of the object is created by calling it's str method.

If you need to print some data from the Article, for example it's url, do:

print(article.url)

More information on Article here: https://newspaper.readthedocs.io/en/latest/

score 0 · Answer 2 · answered Apr 13 '20 at 22:49

The documentation https://newspaper.readthedocs.io/en/latest/ is pretty clear.

It seems you have to modify your code to something like this:

...
while x < len(url_list):


   url = str("'" + url_list[x] + "'") # set url  
   article = Article(url)
   article.download()
   article.parse()
   print(article.authors)
   print(article.publish_date)
   print(article.text)
   print(article.top_image)  
   # And so on and so far...

   x = x + 1 # move to next url

Python Newspaper function not reading Article URL during loop?

2 Answers2