Apologies if this is a stupid question - I'm new to Python and am more familiar with excel VBA.
I am trying to have Python loop through multiple article URLs housed in an excel document and create summaries of the various URLs. The goal would be to have the article Titles, summaries, and URLs exported to a new excel (or different tab). (ultimate goal would be to scrape for relevant news and summarize, but I'm working towards that!)
I'm having issues, however, with getting the Newspaper Article function to read the URL that is passed through from the list I create. When I print the URL, it looks exactly as it would if I had just copy pasted and set url = 'the copy pasted value'. When I go to run 'Article' functions on that URL, though, it does not appear to be reading the URL correctly. They're stored in a list as a string. Not sure what I might be doing wrong. Any help would be appreciated!!
# Import the libraries
import nltk
from newspaper import Article
import openpyxl
# import the URLs from the Excel
from openpyxl import load_workbook
wb = load_workbook(r'C:\Users\Python\RunPythonScript.xlsm') # Work Book
ws = wb.get_sheet_by_name('URLs') # Work Sheet
column = ws['A'] # Column
column_list = [column[x].value for x in range(len(column))] # create a list
url_list = list(filter(None, column_list)) # remove blanks
url_list.pop(0) # remove title
# start loop
x = 0
while x < len(url_list):
url = str("'" + url_list[x] + "'") # set url
article = Article(url) # Get the article ### seems to be where error is ###
print(article)
x = x + 1 # move to next url
I get the following output from python:
<newspaper.article.Article object at 0x07DADB38>
<newspaper.article.Article object at 0x0A698670>
<newspaper.article.Article object at 0x07DADB38>
<newspaper.article.Article object at 0x0A698670>
<newspaper.article.Article object at 0x07DADB38>
<newspaper.article.Article object at 0x0A698670>
<newspaper.article.Article object at 0x07DADB38>
<newspaper.article.Article object at 0x0A698670>
<newspaper.article.Article object at 0x07DADB38>
<newspaper.article.Article object at 0x0A698670>
Instead of printing the article, it seems to be erroring out on the URL.
Any insights? Thanks in advance!!