2

I have a list of URLs saved in a .txt file and I would like to feed them, one at a time, to a variable named url to which I apply methods from the newspaper3k python library. The program extracts the URL content, authors of the article, a summary of the text, etc, then prints the info to a new .txt file. The script works fine when you give it one URL as user input, but what should I do in order to read from a .txt with thousands of URLs?

I am only beginning with Python, as a matter of fact this is my first script, so I have tried to simply say url = (myfile.txt), but I realized this wouldn't work because I have to read the file one line at a time. So I have tried to apply read() and readlines() to it, but it wouldn't work properly because 'str' object has no attribute 'read' or 'readlines'. What should I use to read those URLs saved in a .txt file, each beginning in a new line, as the input of my simple script? Should I convert string to something else?

Extract from the code, lines 1-18:

from newspaper import Article
from newspaper import fulltext
import requests


url = input("Article URL: ")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary

Later I have built some functions to display the info in a desired format and save it to a new .txt. I know this is a very basic one, but I am honestly stuck... I have read other similar questions here but I couldn't properly understand or apply the suggestions. So, what is the best way to read URLs from a .txt file in order to feed them, one at a time, to the url variable, to which other methods are them applied to extract its content?

This is my first question here and I understand the forum is aimed at more experienced programmers, but I would really appreciate some help. If I need to edit or clarify something in this post, please let me know and I will correct immediately.

2 Answers2

1

Here is one way you could do it:

from newspaper import Article
from newspaper import fulltext
import requests

with open('myfile.txt',r) as f:
    for line in f:
        #do not forget to strip the trailing new line
        url = line.rstrip("\n")
        a = Article(url, language='pt')
        html = requests.get(url).text
        text = fulltext(html)
        download = a.download()
        parse = a.parse()
        nlp = a.nlp()
        title = a.title
        publish_date = a.publish_date
        authors = a.authors
        keywords = a.keywords
        summary = a.summary
matt__chv
  • 55
  • 5
  • Thanks, I am going to test it out. Appreciate your time answering this. – AlmiranteAlcasetzer Jan 06 '19 at 18:45
  • 1
    `a = Article(url, language='pt')` and `html = requests.get(url).text` will show an error in my text editor, `undefined name 'url'`. Executing the program as you have described throws an error, `NameError: name 'url' is not defined` – AlmiranteAlcasetzer Jan 06 '19 at 18:50
  • indeed, had not seen the 2nd url in requests, edited my answer accordingly – matt__chv Jan 06 '19 at 18:53
  • fixed the typo on line 6 – matt__chv Jan 06 '19 at 19:09
  • Unfortunately, it throws the same error as the other answer here... `AttributeError: 'NoneType' object has no attribute 'xpath'`. I don't quite get it why the script works perfectly when I input a URL as in the original snippet I've posted but it is so hard to read each line in the .txt and feed them to the `url` variable. – AlmiranteAlcasetzer Jan 06 '19 at 19:19
  • This is why I have been stuck with this simple problem for hours... I've read answers to similar questions here in the forum and I've tried similar methods as the one you provided, but I don't quite get it why they won't work. I don't think it is a problem with the module itself, because it works well when I provided only one URL (as in the original snippet). I have another script that scrapes all the URLs in a given website and saves them to a .txt. But when I try to feed the .txt to the script that scrapes the content of those URLs, those problems appear. – AlmiranteAlcasetzer Jan 06 '19 at 19:23
0

This could help you:

url_file = open('myfile.txt','r')
for url in url_file.readlines():
   print url
url_file.close()

You can apply it on your code as the following

from newspaper import Article
from newspaper import fulltext
import requests

url_file = open('myfile.txt','r')
for url in url_file.readlines():
  a = Article(url, language='pt')
  html = requests.get(url).text
  text = fulltext(html)
  download = a.download()
  parse = a.parse()
  nlp = a.nlp()
  title = a.title
  publish_date = a.publish_date
  authors = a.authors
  keywords = a.keywords
  summary = a.summary
url_file.close()
Walid Da.
  • 948
  • 1
  • 7
  • 15
  • Great, going to test it out. Thanks for the time – AlmiranteAlcasetzer Jan 06 '19 at 18:27
  • Good, let me know is f you face any problem – Walid Da. Jan 06 '19 at 18:29
  • Since I don't really want to print the links, but to feed them, one at a time, to the `url` variable, how should I adapt the code you provided? When I execute the one you provided (substituting print url for print(url)) I can print all the links in the command line, but the program returns the following error `AttributeError: 'NoneType' object has no attribute 'xpath' `. What am I doing wrong? – AlmiranteAlcasetzer Jan 06 '19 at 18:41
  • Yes, that was what I tried before with your first answer. Appreciate the editing though. But it gives me a `AttributeError: 'NoneType' object has no attribute 'xpath'` error. I don't really get it why it works with the user input of one URL but I won't work when I try to read URLs from a .txt file. – AlmiranteAlcasetzer Jan 06 '19 at 18:56
  • The error is too big to post it all here, but the only error thar refers to my script is `File "newscrapy.py", line 12, in text = fulltext(html)` The other errors refer to scripts that make the `newspaper3k` module – AlmiranteAlcasetzer Jan 06 '19 at 19:10
  • Reading is url from the file is done. That error is another problem. – Walid Da. Jan 06 '19 at 19:22
  • Hm, I understand. But why it works when I input one URL manually (as in the original snippet, where I can scrape content from one given URL) and it won't work when I try to read the URL from the .txt? Where should I look for? Anyway, thanks for your time, really appreciate. When I learn more about Python I will also answer desperate newbies here :) – AlmiranteAlcasetzer Jan 06 '19 at 19:28
  • I guess I am going to open an issue in newspaper3k github page, maybe it's a bug on their part. At least I feel better for being stuck in such a trivial problem... Tried many things I've read here and none would work. Anyway, thanks for your help and time. – AlmiranteAlcasetzer Jan 06 '19 at 19:34