I've been trying to pass a list of urls to extract articles from the pages. Extraction(with newspaper) works just fine if I build an actual list of urls (e.g. lista = 'http://www.zeit.de', ...). Taking the list from another file does not work, however, even though printing the list works. The following is the code:
import io
import newspaper
from newspaper import Article
import pickle
lista = ['http://www.zeit.de',
'http://www.guardian.co.uk',
'http://www.zeit.de',
'http://www.spiegel.de']
apple = 0
banana = lista[apple]
orange = "file_" + str(apple) + ".txt"
while apple <4 :
first_article = Article(url= banana , language='de')
first_article.download()
first_article.parse()
print(first_article.text).encode('cp850', errors='replace')
with io.open(orange, 'w', encoding='utf-8') as f:
f.write(first_article.text)
apple += 1
banana = lista[apple]
orange = "file_" + str(apple) + ".txt"
The above MCVE works fine. When I unpickle my list, printing it to console works as I expect, for example with this script:
import pickle
import io
lista = pickle.load( open( "save.p", "rb" ) )
print lista
A sample of the List output looks like this
['www.zeit.de/1998/51/Psychokrieg_mit_Todesfolge', 'www.zeit.de/1998/51/Raffgierig', 'www.zeit.de/1998/51
/Runter_geht_es_schnell', 'www.zeit.de/1998/51/Runter_mit_den_Zinsen_', 'www.zeit.de/1998/51/SACHBUCH', 'www.zeit.de/199
8/51/Schwerer_Irrtum', 'www.zeit.de/1998/51/Silvester_mit_Geist', 'www.zeit.de/1998/51/Tannen_ohne_Nachwuchs', 'www.zeit
.de/1998/51/This_is_Mu_hen', 'www.zeit.de/1998/51/Tuechtig', 'www.zeit.de/1998/51/Ungelehrig']
but there are thousands of urls in the list.
The error message shown doesn't tell me much (full traceback below)
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\newspaper\parsers.py", line 53, in fromstring
cls.doc = lxml.html.fromstring(html)
File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 706, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 600, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src\lxml\lxml.etree.c:68121)
File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:102470)
File "parser.pxi", line 1667, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:101229)
File "parser.pxi", line 1035, in lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:96139)
File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:91290)
File "parser.pxi", line 683, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:92476)
File "parser.pxi", line 633, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:91939)
XMLSyntaxError: None
I've been trying to fix this for hours but I just haven't found a way. Any help would be greatly appreciated.