0

I've been trying to pass a list of urls to extract articles from the pages. Extraction(with newspaper) works just fine if I build an actual list of urls (e.g. lista = 'http://www.zeit.de', ...). Taking the list from another file does not work, however, even though printing the list works. The following is the code:

import io
import newspaper
from newspaper import Article
import pickle

lista = ['http://www.zeit.de',
         'http://www.guardian.co.uk',
         'http://www.zeit.de',
         'http://www.spiegel.de']

apple = 0
banana = lista[apple]
orange = "file_" + str(apple) + ".txt" 

while apple <4 :

   first_article = Article(url= banana , language='de')     
   first_article.download()    
   first_article.parse()

   print(first_article.text).encode('cp850', errors='replace')

   with io.open(orange, 'w', encoding='utf-8') as f:
       f.write(first_article.text)

   apple += 1
   banana = lista[apple]
   orange = "file_" + str(apple) + ".txt" 

The above MCVE works fine. When I unpickle my list, printing it to console works as I expect, for example with this script:

import pickle
import io

lista = pickle.load( open( "save.p", "rb" ) )    
print lista

A sample of the List output looks like this

['www.zeit.de/1998/51/Psychokrieg_mit_Todesfolge', 'www.zeit.de/1998/51/Raffgierig', 'www.zeit.de/1998/51
/Runter_geht_es_schnell', 'www.zeit.de/1998/51/Runter_mit_den_Zinsen_', 'www.zeit.de/1998/51/SACHBUCH', 'www.zeit.de/199
8/51/Schwerer_Irrtum', 'www.zeit.de/1998/51/Silvester_mit_Geist', 'www.zeit.de/1998/51/Tannen_ohne_Nachwuchs', 'www.zeit
.de/1998/51/This_is_Mu_hen', 'www.zeit.de/1998/51/Tuechtig', 'www.zeit.de/1998/51/Ungelehrig']

but there are thousands of urls in the list.

The error message shown doesn't tell me much (full traceback below)

Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\newspaper\parsers.py", line 53, in fromstring
    cls.doc = lxml.html.fromstring(html)
  File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 706, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "C:\Python27\lib\site-packages\lxml\html\__init__.py", line 600, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src\lxml\lxml.etree.c:68121)
  File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:102470)
  File "parser.pxi", line 1667, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:101229)
  File "parser.pxi", line 1035, in lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:96139)
  File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:91290)
  File "parser.pxi", line 683, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:92476)
  File "parser.pxi", line 633, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:91939)
XMLSyntaxError: None

I've been trying to fix this for hours but I just haven't found a way. Any help would be greatly appreciated.

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
blub123
  • 31
  • 5
  • 1
    Please show the whole traceback. – Kevin Jan 16 '15 at 15:27
  • 1
    You are aware that the line `print(first_article.text).encode('cp850', errors='replace')` is wrong, no? You are encoding the return value of the function `print` - why would you want to do that? – fnl Jan 16 '15 at 15:37
  • It seems to me that your description does not match your question title and the trouble is not with importing the pickled list. If `lista = pickle.load( open( "save.p", "rb" ) ) print lista` gives the output you expect, then the pickle load has worked. it seems more likely there's a problem with what's actually in the list. Can you give a bit more detail (e.g. two or three lines of output from the print lista after loading pickled list and the full traceback)? Also consider changing the title of your question. – J Richard Snape Jan 16 '15 at 15:51
  • I've added the full traceback and a sample from the list. The encoding part worked for printing german text in the console, so I'm reluctant to change that. Thanks for the quick answers, I'm really desperate. – blub123 Jan 16 '15 at 16:17
  • It's the URL's in the list, not the pickling, that are the problem - see answer below. At that point - it's not strictly a programming issue (we probably can't debug a library here). There is probably non standards-compliant HTML in the page at the link you're pointing to (common in 1998 and still a problem today) – J Richard Snape Jan 16 '15 at 16:22
  • Actually - as you say you're desperate - I've just looked and for the 1998 article, after download, there is no HTML in the article object. OK - got it - you need to prepend http:// to all your URLs. hang on a minute and I'll edit my answer. – J Richard Snape Jan 16 '15 at 16:29
  • @fnl Interestingly, `print(first_article.html).encode('cp850', errors='replace')` works, I guess the print function's return value is the string it prints... – J Richard Snape Jan 16 '15 at 16:37

1 Answers1

0

If I understand correctly, the object you're trying to pickle / unpickle is the list of URLs in the lista variable. I'll make that assumption. Therefore, if

lista = pickle.load( open( "save.p", "rb" ) ) 
print lista 

gives the output you expect, then the pickle load has worked. it seems more likely there's a problem with what's actually in the list.

Having a quick look at the newspaper.py code on Github - we see that the error is thrown by this line:

 cls.doc = lxml.html.fromstring(html)

Looking at Article.parse() - that calls the Parser.fromstring() method - so it is a good bet that the problem here is that when you call first_article.parse().

That bit of code in the newspaper library carries the comment

 # Enclosed in a `try` to prevent bringing the entire library
 # down due to one article (out of potentially many in a `Source`)

So I think it's highly likely that it's a problem with one of the (first?) articles in your list imported from the pickle

EDIT:

following OP question edit - the problem with your code is the first URL in your list as suspected. Try it manually in a console e.g.

url = 'www.zeit.de/1998/51/Psychokrieg_mit_Todesfolge'
first_article = Article(url, language='de')
first_article.download()
first_article.parse()

it gives the same (or similar) error. However, if I try a different article - e.g. http://www.zeit.de/community/2015-01/suizid-tochter-entscheidung-akzeptieren, the above code works fine and I see the text if I type

print(first_article.text.encode('cp850', errors='replace'))

So, it's the contents that are at fault, not the pickle. For some reason, the lxml library and newspaper library cannot parse that article. I can't say why, it looks OK when I go to it in a browser.

EDIT2:

following comment discussion with OP - the problem is that the URLs in the unpickled list start www. rather than http:\\www.. To solve that

url = ''
lista = pickle.load( open( "save.p", "rb" ) )

for u in lista:
    if u.startswith('www'):
        url = 'http://' + u
    else:
        url = u

# Carry on with your url processing here
J Richard Snape
  • 20,116
  • 5
  • 51
  • 79