Using scraperwiki for pdf-file on disk

Question

I am trying to get some data out of a pdf document using scraperwiki for pyhon. It works beautifully if I download the file using urllib2 like so:

pdfdata = urllib2.urlopen(url).read()
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.html.fromstring(xmldata)
pages = list(root)

But here comes the tricky part. As I would like to do this for a large number of pdf-files that I have on my disk, I would like to do away with the first line and pass the pdf file directly as an argument. However, if I try

pdfdata = open("filename.pdf","wb")
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.html.fromstring(xmldata)

I get the following error

xmldata = scraperwiki.pdftoxml(pdfdata)
File "/usr/local/lib/python2.7/dist-packages/scraperwiki/utils.py", line 44, in pdftoxml
pdffout.write(pdfdata)
TypeError: must be string or buffer, not file

I am guessing that this occurs because I do not open the pdf correctly?

If so, is there a way to open a pdf from disk just like urllib2.urlopen() does?

score 0 · Accepted Answer · answered May 26 '15 at 16:57

0

urllib2.urlopen(...).read() does just that it reads the contents of the stream returned from the url you passed as a parameter.

While open() returns a file handler. Just as urllib2 needed to do an open() call then a read() call so does file handlers.

Change your program to use the the following lines:

with open("filename.pdf", "rb") as pdffile:
      pdfdata=pdffile.read()

xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.html.fromstring(xmldata)

This will open your pdf then read the contents into a buffer named pdfdata. From there your call to scraperwiki.pdftoxml() will work as expected.

answered May 26 '15 at 16:57

Dwight Spencer

1,472
16
22

Thanks. That solved the original problem. However, now I get a different error: `root = lxml.html.fromstring(xmldata) ... lxml.etree.XMLSyntaxError: None` (Full error text is too long for a comment). I am guessing this is unrelated to my original question, still any insight is much appreciated! – w_a_s May 26 '15 at 17:22
Code: http://pastebin.com/3LUpqQ84 Full error message: http://pastebin.com/WK74TS8B – w_a_s May 26 '15 at 17:39
Solved. This solution [http://stackoverflow.com/questions/24005988/lxml-not-working-with-django-scraperwiki] did the trick for some reason. – w_a_s May 27 '15 at 11:56

Using scraperwiki for pdf-file on disk

1 Answers1