I am trying to get some data out of a pdf document using scraperwiki for pyhon. It works beautifully if I download the file using urllib2 like so:
pdfdata = urllib2.urlopen(url).read()
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.html.fromstring(xmldata)
pages = list(root)
But here comes the tricky part. As I would like to do this for a large number of pdf-files that I have on my disk, I would like to do away with the first line and pass the pdf file directly as an argument. However, if I try
pdfdata = open("filename.pdf","wb")
xmldata = scraperwiki.pdftoxml(pdfdata)
root = lxml.html.fromstring(xmldata)
I get the following error
xmldata = scraperwiki.pdftoxml(pdfdata)
File "/usr/local/lib/python2.7/dist-packages/scraperwiki/utils.py", line 44, in pdftoxml
pdffout.write(pdfdata)
TypeError: must be string or buffer, not file
I am guessing that this occurs because I do not open the pdf correctly?
If so, is there a way to open a pdf from disk just like urllib2.urlopen() does?