0

I'm using the Python pdfminer library to extract both text and images from a PDF. Since the TextConverter class by default writes to sys.stdout, I used StringIO to catch the text as a variable as follows (see paste:

def extractTextAndImagesFromPDF(rawFile):
    laparams = LAParams()
    imagewriter = ImageWriter('extractedImageFolder/')    
    resourceManager = PDFResourceManager(caching=True)

    outfp = StringIO()  # Use StringIO to catch the output later.
    device = TextConverter(resourceManager, outfp, codec='utf-8', laparams=laparams, imagewriter=imagewriter)
    interpreter = PDFPageInterpreter(resourceManager, device)
    for page in PDFPage.get_pages(rawFile, set(), maxpages=0, caching=True, check_extractable=True):
        interpreter.process_page(page)
    device.close()    
    extractedText = outfp.getvalue()  # Get the text from the StringIO
    outfp.close()
    return extractedText 

This works fine for the extracted text. What this function also does is extracting the images in the PDF and writing them to the 'extractedImageFolder/'. This works also fine, but I now want the images to be "written to" a file object instead of to the file system, so that I can do some post processing on them.

The ImageWriter class defines a file (fp = file(path, 'wb')) and then writes to that. What I would like is that my extractTextAndImagesFromPDF() function can also return a list of file objects, instead of directly writing them to a file. I guess I also need to use StringIO for that, but I wouldn't know how. Partly also because the writing to file is happening within the pdfminer.

Does anybody know how I can return a list of file objects instead of writing the images to the file system? All tips are welcome!

kramer65
  • 50,427
  • 120
  • 308
  • 488

1 Answers1

1

Here is a hack to allow you to provide a file pointer of your own to write to:

   # add option in aguments to supply your own file pointer
   def export_image(self, image, fp=None):
        ...
        # change this line:
        # fp = file(path, 'wb')
        # add instead:
        fp = fp if fp else file(path, 'wb')
        ...
        # and this line:
        # return name
        # add instead:
        return (fp, name,) if fp else name

Now you would need to use:

# create file-like object backed by string buffer
fp = stringIO.stringIO()
image_fp, name = export_image(image, fp)

and your image should be stored in fp.

Note that the behaviour to export_image, if it was used elsewhere, remains the same.

Reut Sharabani
  • 30,449
  • 6
  • 70
  • 88
  • Thank you! There are 2 problems though. First: I would need to fork the pdfminer lib, where I would preferably like to do it without forking. And 2nd; if I do decide to fork pdfminer, `export_image()` could return the fp, but `export_image()` is only called in `receive_layout()` (https://github.com/euske/pdfminer/blob/master/pdfminer/converter.py#L180), which in turn is called in `end_page()` (https://github.com/euske/pdfminer/blob/master/pdfminer/converter.py#L50), which in turn is called in `process_page()` (https://github.com/euske/pdfminer/blob/master/pdfminer/pdfinterp.py#L840). – kramer65 Dec 15 '14 at 15:38
  • This makes it pretty complicated. So my follow up questions; would you have any idea how I could return the files without forking pdfminer? And if not, would you have any pointers in how I could easily get the lists of images without having to edit a zillion lines of code? – kramer65 Dec 15 '14 at 15:41
  • Your problem is pdfminer creates the file on his side. If you don't have access to get the data the way you want it - you either save to a file and load that data later on, or change the API (fork). I'll think of a **hack** (which will always be just a hack) and if I come up with something I'll let you know... – Reut Sharabani Dec 15 '14 at 15:44