0

I want to parse some pdf files that contains text and may or may not contain images. I want to extract the text portion as string for further processing and save the image as jpeg/png or any other image format. what should be the best module to work with?

Kamrul Khan
  • 3,260
  • 4
  • 32
  • 59

1 Answers1

2

pdfminer will get your text. pdfrw (disclaimer: I am the author of pdfrw) has examples that will find images and dump them to separate pages, and also examples that will split PDFs into separate pages, so you could easily extract all the images to separate PDFs. If you run inkscape in a headless mode (e.g. from the subprocess module), it can read in the PDF and output a different format.

Patrick Maupin
  • 8,024
  • 2
  • 23
  • 42