I want to parse some pdf files that contains text and may or may not contain images. I want to extract the text portion as string for further processing and save the image as jpeg/png or any other image format. what should be the best module to work with?
Asked
Active
Viewed 1,188 times
1 Answers
2
pdfminer will get your text. pdfrw (disclaimer: I am the author of pdfrw) has examples that will find images and dump them to separate pages, and also examples that will split PDFs into separate pages, so you could easily extract all the images to separate PDFs. If you run inkscape in a headless mode (e.g. from the subprocess module), it can read in the PDF and output a different format.

Patrick Maupin
- 8,024
- 2
- 23
- 42
-
the following pdfminer documentation says, Python 3 is not supported. Is that so? http://www.unixuser.org/~euske/python/pdfminer/ – Kamrul Khan Sep 20 '15 at 21:11
-
I think there is a separate pdfminer3k version. Also, PyPDF2 has some extraction features. – Patrick Maupin Sep 20 '15 at 21:12