Can't convert pdf to text even though trying pdfminer, pdf2txt, textract in Python

Asked Jun 21 '16 at 18:09

Active Jun 21 '16 at 18:34

Viewed 587 times

I'm having a trouble extracting text from pdf files which were originally converted from InDesign and Illustrator. I'm working on a project that needs data from these pdf files. I have tried pdfminer, pdf2txt libs in Python, but none of them works in this case. For regular pdf, it works perfectly. However, for these special pdf files, it just gives blank spaces. Could anyone help me out with this? Thanks.

edited Jun 21 '16 at 18:34

asked Jun 21 '16 at 18:09

Nhi Tran

Sounds like you've tried quite a few things. That's great. Please post them here and it's possible someone could point out where you've gone astray. – Matt Cremeens Jun 21 '16 at 18:23
Thanks Matt! I feel like my pdf files contain images. Now I need to know how to extract images from pdf and then extract text from images. Not sure if it is the right way to do it then. – Nhi Tran Jun 21 '16 at 18:47
That's usually done with OCR . Something like pypdfocr or ocrmypdf should do it. – Nicolai Kant Jun 22 '16 at 07:27

Can't convert pdf to text even though trying pdfminer, pdf2txt, textract in Python

0 Answers0