0

I am trying to use pdftotext in order to convert .pdf files to text for further processing of files in python, but I am getting following problem:

It works for some .pdf files, though my output for some files looks like (which is wrong):

(0)

(0)

(0)
(0)
(0)
(0)

000 0000000 0000000000 0000000 00000 000 00
000000000 0000 0000 0000000 00000000000 00000000
000000 000 0000000 000000.
000 000000 0000000 00000000 0000000 0 00000
00000 00 0000000 000000.

When I look at it, it seems to me that one 0 character represents exactly one character.

So my question is, what can be possibly wrong? And how can I fix output of pdftotext?

ziky90
  • 2,627
  • 4
  • 33
  • 47
  • I'd need to see the file to be sure, but it looks like you have a file where the text has been re-encoded, and there is no ToUnicode CMap. Text from such a file cannot be extracted. – KenS May 13 '15 at 12:58
  • Yes you're right I have .pdf without ToUnicode CMap. So there is no other way to go than OCR? Is that right? – ziky90 May 13 '15 at 13:08
  • you may check if the only way is to use OCR or not by opening this PDF in Adobe Reader and selecting the text and copying to the clipboard. If even Adobe Reader is not able to decode this text then the only way to restore the text is to use OCR. – Eugene May 13 '15 at 17:48

0 Answers0