2

I am working on a project which requires to convert PDF to text. The PDF contains Hindi fonts (Mangal to be specific) along with English.

100% of english is getting converted into text. The conversion of Hindi part is around 95%. Remaining 5% Hindi text is either coming as blank or like " ा". I could figure out that the accented characters are not getting converted to text properly.

I am using following code:

pdftotext -enc UTF-8 pdfname.pdf textname.txt

The PDF uses following Fonts

name, type, emb, sub, uni

ZDPKEY+Mangal, CID TrueType, yes, yes, yes

Mangal TrueType, no, no, no

Helvetica-Bold Type 1, no, no, no

CODUBM+Mangal-Bold, CID TrueType, yes, yes, yes

Mangal-Bold, TrueType, no, no, no

Times-Roman, Type 1 no, no, no

Helvetica, Type 1, no, no, no

Following is the result of conversion. Left side is original PDF. Right side is text opened in notepad:

http://preview.tinyurl.com/qbxud9o

My questions is whether the 5% missing / junk characters be correctly captured in Text with open-source packages? Would appreciate your inputs!

Dian
  • 41
  • 3
  • Is it a scanned pdf? Are you sure the missing characters are present in the pdf file as text? Maybe the OCR didn't detect those characters in the first place. – Samik Sep 08 '15 at 19:33
  • Hi Samik: It is not a scanned PDF. It is a "generated" PDF. All characters are present in the PDF. I can copy same and paste in notepad. – Dian Sep 09 '15 at 13:37

1 Answers1

3

Change your code to.

pdftotext -enc "UTF-8" pdfname.pdf textname.txt

It has worked for me, similarly it should work for you.

Pavan Pyati
  • 950
  • 2
  • 13
  • 18