PDFtoTEXT not converting UTF-8 encoded text completely, especially the accented characters

Question

I am working on a project which requires to convert PDF to text. The PDF contains Hindi fonts (Mangal to be specific) along with English.

100% of english is getting converted into text. The conversion of Hindi part is around 95%. Remaining 5% Hindi text is either coming as blank or like " ा". I could figure out that the accented characters are not getting converted to text properly.

I am using following code:

pdftotext -enc UTF-8 pdfname.pdf textname.txt

The PDF uses following Fonts

name, type, emb, sub, uni

ZDPKEY+Mangal, CID TrueType, yes, yes, yes

Mangal TrueType, no, no, no

Helvetica-Bold Type 1, no, no, no

CODUBM+Mangal-Bold, CID TrueType, yes, yes, yes

Mangal-Bold, TrueType, no, no, no

Times-Roman, Type 1 no, no, no

Helvetica, Type 1, no, no, no

Following is the result of conversion. Left side is original PDF. Right side is text opened in notepad:

http://preview.tinyurl.com/qbxud9o

My questions is whether the 5% missing / junk characters be correctly captured in Text with open-source packages? Would appreciate your inputs!

Is it a scanned pdf? Are you sure the missing characters are present in the pdf file as text? Maybe the OCR didn't detect those characters in the first place. — Samik, Sep 08 '15 at 19:33
Hi Samik: It is not a scanned PDF. It is a "generated" PDF. All characters are present in the PDF. I can copy same and paste in notepad. — Dian, Sep 09 '15 at 13:37

score 3 · Answer 1 · answered Apr 11 '18 at 09:11

3

Change your code to.

pdftotext -enc "UTF-8" pdfname.pdf textname.txt

It has worked for me, similarly it should work for you.

answered Apr 11 '18 at 09:11

Pavan Pyati

950
2
13
18

PDFtoTEXT not converting UTF-8 encoded text completely, especially the accented characters

1 Answers1