How can a extract text from Hindi PDF file in Android

Question

I am trying read the content of hindi PDF. I have used itext7 library to read the PDF file.

It's working fine for English language PDFs and getting the exact characters also But When I try with any Hindi(local) language PDF, values are in unreadable format.

Uncreadable Format in which I am getting the text

d d d daaaah h eeh h ee aaaa

Here is sample code of reading PDF page wise.

val pdfReader = PdfReader("pdfPath")
            PdfDocument(pdfReader).use { doc ->
                pdfContent = PdfTextExtractor.getTextFromPage(doc.getPage(1))
            }
            pdfReader.close()

Do I need to pass and language parameter to itext7 library to get the exact contents ?

What exactly do you mean by *unreadable format*? If the characters extracted are completely wrong, probably not even from Hindi, chances are that the pdf itself contains incomplete or incorrect information for text extraction. If it's merely slightly off, it might be a problem of itext. — mkl, Jan 17 '21 at 07:29
@AmedeeVanGasse this link of PDF: https://www.hindutemplealbany.org/wp-content/uploads/2016/08/Sri_Hanuman_Chalisa_Hindi.pdf — Anukool srivastav, Jan 18 '21 at 10:24
@mkl, updated the Question with response format i am getting — Anukool srivastav, Jan 18 '21 at 10:26

score 1 · Answer 1 · answered Jan 18 '21 at 12:31

1

The font object for Hindi glyphs in your example PDF explicitly claims those glyphs correspond to Latin Unicode characters for text extraction:

Thus, it is completely correct that a text extractor extracts Latin Unicode characters for those Hindi glyphs.

Even looking into the embedded font program (which goes beyond regular text extraction) does not improve the situation, the embedded font program also maps to Latin Unicode characters, merely different ones:

Thus, for PDFs like that you should attempt OCR instead of text extraction.

answered Jan 18 '21 at 12:31

mkl

90,588
15
125
265

Thanks for the detail @mkl, I even have the text in Doc file also. Can you recommend any tool through which I can convert this doc to PDF that can be extracted using iText7 library ? – Anukool srivastav Jan 19 '21 at 07:27
1

I don't have hands-on experience with ocr software. But as you're already using itext 7, you might be interested in [pdfOCR](https://itextpdf.com/en/products/itext-7/pdf-ocr-text-recognition), an itext 7 addon. Other than that Tesseract is an ocr software often mentioned. – mkl Jan 19 '21 at 09:14

How can a extract text from Hindi PDF file in Android

1 Answers1