0

I am trying read the content of hindi PDF. I have used itext7 library to read the PDF file.

It's working fine for English language PDFs and getting the exact characters also But When I try with any Hindi(local) language PDF, values are in unreadable format.

Uncreadable Format in which I am getting the text

d d d daaaah h eeh h ee aaaa  

Here is sample code of reading PDF page wise.

val pdfReader = PdfReader("pdfPath")
            PdfDocument(pdfReader).use { doc ->
                pdfContent = PdfTextExtractor.getTextFromPage(doc.getPage(1))
            }
            pdfReader.close()

Do I need to pass and language parameter to itext7 library to get the exact contents ?

Anukool srivastav
  • 807
  • 1
  • 12
  • 20
  • What exactly do you mean by *unreadable format*? If the characters extracted are completely wrong, probably not even from Hindi, chances are that the pdf itself contains incomplete or incorrect information for text extraction. If it's merely slightly off, it might be a problem of itext. – mkl Jan 17 '21 at 07:29
  • Please share a PDF with Hindi content. – Amedee Van Gasse Jan 17 '21 at 21:14
  • @AmedeeVanGasse this link of PDF: https://www.hindutemplealbany.org/wp-content/uploads/2016/08/Sri_Hanuman_Chalisa_Hindi.pdf – Anukool srivastav Jan 18 '21 at 10:24
  • @mkl, updated the Question with response format i am getting – Anukool srivastav Jan 18 '21 at 10:26

1 Answers1

1

The font object for Hindi glyphs in your example PDF explicitly claims those glyphs correspond to Latin Unicode characters for text extraction:

PDFDebugger screen shot

Thus, it is completely correct that a text extractor extracts Latin Unicode characters for those Hindi glyphs.

Even looking into the embedded font program (which goes beyond regular text extraction) does not improve the situation, the embedded font program also maps to Latin Unicode characters, merely different ones:

Font Forge screen shot

Thus, for PDFs like that you should attempt OCR instead of text extraction.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thanks for the detail @mkl, I even have the text in Doc file also. Can you recommend any tool through which I can convert this doc to PDF that can be extracted using iText7 library ? – Anukool srivastav Jan 19 '21 at 07:27
  • 1
    I don't have hands-on experience with ocr software. But as you're already using itext 7, you might be interested in [pdfOCR](https://itextpdf.com/en/products/itext-7/pdf-ocr-text-recognition), an itext 7 addon. Other than that Tesseract is an ocr software often mentioned. – mkl Jan 19 '21 at 09:14