4

I am trying to extract text from a PDF. The PDF contains text in Hindi (Unicode). The utility for extraction I am using is Apache PDFBox ( http://pdfbox.apache.org/). The extractor extracts the text, but the text is not recognizable. I tried changing between many encodings and fonts, but the expected text is still not recognized. Here is an example: Say text in PDF is : पवार
What it looks after extraction is: ̄Ö3⁄4ÖÖ ̧ü

are there any suggestion?

1 Answers1

3

PDF is – at its heart – a print format and thus records text as a series of visual glyphs, not as actual text. Originally it was never intended as a digital archive format and that still shows in many documents. With complex scripts, such as Arabic or Indic scripts that require glyph substitution, ligation and reordering you often get a mess, basically. What you usually get there are the glyph IDs that are used in the embedded fonts which do not have any resemblance to Unicode or an actual text encoding (fonts represent glyphs, some of which may be mapped to Unicode code points, but some are just needed for font-internal use, such as glyph variants based on context or ligatures). You can see the same with PDFs produced by LaTeX, especially with non-ASCII characters and math.

PDF also has facilities to embed the text as text alongside the visual representation, but that's solely at the discretion of the generating application. I have heard Word tries very hard to retain that information when producing PDFs but many PDF generators do not (it usually works somewhat for Latin, that's probably why nearly no one bothers).

I think the best bet for you if the PDF doesn't have the plain text available is OCR on the PDF as an image.

Joey
  • 344,408
  • 85
  • 689
  • 683
  • THanks Joey. I Will try the OCR but, is there any app online which I can use to identify the encoding of the PDF – Prakash Pimpale Sep 22 '11 at 09:44
  • As I tried to say in my answer, there is no encoding. PDF specifies the look of the document by placing glyphs on a page. There is no machine-readable text remaining at that stage, except in a few cases such as ASCII or Latin-1 text which *usually* go through unmangled. It's as if you'd write text on a punch card and expect a computer to read your written text, even though you've punched no holes in the card. – Joey Sep 22 '11 at 10:06