0

Is it possible to change the toUnicode mapping in a pdf file so that itext can extract text correctly?

Tanu
  • 24
  • 7
  • to extract cant you use `PdfTextExtractor` ? – Arun Xavier Feb 26 '16 at 12:59
  • @JAVY: The pdf has certain special characters like alpha, beta etc... When I use PdfTextExtractor these characters are not extracted properly. For example, β changes to P2. Is there a way to resolve this? – Tanu Feb 26 '16 at 13:12
  • have a look at [this](http://stackoverflow.com/a/21604305/4290096) (this question is for Chinese characters , but might help you) – Arun Xavier Feb 26 '16 at 13:18
  • First of all, in general there is not "the toUnicode mapping" in a PDF, each font on each page may have its own. That been said, yes, you can change these mappings. But "β changes to P2" sounds like there is something else weird in your PDF. – mkl Feb 26 '16 at 13:37
  • I understood what the problem was... Since the PDFs were read from OCR, some characters were not proper.. – Tanu Jun 29 '16 at 06:25

0 Answers0