PdfBox - encoding in extracted text

Question

Some PDF contains text with "strange" encoding. E.g. there http://www.iwb.ch/media/Unternehmen/Dokumente/inserat_leiter_pm.pdf If I copy the text for example in acrobat reader and paste somewhere I don't get the same characters as I see. PdfBox has also problem with extracting this text.

My question is how can I detect in PdfBox that some fonts are using this "strange" encoding? I don't need to decode them only detect.

[This answer](http://stackoverflow.com/questions/20402741/how-to-get-text-extraction-from-pdf-to-work/20410126#20410126) (focusing on some PDF for which some characters are incorrectly extracted) references and quotes the PDF specification section explaining which data are necessary for text extraction as e.g. implemented in PDFBox. You have to check the PDFBox objects accordingly. BTW, [this question](http://stackoverflow.com/questions/20068096/pdfbox-text-extraction-not-working-properly) focuses on your PDF, too. — mkl, Jun 16 '14 at 08:44
Thank you, nice answer. So as I understand PDF files with this problem are not created according to the specification and there is no way to determine which code encodes the character. But is there any way how to find out that the font don't uses the differences or cmap and it should or the pdf is not created according to spec? Yes, I used this pdf from there as a sample. — Mayo, Jun 18 '14 at 15:08
Well, those documents may still be created according to spec; the spec does not enforce text extraction capabilities, it only says if you want those capabilities, how to do that. — mkl, Jun 18 '14 at 20:36
Okey and could you tell me how can I distinguish those pdf documents (which doesn't offer capabilities for extracting text)? Can I suppose there is missing **/Encoding** attribute or **/FontDescriptor**? — Mayo, Jun 19 '14 at 07:56
Look at the answer i referenced in my first comment here. The specification sections it refers to tell you what to look for for text extraction. So, if you want to know what to look for to indicate unextracrability, look for the same items. If they are missing, your text is hard to extract. — mkl, Jun 19 '14 at 11:32

PdfBox - encoding in extracted text

0 Answers0