Extracting correctly the text from a pdf (UTF-8)

Question

I want to extract text from some pdf files (programmatically, with some utility or even with copy/paste) but some characters are coming out really strange. Although I specify UTF-8 encoding when extracting the text, characters like "ș, ț, ă," etc look like "„ ˛" and not "s, t, a" (or at least the displayed character). The text is displayed correctly but when I try to copy it for example, those characters are not OK.
Is there some way to extract the text correctly or are those pdf files corrupted in some way (java/C/python etc or windows/linux/etc utility)?

score 0 · Accepted Answer · answered May 18 '12 at 10:08

0

Can you extract the text correctly in Acrobat from the PDF?

answered May 18 '12 at 10:08

mark stephens

3,205
16
19

I used "Save as..." with different settings, in different formats, but i couldn't manage to get the text correctly. Is there something more complex i should try ? I don't understand really why is the text displayed perfectly, but i cannot extract it as it is shown (or if there is a way for that matter). – Andrei F May 18 '12 at 12:26
Text is displayed using the glyfs built into the PDF. Text is extracted using other info so there is no reason you should be able to extract it just because you can see it. If you cannot cut and paste it from Acrobat, the chance are it is not set up for text extraction. – mark stephens May 18 '12 at 15:23

Extracting correctly the text from a pdf (UTF-8)

1 Answers1