How to cut-paste from PDF with non-ASCII encoding?

Question

I have some PDFs and I am trying to cut and paste text they contain from Acrobat Reader into an HTML form. It seems that some of these files use (I suspect) unicode for text encoding, so when I try to paste into the HTML form (on firefox) I get the little boxes with hex chars in them rather than readable text. The problem is not that the PDF has not been OCRed -- when I try to do that in Acrobat Pro it says it can't because the file already contains renderable text. Is there any way to deal with this? For example could I add some sort of javascript to the form that would do conversion?

score 9 · Accepted Answer · answered Feb 04 '12 at 19:37

9

Are you able to paste text copied from the file into other programs like Notepad or Word or any other?

Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.

Such files will be displayed and printed just fine, but text from them can't be properly copied / extracted.

For example, Distiller produces such files when "Smallest File Size" preset is used.

answered Feb 04 '12 at 19:37

Bobrovsky

13,789
19
80
130

1

Same result no matter where I paste it--notepad, Word, etc. I think maybe you're right about the PDF file. If I open this file in Acrobat Pro, copy some of its text, then open a sticky note and try to paste the text, I get boxes instead of chars. So even Acrobat can't deal with this text. – Steve Feb 04 '12 at 20:15

score 5 · Answer 2 · answered Nov 29 '13 at 18:02

I have the same problem... Indeed it is explained here: http://forums.adobe.com/thread/915012

My solution was to convert the pdf to Word using the Exporting Tool of Acrobat and then extract the information I need from it.

It's frustrating but that work.

Another solution that I find is to convert the pdf in images (jpeg, png, etc) and then run an OCR process.

score 3 · Answer 3 · edited Jan 15 '16 at 22:53

3

Select the text in Acrobat.
Right-click and select "Copy with formatting" from the context menu.
Wait for the progress bar to process the text.
Paste in the Word document.

edited Jan 15 '16 at 22:53

Ferrybig

18,194
6
57
79

answered Jan 15 '16 at 22:27

David

31
1

score 2 · Answer 4 · answered Feb 04 '12 at 19:22

It is quite possible that the text contains characters that get copied correctly but your browser is unable to display them, due to lack of suitable font. A PDF document may contain embedded fonts, so Adobe Reader displays the characters OK, but a browser lacks access to those fonts.

You can check whether this is the reason by trying to copy and paste the characters here (it might be useful info about the problem anyway). You could also download and install the Code200x fonts, which contain pretty much any character you can normally expect to encounter. (It is not guaranteed, but probable, that Firefox will be able to use those fonts automatically when needed.)

Tried the fonts, no help. Also, when I pasted the chars into and IDE (Komodo) it said the default encoding cp-1252 was not suitable, and when I changed to encoding to unicode it became happy. — Steve, Feb 04 '12 at 20:08

score 1 · Answer 5 · answered Jul 28 '17 at 06:47

We had similar problem trying to copy/paste cyrillics from a PDF file into Excel.

The easiest solution we found was to open the .pdf with a browser (Chrome, Mozilla or Opera) and copy/paste the text in Word, Excel.

It didn't work with IE, as expected.

score 1 · Answer 6 · edited Nov 11 '21 at 10:18

1

If none of the above works for you, as it didn't work for me, you can take a screenshot of the pdf and open it with Google Lens (in an android phone), then you go in text section and AI detects the text automatically and you can copy it if you want.

edited Nov 11 '21 at 10:18

Tomerikoo

18,379
16
47
61

answered Nov 11 '21 at 10:07

Luka Kavteli

11
1

score 0 · Answer 7 · edited Jan 08 '16 at 11:39

0

I had the same problem but I solved it by opening the PDF file with the web-browser (chrome in my case). Copy-and-pasting non-ASCII encoding works fine in chrome.

edited Jan 08 '16 at 11:39

ruddra

50,746
7
78
101

answered Jan 08 '16 at 11:09

user5762406

1

score 0 · Answer 8 · edited Feb 06 '19 at 20:36

0

You can export from acrobat as jpeg, then open the jpeg in acrobat (not reader) then run the OCR tool. From there you should be able to copy/paste.

edited Feb 06 '19 at 20:36

Eric Aya

69,473
35
181
253

answered Feb 06 '19 at 20:32

Kermit Russell

3
1

score 0 · Answer 9 · answered Mar 26 '21 at 17:38

0

I am using Nitro Pdf. 1st I created images at 600 dpi from pdf. Than I open image in an new pdf file. Then from Review tab I used OCR option. Which took me to another pdf file with standard encoded pdf file where I can copy and paste text.

answered Mar 26 '21 at 17:38

Sami Asrar

1

How to cut-paste from PDF with non-ASCII encoding?

9 Answers9

Linked