I have some PDFs and I am trying to cut and paste text they contain from Acrobat Reader into an HTML form. It seems that some of these files use (I suspect) unicode for text encoding, so when I try to paste into the HTML form (on firefox) I get the little boxes with hex chars in them rather than readable text. The problem is not that the PDF has not been OCRed -- when I try to do that in Acrobat Pro it says it can't because the file already contains renderable text. Is there any way to deal with this? For example could I add some sort of javascript to the form that would do conversion?
9 Answers
Are you able to paste text copied from the file into other programs like Notepad or Word or any other?
Some PDF files are produced without special information that is crucial for successful extraction of text from them. Even by the Adobe tools. Basically, such files do not contain glyph-to-character mapping information.
Such files will be displayed and printed just fine, but text from them can't be properly copied / extracted.
For example, Distiller produces such files when "Smallest File Size" preset is used.

- 13,789
- 19
- 80
- 130
-
1Same result no matter where I paste it--notepad, Word, etc. I think maybe you're right about the PDF file. If I open this file in Acrobat Pro, copy some of its text, then open a sticky note and try to paste the text, I get boxes instead of chars. So even Acrobat can't deal with this text. – Steve Feb 04 '12 at 20:15
I have the same problem... Indeed it is explained here: http://forums.adobe.com/thread/915012
My solution was to convert the pdf to Word using the Exporting Tool of Acrobat and then extract the information I need from it.
It's frustrating but that work.
Another solution that I find is to convert the pdf in images (jpeg, png, etc) and then run an OCR process.

- 61
- 1
- 2
It is quite possible that the text contains characters that get copied correctly but your browser is unable to display them, due to lack of suitable font. A PDF document may contain embedded fonts, so Adobe Reader displays the characters OK, but a browser lacks access to those fonts.
You can check whether this is the reason by trying to copy and paste the characters here (it might be useful info about the problem anyway). You could also download and install the Code200x fonts, which contain pretty much any character you can normally expect to encounter. (It is not guaranteed, but probable, that Firefox will be able to use those fonts automatically when needed.)

- 195,524
- 37
- 270
- 390
-
Tried the fonts, no help. Also, when I pasted the chars into and IDE (Komodo) it said the default encoding cp-1252 was not suitable, and when I changed to encoding to unicode it became happy. – Steve Feb 04 '12 at 20:08
We had similar problem trying to copy/paste cyrillics from a PDF file into Excel.
The easiest solution we found was to open the .pdf with a browser (Chrome, Mozilla or Opera) and copy/paste the text in Word, Excel.
It didn't work with IE, as expected.

- 51
- 1
- 5
If none of the above works for you, as it didn't work for me, you can take a screenshot of the pdf and open it with Google Lens (in an android phone), then you go in text section and AI detects the text automatically and you can copy it if you want.

- 18,379
- 16
- 47
- 61

- 11
- 1
I had the same problem but I solved it by opening the PDF file with the web-browser (chrome in my case). Copy-and-pasting non-ASCII encoding works fine in chrome.

- 50,746
- 7
- 78
- 101
You can export from acrobat as jpeg, then open the jpeg in acrobat (not reader) then run the OCR tool. From there you should be able to copy/paste.

- 69,473
- 35
- 181
- 253

- 3
- 1
I am using Nitro Pdf. 1st I created images at 600 dpi from pdf. Than I open image in an new pdf file. Then from Review tab I used OCR option. Which took me to another pdf file with standard encoded pdf file where I can copy and paste text.