1

I have access to a scanner at my library which can create "searchable PDFs." These are PDFs that show the exact image of a scanned document, but there is a kind of hidden text in the PDF that can be selected when you try to select a portion of the image that contains text. In this way you can copy and paste text or search for text in the scanned document. This is VERY useful. It's an awesome improvement over raw scanned images. I also have several apps on my mac that can create this kind of searchable PDF from a scanned document or a raw image.

Now it's obvious from any who has ever used OCR that the process of converting images to text is not 100% accurate, so the text that you search or copy will not be correct in some places.

So I search for quite some time to find an application that would load a searchable PDF and allow me to repair the hidden searchable text without reformatting or modifying the original scanned image.

Does anyone know of a tool (or library API) that would allow this?

It's worth saying here that I tried the latest version of Adobe Acrobat DC for Mac, and it doesn't seem to even allow me to view the hidden searchable text, much less edit it. It does allow me to replace scanned image with the results of it's own OCR process so that I could edit and save the document. But this would produce horrible results for any of the scanned documents that I am using. It seems designed for editing a "native PDF" not editing a scanned document.

I have also tried ABBYY FineReader with no luck.

Chris Quenelle
  • 801
  • 4
  • 16
  • I expect to answer my own question soon with a statement that you must re-scan the original document, and correct the text at the time you create the searchable PDF. I believe there are tools that will do that, but I haven't looked for that feature yet. – Chris Quenelle Oct 02 '15 at 19:41
  • `pdfedit` is ages old, but does that job for me. – arkascha Oct 02 '15 at 19:43

2 Answers2

0

i'm using ABBYY FineReader 12 Professional. (not open source) Just open a scanned image or scanned pdf and press Verify Text(or Ctrl + F7), than you go over all the spelling errors or low-confidence charachters and fix them.

The program is very good, it shows you the exact place in image/pdf to correct and the OCR guessing side by side for convenience. It iterates all of them.

[By the way, I'm using the shortcuts to speed up things: Alt+Enter to add the unrecognized word to dictionary. Ctrl+Delete to skip word or confirm in case you fixed it.]

Than save the document as a pdf file Menu:File>Save Document As> PDF File, and you can search it on every pdf reader. The saved file look the same as the scanned one, but 'behind' it there text.

It's weird you tried ABBYY with no luck... it's working great for me. maybe you tried not the Professional version.

Hope it helps you.

Ariel Nahom
  • 171
  • 1
  • 11
0

It is not creating a searchable pdf from images the poster is after, he wants to start with an already searchable pdf and modify its text (e.g. because intially a searchable pdf was made but later an overlooked error in recognition was found and needs correction). I see no way and no tool that assists in doing this.