OCR engine to capture characters from images

Question

i'm using c# tessnet2 wrapper for Tesseract OCR engine to capture chracters of image files. i been searching everywhere if tessnet2 has any build in functions to overwrite certain characters and saved them into the same image file it's reading but have not found anything in regards to that. so what i'm thinking of doing is creating a new imagine file base on what i'm receiving from tessnet2 but i need to create the new image the same exact way but change just few things in the new created image. i'm not sure if i'm using the correct methology or if there is other c# assemblies out there that allow you to read characters from image file and at the same time allow you to manipulate as you need them.

score 1 · Accepted Answer · answered Aug 09 '12 at 03:07

Good luck--but tess has no way of replacing in the proper font. Raster graphics don't generally store glyph information. Even if it did, you would potentially be in violation of licenses and/or copyrights surrounding the fonts you'd be writing in. I'm not an expert in OCR, but I will confidently say that this is something not readily available out there in the wild.

pstrjds · Answer 2 · 2012-08-09T12:38:31.810

0

To expand on Brian's answer: You will need to do this yourself. I have not worked with Tesseract, but I have used the Nuance OCR engine. It will return you font information as well as coordinates for the character it has recognized (note that you will most likely have to compute the actual image coordinate as the OCR engine will have deskewed the image before performing the recognition). Once you get the coordinates and the deskew so that you can compute the actual coordinate, you can then use any image manipulation library (Leadtools, Accusoft, etc) or just straight GDI+ functions to clear the character, then using the font info and size info create a new character and merge it into the image. This is not trivial but certainly doable.

Edit:
It was late when I wrote the initial answer, wanted to clarify what is meant by font information. The OCR engine will give you information regarding the point size, whether its bold/italicized and the font family (Seriph, etc). I do not know of one that will tell you the exact font that the document is in. If you have a sample of the documents that you will process, then you can make a good guess based on the info the OCR engine gives you.

edited Aug 09 '12 at 12:38

answered Aug 09 '12 at 03:12

pstrjds

16,840
6
52
61

But still a potential legal issue. If you'll be writing in a commercial typeface, you'll need to license it--that might be unmanageable if you'll be performing these edits on documents from a variety of sources. – Brian Warshaw Aug 09 '12 at 03:15
@BrianWarshaw - Sure, that can be an issue, I was not addressing it from that standpoint. I was addressing it from the can you. I worked for a redaction software company for several years, we did things like this, but it is not trivial (especially with dirty documents) – pstrjds Aug 09 '12 at 03:18
i understand how this works now, i was just learning when i wrote my first post and had no idea was involve. Do you guys know of OCR engine that can read PDF's? Tesseract does not read PDFs – David Perez Aug 30 '12 at 20:45
1

Well, instead of looking for an OCR engine to read PDFs, why not look for an API that reads the contents of a PDF? If your PDF is basically just big raster images on PDF pages, you can use a combination of a PDF reading API (to get to the image content) and then an OCR engine to process the raster graphic. – Brian Warshaw Sep 06 '12 at 01:17
@DavidPerez - To follow Brian's comment, you could try http://www.pdfsharp.net/MainPage.ashx, I think it only supports reading up to version 1.4 PDFs. – pstrjds Sep 06 '12 at 12:54

OCR engine to capture characters from images

2 Answers2