Character recognition (OCR algorithm)

Question

I am working on a project in which I have to develop OCR Algorithm ( I have to read the text from Image and then convert it to different language ).So my first task is to get text from image.

Steps to complete first task.

Loading any image format (bmp, jpg, png) from given source. Then convert the image to grayscale and binarize it using the threshold value (Otsu algorithm). //completed(How to remove noise from output Image???)

Results

Input Image

Output Image

Detecting image features like resolution and inversion. So that we can finally convert it to a straightened image for further processing. (completed the code of rotation of Image but not able to detect Image angle about which we have to rotate the Image,So still working on angle detection part)
Lines detection and removing. This step is required to improve page layout analysis, to achieve better recognition quality for underlined text, to detect tables, etc.(Decided To Complete that part in End)
Page layout analysis. In this step I am trying to identify the text zones present in the image. So that only that portion is used for recognition and rest of the region is left out.
Detection of text lines and words. Here we also need to take care of different font sizes and small spaces between words.
Recognition of characters. This is the main algorithm of OCR; an image of every character must be converted to appropriate character code. Sometimes this algorithm produces several character codes for uncertain images. For instance, recognition of the image of "I" character can produce "I", "|" "1", "l" codes and the final character code will be selected later.
Saving results to selected output format, for instance, searchable PDF, DOC, RTF, TXT. It is important to save original page layout: columns, fonts, colors, pictures, background and so on.

So I need help in part6.I have completed line detection part (get n Images from a paragraph containing n lines) but stuck in next part getting words and character recognisation.If you know good links related to OCR and character recognisation part then please post Here.

For character recognisation I am thinking to use asprise(Java library) http://asprise.com/product/ocr/index.php?lang=java

For the doc part, you could use the Apache POI lib http://poi.apache.org/ and for txt you can write your own streamwriter, it shouldnt be so hard, for PDF you can use http://www.stefanochizzolini.it/en/projects/clown/ PDfClown — Tearsdontfalls, Mar 03 '13 at 17:32
OCR is a well established and researched topic. I always found this a nice read on the topic. http://www.handwritten.net/mv/papers/mori92historical_review_of_ocr_research_and_development.pdf For the problem of OCR zoning particularly this one is quite interesting http://www.music.mcgill.ca/~ich/classes/mumt611_08/Evaluation/KanaiPAMI95.pdf . — Bjoern Rennhak, May 11 '13 at 23:37
for straightening the image, here's a trick I used when I started writing something for OCR on music notation: http://verens.com/2012/07/26/straightening-an-image-of-horizontal-lines/ — Kae Verens, May 20 '13 at 23:45

score 18 · Answer 1 · answered Jun 14 '13 at 02:58

18

To detect the rotation angle, use the Hough transformation.

For noise reduction, replace any pixel, that does not have a neighbour (north, east, south or west) with the same color (a similar color, using a tolerance threshold), with the average of the neighbours.

Search for vertical white gaps for layout detection. Slice along the vertical gap. For each slice, now search horizontal gaps, and slice. If the slices have the same (a similar) height, you are at line level. Otherwise repeat vertical/horizontal slicing, until you only have lines left. The last step then is again a vertical slicing, giving you the single characters (or ligatures in some cases). Long and narrow or short and wide slices are lines.

Compare the character slices with a character library. If performance is not the main concern, try to find the characters within different font libraries, until you can identify the font used. Then stick with that font for character recognition.

In the original image, replace each character with the background color, which is determined by interpolating pixels that not are part of the character for each pixel of the character. This gives you the background image, if any.

answered Jun 14 '13 at 02:58

nibra

3,958
2
20
34

I want some good method of noise removal ."replace any pixel, that does not have a neighbour (north, east, south or west) with the same color" does not works good enough. – TLE Jun 14 '13 at 08:16
If you have information about the stroke width of the characters, you can look for bigger clusters. You can also use Hough to detect the gaps, so the noise is not disturbing that much. – nibra Jun 14 '13 at 15:07
I am getting only 60% accuracy in character matching part How can I improve that? , for character matching I am using my own method to match character Image. – TLE Jun 20 '13 at 06:28
How to detect spacing between character, after getting characters from Image we have make sentence , for that we have to place space . – TLE Jun 30 '13 at 11:35
You'll have to calculate that form the position of the character – nibra Jun 30 '13 at 17:09

score 5 · Answer 2 · answered Jun 15 '14 at 15:35

5

You should use Adaptive treshold instead Otsu method.. I think it will be helpful http://www.csse.uwa.edu.au/~shafait/papers/Shafait-efficient-binarization-SPIE08.pdf This method will automatically remove the noise.

answered Jun 15 '14 at 15:35

Stupi

51
1
1

score 3 · Answer 3 · edited Mar 08 '17 at 01:09

3

You may want to look in to Tesseract for the character recognition part.

edited Mar 08 '17 at 01:09

Andrew Myers

23
7

answered Jun 14 '13 at 03:05

Engineero

12,340
5
53
75

1

Google Vision API is worth looking into too, they perform OCR, although I haven't tried – absin Jun 18 '18 at 04:30

score 1 · Answer 4 · answered Nov 18 '14 at 12:50

1

You can use potrace to reduce the noise It vectorises the given image(bmp) and convert it to svg, pdf and some other formats

http://potrace.sourceforge.net/potrace.html

answered Nov 18 '14 at 12:50

Magesh Vs

82
4

Character recognition (OCR algorithm)

4 Answers4

Linked