0

I am writing a program that should be able to detect a single character from the image of it.

I think it should be pretty easy given how powerful OCR software have become these days but I have no real idea how to do it.

Here are the specifics:

  • The language is Persian

  • The character is not hand written.

  • There are no words or sentences, the image is of a single character generated from a PDF file. It will look like this:

persian letter "seen"

Now ideally I should be able to perform OCR on this image and determine the character.

But I was using another approach so far. The fonts used in the PDF files are from a finite set of fonts (100 something) and from those only 2-3 fonts are usually used. So I can actually "cheat", and compare this character to all the characters of these 100 fonts and determine what it is.

As an example these are some of the characters in the font "Roya". I intended to compare my character image with all of these and determine the letter. Repeat for every other font until a match is found.

roya font characters

I was doing a bitmap compare with imagemagick but I realized that even if the fonts are the same there are still small differences between the character images generated from the same font.

As an example, these two are both the character "beh" from the font "Zar". But as you can see there won't be an exact match when doing a bitmap compare between them:

Persian letter "beh"Persian letter "beh"

So given all this how should I go about doing the OCR?

Other notes:

  • The program is written in Java, but a standalone application or a C/C++ library is also acceptable.
  • I tried using Tesseract but I just couldn't get it to detect characters. Persian was very badly documented and it looked like it would need a ton of calibration and training. It also looked like it is optimized for detecting words and gave very bad results when detecting single characters.
Pouria P
  • 565
  • 1
  • 8
  • 21
  • if possible, you can use machine learning (simple artificial neural networks can be sufficient, CNN is possibly an overkill) for this. If you can gather enough character images, you can train a network and use it for identification. – mcy Jan 20 '22 at 08:27
  • @mcy It seems to me that would be overkill. like I said I only have 100 fonts. Even if I try to train a neural network will 100 samples be enough for it? Most of these things are optimized for training them much more data, and then using them for much more complex problems. I would still appreciate and happily try a more specific solution tho. Like use library 'X' and then train it 'n' times with 'Y' configuration. – Pouria P Jan 20 '22 at 08:34
  • I see. you can take a look at https://github.com/Erfaniaa/Persian-OCR , it may have some tools for you. There are a lot of scientific papers on persian font recognition; these may be of some guidance as well. But aside from those, sadly I did not see any ready tools or image database (such as MNIST) – mcy Jan 20 '22 at 08:45

0 Answers0