Note that I'm really looking for an answer to my question. I am not looking for a link to some source code or to some academic paper: I've already used the source and I've already read papers and still haven't figured out the last part of this issue...
I'm working on some fast screen font OCRing and I'm making very good progress.
I'm already finding the baselines, separating the characters, transforming each character in black & white and then contouring each character in order to apply a Freeman chain code to it.
Basically it's an 8-connected chain code looking like this:
3 2 1
\ | /
4-- --0
/ | \
5 6 7
So if I have an 'a', after all my transformations (including transforming to black and white), I end up with something like this:
11110
00001
01111
10001
10001
01110
Then it's external countour may look like this (I may be making a mistake here, that's ASCII-art contouring and my 'algorithm' may get the contour wrong but that's not the point of my question):
XXXX
X1111X
XXXX1X
X01111X
X10001X
X10001X
X111X
XXX
Following the Xs, I get the chain code, which would be:
0011222334445656677
Note that that's the normalized chain code but you can always normalized a chain code like this: you just keep the smallest integer.
(By the way, there's a super-efficient implementation to find the chain code where you simply take the 8 adjacent pixels of an 'X' and then look in a 256 lookup table if you have 0,1,2,3,4,5,6 or 7)
My question now, however, is: from that 0011222334445656677 chain code, how do I find that I have an 'a'?
Because, for example, if my 'a' looks like this:
11110
00001
01111
10001
10001
01111 <-- This pixel is now full
Then my chain code is now: 0002222334445656677
And yet this is also an 'a'.
I know that the whole point of these chain code is to be resilient to such tiny changes but I can't figure out how I'm supposed to find which character corresponds to one chain code.
I've been that far and now I'm stuck...
(By the way, I don't need 100% efficiency and things like differentiating '0' from 'O' or from 'o' isn't really an issue)