0

I want to match a string with another string from OCR(Optical Character Recognition).

Usually, OCR-read text are imperfect. In my case, 5's are misrecognized as S and so on.

So I am wondering if there's a way to calucate a edit-distance with custom distance.

For example, if I want to calculate a distance from 5S00AS to SSOODS,

I would want to make a substitution distance for A to D large and 5 to S small so that distance('5S00AS', 'SSOOAS') is much smaller than distance('5S00AS', 'PDDDDA').

I think soundex is in the similar vein except that similar sounds have smaller distance. We should have smaller distance for simliar looking spellings.

I wonder if there is already a function or package for doing this type of distance calculation.

KH Kim
  • 1,155
  • 1
  • 7
  • 14

1 Answers1

0

You are looking for Levenshtein distance with a custom scores. See the more general https://en.wikipedia.org/wiki/Needleman–Wunsch_algorithm

The nltk lib levenshein implementation doesn't support using custom scores, but in needleman-wunsch implemenations you can usually specify this (it's easy to modify the algorithm to use a matrix of custom distances).

HGM
  • 26
  • 1