I have to arrays, S & T, containig words (lowercased, trimmed, without diacritics). Number of words can be different. (most of the data is a kind of proper names, rather short (<5))
I need to find a good metrics (and its implementation, or maybe even research paper) which allows to calculate level of the similiarity of those arrays
Some ideas I have so far:
- scoring all words which are presents in both arrays
- scoring all words which are presents in the same place in both arrays
- scoring longest common sequences
- all above + taking into account relative position of index (more important at the beginning)
- some type of levensthein (insert / delete count) using words instead of characters
any other ideas?