good metrics for array of strings distance

Question

I have to arrays, S & T, containig words (lowercased, trimmed, without diacritics). Number of words can be different. (most of the data is a kind of proper names, rather short (<5))

I need to find a good metrics (and its implementation, or maybe even research paper) which allows to calculate level of the similiarity of those arrays

Some ideas I have so far:

scoring all words which are presents in both arrays
scoring all words which are presents in the same place in both arrays
scoring longest common sequences
all above + taking into account relative position of index (more important at the beginning)
some type of levensthein (insert / delete count) using words instead of characters

any other ideas?

This looks more like an invitation for a long discussion rather than a [*"practical, answerable question based on actual problems that you face"*](http://stackoverflow.com/faq). I love the theme, but unfortunately I think this is not the right place for the discussion. — Bruno Reis, Oct 19 '12 at 04:36
I am searching for specific answers: libraries, research papers, algorithms — ts., Oct 19 '12 at 06:27

score 1 · Answer 1 · answered Oct 19 '12 at 07:33

For me, it looks like modeling documents using bag-of-words models http://en.wikipedia.org/wiki/Bag-of-words_model

Depending on your application, you can use different criteria for comparing two bag-of-words feature vectors like what you said in your application. In addition, there are models based on learning statically relationship between different words/sentences, such as topic models http://en.wikipedia.org/wiki/Topic_model

score 0 · Answer 2 · answered Oct 19 '12 at 04:32

If the arrays are rather short then you can find the optimal pairing of the words given some rubric of word similarity. Then have some scoring layered on top for how far the string has to be rotated/contorted for the optimal pairings to be paired. This could be some kind of multiplier or maybe some other system.

One metric of word similarity which we recently learning about in Natural Language Processing is Levenshtein Distance. There's other more complex variants such as the Smith-Waterman algorithm (its linked on the wiki page). These algorithms are meant to measure orthographic similarity, so they are used in morphological analysis to give an idea of how similar words are based on appearance. The Smith-Waterman algorithm says that if one word is contained within the other word then they're are extremely similar no matter how long the suffix/prefix is.

I just realized you mentioned Levenshtein distance explicitly, but we're talking about it in slightly different contexts so I won't edit it out of mine. — emschorsch, Oct 19 '12 at 04:33

score 0 · Answer 3 · answered Oct 19 '12 at 04:42

0

If the strings are Western names, Soundex might be a starting point.

answered Oct 19 '12 at 04:42

WaywiserTundish

122
1
3

good metrics for array of strings distance

3 Answers3