0

Levenshtein distance is a gauge of the distance between two strings. Is there a similar metric for assessing how similar a group of strings are to each other, a kind of mean or group distance?

EDIT: I realize now that my question above could be better worded. Let's say I have a list of strings:

['aaa','aab','aaaa','aacd']

Is there a way to compute the degree to which every string in the above list is similar to each other? Some sort of measurement?

Chris T
  • 453
  • 1
  • 6
  • 17
  • 1
    There are many metrics for strings. Off the top of my head you can use `Hamming distance`. Also you can just look [here](https://en.wikipedia.org/wiki/String_metric). It has a list of metrics for string. You can just pick the one you deem suitable for you – Arnab Roy Jun 18 '19 at 12:58
  • Hi Arnab, thank you for this list. These metrics seem to be for two strings; I'm looking for an algorithm that computes the similarity of a group of more than 2 strings to each other, or an algorithm that applies something like Levenshtein/Hamming to n > 2 strings. – Chris T Jun 18 '19 at 13:02
  • 2
    Are you, perhaps, looking for a technique for clustering groups of strings such that the intra-group distances are minimised (approximately speaking) and the inter-group distances are maximised (approximately speaking) ? – High Performance Mark Jun 18 '19 at 13:02
  • @HighPerformanceMark That's closer to what I want, and a good way of putting it – Chris T Jun 18 '19 at 13:03
  • 1
    What exactly are you asking? Ways to "average" the distance between individual strings in the group, or a faster way to determine the "group" distance (i.e. faster than computing all the n² pairwise distances in the first place)? – tobias_k Jun 18 '19 at 13:06
  • 1
    Well, if you want to avoid measuring the distances between all pairs of strings I think you'll have to come up with some suitable approximation for 'string position', make a rough clustering based on that, then refine it. Rather like using lat/long and a square grid as a first step in establishing nearness of clustering of points in 2d space. Of course, your initial approximation doesn't have to be 2d, though you want to keep it to small -d. But how you make such an approximation I know not, though a good choice may depend on the nature of your strings. – High Performance Mark Jun 18 '19 at 13:08
  • @tobias_k The first of the two options you listed is what I'm looking for. – Chris T Jun 18 '19 at 13:11
  • 1
    Question like this come of often with the graph database Neo4j. Not saying an answer exist there, but that your question rings a bell in that area. – Guy Coder Jun 18 '19 at 13:25
  • 2
    Anything wrong with average of pairwise distance or maximum distance between any two objects in the group? – SaiBot Jun 18 '19 at 13:51
  • @SaiBot I'm sure those would be adequate; I just wasn't sure if there was a superior technique I should be aware of. – Chris T Jun 18 '19 at 13:53
  • Would be good to know the name for this problem, maybe there are more efficient approximations of the average pairwise similarity. – Radio Controlled Sep 05 '22 at 11:04

0 Answers0