Is there a way to measure the similarity of a group of strings? A kind of 'mass' Levenshtein distance?

Question

Levenshtein distance is a gauge of the distance between two strings. Is there a similar metric for assessing how similar a group of strings are to each other, a kind of mean or group distance?

EDIT: I realize now that my question above could be better worded. Let's say I have a list of strings:

['aaa','aab','aaaa','aacd']

Is there a way to compute the degree to which every string in the above list is similar to each other? Some sort of measurement?

There are many metrics for strings. Off the top of my head you can use `Hamming distance`. Also you can just look [here](https://en.wikipedia.org/wiki/String_metric). It has a list of metrics for string. You can just pick the one you deem suitable for you — Arnab Roy, Jun 18 '19 at 12:58
Hi Arnab, thank you for this list. These metrics seem to be for two strings; I'm looking for an algorithm that computes the similarity of a group of more than 2 strings to each other, or an algorithm that applies something like Levenshtein/Hamming to n > 2 strings. — Chris T, Jun 18 '19 at 13:02
Are you, perhaps, looking for a technique for clustering groups of strings such that the intra-group distances are minimised (approximately speaking) and the inter-group distances are maximised (approximately speaking) ? — High Performance Mark, Jun 18 '19 at 13:02
@HighPerformanceMark That's closer to what I want, and a good way of putting it — Chris T, Jun 18 '19 at 13:03
What exactly are you asking? Ways to "average" the distance between individual strings in the group, or a faster way to determine the "group" distance (i.e. faster than computing all the n² pairwise distances in the first place)? — tobias_k, Jun 18 '19 at 13:06
Well, if you want to avoid measuring the distances between all pairs of strings I think you'll have to come up with some suitable approximation for 'string position', make a rough clustering based on that, then refine it. Rather like using lat/long and a square grid as a first step in establishing nearness of clustering of points in 2d space. Of course, your initial approximation doesn't have to be 2d, though you want to keep it to small -d. But how you make such an approximation I know not, though a good choice may depend on the nature of your strings. — High Performance Mark, Jun 18 '19 at 13:08
@tobias_k The first of the two options you listed is what I'm looking for. — Chris T, Jun 18 '19 at 13:11
Question like this come of often with the graph database Neo4j. Not saying an answer exist there, but that your question rings a bell in that area. — Guy Coder, Jun 18 '19 at 13:25
Anything wrong with average of pairwise distance or maximum distance between any two objects in the group? — SaiBot, Jun 18 '19 at 13:51
@SaiBot I'm sure those would be adequate; I just wasn't sure if there was a superior technique I should be aware of. — Chris T, Jun 18 '19 at 13:53
Would be good to know the name for this problem, maybe there are more efficient approximations of the average pairwise similarity. — Radio Controlled, Sep 05 '22 at 11:04

Is there a way to measure the similarity of a group of strings? A kind of 'mass' Levenshtein distance?

0 Answers0