what is a good metric for deciding if 2 Strings are "similar enough"

Question

I'm working on a very rough, first-draft algorithm to determine how similar 2 Strings are. I'm also using Levenshtein Distance to calculate the edit distance between the Strings.

What I'm doing currently is basically taking the total number of edits and dividing it by the size of the larger String. If that value is below some threshold, currently randomly set to 25%, then they are "similar enough".

However, this is totally arbitrary and I don't think is a very good way to calculate similarity. Is there some kind of math equation or probability/statistics approach to taking the Levenshtein Distance data and using it to say "yes, these strings are similar enough based on the number of edits made and the size of the strings"?

Also, the key thing here is that I'm using an arbitrary threshold and I would prefer not to do that. How can I compute this threshold instead of assign it so that I can safely say that 2 Strings are "similar enough"?

UPDATE

I'm comparing strings that represent a Java stack trace. The reason I want to do this is to group a bunch of given stack traces by similarity and use it as a filter to sort "stuff" :) This grouping is important for a higher level reason which I can't exactly share publicly.

So far, my algorithm (pseudo code) is roughly along the lines of:

/*
 * The input lists represent the Strings I want to test for similarity. The
 * Strings are split apart based on new lines / carriage returns because Java
 * stack traces are not a giant one-line String, rather a multi-line String.
 * So each element in the input lists is a "line" from its stack trace.
 */
calculate similarity (List<String> list1, List<String> list2) {

    length1 = 0;
    length2 = 0;
    levenshteinDistance = 0;

    iterator1 = list1.iterator();
    iterator2 = list2.iterator();

    while ( iterator1.hasNext() && iterator2.hasNext() ) {

        // skip blank/empty lines because they are not interesting
        str1 = iterator1.next();    length1 += str1.length();
        str2 = iterator2.next();    length2 += str2.length();

        levensteinDistance += getLevenshteinDistance(str1, str2);
    }

    // handle the rest of the lines from the iterator that has not terminated

    difference = levenshteinDistance / Math.max(length1, length2);

    return (difference < 0.25) ? true : false; // <- arbitrary threshold, yuck!
}

Are you aware of [mutual information](http://en.wikipedia.org/wiki/Mutual_information)? — Flexo, Dec 09 '11 at 21:02
The best metric will depend on what you mean by 'similarity.' What are you comparing and why? — Garrett Hall, Dec 09 '11 at 21:05

score 20 · Accepted Answer · edited Dec 09 '11 at 23:52

20

How about using cosine similarity? This is a general technique to assess similarity between two texts. It works as follows:

Take all the letters from both Strings an build a table like this:

Letter | String1 | String2

This can be a simple hash table or whatever.

In the letter column put each letter and in the string columns put their frequency inside that string (if a letter does not appear in a string the value is 0).

It is called cosine similarity because you interpret each of the two string columns as vectors, where each component is the number associated to a letter. Next, compute the cosine of the "angle" between the vectors as:

C = (V1 * V2) / (|V1| * |V2|)

The numerator is the dot product, that is the sum of the products of the corresponding components, and the denominator is the product of the sizes of the vectors.

How close C is to 1 gives you how similar the Strings are.

It may seem complicated but it's just a few lines of code once you understand the idea.

Let's see an example: consider the strings

s1 = aabccdd
s2 = ababcd

The table looks like:

Letter a b c d
s1     2 1 2 2
s2     2 2 1 1

And thus:

C = (V1 * V2) / (|V1| * |V2|) = 
(2 * 2 + 1 * 2 + 2 * 1 + 2 * 1) / (sqrt(13) * sqrt(10)) = 0.877

So they are "pretty" similar.

edited Dec 09 '11 at 23:52

Hristo

45,559
65
163
230

answered Dec 09 '11 at 21:11

Tudor

61,523
12
102
142

@Tudor... I like your example, but it is missing a critical step, which is what I'm most interested in. 0.877 seems like a good value to say "pretty similar", but how do we output `true` or `false` from 0.877 in a computable way? – Hristo Dec 09 '11 at 21:25
+1 @Tudor for this cool approach. I wonder how reliable this is as a metric of similarity of semantics? I mean two strings could have the exact same letters, but completely different meanings (e.g. "army" - "mary"). – G. Bach Dec 09 '11 at 21:27
@Hristo well it depends on your application design. You need to set a threshold between 0.0 and 1.0 which indicates what you consider to be "acceptably similar" and what not. – Tudor Dec 09 '11 at 21:27
@Tudor... yea, I don't have the data yet to know what is "acceptably similar" :) I was hoping to be able to compute that :D – Hristo Dec 09 '11 at 21:30
@G.Bach... it can be, but I don't want that threshold to be arbitrary. I was hoping to find a computable way to say `true` or `false` without having a threshold. – Hristo Dec 09 '11 at 21:31
@Hristo, of course you realize that true/false without a threshold can only be said if you want to test if they are identical or not. If they may differ by some characters, you need thresholds to identify how much mismatch you are willing to accept. – Tudor Dec 09 '11 at 21:34
@Hristo: now I noticed that you are comparing stack traces. In this case you can take an even more coarse-grain approach and compare at the word level, and not at the character level. – Tudor Dec 09 '11 at 21:36
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/5725/discussion-between-hristo-and-tudor) – Hristo Dec 09 '11 at 21:41

score 4 · Answer 2 · answered Dec 09 '11 at 21:27

4

Stack traces are in a format amenable to parsing. I would just parse the stack traces using a parsing library and then you can extract whatever semantic content you want to compare.

Similarity algorithms are going to be slower and difficult to debug with when strings aren't comparing as you expect.

answered Dec 09 '11 at 21:27

Garrett Hall

29,524
10
61
76

+1 Thanks for your answer. That's definitely one approach. I'll consider some parsing libraries after I've finished my current approach. Then I'll be able to determine which approach is more accurate and produces the more desirable and expected results. – Hristo Dec 09 '11 at 21:35
Or hand-roll a parser with regexes if it's not complicated. – Garrett Hall Dec 09 '11 at 21:37
I have one regex that I use to parse out lines like `at java.util.blah.blah.method(File.java:100)`, but that isn't the only thing I'm considering. – Hristo Dec 09 '11 at 21:40

score 2 · Answer 3 · answered Dec 09 '11 at 21:31

Here's my take on this - just a long story to consider and not necessarily an answer to your problem:

I've done something similar in the past where I would try to determine if someone was plagiarizing by simply rearranging sentences while maintaining the same sort of message.

1 "children should play while we eat dinner"
2 "while we eat dinner, the children should play"
3 "we should eat children while we play"

So levenshtein wouldn't be of much use here because it is linear and each one would be considerably different. The standard difference would pass the test and the student would get away with the crime.

So I broke each word in the sentences up and recomposed the sentences as arrays, then compared each other to first determine if the word existed in each array, and where it was in relation to the last. Then each word would check the next in the array to determine if there were sequential words, like in my example sentences above line 1 and 2. So if there were sequential words, I would compose a string of each sequence common to each array and then attempt to find differences in the remaining words. The fewer remaining words, the more likely they are just filler to make it seem less plagiarized.

"while we eat dinner, I think the children should play"

Then "I think" is evaluated and considered filler based on a keyword lexicon - this part is hard to describe here.

This was a complex project that did a lot more than just what I described and not a simple chunk of code I can easily share, but the idea above is not too hard to replicate.

Good luck. I'm interested in what other SO members have to say about your question.

that is awesome and kudos on doing this :) For the purposes of my question, I'm dealing with stack traces, so I don't need to really worry about semantics. — Hristo, Dec 09 '11 at 21:39

NPE · Answer 4 · 2011-12-09T21:48:08.770

1

Since the Levenshtein distance is never greater than the length of the longer string, I'd certainly change the denominator from (length1 + length2) to Math.max(length1, length2). This would normalize the metric to be between zero and one.

Now, it's impossible to answer what's "similar enough" for your needs based on the information provided. I personally try to avoid step functions like you have with the 0.25 cutoff, preferring continuous values from a known interval. Perhaps it would be better to feed the continuous "similarity" (or "distance") values into higher-level algorithms instead of transforming those values into binary ones?

edited Dec 09 '11 at 21:48

answered Dec 09 '11 at 21:42

NPE

486,780
108
951
1,012

I also don't want to do the 0.25 cutoff. But I don't understand what you mean by "feed the continuous "similarity" (or "distance") values into higher-level algorithms instead of transforming those values into binary ones". Can you please clarify? – Hristo Dec 09 '11 at 21:50

what is a good metric for deciding if 2 Strings are "similar enough"

UPDATE

4 Answers4