I'm working on a very rough, first-draft algorithm to determine how similar 2 Strings are. I'm also using Levenshtein Distance to calculate the edit distance between the Strings.
What I'm doing currently is basically taking the total number of edits and dividing it by the size of the larger String. If that value is below some threshold, currently randomly set to 25%, then they are "similar enough".
However, this is totally arbitrary and I don't think is a very good way to calculate similarity. Is there some kind of math equation or probability/statistics approach to taking the Levenshtein Distance data and using it to say "yes, these strings are similar enough based on the number of edits made and the size of the strings"?
Also, the key thing here is that I'm using an arbitrary threshold and I would prefer not to do that. How can I compute this threshold instead of assign it so that I can safely say that 2 Strings are "similar enough"?
UPDATE
I'm comparing strings that represent a Java stack trace. The reason I want to do this is to group a bunch of given stack traces by similarity and use it as a filter to sort "stuff" :) This grouping is important for a higher level reason which I can't exactly share publicly.
So far, my algorithm (pseudo code) is roughly along the lines of:
/*
* The input lists represent the Strings I want to test for similarity. The
* Strings are split apart based on new lines / carriage returns because Java
* stack traces are not a giant one-line String, rather a multi-line String.
* So each element in the input lists is a "line" from its stack trace.
*/
calculate similarity (List<String> list1, List<String> list2) {
length1 = 0;
length2 = 0;
levenshteinDistance = 0;
iterator1 = list1.iterator();
iterator2 = list2.iterator();
while ( iterator1.hasNext() && iterator2.hasNext() ) {
// skip blank/empty lines because they are not interesting
str1 = iterator1.next(); length1 += str1.length();
str2 = iterator2.next(); length2 += str2.length();
levensteinDistance += getLevenshteinDistance(str1, str2);
}
// handle the rest of the lines from the iterator that has not terminated
difference = levenshteinDistance / Math.max(length1, length2);
return (difference < 0.25) ? true : false; // <- arbitrary threshold, yuck!
}