11

What would be the best way to compare a pattern with a set of strings, one by one, while rating the amount with which the pattern matches each string? In my limited experience with regex, matching strings with patterns using regex seems to be a pretty binary operation...no matter how complicated the pattern is, in the end, it either matches or it doesn't. I am looking for greater capabilities, beyond just matching. Is there a good technique or algorithm that relates to this?

Here's an example:

Lets say I have a pattern foo bar and I want to find the string that most closely matches it out of the following strings:

foo for
foo bax
foo buo
fxx bar

Now, none of these actually match the pattern, but which non-match is the closest to being a match? In this case, foo bax would be the best choice, since it matches 6 out of the 7 characters.

Apologies if this is a duplicate question, I didn't really know what exactly to search for when I looked to see if this question already exists.

  • I'm not sure I understand your question, as you said it either fits the pattern or doesn't, what do you mean by amount, like how many characters match? – user472875 Nov 05 '10 at 15:13
  • Good question; I'm curious about that as well. – Paul Sonier Nov 05 '10 at 15:14
  • yea, I guess I am looking for a different technique than regex matching. apologies for the misunderstanding, changing the question... –  Nov 05 '10 at 15:15
  • 4
    @W_P, do you mean fuzzy string algorithms like [soundex](http://en.wikipedia.org/wiki/Soundex) and/or [Levenshtein distance](http://en.wikipedia.org/wiki/Levenshtein_distance) but then instead of two strings, you have a pattern and a string? Or am I waaaay off? :) – Bart Kiers Nov 05 '10 at 15:18
  • hmmm still looking at it but my first impression is that the Levenshtein distance is what I'm looking for...I have edited the question with an example of what I am talking about. –  Nov 05 '10 at 15:23
  • provided the patterns are simple strings and don't have character list or quantifiers etc. then Levenshtein distance is spot on (but a little expensive to compute for big patterns). If that's true then the general expression for what you're looking for is string similarity metrics. – Flexo Nov 05 '10 at 15:24
  • @Bart Kiers if you provide Levenshtein distance as an answer I will mark it as accepted –  Nov 05 '10 at 16:43
  • @W_P, I see someone else already posted something regarding the Levenshtein distance: feel free to accept that answer instead. – Bart Kiers Nov 05 '10 at 17:13

2 Answers2

3

This one works, I checked with Wikipedia example distance between "kitten" and "sitting" is 3

   public class LevenshteinDistance {

    public static final String TEST_STRING = "foo bar";

    public static void main(String ...args){
        LevenshteinDistance test = new LevenshteinDistance();
        List<String> testList = new ArrayList<String>();
        testList.add("foo for");
        testList.add("foo bax");
        testList.add("foo buo");
        testList.add("fxx bar");
        for (String string : testList) {
          System.out.println("Levenshtein Distance for " + string + " is " + test.getLevenshteinDistance(TEST_STRING, string)); 
        }
    }

    public int getLevenshteinDistance (String s, String t) {
          if (s == null || t == null) {
            throw new IllegalArgumentException("Strings must not be null");
          }

          int n = s.length(); // length of s
          int m = t.length(); // length of t

          if (n == 0) {
            return m;
          } else if (m == 0) {
            return n;
          }

          int p[] = new int[n+1]; //'previous' cost array, horizontally
          int d[] = new int[n+1]; // cost array, horizontally
          int _d[]; //placeholder to assist in swapping p and d

          // indexes into strings s and t
          int i; // iterates through s
          int j; // iterates through t

          char t_j; // jth character of t

          int cost; // cost

          for (i = 0; i<=n; i++) {
             p[i] = i;
          }

          for (j = 1; j<=m; j++) {
             t_j = t.charAt(j-1);
             d[0] = j;

             for (i=1; i<=n; i++) {
                cost = s.charAt(i-1)==t_j ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost                
                d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);  
             }

             // copy current distance counts to 'previous row' distance counts
             _d = p;
             p = d;
             d = _d;
          } 

          // our last action in the above loop was to switch d and p, so p now 
          // actually has the most recent cost counts
          return p[n];
        }

}
ant
  • 22,634
  • 36
  • 132
  • 182
  • 2
    And in fact, there are [lots of different edit distance algorithms](http://en.wikipedia.org/wiki/Edit_distance), depending on what precisely you want to compare. – Antal Spector-Zabusky Nov 05 '10 at 17:11
0

That's an interesting question! The first thing that came to mind is that the way regular expressions are matched is by building a DFA. If you had direct access to the DFA that was built for a given regex (or just built it yourself!) you could run the input measure the distance from the last state you transitioned to and an accept state, using a shortest path as a measure of how close it was to being accepted, but I'm not aware of any libraries that would let you do that easily and even this measure probably wouldn't exactly map onto your intuition in a number of cases.

Flexo
  • 87,323
  • 22
  • 191
  • 272