6

I have the problem that I want to match all strings in the database having a certain edit distance to a given string.

My idea was to generate a regular expression that would match all strings with edit distance d to string s.

So for example I want to generate a regex r for d = 1 and s = 'abc' in the form of: r = 'abc|.abc|.bc|a.c|ab.|abc.' and so on. But I'm not sure if this is very efficient or are there already some good algorithms to that problem? I want to consider even character swaps in the edit distance. so 'acb' should also be part of r. I want to realise it in PHP and then make an SQL query: SELECT * FROM table WHERE name RLIKE TheRegularExpression.

Is it a good way to make it like that? Or what would you recommend?

Brian Webster
  • 30,033
  • 48
  • 152
  • 225
Martin Cup
  • 2,399
  • 1
  • 21
  • 32
  • If you want efficiency, first and foremost you should avoid applying a WHERE condition that can't be resolved using an index to all records in a table, unless that table is fairly small. – millimoose Nov 02 '12 at 18:00
  • Also, consider that the length of the resulting pattern will be `O(nCd)`, where `n` is the length of the string, and `d` is your distance. This can potentially lead to very large patterns. For instance, for an `80`-character string, with a desired distance of `5`, you'd be sending a RE of about two-gigabytes to the database. (This is only considering character replacements, not transpositions.) However, if you're certain the strings will be short and/or the `d` either very small or very close to `n`, it might be feasible. – millimoose Nov 02 '12 at 20:05
  • Another implication of this is that if the strings are entered by users, you need to make sure if the length is within a certain limit, otherwise you'd create a DoS hole. (As with any very, very inefficient algorithm with user-entered parameters.) – millimoose Nov 02 '12 at 20:11

3 Answers3

5

You can store a Levenshtein function in Mysql. After that you can simply do the search like this:

mysql_qery("SELECT `term` FROM `words` WHERE levenshtein('$word', `term`) BETWEEN 0 AND '$d'");
enrico.bacis
  • 30,497
  • 10
  • 86
  • 115
1

You need an implementation of Levenshtein Distance (or something very similar). Here is a function definition for use with MySQL.

diolemo
  • 2,621
  • 2
  • 21
  • 28
  • It'd probably be more efficient to modify that algorithm to bail once the determined edit distance is over the required treshold, instead of needlessly computing the exact result. – millimoose Nov 02 '12 at 17:56
  • thank you. the problem is, that on the server where I want to use it I don't have the rights to use stored functions and procedures...so I have to implement it with php though... – Martin Cup Nov 02 '12 at 19:19
1

Probably the best thing to do is build up an iterative process for all the possibilities. In other words, something like this:

function findall($startString) {
    // create an array of all strings that are distance one away
    // each element would be $returnArray["abc"] = "abc";
}

$d = 2; // distance
$myArray[$startString] = $startString;

for($i = 0; $i < $d; $i++) {
    $newCombos = array_merge(array(), $myArray);
    foreach($myArray as $element) {
        $newCombos = array_merge($newCombos, findall($element));
    }
    $myArray = array_merge(array(), $newCombos);
}

$myRegex = implode("|", $myArray);
durron597
  • 31,968
  • 17
  • 99
  • 158
  • the only thing I noticed about the solution is, that the sql query is very long and slow for longer words and an edit distance higher than 2. – Martin Cup Nov 03 '12 at 00:11
  • I actually think the Levenshtein function solution is probably better than mine (by enrico.bacis), you should check it out – durron597 Nov 03 '12 at 01:11