I have a table of strings that are identifiers, but each identifier can be either fixed or a variable identifier with some static pieces. For example, an identifier could be ABC12345
or an identifier could be DEF**45
where *
represents any alphanumerical character. There are quite literally hundreds of thousands of identifiers in this table and I want to match an inputted string to the closest string(s) based on what the user enters. If a user entered ABC12345
- we are all set there because it is a direct match. If a user enters DEF1245
, that will take some computation. I figured Levenshtein distance is a good candidate here because that will tell me (in this case) that two letters need to be changed to replicate this string which would be closest. In the event there is an identifier DEF12**
, that would not be a problem since both would be the same distance but that is ok since at least I know both options are valid matches. The problem is I could be running this comparison on potentially thousands of strings that are in a database which could have significant efficiency issues. Some things that are in my favor is that length is an immediate discriminator and multiple distances that are the same are not a problem. Is there a more efficient way to process these strings or maybe a better way to store them for processing? It seems almost like a reverse regex but I don't quite understand how I could use them without converting each identifier into a regex, testing, and then going forward. It seems almost more inefficient then a distance calculation.
Asked
Active
Viewed 458 times
0

user3170736
- 511
- 5
- 24
-
I guess you could put all those strings in a [Trie](http://en.wikipedia.org/wiki/Trie) and use that to find one that matches. As you go down the trie, `*` matches everything, but does incur some "cost". – tobias_k Feb 03 '15 at 21:51
-
`but each identifier can be either fixed or a variable identifier with some static pieces` and exactly how do you indicate within this string that it is a variable identifier ? – Feb 03 '15 at 22:03
-
any part of an identifier that can be variable is denoted with a `*`. All static pieces are their literal characters. – user3170736 Feb 03 '15 at 22:17
-
Most engines get their speed from creating a _TRIE_ (from alternations) when it compiles the regex. At runtime, I would gather all the ID's into a list, sort the list, replace all the `*` with a dot or al-num class `[a-z\d]`. Then join on an alternation symbol `|`. Create the final regex string by catting `^(?.` + `joined id's` + `)$`. What the engine does when it compiles this regex is to make a trie out of all the alternations. This immediately takes the choices from thousands, down to 26. Quite a shave in time. – Feb 03 '15 at 23:20
-
Also, the more factoring you can do, the better the trie built. Example: `(?: id25 | id26 | id27a) | (?: ie3 | ie4 | ie5 )` – Feb 03 '15 at 23:26
1 Answers
0
Levenshtein distance is computed based on the letters from both sentences. AFAIK there is no pre-computing step that can speed up a Levenshtein distance. Since you have fixed length it could be possible to store all the subsets of an id, and check all the subsets of what the user typed against all the precomputed subsets
- user types
2345
, with subsets 2, 23, 234, 234, 3, 34, 345, etc... - find all ids that have the same subset as what user typed, starting with longer subset: would match 23456, 12345
- sort by longest common susbset
First thing is that I don't know how efficient that would be. Second (and most important thing) is that I think you should check whether or not Levenshtein distance really is a problem. From my experience, thousands of distance computation should not be enough to significantly slow your search (unless your ids are much longer). Try it first, you'll tackle optimization afterwards

samy
- 14,832
- 2
- 54
- 82