2

This is not a homework; I am trying to simplify and enhance an existing clunky GUI interface written in C# / Winform / Sql Server 2008. It would be cool if you could sugest something specific to these technologies, but if you can point me to something else, such as Java/MySql solution, then I will be happy with that as well.

A similar question has been asked, but the question/answer was not as advanced as what I am after: Given a list of words - what would be a good algorithm for word completion in java? Tradeoffs: Speed/efficiency/memory footprint

Say I have a table containing book information: title, author name, description. I know, all three do not necessarily belong to the same table, but let's assume it made sense to do it this way. So, when the user types something (say "Hari po") into a textbox / combobox, or some custom control, what they should get as the first suggestion is probably "Harry Potter", and corresponding description and author. To keep the question simple, let's restrict the search to the title only. Note that I do not care that "Hari" sounds like "Harry" - the app is not targeted at non-native speakers, but I do care about the fact "Hari po" is only a few keystrokes away from "Harry Po". So, http://en.wikipedia.org/wiki/Levenshtein_distance comes to mind, but it is not exactly what I need, because I would like to have meaningful results as soon as I start typing (think Google suggestion with different purpose). I need some sort of modified Levenshtein distance algorithm that works well with partial matching and does not assume that what I am typing is supposed to be at the beginning of the text I am trying to match. For instance, the book may be called "How the boy named Harry Potter affects our society.", and I do want this title to come up in the search, however, I would like to see something like "Harry Potter and the Order of the Phoenix" come up at the top, because my query starts with this.

I could try the Levenshtein distance multiple times against all possible sub-strings of query length +/- 2, and then weight them somehow by where in the string the sub-string "sort off" appears, and then pick the maximum match coefficient. My first concern with doing this is that this is inefficient. Secondly, there must be a way to get better results, even if speed was not an issue. Thirdly, someone has surely done something similar before, so why reinvent the wheel?

The number of unique rows in the database will be up to 20,000. What I am after is sort of like Google search suggestion, or Visual Studio 2010 IntelliSense (code auto-completion), except that it should not try to remember what the user has typed in the past and adjust the suggestion based on that. There is no need to do query expansion; just working with the actual content. From the user perspective it should work similarly to Google search and IntelliSense, e.g. it should come up with a number of ranked choices, and also come up with an intelligent way to cut that list off at the right point (e.g. if nothing really matches the query, then suggest nothing rather than show the best of the worst fits), and also if the first few results have a strong rank, but subsequent ones have a much weaker one relatively to the top results, then perhaps hide the weak ones.

Perhaps you know of a reasonably sized open-source tool/library with exposed, and readable source code that I can get ideas from?

My next question would be about how to best handle the situation where the search term could apply to either title, and/or author, and/or description, but I suspect that my current question is already loaded.

Please ask clarifying questions if something is not clear about what I am after.

Community
  • 1
  • 1
Hamish Grubijan
  • 10,562
  • 23
  • 99
  • 147
  • You may want to take a look at Solr/Lucene. It supports autocomplete that works pretty well. Runtime performance is also good. – Pankrat Nov 06 '11 at 00:22
  • 1
    @Hamish Grubijan: type *"hari po"* in Google and the 2nd and 3rd suggestions are *"harry potter"* for me ; ) Google does it using a *"damn cool algorithm"*. You aren't far with the Levenhstein Edit Distance: Google are using BK-trees IIRC. As far as I understand it's basically a tree constructed from Levenhstein Edit Distance. You can read about it here: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees Btw as much as the *Levenhstein Edit Distance is trivial*, the bk-tree seems like quite a beast... – TacticalCoder Nov 06 '11 at 01:18
  • @user988052, thanks, please post this comment as an answer. – Hamish Grubijan Nov 08 '11 at 17:58

4 Answers4

1

I'd suggest taking a good look at Lucene. It supports a wide range of query types, including (I think) incremental, approximate search. Plus it's open source and free. :)

Ted Hopp
  • 232,168
  • 48
  • 399
  • 521
1

If you type "hari po" in Google, suggestions near the top will correctly be "harry potter" Google does it using a "damn cool algorithm". You aren't far with the Levenhstein Edit Distance: Google are using BK-trees IIRC.

As far as I understand it's basically a tree constructed from Levenhstein Edit Distance.

There are probably several papers available on the subject by now. The first time I read about it was a few years ago, on a blog called "Damn cool algorithms":

http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees

But you have to know that as much as the Levenhstein Edit Distance is trivial (it can be implemented in about 20 lines of code), the bk-tree seems like quite another beast to develop...

TacticalCoder
  • 6,275
  • 3
  • 31
  • 39
0

Maybe you want to look for a trigram search? A trigram search need to create every possibilites of 3 letters of your input and look for similar strings in the match. http://en.wikipedia.org/wiki/Trigram

Micromega
  • 12,486
  • 7
  • 35
  • 72
  • Thanks, the Wikipedia page for the trigram search is nearly empty. Could you elaborate on this? Do you know of any good tools/libraries that I can use as an example? – Hamish Grubijan Nov 06 '11 at 00:37
  • A trigram search divide the word into 2^3 combinations so you have a total of 64-8 changes. – Micromega Nov 06 '11 at 00:49
0

For a simple completion algorithm you can combine a KWIC index with a radix tree.

Basically, you take each indexed string, identify the "significant" potential start points, and generate N rotated copies of the string based on those potential start points.

Then build a radix tree over the strings, such that when you type in "Harry" you'll find all of the possible next words after "Harry".

While this may sound like it would really explode the size of your DB, it actually only about doubles it, depending on how you choose "significant" start points. (The radix tree is somewhat more compact than storing each line individually, in addition to making for efficient searches.)

Hot Licks
  • 47,103
  • 17
  • 93
  • 151