This is not a homework; I am trying to simplify and enhance an existing clunky GUI interface written in C# / Winform / Sql Server 2008. It would be cool if you could sugest something specific to these technologies, but if you can point me to something else, such as Java/MySql solution, then I will be happy with that as well.
A similar question has been asked, but the question/answer was not as advanced as what I am after: Given a list of words - what would be a good algorithm for word completion in java? Tradeoffs: Speed/efficiency/memory footprint
Say I have a table containing book information: title, author name, description. I know, all three do not necessarily belong to the same table, but let's assume it made sense to do it this way. So, when the user types something (say "Hari po") into a textbox / combobox, or some custom control, what they should get as the first suggestion is probably "Harry Potter", and corresponding description and author. To keep the question simple, let's restrict the search to the title only. Note that I do not care that "Hari" sounds like "Harry" - the app is not targeted at non-native speakers, but I do care about the fact "Hari po" is only a few keystrokes away from "Harry Po". So, http://en.wikipedia.org/wiki/Levenshtein_distance comes to mind, but it is not exactly what I need, because I would like to have meaningful results as soon as I start typing (think Google suggestion with different purpose). I need some sort of modified Levenshtein distance algorithm that works well with partial matching and does not assume that what I am typing is supposed to be at the beginning of the text I am trying to match. For instance, the book may be called "How the boy named Harry Potter affects our society.", and I do want this title to come up in the search, however, I would like to see something like "Harry Potter and the Order of the Phoenix" come up at the top, because my query starts with this.
I could try the Levenshtein distance multiple times against all possible sub-strings of query length +/- 2, and then weight them somehow by where in the string the sub-string "sort off" appears, and then pick the maximum match coefficient. My first concern with doing this is that this is inefficient. Secondly, there must be a way to get better results, even if speed was not an issue. Thirdly, someone has surely done something similar before, so why reinvent the wheel?
The number of unique rows in the database will be up to 20,000. What I am after is sort of like Google search suggestion, or Visual Studio 2010 IntelliSense (code auto-completion), except that it should not try to remember what the user has typed in the past and adjust the suggestion based on that. There is no need to do query expansion; just working with the actual content. From the user perspective it should work similarly to Google search and IntelliSense, e.g. it should come up with a number of ranked choices, and also come up with an intelligent way to cut that list off at the right point (e.g. if nothing really matches the query, then suggest nothing rather than show the best of the worst fits), and also if the first few results have a strong rank, but subsequent ones have a much weaker one relatively to the top results, then perhaps hide the weak ones.
Perhaps you know of a reasonably sized open-source tool/library with exposed, and readable source code that I can get ideas from?
My next question would be about how to best handle the situation where the search term could apply to either title, and/or author, and/or description, but I suspect that my current question is already loaded.
Please ask clarifying questions if something is not clear about what I am after.