4

I am a Java beginner, trying to write a program that will match an input to a list of predefined strings. I have looked at Levenshtein distance, but I have come to problems such as this:

If I have an input such as "fillet of beef" I want it to be matched to "beef fillet". The problem is that "fillet of beef" is closer, according to Levenshtein distance, to something like "fillet of tuna", which of course is wrong.

Should I be using something like Lucene for this? Does one use Lucene methods within a Java class?

Thanks!

abroekhof
  • 796
  • 1
  • 7
  • 20
  • 2
    Lucene is probably the wrong approach (it's meant to find matches across a set of documents, not a single document), but the way that it builds and searches index might be helpful to you (especially the "relevance" algorithm). **Questions that would help people to give you a good answer**: What is your input? How long is your list of words? Do you need to deal with misspellings? – Anon Apr 07 '11 at 12:44
  • Thanks for the feedback: My input would be strings parsed from a xml documents. There shouldn't be too many misspellings but it would be nice to cover them if they do occur. My list of strings numbers around 1000 – abroekhof Apr 07 '11 at 13:05

3 Answers3

2

You need to compute the relevance of your search terms to the input strings. Lucene does have relevance calculations built in, and this article might be a good start to understanding them (I just scanned it, but it seems reasonably authoritative).

The basic process is this:

  • Initialization: tokenize your search terms, and store them in a series of HashSets, one per term. Or, if you want to give different weights to each word, use HashMap where the word is the key.
  • Processing: tokenize each input string, and probe each of the sets of search terms to determine how closely they apply to the input. See above for a description of algorithms.

There's an easy trick to handle misspellings: during initialization, you create sets containing potential misspellings of the search terms. Peter Norvig's post on "How to Write a Spelling Corrector" describes this process (it uses Python code, but a Java implementation is certainly possible).

Anon
  • 2,328
  • 12
  • 7
1

Lucene does support fuzzy search based on Levenshtein distance.

https://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Fuzzy%20Searches

But lucene is meant to search on set of documents rather than string search, so lucene might be an overkill for you. There are other Java implementation available. Take a look at http://www.merriampark.com/ldjava.htm

Nishan
  • 2,821
  • 4
  • 27
  • 36
  • Thanks for your response Nishan. I tried Levenshtein distance Java implementation as you linked above, but I ran into the problem as stated in my question. – abroekhof Apr 07 '11 at 13:08
1

It should be possible to apply the Levenshtein distance to words, not characters. Then, to match words, you could again apply Levenshtein on the character level, so that "filet" in "filet of beef" should match "fillet" in "beef fillet".

Ingo
  • 36,037
  • 5
  • 53
  • 100