0

With the Levenshtein implementation of Lucene 4 claiming to be 100 times faster than before (http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html) I would like to do fuzzy matching of all terms in a query. The idea is that a search for 'gren hose' should be able to find the document 'green house' (I don't really care about phrases at this point, the quotes are just here to make this more readable).

I am using Lucene 4 + Solr 4. As I'm doing some pre- and post-processing there is a small wrapper servlet around Solr, the servlet is using SolrJ to eventually access Solr

I'm currently a little lost on what would be the right way to achieve this. My basic approach is to break the search query down into terms and append the tilde / fuzzy operator to each term. Thus 'gren hose' would become 'gren~ hose~' . Now the question is how to properly do this. I can see several ways:

  1. Brute force: Assume that the terms are delimited by whitespace, so just parse the query and append a tilde before each whitespace (ie. after each term)
  2. Two steps: Send the query to Solr with query debugging turned on. This will give me a list of query terms as parsed by Solr. I can then extract the terms from the debug output, append the tilde operator and re-run the query with the added tilde operators
  3. Internally: Hook into the search request handler and append the tilde operator after the query has been parsed into terms

Method 1 stinks a lot, as it circumvents Solr's query parsing entirely, so I would rather not do that. Method 2 sounds quite doable if the cost of parsing the query twice is not too high. Method 3 sound just right, but I have yet to figure out where I have to hook into the processing chain.

Maybe there is a completely different way to achieve what I want to do, or maybe it's just a stupid idea on my part. Anyway, I would really appreciate a few pointers, maybe someone else has already done something like this. Thanks!

pnuts
  • 58,317
  • 11
  • 87
  • 139
Georg M. Sorst
  • 264
  • 4
  • 13

1 Answers1

1

I would propose the following methods:

  1. Implement a query handler module in your application where you can build solr query from the input user query. This way nothing changes in the SOLR side and your application has all the control on what goes into SOLR.

  2. Implement your own query parser , you can start from Standard SOLR query parser (org.apache.solr.search.QParser) and make your changes. Your application just needs to select your custom query parser and rest your implementation should take care.

I would prefer method 1 as this makes the system completely agnostic to SOLR upgrades, any new release of Solr will not require me to update the custom qparser and you will not have to update/build and setup your custom qparser in the new version.

If you dont have any control on the app and dont want to go through the qparser route , then you can implement a Servlet filter that transforms the solr query before it is dispatched to solr request filter.

Umar
  • 2,819
  • 20
  • 17