2

I'm using Apache Lucene and currently trying to combine Fuzzy and Prefix (or Wildcard) query to implement a kind of suggestion mechanism.

For example, if the query is levy, a document containing Levinshtein should also be returned.

As there seems no builtin query of this sort in Lucene, I've searched for solutions and have used the approach suggested here Lucene query: bla~* (match words that start with something fuzzy), how?, that creates the query as a combination of two Automata (the second reply).

That works great indeed, but, now the thing is that there's no scoring. All results get result of 1.0. I really want "Levy" to be ranked higher then "Levninshtein" in the previous example.

By the way, I tried using Lucene auto-suggestion in the form of FuzzySuggester, but it's not feasible with large inputs, it holds all suggestion in RAM and bloats the memory usage.

Is there another way of doing this? Or I should implement my own Scorer or Similarity?

Community
  • 1
  • 1
Yossi Vainshtein
  • 3,845
  • 4
  • 23
  • 39

1 Answers1

1

It could be solved easily with help of basic rewrite methods in MultiTermQuery

As javadoc said:

The recommended rewrite method is CONSTANT_SCORE_AUTO_REWRITE_DEFAULT: it doesn't spend CPU computing unhelpful scores, and it tries to pick the most performant rewrite method given the query. If you
need scoring (like {@link FuzzyQuery}, use {@link TopTermsScoringBooleanQueryRewrite} which uses a priority queue to only collect competitive terms and not hit this limitation.

It means that for performance optimization they use constant scoring by default. So, only thing you need to do is:

MultiTermQuery query = new AutomatonQuery(new Term("text"), Automata.makeAnyString());
        query.setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);

For more info take a look into my test: https://github.com/MysterionRise/information-retrieval-adventure/blob/master/src/main/java/org/mystic/lucene/AutomatonScoringTest.java

Mysterion
  • 9,050
  • 3
  • 30
  • 52