3

I come up with solution to programmaticlly create query to search for phrase with wildcards using this code:

public static Query createPhraseQuery(String[] phraseWords, String field) {
    SpanQuery[] queryParts = new SpanQuery[phraseWords.length];
    for (int i = 0; i < phraseWords.length; i++) {
        WildcardQuery wildQuery = new WildcardQuery(new Term(field, phraseWords[i]));
        queryParts[i] = new SpanMultiTermQueryWrapper<WildcardQuery>(wildQuery);
    }
    return new SpanNearQuery(queryParts,       //words
                             0,                //max distance
                             true              //exact order
    );
}

Example creation and call toString() method will output:

String[] phraseWords = new String[]{"foo*", "b*r"};
Query phraseQuery = createPhraseQuery(phraseWords, "text");
System.out.println(phraseQuery.toString());

outputs:

spanNear([SpanMultiTermQueryWrapper(text:foo*), SpanMultiTermQueryWrapper(text:b*r)], 0, true)

Which works great, and fast enough for most cases. For instance, if I create such query and search with it, It will output desired results, for example:

Sentence with foo bar.
Foolies beer drinkers.
...

And not something like:

Bar fooes.
Foo has bar.

I have mentioned that query work fast enough in most cases. Currently I have an index with size of aprox. 200GB and on average searching time is between 0.1 to 3 seconds. Depending on many factors like: cache, size of subsets of documents matching single word in phrase since lucene will perform set intersections between founded terms.

Example: Let supose I want to query phrase "an* karenjin*" (which I will split into ["an*", "karenjin*"] and than create query using createPhraseQuery method) and I want that it matches sentences containing: "ana karenjina", "ani karenjinoj", "ane karenjine", ... (different cases due croatian grammar).

This query is very slow that I haven't waited long enough to get results (over 1h) and sometimes causes GC overhead limit exceeded exception. This behaviour is somewhat expected since "an*" itself matches a huge number of documents. I am aware of that I could query "an? karanjin*" which giver results in 30-40sec (faster but still slow).

This is where I am confused. If I query just "karenjin*" it gives results in 1 sec. Therefore I have tried to query "an* karenjin*" and using a Filter "karenjin*" using WildcardQuery and QueryWrapperFilter. And it is still unacceptable slow (I killed process before it returned anythong).

Documentation says that Filter reduces search space of Query. So I tried to use filter:

Filter filter = new QueryWrapperFilter(new WildcardQuery(new Term("text", "karanjin*")));

And query:

Query query = createPhraseQuery(new String[]{"an*", "karenjin*"}, "text");

Than search, (after several warm-up queries):

Sort sort = new Sort(new SortField("insertTime", SortField.Type.STRING, true));
TopDocs docs = searcher.search(query, filter, 100, sort);

OK, what is my question?

How come is quering:

 Query query = new WildcardQuery(new Term("text", "karanjin*"));

is fast, but using Filter described above is still slow?

Antonio Tomac
  • 438
  • 5
  • 12

1 Answers1

1

Yes, wildcards can be performance hogs, especially if they match a lot of terms, but what you describe does seem surprisingly so. Hard to say for sure why that is occuring, but for an attempt.

I'll assume:

Query query = new WildcardQuery(new Term("text", "an*"));

On it's own, is performing very badly, as described. Since the wildcards you are looking for are both prefix style queries, it's a better idea to use a PrefixQuery instead.

Query query = new PrefixQuery(new Term("text", "an"));

Though I don't think that will make much of a difference if any at all. What might just make a different is changing you rewrite method. You could try limiting the number of Terms the query is rewritten into:

Query query = new PrefixQuery(new Term("text", "an"));
//or
//Query query = new WildcardQuery(new Term("text", "an*"));
query.setRewriteMethod(new MultiTermQuery.RewriteMethod.TopTermsRewrite(10));
femtoRgon
  • 32,893
  • 7
  • 60
  • 87
  • Thanks for advice, I will try to limit number of terms and see how it will perform. I am expecting that it will be a lot faster. But results may be incomplete. It is tradeoff between time and results. – Antonio Tomac Sep 30 '14 at 00:30
  • I will try it. And acording to book Lucene in action, WildcardQuery will be internally recognized and optimized to PrefixQuery if it ends with *, or even to TermQuery if there are no wildcards. – Antonio Tomac Sep 30 '14 at 00:37
  • I believe that is correct, but I rather expected that logic to live in the parse, and I didn't see it there. Could be part of the rewrite itself though. – femtoRgon Sep 30 '14 at 01:27
  • In current versions of Lucene, PrefixQuery and WildcardQuery both extend AutomatonQuery, and the automaton they generate comes out the same, so there is no obvious benefit to choosing one over the other. – Hakanai Aug 10 '17 at 02:22