2

I have a RAMDirectory with 1.5 million documents and I'm searching using a PrefixQuery for a single field. When the search text has a length of 3 or more characters, the search is extremely fast, less than 20 milliseconds. But when the search text has a length of less than 3 characters, the search might take even a full 1 second.

Since it's an auto complete feature and the user starts with one character (and there are results that are indeed 1 char length), I cannot restrict the length of the search text.

The code is pretty much:

var symbolCodeTopDocs = searcher.Search(new PrefixQuery(new Term("SymbolCode", searchText), 10);

The SymbolCode is a NOT_ANALYZED field. The Lucene.NET version is 3.0.3.

The example is simplified, and I might have to use a BooleanQuery to apply additional constraints in a real world scenario.

How can I improve performance on this specific case? These single-char or two-char queries are bringing the server down.

Diego Frata
  • 1,028
  • 7
  • 15
  • 1
    First, you must understand that a prefix query is in fact rewritten as a BooleanQuery, which may be very slow depending on how many distinct terms are being searched and whether scoring is enabled. If scoring isn't important in your case, you can try to use a Filter instead. Also try checking how many distinct terms are returned by your query. Use reader.terms(term). – Juan Lopes Dec 27 '12 at 20:22
  • @JuanLopes I'd like to have scoring or sorting enabled as I've to return the top 10 relevant documents. The number of distinct terms is something above 100K for the term 'a'. I'm trying to search using a Filter instead of a Query, but can't find a suitable method. Any further ideas? – Diego Frata Dec 27 '12 at 20:51
  • @JuanLopes I computed the terms before the IndexReader was reopened. The number of distinct terms is over 1.2 million. – Diego Frata Dec 27 '12 at 21:15
  • if you wanna provide autocomplete, dont use a PrefixQuery to do it, use a n-gram approach – Jf Beaulac Dec 28 '12 at 02:11
  • How would a n-gram analyzer improve performance in this case? – Diego Frata Dec 28 '12 at 12:41
  • 2
    With ngrams you dont need PrefixQueries, it becomes a simple TermQuery. – Jf Beaulac Dec 28 '12 at 20:21

1 Answers1

2

Consider removing stop words from your index if you haven't already.

To understand how stop words slow down PrefixQuery then consider how PrefixQuery works: It is rewritten as a BooleanQuery that includes every term from the index beginning with the PrefixQuery's term. For example a* becomes a OR and OR aardvark OR anchor OR ... So far this isn't bad and it will perform surprisingly well even with thousands of terms. The real drain is when stop words like a and and are included because they'll likely be found multiple times in every single document in your index. This creates a lot more work for the gathering/collecting/scoring portion of the search and thus slows things down.

On a side note, I highly recommend not running the autocomplete search when the user has entered less than 2 or 3 characters, purely from a usability perspective. I can't imagine the results would be at all relevant. Imagine running a search for a* -- there's no way to tell which results are more relevant. If you must display something to the user then consider an n-gram approach like Jf Beaulac suggested in the comments.

Keith
  • 20,636
  • 11
  • 84
  • 125
  • I've already removed all the stop words. Incredible enough, there are results that are just one character wide and are VERY relevant. I'm building an auto complete for stock market symbols and C is the symbol code for Citibank. If the user types C and has sufficient privileges to access Citibank's data, then C must show up as a valid alternative in the auto complete along with the company's name (which is stored in Lucene in a different field). – Diego Frata Dec 28 '12 at 16:35
  • I've calculated some boost factors for each document during the indexing stage. I'd like the results to be returned based on these factors, the most relevant stock symbols starting with C should be the ones that have a bigger boost value and matches the prefix criteria. – Diego Frata Dec 28 '12 at 16:42
  • Ahh, now I see the importance of searching even on 1 character. If you haven't already, can you create an index that contains only company names and stock market symbols in order to minimize the number of terms in the index? – Keith Dec 28 '12 at 18:11
  • 3
    I did and it didn't show huge improvements. I've decided to go another route. I'm going to split the documents between two directories. The first one will contain all documents that have a SymbolCode with less than 3 chars and documents that are between the 10K most traded symbols. The second one will contain everything else. This way I'll hardly hit the second directory and won't have problems with the number of terms matching my PrefixQuery. – Diego Frata Dec 28 '12 at 20:24