Keeping query statistics using lucene

Question

I am developing a search component of a web application using Lucene. I would like to save the user queries to an index and use them to suggest alternate queries to users, and to keep query statistics (most often used queries, top scoring queries, ...).

To use this data for alternate query suggestions, I would analyze the queries to see which terms are most often used with one another and use that to create a suggestion to the user.

But I can't figure out in which form to index the data. I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content. Does anyone have any ideas about the way this can be accomplished?

Thanks for the help.

score 1 · Answer 1 · answered Nov 25 '10 at 15:19

1

"I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content"

You can tell Lucene not to store document content, which means that the principal overhead will be the unique Terms, and the index itself. So, it might not be a large overhead to store each query as a unique Document...this way you will not be throwing away any information.

answered Nov 25 '10 at 15:19

Joel

29,538
35
110
138

I thought about doing that, but I also need to keep some statistics on the queries (number of times they were used, number of hits) and the only way I can think of to achieve this is to store the number of times used in the index and increment it before updating the document, but that seems like an expensive operation. – jbradaric Nov 25 '10 at 16:19
Can you use the Term Frequencies from lucene itself to do this? http://stackoverflow.com/questions/667389/get-term-frequencies-in-lucene. If you want to do the query recommendations in real-time you'll want to pre-compute the term freq's ahead of time and store them. – Joel Nov 25 '10 at 16:22
I can use the TermFrequencies if I don't store the queries as a unique field, but I was hoping to avoid that. But it seems that I'll have to store the queries as non-unique until I figure out a better solution, if the solution even exists. – jbradaric Nov 25 '10 at 16:38
I'm probably misreading you, but don't you _want_ to store the queries as one per document, and one field per document, but use an analyser that tokenises the query into constituent words? That way you'll only ever have one instance (token) for the each unique word (not query)... – Joel Nov 25 '10 at 16:51

score 1 · Answer 2 · answered Nov 25 '10 at 21:21

First, I believe that you should store the queries separately from the existing index. The problem is not redundant data but rather "watering down" your index - storing the queries in the same index may harm the relevance of your searches. Some options for this are:

Use a separate Lucene index.
Use Solr, with two separate cores, one for the documents and the other for the queries.
Use a query log. Store scores with the queries. Build query statistics using post-processing.As this is a web application, you can probably use a servlet container, such as Tomcat's, logs for this.

Second, Auto-Suggest From Popular Queries Using EdgeNGrams suggests an alternative implementation of query suggestion using Solr.

Or they could just be stored as a distinct Document type in the same index, but yeah, still probably wise to separate real data from auxiliary data. — Joel, Nov 26 '10 at 07:51

Keeping query statistics using lucene

2 Answers2