3

I am working on a store search API using Lucene.

I need to show store search results for each City,State combination with its frequency in brackets....for example:

Los Angles,CA (450)
Atlanta,GA (212)
Boston, MA (78)
.
.
.

As of now, my search results return around 7000 Lucene documents, on average, if the user says "Show me all the stores". In this use case, I end up showing around 800 unique City,State records as shown above.

I am overriding the HitCollector class's Collect method and retrieving vectors as follows:

var vectors = _reader.GetTermFreqVectors(doc);

Then I iterate through this collection and calculate the frequency for each unique City,State combination.

But this is turning out to be very very slow in performance...is there any better way of grouping search results and calculating frequency in Lucene? A code snippet would be very helpful

Also, please suggest if I can optimize my Lucene search code using any other techniques/tips....

Thanks for reading!

Eddie
  • 53,828
  • 22
  • 125
  • 145
Steve Chapman
  • 1,317
  • 4
  • 23
  • 34

3 Answers3

3

I don't believe you can do this OOTB in Lucene currently - searching for this functionality yields this open issue:

Jira Lucene Feature Request

The functionality is present OOTB with Solr however - which provides a faceting feature. A query such as the following:

http://localhost:8983/solr/select?q=ipod&rows=0&facet=true&facet.limit=-1&facet.field=cat&facet.field=inStock

would return the following result:

<response>
<responseHeader><status>0</status><QTime>2</QTime></responseHeader>
<result numFound="4" start="0"/>
<lst name="facet_counts">
 <lst name="facet_queries"/>
 <lst name="facet_fields">
  <lst name="cat">
        <int name="search">0</int>
        <int name="memory">0</int>
        <int name="graphics">0</int>
        <int name="card">0</int>
        <int name="music">1</int>
        <int name="software">0</int>
        <int name="electronics">3</int>
        <int name="copier">0</int>
        <int name="multifunction">0</int>
        <int name="camera">0</int>
        <int name="connector">2</int>
        <int name="hard">0</int>
        <int name="scanner">0</int>
        <int name="monitor">0</int>
        <int name="drive">0</int>
        <int name="printer">0</int>
  </lst>
  <lst name="inStock">
        <int name="false">3</int>
        <int name="true">1</int>
  </lst>
 </lst>
</lst>
</response>

More information on faceting can be found on the Solr website:

http://wiki.apache.org/solr/SimpleFacetParameters

EDIT: If you definitely don't want to go down the SOLR aproach to faceting you may be able to leverage the functionality in this patch described for Lucene:

http://sujitpal.blogspot.com/2007/01/faceted-searching-with-lucene.html

which provides an implementation of the faceting feature on top of Lucene 2.0 via a patch.

Jonathan Holloway
  • 62,090
  • 32
  • 125
  • 150
  • Can you please answer this one? http://stackoverflow.com/questions/899542/problem-using-same-instance-of-indexsearcher-for-multiple-requests – Steve Chapman Jun 12 '09 at 02:44
0

I'm not sure that I understood what you mean by "grouping", but if you just want to count the number of docs for each category, you should take a look at this question.

My answer there still stands, tough nobody seemed to like it enough to upvote me...

Community
  • 1
  • 1
itsadok
  • 28,822
  • 30
  • 126
  • 171
0

Steve, I believe you want faceted search. It does not come out of the box with Lucene. I suggest you try using SOLR, that has faceting as a major and convenient feature.

Yuval F
  • 20,565
  • 5
  • 44
  • 69
  • Can you please answer this one? http://stackoverflow.com/questions/899542/problem-using-same-instance-of-indexsearcher-for-multiple-requests – Steve Chapman Jun 12 '09 at 02:43