5

I built a custom collector for Lucene.Net, but I can't figure out how to order (or page) the results. Everytime Collect gets called, I can add the result to an internal PriorityQueue, which I understand is the correct way to do this.

I extended the PriorityQueue, but it requires a size parameter on creation. You have to call Initialize in the constructor and pass in the max size.

However, in a collector, the searcher just calls Collect when it gets a new result, so I don't know how many results I have when I create the PriorityQueue. Based on this, I can't figure out how to make the PriorityQueue work.

I realize I'm probably missing something simple here...

Deane
  • 8,269
  • 12
  • 58
  • 108

2 Answers2

7

PriorityQueue is not SortedList or SortedDictionary. It is a kind of sorting implementation where it returns the top M results(your PriorityQueue's size) of N elements. You can add with InsertWithOverflow as many items as you want, but it will only hold only the top M elements.

Suppose your search resulted in 1000000 hits. Would you return all of the results to user? A better way would be to return the top 10 elements to the user(using PriorityQueue(10)) and if the user requests for the next 10 result, you can make a new search with PriorityQueue(20) and return the next 10 elements and so on. This is the trick most search engines like google uses.

Everytime Commit gets called, I can add the result to an internal PriorityQueue.

I can not undestand the relationship between Commit and search, Therefore I will append a sample usage of PriorityQueue:

public class CustomQueue : Lucene.Net.Util.PriorityQueue<Document>
{
    public CustomQueue(int maxSize): base()
    {
        Initialize(maxSize);
    }

    public override bool LessThan(Document a, Document b)
    {
        //a.GetField("field1")
        //b.GetField("field2");
        return  //compare a & b
    }
}

public class MyCollector : Lucene.Net.Search.Collector
{
    CustomQueue _queue = null;
    IndexReader _currentReader;

    public MyCollector(int maxSize)
    {
        _queue = new CustomQueue(maxSize);
    }

    public override bool AcceptsDocsOutOfOrder()
    {
        return true;
    }

    public override void Collect(int doc)
    {
        _queue.InsertWithOverflow(_currentReader.Document(doc));
    }

    public override void SetNextReader(IndexReader reader, int docBase)
    {
        _currentReader = reader;
    }

    public override void SetScorer(Scorer scorer)
    {
    }
}

searcher.Search(query,new MyCollector(10)) //First page.
searcher.Search(query,new MyCollector(20)) //2nd page.
searcher.Search(query,new MyCollector(30)) //3rd page.

EDIT for @nokturnal

public class MyPriorityQueue<TObj, TComp> : Lucene.Net.Util.PriorityQueue<TObj>
                                where TComp : IComparable<TComp>
{
    Func<TObj, TComp> _KeySelector;

    public MyPriorityQueue(int size, Func<TObj, TComp> keySelector) : base()
    {
        _KeySelector = keySelector;
        Initialize(size);
    }

    public override bool LessThan(TObj a, TObj b)
    {
        return _KeySelector(a).CompareTo(_KeySelector(b)) < 0;
    }

    public IEnumerable<TObj> Items
    {
        get
        {
            int size = Size();
            for (int i = 0; i < size; i++)
                yield return Pop();
        }
    }
}

var pq = new MyPriorityQueue<Document, string>(3, doc => doc.GetField("SomeField").StringValue);
foreach (var item in pq.Items)
{
}
L.B
  • 114,136
  • 19
  • 178
  • 224
  • But here's the thing -- to sort, I obviously have to complete the entire result set first, so I have everything to sort. So, if I have a search that returns 100,000 results, and I want to give the user the first X results when sorted by, say, date, then I have to add all 100,000 results to the PriorityQueue, correct? – Deane Oct 29 '11 at 12:09
  • Also, I'm sorry -- meant "Collect," not "Commit." I have edited the question. – Deane Oct 29 '11 at 12:09
  • `have to complete the entire result set first` + `I have to add all 100,000 results to the PriorityQueue` Yes, but there will be *at most M elements* in the queue at any time during collect process. – L.B Oct 29 '11 at 13:07
  • Okay, I think I see my disconnect -- when I say Max Size is X, that doesn't mean I can't keep adding and adding all day long, it just means that it will "throw away" anything that doesn't "fit" based on the sort. I was thinking I added them all, THEN the queue did a big batch sort. You seem to be saying that the sort is incremental, so every time I add something, it's eval'd at that time. True? – Deane Oct 29 '11 at 13:14
  • Yes you are very correct, therefore its fast. It doesn't do a `batch` sort on all results. – L.B Oct 29 '11 at 13:18
  • Awesome answer and explanation... one question though (probably a moronic one) but how do I iterate over the queue after it is populated? – nokturnal Jun 11 '13 at 21:07
  • ... I exposed the CustomQueue (via a getter) and am looping with: while(collector.Queue.Size() > 0) { Document doc = collector.Queue.Pop(); } ... just feels wrong and/or dirty to me for some reason... What do you guys think... – nokturnal Jun 11 '13 at 21:17
  • 1
    @nokturnal you can add an `Items` method to `CustomQueue` as **`public IEnumerable Items() { int size = Size(); for (int i = 0; i < size; i++) { yield return Pop(); } }`** which would allow you to use *`foreach`* – L.B Jun 12 '13 at 18:12
0

The reason Lucene's Priority Queue is size limited is because it uses a fixed size implementation that is very fast.

Think about what is the reasonable maximum number of results to get back at a time and use that number, the "waste" for when the results are few is not that bad for the benefit it gains.

On the other hand, if you have such a huge number of results that you cannot hold them, then how are you going to be serving/displaying them? Keep in mind that this is for "top" hits so as you iterate through the results you will be hitting less and less relevant ones anyway.

Desmond Zhou
  • 1,369
  • 1
  • 11
  • 18