How to search Lucene.NET without indicating "top n" hits limit?

Question

There are several overloads of IndexSearcher.Search method in Lucene. Some of them require "top n hits" argument, some don't (these are obsolete and will be removed in Lucene.NET 3.0).

Those, which require "top n" argument actually cause memory preallocation for this entire posible range of results. So when you're in situation when you can't even approximately estimate count of results returned, the only opportunity is to pass a random large number to ensure that all query results will be returned. This causes severe memory pressure and leaks due to LOH fragmentation.

Is there an oficial not outdated way to search without passing "top n" argument?

Thanks in advance, guys.

score 2 · Accepted Answer · edited Feb 07 '14 at 09:09

2

I'm using Lucene.NET 2.9.2 as reference point for this answer.

You could build a custom collector which you pass to one of the search overloads.

using System;
using System.Collections.Generic;
using Lucene.Net.Index;
using Lucene.Net.Search;

public class AwesomeCollector : Collector {
    private readonly List<Int32> _docIds = new List<Int32>();
    private Scorer _scorer;
    private Int32 _docBase;

    public IEnumerable<Int32> DocumentIds {
        get { return _docIds; }
    }

    public override void SetScorer(Scorer scorer) {
        _scorer = scorer;
    }

    public override void Collect(Int32 doc) {
        var score = _scorer.Score();
        if (_lowerInclusiveScore <= score)
            _docIds.Add(_docBase + doc);
    }

    public override void SetNextReader(IndexReader reader, Int32 docBase) {
        _docBase = docBase;
    }

    public override bool AcceptsDocsOutOfOrder() {
        return true;
    }
}

edited Feb 07 '14 at 09:09

xlecoustillier

16,183
14
60
85

answered Jan 21 '11 at 12:52

sisve

19,501
3
53
95

Thank you for your suggestion. We have actually been using Collector in pretty much the same way with the only difference in using LinkedList instead of List to prevent memory reallocation on growth. This approach works great when there's no need to do Sorting. There's no Search() overload which receives both Collector and Sort objects. When using Sort, we force Lucene to use default TopHitsCollector, which preallocates memory in a way described. Maybe it would be a good idea to use custom collector, which does it's own sorting on Coolect call. What do you think? – Rezgar Cadro Jan 21 '11 at 13:34
I would change it into storing both document id and the sort value within the list, and do the sorting when all results have been collected. You can use the FieldCache if you have a single keyword field as a sort field, it will load (and cache) field values per segment. You _must_ use the inner reader (the one passed to you in SetNextReader) for the cache to work properly. – sisve Jan 21 '11 at 14:04
Yes, I guess this would be the best way to do it with an exception of using Field cache. This thing eats up a lot of memory in a large index and since we have to re-open it quite often, I would prefer not to load all field data from index into memory just for the sake of sorting a few hundred rows. So it seems that your advice is basically an answer to my question. Thank you Simon :) – Rezgar Cadro Jan 21 '11 at 15:29
Calling .Reopen on an IndexReader will reuse already opened segments, and as such, already opened SegmentReaders. This means that the FieldCache will already have the items in memory, and no disk access is needed to retrieve the sort values. But it really consumes alot of memory, yes. – sisve Jan 21 '11 at 15:30
I can't really use Reopen since it reopens index at a specific commit point. Our production index is being frequently updated and can be randomly rebuilt. Actually, it was memory pressure that pushed me to dig into Sorting and FieldCache mechanics. I'm not sure if it's true yet, but most likely allocation of large memory chunks for FieldCache on every index re-opening caused LOH fragmentation. So for now I think I'll try to avoid it and see if it helps. – Rezgar Cadro Jan 21 '11 at 15:44
1

Uhm. Reopen() will return a fresh IndexReader from the source used to create the original reader (including an IndexWriter for near-realtime-updates). It will return the same instance if no changes have occurred, and reuse any segments that hasn't changed. Remember that the IndexReader you get from IndexReader.Open really is an DirectoryReader composed of several SegmentReaders. SegmentReaders are reused as long as the segment still exists (segments are only created and deleted, never updated). – sisve Jan 21 '11 at 18:26
And a quick note again, DO NOT use the IndexReader retrieved by IndexReader.Open of you call the FieldCache stuff directly. It will use the reader you pass to it as a cache key, and every index change will cause it to reread everything as it considers it a new cache key. You SHOULD use the SegmentReaders the DirectoryReader consists of, which can be retrieved with IndexReader.GetSequentialSubReaders(). These restart their document numbering at 0, so you need to calculate their real document ids by using IndexReader.MaxDoc() from the previous readers. Use the ReaderUtil class as reference. – sisve Jan 21 '11 at 19:40
And another note, you could build your own FieldCache which doesn't store everything in giant arrays to avoid hitting the LOH. Or do you want to avoid loading the stuff into memory at all? – sisve Jan 21 '11 at 19:45
That's a quite valuable piese of information on how Lucene works internally. Thank you for your time and effort, Simon. PS: As to the field cache, I guess we will follow your advice on building a custom FieldCache, it seems like the most appropriate solution if we have to perform sorting (using native lucene mechanics) on a large index. – Rezgar Cadro Jan 25 '11 at 10:22

How to search Lucene.NET without indicating "top n" hits limit?

1 Answers1