I have a fairly large Lucene.net index (created with the latest version - 2.9). It has ~1 billion documents. It takes ~70GB of HD space. Each document is very small, just two fields: a string and an integer.
I want to search by the string field, and sort by the index field. The thing is, I get an OutOfMemoryException when I attempt to run the query with a sort. The code looks something like this:
var sort = new Sort(new SortField("frequency",SortField.INT,false));
var topDocs = searcher.Search(query, null, 1,sort);
It doesn't matter which query I use, if I use the sort, it crashes. Here is the stack trace:
System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
at Lucene.Net.Search.FieldCacheImpl.IntCache.CreateValue(IndexReader reader, Entry entryKey)
at Lucene.Net.Search.FieldCacheImpl.Cache.Get(IndexReader reader, Entry key)
at Lucene.Net.Search.FieldCacheImpl.GetInts(IndexReader reader, String field, IntParser parser)
at Lucene.Net.Search.FieldCacheImpl.IntCache.CreateValue(IndexReader reader, Entry entryKey)
at Lucene.Net.Search.FieldCacheImpl.Cache.Get(IndexReader reader, Entry key)
at Lucene.Net.Search.FieldCacheImpl.GetInts(IndexReader reader, String field, IntParser parser)
at Lucene.Net.Search.FieldComparator.IntComparator.SetNextReader(IndexReader reader, Int32 docBase)
at Lucene.Net.Search.IndexSearcher.Search(Weight weight, Filter filter, Collector collector)
at Lucene.Net.Search.IndexSearcher.Search(Weight weight, Filter filter, Int32 nDocs, Sort sort, Boolean fillFields)
at Lucene.Net.Search.IndexSearcher.Search(Weight weight, Filter filter, Int32 nDocs, Sort sort)
at Lucene.Net.Search.Searcher.Search(Query query, Filter filter, Int32 n, Sort sort)
I'm fairly new to Lucene. Looks like it is trying to cache a huge amount of data and runs out of memory.
Update: Indeed, looks like Lucene attempts to create an array int[maxDoc] which is huge if my case.
Sorting uses of caches of term values maintained by the internal HitQueue(s). The cache is static and contains an integer or float array of length IndexReader.maxDoc() for each field name for which a sort is performed. In other words, the size of the cache in bytes is: 4 * IndexReader.maxDoc() * (# of different fields actually used to sort)
Can I change this behavior somehow?