Why Lucene index is so large?

Question

I am storing documents in a Lucene instance the following way:

Document doc = new Document();
doc.add(new StringField("title", processor.title, Field.Store.YES));
doc.add(new StringField("annotation", processor.annotation, Field.Store.YES));
doc.add(new TextField("text", processor.text, Field.Store.NO));
w.addDocument(doc);

I don't need full text to be stored in index, the only thing I need is to be able to perform searches on documents.

The problem is that I am getting an index having almost the same size as the size of the original set of documents. It seems quite strange to me as it should only store word frequencies. Why is it happening?

Could you add some sample documents so we could see, how many fields there are in the original document, etc.? Also, some numbers would be nice: how many docs, how big are the fields, what's the size of the docs and the index. — Dominik Sandjaja, Feb 17 '17 at 22:49
@DominikSandjaja Documents have three fields that you can see in the question. Size of the text that is not stored in index is ~100K of plain English text. — Denis Kulagin, Feb 17 '17 at 22:54
Can you please provide how the IndexWriter and IndexWriterConfig is created? — René Scheibe, Feb 17 '17 at 23:18
what you mean by so large? how big is this? how many entries you're inserting into the index and how long are the titles, annotations and texts? — thiagoh, Feb 17 '17 at 23:24

score 2 · Answer 1 · answered Feb 18 '17 at 00:39

It seems quite strange to me as it should only store word frequencies.

I think you are misapprehending what is stored and how it is stored. The Lucene documentation for the index file formats explains in detail. Quoting from the Overview section:

Each segment index maintains the following:

Field names. This contains the set of field names used in the index.

Stored Field values. This contains, for each document, a list of attribute-value pairs, where the attributes are field names. These are used to store auxiliary information about the document, such as its title, url, or an identifier to access a database. The set of stored fields are what is returned for each hit when searching. This is keyed by document number.

Term dictionary. A dictionary containing all of the terms used in all of the indexed fields of all of the documents. The dictionary also contains the number of documents which contain the term, and pointers to the term's frequency and proximity data.

Term Frequency data. For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document if omitTf is false.

Term Proximity data. For each term in the dictionary, the positions that the term occurs in each document. Note that this will not exist if all fields in all documents set omitTf to true.

Normalization factors. For each field in each document, a value is stored that is multiplied into the score for hits on that field.

Term Vectors. For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency. To add Term Vectors to your index see the Field constructors

Deleted documents. An optional file indicating which documents are deleted.

Some of the above are optional and probably won't be present in your indexes. However, a minimal index will have the "field names", "stored field values", "term dictionary", and "term frequency data".

Some of these data structures scale according to the number of distinct words in your corpus. Others scale according to the number of documents, or the number of unique words per document.

If you populate an index with a single (relatively) small document, then some of the scaling factors will be working against you.

Finally, the physical representations of the index segments will be designed and optimized primarily for fast searching in preference to reduced storage space. That will affect the "information density" ... and the storage space used in practice.

I wonder about term dictionary, since there are limited words in english, but by this way, it would be duplicated across the segments? — TomSawyer, Apr 19 '20 at 21:07

score 1 · Answer 2 · answered Feb 17 '17 at 23:30

The analyzer (tokenizer and filter) should match your text. For English the StandardAnalyzer should be a good start.

Analyzer analyzer = new StandardAnalyzer(Version.LATEST);
Directory index = FSDirectory.open(new File("index"));
IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer);
IndexWriter writer = new IndexWriter(index, config);

Why Lucene index is so large?

2 Answers2