Impact of repeat value across multiple fields in Lucene

Question

What would be the impact of re-indexing the same value across multiple fields in a lucene index?

The idea is that someone's first name is a part of their name and their general details. So I would want to index that value into multiple fields. Ted Bloggs I might index as follows:

Field        |    Value
-------------|---------
firstName    | Ted
lastName     | Blogs
name         | Ted
name         | Bloggs
general      | Ted
general      | Bloggs
all          | Ted
all          | Bloggs

By doing this I can easily form categories of fields however I'm worried it may have adverse performance and/or disk usage impacts.

Could anyone advise please

score 4 · Accepted Answer · answered Dec 02 '11 at 21:00

@aishwarya is right, but to expand on it a little bit more:

From the docs:

This file is sorted by Term. Terms are ordered first lexicographically (by UTF16 character code) by the term's field name, and within that lexicographically (by UTF16 character code) by the term's text.

The term will be stored once per field, so if you repeat each term five times your storage will be five times bigger. However, the size of the term dic is logarithmic with respect to the size of the raw data, so I doubt you will have a problem.

The performance penalty will be non-existent (Lucene caches where each field starts) except insofar as having more data will force stuff out of memory. For most search infrastructures, you'll probably have an index of under a few gb, which will easily fit in memory anyway.

score 4 · Answer 2 · answered Dec 03 '11 at 16:36

Xodarap's explanation is good, just to add some more:

The best way to think about fields in lucene is that each is its own miniature inverted index, but the document ids are aligned/parallel so you can do disjunctions/conjunctions across different fields.

one thing to be careful of when adding lots of fields: by default each field has a byte[maxdoc] loaded up in ram used for length normalization. So with many documents and lots of fields, all with length normalization enabled, this could chew up some space.

Looking at your use case length normalization is probably not very useful for fields like firstName/LastName anyway, so you may want to omitNorms() on these short fields.

score 1 · Answer 3 · answered Dec 02 '11 at 14:44

1

lucene's indexing is pretty well optimised so I would not worry too much about performance or disk usage here. Having said that, your use case would definitely have a poorer performance compared to a straight forward single column exact search, although not so much to cause worry.

answered Dec 02 '11 at 14:44

aishwarya

1,970
1
14
22

I would say his usage is an acceptable and best practice for handling multiple values for a single field. – goalie7960 Dec 02 '11 at 14:57
oh yes, i did not mean to question his usage at all! all I am saying is that given the use case, there would certainly be some performance overhead, but not much to be worried about. – aishwarya Dec 02 '11 at 15:02

Impact of repeat value across multiple fields in Lucene

3 Answers3

Linked