Indexing Architecture for frequently updated index solr?

Question

I have roughly 50M documents, 90 (stored(20) + non- stored(70)) fields in schema.xml indexed in single core. The queries are quiet complex along with faceting and highlighting. Out of this 90 fields, there are 3-4 fields (all stored) which are very frequently uploaded. Now, updating these field normally would require populating all the fields again which is heavy task. If I use atomic/partial update, we have to update the non-stored fields again.

Our Solution: To overcome the above problems, we decided to use SolrCloud and Join queries. We split the index into two separate indexes/collection i.e one for stored fields and one for non-stored fields. The relation b/w the documents being the id of the doc. We kept the frequently updated fields in stored index. By doing this we were able to leverage atomic updates. Also to overcome the limitation of join queries in cloud, we sharded & replicated the stored fields across all nodes but the non-stored was not sharded but replicated across all nodes.we have a 5 node cluster with additional 3 instances of zookeeper. Considering the number of docs, the only area of concern is that will join queries eventually degrade search performance? If so, what other options I can consider.

Have you looked at in place updates which are available with docvalues and recent versions of Solr? — MatsLindh, Jan 11 '18 at 21:20
Cannot use in-place updates as it requires fields to be non-indexed (indexed="false"), non-stored (stored="false"), single valued (multiValued="false") . If we do so it will serve no purpose — ak1234, Jan 12 '18 at 06:31
If you have to search against the fields (i.e. you require them to be indexed), then no, you can't use in-place updates. The `stored` part will be retrieved from docValues if you need the actual value (since Lucene can use docValues as the stored value). The requirement is there since in-place updates can't update the stored data in the old structure. — MatsLindh, Jan 12 '18 at 09:46

score 1 · Answer 1 · answered Jan 11 '18 at 19:11

Thinking about Joins makes Solr more like a Relational database. I have found an article on this from the Lucidworks team Solr and Joins. Even they are saying that if your solution includes the use of Join then it means you need to rethink about that.

I think I have a solution for you guys. First of all, forget two collections.You create one collection and You are going to have two Solr document for every single document. Now one document will have the stored fields and the other has the non-stored fields. At the time of updating you will update the document which has stored field and perform a search-related operation on the other document.

Now all you need to do is at the time of query you need to merge both the documents into a single document which can be done by writing service layer over the Solr.

Jan Rasehorn · Answer 2 · 2022-03-17T12:08:12.973

I have a issue with partial/atomic updates and index operations on fields in the background, I did not modify. This is different to the question, but maybe the use of nested documents is worth thinking about.

I was checking the use of nested documents to separate document header data from text content to be indexed, since processing the text content is consuming a lot resources. According to the docs, parent and childs are indexed as blocks and always have to be indexed together.

This is stated in https://solr.apache.org/guide/8_0/indexing-nested-documents.html:

With the exception of in-place updates, the whole block must be updated or deleted together, not separately. For some applications this may result in tons of extra indexing and thus may be a deal-breaker.

So as long as you are not able to perform in-place updates (which have their own restrictions in terms of indexed, stored and <copyField...> directives), the use of nested documents does not seem to be a valid approach.

Indexing Architecture for frequently updated index solr?

2 Answers2