There are many settings in Solr as well as the hardware specs that can affect indexing performance. Besides the obvious solution to throw more machines at it tuning Solr is more of an art than science. Here is my experience so take it with a grain of salt. Generally you should see 6K to 8K per second indexing performance.
Hardware specs: 4 x 40 cores (hyperthreaded) with 256GB of RAM with SSD
I also use updateCSV API for importing documents.
My baseline matrix is measured with 1 of those machines (1 shard).
My SolrCloud matrix is measured with all 4 of them (4 shards with 1 replica per collection).
For large collection (82GB), I saw 3.68x throughput.
For medium collection (7GB), 2.17x.
For small collection (1.29GB), 1.17x.
So to answer your question:
Q1: Generally the more Solr nodes you have per collection increase indexing speed. It might plateau at some point but certainly indexing performance should not degrade. Maybe your collection is too small to justify the SolrCloud horizontal scaling overhead?
Q2: No, SolrCloud should not degrade indexing speed.
Q3: It really depends on how you set it up. I see performance gain with just default settings. But here are the things I came across that gained performance boost even more:
- Don't set
commit=true
in your updateCSV API call.
- You can use more shards per collection than the number of live Solr nodes if system utilization is low.
solr.hdfs.blockcache.slab.count
should be between 10 to 20% of available system memory.
autoCommit
generally should be 15 seconds.