3

I've done 2 performances tests to measures the indexing speed with a collection of 235280 documents:

1st test : 1 solr instance without SolrCloud: indexing speed = 6191 doc/s

2nd test : 4 solr instance (4 shards) linked with SolrCloud : indexing speed = 4506 doc/s

I use 8 CPUs.

So, I've some questions about these results :

Q1 : Usually, Does the number of solr instances improve or degrade indexing speed ?

Q2 : Does SolrCloud degrade indexing speed ?

Q3 : Why do I get a decrease of performances when I use SolrCloud ? Do I missed something (setting ?) ?

Edit :

I use a CSV update handler to index my collection.

Corentin
  • 325
  • 5
  • 21
  • how many shards do you have? how do you index (Tika,Data Import, Custom)? – lexk Aug 01 '13 at 15:31
  • In the first test, I don't use SolrCloud (1 shard). And in the second test, I have 4 shards (1 shard by instance). I index my collection with a CSV update handler. – Corentin Aug 02 '13 at 07:42
  • how many servers do you send your queries to in parallel? If it's a single one - there is an overhead to forward the message to a right shard leader. Anyway I don't think it may account for such degradation. Are you doing it on the very same machine with 4 different instances?4 cores? Or different machines? – lexk Aug 04 '13 at 16:06

2 Answers2

0

Based on the performance test that I carried out, sharing across multiple nodes in a Solr cloud infrastructure improved my indexing performance. Replication of shards in multiple nodes to handle fail overs did slow down the indexing performance for obvious reason. Also consider Bulk indexing over doing single updates.

You can read http://wiki.apache.org/lucene-java/ImproveIndexingSpeed for further information.

0

There are many settings in Solr as well as the hardware specs that can affect indexing performance. Besides the obvious solution to throw more machines at it tuning Solr is more of an art than science. Here is my experience so take it with a grain of salt. Generally you should see 6K to 8K per second indexing performance.

Hardware specs: 4 x 40 cores (hyperthreaded) with 256GB of RAM with SSD

I also use updateCSV API for importing documents.

My baseline matrix is measured with 1 of those machines (1 shard). My SolrCloud matrix is measured with all 4 of them (4 shards with 1 replica per collection).


For large collection (82GB), I saw 3.68x throughput.

For medium collection (7GB), 2.17x.

For small collection (1.29GB), 1.17x.


So to answer your question:

Q1: Generally the more Solr nodes you have per collection increase indexing speed. It might plateau at some point but certainly indexing performance should not degrade. Maybe your collection is too small to justify the SolrCloud horizontal scaling overhead?

Q2: No, SolrCloud should not degrade indexing speed.

Q3: It really depends on how you set it up. I see performance gain with just default settings. But here are the things I came across that gained performance boost even more:

  • Don't set commit=true in your updateCSV API call.
  • You can use more shards per collection than the number of live Solr nodes if system utilization is low.
  • solr.hdfs.blockcache.slab.count should be between 10 to 20% of available system memory.
  • autoCommit generally should be 15 seconds.
adam
  • 238
  • 4
  • 14