Ensuring evenly distributed documents per shard in solr

Question

I've found myself needing to support result grouping with an accurate ngroups count. This required colocation of documents by a secondaryId field.

I'm currently indexing documents using the compositeId router in solr. The uniqueKey is documentId and I'm adding a shard key at the front like this:

doc.addField("documentId", secondaryId + "!" + actualDocId);

The problem I'm seeing is that the document count accross my 3 shards is now uneven:

shard1: ~30k
shard1: ~60k
shard1: ~30k

(This is expected to grow a lot.)

Apparently the hashes of secondaryId are not very evenly distributed, but I don't know enough about possible values.

Any thoughts on getting a better distribution of these documents?

how many different values of `secondaryId` do you have? – MatsLindh Feb 25 '17 at 20:36 — MatsLindh, Feb 25 '17 at 20:36
@MatsLindh, there are about 21,000 values of `secondaryId` – Andrei Feb 27 '17 at 21:29 — Andrei, Feb 27 '17 at 21:29

Piyush Bansal · Answer 1 · 2018-11-29T22:12:34.733

Your data is not evenly spread across you secondaryIds. Some secondary ids have a lot more data than others. There is no perfect and/or simple solution.

Assuming you cannot change your routing id, one approach is to create a larger number of shards, say 16 on same number of hosts. Your shards will now be smaller and still potentially uneven. But given their larger numbers, you can then move your shards around across the nodes you have, to more or less balance out the nodes in size.

The caveat is that you have routed queries so that each query hits only one shard. If you have unrouted queries, having a large number of shards can result in significant performance degradation as each query will need to be run against each shard.

score 0 · Answer 2 · answered Oct 13 '22 at 17:39

What I've done is read the Solr routing code to see how it hashes. Then replicate some of the logic manually to figure out the hash ranges to split.

I found these online tools to convert the Ids to hash then back and forth to Hex which is what the shard split command wants.

Murmur hash app: http://murmurhash.shorelabs.com/
- Use “MurmurHash3” form.
Hex converter app: https://www.rapidtables.com/convert/number/decimal-to-hex.html
- I think want “Hex signed 2's complement” when different, but not when has 00000000 prefix...

You'll also have to pay attention to masking. It's somethinglike:

Imagine you have a document hashed to a HEX values of 12345678. This is a composite of:
primaryRouteId: 12xxxxxx
secondaryRouteId:xx34xxx
documentId: xxxx5678

(Note if you only have a primaryRouteId!docId then primaryRouteId takes the first 4 spots.)

score 0 · Answer 3 · answered Oct 22 '22 at 17:50

You can use Solr rebalancing with the feature called UTILIZENODE.

Check these links :

https://solr.apache.org/guide/8_4/cluster-node-management.html#utilizenode https://solr.pl/en/2018/01/02/solr-7-2-rebalancing-replicas-using-utilizenode/

It will automatically handle the uneven shards and will balance them across all the servers.

Note : It is a new feature and will work only with Solr version greater than equal to 8.2

Ensuring evenly distributed documents per shard in solr

3 Answers3