1

Environment:

  1. Apache Ignite 2.4 running on Amazon Linux. VM is 16CPUs/122GB ram. There is plenty of room there.
  2. 5 nodes, 12GB each
  3. cacheMode = PARTITIONED
  4. backups = 0
  5. OnheapCacheEnabled = true
  6. atomicityMode = ATOMIC
  7. rebalacneMode = SYNC
  8. rebalanceBatchSize = 1MB
  9. copyOnread = false
  10. rebalanceThrottle = 0
  11. rebalanceThreadPoolSize = 4

Basically we have a process that populates the cache on startup and then receives periodic updates from Kafka, propagating them to the cache.

The number of elements in the cache is more or less stable over time (there is just a little fluctuation since we have a mixture of create, update and delete events), but what we have noticed is that the distribution of data across the different nodes is very uneven, with one of the nodes having at least double the number of keys (and memory utilization) as the others. Over time, that node either runs out of memory, or starts doing very long GCs and loses contact with the rest of the cluster.

My expectation was that Ignite would balance the data across the different nodes, but reality shows something completely different. Am I missing something here? Why do we see this imbalance and how do we fix it?

Thanks in advance.

Al A
  • 175
  • 2
  • 15
  • How "very uneven"? Ignite uses Rendezvous hashing, you will expect ~5% of difference under default settings. If you see much more, your affinity keys are probably skewed (if you have them). – alamar May 09 '18 at 08:48
  • The node with the larges utilization had twice as many entries as the one with the lowest utilization. I don't have any affinity keys defined, just using the defaults. – Al A May 09 '18 at 15:51
  • 1
    @AlA It may be the case that for some unknown reason you have keys mapped to partitions unevenly. I would look to either changing the hashcode or customizing the affinity function: https://apacheignite.readme.io/docs/affinity-collocation#section-affinity-function – Dmitriy May 12 '18 at 16:05
  • Just noticed the exact same problem in our cluster. Up to 40 nodes it's still wildly uneven - a factor of 2.2 difference between the most-populated and least-populated nodes. – Brian Reischl May 21 '18 at 21:31
  • @Dmitriy, the "standard" Java hash function should provide a good distribution. Although the cached objects vary in size, the number of cached objects is large (millions), and the hash function on the key should provide a random distribution of the data. Shouldn't the nodes re-partition the data automatically if one of the nodes has significantly more data than the others? – Al A May 23 '18 at 22:48
  • @AlA Ignite maps keys to partitions and partitions to nodes. If you check, the partitions should be mapped to nodes about evenly and such assignment does not change unless number of nodes changes. I would look at the number of keys in each partition, you may have one of the partitions overloaded. – Dmitriy May 24 '18 at 05:53
  • @Dmitriy, this is what we will do - we will run our production data using our hash function and see what the distribution looks like. Either it's not uniform (in which case we need to replace it with something else), or it is uniform (in which case I have no idea what to do next...) – Al A May 25 '18 at 22:11
  • So, this is what's happening - we ran all the production keys to our hash function, took the %1024 of the result, and counted how many entries we have on each [0..1023] bucket; the distribution looks pretty even. Conclusion: the mapping from keys to partitions is evenly distributed; the mapping from partitions to node is not. Although I'm reluctant to do this, we will have to mess with the affinity function - unless @Dmitriy has another suggestion... I'll see how that goes and post as a response to my own question if that works. – Al A Jun 07 '18 at 22:54

1 Answers1

0

Bottom line, although our hash function had good distribution, the default affinity function was not yielding a good distribution of keys (and, consequently, memory) across the nodes in the cluster. We replaced it with a very naive one (partition # % # of nodes), and that improved the distribution quite a bit (less than 2% variance).

This not a generic solution; it works for us because our entire cluster is in one VM and we don't use replication. For massive clusters crossing VM boundaries and replication, keeping the replicated data in separate servers is mandatory, and the naive approach won't cut it.

Al A
  • 175
  • 2
  • 15