2

According to most articles on internet Random Partitioning(RP) is better than Ordered Partitioning(OP) cause of the data distribution.

in fact, I think, that cause of data replication even if we are using the OP the data will be well distributed ! so is the first assumption is still true ?

what about reading performance ? is OP better than RP when trying to read data between two value in the same range ?

thanks a lot

Mehdi TAZI
  • 575
  • 2
  • 5
  • 23

1 Answers1

3

I can't answer really answer confidently for HBase (which only supports Ordered Partitioning to my knowledge), but for Cassandra I would strongly discourage the use of OrderPreservingPartitioner and ByteOrderedPartitioner unless you have a very specific use case that requires it (like if you need to do range scans across keys). It is not very common for Ordered Partitioner to be used

in fact, I think, that cause of data replication even if we are using the OP the data will be well distributed ! so is the first assumption is still true ?

Not particularly, it is much more likely for hotspots to be encountered with an Ordered Partitioner vs. a Random Partitioner. As described from the Partitioners page on the Cassandra Wiki:

globally ordering all your partitions generates hot spots: some partitions close together will get more activity than others, and the node hosting those will be overloaded relative to others. You can try to mitigate with active load balancing but this works poorly in practice; by the time you can adjust token assignments so that less hot partitions are on the overloaded node, your workload often changes enough that the hot spot is now elsewhere. Remember that preserving global order means you can't just pick and choose hot partitions to relocate, you have to relocate contiguous ranges.

There are other problems with Ordered Partitioning that are described well here:

Difficult load balancing:

More administrative overhead is required to load balance the cluster. An ordered partitioner requires administrators to manually calculate partition ranges based on their estimates of the partition key distribution. In practice, this requires actively moving node tokens around to accommodate the actual distribution of data once it is loaded.

Uneven load balancing for multiple tables:

If your application has multiple tables, chances are that those tables have different row keys and different distributions of data. An ordered partitioner that is balanced for one table may cause hot spots and uneven distribution for another table in the same cluster.

With regards to:

what about reading performance ? is OP better than RP when trying to read data between two value in the same range ?

You will definitely achieve better performance for range scans (i.e. get all data between this key and that key).

So it really comes down to the kind of queries you are making. Are range scan queries between keys vital to you? In that case HBase may be a more appropriate solution for you. If it is not as important, there are reasons to consider C* instead. I won't add much more to that as I don't want my answer to devolve into comparing the two solutions :).

Community
  • 1
  • 1
Andy Tolbert
  • 11,418
  • 1
  • 30
  • 45
  • 1
    Good answer Andy. I have a go-to answer about the BOP, as well: http://stackoverflow.com/questions/27939234/cassandra-byteorderedpartitioner/27944273#27944273 – Aaron Jan 11 '16 at 01:36
  • 1
    I wish they would just deprecate it, so people stop trying to use it. – Aaron Jan 11 '16 at 01:37
  • Excellent answer Aaron, I especially like your point of using multi-column primary keys for range queries. I agree that should solve a good portion of user's needs, I'll keep that in my back pocket :). Also +1 on deprecating BOP in C*! – Andy Tolbert Jan 11 '16 at 01:48
  • Thanks Aaron ! Of course OrderPatitionner will cause hotspots for write operations, but dont you think that replication will avoid having hotspots for read operations ? – Mehdi TAZI Jan 11 '16 at 09:11
  • A hot spot is less about reads and writes and more about positioning of data in storage. In the case of Ordered Partitioning its more likely for there to be hot spots, which puts more burden on nodes that have those hot spots which affects both read and write performance. – Andy Tolbert Jan 11 '16 at 14:53
  • Why does order partitioner cause hotspots (for write and perhaps also read operations)? Could you provide an example? Thank you! – tonix Sep 18 '21 at 06:34