1

How does cassandra calculates the size of partitioning key and clustering key . We have tables with with relatively large partitioning keys (UUID and combination of UUID) along with large clustering key for example

mydb/parent/6E219A7E21044B48B8816B931925CCDB/child1/29E6E709854D49CFAC72ECD5E1AEBFA3/ mydb/parent/6E219A7E21044B48B8816B931925CCDB/child2/29E6E709854D49CFAC72ECD5E1AEBFA4/ mydb/parent/6E219A7E21044B48B8816B931925CCDB/child3/29E6E709854D49CFAC72ECD5E1AEBFA5/

here PK - 6E219A7E21044B48B8816B931925CCDB Clustering Column is - /child1/29E6E709854D49CFAC72ECD5E1AEBFA3/

We have child level upon nth level (right now we are doing till 100 level)

Now does having large keys have performance impact when we have huge data ~300 million , also what will be impact on disk usage

invincible
  • 73
  • 5

1 Answers1

1

Having large partition key or clustering key is not a issue. It has no impact on performance.

Only thing you should avoid is having large partitions. For example in your case, you have 100 rows in a single partition. So if the size of all rows combined is within 10MB (Ideal size of a Cassandra partition is equal to or lower than 10MB with a maximum of 100MB.), then you are doing fine. You can refer this link for calculating your partition size.

If your partition size is large, then you have to refine your data model so as to reduce your partition size. Following are some of the techniques generally applied for reducing the partition size

  1. Bucketing - Introduce a number with your partition key. Generally applied for time series data. (More can be read here.
  2. Introducing another column from your table as part of partition key.
Manish Khandelwal
  • 2,260
  • 2
  • 15
  • 13
  • ok thanks , won't the size of partitioning key play a role when we have millions of such keys ,after all we need disk to store this . For instance how the reference of t1/123 vs t1/UUID is stored in terms of indexs and all . The name of keyspace and column family does not have a role as they are stored as a directory but for partitioning key and clustering column during replication shouldn't it matter – invincible Jul 15 '21 at 11:03
  • partitioning key is used by Cassandra to identify the nodes where the data resides. In terms of size you are right a bigger partition key will require more space but you were concerned about performance. So size of partition key will have no impact on performance. – Manish Khandelwal Jul 15 '21 at 14:46
  • I have asked both performance impact and impact on disk usage .Actually we have huge subscriber base and customer has asked to reduce the disk footprint without compromising the functionality as the cassandra nodes is scaling upto 100's nodes and we need to support replication cross sites . Thanks for your response :) – invincible Jul 16 '21 at 09:51