Cassandra schema design: should more columns go into partition vs. cluster?

Question

In my case I have a table structure like this:

table_1 {
 entity_uuid text 
,fk1_uuid text
,fk2_uuid text
,int_timestamp bigint
,cnt counter
,primary key (entity_uuid, fk1_uuid, fk2_uuid, int_timestamp)
}

The text columns are made up of random strings. However, only entity_uuid is truly random and evenly distributed. fk1_uuid and fk2_uuid have much lower cardinality and may be sparse (sometimes fk1_uuid=null or fk2_uuid=null).

In this case, I can either define only entity_uuid as the partition key or entity_uuid, fk1_uuid, fk2_uuid combination as the partition key.

And this is a LOOKUP-type of table, meaning we don't plan to do any aggregations/slice-dice based on this table. And the rows will be rotated out since we will be inserting with TTL defined for each row.

Can someone enlighten me:

What is the downside of having too many partition keys with very few rows in each? Is there a hit/cost on the storage engine level?
My understanding is the cluster keys are ALWAYS sorted. Does that mean having text columns in a cluster will always incur tree balancing cost?

Well you can tell where my heart lies by now. However, when all rows in a partition all TTL-ed out, that partition still lives, or is there a way they will be removed by the DB engine as well?

Thanks,

Bing

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

The major and possibly most significant difference between having big partitions and small partitions is the ability to do range scans. If you want to be able to do scan queries like

SELECT * FROM table_1 where entity_id = x and fk1_uuid > something

Then you'll need to have the clustering column for performance, otherwise this query would be difficult (a multi-get at best, full table scan at worst.) I've never heard of any cases where having too many partitions is a drag on performance but having too wide a partition (ie lots of clustering column values) can cause issues when you get into the 1B+ cell range.

In terms of the cost of clustering, it is basically free at write time (in memory sort is very very fast) but you can incur costs at read time as partitions become spread amongst various SSTables. Small partitions which are written once will not occur the merge penalty since they will most likely only exist in 1 SSTable.

TTL'd partitions will be removed but be sure to read up on GC_GRACE_SECONDS to see how Cassandra actually deals with removing data.

TL;DR

Everything is dependent on your read/write pattern

No Range Scans? No need for clustering keys
Yes Range Scans? Clustering keys a must

Cassandra schema design: should more columns go into partition vs. cluster?

1 Answers1

TL;DR