3

I have this table

create table comment_by_post
(
    postId uuid,
    userId uuid,
    cmntId timeuuid,
    cmntTxt text,   
    cmntBy text,
    time bigint, 
    primary key ((postId, userId),cmntId)
)

here is internal data in this table

RowKey: 4978f728-0f96-11e5-a6c0-1697f925ec7b:4978f728-0f96-12e5-a6c0-1697f92e537a
=> (name=d3f02a30-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270721107000)
=> (name=d3f02a30-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e743434, timestamp=1434270721107000)
-------------------
RowKey: 4978f728-0f96-11e5-a6c0-1697f925ec7b:4978f728-0f96-12e5-a6c0-1697f92eec7a
=> (name=465fee30-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270483603000)
=> (name=465fee30-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7432, timestamp=1434270483603000)
=> (name=4ba89f40-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270492468000)
=> (name=4ba89f40-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7431, timestamp=1434270492468000)
=> (name=504a61f0-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270500239000)
=> (name=504a61f0-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7433, timestamp=1434270500239000)
-------------------
RowKey: 4978f728-0f96-11e5-a6c0-1697f925ec7b:4978f728-0f96-12e5-a6c0-1697f92e237a
=> (name=cd1e8f30-126f-11e5-879b-e700f669bcfc:, value=, timestamp=1434270709667000)
=> (name=cd1e8f30-126f-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7433, timestamp=1434270709667000)

If i do primary key (postId, userId,cmntId) then its like:

RowKey: 4978f728-0f96-11e5-a6c0-1697f925ec7b
=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:971da150-1260-11e5-879b-e700f669bcfc:, value=, timestamp=1434264176613000)

=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:971da150-1260-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7431, timestamp=1434264176613000)

=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:a0d4a900-1260-11e5-879b-e700f669bcfc:, value=, timestamp=1434264192912000)

=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:a0d4a900-1260-11e5-879b-e700f669bcfc:cmnttxt, value=636d6e7432, timestamp=1434264192912000)

=> (name=4978f728-0f96-12e5-a6c0-1697f92eec7a:a5d94c30-1260-11e5-879b-e700f669bcfc:, value=, timestamp=1434264201331000)

Why it is like that and what is the benefit of both ?

Manish Kumar
  • 10,214
  • 25
  • 77
  • 147
  • 2
    Is there a particular question you're looking to have answered or are you curious about how the primary key's construction interacts with the underlying storage? For more information check out this blog post which shows how CQL3 maps to the Cassandra internal data structure. http://opensourceconnections.com/blog/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure/ – Christopher Bradford Jun 14 '15 at 16:05
  • @ChristopherBradford That's a great article. John Berryman does a very good job of explaining all of this; especially how clustering keys work "under the hood." – Aaron Jun 14 '15 at 17:05
  • 1
    I answered a similar question here: http://stackoverflow.com/questions/30114854/cassandra-storage-internal/30126188#30126188 – Aaron Jun 14 '15 at 17:06

2 Answers2

4

Christopher already explained how the partitioning keys are concatenated together to generate the rowkey for storage, so I won't re-hash (no pun intended) that. But I will explain the advantages and disadvantages of these two approaches.

PRIMARY KEY (postId, userId,cmntId)

With this PRIMARY KEY, your data is partitioned by postId, and clustered by userId and cmntId. What this means, is that all comments made on a post will be stored together on-disk by postId, and then sorted by userId and cmntId (respectively).

The advantage here, is that you have query flexibility. You can query all comments for a post, or all comments for a post by a specific user.

The disadvantage, is that you have a higher chance of unbounded row growth than your other solution. If your total columns per postId ever were to exceed 2 billion, you would max out how much data you could store per postId. But the odds of you storing that much comment data per post are low, so you should be ok.

PRIMARY KEY ((postId, userId),cmntId)

This solution helps negate the possibility of unbounded row growth, by storing comment data together by a concatenated rowkey of postId and userId (sorted by cmntId. That's the advantage over your other solution.

The disadvantage is that of losing query flexibility, as now you need to provide postId and userId with every query. This PRIMARY KEY definition simply will not support queries for comments with only postId, as Cassandra CQL requires you to provide the entire partition key for a query.

Aaron
  • 55,518
  • 11
  • 116
  • 132
  • 1
    It may be worth noting that with the `PRIMARY KEY ((postId, userId),cmntId)` form the comments for a post will be placed across multiple nodes in the cluster and not just on one as mentioned with the less specific partition key `PRIMARY KEY(postId, usedId,cmndId)`. – Christopher Bradford Jun 14 '15 at 17:37
  • 1
    what i understood that in case of `PRIMARY KEY (postId, userId,cmntId)` column count may grow rapidly but query will be easy. In case of `PRIMARY KEY ((postId, userId),cmntId)` column count will be under control but from query point of view it wont be that easy . right? – Manish Kumar Jun 15 '15 at 04:33
2

The first primary key uses postId and userId as partition keys with cmntId as a clustering column. Note the value used for the RowKey contains both the values from postId and userId separated by a :. Next the value for the clustering column is used in the name of each cell within the row.

In your second example the primary key is missing parenthesis around the partition key. They may be omitted, but are often preferred to be present as we can explicitly determine which parts of the primary key are for the partition and clustering. When the extra parenthesis are not included only the first column is used as the partition key (visible in the RowKey value from cassandra-cli). All subsequent columns are assumed to be clustering columns which we can verify by looking at the cell names.

Christopher Bradford
  • 2,220
  • 2
  • 15
  • 12