3

I have a table user with multiple columns, every user has a unique userid. Because it is unique, I dont have to specify a clustering key unless I want to use the column in queries. Is this bad, because every partition consists of a single row? If it is bad for whatever reason, what is the best practise to do in this case? Thank you for your help!

Edit: If I have a query that needs to return all usernames, how can I do that with a good performance? Doing it from this table seems not very efficient for me, should I make another table where I simply duplicate all usernames in a Collection? Then they are all in one place and the read doesn't have to jump over multiple nodes.

drei vier
  • 31
  • 3

2 Answers2

2

I just answered the similar question. Short story - it really depends on the access patterns, and table settings. You may need to tune the table parameters to get best performance, but the settings may depend on the amount of data, and other requirements.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
2

There are always two (main) considerations when defining your primary keys in Cassandra:

  • Data distribution
  • Query pattern match

From a data distribution standpoint, you can't get much better than using a unique key as the partition key. The more of them, the more evenly they should hash-out and thus be evenly distributed.

However, a key which distributes well but doesn't fit the desired query pattern, is pretty useless.

tl;dr;

If that unique key is all you'll ever query the table by, then it makes a great choice for a partition key.

Aaron
  • 55,518
  • 11
  • 116
  • 132
  • Thank you very much! If i had to query for another column like Email-address, where i want to get the values of other 2 columns, would it be the best way to just create a materialized view of the user table with email-address as partitioning key and the other needed columns for the query? I have 4 columns that i have to query the table by, the user data does not change so there would not be a huge write-overhead if I make 3 materialized views and the read should be pretty fast. Is this approach fine? Or do I miss a big flaw/are there better alternatives? Thank you for your help! – drei vier Sep 23 '20 at 22:57
  • @dreivier I would stay away from MVs, as they are very much broken. Additional queries should be satisfied by duplicating the data into differently-keyed "query tables." – Aaron Sep 24 '20 at 15:25
  • What do you mean with they are broken? Aren't they basically the same as your recommended "query tables", but they update automatically? Thanks for your help! – drei vier Sep 24 '20 at 16:03
  • @dreivier MVs are a feature that was never finished. Its "experimental" status is testament to that. https://stackoverflow.com/questions/62086421/how-to-delete-a-row-in-cassandra-materialiezed-view/62087819#62087819 – Aaron Sep 24 '20 at 17:32
  • Thank you very much. – drei vier Sep 24 '20 at 17:42