2

I have a cassandra table

CREATE TABLE users_by_id (
id bigint PRIMARY KEY,
name text,
email text,
password text,
);

and

CREATE TABLE members_by_org_id_user_id (
organisation bigint,
user bigint,
roles set<bigint>,
PRIMARY KEY (organisation, user)
) WITH CLUSTERING ORDER BY (user DESC);

If I want to retrieve the name of each member of an organisation I could:

  1. SELECT user FROM members_by_org_id_user_id WHERE organisation = ? and then perform SELECT name FROM users_by_id WHERE id in ? (coordinator must contact many nodes = bad Is the IN relation in Cassandra bad for queries?)
  2. SELECT user from MEMBERS_by_org_id_user_id WHERE organisation = ? and then perform a query per user SELECT name FROM users_by_id WHERE id = ? (application must contact many nodes, because the primary keys of the users are different = less bad? Not perfect)
  3. Change members_by_org_id_user_id to
CREATE TABLE members_by_org_id_user_id (
organisation bigint,
user bigint,
name text,
email text,
PRIMARY KEY (organisation, user)
) WITH CLUSTERING ORDER BY (user DESC);

The problem with the third approach is, that if a user is updated, all the member rows that relate to that user will also need to be updated, which, while allowing the application to only contact one node, means many writes may need to be done for each update

How could I model my data to decrease each of these problems or get rid of them entirely?

An organisation could, theoretically, consist of up to 2000 members and a user could be in up to 20 organisations.

Jacques Amsel
  • 1,061
  • 2
  • 14
  • 30

1 Answers1

1

Putting the name column in your members_by_org_id_user_id table will solve your current problem but it may not be the right approach if you need to fetch your user's email in the future or any other column that you may add.

application must contact many nodes

since you used user id as primary key(id column) in your users_by_id table, then Cassandra will not traverse each node one by one - it knows where to find your user. Since you are using a single primary key then it is also partition key. That's one of the most efficient way to query a table in Cassandra.

In my opinion, option 2 is the best approach for your data modeling but as @Alex Ott's question pointed out, the sizes of those table may be a key factor for going with "where in" clause solution.

Edit:

Datastax's How are read requests accomplished? article is a great resource to understand read strategy of Cassandra.

Ersoy
  • 8,816
  • 6
  • 34
  • 48
  • "since you used user id as primary key(id column) in your users_by_id table, then Cassandra will not traverse each node one by one" the users are not guaranteed to be on one node, so the app may need to check multiple nodes to get all the users. – Jacques Amsel Apr 21 '20 at 13:10
  • it depends on your consistency level. Aaron has a great answer - it may help you https://stackoverflow.com/a/36505965/2188922 – Ersoy Apr 21 '20 at 14:09
  • The issue isn't consistency, it's that, because I'll be looking up users with different ids, they will likely be on separate nodes, because cassandra will distribute them across nodes based on the hash. So if an organisation has two members, id 1 and id 2, id 1 may be on node 10, and id 2 on node 5. If i make a query for each of those, cassandra will need to lookup the users on multiple nodes. – Jacques Amsel Apr 21 '20 at 16:14
  • @Dextication i was saying "consistency level" - depending on your consistency level, Cassandra may not need a secondary lookup on a different node. I've edited the answer and added Datastax's blogpost about read requests 2 hours ago, sorry didn't inform you. – Ersoy Apr 21 '20 at 16:17
  • I understand that it won't traverse every node one by one to find a single user, but it may need to search for *many* users. For example, if an organisation has 1k users, it will need to contact many nodes, since the all users' IDs will not map to one node. That is bad, so I'd like to fix that. If I was using an RDBMS, I'd use a JOIN, but that's not possible with cassandra – Jacques Amsel Apr 25 '20 at 14:42
  • how does the "where in" manage to fetch all those 1K users in a single node ? – Ersoy Apr 25 '20 at 14:48
  • It is all related to read consistency you use to get all those user ids. if you choose `LOCAL_ONE` then closest replica in the local datacenter will return response. if you choose `LOCAL_QUORUM` then it will return response after a quorum of replicas in the current datacenter as the coordinator has reported. – Ersoy Apr 25 '20 at 14:53
  • But even with LOCAL_ONE, all users aren't stored on one node, right? So It will *still* need to contact many nodes – Jacques Amsel Apr 25 '20 at 15:16
  • then it is related to write consistency, if you are writing with consistency level `ALL` then all replica nodes will have it. if it is written with `ONE` then your read node may not have all the data you have. – Ersoy Apr 25 '20 at 15:19