Cassandra (CQL) schema/tables look the same as RDBMS for my scenario

Question

I have looked at the Twissandra examples. I asked a similar question regarding this a few days back and received some tips I implemented here. However, by looking at the tables (column families) I see barely any difference between this and a relational database.

My scenario: A simple address book where a user can create his own contacts and group them (one contact can be placed in many groups, one group can contain many contacts). A contact may have multiple addresses for example.

I want to retrieve all the contacts who live in address x and are placed in group y. Therefore, I did the following:

CREATE TABLE if not exists User (user_id uuid, contact_id uuid, type varchar, email varchar, PRIMARY KEY(id));
CREATE TABLE if not exists Contact (contact_id uuid, firstname varchar,lastname varchar, photo blob, imagelength int, note varchar, PRIMARY KEY (id));
CREATE TABLE if not exists Address (address_id uuid, contact_id uuid, street varchar, number int, zipcode varchar, country varchar, PRIMARY KEY(address_id));
CREATE TABLE if not exists Group (group_id uuid, user_id, groupname varchar, PRIMARY KEY(group_id));
CREATE TABLE if not exists Group_Contact (group_id uuid, contact_id, PRIMARY KEY(id, contact_id));

However, based on this, this is literally exactly the same as a relational database, well, except that I believe Cassandra is putting this data in a different way than a RDBMS on disk. I don't see how this can be made better in Cassandra and whether I even modeled this the right way. It just feels as a plain relational database. I feel that I did something wrong since I have to use application level joins to get the address of the contacts. I really don't know how I can de-normalize this to allow multiple addresses (and maybe even phones, emails).

Any suggestions to improve this scenario would be greatly appreciated!

So, with you model, which queries are you going to use given that you can query only primary key columns and there are no joins? — jny, Mar 24 '15 at 13:40
Thanks for your reply. I'm planning to join the queries in the application layer. Basically, want I want to get is all the contacts that live in country x and are set in group x. — Moody, Mar 24 '15 at 13:50
how are you planning to get all the contacts who live in country x? What will be your queries? — jny, Mar 24 '15 at 13:57
I was planning to do this: SELECT contact_id FROM Address WHERE address.country = "NL";. Followed by for every contact_id x in that resultset (application layer): SELECT * FROM contact WHERE contact_id = x. — Moody, Mar 24 '15 at 14:04
The first query is not going to work: you can't use columns which are not in primary key in WHERE clause unless they are indexed. But indexes are not efficient. Also your queries would run on all contacts for all users... — jny, Mar 24 '15 at 14:09
Oh shoot, I didn't know that.. Thank you for the advice. Hmm I hope to figure out a better way to model this database in this case. But can't I fix it for the first query if I make the contact_id also a primary key along with the address_id — Moody, Mar 24 '15 at 14:11

score 3 · Accepted Answer · answered Mar 24 '15 at 18:55

As jny indicated, data duplication, denormalization and query-based modeling are keys to building good Cassandra data models. If I wanted to take your tables above, and build a table to support address/contact queries based-on country, I could do it like this:

First, I'll create a user defined type for the contact's address.

aploetz@cqlsh:stackoverflow> CREATE TYPE contactAddress (
             ...   street varchar, 
             ...   city varchar,
             ...   zip_code varchar,
             ...   country varchar);

Next, I'll create a table called UserContactsByCountry to store user contact info, as well as any user contact addresses:

aploetz@cqlsh:stackoverflow> CREATE TABLE UserContactsByCountry (
             ...   country varchar,
             ...   user_id uuid,
             ...   type varchar,
             ...   email varchar,
             ...   firstname varchar,
             ...   lastname varchar,
             ...   photo blob,
             ...   imagelength int,
             ...   note varchar,
             ...   addresses map<text, frozen <contactAddress>>,
             ...   PRIMARY KEY ((country),user_id));

A couple of things to note here:

I am using country as a partitioning key for querying, and addding user_id as a clustering key for uniqueness.
Technically, country is being stored multiple in each row. Once as the partiiton key, and again with each address. Note that the country partition key is the one which allows us to run our query.
I assume that user contacts can have multiple addresses, so I'll store them in a map of type text (varchar), contactAddress (type created above).

Next, I'll insert three user contacts, each with two addresses, two from the USA and one from Great Britain.

aploetz@cqlsh:stackoverflow> INSERT INTO usercontactsbycountry (country, user_id, type, email, firstname, lastname, note, addresses)
VALUES ('USA',uuid(),'Tech','brycelynch@network23.com','Bryce','Lynch','Head of R&D at Network 23',{'work':{street:'101 Big Network Drive',city:'New York',zip_code:'10023',country:'USA'},'home':{street:'8192 N. 42nd St.',city:'New York',zip_code:'10025',country:'USA'}});
aploetz@cqlsh:stackoverflow> INSERT INTO usercontactsbycountry (country, user_id, type, email, firstname, lastname, note, addresses)
VALUES ('USA',uuid(),'Reporter','edisoncarter@network23.com','Edison','Carter','Reporter at Network 23',{'work':{street:'101 Big Network Drive',city:'New York',zip_code:'10023',country:'USA'},'home':{street:'76534 N. 62nd St.',city:'New York',zip_code:'10024',country:'USA'}});
aploetz@cqlsh:stackoverflow> INSERT INTO usercontactsbycountry (country, user_id, type, email, firstname, lastname, note, addresses)
VALUES ('GBR',uuid(),'Reporter','theorajones@network23.com','Theora','Jones','Controller at Network 23',{'work':{street:'101 Big Network Drive',city:'New York',zip_code:'10023',country:'USA'},'home':{street:'821 Wembley St.',city:'London',zip_code:'W11 2BQ',country:'GBR'}});

Now I can query that table for all user contacts in the USA:

aploetz@cqlsh:stackoverflow> SELECT * FROM usercontactsbycountry WHERE country ='USA';
 country | user_id                              | addresses                                                                                                                                                                                    | email                      | firstname | imagelength | lastname | note                      | photo | type
---------+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+-------------+----------+---------------------------+-------+----------
     USA | 2dee94e2-4887-4988-8cf5-9aee5fd0ea1e |  {'home': {street: '8192 N. 42nd St.', city: 'New York', zip_code: '10025', country: 'USA'}, 'work': {street: '101 Big Network Drive', city: 'New York', zip_code: '10023', country: 'USA'}} |   brycelynch@network23.com |     Bryce |        null |    Lynch | Head of R&D at Network 23 |  null |     Tech
     USA | b92612dd-dbaa-42f2-8ff2-d36b6c601aeb | {'home': {street: '76534 N. 62nd St.', city: 'New York', zip_code: '10024', country: 'USA'}, 'work': {street: '101 Big Network Drive', city: 'New York', zip_code: '10023', country: 'USA'}} | edisoncarter@network23.com |    Edison |        null |   Carter |    Reporter at Network 23 |  null | Reporter

(2 rows)

There are probably other ways in which this could be modeled, but this is one that I hoped to use to help you understand some of the techniques available.

jny · Answer 2 · 2015-03-24T14:45:38.533

It is difficult to switch from modeling for relational databases to modeling for Cassandra, because they seem so similar: the query language looks almost the same. But the first rule of Cassandra is model to your queries while in Relational Databases we model to data. That means:

Consider what your query the most on
Learn about partition keys and cluster keys
Don't be afraid of data duplication

There is a good example on data modeling in Cassandra: https://www.datastax.com/documentation/cql/3.1/cql/ddl/ddl_music_service_c.html

Cassandra (CQL) schema/tables look the same as RDBMS for my scenario

2 Answers2