Cassandra schema design optimization

Question

My problem statement is :-

   I have a college , every college have some students , 
every student has some data(for eg student id , student name , class etc) related to him.

The kind of questions i would like to answer are :-

a) get all students of a particular college.(student id and name only not the other data)
b) get data of a student of a college given his ID and college name .
c) create a student to a college.(ADD a new student in college).
d) delete a student of a particular college given his id and college name .
e) UPdate a students data of a particular college given his ID and college name .

I have thought of following schema :-

One column Family which uses row_key as college$student_id(concatenation of college and student id) and value as data of the student in JSON. Another column Family which has college as row_key and one column with list of studentid and student names(student name is part of student data in first CF) concatenated as comma separated strings.(Specifically for the first query)

I can answer all queries with it.

Second i can think of :- A column family with college as row_key and column name as student_id and value as student data(which contains student name too). with this CF itself i can answer all queries.

But how will the query behave in terms of efficiency for get all students for a particular college.

I have read it will take the whole row into memory even when i need only student id and student name for which i have kept the second column family in my first approach. Even when i want a student for a particular id and college it will take the entire row in memory for giving me that for which i have the first CF in first approach

Secondly in this approach , problem of hotspots can be there.

Even when i want a student for a particular id and college it will take the entire row in memory for giving me that for which i have the first CF in first approach.

The only Reason i was thinking of the second approach was for the following new query which will have to do multi-gets from multiple nodes in the cluster.

Get all students from a particular college given list of ids of the students.

In my first approach as the college data will be spread across nodes i will have to do a get from many nodes. whereas in second approach as college is the key i will get it from one node itself.

BUt second approach had many other disadvantages i discussed above.

What can be a better approach or is the first approach right and efficient ?

abhi · Answer 1 · 2013-05-17T05:46:44.893

You can have a schema like this

CREATE COLUMNFAMILY cf(
    college  text,
    student_id bigint,
    student_name text,
    class text,
    ...
    ...,
    PRIMARY KEY(college, student_id)

);

get all students of a particular college.(student id and name only not the other data)

 cqlsh:keyspace> SELECT * FROM cf;


 college | student_id | student_class | student_name
---------+------------+---------------+--------------
    IITK |        102 |            B! |          sam
    IITK |        103 |            B! |         bugy
     SMU |        101 |            B* |         abhi
     SMU |        103 |            B! |        sandy

get data of a student of a college given his ID and college name .

cqlsh:keyspace> SELECT college,student_id,student_class,student_name FROM cf WHERE college='SMU' AND student_id=101;

 college | student_id | student_class | student_name
---------+------------+---------------+--------------
     SMU |        101 |            B* |         abhi

create a student to a college.(ADD a new student in college).

INSERT INTO cf (college, student_id ,student_class,student_name) VALUES ( 'SMU',104,'B!','mat');
cqlsh:keyspace> select * from cf ;

 college | student_id | student_class | student_name
---------+------------+---------------+--------------
    IITK |        102 |            B! |          sam
    IITK |        103 |            B! |         bugy
     SMU |        101 |            B* |         abhi
     SMU |        103 |            B! |        sandy
     SMU |        104 |            B! |          mat

delete a student of a particular college given his id and college name .

cqlsh:keyspace> DELETE FROM cf where college='SMU' AND student_id=104;
cqlsh:keyspace> select * from cf ;

 college | student_id | student_class | student_name
---------+------------+---------------+--------------
    IITK |        102 |            B! |          sam
    IITK |        103 |            B! |         bugy
     SMU |        101 |            B* |         abhi
     SMU |        103 |            B! |        sandy

Update a students data of a particular college given his ID and college name .

cqlsh:keyspace> UPDATE cf SET student_class='B!' where college='SMU' AND student_id=101;
cqlsh:keyspace> select * from cf ;

 college | student_id | student_class | student_name
---------+------------+---------------+--------------
    IITK |        102 |            B! |          sam
    IITK |        103 |            B! |         bugy
     SMU |        101 |            B! |         abhi
     SMU |        103 |            B! |        sandy

Get all students from a particular college given list of ids of the students

cqlsh:keyspace> SELECT * FROM cf WHERE college='SMU' AND student_id IN ( 101,103);

 college | student_id | student_class | student_name
---------+------------+---------------+--------------
     SMU |        101 |            B! |         abhi
     SMU |        103 |            B! |        sandy

The primary key is college,student id I will get a problem of multigets from different nodes as mentioned later in the question in case of another query — Peter, May 17 '13 at 04:40
@Peter Are you following cassandra 1.2 docs and also see the update section regarding your multiget query? — abhi, May 17 '13 at 06:01
Sorry for my naive level in cassandra . I just read that the shard key will be the first column of the primary key. This will create Hotspots right ? Also i wanted to know even when i will be querying for a student by id , will the entire row will be fetched into memory ? — Peter, May 17 '13 at 06:55
whenever you set a composite key, first entry is your primary partition key, rather say it the shrad key. And yes this will create the hot spot — abhi, May 17 '13 at 07:16
will the entire row be fetched in the memory even when i am querying for say one particular column. I think i read it somewhere ? — Peter, May 17 '13 at 08:03
Not exactly the way the you are thinking. Will suggest you to go through the [blog](http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/) — abhi, May 17 '13 at 16:35

Cassandra schema design optimization

1 Answers1