My problem statement is :-
I have a college , every college have some students ,
every student has some data(for eg student id , student name , class etc) related to him.
The kind of questions i would like to answer are :-
a) get all students of a particular college.(student id and name only not the other data)
b) get data of a student of a college given his ID and college name .
c) create a student to a college.(ADD a new student in college).
d) delete a student of a particular college given his id and college name .
e) UPdate a students data of a particular college given his ID and college name .
I have thought of following schema :-
One column Family which uses row_key as college$student_id(concatenation of college and student id) and value as data of the student in JSON. Another column Family which has college as row_key and one column with list of studentid and student names(student name is part of student data in first CF) concatenated as comma separated strings.(Specifically for the first query)
I can answer all queries with it.
Second i can think of :- A column family with college as row_key and column name as student_id and value as student data(which contains student name too). with this CF itself i can answer all queries.
But how will the query behave in terms of efficiency for get all students for a particular college.
I have read it will take the whole row into memory even when i need only student id and student name for which i have kept the second column family in my first approach. Even when i want a student for a particular id and college it will take the entire row in memory for giving me that for which i have the first CF in first approach
Secondly in this approach , problem of hotspots can be there.
Even when i want a student for a particular id and college it will take the entire row in memory for giving me that for which i have the first CF in first approach.
The only Reason i was thinking of the second approach was for the following new query which will have to do multi-gets from multiple nodes in the cluster.
Get all students from a particular college given list of ids of the students.
In my first approach as the college data will be spread across nodes i will have to do a get from many nodes. whereas in second approach as college is the key i will get it from one node itself.
BUt second approach had many other disadvantages i discussed above.
What can be a better approach or is the first approach right and efficient ?