Nested data data modelling in Cassandra?

Question

I am trying to create tables which can drills down to multiple level for eg

each project has multiple milestones and each milestones has mutiple tasks and each task has few subtasks along with task thier properties(like description,author,assigned to)..

i will be performing CRUD operations on it and user will be navigating through the project.

Cassandra 2.0.7 | CQL spec 3.1.1

How can i use composite keys or any other better approach with this.

Thanks in Advance

score 0 · Answer 1 · answered Jun 25 '14 at 13:30

The optimal design depends very much on all the ways you plan to query this set of data, and also on the estimated number of items on every level.

If all you need is drilling down, then the following design could work:

CREATE TABLE SubTask
(
    ProjectId int,
    <Project fields>,
    MilestoneId int,
    <Milestone fields>,
    TaskId int,
    <Task fields>,
    SubTaskId int,
    <SubTask fields>,
    PRIMARY KEY (ProjectId, MilestoneId, TaskId, SubTaskId)
);

Your project id would be the partitioning key, so you will need to verify if you have enough projects for an even partitioning among your cluster nodes. You could also use composite partition keys (PRIMARY KEY ((ProjectId, MilestoneId), TaskId, SubTaskId), but that would limit your possibilities to retrieve just the projects.

You could go with the convention that real ids are > 0, and preserve the rows with id 0 for the description fields. In other words, you would complete the milestone fields for a specific product only on the row that has TaskId = 0, thus avoid data duplication.

You could query all tasks and subtasks for a milestone as follows:

SELECT <task fields>, <subtask fields> 
FROM SubTask 
WHERE ProjectId=xxx AND MilestoneId=yyy;

If you would only need the task fields, you would need an index on the SubTaskId field:

CREATE INDEX ON SubTask(SubTaskId);

and then select as follows:

SELECT <task fields>
FROM SubTask
WHERE ProjectId=xxx AND MilestoneId=yyy AND SubTaskId=0;

An alternative solution would be to create a separate table for every entity (Project, Milestone, Task, SubTask), but then you would need two queries to retrieve tasks and subtasks).

Also, if the number of subtasks in a task cannot be too high, you could experiment with Cassandra 2.1's user defined types (currently in beta).

thanks!! which would be better performance wise ...separate table for each entity or one very wide table like you suggested??? — prankyrules, Jun 27 '14 at 15:04
There is no generic answer to this question. The right approach to Cassandra data modelling is to start with enumerating **ALL** queries you will need, design your tables so that these queries can be executed efficiently. Some data duplication and denormalization is fine with Cassandra. — medvekoma, Jun 28 '14 at 08:07

Nested data data modelling in Cassandra?

1 Answers1