The optimal design depends very much on all the ways you plan to query this set of data, and also on the estimated number of items on every level.
If all you need is drilling down, then the following design could work:
CREATE TABLE SubTask
(
ProjectId int,
<Project fields>,
MilestoneId int,
<Milestone fields>,
TaskId int,
<Task fields>,
SubTaskId int,
<SubTask fields>,
PRIMARY KEY (ProjectId, MilestoneId, TaskId, SubTaskId)
);
Your project id would be the partitioning key, so you will need to verify if you have enough projects for an even partitioning among your cluster nodes. You could also use composite partition keys (PRIMARY KEY ((ProjectId, MilestoneId), TaskId, SubTaskId)
, but that would limit your possibilities to retrieve just the projects.
You could go with the convention that real ids are > 0, and preserve the rows with id 0 for the description fields. In other words, you would complete the milestone fields for a specific product only on the row that has TaskId = 0, thus avoid data duplication.
You could query all tasks and subtasks for a milestone as follows:
SELECT <task fields>, <subtask fields>
FROM SubTask
WHERE ProjectId=xxx AND MilestoneId=yyy;
If you would only need the task fields, you would need an index on the SubTaskId field:
CREATE INDEX ON SubTask(SubTaskId);
and then select as follows:
SELECT <task fields>
FROM SubTask
WHERE ProjectId=xxx AND MilestoneId=yyy AND SubTaskId=0;
An alternative solution would be to create a separate table for every entity (Project, Milestone, Task, SubTask), but then you would need two queries to retrieve tasks and subtasks).
Also, if the number of subtasks in a task cannot be too high, you could experiment with Cassandra 2.1's user defined types (currently in beta).