This isn't going to work well, as-is. There are more than a few changes needed to get what you want here. In Cassandra there are two things that can usually help.
- If you're having trouble with a data model, ask yourself what it might look like as a time series.
With Cassandra's distributed, append-only storage engine, use cases like time series and event tracking fit easily. And sometimes a data model makes more sense (from a Cassandra point of view) when adjusted to that perspective.
- Build your tables to match your query patterns.
I see what is probably a PRIMARY KEY on ID. But what I don't see (above, at least) are any queries that filter on ID. I can tell that things like employees and laptops are important, and probably unique. But unique keys don't always make the best information filters.
The main question to ask is, what are you trying to get here?
To me, it looks like you want to see users that are experiencing high network utilization. And high network utilization is a (hopefully) temporary thing, so why don't we add a time component to it (checkpoint_time
)? IMO, it makes sense to track computing resource utilization over time. After considering those points, I came up with a data model like this:
cassdba@cqlsh:stackoverflow> CREATE TABLE employee_laptop__by_network_utilization (
timebucket text,
checkpoint_time timestamp,
employee_id bigint,
name text,
type text,
laptop_id bigint,
cpu bigint,
memory bigint,
network_utilization bigint,
disk_utilization bigint,
PRIMARY KEY ((timebucket),network_utilization,
checkpoint_time,employee_id,laptop_id)
) WITH CLUSTERING ORDER by
(network_utilization ASC, checkpoint_time DESC,
employee_id ASC, laptop_id ASC);
After inserting some rows, I can now query employee/laptop combinations who were experiencing network utilization > 50 on October 12, 2017.
cassdba@cqlsh:stackoverflow> SELECT * FROm employee_laptop__by_network_utilization
WHERE timebucket='20171012' AND network_utilization > 50;
timebucket | network_utilization | checkpoint_time | employee_id | laptop_id | cpu | disk_utilization | memory | name | type
------------+---------------------+---------------------------------+-------------+-----------+-----+------------------+--------+----------+-----------
20171012 | 55 | 2017-10-12 12:30:00.000000+0000 | 1 | 1 | 4 | 62 | 19 | Jebediah | Pilot
20171012 | 55 | 2017-10-12 12:15:00.000000+0000 | 1 | 1 | 19 | 62 | 18 | Jebediah | Pilot
20171012 | 72 | 2017-10-12 12:00:00.000000+0000 | 3 | 3 | 47 | 54 | 13 | Bob | Scientist
(3 rows)
First of all, I needed a good partition key that would both make sense for the queries, and prevent my partitions from unbound growth. Therefore, I picked a timebucket
named "date bucket." This way, I can isolate my queries for a single day, and ensure that each query is served by a single node.
Next, I clustered on network_utilization
, as that's the main column that this model is primarily concerned with. It is the first clustering column, as we don't want to have to provide too much more in the way of filtering columns in our query.
checkpoint_time
is the next column in the PRIMARY KEY, mainly because requests with the same timebucket
and network_utilization
will probably make more sense sorted by time (DESCending).
Finally, I added employee_id
for uniqueness, and then laptop_id
because an employee could have more than one laptop.
Now, I'm sure you're going to find aspects of my solution that don't exactly fit with your use case. And that's because Cassandra data modeling is very use-case-centric. Often one good solution isn't a cookie-cutter-fit for another. But, it's one way to get the data that you're after.