Cassandra 2.1: how to model for the N most recently active users?

Question

I need to get a list of N users with the most recent actions using cassandra.

I have tried to use the following schema:

CREATE TABLE user_actions(
  partition_key int, 
  username int,
  action_time timestamp,
PRIMARY KEY(partition_key, action_time)
WITH CLUSTERING ORDER BY(action_time DESC);

When a user first logs in it creates a new entry. However, I'm unable to update the start_time property as it's used as a clustering key. I could delete the old data and create a new entry when the user takes another action - but user actions occur frequently and we will quickly hit the tombstone limit.

Does anyone know how best I could model the data in order to achieve this?

Chris Lohfink · Accepted Answer · 2016-12-16T02:47:28.133

4

You dont need to delete individual cells. Just keep appending. Add another component to the partition key to keep from getting too wide:

PRIMARY KEY((partition_key, day), action_time)

Then query using something like (pseudo code):

y = floor(time() / (60 * 60 * 24))
oldest_possible = y - 7
r = []
while len(r) < N and y >= oldest_possible:
    R.append(query('SELECT * FROM user_actions where partition_key = {x} AND day = {y} LIMIT {N};', x, y, N)
    y -= 1

to get the last N users. When theres not N in the partition, decrement day by 1 and query that too. Set TTL to 1 to 5 days or something and old records will be compacted away. This assumes you have at least N actions in 24 hours, or else you may get a not full list. If your records are super active can use hour instead of day.

The day/hour component can just be a simple floor(time() / (60*60*24))

edited Dec 16 '16 at 02:47

answered Dec 16 '16 at 02:41

Chris Lohfink

16,150
1
29
38

Thanks Chris - I've gone ahead and implemented this. There's one problem though, the query returns dupes. While I can filter out those duplicates in my application I was wondering if there was a way which didn't create these dupes. – cscan Dec 16 '16 at 18:38
Filtering is probably easiest option. In which case you probably want to remove the LIMIT from the query. The drivers fetch limit will handle the paging through the partition for you. It could introduce some bad worse case scenarios though. You could potentially do a: `((partition_key, day), username)` that just stores the last action_time for each user, and keep top N of iterating though that. I would expect that to be worse but it can really depend on how many users are active. If you only have a few active users in day that do a lot, original model would have to walk through more than other – Chris Lohfink Dec 16 '16 at 20:28

Cassandra 2.1: how to model for the N most recently active users?

1 Answers1