Cassandra - duplicate timestamps with TimeUUID?

Question

I have sensors which have a frequent rate of writing data to a log file. I want to store these logs into Cassandra and process them along with Spark.

I have thought about using a TimeUUID column for storing my timestamp to preserve the order automatically. My queries will heavily use range queries and thus I thought this might be ideal. However, my logs can contain duplicate timestamps due to the frequency of the logging. The logs are not streamed to cassandra; I am working with historical data only. The timestamp will be a part of my compound primary key. I can not think of a viable column that I could pull in to the row key to make the row with a duplicate timestamp unique.

The documentation says: "The values returned by minTimeuuid and maxTimeuuid functions are not true UUIDs in that the values do not conform to the Time-Based UUID generation process specified by the RFC 4122. The results of these functions are deterministic, unlike the now function."

When forcing the date of a TimeUUID, instead of using now, this might end up in overwriting previous data.

I will use Java/Scala to bulk-insert my historical data from .json to Cassandra. (Cassandra 3.0.8 | CQL spec 3.4.0 | Native protocol v4)

How can I have duplicate timestamps within my data?

Do I use a TimeUUID(now) for my primary key and have the actual date/time stored in a different column? This would make me lose the benefits of having the actual date/time ordered already.
Do I have to make sure that my Java/Scala application will generate valid, unique TimeUUIDs? If so, are there any common libs I can use?

Or are there other (better) options?

Thanks

score 4 · Accepted Answer · edited May 23 '17 at 12:14

Your idea to use timeuuids as a unique identifier is the proper approach. When properly done, you won't have duplicates. The timeuuid is a type 1 uuid which contains not only a timestamp, but also some entropy to guarantee uniqueness even for the same point in time.

So, now the question remains - how should you generate timeuuids for your historical data? As you noted, the minTimeuuid/maxTimeuuid functions aren't good for generating a proper version 1 uuid. That's ok, because that's not their purpose. You'll need them later on when you're querying your data using time ranges:

SELECT * FROM sensor_readings
   WHERE sensor_id = 123
   AND ts > maxTimeuuid('2016-07-15 00:00+0000')
   AND ts < minTimeuuid('2016-07-17 00:00+0000')

Unfortunately CQL doesn't offer a function to generate them for a given timestamp (as of CQL 3.3) so your client must generate the uuid. There are a few Java libraries that will do it. See this question for some suggestions. Be sure to pick a quality library that guarantees uniqueness.

score 1 · Answer 2 · answered Jul 20 '16 at 13:32

First off all, please make sure that your idea of how you want to order and query your data really is possible using Cassandra. Range queries will only work based on a certain partition key, e.g. PRIMARY KEY(sensor_id, time). In most cases discriminating by the partitioning key is enough to make sure timestamps will be unique.

If you still need to generate globally unique time based UUIDs, that should be possible too, as you're going to import historical data and can just implement a shared UUID generator that will create unique UUIDs by keeping track of the last created timestamp and just increments by a certain amount of nano seconds to create a new unique timestamp in case of overlaps, so values will always increase monotonically.

Cassandra - duplicate timestamps with TimeUUID?

2 Answers2