I have sensors which have a frequent rate of writing data to a log file. I want to store these logs into Cassandra and process them along with Spark.
I have thought about using a TimeUUID column for storing my timestamp to preserve the order automatically. My queries will heavily use range queries and thus I thought this might be ideal. However, my logs can contain duplicate timestamps due to the frequency of the logging. The logs are not streamed to cassandra; I am working with historical data only. The timestamp will be a part of my compound primary key. I can not think of a viable column that I could pull in to the row key to make the row with a duplicate timestamp unique.
The documentation says: "The values returned by minTimeuuid and maxTimeuuid functions are not true UUIDs in that the values do not conform to the Time-Based UUID generation process specified by the RFC 4122. The results of these functions are deterministic, unlike the now function."
When forcing the date of a TimeUUID, instead of using now
, this might end up in overwriting previous data.
I will use Java/Scala to bulk-insert my historical data from .json to Cassandra. (Cassandra 3.0.8 | CQL spec 3.4.0 | Native protocol v4)
How can I have duplicate timestamps within my data?
- Do I use a TimeUUID(now) for my primary key and have the actual date/time stored in a different column? This would make me lose the benefits of having the actual date/time ordered already.
- Do I have to make sure that my Java/Scala application will generate valid, unique TimeUUIDs? If so, are there any common libs I can use?
Or are there other (better) options?
Thanks