Spark: Continuously reading data from Cassandra

Question

I have gone through Reading from Cassandra using Spark Streaming and through tutorial-1 and tutorial-2 links.

Is it fair to say that Cassandra-Spark integration currently does not provide anything out of the box to continuously get the updates from Cassandra and stream them to other systems like HDFS?

By continuously, I mean getting only those rows in a table which have changed (inserted or updated) since the last fetch by Spark. If there are too many such rows, there should be an option to limit the number of rows and the subsequent spark fetch should begin from where it left off. At-least once guarantee is ok but exactly-once would be a huge welcome.

If its not supported, one way to support it could be to have an auxiliary column updated_time in each cassandra-table that needs to be queried by storm and then use that column for queries. Or an auxiliary table per table that contains ID, timestamp of the rows being changed. Has anyone tried this before?

score 0 · Answer 1 · answered Feb 28 '17 at 21:37

I don't think Apache Cassandra has this functionality out of the box. Internally [for some period of time] it stores all operations on data in sequential manner, but it's per node and it gets compacted eventually (to save space). Frankly, Cassandra's (as most other DB's) promise is to provide latest view of data (which by itself can be quite tricky in distributed environment), but not full history of how data was changing.

So if you still want to have such info in Cassandra (and process it in Spark), you'll have to do some additional work yourself: design dedicated table(s) (or add synthetic columns), take care of partitioning, save offset to keep track of progress, etc.

Cassandra is ok for time series data, but in your case I would consider just using streaming solution (like Kafka) instead of inventing it.

score 0 · Answer 2 · answered Mar 01 '17 at 03:13

I agree with what Ralkie stated but wanted to propose one more solution if you're tied to C* with this use case. This solution assumes you have full control over the schema and ingest as well. This is not a streaming solution though it could awkwardly be shoehorned into one.

Have you considered using composite key composed of the timebucket along with a murmur_hash_of_one_or_more_clustering_columns % some_int_designed_limit_row_width? In this way, you could set your timebuckets to 1 minute, 5 minutes, 1 hour, etc depending on how "real-time" you need to analyze/archive your data. The murmur hash based off of one or more of the clustering columns is needed to help located data in the C* cluster (and is a terrible solution if you're often looking up specific clustering columns).

For example, take an IoT use case where sensors report in every minute and have some sensor reading that can be represented as an integer.

create table if not exists iottable {
  timebucket bigint,
  sensorbucket int,
  sensorid varchar,
  sensorvalue int,
  primary key ((timebucket, sensorbucket), sensorid)
} with caching = 'none'
   and compaction = { 'class': 'com.jeffjirsa.cassandra.db.compaction.TimeWindowedCompaction' };

Note the use of TimeWindowedCompaction. I'm not sure what version of C* you're using; but with the 2.x series, I'd stay away from DateTieredCompaction. I cannot speak to how well it performs in 3.x. Any any rate, you should test and benchmark extensively before settling on your schema and compaction strategy.

Also note that this schema could result in hotspotting as it is vulnerable to sensors that report more often than others. Again, not knowing the use case it's hard to provide a perfect solution -- it's just an example. If you don't care about ever reading C* for a specific sensor (or column), you don't have to use a clustering column at all and you can simply use a timeUUID or something random for the murmur hash bucketing.

Regardless of how you decide to partition the data, a schema like this would then allow you to use repartitionByCassandraReplica and joinWithCassandraTable to extract the data written during a given timebucket.

Spark: Continuously reading data from Cassandra

2 Answers2