It's quite a broad topic, and efficient solution will depend on the amount of data in your tables, table structure, how data is inserted/updated, etc. Also, specific solution may depend on the version of Spark available. One downside of Spark-only method is you can't easily detect deletes of the data, without having a complete copy of previous state, so you can generate a diff between 2 states.
In all cases you'll need to perform full table scan to find changed entries, but if your table is organized specifically for this task, you can avoid reading of all data. For example, if you have a table with following structure:
create table test.tbl (
pk int,
ts timestamp,
v1 ...,
v2 ...,
primary key(pk, ts));
then if you do following query:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("tbl", "test").load()
val filtered = data.filter("""ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp)
AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)""")
then Spark Cassandra Connector will push this query down to the Cassandra, and will read only data where ts
is in the given time range - you can check this by executing filtered.explain
and checking that both time filters are marked with *
symbol.
Another way to detect changes is to retrieve the write time from Cassandra, and filter out the changes based on that information. Fetching of writetime is supported in RDD API for all recent versions of SCC, and is supported in the Dataframe API since release of SCC 2.5.0 (requires at least Spark 2.4, although may work with 2.3 as well). After fetching this information, you can apply filters on the data & extract changes. But you need to keep in mind several things:
- there is no way to detect deletes using this method
- write time information exists only for regular & static columns, but not for columns of primary key
- each column may have its own write time value, in case if there was a partial update of the row after insertion
- in most versions of Cassandra, call of
writetime
function will generate error when it's done for collection column (list/map/set), and will/may return null
for column with user-defined type
P.S. Even if you had CDC enabled, it's not a trivial task to use it correctly:
- you need to de-duplicate changes - you have RF copies of the changes
- some changes could be lost, for example, when node was down, and then propagated later, via hints or repairs
- TTL isn't easy to handle
- ...
For CDC you may look for presentations from 2019th DataStax Accelerate conference - there were several talks on that topic.