I am trying to build the below spark streaming spark job that would read from kafka, perform aggregation (count on every min window) and store in Cassandra. I am getting an error on update mode.
java.lang.IllegalArgumentException: requirement failed: final_count does not support Update mode.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.execution.datasources.v2.V2Writes$.org$apache$spark$sql$execution$datasources$v2$V2Writes$$buildWriteForMicroBatch(V2Writes.scala:121)
at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.applyOrElse(V2Writes.scala:90)
at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.applyOrElse(V2Writes.scala:43)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
at
My spark source is
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0,com.datastax.spark:spark-cassandra-connector_2.12:3.2.0 pyspark-shell'
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "xxxx:9092") \
.option("subscribe", "yyyy") \
.option("startingOffsets", "earliest") \
.load() \
.select(from_json(col("value").cast("string"), schema).alias("parsed_value")) \
.select(col("parsed_value.country"), col("parsed_value.city"), col("parsed_value.Location").alias("location"), col("parsed_value.TimeStamp")) \
.withColumn('currenttimestamp', unix_timestamp(col('TimeStamp'), "yyyy-MM-dd HH:mm:ss").cast(TimestampType())) \
.withWatermark("currenttimestamp", "1 minutes");
df.printSchema();
df=df.groupBy(window(df.currenttimestamp, "1 minutes"), df.location) \
.count();
df = df.select(col("location"), col("window.start").alias("starttime"), col("count"));
df.writeStream.outputMode("update").format("org.apache.spark.sql.cassandra").option("checkpointLocation", '/tmp/check_point/').option("keyspace", "cccc").option("table", "bbbb").option("spark.cassandra.connection.host", "aaaa").option("spark.cassandra.auth.username", "ffff").option("spark.cassandra.auth.password", "eee").start().awaitTermination();
Schema for table in cassandra is as below
CREATE TABLE final_count (
starttime TIMESTAMP,
location TEXT,
count INT,
PRIMARY KEY (starttime,location);
Works on update mode printing on console, but fails with error while updating cassandra.
Any suggestions?