1

I am trying to build the below spark streaming spark job that would read from kafka, perform aggregation (count on every min window) and store in Cassandra. I am getting an error on update mode.

java.lang.IllegalArgumentException: requirement failed: final_count does not support Update mode.
    at scala.Predef$.require(Predef.scala:281)
    at org.apache.spark.sql.execution.datasources.v2.V2Writes$.org$apache$spark$sql$execution$datasources$v2$V2Writes$$buildWriteForMicroBatch(V2Writes.scala:121)
    at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.applyOrElse(V2Writes.scala:90)
    at org.apache.spark.sql.execution.datasources.v2.V2Writes$$anonfun$apply$1.applyOrElse(V2Writes.scala:43)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
    at 

My spark source is

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0,com.datastax.spark:spark-cassandra-connector_2.12:3.2.0 pyspark-shell'



df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "xxxx:9092") \
  .option("subscribe", "yyyy") \
  .option("startingOffsets", "earliest") \
  .load() \
  .select(from_json(col("value").cast("string"), schema).alias("parsed_value")) \
  .select(col("parsed_value.country"), col("parsed_value.city"), col("parsed_value.Location").alias("location"), col("parsed_value.TimeStamp")) \
  .withColumn('currenttimestamp', unix_timestamp(col('TimeStamp'), "yyyy-MM-dd HH:mm:ss").cast(TimestampType())) \
  .withWatermark("currenttimestamp", "1 minutes");
df.printSchema();
df=df.groupBy(window(df.currenttimestamp, "1 minutes"), df.location) \
     .count();
df = df.select(col("location"), col("window.start").alias("starttime"), col("count"));

df.writeStream.outputMode("update").format("org.apache.spark.sql.cassandra").option("checkpointLocation", '/tmp/check_point/').option("keyspace", "cccc").option("table", "bbbb").option("spark.cassandra.connection.host", "aaaa").option("spark.cassandra.auth.username", "ffff").option("spark.cassandra.auth.password", "eee").start().awaitTermination();

Schema for table in cassandra is as below

CREATE TABLE final_count (
    starttime TIMESTAMP,
    location TEXT,
    count INT,
    PRIMARY KEY (starttime,location);

Works on update mode printing on console, but fails with error while updating cassandra.

Any suggestions?

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • Welcome to Stack Overflow! You question is missing a few details. The general guidance is that you (a) provide a good summary of the problem that includes software/component versions, the full error message + full stack trace; (b) describe what you've tried to fix the problem, details of investigation you've done; and (c) minimal sample code that replicates the problem. Cheers! – Erick Ramirez Sep 12 '22 at 03:32

1 Answers1

0

Need foreachBatch as Cassandra is still not a standard Sink.

See https://docs.databricks.com/structured-streaming/examples.html#write-to-cassandra-using-foreachbatch-in-scala

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • 1
    Databricks examples are outdated, unfortunately. Update mode worked in Spark 2.x: https://github.com/alexott/cassandra-dse-playground/blob/master/spark-dse/src/main/scala/com/datastax/alexott/streaming/StructuredStreamingDSE.scala#L53 – Alex Ott Sep 23 '22 at 20:37
  • Interesting. @AlexOtt – thebluephantom Sep 24 '22 at 08:53