I'm trying to read Kafka messages using Spark Streaming, do some computations and send the results to another process.
val jsonObject = new JSONObject
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topicsSet)
stream.foreachRDD { rdd => {
val jsonDF = spark.read.json(rdd.map(_._2))
val res = jsonDF.groupBy("artist").count.sort(col("count").desc).take(10)
/*Some Transformations => create jsonArray*/
jsonObject.put("Key", jsonArray)
}}
ssc.start()
I need to accumulate the JSONObject (Global variable) for my requirement. put
operation throws NotSerializable exception.
java.io.NotSerializableException: Object of org.apache.spark.streaming.kafka.DirectKafkaInputDStream$MappedDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects.
Is it possible to send this jsonArray out of this foreahRDD block? I do not want to write into files or databases.