0

I'm trying to read Kafka messages using Spark Streaming, do some computations and send the results to another process.

val jsonObject = new JSONObject

val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
  ssc, kafkaParams, topicsSet)

stream.foreachRDD { rdd => {
  val jsonDF = spark.read.json(rdd.map(_._2))
  val res = jsonDF.groupBy("artist").count.sort(col("count").desc).take(10)
  /*Some Transformations => create jsonArray*/
  jsonObject.put("Key", jsonArray)
}}

ssc.start()

I need to accumulate the JSONObject (Global variable) for my requirement. put operation throws NotSerializable exception.

java.io.NotSerializableException: Object of org.apache.spark.streaming.kafka.DirectKafkaInputDStream$MappedDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects.

Is it possible to send this jsonArray out of this foreahRDD block? I do not want to write into files or databases.

sen
  • 198
  • 2
  • 9
  • What do you want to do with this json array object? – Yuval Itzchakov Jul 12 '17 at 05:22
  • As explained in the referenced question, this is a closure-pull-context issue. Except that here, it is unclear what you plan to do with that jsonObject that will be growing forever. – maasg Jul 12 '17 at 06:30

0 Answers0