3

SparkStreaming context reading a stream from RabbitMQ with an interval of 30 seconds. I want to modify the values of few columns of corresponding rows existing in cassandra and then want to store data back to Cassandra. For that i need to check whether the row for the particular primary key exist in Cassandra or not if, yes, fetch it and do the necessary operation. But the problem is, i create the StreamingContext on the driver and actions get performed on Worker. So, they are not able to get the StreamingContext object reason being it wasn't serialized and sent to workers and i get this error : java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext. I also know that we cannot access the StreamingContext inside foreachRDD. But, How do i achieve the same functionality here without getting serialization error?

I have looked at fews examples here but it didn't help.

Here is the snippet of the code :

   val ssc = new StreamingContext(sparkConf,30)
    val receiverStream = RabbitMQUtils.createStream(ssc, rabbitParams)
    receiverStream.start()      
    val lines = receiverStream.map(EventData.fromString(_))
    lines.foreachRDD{ x => if (x.toLocalIterator.nonEmpty) {
                x.foreachPartition { it => for (tuple <- it) { 
                val cookieid  = tuple.cookieid                
                val sessionid = tuple.sessionid              
                val logdate = tuple.logdate
                val EventRows =  ssc.cassandraTable("SparkTest", CassandraTable).select("*")
                .where("cookieid = '" + cookieid + "' and logdate = '" + logdate+ "' and sessionid = '" + sessionid + "')

                   Somelogic Whether row exist or not for Cookieid

                }  } }
Christian Will
  • 1,529
  • 3
  • 17
  • 25
Naresh
  • 5,073
  • 12
  • 67
  • 124
  • 1
    Possible duplicate of [how to create a new RDD in anonymous function in spark streaming?](http://stackoverflow.com/questions/38522618/how-to-create-a-new-rdd-in-anonymous-function-in-spark-streaming) – eliasah Sep 08 '16 at 19:19
  • @eliasah I cannot access `StreamingContext` inside `lines.foreachRDD` whether you declare the `StreamingContext` outside or inside main. Why did you mark it close ? It didn't solve my problem and FYI it is not a duplicate of link you provided. – Naresh Sep 09 '16 at 05:46
  • take spark context from from `x.sparkContext()`, dot not call from forEachMethod – Grigoriev Nick Sep 06 '18 at 12:26

2 Answers2

1

The SparkContext cannot be serialized and passed across multiple workers in possibly different nodes. If you need to do something like this you could use forEachPartiion, mapPartitons. Else do this withing your function that gets passed around

 CassandraConnector(SparkWriter.conf).withSessionDo { session =>
  ....
    session.executeAsync(<CQL Statement>)

and in the SparkConf you need to give the Cassandra details

  val conf = new SparkConf()
    .setAppName("test")
    .set("spark.ui.enabled", "true")
    .set("spark.executor.memory", "8g")
    //  .set("spark.executor.core", "4")
    .set("spark.eventLog.enabled", "true")
    .set("spark.eventLog.dir", "/ephemeral/spark-events")
    //to avoid disk space issues - default is /tmp
    .set("spark.local.dir", "/ephemeral/spark-scratch")
    .set("spark.cleaner.ttl", "10000")
    .set("spark.cassandra.connection.host", cassandraip)
    .setMaster("spark://10.255.49.238:7077")

The Java CSCParser is a library that is not serializable. So Spark cannot send it possibly different nodes if you call map or forEach on the RDD. One workaround is using mapPartion, in which case one full Parition will be executed in one SparkNode. Hence it need not serialize for each call.Example

val rdd_inital_parse = rdd.mapPartitions(pLines).

 def pLines(lines: Iterator[String]) = {
    val parser = new CSVParser() ---> Cannot be serialized, will fail if using rdd.map(pLines)
    lines.map(x => parseCSVLine(x, parser.parseLine))
  }
Alex Punnen
  • 5,287
  • 3
  • 59
  • 71
  • If i use `lines.foreachRDD{ x => if (x.toLocalIterator.nonEmpty) { x.foreachPartition { CassandraConnector(ssc.sparkContext.getConf).withSessionDo { session => .... session.executeAsync() ` ssc won't be available on worker. So, again it's the same issue what i have asked you – Naresh Sep 16 '16 at 07:17
0

Try with x.sparkContext.cassandraTable() instead of ssc.cassandraTable() and see if it helps

Anupam Jain
  • 476
  • 2
  • 6