Issue while storing data from Spark-Streaming to Cassandra

Question

SparkStreaming context reading a stream from RabbitMQ with an interval of 30 seconds. I want to modify the values of few columns of corresponding rows existing in cassandra and then want to store data back to Cassandra. For that i need to check whether the row for the particular primary key exist in Cassandra or not if, yes, fetch it and do the necessary operation. But the problem is, i create the StreamingContext on the driver and actions get performed on Worker. So, they are not able to get the StreamingContext object reason being it wasn't serialized and sent to workers and i get this error : java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext. I also know that we cannot access the StreamingContext inside foreachRDD. But, How do i achieve the same functionality here without getting serialization error?

I have looked at fews examples here but it didn't help.

Here is the snippet of the code :

   val ssc = new StreamingContext(sparkConf,30)
    val receiverStream = RabbitMQUtils.createStream(ssc, rabbitParams)
    receiverStream.start()      
    val lines = receiverStream.map(EventData.fromString(_))
    lines.foreachRDD{ x => if (x.toLocalIterator.nonEmpty) {
                x.foreachPartition { it => for (tuple <- it) { 
                val cookieid  = tuple.cookieid                
                val sessionid = tuple.sessionid              
                val logdate = tuple.logdate
                val EventRows =  ssc.cassandraTable("SparkTest", CassandraTable).select("*")
                .where("cookieid = '" + cookieid + "' and logdate = '" + logdate+ "' and sessionid = '" + sessionid + "')

                   Somelogic Whether row exist or not for Cookieid

                }  } }

Possible duplicate of [how to create a new RDD in anonymous function in spark streaming?](http://stackoverflow.com/questions/38522618/how-to-create-a-new-rdd-in-anonymous-function-in-spark-streaming) — eliasah, Sep 08 '16 at 19:19
@eliasah I cannot access `StreamingContext` inside `lines.foreachRDD` whether you declare the `StreamingContext` outside or inside main. Why did you mark it close ? It didn't solve my problem and FYI it is not a duplicate of link you provided. — Naresh, Sep 09 '16 at 05:46
take spark context from from `x.sparkContext()`, dot not call from forEachMethod — Grigoriev Nick, Sep 06 '18 at 12:26

score 1 · Answer 1 · answered Sep 09 '16 at 22:56

The SparkContext cannot be serialized and passed across multiple workers in possibly different nodes. If you need to do something like this you could use forEachPartiion, mapPartitons. Else do this withing your function that gets passed around

 CassandraConnector(SparkWriter.conf).withSessionDo { session =>
  ....
    session.executeAsync(<CQL Statement>)

and in the SparkConf you need to give the Cassandra details

  val conf = new SparkConf()
    .setAppName("test")
    .set("spark.ui.enabled", "true")
    .set("spark.executor.memory", "8g")
    //  .set("spark.executor.core", "4")
    .set("spark.eventLog.enabled", "true")
    .set("spark.eventLog.dir", "/ephemeral/spark-events")
    //to avoid disk space issues - default is /tmp
    .set("spark.local.dir", "/ephemeral/spark-scratch")
    .set("spark.cleaner.ttl", "10000")
    .set("spark.cassandra.connection.host", cassandraip)
    .setMaster("spark://10.255.49.238:7077")

The Java CSCParser is a library that is not serializable. So Spark cannot send it possibly different nodes if you call map or forEach on the RDD. One workaround is using mapPartion, in which case one full Parition will be executed in one SparkNode. Hence it need not serialize for each call.Example

val rdd_inital_parse = rdd.mapPartitions(pLines).

 def pLines(lines: Iterator[String]) = {
    val parser = new CSVParser() ---> Cannot be serialized, will fail if using rdd.map(pLines)
    lines.map(x => parseCSVLine(x, parser.parseLine))
  }

If i use `lines.foreachRDD{ x => if (x.toLocalIterator.nonEmpty) { x.foreachPartition { CassandraConnector(ssc.sparkContext.getConf).withSessionDo { session => .... session.executeAsync() ` ssc won't be available on worker. So, again it's the same issue what i have asked you — Naresh, Sep 16 '16 at 07:17

score 0 · Answer 2 · answered Sep 07 '16 at 15:24

0

Try with x.sparkContext.cassandraTable() instead of ssc.cassandraTable() and see if it helps

answered Sep 07 '16 at 15:24

Anupam Jain

476
2
6

it doesn't work. And kindly write suggestions in the comments – Naresh Sep 08 '16 at 12:10
This actually must help! – Grigoriev Nick Sep 06 '18 at 12:27

Issue while storing data from Spark-Streaming to Cassandra

2 Answers2

Linked