2

I am using the Datastax Cassandra java driver to write to Cassandra from spark workers. Code snippet

   rdd.foreachPartition(record => {
      val cluster = SimpleApp.connect_cluster(Spark.cassandraip)
      val session = cluster.connect()
      record.foreach { case (bin_key: (Int, Int), kpi_map_seq: Iterable[Map[String, String]]) => {
        kpi_map_seq.foreach { kpi_map: Map[String, String] => {
          update_tables(session, bin_key, kpi_map)
        }
        }
      }
      } //record.foreach
      session.close()
      cluster.close()
    }

While reading I am using the spark cassandra connector (which uses the same driver internally I assume)

   val bin_table = javaFunctions(Spark.sc).cassandraTable("keyspace", "bin_1")
      .select("bin").where("cell = ?", cellname) // assuming this will run on worker nodes
    println(s"get_bins_for_cell Count of Bins for Cell $cellname is ", cell_bin_table.count())
    return bin_table

Doing this each at a time does not cause any problem. Doing it together is throwing this stack trace.

My main goal is not to do the write or read directly from the Spark driver program. Still it seems that it has to do something with the context; two context getting used ?

16/07/06 06:21:29 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 22, euca-10-254-179-202.eucalyptus.internal): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_5_piece0 of broadcast_5
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
        at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
        at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
        at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
        at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
Alex Punnen
  • 5,287
  • 3
  • 59
  • 71

1 Answers1

0

The Spark Context was getting closed after using the session with Cassandra like below

Example

def update_table_using_cassandra_driver() ={
 CassandraConnector(SparkWriter.conf).withSessionDo { session =>
 val statement_4: Statement = QueryBuilder.insertInto("keyspace", "table")
          .value("bin", my_tuple_value)
          .value("cell", my_val("CName"))
  session.executeAsync(statement_4)
  ...
}

So next time I call this in the loop I was getting exception. Looks like a bug in Cassandra driver;will have to check this. For the time being did the following to work around this

 for(a <- 1 to 1000) {
  val sc = new SparkContext(SparkWriter.conf)
  update_table_using_cassandra_driver()
  sc.stop()
  ...sleep(xxx)
 }
Alex Punnen
  • 5,287
  • 3
  • 59
  • 71