1

I have a standalone spark 1.4.1 job that runs on a Red Hat box I submit via spark-submit that sometimes hangs during insertion of data from an RDD. I have auto-commit on the connection turned off and commit the transactions in batches of insertions. What the logs show me before it hangs:

16/03/25 14:00:05 INFO Executor: Finished task 3.0 in stage 138.0 (TID 915). 1847 bytes result sent to driver
16/03/25 14:00:05 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message AkkaMessage(StatusUpdate(915,FINISHED,java.nio.HeapByteBuffer[pos=0 lim=1847 cap=1
16/03/25 14:00:05 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: AkkaMessage(StatusUpdate(915,FINISHED,java.nio.HeapByteBuffer[pos=0 lim=1847 cap=1847
16/03/25 14:00:05 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_138, runningTasks: 1
16/03/25 14:00:05 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message (0.118 ms) AkkaMessage(StatusUpdate(915,FINISHED,java.nio.HeapByteBuffer[pos=621 li
16/03/25 14:00:05 INFO TaskSetManager: Finished task 3.0 in stage 138.0 (TID 915) in 7407 ms on localhost (23/24)
16/03/25 14:00:05 TRACE DAGScheduler: Checking for newly runnable parent stages
16/03/25 14:00:05 TRACE DAGScheduler: running: Set(ResultStage 138)
16/03/25 14:00:05 TRACE DAGScheduler: waiting: Set()
16/03/25 14:00:05 TRACE DAGScheduler: failed: Set()
16/03/25 14:00:10 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message AkkaMessage(Heartbeat(driver,[Lscala.Tuple2;@7ed52306,BlockManagerId(driver, local
16/03/25 14:00:10 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: AkkaMessage(Heartbeat(driver,[Lscala.Tuple2;@7ed52306,BlockManagerId(driver, localhos
16/03/25 14:00:10 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message (0.099 ms) AkkaMessage(Heartbeat(driver,[Lscala.Tuple2;@7ed52306,BlockManagerId(dri

And then it just repeats the last 3 lines with this intermittently:

16/03/25 14:01:04 TRACE HeartbeatReceiver: Checking for hosts with no recent heartbeats in HeartbeatReceiver. 

The kicker is that I can't take a look at the web UI due to some firewall issues on these machines. What I noticed is that this issue was more prevalent when I was inserting with batches of 1000 than with 100. This is the scala code that looks to be the culprit.

//records should have up to INSERT_BATCH_SIZE entries
private def insertStuff(records: Seq[(String, (String, Stuff1, Stuff2, Stuff3))]) {
if (!records.isEmpty) {
  //get statement used for insertion (instantiated in an array of statements)
  val stmt = stuffInsertArray(//stuff)
  logger.info("Starting insertions on stuff" + table + " for " + time + " with " + records.length + " records")
  try {
    records.foreach(record => {
      //get vals from record
      ...
      //perform sanity checks
      if (//validate stuff)
      {
        //log stuff because it didn't validate
      }
      else
      {
        stmt.setInt(1, //stuff)
        stmt.setLong(2, //stuff)
        ...
        stmt.addBatch()
      }
    })

    //check if connection is still valid
    if (!connInsert.isValid(VALIDATE_CONNECTION_TIMEOUT))
    {
      logger.error("Insertion connection is not valid while inserting stuff.")
      throw new RuntimeException(s"Insertion connection not valid while inserting stuff.")
    }

    logger.debug("Stuff insertion executing batch...")
    stmt.executeBatch()
    logger.debug("Stuff insertion execution complete. Committing...")
    //commit insert batch. Either INSERT_BATCH_SIZE insertions planned or the last batch to be done
    insertCommit() //this does the commit and resets some counters
    logger.debug("stuff insertion commit complete.")
  } catch {
    case e: Exception => throw new RuntimeException(s"insertStuff exception  ${e.getMessage}")
  }
}

}

And here's how it gets called:

    //stuffData is an RDD
    stuffData.foreachPartition(recordIt => {
      //new instance of the object of whose member function we're currently in
      val obj = new Obj(clusterInfo)
      recordIt.grouped(INSERT_BATCH_SIZE).foreach(records => obj.insertStuff(records))
    })

All the extra logging and connection checking I put in just to isolate the issue but since I write for every batch of insertions, the logs get convoluted. If I serialize the insertions, the issue still persists. Any idea why the last task (out of 24) doesn't finish? Thanks.

0 Answers0