I have a standalone spark 1.4.1 job that runs on a Red Hat box I submit via spark-submit that sometimes hangs during insertion of data from an RDD. I have auto-commit on the connection turned off and commit the transactions in batches of insertions. What the logs show me before it hangs:
16/03/25 14:00:05 INFO Executor: Finished task 3.0 in stage 138.0 (TID 915). 1847 bytes result sent to driver
16/03/25 14:00:05 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message AkkaMessage(StatusUpdate(915,FINISHED,java.nio.HeapByteBuffer[pos=0 lim=1847 cap=1
16/03/25 14:00:05 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: AkkaMessage(StatusUpdate(915,FINISHED,java.nio.HeapByteBuffer[pos=0 lim=1847 cap=1847
16/03/25 14:00:05 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_138, runningTasks: 1
16/03/25 14:00:05 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message (0.118 ms) AkkaMessage(StatusUpdate(915,FINISHED,java.nio.HeapByteBuffer[pos=621 li
16/03/25 14:00:05 INFO TaskSetManager: Finished task 3.0 in stage 138.0 (TID 915) in 7407 ms on localhost (23/24)
16/03/25 14:00:05 TRACE DAGScheduler: Checking for newly runnable parent stages
16/03/25 14:00:05 TRACE DAGScheduler: running: Set(ResultStage 138)
16/03/25 14:00:05 TRACE DAGScheduler: waiting: Set()
16/03/25 14:00:05 TRACE DAGScheduler: failed: Set()
16/03/25 14:00:10 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message AkkaMessage(Heartbeat(driver,[Lscala.Tuple2;@7ed52306,BlockManagerId(driver, local
16/03/25 14:00:10 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: AkkaMessage(Heartbeat(driver,[Lscala.Tuple2;@7ed52306,BlockManagerId(driver, localhos
16/03/25 14:00:10 DEBUG AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message (0.099 ms) AkkaMessage(Heartbeat(driver,[Lscala.Tuple2;@7ed52306,BlockManagerId(dri
And then it just repeats the last 3 lines with this intermittently:
16/03/25 14:01:04 TRACE HeartbeatReceiver: Checking for hosts with no recent heartbeats in HeartbeatReceiver.
The kicker is that I can't take a look at the web UI due to some firewall issues on these machines. What I noticed is that this issue was more prevalent when I was inserting with batches of 1000 than with 100. This is the scala code that looks to be the culprit.
//records should have up to INSERT_BATCH_SIZE entries
private def insertStuff(records: Seq[(String, (String, Stuff1, Stuff2, Stuff3))]) {
if (!records.isEmpty) {
//get statement used for insertion (instantiated in an array of statements)
val stmt = stuffInsertArray(//stuff)
logger.info("Starting insertions on stuff" + table + " for " + time + " with " + records.length + " records")
try {
records.foreach(record => {
//get vals from record
...
//perform sanity checks
if (//validate stuff)
{
//log stuff because it didn't validate
}
else
{
stmt.setInt(1, //stuff)
stmt.setLong(2, //stuff)
...
stmt.addBatch()
}
})
//check if connection is still valid
if (!connInsert.isValid(VALIDATE_CONNECTION_TIMEOUT))
{
logger.error("Insertion connection is not valid while inserting stuff.")
throw new RuntimeException(s"Insertion connection not valid while inserting stuff.")
}
logger.debug("Stuff insertion executing batch...")
stmt.executeBatch()
logger.debug("Stuff insertion execution complete. Committing...")
//commit insert batch. Either INSERT_BATCH_SIZE insertions planned or the last batch to be done
insertCommit() //this does the commit and resets some counters
logger.debug("stuff insertion commit complete.")
} catch {
case e: Exception => throw new RuntimeException(s"insertStuff exception ${e.getMessage}")
}
}
}
And here's how it gets called:
//stuffData is an RDD
stuffData.foreachPartition(recordIt => {
//new instance of the object of whose member function we're currently in
val obj = new Obj(clusterInfo)
recordIt.grouped(INSERT_BATCH_SIZE).foreach(records => obj.insertStuff(records))
})
All the extra logging and connection checking I put in just to isolate the issue but since I write for every batch of insertions, the logs get convoluted. If I serialize the insertions, the issue still persists. Any idea why the last task (out of 24) doesn't finish? Thanks.