Using Apache Spark 2.0.1 deployed as Standalone together with jobserver 0.7.0.
I have a small job to test if the context is operational due sometimes the contexts is killed but the java process on my server is still alive. So I double check if the context is up as a system process and if is available calling a Job which will return some spark configurations and java status as a JSON formated string.
public class TestJob extends VIQ_SparkJob {
@Override
public Object runJob(SparkContext jsc, Config jobConfig) {
getSparkSession(jsc);
String result = "{";
result += "\"AppName\":\"" + jsc.appName() + "\",";
result += "\"ApplicationID\":\"" + jsc.applicationId() + "\",";
result += "\"DeployMode\":\"" + jsc.deployMode() + "\",";
result += "\"ExecutorID\":\"" + jsc.env().executorId() + "\",";
scala.collection.immutable.Map<String, String> all = sparkSession.conf().getAll();
scala.collection.immutable.Set<String> keys = all.keySet();
for (scala.collection.Iterator<String> iterator = keys.iterator(); iterator.hasNext();) {
String next = iterator.next();
result += "\"" + next + "\":\"" + all.get(next).get() + "\",";
}
result += "\"JavaAvailableProcessors\":\"" + Runtime.getRuntime().availableProcessors() + "\",";
result += "\"JavaMaxMemory\":\"" + Runtime.getRuntime().maxMemory() + "\",";
result += "\"JavaTotalMemory\":\"" + Runtime.getRuntime().totalMemory() + "\",";
result += "\"JavaFreeMemory\":\"" + Runtime.getRuntime().freeMemory() + "\"";
final HotSpotDiagnosticMXBean hsdiag = ManagementFactory
.getPlatformMXBean(HotSpotDiagnosticMXBean.class);
if (hsdiag != null) {
List<VMOption> vmOptions = hsdiag.getDiagnosticOptions();
for (Iterator<VMOption> iterator = vmOptions.iterator(); iterator.hasNext();) {
VMOption next = iterator.next();
result += ",\"Java" + next.getName() + "\":\"" + next.getValue() + "\"";
}
}
result += "}";
return result;
}
}
I execute this check every 60 seconds, works fine till the context is killed and I got the following error on my spark-job-server.log:
[2017-02-19 06:37:33,639] ERROR ka.actor.OneForOneStrategy [] [akka://JobServer/user/context-supervisor/application_analytics] - Futures timed out after [3 seconds]
java.util.concurrent.TimeoutException: Futures timed out after [3 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread$$anon$3.block(ThreadPoolBuilder.scala:169)
at scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3640)
at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.blockOn(ThreadPoolBuilder.scala:167)
at scala.concurrent.Await$.result(package.scala:190)
at spark.jobserver.JobManagerActor.startJobInternal(JobManagerActor.scala:219)
at spark.jobserver.JobManagerActor$$anonfun$wrappedReceive$1.applyOrElse(JobManagerActor.scala:157)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at spark.jobserver.common.akka.ActorStack$$anonfun$receive$1.applyOrElse(ActorStack.scala:33)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at spark.jobserver.common.akka.Slf4jLogging$$anonfun$receive$1$$anonfun$applyOrElse$1.apply$mcV$sp(Slf4jLogging.scala:25)
at spark.jobserver.common.akka.Slf4jLogging$class.spark$jobserver$common$akka$Slf4jLogging$$withAkkaSourceLogging(Slf4jLogging.scala:34)
at spark.jobserver.common.akka.Slf4jLogging$$anonfun$receive$1.applyOrElse(Slf4jLogging.scala:24)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at spark.jobserver.common.akka.ActorMetrics$$anonfun$receive$1.applyOrElse(ActorMetrics.scala:23)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at spark.jobserver.common.akka.InstrumentedActor.aroundReceive(InstrumentedActor.scala:8)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[2017-02-19 06:37:33,639] ERROR .jobserver.JobManagerActor [] [] - About to restart actor due to exception:
java.util.concurrent.TimeoutException: Futures timed out after [3 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread$$anon$3.block(ThreadPoolBuilder.scala:169)
at scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3640)
at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.blockOn(ThreadPoolBuilder.scala:167)
at scala.concurrent.Await$.result(package.scala:190)
at spark.jobserver.JobManagerActor.startJobInternal(JobManagerActor.scala:219)
at spark.jobserver.JobManagerActor$$anonfun$wrappedReceive$1.applyOrElse(JobManagerActor.scala:157)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at spark.jobserver.common.akka.ActorStack$$anonfun$receive$1.applyOrElse(ActorStack.scala:33)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at spark.jobserver.common.akka.Slf4jLogging$$anonfun$receive$1$$anonfun$applyOrElse$1.apply$mcV$sp(Slf4jLogging.scala:25)
at spark.jobserver.common.akka.Slf4jLogging$class.spark$jobserver$common$akka$Slf4jLogging$$withAkkaSourceLogging(Slf4jLogging.scala:34)
And in the spark worker log I see that the worker killed the executor.
17/02/20 00:09:17 INFO Worker: Asked to kill executor app-20170218095729-0000/0
17/02/20 00:09:17 INFO ExecutorRunner: Runner thread for executor app-20170218095729-0000/0 interrupted
17/02/20 00:09:17 INFO ExecutorRunner: Killing process!
17/02/20 00:09:18 INFO Worker: Executor app-20170218095729-0000/0 finished with state KILLED exitStatus 0
17/02/20 00:09:18 INFO Worker: Cleaning up local directories for application app-20170218095729-0000
17/02/20 00:09:18 INFO ExternalShuffleBlockResolver: Application app-20170218095729-0000 removed, cleanupLocalDirs = true
17/02/20 00:09:18 INFO ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=app-20170218095729-0000, execId=0}'s 1 local dirs
As I have another applications which are running on the same server (and I'm not experimenting memory issues), but some times the processors could be extensively used by the other applications. Usually this is not a problem because the jobserver is generaly used during the day and the other applications are executed during the night, so the load is balanced.
So my idea was that usually the problem are related with the memory, so I have distributed the memory enought for each process. But I suppoused that if in the case that the processesors are being used by other applications then it will only slow down the job execution, but not crash it. Or am I wrong?
And what is the meaning of Executor app-20170218095729-0000/0 finished with state KILLED exitStatus 0?