I have a Spark Standalone setup in my Docker Swarm cluster (1 manager node, 2 worker nodes). I also have a livy
container that is colocated with the Spark master
container in the manager node.
When initializing livy sessions, the dynamic allocation works as intended. But if you leave the session idling for a few minutes then execute codes again, it won't be able to get more executors and will be stuck with the minimum number of executors.
Inspecting the session logs in the livy UI, I found this:
Caused by: java.io.IOException: Failed to send RPC RPC 8079653042324188410 to master/10.0.2.97:7077: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:362)
at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:339)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
20/05/27 06:41:41 WARN spark.ExecutorAllocationManager: Unable to reach the cluster manager to request 2 total executors!
What's weird is when I ping master
inside the livy
container, it was able to resolve the master
hostname just fine.
I am actually lost on how to resolve this. Already tried messing with the spark-defaults.conf networking parameters but it seems I am not able to change the right config that would solve this.