2

I have a Spark Standalone setup in my Docker Swarm cluster (1 manager node, 2 worker nodes). I also have a livy container that is colocated with the Spark master container in the manager node.

When initializing livy sessions, the dynamic allocation works as intended. But if you leave the session idling for a few minutes then execute codes again, it won't be able to get more executors and will be stuck with the minimum number of executors.

Inspecting the session logs in the livy UI, I found this:

Caused by: java.io.IOException: Failed to send RPC RPC 8079653042324188410 to master/10.0.2.97:7077: java.nio.channels.ClosedChannelException
    at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:362)
    at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:339)
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
    at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
    at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
    at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
    at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
    at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    ... 1 more
Caused by: java.nio.channels.ClosedChannelException
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
20/05/27 06:41:41 WARN spark.ExecutorAllocationManager: Unable to reach the cluster manager to request 2 total executors!

What's weird is when I ping master inside the livy container, it was able to resolve the master hostname just fine.

I am actually lost on how to resolve this. Already tried messing with the spark-defaults.conf networking parameters but it seems I am not able to change the right config that would solve this.

Von Yu
  • 31
  • 4

0 Answers0