1

I am running a Kubernetes cluster where I have enabled the two service meshes Istio and Linkerd on different occasions.

When I try to deploy a Spark Standalone cluster where each Spark worker and Master will run in different pods, the workers cannot connect to the Spark Master.

It is possible to run curl requests over a service (which passes through the sidecars) from the worker to the master to fetch the Spark Master UI. However, when trying to launch a Spark worker which connects to the Master, it fails.

Here is the service manifest:

apiVersion: v1 kind: Service metadata: labels: app: sparkmaster name: spark-submit2 namespace: spark spec: ports: - port: 7077 protocol: TCP targetPort: 7077 selector: app: sparkmaster type: ClusterIP

And when I run for instance

sparkuser@sparkslave-6897c9cdd7-frcgs:~$ /opt/spark-2.4.4-bin-hadoop2.7/sbin/start-slave.sh spark://spark-submit2:7077

I am getting the following error shown below.

What is the proper solution to this problem?

Note: If I do the exact same procedure in a namespace without a service mesh enable, it works.

20/02/13 15:05:55 INFO Worker: Connecting to master spark-submit2:7077...
20/02/13 15:05:55 INFO TransportClientFactory: Successfully created connection to spark-submit2/10.96.121.191:7077 after 259 ms (0 ms spent in bootstraps)
20/02/13 15:05:55 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from spark-submit2/10.96.121.191:7077 is closed
20/02/13 15:05:55 WARN Worker: Failed to connect to master spark-submit2:7077
org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
    at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:253)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Connection from spark-submit2/10.96.121.191:7077 closed
    at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
    at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
    at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
    at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
    at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
    at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
    at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1354)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
    at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:917)
    at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    ... 1 more
toerq
  • 117
  • 2
  • 10
  • Do you know what protocol Spark uses? I suspect that it's a custom protocol that neither service mesh recognizes. For Linkerd you can use the `--skip-outbound-ports` and `--skip-inbound-ports` configurations to make sure that the proxy doesn't handle the request. https://linkerd.io/2/features/protocol-detection/#configuring-protocol-detection – cpretzer Dec 09 '20 at 05:48
  • any solution to it , i have tried all but facing the same issue. – sp_user123 Mar 24 '23 at 12:26

1 Answers1

1

service mesh like Istio requires bind address as 0.0.0.0 so it's not possible to run spark app on cluster mode unless you add exlude Inbound/Outbound Spark ports into config.

spark.kubernetes.executor.annotation.traffic.sidecar.istio.io/excludeOutboundPorts=7078,7079
spark.kubernetes.driver.annotation.traffic.sidecar.istio.io/excludeInboundPorts=7078,7079

Other option is using Spark client mode and add spark.driver.bindAddress = 0.0.0.0. Or else waiting for service mesh to support binding addressing with pod IP