Connecting to remote Dataproc master in SparkSession

Question

I created a 3 node (1 master, 2 workers) Apache Spark cluster in on Google Cloud Dataproc. I'm able to submit jobs to the cluster when connecting through ssh with the master, however I can't get it work remotely. I can't find any documentation about how to do this except for a similar issue on AWS but that isn't working for me.

Here is what I am trying

import pyspark
conf = pyspark.SparkConf().setAppName('Test').setMaster('spark://<master-node-ip>:7077')
sc = pyspark.SparkContext(conf=conf)

I get the error

19/11/13 13:33:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/11/13 13:33:53 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master <master-node-ip>:7077
org.apache.spark.SparkException: Exception thrown in awaitResult: 
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
        at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:106)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to /<master-node-ip>:7077
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
        at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
        at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
        ... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /<master-node-ip>:7077
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        ... 1 more
Caused by: java.net.ConnectException: Connection refused

I added a firewall rule to allow ingress traffic on tcp:7077. But that doesn't solve it.

Utimately I would like to setup a VM on compute engine that can run this code while connecting over internal ip adresses (in a VPC I created) to run jobs on dataproc without using gcloud dataproc jobs submit. I tried it both over internal and external IP but neither are working.

Does anyone know how I can get it working?

score 6 · Accepted Answer · answered Nov 13 '19 at 21:55

So there is a few things to unpack here.

The first thing I want to make sure you understand is when exposing your distributed computing framework to ingress traffic you should be very careful. If Dataproc exposed a Spark-Standalone cluster on port 7077, you would want to make sure that you lock down that ingress traffic. Sounds like you know that by wanting a VM on a shared VPC, but this is pretty important even when testing if you open up firewalls.

The main problem it looks like you're having though is that you appear to be trying to connect as if it was a Spark-Standalone cluster. Dataproc actually uses Spark on YARN. To connect, you will need to set the Spark Cluster Manager type to "yarn" and correctly configure your local machine to talk to a remote YARN cluster, either by setting up a yarn-site.xml and having the HADOOP_CONF_DIR point to it or by directly setting YARN properties like yarn.resourcemanager.address via spark-submit --conf.

Also note this is similar to this question once you know that Dataproc uses YARN: Scala Spark connect to remote cluster

Connecting to remote Dataproc master in SparkSession

1 Answers1