How may I connect Google Dataproc cluster from Sparklyr?

Question

I'm new to Spark and GCP. I've tried to connect to it with

sc <- spark_connect(master = "IP address")

but it obviously couldn't work (e.g. there is no authentication).

How should I do that? Is it possible to connect to it from outside Google Cloud?

score 3 · Answer 1 · answered Sep 29 '16 at 22:10

There are two issues with connecting to Spark on Dataproc from outside a cluster: Configuration and Network access. It is generally somewhat difficult and not fully supported, So I would recommend using sparklyr inside the cluster.

Configuration

Google Cloud Dataproc runs Spark on Hadoop YARN. You actually need to use yarn-cluster:

sc <- spark_connect(master = 'yarn-client')

However you also need a yarn-site.xml in your $SPARK_HOME directory to point Spark to the right hostname.

Network Access

While you can open ports to your IP address using firewall rules on your Google Compute Engine network, it's not considered a good security practice. You would also need to configure YARN to use the instance's external IP address or have a way to resolve hostnames on your machine.

Using sparklyr on Dataproc

sparklyr can be installed and run with R REPL by SSHing into the master node and running:

$ # Needed for the curl library
$ sudo apt-get install -y libcurl4-openssl-dev
$ R
> install.packages('sparklyr')
> library(sparklyr)
> sc <- spark_connect(master = 'yarn-client')

I believe RStudio Server supports SOCKS proxies, which can be set up as described here, but I am not very familiar with RStudio.

I use Apache Zeppelin on Dataproc for R notebooks, but it autoloads SparkR, which I don't think plays well with sparklyr at this time.

How may I connect Google Dataproc cluster from Sparklyr?

1 Answers1

Configuration

Network Access

Using sparklyr on Dataproc