There are two issues with connecting to Spark on Dataproc from outside a cluster: Configuration and Network access. It is generally somewhat difficult and not fully supported, So I would recommend using sparklyr inside the cluster.
Configuration
Google Cloud Dataproc runs Spark on Hadoop YARN. You actually need to use yarn-cluster:
sc <- spark_connect(master = 'yarn-client')
However you also need a yarn-site.xml
in your $SPARK_HOME
directory to point Spark to the right hostname.
Network Access
While you can open ports to your IP address using firewall rules on your Google Compute Engine network, it's not considered a good security practice. You would also need to configure YARN to use the instance's external IP address or have a way to resolve hostnames on your machine.
Using sparklyr on Dataproc
sparklyr can be installed and run with R REPL by SSHing into the master node and running:
$ # Needed for the curl library
$ sudo apt-get install -y libcurl4-openssl-dev
$ R
> install.packages('sparklyr')
> library(sparklyr)
> sc <- spark_connect(master = 'yarn-client')
I believe RStudio Server supports SOCKS proxies, which can be set up as described here, but I am not very familiar with RStudio.
I use Apache Zeppelin on Dataproc for R notebooks, but it autoloads SparkR, which I don't think plays well with sparklyr at this time.