I set as standalone a Ubuntu server
with Spark 2.2.0
up and running.
My aim is to allow several users (clients
) to connect to this server and develop locally (from their own computer) thanks to RStudio
, some code, which has to be executed on Spark
.
So, I installed Livy on my server (which is up and running), which allows me to connect to my server from RStudio
config = livy_config(username = "me", password = "***")
sc <- spark_connect(master = "http://myserver:8998", method = "livy", config = config)
RStudio
sends me back a message telling me that I'm connected.
From this, I have few questions :
Can I develop on my RStudio locally and send all the processing to Spark (e.g.: Manage a dataframe + perform some machine learning)? If yes, how ? Do I have to use function from SparklyR directly? Do I have to install a Spark instance running locally to be able to test my code before sending it to my Spark cluster on my remote server ?
When I use copy_to function, with Iris dataframe, it takes approximatively one minute. Can I conclude that my connection is too slow to consider to develop locally and send all the proccessings to my server ?
It is not possible to use RStudio inside my server directly (because we just access it with commands lines) and we will be several persons to develop at the same time. What would be the best solution to develop easily ?
Finally, I'm facing a simple issue : If the best solution is to develop our apps locally and then, send them via ssh to my server, and execute them directly on my server, how can I run them ? I already tried to archive a simple R script
to .JAR
file and run spark_submit
but I got a class not found error (no Main program found). How can I do ?