0

I'm trying to follow a tutorial for using spark from RStudio on DSX, but I'm running into the following error:

> library(sparklyr)
> sc <- spark_connect(master = "CS-DSX")
Error in spark_version_from_home(spark_home, default = spark_version) : 
  Failed to detect version from SPARK_HOME or SPARK_HOME_VERSION. Try passing the spark version explicitly.

I took the above code snippet from the connect to spark dialog in RStudio:

enter image description here

So I took a look at SPARK_HOME:

> Sys.getenv("SPARK_HOME")
[1] "/opt/spark"

Ok, Lets check that dir exists:

> dir("/opt")
[1] "ibm"

I'm guessing this is the cause of the problem?


NOTE: there are a few similar questions on stackoverflow, but none of them are about IBM's Data Science Experience (DSX).

Update 1:

I tried the following:

> sc <- spark_connect(config = "CS-DSX")
Error in config$spark.master : $ operator is invalid for atomic vectors

Update 2:

An extract from my config.yml. Note that I have many more spark services in my, I've just pasted the first one:

default:
    method: "shell"

CS-DSX:
    method: "bluemix"
    spark.master: "spark.bluemix.net"
    spark.instance.id: "7a4089bf-3594-4fdf-8dd1-7e9fd7607be5"
    tenant.id: "sdd1-7e9fd7607be53e-39ca506ba762"
    tenant.secret: "xxxxxx"
    hsui.url: "https://cdsx.ng.bluemix.net"

Note that my config.yml was generated for me.

Update 3:

My .Rprofile looks like this:

# load sparklyr library
library(sparklyr)

# setup SPARK_HOME
if (nchar(Sys.getenv("SPARK_HOME")) < 1) {
  Sys.setenv(SPARK_HOME = "/opt/spark")
}

# setup SparkaaS instances
options(rstudio.spark.connections = c("CS-DSX","newspark","cleantest","4jan2017","Apache Spark-4l","Apache Spark-3a","ML SPAAS","Apache Spark-y9","Apache Spark-a8"))

Note that my .Rprofile was generated for me.

Update 4:

I uninstalled sparklyr and restarted the session twice. Next I tried to run:

library(sparklyr)
library(dplyr)
sc <- spark_connect(config = "CS-DSX")

However, the above command hung. I stopped the command and checked the version of sparklyr which seems to be ok:

> ip <- installed.packages()
> ip[ rownames(ip) == "sparklyr", c(0,1,3) ]
   Package    Version 
"sparklyr"   "0.4.36" 
Chris Snow
  • 23,813
  • 35
  • 144
  • 309
  • There are two issues:- 1. DSX Rstudio generated code suppose to have config = , if it is connecting to bluemix spark service. I have raise defect for this. 2. For not able to connect using config, can you share the contents of your config.yml file. Pleawse mask the tenant secret. – charles gomes Feb 16 '17 at 21:35
  • You're trying to connect to a remote Spark service, so the local value of SPARK_HOME is meaningless. Apparently, some library still tries to guess the Spark version from SPARK_HOME. Possible courses of action: 1. Find out how to specify the Spark version in the config. 2. Locally set SPARK_HOME to a value that indicates the remote Spark version. The directory doesn't have to exist locally. – Roland Weber Feb 17 '17 at 10:04
  • Strangely the .Rprofile that was generated for me is setting SPARK_HOME. – Chris Snow Feb 17 '17 at 12:23
  • I can reproduce your issue when i install the sparklyr package using install.packages("sparklyr") which installs 0.5.2 from CRAN and override the default installation of sparklyr(0.4.36). > packageVersion("sparklyr") [1] ‘0.4.36’ Please remove your sparklyr version with remove.packages("sparklyr") and restart the rstudio twice so it will reinitialize the sparklyr package back to sparklyr. – charles gomes Feb 19 '17 at 08:57
  • Ok, that got me slightly further. I've updated the question (update 4). – Chris Snow Feb 20 '17 at 10:38

2 Answers2

1

You cannot use master parameter to connect to bluemix spark service if that is the intent since your kernels are defined in config.yml file, you should be using config parameter instead to connect.

config.yml is loaded up with your available kernel information(spark instances).

Apache Spark-ic:
method: "bluemix"
spark.master: "spark.bluemix.net"
spark.instance.id: "41a2e5e9xxxxxx47ef-97b4-b98406426c07"
tenant.id: "s7b4-b9xxxxxxxx7e8-2c631c8ff999"
tenant.secret: "XXXXXXXXXX"
hsui.url: "https://cdsx.ng.bluemix.net"

Please use config sc <- spark_connect(config = "Apache Spark-ic")

as suggested in tutorial:- http://datascience.ibm.com/blog/access-ibm-analytics-for-apache-spark-from-rstudio/

FYI, By Default, you are connected to , i am working on finding how to change version with config parameter.

> version <- invoke(spark_context(sc), "version")

print(version)

[1] "2.0.2"

Thanks, Charles.

charles gomes
  • 2,145
  • 10
  • 15
  • Unfortunately that didn't work for me. I have updated the question with the output from spark_connect. I have also given a screenshot from the RStudio spark connection - this is where I took my code from. We should fix this integration if it is leading users in the wrong direction? – Chris Snow Feb 16 '17 at 19:33
0

I had the same issue and fix it as follows:

  1. go to C:\Users\USER_NAME\AppData\Local/spark/ and delete everything you'll find in the directory
  2. Then, in the R console run:
if (!require(shiny)) install.packages("shiny"); 
library(shiny)
if (!require(sparklyr)) install.packages("sparklyr"); 
library(sparklyr)
spark_install()