1

I'm trying to connect my RStudio Server to my DSE Analytics cluster.

The setup:

  • CentOS 7
  • openjdk-1.8
  • RStudio Server v1.0.136 (with latest version of sparklyr by >devtools::install_github("rstudio/sparklyr"))
  • DSE 5.0 (spark 1.6.2)
  • 5 nodes of DSE Analytics in a DC within a cluster (shared by another DC for OLTP)
  • RStudio Server running DSE Analytics stand alone (VM)

Since, unlike the sparklyr tutorial, I'm bringing my own (DSE's) Spark. SPARK_HOME was not set. Nor was JAVA_HOME. So:

> Sys.setenv(JAVA_HOME = '/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.121-0.b13.el7_3.x86_64')  
> Sys.setenv(SPARK_HOME = '/usr/share/dse/spark/')

My config.yml (found the exaple here):

spark.cassandra.connection.host: <IP of one node>
spark.cassandra.auth.username: cassandra
spark.cassandra.auth.password: <PW>

sparklyr.defaultPackages:
- com.databricks:spark-csv_2.11:1.3.0
- com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M1
- com.datastax.cassandra:cassandra-driver-core:3.0.2

My session info:

> devtools::session_info()
Session info --------------------------
 setting  value                       
 version  R version 3.3.2 (2016-10-31)
 system   x86_64, linux-gnu           
 ui       RStudio (1.0.136)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Mexico_City         
 date     2017-02-02                  

Packages ----------------------------------------
 package    * version    date       source                           
 assertthat   0.1        2013-12-06 CRAN (R 3.3.2)                   
 backports    1.0.5      2017-01-18 CRAN (R 3.3.2)                   
 base64enc    0.1-3      2015-07-28 CRAN (R 3.3.2)                   
 config       0.2        2016-08-02 CRAN (R 3.3.2)                   
 curl         2.3        2016-11-24 CRAN (R 3.3.2)                   
 DBI          0.5-1      2016-09-10 CRAN (R 3.3.2)                   
 devtools     1.12.0     2016-12-05 CRAN (R 3.3.2)                   
 digest       0.6.12     2017-01-27 CRAN (R 3.3.2)                   
 dplyr        0.5.0      2016-06-24 CRAN (R 3.3.2)                   
 git2r        0.18.0     2017-01-01 CRAN (R 3.3.2)                   
 htmltools    0.3.5      2016-03-21 cran (@0.3.5)                    
 httpuv       1.3.3      2015-08-04 cran (@1.3.3)                    
 httr         1.2.1      2016-07-03 CRAN (R 3.3.2)                   
 jsonlite     1.2        2016-12-31 CRAN (R 3.3.2)                   
 magrittr     1.5        2014-11-22 CRAN (R 3.3.2)                   
 memoise      1.0.0      2016-01-29 CRAN (R 3.3.2)                   
 mime         0.5        2016-07-07 CRAN (R 3.3.2)                   
 packrat      0.4.8-1    2016-09-07 CRAN (R 3.3.2)                   
 R6           2.2.0      2016-10-05 CRAN (R 3.3.2)                   
 Rcpp         0.12.9     2017-01-14 CRAN (R 3.3.2)                   
 rprojroot    1.2        2017-01-16 CRAN (R 3.3.2)                   
 rstudioapi   0.6        2016-06-27 CRAN (R 3.3.2)                   
 shiny        1.0.0      2017-01-12 cran (@1.0.0)                    
 sparklyr   * 0.5.3-9000 2017-02-02 Github (rstudio/sparklyr@bd4aee0)
 tibble       1.2        2016-08-26 CRAN (R 3.3.2)                   
 withr        1.0.2      2016-06-20 CRAN (R 3.3.2)                   
 xtable       1.8-2      2016-02-05 cran (@1.8-2)                    
 yaml         2.1.14     2016-11-12 CRAN (R 3.3.2)  

Now, when I try to generate the spark context, this is what I get:

> sc <- spark_connect(master = "spark://<IP of one node>", config = spark_config(file = "config.yml"), version = "1.6.2")  
Error in force(code) : 
  Failed while connecting to sparklyr to port (8880) for sessionid (646): Gateway in port (8880) did not respond.
    Path: /usr/share/dse/spark/bin/spark-submit
    Parameters: --class, sparklyr.Backend, --jars, '/home/emiliano/rprojects/sparklyr_test/packrat/lib/x86_64-redhat-linux-gnu/3.3.2/sparklyr/java/spark-csv_2.11-1.3.0.jar','/home/emiliano/rprojects/sparklyr_test/packrat/lib/x86_64-redhat-linux-gnu/3.3.2/sparklyr/java/commons-csv-1.1.jar','/home/emiliano/rprojects/sparklyr_test/packrat/lib/x86_64-redhat-linux-gnu/3.3.2/sparklyr/java/univocity-parsers-1.5.1.jar', '/home/emiliano/rprojects/sparklyr_test/packrat/lib/x86_64-redhat-linux-gnu/3.3.2/sparklyr/java/sparklyr-1.6-2.10.jar', 8880, 646


---- Output Log ----
Failed to find Spark assembly in /usr/share/dse/spark/lib.
You need to build Spark before running this program.

---- Error Log ----

From this output, my guess is that sparklyr is not recognizing the spark of DSE Analytics. As I understand it, DSE's spark it's deeply integrated with Cassandra with its connector, it even has its own dse spark-submit. I'm sure I'm passing the wrong configs to sparklyr. I'm just lost as what to pass to it. Any help is welcome. Thank you.

Edit: I obviously hit the same error with > sc <- spark_connect(master="local")

  • I don't believe remote connections are supported by RStudio at this time. https://github.com/rstudio/sparklyr/issues/299. I know you said you tried local, but was RStudio on the same instance running DSE? – peytoncas Feb 07 '17 at 23:19
  • Thank you @peytoncas RStudio Server lives in a server with DSE installed. I can even connect to the spark shell (`$ dse spark`) on that local server. So, I should at least be able to connect locally. – Mematematica Feb 07 '17 at 23:37

0 Answers0