5

I'm trying to connect R to Spark using Sparklyr.

I followed the tutorial from rstudio blog

I tried installing sparklyr using

  • install.packages("sparklyr") which went fine but In another post, I saw that there was a bug in sparklyr_0.4 version. So I followed the instruction to download the dev version using

  • devtools::install_github("rstudio/sparklyr") which also went fine and now my sparklyr version is sparklyr_0.4.16.

I followed the rstudio tutorial to download and install spark using

spark_install(version = "1.6.2")

When I tried to first connect to spark using

sc <- spark_connect(master = "local")

got the following error.

Created default hadoop bin directory under: C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop
Error: 
To run Spark on Windows you need a copy of Hadoop winutils.exe:
1. Download Hadoop winutils.exe from:
   https://github.com/steveloughran/winutils/raw/master/hadoop-2.6.0/bin/
2. Copy winutils.exe to C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop\bin
Alternatively, if you are using RStudio you can install the RStudio Preview Release,
which includes an embedded copy of Hadoop winutils.exe:
  https://www.rstudio.com/products/rstudio/download/preview/**

I then downloaded winutils.exe and placed it in C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop\bin - This was given in instruction.

I tried connecting to spark again.

sc <- spark_connect(master = "local",version = "1.6.2")

but I got the following error

Error in force(code) : 
Failed while connecting to sparklyr to port (8880) for sessionid (8982): Gateway in port (8880) did not respond.
Path: C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\bin\spark-submit2.cmd
Parameters: --class, sparklyr.Backend, --packages, "com.databricks:spark-csv_2.11:1.3.0", "C:\Users\rkaku\Documents\R\R-3.2.3\library\sparklyr\java\sparklyr-1.6-2.10.jar", 8880, 8982
Traceback:
  shell_connection(master = master, spark_home = spark_home, app_name = app_name, version = version, hadoop_version = hadoop_version, shell_args = shell_args, config = config, service = FALSE, extensions = extensions)
  start_shell(master = master, spark_home = spark_home, spark_version = version, app_name = app_name, config = config, jars = spark_config_value(config, "spark.jars.default", list()), packages = spark_config_value(config, "sparklyr.defaultPackages"), extensions = extensions, environment = environment, shell_args = shell_args, service = service)
  tryCatch({
gatewayInfo <- spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, config = config, isStarting = TRUE)
}, error = function(e) {
abort_shell(paste("Failed while connecting to sparklyr to port (", gatewayPort, ") for sessionid (", sessionId, "): ", e$message, sep = ""), spark_submit_path, shell_args, output_file, error_file)
})
  tryCatchList(expr, classes, parentenv, handlers)
  tryCatchOne(expr, names, parentenv, handlers[[1]])
  value[[3]](cond)
  abort_shell(paste("Failed while connecting to sparklyr to port (", gatewayPort, ") for sessionid (", sessionId, "): ", e$message, sep = ""), spark_submit_path, shell_args, output_file, error_file)

---- Output Log ----
The system cannot find the path specified.

Can somebody please help me to solve this Issue. I'm sitting on this issue from past 2 weeks without much help. Really appreciate anyone who could help me resolve this.

r2evans
  • 141,215
  • 6
  • 77
  • 149
Rakesh Kumar
  • 161
  • 2
  • 9
  • I'd omit the first two paragraphs -- you don't need to apologize for asking a question. It seems like you solved the first problem on your own -- you at least got past the error about winutils being required, so I'm not sure that that's really relevant at this point. Focus on the thing that you're trying to solve, i.e. the second error. – Caleb Oct 17 '16 at 02:40
  • 1
    @Caleb : Thanks for reviewing my question. I will remove my initial comments – Rakesh Kumar Oct 17 '16 at 02:44

2 Answers2

3

I finally figured out the issue and am really happy that could do it all by myself. Obviously with lot of googling.

The issue was with Winutils.exe.

R studio does not give the correct location to place the winutils.exe. Copying from my question - location to paste winutils.exe was C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\tmp\hadoop\bin.

But while googling i figured out that there's a log file that will be created in temp folder to check for the issue, which was as below.

java.io.IOException: Could not locate executable C:\Users\rkaku\AppData\Local\rstudio\spark\Cache\spark-1.6.2-bin-hadoop2.6\bin\bin\winutils.exe in the Hadoop binaries

Location given in log file was not same as the location suggested by R Studio :) Finally after inserting winutils.exe in the location referred by spark log file, I was able to successfully connect to Sparklyr ...... wohooooooo!!!! I'll have to say 3 weeks of time was gone in just connecting to Spark, but all worth it :)

Rakesh Kumar
  • 161
  • 2
  • 9
0

please mind any proxy

    Sys.getenv("http_proxy")
    Sys.setenv(http_proxy='')

did the trick for me