1

I want to run a Java tool on data stored in a Hadoop cluster. I am trying to do it using the spark_apply function from sparklyr, but I am a bit confused by the syntax.

Before running the spark code, I've set up a conda environment following the instructions here: http://blog.cloudera.com/blog/2017/09/how-to-distribute-your-r-code-with-sparklyr-and-cdsw/ . I don't have access to parcels, so I need to use the second option described in the article. The conda environment also contains the Java tool I want to use.

Let's take for example the iris data:

library(sparklyr)
library(tidyverse)
library(datasets)
data(iris)
config <- spark_config()
config[["spark.r.command"]] <- "./r_env.zip/r_env/bin/Rscript"
config[["spark.yarn.dist.archives"]] <- "r_env.zip"
config$sparklyr.apply.env.R_HOME <- "./r_env.zip/r_env/lib/R"
config$sparklyr.apply.env.RHOME <- "./r_env.zip/r_env"
config$sparklyr.apply.env.R_SHARE_DIR <- "./r_env.zip/r_env/lib/R/share"
config$sparklyr.apply.env.R_INCLUDE_DIR <- "./r_env.zip/r_env/lib/R/include"
sc <- spark_connect(master = "yarn-client", config = config)

# Write iris table to HDFS, partitioning by Species
iris_tbl_tmp = copy_to(sc, iris, overwrite=T)
spark_write_table(iris_tbl_tmp, "iris_byspecies", partition_by="Species")
iris_tbl = sc %>% tbl("iris_byspecies")
iris_tbl

Since the Java tool cannot read data from HDFS, I actually have to save each dataset to a file, run the Java tool, then read the data again:

myfunction = function(x) { 
    write.table(x, "tempfile.txt")
    system2("{PATH}/myjavatool.java")
    res = read.table("output_of_java_command.txt")
    res
}
myoutput = spark_apply(iris_tbl, myfunction, group_by=Species)

My question is about the PATH to the Java tool. How can I see where sparklyr stores the conda environment?

Moreover, is there a simpler way to do this?

zero323
  • 322,348
  • 103
  • 959
  • 935
dalloliogm
  • 8,718
  • 6
  • 45
  • 55

2 Answers2

1

According to the [Running Spark on YARN]https://spark.apache.org/docs/latest/running-on-yarn.html() guide, spark.yarn.dist.archives:

Comma separated list of archives to be extracted into the working directory of each executor.

So the files should be just in the working directory of your app.

  • Thanks. When using sparklyr, I couldn't find them in that location. However I could figure out the location by looking at .libPaths(). – dalloliogm Oct 24 '18 at 13:32
1

you need to call sparklyr::spark_apply with packages = FALSE, which means sparklyr::spark_apply will use your archive package(r_env.zip) instead of your .libPaths()

Harry Zhu
  • 39
  • 2
  • Know how: * 1. spark driver will upload your r_env.zip to a HDFS path * 2. spark worker collect previous HDFS file to local container path * 3. spark worker extract your zip file : r_env.zip to ./r_env.zip/r_env/ * 4. spark worker will call Rscript by R_HOME and other environment settings. – Harry Zhu Oct 27 '18 at 14:56