Running a system command in Hadoop using spark_apply from sparklyr

Question

I want to run a Java tool on data stored in a Hadoop cluster. I am trying to do it using the spark_apply function from sparklyr, but I am a bit confused by the syntax.

Before running the spark code, I've set up a conda environment following the instructions here: http://blog.cloudera.com/blog/2017/09/how-to-distribute-your-r-code-with-sparklyr-and-cdsw/ . I don't have access to parcels, so I need to use the second option described in the article. The conda environment also contains the Java tool I want to use.

Let's take for example the iris data:

library(sparklyr)
library(tidyverse)
library(datasets)
data(iris)
config <- spark_config()
config[["spark.r.command"]] <- "./r_env.zip/r_env/bin/Rscript"
config[["spark.yarn.dist.archives"]] <- "r_env.zip"
config$sparklyr.apply.env.R_HOME <- "./r_env.zip/r_env/lib/R"
config$sparklyr.apply.env.RHOME <- "./r_env.zip/r_env"
config$sparklyr.apply.env.R_SHARE_DIR <- "./r_env.zip/r_env/lib/R/share"
config$sparklyr.apply.env.R_INCLUDE_DIR <- "./r_env.zip/r_env/lib/R/include"
sc <- spark_connect(master = "yarn-client", config = config)

# Write iris table to HDFS, partitioning by Species
iris_tbl_tmp = copy_to(sc, iris, overwrite=T)
spark_write_table(iris_tbl_tmp, "iris_byspecies", partition_by="Species")
iris_tbl = sc %>% tbl("iris_byspecies")
iris_tbl

Since the Java tool cannot read data from HDFS, I actually have to save each dataset to a file, run the Java tool, then read the data again:

myfunction = function(x) { 
    write.table(x, "tempfile.txt")
    system2("{PATH}/myjavatool.java")
    res = read.table("output_of_java_command.txt")
    res
}
myoutput = spark_apply(iris_tbl, myfunction, group_by=Species)

My question is about the PATH to the Java tool. How can I see where sparklyr stores the conda environment?

Moreover, is there a simpler way to do this?

score 1 · Answer 1 · answered Oct 22 '18 at 23:46

1

According to the [Running Spark on YARN]https://spark.apache.org/docs/latest/running-on-yarn.html() guide, spark.yarn.dist.archives:

Comma separated list of archives to be extracted into the working directory of each executor.

So the files should be just in the working directory of your app.

answered Oct 22 '18 at 23:46

Thanks. When using sparklyr, I couldn't find them in that location. However I could figure out the location by looking at .libPaths(). – dalloliogm Oct 24 '18 at 13:32

score 1 · Answer 2 · answered Oct 26 '18 at 12:01

1

you need to call sparklyr::spark_apply with packages = FALSE, which means sparklyr::spark_apply will use your archive package(r_env.zip) instead of your .libPaths()

answered Oct 26 '18 at 12:01

Harry Zhu

39
2

Know how: * 1. spark driver will upload your r_env.zip to a HDFS path * 2. spark worker collect previous HDFS file to local container path * 3. spark worker extract your zip file : r_env.zip to ./r_env.zip/r_env/ * 4. spark worker will call Rscript by R_HOME and other environment settings. – Harry Zhu Oct 27 '18 at 14:56

Running a system command in Hadoop using spark_apply from sparklyr

2 Answers2