2

I'm new to Spark and newer to R, and am trying to figure out how to 'include' other R-scripts when running spark-submit.

Say I have the following R script which "sources" another R script:

main.R

source("sub/fun.R")
mult(4, 2)

The second R script looks like this, which exists in a sub-directory "sub":

sub/fun.R

mult <- function(x, y) {
   x*y
}

I can invoke this with Rscript and successfully get this to work.

Rscript file.R
[1] 8

However, I want to run this with Spark, and use spark-submit. When I run spark-submit, I need to be able to set the current working directory on the Spark workers to the directory which contains the main.R script, so that the Spark/R worker process will be able to find the "sourced" file in the "sub" subdirectory. (Note: I plan to have a shared filesystem between the Spark workers, so that all workers will have access to the files).

How can I set the current working directory that SparkR executes in such that it can discover any included (sourced) scripts?

Or, is there a flag/sparkconfig to spark-submit to set the current working directory of the worker process that I can point at the directory containing the R Scripts?

Or, does R have an environment variable that I can set to add an entry to the "R-PATH" (forgive me if no such thing exists in R)?

Or, am I able to use the --files flag to spark-submit to include these additional R-files, and if so, how?

Or is there generally a better way to include R scripts when run with spark-submit?

In summary, I'm looking for a way to include files with spark-submit and R.

Thanks for reading. Any thoughts are much appreciated.

Joe J
  • 9,985
  • 16
  • 68
  • 100
  • did you ever solve this? Running into the exact same thing presently. I don't want to use `setwd` in the script if I can avoid it - that seems sloppy – AgentBawls Jan 24 '22 at 17:38

0 Answers0