1

Introduction:

I've installed some packages on a Databricks cluster using install.packages on DR 9.1 LTS, and I want to run a UDF using R & Spark (SparkR or sparklyr). My use case is to score some data in batch using Spark (either SparkR or sparklyr). I've currently chosen SparkR::dapply. The main issue is that the installed packages don't appear to be available on the workers using SparkR::dapply.

Code (info reduced and some revised for privacy):

install.packages("lda", repos = "https://cran.microsoft.com/snapshot/2021-12-01/")
my_data<- read.csv('/dbfs/mnt/container/my_data.csv')
my_data_sdf <- as.DataFrame(my_data)

schema <- structType(structField("Var1", "integer"),structField("Var2", "integer"),structField("Var3", "integer"))

df1 <- SparkR::dapply(my_data_sdf , function(my_data) {
  # lda #
  #install.packages("lda", repos = "https://cran.microsoft.com/snapshot/2021-12-01/")
  library( lda )
return(my_data_sdf)
}, schema)

display(df1)

Error message (some info redacted with 'X'):

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 9) (X0.XXX.X.X executor 0): org.apache.spark.SparkException: R unexpectedly exited.
R worker produced errors: Error in library(lda) : there is no package called ‘lda’
Calls: compute -> computeFunc -> library
Execution halted

System\Hardware:

  • Azure Databricks
  • Databricks Runtime 9.1 LTS (min 2 workers max 10)
  • Worker hardware = Standard_DS5_v2
  • Driver hardware = Standard_D32s_v2

Notes:

  • If I use 'require' no error message is returned, but 'require' is designed not to return an error message.
  • I'm able to run SparkR::dapply and preform operations, but once I add in library(lda) I get an error message even though I've installed 'lda' and I'm using DR 9.1 LTS
  • I'm using recommended CRAN snapshot to install - https://learn.microsoft.com/en-us/azure/databricks/kb/r/pin-r-packages
  • I'm using DR 9.1 LTS which (to my understanding) makes installed packages available to workers - "Starting with Databricks Runtime 9.0, R packages are accessible to worker nodes as well as the driver node." - https://learn.microsoft.com/en-us/azure/databricks/libraries/notebooks-r-libraries
  • If I include install.packages("lda", repos = "https://cran.microsoft.com/snapshot/2021-12-01/") in dapply, then it works without error, but this doesn't seem like best practice from documentation.

Questions:

  • How do I install R packages on Databricks clusters so they're available on all the nodes? What is the proper approach?
  • How do I make sure that my packages are available to SparkR::dapply?
  • Thoughts on including install.packages in the dapply function itself?
  • Should I try something other than SparkR::dapply?

Thanks everyone :)

yeamusic21
  • 276
  • 3
  • 11
  • have you tried to attach that library to a cluster itself? https://learn.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries. But it's really strange that scoped libraries don't work – Alex Ott Dec 01 '21 at 19:05
  • The issue with the scoped libraries approach is that Data Factory doesn't like it. Data Factory will try to kick off the notebook before the libraries are installed, so we've avoided this type of install since it doesn't appear to behave in a live setting with Data Factory. – yeamusic21 Dec 01 '21 at 19:48
  • If notebook scoped libraries don’t work, it’s better to raise support ticket – Alex Ott Dec 01 '21 at 19:56
  • Scoped libraries work as long as I'm running the notebook manually. Scoped won't work with Data Factory because the notebook gets kicked off before all libraries are installed. - https://github.com/MicrosoftDocs/azure-docs/issues/30253 – yeamusic21 Dec 01 '21 at 23:34
  • I created a Azure Databricks support ticket – yeamusic21 Dec 02 '21 at 16:49

2 Answers2

0

After working with Azure support team, the following work-around / alternative option is to use an init script. Init script all together works well and plays nicely with Data Factory.

Example

From Notebook:

dbutils.fs.mkdirs("dbfs:/databricks/scripts/")

dbutils.fs.put("/databricks/scripts/r-installs.sh","""R -e 'install.packages("caesar", repos="https://cran.microsoft.com/snapshot/2021-08-02/")'""", True)

display(dbutils.fs.ls("dbfs:/databricks/scripts/r-installs.sh"))

From Cluster UI:

Add init script from 'Init Scripts' tab by following prompts.

References

yeamusic21
  • 276
  • 3
  • 11
  • I would also recommend installing R packages from Ubuntu binaries in the init script, if available (https://packages.ubuntu.com/search?keywords=r-cran-&searchon=names), as the installation time would be in seconds compared to minutes via the `install.packages` route. – Vivek Atal Jan 26 '23 at 16:25
0

An addition to the init script approach, which works best by the way, is to persist the installed binaries of the R packages in DBFS which can be accessed by the worker nodes as well! This approach is easier for interactive workload, and also if you do not have the rights to modify cluster config to add init scripts.

Please refer this page for more details: https://github.com/marygracemoesta/R-User-Guide/blob/master/Developing_on_Databricks/package_management.md

Below codes can be run inside a Databricks notebook - this step is needed to be done only once. Later on, you won't have to install the packages even if you restart your cluster.

%python
# Creating a location in DBFS where we will finally store installed packages
dbutils.fs.mkdirs("/dbfs/persist-loc")

%sh
mkdir /usr/lib/R/persist-libs

%r
install.packages(c("caesar", "dplyr", "rlang"), 
                 repos="https://cran.microsoft.com/snapshot/2021-08-02", lib="/usr/lib/R/persist-libs")
# Can even persist custom packages
# install.packages("/dbfs/path/to/package", repos=NULL, type="source", lib="/usr/lib/R/persist-libs")

%r
system("cp -R /usr/lib/R/persist-libs /dbfs/persist-loc", intern=TRUE)

Now just append the final persist location to .libPaths() in your R script where you used dapply - this can be done in the very first cell, and it will work just fine even with worker nodes. You will not have to install them again as well, which will save time also.

%r
.libPaths(c("/dbfs/persist-loc/persist-libs", .libPaths()))
Vivek Atal
  • 468
  • 5
  • 11