Introduction:
I've installed some packages on a Databricks cluster using install.packages on DR 9.1 LTS, and I want to run a UDF using R & Spark (SparkR or sparklyr). My use case is to score some data in batch using Spark (either SparkR or sparklyr). I've currently chosen SparkR::dapply. The main issue is that the installed packages don't appear to be available on the workers using SparkR::dapply.
Code (info reduced and some revised for privacy):
install.packages("lda", repos = "https://cran.microsoft.com/snapshot/2021-12-01/")
my_data<- read.csv('/dbfs/mnt/container/my_data.csv')
my_data_sdf <- as.DataFrame(my_data)
schema <- structType(structField("Var1", "integer"),structField("Var2", "integer"),structField("Var3", "integer"))
df1 <- SparkR::dapply(my_data_sdf , function(my_data) {
# lda #
#install.packages("lda", repos = "https://cran.microsoft.com/snapshot/2021-12-01/")
library( lda )
return(my_data_sdf)
}, schema)
display(df1)
Error message (some info redacted with 'X'):
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 9) (X0.XXX.X.X executor 0): org.apache.spark.SparkException: R unexpectedly exited.
R worker produced errors: Error in library(lda) : there is no package called ‘lda’
Calls: compute -> computeFunc -> library
Execution halted
System\Hardware:
- Azure Databricks
- Databricks Runtime 9.1 LTS (min 2 workers max 10)
- Worker hardware = Standard_DS5_v2
- Driver hardware = Standard_D32s_v2
Notes:
- If I use 'require' no error message is returned, but 'require' is designed not to return an error message.
- I'm able to run SparkR::dapply and preform operations, but once I add in library(lda) I get an error message even though I've installed 'lda' and I'm using DR 9.1 LTS
- I'm using recommended CRAN snapshot to install - https://learn.microsoft.com/en-us/azure/databricks/kb/r/pin-r-packages
- I'm using DR 9.1 LTS which (to my understanding) makes installed packages available to workers - "Starting with Databricks Runtime 9.0, R packages are accessible to worker nodes as well as the driver node." - https://learn.microsoft.com/en-us/azure/databricks/libraries/notebooks-r-libraries
- If I include install.packages("lda", repos = "https://cran.microsoft.com/snapshot/2021-12-01/") in dapply, then it works without error, but this doesn't seem like best practice from documentation.
Questions:
- How do I install R packages on Databricks clusters so they're available on all the nodes? What is the proper approach?
- How do I make sure that my packages are available to SparkR::dapply?
- Thoughts on including install.packages in the dapply function itself?
- Should I try something other than SparkR::dapply?
Thanks everyone :)