2

Spatial packages in R often depend on C libraries for their numerical computation. This presents a problem when installing R packages that depend on these libraries if the R engine is unable to install these libraries using default permissions. It appears that databricks clusters present such an obstacle for R. I guess there are two ways around this, 1) to create a docker container with the relevant scripts to install the packages or 2) to install them by way of an init script. I figured the latter approach would be easier but I'm having some problems. The clusters fail to start up bc my init script fails to execute. See below -I've also tried with sudo

set -euxo pipefail

apt install libgeos-dev
apt install libudunits2-dev
apt install libgdal-dev

Relatedly, should these only be installed on the driver node? I dont see a reason why they need to be on worker nodes. The above code installs it on workers and drivers I think. To install on just the driver I suppose it would be:

if [[ $DB_IS_DRIVER = "TRUE" ]]; then
apt install libgeos-dev
apt install libudunits2-dev
apt install libgdal-dev
Cyrus Mohammadian
  • 4,982
  • 6
  • 33
  • 62
  • what error do you get for your init scripts? – Alex Ott Feb 11 '21 at 08:14
  • No error, the init script just never finishes processing and the cluster, even after waiting for an hour just keeps spinning... – Cyrus Mohammadian Feb 11 '21 at 17:29
  • I figured the long wait time on installation was related to installing the libraries on each worker node but even on a cluster of two worker nodes it never completed and running the latter code (to install on driver only) resulted in the failure of the init script -no details on the failure -the event log just mentions it failed. – Cyrus Mohammadian Feb 11 '21 at 17:31
  • you can enable cluster logs to DBFS, and then it will include logs for the init script as well, you'll able to pull it via `databrics fs ...` to local machine – Alex Ott Feb 11 '21 at 17:58

1 Answers1

0

I faced a similar situation, needed to install some libraries which were needed for some R package to work in unix environment. Executed some commands on Databricks similar to below to create the initscript in DBFS, hopefully it is helpful for your problem.
Also, they should be installed on all nodes, not only on driver node, for the R package to work on worker nodes as well if you wish to use distributed computing.

dbutils.fs.mkdirs("dbfs:/databricks/initscripts/") 

dbutils.fs.put("/databricks/initscripts/installpackagehelpers.sh","""
#!/bin/bash
echo "Installing libgmp"
sudo apt-get -q -y --fix-missing install libgmp-dev
echo "Installed libgmp" 
echo "Installing libmpfr"
sudo apt-get -q -y --fix-missing install libmpfr-dev
echo "Installed libmpfr"
""", True)

Finally the initscript location from DBFS was provided while creating a cluster. /databricks/initscripts/installpackagehelpers.sh as per the above example

Vivek Atal
  • 468
  • 5
  • 11