2

I need to install a JAR file as a library while setting up a Databricks cluster as part of my Azure Release pipeline. As of now, I have completed the following -

  • use an Azure CLI task to create the cluster definition
  • use curl command to download the JAR file from Maven repository into the pipeline agent folder
  • set up Databricks CLI on the pipeline agent
  • use databricks fs cp to copy the JAR file from local(pipeline agent) directory onto dbfs:/FileStore/jars folder

I am trying to create a cluster-scoped init script(bash) script that will -

  • install pandas, azure-cosmos and python-magic packages
  • install the JAR file (already copied in the earlier steps to dbfs:/FileStore/jars location) as a cluster library file

My cluster init script looks like this -

#!/bin/bash
/databricks/python/bin/pip install pandas 2>/dev/null
/databricks/python/bin/pip install azure-cosmos 2>/dev/null
/databricks/python/bin/pip install python-magic 2>/dev/null

But I don't know -

  • if this would add the packages to the cluster
  • how to add an existing JAR file to a cluster as a library

I know there are other ways to edit cluster library metadata, but as far as my knowledge, any change on the cluster libraries would require the cluster to be in RUNNING state which may not be in our case. That's why, I want to add an init script to my cluster definition so that, as and when the cluster is RESTARTED/RUNNING, the init script will be executed.

Please help.

Thanks. Subhash

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Subhash Ghai
  • 75
  • 1
  • 8

1 Answers1

3

If you just want to copy jar files into a cluster nodes, just copy them into /databricks/jars folder, like this (as part of your init script):

cp `/dbfs/FileStore/jars/<file-name.jar> /databricks/jars/

or

cp `/dbfs/FileStore/jars/*.jar /databricks/jars/

Regarding the rest of the init script - yes, it will install packages on all cluster nodes as required. Just two comments:

  • You can install multiple packages with one pip command - it should be slightly faster than installation one by one:
#!/bin/bash
/databricks/python/bin/pip install pandas azure-cosmos python-magic
  • use of 2>/dev/null could make debugging of the init script harder, for example, when you have a problem with network connectivity, or build errors. Without it you would be able to pull cluster logs (if they are enabled, and it will contain logs for init script as well)
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • in the init script log output, it shows **"Successfully installed azure-core-1.18.0 azure-cosmos-4.2.0 python-magic-0.4.24"**, however no library is showing up the cluster configuration. The JAR file has also been copied yet that is also not showing as well. the latest cluster statuses does not return any cluster details – Subhash Ghai Sep 17 '21 at 14:20
  • 1
    if you install libraries with init script - they won't be shown in the UI. Cluster UI is running on the Databricks side, and is not aware of any installation via init script – Alex Ott Sep 17 '21 at 14:22
  • thanks once again, this was informative. i could see the pandas module getting **import**ed from the notebooks successfully, and i can use the python magic commands as well... appreciate – Subhash Ghai Sep 19 '21 at 09:09