Original question
I'm trying to move from databricks runtime 10.4 LTS to 11.3 LTS. I'm able to connect fine with databricks-connect, but I am not able to import the correct version of the modules that I have loaded on the cluster.
- I start by creating a new cluster with databricks runtime 11.3 LTS
- Then I install
pandas==1.5.3
on the cluster with pypi from the Libraries tab in the cluster config - Wait for the cluster to be ready with installed modules.
- Then I run the following snippet with databricks-connect:
def test_map(s):
import pandas as pd
return pd.__version__
test_rdd = spark.sparkContext.parallelize(["test"])
test_rdd.map(test_map).collect()
It returns ['1.3.4']
. It should have returned ['1.5.3']
.
When I run the same snippet in a databricks notebook on the same cluster, it returns ['1.5.3']
as expected.
If I follow the steps above for a cluster running 10.4 LTS, the code snippet returns ['1.5.3']
with both databricks-connect and in a databricks notebook.
If I try to install a module on 11.3 LTS that is not part of the databricks runtime by default, e.g. openpyxl
, and importing it with databricks-connect as above, I get an exception ModuleNotFoundError: No module named 'openpyxl'
. With a standard databricks notebook, the module gets imported fine.
I run databricks-connect==10.4.22
when connecting to 10.4 LTS.
I run databricks-connect==11.3.10
when connecting to 11.3 LTS
How can I make the installed modules available through databricks-connect when running databricks runtime 11.3 LTS?
Further investigation:
To diagnose the problem further, I tried running the following snippet in both databricks-connect and standard databricks notebook:
def test_map(s):
import sys
return sys.executable
test_rdd = spark.sparkContext.parallelize(["test"])
test_rdd.map(test_map).collect()
The idea is to see whether the same python environment is used in both cases. The following tables shows what the snippet returns:
10.4 LTS | 11.3 LTS | |
---|---|---|
Databricks notebook | ['/databricks/python/bin/python'] |
['/local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/python'] |
Databricks-connect | ['/databricks/python/bin/python'] |
['/databricks/python/bin/python'] |
The 11.3 LTS columns hints that the modules are missing because databricks-connect is running in a different python environment than plain databricks notebooks.
Trying out proposed answer:
I tried following the suggestions in this answer without luck.
- I start by extracting the versions of all installed packages on the cluster by running:
in a databricks notebook.%sh pip freeze
- I then copy the list of packages and versions to a local
requirements.txt
and run
to install the correct version of the packages locally. I was not able to install the following packages, but I don't think they are needed:pip install -r requirements.txt
distro-info===0.23ubuntu1 python-apt==2.0.1 unattended-upgrades==0.1 PyGObject==3.36.0 dbus-python==1.2.16 Pygments==2.10.0d
- Then I ran the suggested script locally with databricks-connect. I got no output, which verifies that the local and cluster packages are now the same version.
- Then I install
pandas==1.5.3
with pypi on the cluster from the Libraries tab in the cluster config. - Then I install
pandas==1.5.3
locally by changing the pandas line inrequirements.txt
topandas==1.5.3
and then runningpip install -r requirements.txt
- Then I again run the suggested script locally with databricks-connect, and get the following output:
Version mismatch for packate pandas! Remote: 1.3.4, Local: 1.5.3
I am still not able to install another version of pandas. Also, when installing openpyxl
on the cluster, it does not show up in the remote_pkgs
variable in the script. The output from %sh pip freeze
running in a databricks notebook reports it as installed.
Unfortunately, nothing seems to have changed from my original question.