Can't import modules installed on cluster when connecting with databricks-connect

Question

Original question

I'm trying to move from databricks runtime 10.4 LTS to 11.3 LTS. I'm able to connect fine with databricks-connect, but I am not able to import the correct version of the modules that I have loaded on the cluster.

I start by creating a new cluster with databricks runtime 11.3 LTS
Then I install pandas==1.5.3 on the cluster with pypi from the Libraries tab in the cluster config
Wait for the cluster to be ready with installed modules.
Then I run the following snippet with databricks-connect:

def test_map(s):
    import pandas as pd

    return pd.__version__


test_rdd = spark.sparkContext.parallelize(["test"])
test_rdd.map(test_map).collect()

It returns ['1.3.4']. It should have returned ['1.5.3']. When I run the same snippet in a databricks notebook on the same cluster, it returns ['1.5.3'] as expected.

If I follow the steps above for a cluster running 10.4 LTS, the code snippet returns ['1.5.3'] with both databricks-connect and in a databricks notebook.

If I try to install a module on 11.3 LTS that is not part of the databricks runtime by default, e.g. openpyxl, and importing it with databricks-connect as above, I get an exception ModuleNotFoundError: No module named 'openpyxl'. With a standard databricks notebook, the module gets imported fine.

I run databricks-connect==10.4.22 when connecting to 10.4 LTS.

I run databricks-connect==11.3.10 when connecting to 11.3 LTS

How can I make the installed modules available through databricks-connect when running databricks runtime 11.3 LTS?

Further investigation:

To diagnose the problem further, I tried running the following snippet in both databricks-connect and standard databricks notebook:

def test_map(s):
    import sys

    return sys.executable


test_rdd = spark.sparkContext.parallelize(["test"])
test_rdd.map(test_map).collect()

The idea is to see whether the same python environment is used in both cases. The following tables shows what the snippet returns:

	10.4 LTS	11.3 LTS
Databricks notebook	`['/databricks/python/bin/python']`	`['/local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/python']`
Databricks-connect	`['/databricks/python/bin/python']`	`['/databricks/python/bin/python']`

The 11.3 LTS columns hints that the modules are missing because databricks-connect is running in a different python environment than plain databricks notebooks.

Trying out proposed answer:

I tried following the suggestions in this answer without luck.

I start by extracting the versions of all installed packages on the cluster by running:
```
%sh pip freeze
```
in a databricks notebook.
I then copy the list of packages and versions to a local requirements.txt and run
```
pip install -r requirements.txt
```
to install the correct version of the packages locally. I was not able to install the following packages, but I don't think they are needed:
```
distro-info===0.23ubuntu1
python-apt==2.0.1
unattended-upgrades==0.1
PyGObject==3.36.0
dbus-python==1.2.16
Pygments==2.10.0d
```
Then I ran the suggested script locally with databricks-connect. I got no output, which verifies that the local and cluster packages are now the same version.
Then I install pandas==1.5.3 with pypi on the cluster from the Libraries tab in the cluster config.
Then I install pandas==1.5.3 locally by changing the pandas line in requirements.txt to pandas==1.5.3 and then running
```
pip install -r requirements.txt
```
Then I again run the suggested script locally with databricks-connect, and get the following output:
```
Version mismatch for packate pandas! Remote: 1.3.4, Local: 1.5.3
```

I am still not able to install another version of pandas. Also, when installing openpyxl on the cluster, it does not show up in the remote_pkgs variable in the script. The output from %sh pip freeze running in a databricks notebook reports it as installed. Unfortunately, nothing seems to have changed from my original question.

score 0 · Answer 1 · answered Apr 08 '23 at 17:09

With Databricks Connect, your local machine runs a Spark driver, and main code is running locally, so you need to have the same packages installed locally. And versions of the packages should match, otherwise you may get strange errors. You can use following script to check if versions on your local machine & executors match.

Can't import modules installed on cluster when connecting with databricks-connect

Original question

Further investigation:

Trying out proposed answer:

1 Answers1

Linked