Databricks-Connect also return module not found for multiple python files job

Question

Currently I'm connecting to databricks with local VS Code via databricks-connect. But my submmission all comes with error of module not found, which means the code in other python files not found. I tried:

Move code into the folder with main.py
import the file inside of the function that uses it
adding the file via sparkContext.addPyFile

Does anyone have any experiecen on it? Or the even better way to interact with databricks for python projects.

I seems my python part code is executed in local python env, only the code directlry related spark is in cluster, but the cluster does not load all my python files. then raising error.

I have file folder

main.py

lib222.py

__init__.py

with class Foo in lib222.py

main code is:

from pyspark.sql import SparkSession
    
spark = SparkSession.builder.getOrCreate()
    
sc = spark.sparkContext
#sc.setLogLevel("INFO")
    
print("Testing addPyFile isolation")
sc.addPyFile("lib222.py")
from lib222 import Foo
print(sc.parallelize(range(10)).map(lambda i: Foo(2)).collect())

But I got error of Module not find lib222.

Also when I print python version of some sys info, it seems the python code is executed in my local machine instead of remote driver. My db version is 6.6. Detailed Error:

> Exception has occurred: Py4JJavaError
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 10.139.64.8, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'lib222'
>
>During handling of the above exception, another exception occurred:
>
>Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 462, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/databricks/spark/python/pyspark/worker.py", line 71, in read_command
command = serializer._read_with_length(file)
File "/databricks/spark/python/pyspark/serializers.py", line 185, in _read_with_length
raise SerializationError("Caused by " + traceback.format_exc())
pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'lib222```

From the error message: "ModuleNotFoundError: No module named 'lib222'" it looks like "lib222" module is missing. Could you please install the module named "lib222" and re-try? — CHEEKATLAPRADEEP, Jul 21 '20 at 07:44

Rohit Mishra · Answer 1 · 2020-08-16T10:07:02.817

I use Databricks on AWS and the best practice I follow are as follows-

Uninstall PySpark from your local environment using pip or conda
Create a virtual environment on your local system with a python environment compatible with your Databricks runtime. Having a virtual environment gives you more control over your setup and avoid version conflicts. conda create -n ENV_NAME python==PYTHON_VERSION

The minor version of your client Python installation must be the same as the minor Python version of your Databricks cluster (3.5, 3.6, or 3.7). Databricks Runtime 5.x has Python 3.5, Databricks Runtime 5.x ML has Python 3.6, and Databricks Runtime 6.1 and above and Databricks Runtime 6.1 ML and above have Python 3.7.

Note: Always use pip to install Pyspark as it points to the official release. Avoid conda or conda-forge for PySpark installation.

Follow steps in databricks-connect for configuring workspace- Official-document
On your databricks cluster check the existing version for Pyspark and its dependencies. If I am correct the version detail for dependencies for the latest PySpark code are as follows-
pandas 0.23.2
NumPy 1.7
pyarrow 0.15.1
Py4J 0.10.9

Databricks-Connect also return module not found for multiple python files job

1 Answers1