Currently I'm connecting to databricks with local VS Code via databricks-connect. But my submmission all comes with error of module not found, which means the code in other python files not found. I tried:
Move code into the folder with main.py
import the file inside of the function that uses it
adding the file via sparkContext.addPyFile
Does anyone have any experiecen on it? Or the even better way to interact with databricks for python projects.
I seems my python part code is executed in local python env, only the code directlry related spark is in cluster, but the cluster does not load all my python files. then raising error.
I have file folder
main.py
lib222.py
__init__.py
with class Foo in lib222.py
main code is:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
#sc.setLogLevel("INFO")
print("Testing addPyFile isolation")
sc.addPyFile("lib222.py")
from lib222 import Foo
print(sc.parallelize(range(10)).map(lambda i: Foo(2)).collect())
But I got error of Module not find lib222.
Also when I print python version of some sys info, it seems the python code is executed in my local machine instead of remote driver. My db version is 6.6. Detailed Error:
> Exception has occurred: Py4JJavaError
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 10.139.64.8, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'lib222'
>
>During handling of the above exception, another exception occurred:
>
>Traceback (most recent call last):
File "/databricks/spark/python/pyspark/worker.py", line 462, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/databricks/spark/python/pyspark/worker.py", line 71, in read_command
command = serializer._read_with_length(file)
File "/databricks/spark/python/pyspark/serializers.py", line 185, in _read_with_length
raise SerializationError("Caused by " + traceback.format_exc())
pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 182, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'lib222```