Azure Data Factory run Databricks Python Wheel

Question

I have a python package (created in PyCharm) that I want to run on Azure Databricks. The python code runs with Databricks from the command line of my laptop in both Windows and Linux environments, so I feel like there are no code issues.

I've also successfully created a python wheel from the package, and am able to run the wheel from the command line locally.

Finally I've uploaded the wheel as a library to my Spark cluster, and created the Databricks Python object in Data Factory pointing to the wheel in dbfs.

When I try to run the Data Factory Pipeline, it fails with the error that it can't find the module that is the very first import statement of the main.py script. This module (GlobalVariables) is one of the other scripts in my package. It is also in the same folder as main.py; although I have other scripts in sub-folders as well. I've tried installing the package into the cluster head and still get the same error:

ModuleNotFoundError: No module named 'GlobalVariables'Tue Apr 13 21:02:40 2021 py4j imported

Has anyone managed to run a wheel distribution as a Databricks Python object successfully, and did you have to do any trickery to have the package find the rest of the contained files/modules?

Your help greatly appreciated!

Configuration screen grabs:

Thank you Alex, have added screen grabs to the post for the configuration. — Simon Norton, Apr 27 '21 at 00:00

Emer · Accepted Answer · 2021-04-28T17:42:28.370

2

We run pipelines using egg packages but it should be similar to wheel. Here is a summary of the steps:

Build the package with with python setup.py bdist_egg
Place the egg/whl file and the main.py script into Databricks FileStore (dbfs)
In Azure DataFactory's Databricks Activity go to the Settings tab
In Python file, set the dbfs path to the python entrypoint file (main.py script).
In Append libraries section, select type egg/wheel set the dbfs path to the egg/whl file
Select pypi and set all the dependencies of your package. It is recommended to specify the versions used in development.

Ensure GlobalVariables module code is inside the egg. As you are working with wheels try using them in step 5. (never tested myself)

edited Apr 28 '21 at 17:42

answered Apr 24 '21 at 21:53

Emer

3,734
2
33
47

Will try step 5 as I have not been doing that! – Simon Norton Apr 26 '21 at 23:59
After appending the library, am I also supposed to "Install" it on the Libraries panel? – Simon Norton Apr 27 '21 at 00:09
1

No, Data Factory will spin a new Job Cluster and install the dependencies listed in Settings tab – Emer Apr 27 '21 at 07:29
Thanks! Do I need to upload to Pypi (or some other repository), or is that optional if the wheel is already resident on dbfs? With the library in the settings, STDERR shows "ValueError: source code string cannot contain null bytes" when trying to run exec(f.read()). Does that sound like it is not finding the wheel file? – Simon Norton Apr 27 '21 at 16:07
No need to upload the wheel as long as it is available to all nodes (only dbfs is fine). Did you do step 4, what path do you use? do you have a separate py file as an entrypoint? – Emer Apr 28 '21 at 05:19
The path in Python File and in DBFS URI points to the wheel, which is now in a GUID-like folder under the jars folder. I've also tried it directly in /FileStore, but I get the same result. There's no other files involved, I'm just calling the wheel directly. – Simon Norton Apr 28 '21 at 16:30
That's the issue. Spark distributes the wheel to all nodes but it does not know how to start the application. You need a main function which loads the module and runs the code inside the wheel. Basically an entrypoint. If that main function is inside the wheel, you need to separate it as an individual py file. Then you deploy both the `whl` and the `py` files to the dbfs. `Python file` points to the file with the main function. More details: https://stackoverflow.com/questions/38120011/using-spark-submit-with-python-main – Emer Apr 28 '21 at 17:31
Sorry, just noticed you mentioned the `main.py` script in the Question. Put that script as a separate file in `dbfs` and add the path to Python file in Data Factory. I will edit the answer to mention this – Emer Apr 28 '21 at 17:39
In hindsight, that makes sense. I have to reconfigure my Wheel a little, but that's a different issue :) Thanks for the help! – Simon Norton Apr 28 '21 at 23:49
is there a way of passing named parameter to the ADF, i have question as below https://stackoverflow.com/questions/71322473/pass-parameters-to-python-code-from-azure-data-factory-for-it-to-run-on-databric – Mar 02 '22 at 16:12

Azure Data Factory run Databricks Python Wheel

1 Answers1