Databricks Import/Copy Data from python lib inside repo

Question

i am facing a little challenge while trying to implement a solution using the new repo functionality of databricks. I am working in a interdisziplinairy project which needs to be able to use python und pyspark code. The python team already builded some libraries which now also want to be used by the pyspark team (e.g. preprocessing ect.). We thought that using the new repo function would be a good compromise to collaborate easily. Therefore, we have added the ## Databricks notebook source to all library files so that they can easily changed in databricks (since python development isn't finished yet, the code will also be changed by the pyspark team). Unfortunately, we run into trouble with "importing" the library moduls in a databricks workspace directly from the repo.

Let me explain you our problem in an easy example:

Let this be module_a.py

## Databricks notebook source
def function_a(test):
    pass

And this module_b.py

## Databricks notebook source
import module_a as a
def function_b(test):
    a.function_a(test)
...

The issue is, that the only way to import these module directly in databricks is to use

%run module_a

%run module_b

which will fail since modulbe_b is trying to import module a which is not in the python path.

My idea was to copy the module_a.py and module_b.py file to the dbfs or localFilestore and then add the path to the python path with using sys.path.append(). Unfortunately, I didn't found any possility to access the file from the repo via some magic commands in databricks to be able to copy them to the file store.

(I do not want to clone the repo, since then I need to push my changed everytime before reexecuting the code).

Is there a way to access the repo directoy somehow via a notebook itself, so that I can copy them to the dbfs/filestorage?

Is there another way to import the function correctly ? (Installing the repo as a library on the cluster is not an option, since library will be changed during the process by the developers).

Thanks!

Alex Ott · Answer 1 · 2023-02-26T19:21:27.493

2

This functionality isn't available on Databricks yet. When you work with notebooks in the Databricks UI, you work with objects located in so-called Control Plane that is part of Databricks cloud, while code to be accessible as Python package should be in the data plane that is part of customer's cloud (see this answer for more details).

Usually people split the code into the notebooks that are used as a glue between configuration/business logic, and libraries that contain data transformations, etc. But libraries needs to be installed onto clusters, and usually developed separately from notebooks (there are some tools that helps with that, like, cicd-templates).

There is also the libify package that tries to emulate Python packages on top of Databricks notebooks, but it's not supported by Databricks, and I don't have personal experience with it.

P.S. I'll pass this feedback to development team.

Update: Feb 2023rd. Since later 2021st there is a functionality called "Files in Repos" that allows to import Python or R files (not notebooks) as packages. See demo here.

edited Feb 26 '23 at 19:21

answered May 02 '21 at 15:23

Alex Ott

80,552
8
87
132

As a workaround, can we "import" a module from a .py file sitting in DBFS? It seems strange to me that I can see these modules in the `Repo` tab of Databricks, just like I can see the JSON configs I loaded (and am `json.load`ing in a notebook) from `Data>DBFS`. But I cannot `import` anything from the former even if I can "load" objects from the latter...? – d8aninja Aug 07 '21 at 14:22
1

you can load from DBFS if you point python to search on the path `/dbfs/.....` (doesn't work on community edition). You may ask the solution architect or customer success engineer assigned to your account - they could provide more information on features that will come in the future (can't write here unfortunately) – Alex Ott Aug 07 '21 at 14:58
thanks for the speedy response. i think this info is actually worthy of adding to the top of your answer here, but that's imho. especially since bricks can be called by logic apps and data factory (which I guess is synapse now?), and all those can be cicd from azure devops, importing from local repo seems like a strong feature. rolling eggs sucks :( – d8aninja Aug 07 '21 at 15:18
wait a bit :-) Or you can always use `%run`. If you want to implement testing of notebooks, you can use this repo as an example: https://github.com/alexott/databricks-nutter-projects-demo – Alex Ott Aug 07 '21 at 15:24

Databricks Import/Copy Data from python lib inside repo

1 Answers1