i am facing a little challenge while trying to implement a solution using the new repo functionality of databricks. I am working in a interdisziplinairy project which needs to be able to use python und pyspark code. The python team already builded some libraries which now also want to be used by the pyspark team (e.g. preprocessing ect.). We thought that using the new repo function would be a good compromise to collaborate easily. Therefore, we have added the ## Databricks notebook source to all library files so that they can easily changed in databricks (since python development isn't finished yet, the code will also be changed by the pyspark team). Unfortunately, we run into trouble with "importing" the library moduls in a databricks workspace directly from the repo.
Let me explain you our problem in an easy example:
Let this be module_a.py
## Databricks notebook source
def function_a(test):
pass
And this module_b.py
## Databricks notebook source
import module_a as a
def function_b(test):
a.function_a(test)
...
The issue is, that the only way to import these module directly in databricks is to use
%run module_a
%run module_b
which will fail since modulbe_b is trying to import module a which is not in the python path.
My idea was to copy the module_a.py
and module_b.py
file to the dbfs
or localFilestore
and then add the path to the python path with using sys.path.append()
. Unfortunately, I didn't found any possility to access the file from the repo via some magic commands in databricks to be able to copy them to the file store.
(I do not want to clone the repo, since then I need to push my changed everytime before reexecuting the code).
Is there a way to access the repo directoy somehow via a notebook itself, so that I can copy them to the dbfs
/filestorage?
Is there another way to import the function correctly ? (Installing the repo as a library on the cluster is not an option, since library will be changed during the process by the developers).
Thanks!