Running a databricks notebook connected to git via ADF independent from git username

Question

In our company for orchestrating of running Databricks notebooks, experimentally we learned to connect our notebooks (affiliated to a git repository) to ADF pipelines, however, there is an issue.

As you can see in the photo attached to this question path to the notebook depends on the employee username, which is not a stable solution at production.

What is/are the solution(s) to solve it?.

update: The main issue is keeping employee username out of production to avoid any future failure. Either in path of ADF or secondary storage place which can be read by lookup but still sitting production side.

Path selection in ADF:

score 2 · Accepted Answer · answered Jan 21 '22 at 12:46

If you want to avoid having the username in the path, then you can just create a folder inside Repos, and do checkout there (here is full instruction):

In the Repos, in the top-level part, click on the ᐯ near the "Repos" header, select "Create" and select "Folder". Give it some name, like, "Staging":

Create a repository inside that folder

Click on the ᐯ near the "Staging" folder, and click "Create" and select "Repo":

After that you can navigate to that repository in the ADF UI.

It's also recommended to set permissions on the folder, so only specific people can update projects inside it.

score 1 · Answer 2 · answered Jan 21 '22 at 10:31

1

You can use Azure DevOps source control to manage the developer and production Databrick Notebooks or other related codes/scripts/documents in Git. Learn more here.

Keep your Notebooks in logical distributed repositories in Github and use the same path in your Azure Data Factory in Notebook activity.

If you want to pass the dynamic path in Notebook activity, you should have placeholder of the notebook file paths lists something like a text/csv file or a SQL table where all the notebooks paths will be available.

Then use the Lookup activity in the ADF to get the list of those paths and pass the lookup output to a ForEach activity and have a Notebook activity inside ForEach and pass the path (for each iteration) to the parameters. This way you can avoid hardcoded field path in the pipeline.

answered Jan 21 '22 at 10:31

Utkarsh Pal

4,079
1
5
14

Thank @UtkarshPal-MT . That's definitely very helpful. For the first part of answer, indeed we are using "Azure DevOps source control of type git", on the other hand Databricks and ADF are sharing same same git from azure. Considering these conditions, is there any more solution? – Ali Saberi Jan 21 '22 at 12:04
1

This is the best approach as per my knowledge. You just need to change the repository names and if required use Lookup and ForEach activity for dynamic paths. – Utkarsh Pal Jan 21 '22 at 12:11
main issue is keeping employee username not in the production. Either in path of ADF or secondary place which can be read by lookup but still sitting production side. – Ali Saberi Jan 21 '22 at 12:18
Why aren't you changing repo name when moving from dev to prod env? – Utkarsh Pal Jan 21 '22 at 12:25
oh wait a second, we are seeing it in different way, when putting a adf notebook activity, the next step is notebook path. and it shows up root folder. Here first option is "Repos", and after clicking Repos, I have two folder which employee username. One I click I see list of repos under that user. and I select project repo, and path to that notebook. Thanks for answering back. Really appreciated. I update the original question with more images of what I explained here – Ali Saberi Jan 21 '22 at 12:41

Running a databricks notebook connected to git via ADF independent from git username

2 Answers2