4

I have a notebook which will process the file and creates a data frame in structured format. Now I need to import that data frame created in another notebook, but the problem is before running the notebook I need to validate that only for some scenarios I need to run.

Usually to import all data structures, we use %run. But in my case it should be combinations of if clause and then notebook run

if "dataset" in path": %run ntbk_path

its giving an error " path not exist"

if "dataset" in path": dbutils.notebook.run(ntbk_path)

this one I cannot get all the data structures.

Can someone help me to resolve this error?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Mahi
  • 77
  • 1
  • 2
  • 10

1 Answers1

9

To implement it correctly you need to understand how things are working:

  • %run is a separate directive that should be put into the separate notebook cell, you can't mix it with the Python code. Plus, it can't accept the notebook name as variable. What %run is doing - it's evaluating the code from specified notebook in the context of the current Spark session, so everything that is defined in that notebook - variables, functions, etc. is available in the caller notebook.
  • dbutils.notebook.run is a function that may take a notebook path, plus parameters and execute it as a separate job on the current cluster. Because it's executed as a separate job, then it doesn't share the context with current notebook, and everything that is defined in it won't be available in the caller notebook (you can return a simple string as execution result, but it has a relatively small max length). One of the problems with dbutils.notebook.run is that scheduling of a job takes several seconds, even if the code is very simple.

How you can implement what you need?

  • if you use dbutils.notebook.run, then in the called notebook you can register a temp view, and caller notebook can read data from it (examples are adopted from this demo)

Called notebook (Code1 - it requires two parameters - name for view name & n - for number of entries to generate):

name = dbutils.widgets.get("name")
n = int(dbutils.widgets.get("n"))
df = spark.range(0, n)
df.createOrReplaceTempView(name)

Caller notebook (let's call it main):

if "dataset" in "path": 
  view_name = "some_name"
  dbutils.notebook.run(ntbk_path, 300, {'name': view_name, 'n': "1000"})
  df = spark.sql(f"select * from {view_name}")
  ... work with data
  • it's even possible to do something like with %run, but it could require a kind of "magic". The foundation of it is the fact that you can pass arguments to the called notebook by using the $arg_name="value", and you can even refer to the values specified in the widgets. But in any case, the check for value will happen in the called notebook.

The called notebook could look as following:

flag = dbutils.widgets.get("generate_data")
dataframe = None
if flag == "true":
  dataframe = ..... create datarame

and the caller notebook could look as following:

------ cell in python
if "dataset" in "path": 
  gen_data = "true"
else:
  gen_data = "false"
dbutils.widgets.text("gen_data", gen_data)

------- cell for %run
%run ./notebook_name $generate_data=$gen_data

------ again in python
dbutils.widgets.remove("gen_data") # remove widget
if dataframe: # dataframe is defined
  do something with dataframe
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thanks for the detailed explanation That helps!. Is there any way to pass variables : var1 =" value1" ; %run notebook $param1=var1 ; I am getting "var1" as parameter value in the calling notebook. – Mahi Sep 02 '21 at 09:18
  • you need to put `value1` into widget, and refer to the name of that widget as `$name` - it's in the example – Alex Ott Sep 02 '21 at 10:08