To implement it correctly you need to understand how things are working:
%run
is a separate directive that should be put into the separate notebook cell, you can't mix it with the Python code. Plus, it can't accept the notebook name as variable. What %run
is doing - it's evaluating the code from specified notebook in the context of the current Spark session, so everything that is defined in that notebook - variables, functions, etc. is available in the caller notebook.
dbutils.notebook.run
is a function that may take a notebook path, plus parameters and execute it as a separate job on the current cluster. Because it's executed as a separate job, then it doesn't share the context with current notebook, and everything that is defined in it won't be available in the caller notebook (you can return a simple string as execution result, but it has a relatively small max length). One of the problems with dbutils.notebook.run
is that scheduling of a job takes several seconds, even if the code is very simple.
How you can implement what you need?
- if you use
dbutils.notebook.run
, then in the called notebook you can register a temp view, and caller notebook can read data from it (examples are adopted from this demo)
Called notebook (Code1
- it requires two parameters - name
for view name & n
- for number of entries to generate):
name = dbutils.widgets.get("name")
n = int(dbutils.widgets.get("n"))
df = spark.range(0, n)
df.createOrReplaceTempView(name)
Caller notebook (let's call it main
):
if "dataset" in "path":
view_name = "some_name"
dbutils.notebook.run(ntbk_path, 300, {'name': view_name, 'n': "1000"})
df = spark.sql(f"select * from {view_name}")
... work with data
- it's even possible to do something like with
%run
, but it could require a kind of "magic". The foundation of it is the fact that you can pass arguments to the called notebook by using the $arg_name="value"
, and you can even refer to the values specified in the widgets. But in any case, the check for value will happen in the called notebook.
The called notebook could look as following:
flag = dbutils.widgets.get("generate_data")
dataframe = None
if flag == "true":
dataframe = ..... create datarame
and the caller notebook could look as following:
------ cell in python
if "dataset" in "path":
gen_data = "true"
else:
gen_data = "false"
dbutils.widgets.text("gen_data", gen_data)
------- cell for %run
%run ./notebook_name $generate_data=$gen_data
------ again in python
dbutils.widgets.remove("gen_data") # remove widget
if dataframe: # dataframe is defined
do something with dataframe