How to pass a dataframe as notebook parameter in databricks?

Question

I have a requirement wherein I need to pass a pyspark dataframe as notebook parameter to a child notebook. Essentially, the child notebook has few functions with argument type as dataframe to perform certain tasks. Now the problem is I'm unable to pass a dataframe to this child notebook using (without writing this to temp directory)

dbutils.notebook.run(<notebookpath>, timeout, <arguments>)

I tried referring to this url - Return a dataframe from another notebook in databricks

However, I'm still bit confused how can I return a dataframe from child notebook to the parent notebook, and from parent to another child notebook.

I tried writing code as below -

tempview_list = ["tempView1", "tempView2", "tempView3"]
 
for tempview in tempview_list:
  dbutils.notebook.exit(spark.sql(f"Select * from {tempview}"))

But it is just returning the schema of the 1st tempView.

Please help. I'm newbie in pyspark.

Thanks.

score 3 · Answer 1 · answered Aug 31 '21 at 17:25

You can't directly pass a dataframe as a parameter or exit a dataframe. Only strings can be passed this way. -- What you've ended up doing is exiting the schema of your views.

The way you want to do this is to write the DataFrames you want to pass between notebooks into a global_temp_view.

Once you have done that you can pass the name/location of the temp_view as a parameter or exit it to the parent.

This explains it best here (use example 1: https://docs.databricks.com/notebooks/notebook-workflows.html#pass-structured-data

However as this only gives guidance on exiting one dataframe/temp_view I'll elaborate on the example you provided.

The main changes are:

You exit the view names not the data itself.
You can only exit one thing, so exit all of the names as one.
The For Loop is in the parent, to use the names to read from the temp views.

In the parent run the child notebook and assign it's output/exit to a variable:

child_output = dbutils.notebook.run(<notebookpath>, timeout, <arguments>)

in child:

tempview_list = ["tempView1", "tempView2", "tempView3"]
dbutils.notebook.exit(tempview_list)

The array will be exited into the child_output variable as a string:

"['tempView1', 'tempView2', 'tempView3']"

So in the parent you will need to turn the string back into an array using exec():

exec(f'tempview_list = {child_output}')

It's now you've done this that you can do your for loop in the parent notebook:

global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")

for tempview in tempview_list:
  exec(f'{tempview}_df = table(global_temp_db + "." + tempview)')

This will then make 3 dataframes based on your temp_views: tempView1_df, tempView2_df & tempView3_df that you can do whatever you want with.

I'm assuming you have already created the temp_views from your initial dataframes, you'll need to update these to be global_temp_views instead.

score 0 · Answer 2 · answered Sep 07 '22 at 11:56

0

I would suggest that you create temporaryGlobalviews that you can load from the next/parent notebook in your workflow.

It is not as elegant but it works as a charm.

answered Sep 07 '22 at 11:56

George Sotiropoulos

1,864
1
22
32

score 0 · Answer 3 · answered Jan 06 '23 at 15:36

I had the same issue. An approach which worked pretty fine was to create a GlobalTempView in the first notebook, and read it in the 2nd one:

at the end of the first notebook, you can store your dataframe as a GlobalTempView:

df_notebook1.createOrReplaceGlobalTempView('dataframe_temp_view')

and in the next notebook(s) read it like this:

df_temp = spark.table('dataframe_temp_view')

just make sure that both notebooks are attached to the same cluster

How to pass a dataframe as notebook parameter in databricks?

3 Answers3