I am building an Azure Data Factory v2, which comprises
- A Databricks step to query large tables from Azure Blob storage and generate a tabular result
intermediate_table
; - A Python step (which does several things and would be cumbersome to put in a single notebook) to read the
processed_table
and generate the final output.
And looks like this
The notebook generates a pyspark.sql.dataframe.DataFrame
which I tried to save into parquet format with attempts like
processed_table.write.format("parquet").saveAsTable("intermediate_table", mode='overwrite')
or
processed_table.write.parquet("intermediate_table", mode='overwrite')
Now, I would like the Python step to re-read the intermediate result, ideally with a postprocess.py
file with a syntax like
import pandas as pd
intermediate = pd.read_parquet("intermediate_table")
after having installed fastparquet
inside my Databricks cluster.
This is (not surprisingly...) failing with errors like
FileNotFoundError: [Errno 2] No such file or directory: './my_processed_table'
I assume the file is not found because the Python file is not accessing the data in the right context/path.
How should I amend the code above, and what would be the best/canonical ways to pass data across such steps in a pipeline? (any other advice on common/best practices to do this are welcome)