0

I am building an Azure Data Factory v2, which comprises

  • A Databricks step to query large tables from Azure Blob storage and generate a tabular result intermediate_table;
  • A Python step (which does several things and would be cumbersome to put in a single notebook) to read the processed_table and generate the final output.

And looks like this

enter image description here

The notebook generates a pyspark.sql.dataframe.DataFrame which I tried to save into parquet format with attempts like

processed_table.write.format("parquet").saveAsTable("intermediate_table", mode='overwrite')

or

processed_table.write.parquet("intermediate_table", mode='overwrite')

Now, I would like the Python step to re-read the intermediate result, ideally with a postprocess.py file with a syntax like

import pandas as pd
intermediate = pd.read_parquet("intermediate_table")

after having installed fastparquet inside my Databricks cluster.
This is (not surprisingly...) failing with errors like

FileNotFoundError: [Errno 2] No such file or directory: './my_processed_table'

I assume the file is not found because the Python file is not accessing the data in the right context/path.

How should I amend the code above, and what would be the best/canonical ways to pass data across such steps in a pipeline? (any other advice on common/best practices to do this are welcome)

Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72
  • Would saving the file to DBFS help with this issue? – Jon Jul 15 '19 at 14:59
  • Anything goes really, I am new to data factories and Databricks so I am not sure what best practices are here! If I save to dbfs (which I think is what `write.parquet` does) how should I load the file in the Python step though? `pd.read_parquet("dbfs:/my_processed_table")` also fails... – Davide Fiocco Jul 15 '19 at 15:08
  • The saveAsTable may not work with DBFS. I'd have to mess with it myself to see if that works, though. This is all mainly just a thought :) – Jon Jul 15 '19 at 15:28

1 Answers1

0

One way to run the pipeline successfully is to have in the Databricks notebook a cell like

%python

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
import pandas as pd
processed_table.toPandas().to_parquet("/dbfs/intermediate", engine="fastparquet", compression = None)

and then have in preprocess.py

import pandas as pd
intermediate = pd.read_parquet("/dbfs/intermediate")

not sure if that's good practice (it works though).

Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72