We are using kedro in our project. Normally, one can define datasets as such:
client_table:
type: spark.SparkDataSet
filepath: ${base_path_spark}/${env}/client_table
file_format: parquet
save_args:
mode: overwrite
Now we're running on databricks and they offer many optimisations such as autoOptimizeShuffle
. We are considering to make use of this to handle our 15TB+ datasets.
However, it's not clear to me how to use kedro with the databricks delta lake solution