How would one use databricks delta lake format with Kedro?

Question

We are using kedro in our project. Normally, one can define datasets as such:

client_table:
  type: spark.SparkDataSet
  filepath: ${base_path_spark}/${env}/client_table
  file_format: parquet
  save_args:
    mode: overwrite

Now we're running on databricks and they offer many optimisations such as autoOptimizeShuffle. We are considering to make use of this to handle our 15TB+ datasets.

However, it's not clear to me how to use kedro with the databricks delta lake solution

score 2 · Answer 1 · answered Jan 06 '21 at 16:08

2

Its worked for us.

    client_table:
      type: kedro.contrib.io.pyspark.SparkDataSet
      filepath: ${base_path_spark}/${env}/client_table
      file_format: "delta"
      save_args:
        mode: overwrite

answered Jan 06 '21 at 16:08

jovib

31
4

score 1 · Accepted Answer · answered Aug 17 '22 at 16:34

Kedro now has a native dataset, see the docs here: https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#spark-and-delta-lake-interaction

temperature:
  type: spark.SparkDataSet
  filepath: data/01_raw/data.csv
  file_format: "csv"
  load_args:
    header: True
    inferSchema: True
  save_args:
    sep: '|'
    header: True

weather@spark:
  type: spark.SparkDataSet
  filepath: s3a://my_bucket/03_primary/weather
  file_format: "delta"
  save_args:
    mode: "overwrite"
    versionAsOf: 0

weather@delta:
  type: spark.DeltaTableDataSet
  filepath: s3a://my_bucket/03_primary/weather

How would one use databricks delta lake format with Kedro?

2 Answers2