1

We are using kedro in our project. Normally, one can define datasets as such:

client_table:
  type: spark.SparkDataSet
  filepath: ${base_path_spark}/${env}/client_table
  file_format: parquet
  save_args:
    mode: overwrite

Now we're running on databricks and they offer many optimisations such as autoOptimizeShuffle. We are considering to make use of this to handle our 15TB+ datasets.

However, it's not clear to me how to use kedro with the databricks delta lake solution

pascalwhoop
  • 2,984
  • 3
  • 26
  • 40

2 Answers2

2

Its worked for us.

    client_table:
      type: kedro.contrib.io.pyspark.SparkDataSet
      filepath: ${base_path_spark}/${env}/client_table
      file_format: "delta"
      save_args:
        mode: overwrite
jovib
  • 31
  • 4
1

Kedro now has a native dataset, see the docs here: https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#spark-and-delta-lake-interaction

temperature:
  type: spark.SparkDataSet
  filepath: data/01_raw/data.csv
  file_format: "csv"
  load_args:
    header: True
    inferSchema: True
  save_args:
    sep: '|'
    header: True

weather@spark:
  type: spark.SparkDataSet
  filepath: s3a://my_bucket/03_primary/weather
  file_format: "delta"
  save_args:
    mode: "overwrite"
    versionAsOf: 0

weather@delta:
  type: spark.DeltaTableDataSet
  filepath: s3a://my_bucket/03_primary/weather
datajoely
  • 1,466
  • 10
  • 13