2

I am trying to use DLT for incremental processing where inputs are parquet files on s3 arriving daily. I am told that dlt read_stream can help . I was able to get incrementally read files, but when I perform aggregations, it is doing wide aggregations instead of aggregating only the incrementals rows. Appreciate any suggestions .

Here is the example code

@dlt.table()
def tab1():
        return (spark.readStream.format("cloudFiles")
                .schema(schema)
                .option("cloudFiles.format", "parquet")
                .option("cloudFiles.includeExistingFiles",False)
                .option("cloudFiles.allowOverwrites",False)
                .option("cloudFiles.validateOptions",True)
                .load(f"{s3_prefix}/tab1/")
@dlt.table(
  comment="Aggregate table1"
)
def tab1_agg():
    return dlt.read_stream("tab1")
.groupBy("col1")
.agg(F.count(F.lit(1)).alias("cnt"),
F.sum("col2").alias("sum_col2"))
.withColumn("kh_meta_canonical_timestamp", F.current_timestamp())

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • but how do you know what rows are in the increment? You need to define some kind of window on which you do the aggregate – Alex Ott Aug 15 '22 at 18:14
  • Yep, that is where i am stuck. I was hoping that dlt.read_stream("tab1") would operate only on the incremental set provided by "tab1". When you say "window" what kind of window are you thinking ? – developer developer Aug 15 '22 at 19:55
  • It could be time window, session window, etc. By default, Spark structured streaming just represents stream as an infinite table – Alex Ott Aug 15 '22 at 20:11

0 Answers0