0

I am loading the delta tables into S3 delta lake. the table schema is product_code,date,quantity,crt_dt.

i am getting 6 months of Forecast data, for example if this month is May 2022, i will get May, June, July, Aug, Sept, Oct quantities data. What is the issue i am facing here is the data is getting duplicated every month. i want only a single row in the delta table based on the recent crt_dt as shown in below screenshot. Can anyone help me with the solution i should implement?

enter image description here

The data is partitioned by crt_dt.

Thanks!

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Krishna
  • 35
  • 5

2 Answers2

0

If you want to get the recent crt_dt normally this code will do the trick


w3 = Window.partitionBy("product_cat").orderBy(col("crt_dt").desc())
df.withColumn("row",row_number().over(w3)) \
  .filter(col("row") == 1).drop("row") \
  .show()

for more details check this https://sparkbyexamples.com/pyspark/pyspark-select-first-row-of-each-group/

MOK
  • 58
  • 1
  • 6
  • I want to write that data into delta lake after every incremental load happened only monthly basis... – Krishna Oct 12 '22 at 01:03
0

You have a dataset that you'd like to filter and then write out to a Delta table.

Another poster told you how to filter the data to meet your requirements. Here's how to filter the data and then write it out.

filtered_df = df.withColumn("row",row_number().over(w3)) \
  .filter(col("row") == 1).drop("row") \
  .show()

filtered_df.write.format("delta").mode("append").save("path/to/delta_lake")

You can also do this with SQL if you aren't using the Python API.

Powers
  • 18,150
  • 10
  • 103
  • 108