need only updated quantity based on the current month using pyspark delta loads using databricks

Question

I am loading the delta tables into S3 delta lake. the table schema is product_code,date,quantity,crt_dt.

i am getting 6 months of Forecast data, for example if this month is May 2022, i will get May, June, July, Aug, Sept, Oct quantities data. What is the issue i am facing here is the data is getting duplicated every month. i want only a single row in the delta table based on the recent crt_dt as shown in below screenshot. Can anyone help me with the solution i should implement?

The data is partitioned by crt_dt.

Thanks!

score 0 · Answer 1 · answered Oct 11 '22 at 19:06

0

If you want to get the recent crt_dt normally this code will do the trick


w3 = Window.partitionBy("product_cat").orderBy(col("crt_dt").desc())
df.withColumn("row",row_number().over(w3)) \
  .filter(col("row") == 1).drop("row") \
  .show()

for more details check this https://sparkbyexamples.com/pyspark/pyspark-select-first-row-of-each-group/

answered Oct 11 '22 at 19:06

MOK

58
1
6

I want to write that data into delta lake after every incremental load happened only monthly basis... – Krishna Oct 12 '22 at 01:03

score 0 · Answer 2 · answered Oct 14 '22 at 19:56

You have a dataset that you'd like to filter and then write out to a Delta table.

Another poster told you how to filter the data to meet your requirements. Here's how to filter the data and then write it out.

filtered_df = df.withColumn("row",row_number().over(w3)) \
  .filter(col("row") == 1).drop("row") \
  .show()

filtered_df.write.format("delta").mode("append").save("path/to/delta_lake")

You can also do this with SQL if you aren't using the Python API.

need only updated quantity based on the current month using pyspark delta loads using databricks

2 Answers2