Apache PySpark - Get latest record issue

Asked Sep 16 '19 at 01:46

Active Sep 16 '19 at 01:46

Viewed 103 times

We have ~100M records, that have been collected for 2 weeks. Same records can appear multiple times. For the duplicated records, I only need the latest one based on the "LastModified" date.

I have tried with the following Spark script but it seemed to pickup the value randomly.

df.orderBy(unix_timestamp(df["LastModified"], "MM/dd/yyyy hh:mm:ss a").desc()).dropDuplicates(["LastModified"])

I have checked the data, the date format, ... all looked good. Anyone have any ideas?

asked Sep 16 '19 at 01:46

Tuong Le

18,533
11
50
44

1

Look at this answer: It can be the reason https://stackoverflow.com/a/54738843/2700344 If the DF has many partitions, df.dropDuplicates works for data in the partition – leftjoin Sep 17 '19 at 06:24
@Tuong Le..did you get an answer to this? – vikrant rana Nov 12 '19 at 20:48
Yep, please follow the above link – Tuong Le Dec 03 '19 at 02:49

Apache PySpark - Get latest record issue

0 Answers0