1

We have ~100M records, that have been collected for 2 weeks. Same records can appear multiple times. For the duplicated records, I only need the latest one based on the "LastModified" date.

I have tried with the following Spark script but it seemed to pickup the value randomly.

df.orderBy(unix_timestamp(df["LastModified"], "MM/dd/yyyy hh:mm:ss a").desc()).dropDuplicates(["LastModified"])

I have checked the data, the date format, ... all looked good. Anyone have any ideas?

Tuong Le
  • 18,533
  • 11
  • 50
  • 44

0 Answers0