We have ~100M records, that have been collected for 2 weeks. Same records can appear multiple times. For the duplicated records, I only need the latest one based on the "LastModified" date.
I have tried with the following Spark script but it seemed to pickup the value randomly.
df.orderBy(unix_timestamp(df["LastModified"], "MM/dd/yyyy hh:mm:ss a").desc()).dropDuplicates(["LastModified"])
I have checked the data, the date format, ... all looked good. Anyone have any ideas?