Spark - how to merge 2 dataframes by key and de-duplicating by createdTime

Question

Im new at Spark and mapreduce. I wanna ask for help that is there any elegant way to do as below. As I have a dataframe A. Then I want to have dataframe R which its records is merged by specific keys between dataframe A and new dataframe B with a condition like A’s record.createdTime < B’s record.createdTime . thanks you guys in advanced.

score 0 · Answer 1 · answered Nov 04 '19 at 04:11

0

You can use join on DataFrame to achieve desired result

In Python

dfA.join(dfB, (dfA.key == dfB.key) & (dfA.createdTime < dfB.createdTime) ).show()

You can also follow old question

answered Nov 04 '19 at 04:11

Naga

416
3
11

how about de-duplication? do you know any way to solve if in DF B have duplicated records. I only want to take the lastest record with biggest createdTime. – Trung Hiếu Trần Nov 04 '19 at 05:30
You can use ```dropDuplicates``` on dfB dfB.dropDuplicates() additionally if you want you can mention list of columns to remove duplicates on specific column. You can refer more in this link https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrame.html#dropDuplicates(java.lang.String[]) this link is for Java you can select whichever API and version of your choice – Naga Nov 04 '19 at 14:44

Spark - how to merge 2 dataframes by key and de-duplicating by createdTime

1 Answers1