Im new at Spark and mapreduce. I wanna ask for help that is there any elegant way to do as below. As I have a dataframe A. Then I want to have dataframe R which its records is merged by specific keys between dataframe A and new dataframe B with a condition like A’s record.createdTime < B’s record.createdTime . thanks you guys in advanced.
Asked
Active
Viewed 43 times
1 Answers
0
You can use join
on DataFrame to achieve desired result
In Python
dfA.join(dfB, (dfA.key == dfB.key) & (dfA.createdTime < dfB.createdTime) ).show()
You can also follow old question

Naga
- 416
- 3
- 11
-
how about de-duplication? do you know any way to solve if in DF B have duplicated records. I only want to take the lastest record with biggest createdTime. – Trung Hiếu Trần Nov 04 '19 at 05:30
-
You can use ```dropDuplicates``` on dfB dfB.dropDuplicates() additionally if you want you can mention list of columns to remove duplicates on specific column. You can refer more in this link https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrame.html#dropDuplicates(java.lang.String[]) this link is for Java you can select whichever API and version of your choice – Naga Nov 04 '19 at 14:44