0

Im new at Spark and mapreduce. I wanna ask for help that is there any elegant way to do as below. As I have a dataframe A. Then I want to have dataframe R which its records is merged by specific keys between dataframe A and new dataframe B with a condition like A’s record.createdTime < B’s record.createdTime . thanks you guys in advanced.

Trung Hiếu Trần
  • 365
  • 1
  • 2
  • 10

1 Answers1

0

You can use join on DataFrame to achieve desired result

In Python

dfA.join(dfB, (dfA.key == dfB.key) & (dfA.createdTime < dfB.createdTime) ).show()

You can also follow old question

Naga
  • 416
  • 3
  • 11
  • how about de-duplication? do you know any way to solve if in DF B have duplicated records. I only want to take the lastest record with biggest createdTime. – Trung Hiếu Trần Nov 04 '19 at 05:30
  • You can use ```dropDuplicates``` on dfB dfB.dropDuplicates() additionally if you want you can mention list of columns to remove duplicates on specific column. You can refer more in this link https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrame.html#dropDuplicates(java.lang.String[]) this link is for Java you can select whichever API and version of your choice – Naga Nov 04 '19 at 14:44