save dropped duplicates in pyspark RDD

Question

From here, Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame, we learned how to drop duplicated observations based on some specific variables. What if I want to save those duplicate observations in form of RDD, how shall I do? I guess rdd.substract() may be not efficient if RDD contains billions of observations. So besides using rdd.substract(), is there any other way I can use?

this link might help: https://stackoverflow.com/questions/49559994/keep-only-duplicates-from-a-dataframe-regarding-some-field — jxc, Sep 18 '19 at 02:25

score 0 · Answer 1 · answered Sep 18 '19 at 02:38

If you need both the datasets, one having only the distinct values and the other having the duplicates, you should use subtract. That will provide an accurate result. In case you need only the duplicates, you can use sql to get that.

df.createOrReplaceTempView('mydf')
df2 = spark.sql("select *,row_number() over(partition by <<list of columns used to identify duplicates>> order by <<any column/s not used to identify duplicates>>) as row_num from mydf having row_num>1").drop('row_num')

save dropped duplicates in pyspark RDD

1 Answers1