How to filter out duplicate rows based on some columns in spark dataframe?

Question

Suppose, I have a Dataframe like below:

Here, you can see that transaction number 1,2 and 3 have same value for columns A,B,C but different value for column D and E. Column E has date entries.

For same A,B and C combination (A=1,B=1,C=1), we have 3 rows. I want to take only one row based on the recent transaction date of column E means the rows which have the most recent date. But for the most recent date, there are 2 transactions. But i want to take just one of them if two or more rows found for the same combination of A,B,C and most recent date in column E. So my expected output for this combination will be row number 3 or 4(any one will do).
For same A,B and C combination (A=2,B=2,C=2), we have 2 rows. But based on column E, the most recent date is the date of row number 5. So we will just take this row for this combination of A,B and C. So my expected output for this combination will be row number 5

So the final output will be (3 and 5) or (4 and 5).

Now how should i approach:

I read this:

Both reduceByKey and groupByKey can be used for the same purposes but reduceByKey works much better on a large dataset. That’s because Spark knows it can combine output with a common key on each partition before shuffling the data.

I tried with groupBy on Column A,B,C and max on column E. But it can't give me the head of the rows if multiple rows present for the same date.

What is the most optimized approach to solve this? Thanks in advance.

EDIT: I need get back my filtered transactions. How to do it also?

Approach 2 is better. Function "max" always will return one biggest date. If several such dates exists, only one will be selected. — pasha701, Mar 28 '17 at 07:16
But groupBy is very costly function and using max how can i get back the transactions? Can you point me the way? — Setu Kumar Basak, Mar 28 '17 at 07:34
"reduceByKey" is also costly. You don't asked about getting back transactions in yours question. But them can be returned by linking initial and aggregated dataframes by columns (A,B,C). — pasha701, Mar 28 '17 at 08:00

Setu Kumar Basak · Accepted Answer · 2017-03-30T13:20:18.690

1

I have used spark window functions to get my solution:

 val window = Window
      .partitionBy(dataframe("A"), dataframe("B"),dataframe("C"))
      .orderBy(dataframe("E") desc)

 val dfWithRowNumber = dataframe.withColumn("row_number", row_number() over window)
 val filteredDf = dfWithRowNumber.filter(dfWithRowNumber("row_number") === 1)

edited Mar 30 '17 at 13:20

answered Mar 29 '17 at 15:29

Setu Kumar Basak

11,460
9
53
85

This is guaranteed to do full shuffle . You may want to compare it with pairRDD/reduceByKey option of achieving the same. – sourabh Mar 29 '17 at 15:45

score 0 · Answer 2 · answered Mar 28 '17 at 09:41

Link possible by several steps. Agregated Dataframe:

val agregatedDF=initialDF.select("A","B","C","E").groupBy("A","B","C").agg(max("E").as("E_max"))

Link intial-agregated:

initialDF.join(agregatedDF, List("A","B","C"))

If initial DataFrame comes from Hive, all can be simplified.

score 0 · Answer 3 · edited May 23 '17 at 12:32

val initialDF = Seq((1,1,1,1,"2/28/2017 0:00"),(1,1,1,2,"3/1/2017 0:00"),
(1,1,1,3,"3/1/2017 0:00"),(2,2,2,1,"2/28/2017 0:00"),(2,2,2,2,"2/25/20170:00"))

This will miss out on corresponding col(D)

initialDF
.toDS.groupBy("_1","_2","_3")
.agg(max(col("_5"))).show

In case you want the corresponding colD for the max col:

 initialDF.toDS.map(x=>x._1,x._2,x._3,x._5,x._4))).groupBy("_1","_2","_3")
.agg(max(col("_4")).as("_4")).select(col("_1"),col("_2"),col("_3"),col("_4._2"),col("_4._1")).show

For ReduceByKey you can convert the dataset to pairRDD and then work off it.Should be faster in case the Catalyst is not able to optimize the groupByKey in the first one. Refer Rolling your own reduceByKey in Spark Dataset

How to filter out duplicate rows based on some columns in spark dataframe?

3 Answers3