2

Suppose, I have a Dataframe like below:

enter image description here

Here, you can see that transaction number 1,2 and 3 have same value for columns A,B,C but different value for column D and E. Column E has date entries.

  1. For same A,B and C combination (A=1,B=1,C=1), we have 3 rows. I want to take only one row based on the recent transaction date of column E means the rows which have the most recent date. But for the most recent date, there are 2 transactions. But i want to take just one of them if two or more rows found for the same combination of A,B,C and most recent date in column E. So my expected output for this combination will be row number 3 or 4(any one will do).
  2. For same A,B and C combination (A=2,B=2,C=2), we have 2 rows. But based on column E, the most recent date is the date of row number 5. So we will just take this row for this combination of A,B and C. So my expected output for this combination will be row number 5

So the final output will be (3 and 5) or (4 and 5).

Now how should i approach:

  1. I read this:

Both reduceByKey and groupByKey can be used for the same purposes but reduceByKey works much better on a large dataset. That’s because Spark knows it can combine output with a common key on each partition before shuffling the data.

  1. I tried with groupBy on Column A,B,C and max on column E. But it can't give me the head of the rows if multiple rows present for the same date.

What is the most optimized approach to solve this? Thanks in advance.

EDIT: I need get back my filtered transactions. How to do it also?

Setu Kumar Basak
  • 11,460
  • 9
  • 53
  • 85

3 Answers3

1

I have used spark window functions to get my solution:

 val window = Window
      .partitionBy(dataframe("A"), dataframe("B"),dataframe("C"))
      .orderBy(dataframe("E") desc)

 val dfWithRowNumber = dataframe.withColumn("row_number", row_number() over window)
 val filteredDf = dfWithRowNumber.filter(dfWithRowNumber("row_number") === 1)
Setu Kumar Basak
  • 11,460
  • 9
  • 53
  • 85
  • This is guaranteed to do full shuffle . You may want to compare it with pairRDD/reduceByKey option of achieving the same. – sourabh Mar 29 '17 at 15:45
0

Link possible by several steps. Agregated Dataframe:

val agregatedDF=initialDF.select("A","B","C","E").groupBy("A","B","C").agg(max("E").as("E_max"))

Link intial-agregated:

initialDF.join(agregatedDF, List("A","B","C"))

If initial DataFrame comes from Hive, all can be simplified.

pasha701
  • 6,831
  • 1
  • 15
  • 22
0
val initialDF = Seq((1,1,1,1,"2/28/2017 0:00"),(1,1,1,2,"3/1/2017 0:00"),
(1,1,1,3,"3/1/2017 0:00"),(2,2,2,1,"2/28/2017 0:00"),(2,2,2,2,"2/25/20170:00")) 

This will miss out on corresponding col(D)

initialDF
.toDS.groupBy("_1","_2","_3")
.agg(max(col("_5"))).show 

In case you want the corresponding colD for the max col:

 initialDF.toDS.map(x=>x._1,x._2,x._3,x._5,x._4))).groupBy("_1","_2","_3")
.agg(max(col("_4")).as("_4")).select(col("_1"),col("_2"),col("_3"),col("_4._2"),col("_4._1")).show

For ReduceByKey you can convert the dataset to pairRDD and then work off it.Should be faster in case the Catalyst is not able to optimize the groupByKey in the first one. Refer Rolling your own reduceByKey in Spark Dataset

Community
  • 1
  • 1
sourabh
  • 466
  • 4
  • 13