0

I have a dataframe grouped by 'id' and 'type':

+---+----+-----+
| id|type|count|
+---+----+-----+
|  0|   A|    2|
|  0|   B|    3|
|  0|   C|    1|
|  0|   D|    3|
|  0|   G|    1|
|  1|   A|    0|
|  1|   C|    1|
|  1|   D|    1|
|  1|   G|    2|
+---+----+-----+

I would like now to group by 'id' and get a sum of 3 largest values:

+---+-----+
| id|count|
+---+-----+
|  0|    8|
|  1|    4|
+---+-----+

How can I do it in pyspark, so that the computation is relatively efficient?

Found solution here

1 Answers1

0

You can use the following code to perform this

from pyspark.sql.functions import *
from pyspark.sql.window import Window
df = spark.createDataFrame([
(0, "A", 2),
(0,"B", 3),
(0,"C", 1),
(0,"D", 3),
(1,"A", 0),
(1,"C", 1),
(1,"D", 1),
(1,"G", 2)
], ("id", "type", "count"))

my_window = Window.partitionBy("id").orderBy("count")
df.withColumn("last_3", lead("count").over(my_window)).groupBy("id").agg(sum("last_3")).show()

Output:

+---+-----------+
| id|sum(last_3)|
+---+-----------+
|  0|          8|
|  1|          4|
+---+-----------+

Details: Window partitions your data by id and orders it by count then you create a new column where lead uses this window and returns the next value in that group (which was created by the window) so (0,C,1) is the lowest tuple in the group for id=0 this receive the value 2 since it is the next highest in this group (from tuple (0,A,2) and so on. The highest tuple does not have a following value and is assigned null. Finally perform the group operation and the sum.

gaw
  • 1,960
  • 2
  • 14
  • 18
  • The fact is that there can be any number of rows with the same id (not just 4). –  Oct 24 '18 at 10:31