Efficiently calculate top k elements on PySpark GroupedData (not scala)

Question

I have a Dataframe of the form:

+---+---+----+
|  A|  B|dist|
+---+---+----+
| a1| b1| 1.0|
| a1| b2| 2.0|
| a2| b1|10.0|
| a2| b2|10.0|
| a2| b3| 2.0|
| a3| b1|10.0|
+---+---+----+

and, fixed max_rank=2, I want to obtain the following one

+---+---+----+----+
|  A|  B|dist|rank|
+---+---+----+----+
| a3| b1|10.0|   1|
| a2| b3| 2.0|   1|
| a2| b1|10.0|   2|
| a2| b2|10.0|   2|
| a1| b1| 1.0|   1|
| a1| b2| 2.0|   2|
+---+---+----+----+

The classical method to do that is the following

df = sqlContext.createDataFrame([("a1", "b1", 1.), ("a1", "b2", 2.), ("a2", "b1", 10.), ("a2", "b2", 10.), ("a2", "b3", 2.), ("a3", "b1", 10.)], schema=StructType([StructField("A", StringType(), True), StructField("B", StringType(), True),StructField("dist", FloatType(), True)]))

win = Window().partitionBy(df['A']).orderBy(df['dist'])
out = df.withColumn('rank', rank().over(win))
out = out.filter('rank<=2')

However this solution is inefficient due to the Window function that uses an OrderBy.

There is another solution for Pyspark? For example a method similar to .top(k, key=--) for RDD?

I found a similar answer here but uses scala instead of python.

I think the window approach is as efficient as the answers in the referenced page. Your sorting will be done in partitions (in the same node), so it doesn't look bad. Answers in the referenced page need shuffling, they use grouping and sorting too, so they look equally efficient to me. — ZygD, Sep 15 '22 at 13:08

Efficiently calculate top k elements on PySpark GroupedData (not scala)

0 Answers0