I have a Dataframe of the form:
+---+---+----+
| A| B|dist|
+---+---+----+
| a1| b1| 1.0|
| a1| b2| 2.0|
| a2| b1|10.0|
| a2| b2|10.0|
| a2| b3| 2.0|
| a3| b1|10.0|
+---+---+----+
and, fixed max_rank=2, I want to obtain the following one
+---+---+----+----+
| A| B|dist|rank|
+---+---+----+----+
| a3| b1|10.0| 1|
| a2| b3| 2.0| 1|
| a2| b1|10.0| 2|
| a2| b2|10.0| 2|
| a1| b1| 1.0| 1|
| a1| b2| 2.0| 2|
+---+---+----+----+
The classical method to do that is the following
df = sqlContext.createDataFrame([("a1", "b1", 1.), ("a1", "b2", 2.), ("a2", "b1", 10.), ("a2", "b2", 10.), ("a2", "b3", 2.), ("a3", "b1", 10.)], schema=StructType([StructField("A", StringType(), True), StructField("B", StringType(), True),StructField("dist", FloatType(), True)]))
win = Window().partitionBy(df['A']).orderBy(df['dist'])
out = df.withColumn('rank', rank().over(win))
out = out.filter('rank<=2')
However this solution is inefficient due to the Window function that uses an OrderBy.
There is another solution for Pyspark? For example a method similar to .top(k, key=--) for RDD?
I found a similar answer here but uses scala instead of python.