Using Pandas, find a Column's value when Group By on another Column has a Max Value of yet another column. Also in PySpark DataFrame

Question

Similar to pandas groupby where you get the max of one column and the min of another column Given that the Data Frame is about Car Race, we have in df the following [Race_Event_ID, Car_ID, Driver_ID, AvgSpeed] Group By df.group_by(['Race_Event_ID'])['Speed'].agg(['max','mean',.....] can give Group by Stats. But I need to have the 'Speed' , Car_ID, Driver_ID of the Topper, or Whose speed = 'Max Speed', and Similarly the Tailer's Speed, Car_ID, Driver_ID , ie Speed = Min Speed.

Given that there could be Tie with Top Speed, Lets get Both or at least one.

One Solution is ,

df_max =df.group_by(['Race_Event_ID'])['Speed'].agg(['max'])

df_max.merge(df,on='Race_Event_ID')

This solution may take time. And within the same iteration of the data , things may be achievable if we were to iterate ourselves. Do we have an efficient way in Pandas ? And I would be curious to see a Spark DataFrame solution as well.

Do you just need topper and tailer? like topper: `df.loc[df.groupby('Race_Event_ID').Speed.idxmax()]`? if not, please add your sample data and expected output? — Emma, Jun 23 '23 at 15:12

score 1 · Answer 1 · answered Jun 23 '23 at 14:59

Instead of using merge, you can try Windows:

from pyspark.sql import SparkSession
from pyspark.sql.functions import max as spark_max, min as spark_min
from pyspark.sql.window import Window


spark = SparkSession.builder.getOrCreate()


window = Window.partitionBy('Race_Event_ID')


df = df.withColumn('Max_Speed', spark_max('Speed').over(window))
df = df.withColumn('Min_Speed', spark_min('Speed').over(window))


topper = df.filter(df['Speed'] == df['Max_Speed'])
tailer = df.filter(df['Speed'] == df['Min_Speed'])


topper_info = topper.select('Race_Event_ID', 'Car_ID', 'Driver_ID', 'Speed')
tailer_info = tailer.select('Race_Event_ID', 'Car_ID', 'Driver_ID', 'Speed')

Thanks Omar, Hoever, I may still need to join topper_info and tailer_info to get the final output. The bigger Question is, Does Spark internally iterates on the the same Data multiple times to achieve this ? Can that be avoided ? — user2458922, Jun 26 '23 at 13:59

Using Pandas, find a Column's value when Group By on another Column has a Max Value of yet another column. Also in PySpark DataFrame

1 Answers1