4

I am trying to find a way to do a cumulative total that accounts for ties in Pandas.

Lets take hypothetical data from a track meet, where I have people, races, heats, and time.

Each person's placement is according to the following:

For a given race/heat combination:

  • The person person with the lowest time placed first
  • The person with the second lowest time placed second

and so on...

This would be fairly simple code, but for one thing..

If two people have the same time, they both get the same place and then the next time greater than their time will have that value + 1 as the placement.

In the table below, for 100 yard dash, heat 1, RUNNER1 finished first, RUNNER2/RUNNER3 finished second, and RUNNER3 finished third (next time after RUNNER2/RUNNER3)

So basically, the logic is as follows:

If race <> race.shift() or heat <> heat.shift() then place =1

If race = race.shift() and heat = heat.shift() and time>time.shift then place =place.shift()+1

If race = race.shift() and heat = heat.shift() and time>time.shift then place =place.shift()

The part that confuses me is how to handle the ties. Otherwise I could do something like

df['Place']=np.where(
              (df['race']==df['race'].shift())
              &
              (df['heat']==df['heat'].shift()),
              df['Place'].shift()+1,
              1)

Thank you!

Sample data follows:

Person,Race,Heat,Time
RUNNER1,100 Yard Dash,1,9.87
RUNNER2,100 Yard Dash,1,9.92
RUNNER3,100 Yard Dash,1,9.92
RUNNER4,100 Yard Dash,1,9.96
RUNNER5,100 Yard Dash,1,9.97
RUNNER6,100 Yard Dash,1,10.01
RUNNER7,100 Yard Dash,2,9.88
RUNNER8,100 Yard Dash,2,9.93
RUNNER9,100 Yard Dash,2,9.93
RUNNER10,100 Yard Dash,2,10.03
RUNNER11,100 Yard Dash,2,10.26
RUNNER7,200 Yard Dash,1,19.63
RUNNER8,200 Yard Dash,1,19.67
RUNNER9,200 Yard Dash,1,19.72
RUNNER10,200 Yard Dash,1,19.72
RUNNER11,200 Yard Dash,1,19.86
RUNNER12,200 Yard Dash,1,19.92

what I want at the end is

Person,Race,Heat,Time,Place
RUNNER1,100 Yard Dash,1,9.87,1
RUNNER2,100 Yard Dash,1,9.92,2
RUNNER3,100 Yard Dash,1,9.92,2
RUNNER4,100 Yard Dash,1,9.96,3
RUNNER5,100 Yard Dash,1,9.97,4
RUNNER6,100 Yard Dash,1,10.01,5
RUNNER7,100 Yard Dash,2,9.88,1
RUNNER8,100 Yard Dash,2,9.93,2
RUNNER9,100 Yard Dash,2,9.93,2
RUNNER10,100 Yard Dash,2,10.03,3
RUNNER11,100 Yard Dash,2,10.26,4
RUNNER7,200 Yard Dash,1,19.63,1
RUNNER8,200 Yard Dash,1,19.67,2
RUNNER9,200 Yard Dash,1,19.72,3
RUNNER10,200 Yard Dash,1,19.72,3
RUNNER11,200 Yard Dash,1,19.86,4
RUNNER12,200 Yard Dash,1,19.92,4

[edit] Now, one step further..

Lets assume that once I leave a set of unique values, the next time that set comes up, the values will reset to 1..

So, for example, - Note that it goes to "heat 1" and then "heat 2" and back to "heat 1" - I don't want the rankings to continue from the prior "heat 1", rather I want them to reset.

Person,Race,Heat,Time,Place
RUNNER1,100 Yard Dash,1,9.87,1
RUNNER2,100 Yard Dash,1,9.92,2
RUNNER3,100 Yard Dash,1,9.92,2
RUNNER4,100 Yard Dash,2,9.96,1
RUNNER5,100 Yard Dash,2,9.97,2
RUNNER6,100 Yard Dash,2,10.01,3
RUNNER7,100 Yard Dash,1,9.88,1
RUNNER8,100 Yard Dash,1,9.93,2
RUNNER9,100 Yard Dash,1,9.93,2

1 Answers1

9

You could use:

grouped =  df.groupby(['Race','Heat'])
df['Place'] = grouped['Time'].transform(lambda x: pd.factorize(x, sort=True)[0]+1)

import pandas as pd
df = pd.DataFrame({'Heat': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1], 'Person': ['RUNNER1', 'RUNNER2', 'RUNNER3', 'RUNNER4', 'RUNNER5', 'RUNNER6', 'RUNNER7', 'RUNNER8', 'RUNNER9', 'RUNNER10', 'RUNNER11', 'RUNNER7', 'RUNNER8', 'RUNNER9', 'RUNNER10', 'RUNNER11', 'RUNNER12'], 'Race': ['100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '200 Yard Dash', '200 Yard Dash', '200 Yard Dash', '200 Yard Dash', '200 Yard Dash', '200 Yard Dash'], 'Time': [9.8699999999999992, 9.9199999999999999, 9.9199999999999999, 9.9600000000000009, 9.9700000000000006, 10.01, 9.8800000000000008, 9.9299999999999997, 9.9299999999999997, 10.029999999999999, 10.26, 19.629999999999999, 19.670000000000002, 19.719999999999999, 19.719999999999999, 19.859999999999999, 19.920000000000002]})

grouped =  df.groupby(['Race','Heat'])
df['Place'] = grouped['Time'].transform(lambda x: pd.factorize(x, sort=True)[0]+1)
df['Rank'] = grouped['Time'].rank(method='min')
print(df)

yields

    Heat    Person           Race   Time  Place  Rank
0      1   RUNNER1  100 Yard Dash   9.87    1.0   1.0
1      1   RUNNER2  100 Yard Dash   9.92    2.0   2.0
2      1   RUNNER3  100 Yard Dash   9.92    2.0   2.0
3      1   RUNNER4  100 Yard Dash   9.96    3.0   4.0
4      1   RUNNER5  100 Yard Dash   9.97    4.0   5.0
5      1   RUNNER6  100 Yard Dash  10.01    5.0   6.0
6      2   RUNNER7  100 Yard Dash   9.88    1.0   1.0
7      2   RUNNER8  100 Yard Dash   9.93    2.0   2.0
8      2   RUNNER9  100 Yard Dash   9.93    2.0   2.0
9      2  RUNNER10  100 Yard Dash  10.03    3.0   4.0
10     2  RUNNER11  100 Yard Dash  10.26    4.0   5.0
11     1   RUNNER7  200 Yard Dash  19.63    1.0   1.0
12     1   RUNNER8  200 Yard Dash  19.67    2.0   2.0
13     1   RUNNER9  200 Yard Dash  19.72    3.0   3.0
14     1  RUNNER10  200 Yard Dash  19.72    3.0   3.0
15     1  RUNNER11  200 Yard Dash  19.86    4.0   5.0
16     1  RUNNER12  200 Yard Dash  19.92    5.0   6.0

Note that Pandas has a Groupby.rank method which can compute many common forms of rank -- but not the one you described. Notice how for example on row 3 the Rank is 4 after a tie between the second and third runners, while the Place is 3.


Regarding the edit: Use

(df['Heat'] != df['Heat'].shift()).cumsum()

to disambiguate the heats:

import pandas as pd
df = pd.DataFrame({'Heat': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1], 'Person': ['RUNNER1', 'RUNNER2', 'RUNNER3', 'RUNNER4', 'RUNNER5', 'RUNNER6', 'RUNNER7', 'RUNNER8', 'RUNNER9', 'RUNNER10', 'RUNNER11', 'RUNNER7', 'RUNNER8', 'RUNNER9', 'RUNNER10', 'RUNNER11', 'RUNNER12'], 'Race': ['100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash', '100 Yard Dash'], 'Time': [9.8699999999999992, 9.9199999999999999, 9.9199999999999999, 9.9600000000000009, 9.9700000000000006, 10.01, 9.8800000000000008, 9.9299999999999997, 9.9299999999999997, 10.029999999999999, 10.26, 19.629999999999999, 19.670000000000002, 19.719999999999999, 19.719999999999999, 19.859999999999999, 19.920000000000002]})

df['HeatGroup'] = (df['Heat'] != df['Heat'].shift()).cumsum()
grouped =  df.groupby(['Race','HeatGroup'])
df['Place'] = grouped['Time'].transform(lambda x: pd.factorize(x, sort=True)[0]+1)
df['Rank'] = grouped['Time'].rank(method='min')
print(df)

yields

    Heat    Person           Race   Time  HeatGroup  Place  Rank
0      1   RUNNER1  100 Yard Dash   9.87          1    1.0   1.0
1      1   RUNNER2  100 Yard Dash   9.92          1    2.0   2.0
2      1   RUNNER3  100 Yard Dash   9.92          1    2.0   2.0
3      1   RUNNER4  100 Yard Dash   9.96          1    3.0   4.0
4      1   RUNNER5  100 Yard Dash   9.97          1    4.0   5.0
5      1   RUNNER6  100 Yard Dash  10.01          1    5.0   6.0
6      2   RUNNER7  100 Yard Dash   9.88          2    1.0   1.0
7      2   RUNNER8  100 Yard Dash   9.93          2    2.0   2.0
8      2   RUNNER9  100 Yard Dash   9.93          2    2.0   2.0
9      2  RUNNER10  100 Yard Dash  10.03          2    3.0   4.0
10     2  RUNNER11  100 Yard Dash  10.26          2    4.0   5.0
11     1   RUNNER7  100 Yard Dash  19.63          3    1.0   1.0
12     1   RUNNER8  100 Yard Dash  19.67          3    2.0   2.0
13     1   RUNNER9  100 Yard Dash  19.72          3    3.0   3.0
14     1  RUNNER10  100 Yard Dash  19.72          3    3.0   3.0
15     1  RUNNER11  100 Yard Dash  19.86          3    4.0   5.0
16     1  RUNNER12  100 Yard Dash  19.92          3    5.0   6.0
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Great answer. I imagine that controlling the accuracy used by the comparisons would require something more explicit like modifying the numbers yourself through math.ceil() or whatever. – Ma0 Jul 07 '16 at 13:15
  • @Ev.Kounis: Yes, you could use something like `df['Time'] = df['Time'].round(2)` to round all the times to 2 decimal digits before using `groupby/transform` or `groupby/rank`. – unutbu Jul 07 '16 at 13:25
  • Thank you! I added a little twist at the end.. its a little confusing to explain why I need to do this, but any ideas how I can accomplish that objective also? – Stumbling Through Data Science Jul 07 '16 at 13:26
  • very elegant solution! – MaxU - stand with Ukraine Jul 07 '16 at 14:57