3

I use pandas frequently and often execute code comparable to the following:

df['var_rank'] = df['var'].rank(pct=True)
print( df.var_rank.max() )

And will often get values greater than 1. It still happens whether I keep or drop 'na' values. This is obviously easy to fix (just divide by the value with the largest rank), so I'm not asking for a work-around. I'm just curious why this happens and haven't found any clues online.

Anyone know why this happens?

Some very simple example data here (dropbox link - pickled pandas series).

I get a value of 1.0156 from df.rank(pct=True).max(). I've had other data with values as high as 4 or 5. I'm usually using pretty messy data.

benten
  • 1,995
  • 2
  • 23
  • 38

1 Answers1

1

You have bad data.

>>> s.rank(pct=True).max()
1.015625

s.sort(inplace=True)
>>> s.tail(7)
8      202512882
6      253661077
102            -
101            -
99             -
58             -
116            -
Name: Total Assets, dtype: object

>>> s[s != u'-'].rank(pct=True).max()
1.0

In Pandas 0.18.0 (released last week), you can specify numeric only:

s.rank(pct=True, numeric_only=True)

I've tried the above in 0.18.0 and couldn't seem to get it to work, so you can also do this to rank all float and int values:

>>> s[s.apply(lambda x: isinstance(x, (int, float)))].rank(pct=True).max()
1.0

It creates a boolean mask making sure each value is an int or float, and then ranks the filtered result.

Alexander
  • 105,104
  • 32
  • 201
  • 196
  • I assumed it was from these non-numeric observations, but my intuition is that those entries would bound percentiles below 1 (e.g. if half the data is bad then my highest ranked observation would be .5). Anyway, good to know about the update. – benten Mar 17 '16 at 22:49