2

I have a list of value

say

df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
   ....:                 'key2' : ['one', 'two', 'one', 'two', 'one'],
   ....:                 'data1' : abs(np.random.randn(5)*100),
   ....:                 'data2' : np.random.randn(5)})

So if Here's my data ,

I want to return only top 3 value of data1 and return all 4 columns

what would be the best way to do this other than a lot of if statement that I have in mind.

I was looking into nlargest , but not sure how could I do this

========================update =========================

so if run above would get this result

enter image description here

I would like to get return df that only have rowindex of 1,2,3 because they have highest top 3 rank of data1 ( 98,94,95 )

Community
  • 1
  • 1
JPC
  • 5,063
  • 20
  • 71
  • 100
  • I understand that you want to write a function that returns only the top 3 values, but I'm not quite sure which top 3 values. Can you give an example where you fully specify (all numbers/strings, no calls to numpy) the input to this function and the output? – Sam Mussmann Oct 13 '13 at 20:47

2 Answers2

3
In [271]: df
Out[271]: 
      data1     data2 key1 key2
0 -1.318436  0.829593    a  one
1  0.172596 -0.541057    a  two
2 -2.071856 -0.181943    b  one
3  0.183276 -1.889666    b  two
4  0.558144 -1.016027    a  one

In [272]: df.ix[df['data1'].argsort()[-3:]]
Out[272]: 
      data1     data2 key1 key2
1  0.172596 -0.541057    a  two
3  0.183276 -1.889666    b  two
4  0.558144 -1.016027    a  one

Although heapq.nlargest may be theoretically more efficient, in practice even for fairly large DataFrames, argsort tends to be quicker:

import heapq
import pandas as pd
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a']*10000,
                 'key2' : ['one', 'two', 'one', 'two', 'one']*10000,
                 'data1' : np.random.randn(50000),
                 'data2' : np.random.randn(50000)})

In [274]: %timeit df.ix[df['data1'].argsort()[-3:]]
100 loops, best of 3: 5.62 ms per loop

In [275]: %timeit df.iloc[heapq.nlargest(3, df.index, key=lambda x: df['data1'].iloc[x])]
1 loops, best of 3: 1.03 s per loop
Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Yes !! thank you , it took me two hours to get figure this out. – JPC Oct 13 '13 at 20:56
  • I'm not 100% those are fair comparisons, though – isn't iloc doing a linear search? (I know very little about pandas.) – kojiro Oct 13 '13 at 21:27
  • @kojiro: `iloc` is doing integer indexing of an array, so it should be O(1), not O(n). – unutbu Oct 13 '13 at 21:37
1

Sort in descending order by the value of the data1 column:

df.sort(['data1'], ascending=False)[:3]
user278064
  • 9,982
  • 1
  • 33
  • 46
  • 1
    A sort would be O(n lg(n)) in the average case. `heapq.nsmallest` would be a more efficient way to get the n smallest values. (There's a `heapq.nlargest`, too, of course.) – kojiro Oct 13 '13 at 21:06
  • @kojiro: I did'nt know about it. Thanks a lot! :) – user278064 Oct 13 '13 at 21:09