4

I have a DataFrame:

>>> df = pd.DataFrame({'row1' : [1,2,np.nan,4,5], 'row2' : [11,12,13,14,np.nan], 'row3':[22,22,23,24,25]}, index = 'a b c d e'.split()).T
>>> df
         a     b     c     d     e
row1   1.0   2.0   NaN   4.0   5.0
row2  11.0  12.0  13.0  14.0   NaN
row3  22.0  22.0  23.0  24.0  25.0

and a Series that specifies the number of top N values I want from each row

>>> n_max = pd.Series([2,3,4])

What is Panda's way of using df and n_max to find the largest N elements of each (breaking ties with a random pick, just as .nlargest() would do)?

The desired output is

         a     b     c     d     e
row1   NaN   NaN   NaN   4.0   5.0
row2   NaN  12.0  13.0  14.0   NaN
row3  22.0   NaN  23.0  24.0  25.0

I know how to do this with a uniform/fixed N across all rows (say, N=4). Note the tie-breaking in row3:

>>> df.stack().groupby(level=0).nlargest(4).unstack().reset_index(level=1, drop=True).reindex(columns=df.columns)
         a     b     c     d     e
row1   1.0   2.0   NaN   4.0   5.0
row2  11.0  12.0  13.0  14.0   NaN
row3  22.0   NaN  23.0  24.0  25.0

But the goal, again, is to have row-specific N. Looping through each row obviously doesn't count (for performance reasons). And I've tried using .rank() with a mask but tie breaking doesn't work there...

Zhang18
  • 4,800
  • 10
  • 50
  • 67
  • 1
    `row3` has the second element `NaN-ed`, whereas the first element has tie with it. So, any reason why the second elem was picked? Does it matter if either of those were picked? – Divakar May 19 '17 at 14:56
  • It doesn't matter. That's exactly why I commented the behavior of `.nlargest()` which has that random tie break function. – Zhang18 May 19 '17 at 15:07
  • 1
    `.rank` has a `method` parameter that handles all sort of "ties". – Scott Boston May 19 '17 at 15:17
  • @ScottBoston, thank you. That tip alone solved my problem. I'll use `method='first'`. – Zhang18 May 19 '17 at 15:28
  • Since, we were looking for performance, did you check out this post - http://stackoverflow.com/a/44074289/3293881 ? – Divakar May 22 '17 at 07:38

2 Answers2

3

Based on @ScottBoston's comment on the OP, it is possible to use the following mask based on rank to solve this problem:

>>> n_max.index = df.index
>>> df_rank = df.stack(dropna=False).groupby(level=0).rank(ascending=False, method='first').unstack()
>>> selected = df_rank.le(n_max, axis=0)
>>> df[selected]
         a     b     c     d     e
row1   NaN   NaN   NaN   4.0   5.0
row2   NaN  12.0  13.0  14.0   NaN
row3  22.0   NaN  23.0  24.0  25.0
Zhang18
  • 4,800
  • 10
  • 50
  • 67
1

For performance, I would suggest NumPy -

def mask_variable_largest_per_row(df, n_max):
    a = df.values
    m,n = a.shape

    nan_row_count = np.isnan(a).sum(1)
    n_reset = n-n_max.values-nan_row_count
    n_reset.clip(min=0, max=n-1, out = n_reset)

    sidx = a.argsort(1)
    mask = n_reset[:,None] > np.arange(n)

    c = sidx[mask]
    r = np.repeat(np.arange(m), n_reset)
    a[r,c] = np.nan
    return df

Sample run -

In [182]: df
Out[182]: 
         a     b     c     d     e
row1   1.0   2.0   NaN   4.0   5.0
row2  11.0  12.0  13.0  14.0   NaN
row3  22.0  22.0   5.0  24.0  25.0

In [183]: n_max = pd.Series([2,3,2])

In [184]: mask_variable_largest_per_row(df, n_max)
Out[184]: 
       a     b     c     d     e
row1 NaN   NaN   NaN   4.0   5.0
row2 NaN  12.0  13.0  14.0   NaN
row3 NaN   NaN   NaN  24.0  25.0

Further boost : Bringing in numpy.argpartition to replace the numpy.argsort should help, as we don't care about the order of indices to be reset as NaNs. Thus, a numpy.argpartition based one would be -

def mask_variable_largest_per_row_v2(df, n_max):
    a = df.values
    m,n = a.shape

    nan_row_count = np.isnan(a).sum(1)
    n_reset = n-n_max.values-nan_row_count
    n_reset.clip(min=0, max=n-1, out = n_reset)

    N = (n-n_max.values).max()
    N = np.clip(N, a_min=0, a_max=n-1)

    sidx = a.argpartition(N, axis=1) #sidx = a.argsort(1)
    mask = n_reset[:,None] > np.arange(n)

    c = sidx[mask]
    r = np.repeat(np.arange(m), n_reset)
    a[r,c] = np.nan
    return df

Runtime test

Other approaches -

def pandas_rank_based(df, n_max):
    n_max.index = df.index
    df_rank = df.stack(dropna=False).groupby(level=0).rank\
               (ascending=False, method='first').unstack()
    selected = df_rank.le(n_max, axis=0)
    return df[selected]

Verification and timings -

In [387]: arr = np.random.rand(1000,1000)
     ...: arr.ravel()[np.random.choice(arr.size, 10000, replace=0)] = np.nan
     ...: df1 = pd.DataFrame(arr)
     ...: df2 = df1.copy()
     ...: df3 = df1.copy()
     ...: n_max = pd.Series(np.random.randint(0,1000,(1000)))
     ...: 
     ...: out1 = pandas_rank_based(df1, n_max)
     ...: out2 = mask_variable_largest_per_row(df2, n_max)
     ...: out3 = mask_variable_largest_per_row_v2(df3, n_max)
     ...: print np.nansum(out1-out2)==0 # Verify
     ...: print np.nansum(out1-out3)==0 # Verify
     ...: 
True
True

In [388]: arr = np.random.rand(1000,1000)
     ...: arr.ravel()[np.random.choice(arr.size, 10000, replace=0)] = np.nan
     ...: df1 = pd.DataFrame(arr)
     ...: df2 = df1.copy()
     ...: df3 = df1.copy()
     ...: n_max = pd.Series(np.random.randint(0,1000,(1000)))
     ...: 

In [389]: %timeit pandas_rank_based(df1, n_max)
1 loops, best of 3: 559 ms per loop

In [390]: %timeit mask_variable_largest_per_row(df2, n_max)
10 loops, best of 3: 34.1 ms per loop

In [391]: %timeit mask_variable_largest_per_row_v2(df3, n_max)
100 loops, best of 3: 5.92 ms per loop

Pretty good speedups there of 50x+ over the pandas built-in!

Divakar
  • 218,885
  • 19
  • 262
  • 358