2

How can I remove the iterrows()? Can this be done faster with numpy or pandas?

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8)*0  })
print(df)
#      A      B  C
# 0  foo    one  0
# 1  bar    one  0
# 2  foo    two  0
# 3  bar  three  0
# 4  foo    two  0
# 5  bar    two  0
# 6  foo    one  0
# 7  foo  three  0

selDict = {"foo":2, "bar":3}

This works:

for i, r in df.iterrows():
    if selDict[r["A"]] > 0:
        selDict[r["A"]] -=1         
        df.set_value(i, 'C', 1)

   print df
#      A      B  C
# 0  foo    one  1
# 1  bar    one  1
# 2  foo    two  1
# 3  bar  three  1
# 4  foo    two  0
# 5  bar    two  1
# 6  foo    one  0
# 7  foo  three  0
Merlin
  • 24,552
  • 41
  • 131
  • 206

3 Answers3

4

If I understood correctly, you can use cumcount:

df['C'] = (df.groupby('A').cumcount() < df['A'].map(selDict)).astype('int')

df
Out: 
     A      B  C
0  foo    one  1
1  bar    one  1
2  foo    two  1
3  bar  three  1
4  foo    two  0
5  bar    two  1
6  foo    one  0
7  foo  three  0
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • @Merlin well it is vectorized so it is faster than iterrows but pandas handles a lot of different things like NaNs and dtypes so it cannot be as fast as numpy. Not sure how though. Let's wait for Divakar. :) – ayhan Jul 16 '17 at 19:50
  • Nice trickery on conjuring me up! ;) Well was a lot more effort than I though it would be. – Divakar Jul 16 '17 at 22:25
  • @Divakar Yeah I wasn't also expecting it to be complicated. It's better though - more for me to learn. :) – ayhan Jul 16 '17 at 22:29
2

Here's one approach -

1) Helper functions :

def argsort_unique(idx):
    # Original idea : http://stackoverflow.com/a/41242285/3293881 by @Andras
    n = idx.size
    sidx = np.empty(n,dtype=int)
    sidx[idx] = np.arange(n)
    return sidx

def get_bin_arr(grplens, stop1_idx):
    count_stops_corr = np.minimum(stop1_idx, grplens)

    limsc = np.maximum(grplens, count_stops_corr)
    L = limsc.sum()

    starts = np.r_[0,limsc[:-1].cumsum()]

    shift_arr = np.zeros(L,dtype=int)
    stops = starts + count_stops_corr
    stops = stops[stops<L]

    shift_arr[starts] += 1
    shift_arr[stops] -= 1
    bin_arr = shift_arr.cumsum()
    return bin_arr 

Possibly faster alternative with a loopy slicing based helper function :

def get_bin_arr(grplens, stop1_idx):
    stop1_idx_corr = np.minimum(stop1_idx, grplens)    
    clens = grplens.cumsum()
    out = np.zeros(clens[-1],dtype=int)    
    out[:stop1_idx_corr[0]] = 1
    for i,j in zip(clens[:-1], clens[:-1] + stop1_idx_corr[1:]):
        out[i:j] = 1
    return out

2) Main function :

def out_C(A, selDict):
    k = np.array(selDict.keys())
    v = np.array(selDict.values())
    unq, C  = np.unique(A, return_counts=1)
    sidx3 = np.searchsorted(unq, k)
    lims = np.zeros(len(unq),dtype=int)
    lims[sidx3] = v
    bin_arr = get_bin_arr(C, lims)
    sidx2 = A.argsort()
    out = bin_arr[argsort_unique(sidx2)]    
    return out

Sample runs -

Original approach :

def org_app(df, selDict):
    df['C'] = 0
    d = selDict.copy()    
    for i, r in df.iterrows():
        if d[r["A"]] > 0:
            d[r["A"]] -=1         
            df.set_value(i, 'C', 1)
    return df

Case #1 :

>>> df = pd.DataFrame({'A': 'foo bar foo bar res foo bar res foo foo res'.split()})
>>> selDict = {"foo":2, "bar":3, "res":1}
>>> org_app(df, selDict)
      A  C
0   foo  1
1   bar  1
2   foo  1
3   bar  1
4   res  1
5   foo  0
6   bar  1
7   res  0
8   foo  0
9   foo  0
10  res  0
>>> out_C(df.A.values, selDict)
array([1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0])

Case #2 :

>>> selDict = {"foo":20, "bar":30, "res":10}
>>> org_app(df, selDict)
      A  C
0   foo  1
1   bar  1
2   foo  1
3   bar  1
4   res  1
5   foo  1
6   bar  1
7   res  1
8   foo  1
9   foo  1
10  res  1
>>> out_C(df.A.values, selDict)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Can this help? My thought were to use dups and limit.. This uses rank. pseudo code. apply "1", return "0" above rank. https://stackoverflow.com/questions/14671013/ranking-of-numpy-array-with-possible-duplicates – Merlin Jul 17 '17 at 00:53
  • @Merlin Hmm, don't think `rankdata` can help you here. Also, any such apply method would be essentially looping. If you are dealing with a small number of `keys`, a loopy method/apply could be a better choice. I was assuming large data and decent no. of keys. – Divakar Jul 17 '17 at 17:09
  • @Merlin Good catch! Edited. – Divakar Jul 21 '17 at 04:00
1

scipy.stats.rankdata can help here. In order to derive the rank of each element within its bucket, we take the difference between the "min" and "ordinal" methods:

>>> from scipy.stats import rankdata as rd
>>> rd(df.A, 'ordinal') - rd(df.A, 'min')
array([0, 0, 1, 1, 2, 2, 3, 4])

Then we just compare to df.A.map(selDict):

df.C = (rd(df.A, 'ordinal') - rd(df.A, 'min') < df.A.map(selDict)).astype(int)

This may be a little inefficient (calling rankdata twice), but using the optimized routines in scipy should make up for that.

If you can't use scipy you can use repeated argsort() for the "ordinal" method and my solution using unique and bincount for the "min" method:

>>> _, v = np.unique(df.A, return_inverse=True)
>>> df.A.argsort().argsort() - (np.cumsum(np.concatenate(([0], np.bincount(v)))))[v]
0    0
1    0
2    1
3    1
4    2
5    2
6    3
7    4
Name: A, dtype: int64

Then compare to df.A.map(selDict) as above.

ecatmur
  • 152,476
  • 27
  • 293
  • 366