4

I have this dataframe called target:

target:

          group
170  64.22-1-00
72   64.22-1-00
121  35.12-3-00
99   64.22-1-00
19   35.12-3-00

I want to create a new column called group_incidence which is ratio of frequency that the group appears in the dataframe. It is calculated like this:

[total number of times that that 'group' appeared in the group column]/len(target.index)

It would look like this:

          group   group_incidence 
170  64.22-1-00               0.6
72   64.22-1-00               0.6
121  35.12-3-00               0.4
99   64.22-1-00               0.6
19   35.12-3-00               0.4

I was able to do that through a for loop, however since that's a large dataframe, it is taking too long. I believe that if I could skip the for loop I would have considerable performance gains.

Is there a way to perform that same operation without going through the for loop?

Divakar
  • 218,885
  • 19
  • 262
  • 358
aabujamra
  • 4,494
  • 13
  • 51
  • 101

2 Answers2

4
In [112]: df['group_incidence'] = df.groupby('group')['group'].transform('size') / len(df)    

In [113]: df
Out[113]:
          group group_incidence
170  64.22-1-00             0.6
72   64.22-1-00             0.6
121  35.12-3-00             0.4
99   64.22-1-00             0.6
19   35.12-3-00             0.4
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 3
    Could have `len(df)` outside `apply` perhaps? `df.groupby('group')['group'].transform('count')/len(df)` – Zero Jun 28 '17 at 18:21
4

Approach #1

One NumPy approach would be -

_,tags, c = np.unique(df.group.values, return_counts=1, return_inverse=1)
df['group_incidence'] = (c/c.sum())[tags]

Sample run -

In [584]: df
Out[584]: 
          group
170  64.22-1-00
72   64.22-1-00
121  35.12-3-00
99   64.22-1-00
19   35.12-3-00

In [585]: _,tags, c = np.unique(df.group.values, return_counts=1, return_inverse=1)

In [586]: df['group_incidence'] = (c/c.sum())[tags]

In [587]: df
Out[587]: 
          group  group_incidence
170  64.22-1-00              0.6
72   64.22-1-00              0.6
121  35.12-3-00              0.4
99   64.22-1-00              0.6
19   35.12-3-00              0.4

Approach #2

Another with a bit of mess/low-level NumPy tweaks with the idea of gaining performance -

def argsort_unique(idx):
    # Original idea : http://stackoverflow.com/a/41242285/3293881 by @Andras
    n = idx.size
    sidx = np.empty(n,dtype=int)
    sidx[idx] = np.arange(n)
    return sidx

def group_ratios_tagged(a):
    sidx = a.argsort()
    b = a[sidx]

    m = np.concatenate(( [False], b[1:] != b[:-1] ))
    sep_idx = np.concatenate(([0], np.flatnonzero(m), [a.size]))
    idx = m.astype(int)
    np.maximum.accumulate(idx, out=idx)

    c = sep_idx[1:] - sep_idx[:-1]
    h = (c/c.sum())[idx]
    out = h[argsort_unique(sidx)]
    return out

Sample run (how to use) -

In [659]: df = pd.read_clipboard()

In [660]: df
Out[660]: 
          group
170  64.22-1-00
72   64.22-1-00
121  35.12-3-00
99   64.22-1-00
19   35.12-3-00

In [661]: df['group_incidence'] = group_ratios_tagged(df.group.values)

In [662]: df
Out[662]: 
          group  group_incidence
170  64.22-1-00              0.6
72   64.22-1-00              0.6
121  35.12-3-00              0.4
99   64.22-1-00              0.6
19   35.12-3-00              0.4
Divakar
  • 218,885
  • 19
  • 262
  • 358