4

I want to find a more efficient way (in terms of peak memory usage and possibly time) to do the work of panda's groupby.ngroup so that I don't run into memory issues when working with large datasets (I provide reasons for why this column is useful to me below). Take this example with a small dataset. I can accomplish this task easily using groupby.ngroup.

import pandas as pd
import numpy as np


df = pd.DataFrame(np.array(
        [[0, 1, 92],
        [0, 0, 39],
        [0, 0, 32],
        [1, 0, 44],
        [1, 1, 50],
        [0, 1, 11],
        [0, 0, 14]]), columns=['male', 'edu', 'wage'])

df['group_id'] = df.groupby(['male', 'edu']).ngroup()
df
   male  edu  wage  group_id
0     0    1    92         1
1     0    0    39         0
2     0    0    32         0
3     1    0    44         2
4     1    1    50         3
5     0    1    11         1
6     0    0    14         0

But when I start using larger datasets, the memory usage and computation time explodes and the memory usage in the groupby as a ratio over the memory usage of the dataframe increases almost three-fold for N=100,000,000 as compared to N=100,000. See below.

from memory_profiler import memory_usage
import time

N_values = [10**k for k in range(4, 9)]

stats = pd.DataFrame(index=N_values, dtype=float, columns=['time', 'basemem', 'groupby_mem'])

for N in N_values:
    df = pd.DataFrame(
        np.hstack([np.random.randint(0, 2, (N, 2)), np.random.normal(5, 1, (N, 1))]),
        columns=['male', 'edu', 'wage']        
    )

    def groupby_ngroup():
        df.groupby(['male', 'edu']).ngroup()

    def foo():
        pass

    basemem = max(memory_usage(proc=foo))

    tic = time.time()
    mem = max(memory_usage(proc=groupby_ngroup))
    toc = time.time() - tic

    stats.loc[N, 'basemem'] = basemem
    stats.loc[N, 'groupby_mem'] = mem
    stats.loc[N, 'time'] = toc

stats['mem_ratio'] = stats.eval('groupby_mem/basemem')
stats
               time      basemem  groupby_mem  mem_ratio
10000      0.037834   104.781250   105.359375   1.005517
100000     0.051785   108.187500   113.125000   1.045638
1000000    0.143642   128.156250   182.437500   1.423555
10000000   0.644650   334.148438   820.183594   2.454549
100000000  6.074531  2422.585938  7095.437500   2.928869

Why am I interested in this group identifier? Because I want to create columns that utilize pandas' groupby functions such as groupby.mean using the .map method as opposed to groupby.transform which takes a lot of memory and time. Furthermore, the .map approach can be used with dask dataframes as dask currently doesn't support .transform. With a column for "group_id" I can simply do means = df.groupby(['group_id'])['wage'].mean() and df['mean_wage'] = df['group_id'].map(means) to do the work of transform.

jtorca
  • 1,531
  • 2
  • 17
  • 31

3 Answers3

1

How about not using ngroup, and instead writing our own function to create group_id column?

Here is a code snippet that seems to give a slightly better performance:

from memory_profiler import memory_usage
import time
import pandas as pd
import numpy as np

N_values = [10**k for k in range(4, 9)]

stats = pd.DataFrame(index=N_values, dtype=float, columns=['time', 'basemem', 'groupby_mem'])

for N in N_values:
    df = pd.DataFrame(
        np.hstack([np.random.randint(0, 2, (N, 2)), np.random.normal(5, 1, (N, 1))]),
        columns=['male', 'edu', 'wage']        
    )

    def groupby_ngroup():
        #df.groupby(['male', 'edu']).ngroup()
        df['group_id'] = 2*df.male + df.edu

    def foo():
        pass

    basemem = max(memory_usage(proc=foo))

    tic = time.time()
    mem = max(memory_usage(proc=groupby_ngroup))
    toc = time.time() - tic

    stats.loc[N, 'basemem'] = basemem
    stats.loc[N, 'groupby_mem'] = mem
    stats.loc[N, 'time'] = toc

stats['mem_ratio'] = stats.eval('groupby_mem/basemem')
stats


            time        basemem     groupby_mem mem_ratio
10000       0.117921    2370.792969 79.761719   0.033643
100000      0.026921    84.265625   84.324219   1.000695
1000000     0.067960    130.101562  130.101562  1.000000
10000000    0.220024    308.378906  536.140625  1.738577
100000000   0.751135    2367.187500 3651.171875 1.542409

Essentially, we use the fact that the columns are numerical and treat them as binary numbers. The group_ids shall be the decimal equivalents.

Scaling it for three columns gives a similar result. For that, replace the dataframe initialization to the following:

df = pd.DataFrame(
        np.hstack([np.random.randint(0, 2, (N, 3)), np.random.normal(5, 1, (N, 1))]),
        columns=['male', 'edu','random1', 'wage']        
    )

and group_id function to:

def groupby_ngroup():
        df['group_id'] = 4*df.male + 2*df.edu + df.random1

Following are the results of that test:

            time        basemem     groupby_mem mem_ratio
10000       0.050006    78.906250   78.980469   1.000941
100000      0.033699    85.007812   86.339844   1.015670
1000000     0.066184    147.378906  147.378906  1.000000
10000000    0.322198    422.039062  691.179688  1.637715
100000000   1.233054    3167.921875 5183.183594 1.636146
S.Au.Ra.B.H
  • 457
  • 5
  • 9
  • I like this solution a lot. I had been thinking of a solution for a general set of columns with discrete support, but perhaps there is no better general solution and I should think instead of preprocessing data to get it into the form of pure categoricals, taking values `0,1,2,...` and then using your approach. So, in general, if `C` denotes my `N` group by columns, and `K` the number of unique values in each column, I can in general use `C_0 + (K_0)*C_1 + (K_0*K_1)*C_2 + ... + (K_0*K_1*....*K_{N-1})C_N`? – jtorca Dec 16 '19 at 16:41
  • 1
    @jtorca, apologies for the late reply. The multiplier depends on the maximum number of categories for all the columns. Say, `C_1, ..., C_N` are the group_by columns. Then `K = max(num_categories(C_1, ..., C_N))`. This is because we are assuming values in every column to be a representation in a number system which has a base `K`. Let's say we have 3 columns. C_1 and C_2 take values [0,1] and C_3 takes values [0,1,2]. Here, `K = 3`. and the equivalent group_id will be calculated as `(3**2)C_1 + (3**1)C_2 + (3**0)C_3`. Does this make sense? – S.Au.Ra.B.H Dec 17 '19 at 23:42
  • 1
    Oh, I see. That definitely works. I based my suggestion taking your original proposal as inspiration but I guess I didn't match it quite right. Though I think my first guess also happens to get unique categories and has the advantage of creating contiguous integer valued ids for the groups, e.g., `0,1,2,3,...`. – jtorca Dec 18 '19 at 17:25
  • @jtorca yes, I think your solution works too and perhaps efficiently than mine. – S.Au.Ra.B.H Dec 19 '19 at 00:54
0

Let us try using hash

list(map(hash,df.to_records().tolist()))
[4686582722376372986, 3632587615391525059, 2578593961740479157, -48845846747569345, 2044051356115000853, -583388452461625474, -1637380652526859201]
BENY
  • 317,841
  • 20
  • 164
  • 234
0

For a groupby where groupby variables are of unknown pattern, it seems that groupby.ngroup may be as good as it gets. But if your groupby variables are all categorical, e.g., take values 0,1,2,3...., then we can take inspiration from the solution given by @saurjog.

To generate the group ID, we can build a numerical expression that evaluates a special sum of the groupby variables. Consider the following functions

def gen_groupby_numexpr(cols, numcats):
    txt = [cols[0]]

    k = numcats[0]

    for c,k_ in zip(cols[1:], numcats[1:]):

        txt.append('{}*{}'.format(k, c))

        k = k*k_

    return ' + '.join(txt)

def ngroup_cat(df, by, numcats):
    '''
    by : list
        the categorical (0,1,2,3...) groupby column names
    numcats : list
        the number of unique values for each column in "by"
    '''
    expr = gen_groupby_numexpr(by, numcats)

    return df.eval(expr)

The function gen_groupby_numexpr generates the numerical expression and ngroup_cat generates the group id for the groupby variables in by with unique value counts numcats. Thus, consider the following dataset that matches our use case. It contains 3 categorical variables we will use to form the groupby, two of which take values in {0,1} and one takes values in {0,1,2}.

df2 = pd.DataFrame(np.hstack([np.random.randint(0, 2, (100, 2)), 
                              np.random.randint(0, 3, (100, 1)), 
                              np.random.randint(0, 20, (100, 1))]), 
    columns=['male', 'mar', 'edu', 'wage'])

If we generate the numerical expression we get:

'male + 2*mar + 4*edu'

Putting this altogether, we can generate the group id with

df2['group_id'] = ngroup_cat(df2, ['male', 'mar', 'edu'], [2, 2, 3])

from which we get 2*2*3=12 unique group IDs:

df2[['male', 'mar', 'edu', 'group_id']].drop_duplicates().sort_values(['group_id'])
    male  mar  edu  group_id
1      0    0    0         0
13     1    0    0         1
8      0    1    0         2
10     1    1    0         3
4      0    0    1         4
12     1    0    1         5
2      0    1    1         6
6      1    1    1         7
7      0    0    2         8
5      1    0    2         9
44     0    1    2        10
0      1    1    2        11

When I bench mark the solution above against groupby.ngroup it runs nearly 3 times faster on a dataset of N=10,000,000 and uses significantly less additional memory.

Now we can estimate these groupby means and then map them back to the whole dataframe to do the work of transform. I compute some benchmarks with mixed results on whether using transform or groupby then map is faster and less memory intensive. If you are computing means for groups of many variables then I think the latter is more efficient. Further, the latter can also be done in dask where transform is not yet supported.

jtorca
  • 1,531
  • 2
  • 17
  • 31