I want to find a more efficient way (in terms of peak memory usage and possibly time) to do the work of panda's groupby.ngroup
so that I don't run into memory issues when working with large datasets (I provide reasons for why this column is useful to me below). Take this example with a small dataset. I can accomplish this task easily using groupby.ngroup
.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array(
[[0, 1, 92],
[0, 0, 39],
[0, 0, 32],
[1, 0, 44],
[1, 1, 50],
[0, 1, 11],
[0, 0, 14]]), columns=['male', 'edu', 'wage'])
df['group_id'] = df.groupby(['male', 'edu']).ngroup()
df
male edu wage group_id
0 0 1 92 1
1 0 0 39 0
2 0 0 32 0
3 1 0 44 2
4 1 1 50 3
5 0 1 11 1
6 0 0 14 0
But when I start using larger datasets, the memory usage and computation time explodes and the memory usage in the groupby as a ratio over the memory usage of the dataframe increases almost three-fold for N=100,000,000
as compared to N=100,000
. See below.
from memory_profiler import memory_usage
import time
N_values = [10**k for k in range(4, 9)]
stats = pd.DataFrame(index=N_values, dtype=float, columns=['time', 'basemem', 'groupby_mem'])
for N in N_values:
df = pd.DataFrame(
np.hstack([np.random.randint(0, 2, (N, 2)), np.random.normal(5, 1, (N, 1))]),
columns=['male', 'edu', 'wage']
)
def groupby_ngroup():
df.groupby(['male', 'edu']).ngroup()
def foo():
pass
basemem = max(memory_usage(proc=foo))
tic = time.time()
mem = max(memory_usage(proc=groupby_ngroup))
toc = time.time() - tic
stats.loc[N, 'basemem'] = basemem
stats.loc[N, 'groupby_mem'] = mem
stats.loc[N, 'time'] = toc
stats['mem_ratio'] = stats.eval('groupby_mem/basemem')
stats
time basemem groupby_mem mem_ratio
10000 0.037834 104.781250 105.359375 1.005517
100000 0.051785 108.187500 113.125000 1.045638
1000000 0.143642 128.156250 182.437500 1.423555
10000000 0.644650 334.148438 820.183594 2.454549
100000000 6.074531 2422.585938 7095.437500 2.928869
Why am I interested in this group identifier? Because I want to create columns that utilize pandas' groupby
functions such as groupby.mean
using the .map
method as opposed to groupby.transform
which takes a lot of memory and time. Furthermore, the .map
approach can be used with dask
dataframes as dask
currently doesn't support .transform
. With a column for "group_id"
I can simply do means = df.groupby(['group_id'])['wage'].mean()
and df['mean_wage'] = df['group_id'].map(means)
to do the work of transform
.