3

I have a large .csv file with roughly 150M rows. I can still fit the entire data set into memory and use Pandas to groupby and combine. Example...

aggregated_df = df.groupby(["business_partner", "contract_account"]).sum()

In the above example the dataframe contains two integer columns, business_partner and contract_account, which are used as keys for the grouping operation. The remaining columns can be assumed all be floating point features which I would like to aggregate.

However this only uses 1 of the 48 cores on my workstation. I am trying to use vaex in order to take advantage of all of my cores but cannot figure out the API calls to perform groupby and combine. Perhaps it is not yet possible in Vaex?

Edit(s):

  1. I am aware that this operation can be done in dask, but for this question I want to focus on Vaex.
davidrpugh
  • 4,363
  • 5
  • 32
  • 46
  • What does your data look like ? – vlemaistre Jun 26 '19 at 07:33
  • 1
    @davidrpugh would you perfer to use dask for this operation. It has the same api of that of pandas (built on top of pandas in way). `from dask.distributed import Client; client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB') ;df.goupby(["col1", "col2"]).sum().compute()` – Nithin Varghese Jun 26 '19 at 07:35

2 Answers2

8

You can find a working example in https://docs.vaex.io/en/latest/api.html#vaex.dataframe.DataFrameLocal.groupby

Going with your example of grouping by 2 columns and getting a sum aggregation:

import pandas as pd, numpy as np
import vaex

# Create input dataframe

n=10**6  # Change this to adjust df size

a_c1 = [1,2,3]*n
a_c2 = [1,1,2,2,3,3]*int(n/2)
a_x = np.arange(float(len(a_c1)))
df = pd.DataFrame({'c1':a_c1,'c2':a_c2,'x1':a_x, 'x2':a_x})

# Convert dataframe to vaex

# dfv = vaex.from_pandas(df) # This also works, but it's slower
dfv = vaex.from_arrays(c1=a_c1, c2=a_c2, x1=a_x, x2=a_x)

df_result1 = df.groupby(['c1','c2']).sum()
df_result2 = dfv.groupby(['c1','c2'],agg='sum')

The output structure will be slightly different:

> print(df_result1)

                 x1            x2
c1 c2                            
1  1   7.499985e+11  7.499985e+11
   2   7.500000e+11  7.500000e+11
2  1   7.499990e+11  7.499990e+11
   3   7.500005e+11  7.500005e+11
3  2   7.499995e+11  7.499995e+11
   3   7.500010e+11  7.500010e+11

> print(df_result2)

  #    c1    c2      x_1_sum      x_2_sum
  0     2     3  7.50000e+11  7.50000e+11
  1     2     1  7.49999e+11  7.49999e+11
  2     3     2  7.5e+11      7.5e+11
  3     3     3  7.50001e+11  7.50001e+11
  4     1     2  7.5e+11      7.5e+11
  5     1     1  7.49998e+11  7.49998e+11

0

As far as I know, you will have to adjust the limits and number of bins for the grouping manually - but the 'binby' argument should do the job in vaex:

df.sum([list of columns you want summed],binby=["business_partner", "contract_account"],limits=['minmax','minmax'],
       shape=[business_partner_bins,contract_account_bins])

if you want all columns summed you could replace [list of columns you want summed] by df.column_names

PS. I just noted there is a groupby function in vaex, but I have no experience using it. https://vaex.readthedocs.io/en/latest/api.html?highlight=groupby#vaex.dataframe.DataFrameLocal.groupby

Helmer
  • 286
  • 2
  • 4
  • I will give this a try (there is a similar example in the docs) but this approach seems VERY odd coming from SQL and Pandas. In particular I don't understand the need to provide a `shape`. The `binby` columns group the data into bins and for each bin I then want to compute the sum. From the docs it seems like this approach was intended for a different use case. I also found the `groupby` function you mentioned but was never able to get it to work. – davidrpugh Jul 06 '19 at 09:10
  • Yes, I think you are correct and this is not the intended use for this approach - so that's why you will have to tune the shape. The `binby` does group the columns into bins, the `shape` sets the number of bins. The default `shape` is 64 if I'm not mistaken. I assumed here that you want a bin for each "business_partner" and "contract_account", which is likely not 64 so that's why I recommended changing the default shape. – Helmer Jul 06 '19 at 12:24