I have a large .csv
file with roughly 150M rows. I can still fit the entire data set into memory and use Pandas to groupby and combine. Example...
aggregated_df = df.groupby(["business_partner", "contract_account"]).sum()
In the above example the dataframe contains two integer columns, business_partner
and contract_account
, which are used as keys for the grouping operation. The remaining columns can be assumed all be floating point features which I would like to aggregate.
However this only uses 1 of the 48 cores on my workstation. I am trying to use vaex in order to take advantage of all of my cores but cannot figure out the API calls to perform groupby and combine. Perhaps it is not yet possible in Vaex?
Edit(s):
- I am aware that this operation can be done in dask, but for this question I want to focus on Vaex.