groupby on very large dataset +10GB with python librairies, pandas, vaex and dask

Question

I have more than 10 GB transaction data, i used DASK to read the data, select the columns am intrested in and also groupby the columns i wanted. All this was incredibly fast but computing wasn't working well and debugging was hard.

I then decided to open my data by using PANDAS chunksize, chunking my data per million. And then used VAEX to combine the files in one big HDF5 file. Until here everything went well, but when i try to groupby my columns and exceed 50k data, my code crashes. I was wondering how to manage this...should i groupby every pandas chunk before combining them in the a vaex dataframe or is it possible to convert my vaex dataframe to a dask dataframe, groupby and then convert the grouped by dataframe into a vaex which is more user friendly for me as it's similar to pandas.

path=....
cols=['client_1_id','amount', 'client_2_id',  'transaction_direction'] 

chunksize = 10**6
df = pd.read_csv(path,
                 iterator=True,
                 sep='\t',
                 usecols=cols,
                 chunksize=chunksize,
                error_bad_lines=False)


import vaex
# Step 1: export to hdf5 chunks
for i, chunk in enumerate(df):
    print(i)
    df_chunk = vaex.from_pandas(chunk, copy_index=False)
    df_chunk.export_hdf5(f'dfv_{i}.hdf5')
    
dfv = vaex.open('dfv_*.hdf5')

# Step 2: Combine back into one big hdf5 file
dfv.export_hdf5('dfv.hdf5')


dfv=vaex.open('dfv.hdf5')

this is my first post, sorry if there's not enough details, or if i am unclear, please feel free to ask me any question.

"computing wasn't working well" - **if** you are asking about dask, you must say what you did and what went wrong. If not, then remove that part as irrelevant. — mdurant, Oct 06 '20 at 14:37
i didn't explicit this as i didn't understand why it wasn't working. In fact everytime i did df.compute() it ran for a very long time before giving me 'invalid argument' but i have no idea what was wrong. I kept dask in the title as i was wondering if i should still use it to do the groupby and if it was possible to go from a dask dataframe to a vaex dataframe. — amiraghrs, Oct 07 '20 at 07:54
Dask should be able to do what you want by itself, but we cannot help without further specifics. Indeed, you end up loading in the big file and ... then what? Why do you need one big file at all? — mdurant, Oct 07 '20 at 13:19
my aim is to use transaction data to create a big social network graph, if i don't have all the data in the same file i might miss some links between people. I kept on trying with dask and i know that i need to be more specific on the error to get help but i don't know why my kernel dies everytime i try to compute my dask dataframe. everything works fine with VAEX, except the groupby. so my question is, how to transform a dask dataframe groupby into an HDF5 file that can be open with vaex? — amiraghrs, Oct 09 '20 at 09:38
I still suspect you are asking the wrong question. "except the groupby" - the highest memory use part, which is probably also the problem for dask. — mdurant, Oct 09 '20 at 12:54

groupby on very large dataset +10GB with python librairies, pandas, vaex and dask

0 Answers0