0

I have a dataset consisting of 1800000 rows and 45 columns the operation that I am trying to perform is group by one column, the sum of other columns

the 1st step I did is considering data_df as my data frame and all the columns are numeric

columns= data_df.column_names
df_result = df.groupby(columns,agg='sum')

the result is Kernal getting restarted the RAM of the system is 32 GB enter image description here

another approach that I tried

df=None
for col in colm:
    print("the col is ",col)
    
    if df is None:  
        df= data_df.groupby(data_df.MSISDN, agg=[vaex.agg.sum(col)])
    else:
        dfTemp= data_df.groupby(data_df.MSISDN, agg=[vaex.agg.sum(col)])
        df =df.join(dfTemp,left_on="MSISDN",right_on ="MSISDN",how ="inner",allow_duplication=True)
        del dfTemp

here I am able to find the sum up to 11 columns then the kernel gets restarted again is there any other way to get the results using vaex or pandas ?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
aziz shaw
  • 144
  • 1
  • 12
  • it seems like you grouped by all the columns and not just by one column columns= data_df.column_names - Which version of python you are using? – Niv Dudovitch Sep 01 '21 at 14:05
  • python version 3.7 – aziz shaw Sep 01 '21 at 14:55
  • I am not sure if this is what you want but.. you can simply try: `df.groupby('my_col').agg({col: 'sum' for col in df.column_names})` where `my_col` is the column on which you want to group. Should work for both vaex and pandas. – Joco Oct 16 '21 at 01:28

0 Answers0