0

I have a dataframe with 600 rows and 2999 float columns and 1 categorical column.

I groupby cat1 and apply mean,,max,min,std,sum functions and then merge into df and then convert df to numpy array.

The code takes around 35ms in my computer. I want to improve its speed. Even 1ms improvement will make me happy.

Code to create a dummy df, so you can reproduce:

import pandas as pd
import numpy as np

# DataFrame with 600 rows and 3000 columns
num_rows = 600
num_numeric_cols = 2999
#Generate random float32 values for the numeric columns
numeric_data = np.random.rand(num_rows, num_numeric_cols).astype(np.float32)
# categorical column with 50 unique categories
num_categories = 50
categorical_data = np.random.choice(range(num_categories), num_rows)
# column names for numeric columns
numeric_column_names = [f"col{i}" for i in range(num_numeric_cols)]
#  DataFrame
data = {numeric_column_names[i]: numeric_data[:, i] for i in range(num_numeric_cols)}
data["cat1"] = categorical_data

df = pd.DataFrame(data)
df["cat1"] = df["cat1"].astype("category")

Now, the below code is the part, whoes speed I wanna increase.

start = time.time()    

aa = df.groupby(["cat1"])[["col1","col2","col3","col4","col5","col6","col7","col8","col9","col10"]]
df = pd.merge(df,aa.mean().reset_index(),how="left",on="cat1",suffixes=["","mean"])
df = pd.merge(df,aa.std().reset_index(),how="left",on="cat1",suffixes=["","std"])
df = pd.merge(df,aa.min().reset_index(),how="left",on="cat1",suffixes=["","min"])
df = pd.merge(df,aa.max().reset_index(),how="left",on="cat1",suffixes=["","max"])
df = pd.merge(df,aa.sum().reset_index(),how="left",on="cat1",suffixes=["","sum"])
df= df.to_numpy(np.float32)
print("Time Taken",time.time()-start)
print(df.shape)

Since I am converting the code in numpy array at the end, you can also do groupby or merge in numpy, if it is faster.

1.) You can find a way to groupby in faster way

2.) You can find a way to merge in faster way. I found a faster solution for this point, will post code in answer to keep it simple.

  • Don't merge, use `groupby.transform`, even if you had to merge, doing it many times successively is inefficient, `concat` the aggregations and `merge` once – mozway Aug 11 '23 at 07:36
  • Ohh perfect. That is damn fast. Any way to calculate mean,max,min,std,sum at once using only 1 `transform` , to make it even faster? – Rahul Sharma Aug 11 '23 at 09:04
  • Currently, I am using 5 lines of code for each transform (mean,min,max,sum,std) and then contacting. @mozway Thanks a lot btw – Rahul Sharma Aug 11 '23 at 09:05

1 Answers1

0

I created this custom merge function to improve speed.

def merge_cols(maindf,series,on,suffix):
    extraColumns = sorted(list(set(series.columns.tolist())-set(on)))
    series = pd.merge(maindf[on], series, how='left', on=on)
    for extraColumn in extraColumns:
        maindf[str(extraColumn)+suffix] = series[extraColumn].values
    maindf.reset_index(drop=True,inplace=True)
    return maindf

start = time.time()    

aa = df.groupby(["cat1"])[["col1","col2","col3","col4","col5","col6","col7","col8","col9","col10"]]
df = merge_cols(df,aa.mean().reset_index(),["cat1"],"mean")
df = merge_cols(df,aa.std().reset_index(),["cat1"],"std")
df = merge_cols(df,aa.min().reset_index(),["cat1"],"min")
df = merge_cols(df,aa.max().reset_index(),["cat1"],"max")
df = merge_cols(df,aa.sum().reset_index(),["cat1"],"sum")
df = df.to_numpy(np.float32)
print("Time Taken",time.time()-start)
print(df.shape)