I have a dataframe with 600 rows and 2999 float columns and 1 categorical column.
I groupby cat1
and apply mean
,,max,
min,std
,sum
functions and then merge into df
and then convert df to numpy
array.
The code takes around 35ms in my computer. I want to improve its speed. Even 1ms improvement will make me happy.
Code to create a dummy df, so you can reproduce:
import pandas as pd
import numpy as np
# DataFrame with 600 rows and 3000 columns
num_rows = 600
num_numeric_cols = 2999
#Generate random float32 values for the numeric columns
numeric_data = np.random.rand(num_rows, num_numeric_cols).astype(np.float32)
# categorical column with 50 unique categories
num_categories = 50
categorical_data = np.random.choice(range(num_categories), num_rows)
# column names for numeric columns
numeric_column_names = [f"col{i}" for i in range(num_numeric_cols)]
# DataFrame
data = {numeric_column_names[i]: numeric_data[:, i] for i in range(num_numeric_cols)}
data["cat1"] = categorical_data
df = pd.DataFrame(data)
df["cat1"] = df["cat1"].astype("category")
Now, the below code is the part, whoes speed I wanna increase.
start = time.time()
aa = df.groupby(["cat1"])[["col1","col2","col3","col4","col5","col6","col7","col8","col9","col10"]]
df = pd.merge(df,aa.mean().reset_index(),how="left",on="cat1",suffixes=["","mean"])
df = pd.merge(df,aa.std().reset_index(),how="left",on="cat1",suffixes=["","std"])
df = pd.merge(df,aa.min().reset_index(),how="left",on="cat1",suffixes=["","min"])
df = pd.merge(df,aa.max().reset_index(),how="left",on="cat1",suffixes=["","max"])
df = pd.merge(df,aa.sum().reset_index(),how="left",on="cat1",suffixes=["","sum"])
df= df.to_numpy(np.float32)
print("Time Taken",time.time()-start)
print(df.shape)
Since I am converting the code in numpy array at the end, you can also do groupby or merge in numpy, if it is faster.
1.) You can find a way to groupby in faster way
2.) You can find a way to merge in faster way. I found a faster solution for this point, will post code in answer to keep it simple.