I want to apply Kruskal-Wallis statistical analysis on every numeric column of polars dataframe & return a new dataframe where every column holds the result of the KW.
My dataframe consisting of a large number of rows & columns looks like this.
df=pl.DataFrame({"Group":["A","B","A","A","B"], 'col1':[3,3.2,None, 1,2.3], 'col2':[4,5.4,3.2,1.5,2.3]})
Group | col1 | col2 |
---|---|---|
A | 3.0 | 4.0 |
B | 3.2 | 5.4 |
A | NaN | 3.2 |
A | 1.0 | 1.5 |
B | 2.3 | 2.3 |
I want to apply KW such that I get this:
Group | col1 | col2 |
---|---|---|
[A,B] | {'Hstats': value1, 'p_val': value2} | {'H_stats': value3, 'p_val': value4} |
value** are the results from kruskal-wallis.
I have tried the following, where I am iterating over every column to get the output. But its too tedious & might be longer computation time when dataframe size is too large.
from scipy.stats import mstats
def kwa(arr):
try: H,p = mstats.kruskalwallis(arr.to_list())
return {'Hstats': H, 'p_val':p}
except Exception: return {'Hstats': 'NA', 'p_val':'NA'}
def calckwa(df):
dft = df.groupby('Group',maintain_order=True).agg(pl.all())
trendcols = dft.get_columns()[1:]
trendskwa = {**{trend.name : kwa(trend) for trend in trendcols}}
return trendskwa
kwa_dict = pl.from_dict(calckwa(df))
I also tried to follow this answer where it has been done with Pandas , however, when I tried to do with Polars like below, I got Panic Exception error.
dfgrp = df.groupby('Group')
newdf = df_grp.apply(lambda grp: grp.drop('Group').apply(kwa))
Throws Error: PanicException: BindingsError: "Could not determine output type"
However I also get this error if I just try out with any simple polars dataframe which consists of few numeric columns & has a string "Group".
So, can anyone help me how to apply kruskal-wallis to all the columns of a polars dataframe without iterating over them, without having the need to convert it to pandas dataframe (since if dataframe size is already large, converting to pandas will take longer)?