1

I want to apply Kruskal-Wallis statistical analysis on every numeric column of polars dataframe & return a new dataframe where every column holds the result of the KW.

My dataframe consisting of a large number of rows & columns looks like this.

df=pl.DataFrame({"Group":["A","B","A","A","B"], 'col1':[3,3.2,None, 1,2.3], 'col2':[4,5.4,3.2,1.5,2.3]})
Group col1 col2
A 3.0 4.0
B 3.2 5.4
A NaN 3.2
A 1.0 1.5
B 2.3 2.3

I want to apply KW such that I get this:

Group col1 col2
[A,B] {'Hstats': value1, 'p_val': value2} {'H_stats': value3, 'p_val': value4}

value** are the results from kruskal-wallis.

I have tried the following, where I am iterating over every column to get the output. But its too tedious & might be longer computation time when dataframe size is too large.

from scipy.stats import mstats
def kwa(arr): 
    try: H,p = mstats.kruskalwallis(arr.to_list()) 
        return {'Hstats': H, 'p_val':p} 
    except Exception: return {'Hstats': 'NA', 'p_val':'NA'}
def calckwa(df): 
    dft = df.groupby('Group',maintain_order=True).agg(pl.all()) 
    trendcols = dft.get_columns()[1:] 
    trendskwa = {**{trend.name : kwa(trend) for trend in trendcols}}
    return trendskwa

kwa_dict = pl.from_dict(calckwa(df))

I also tried to follow this answer where it has been done with Pandas , however, when I tried to do with Polars like below, I got Panic Exception error.

dfgrp = df.groupby('Group') 
newdf = df_grp.apply(lambda grp: grp.drop('Group').apply(kwa))

Throws Error: PanicException: BindingsError: "Could not determine output type" However I also get this error if I just try out with any simple polars dataframe which consists of few numeric columns & has a string "Group".

So, can anyone help me how to apply kruskal-wallis to all the columns of a polars dataframe without iterating over them, without having the need to convert it to pandas dataframe (since if dataframe size is already large, converting to pandas will take longer)?

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
megha
  • 23
  • 4

1 Answers1

0

The simplest thing is to just use pl.reduce like this:

from scipy.stats.mstats import kruskalwallis
df.select(pl.reduce(kruskalwallis, ('col1','col2')))

You can do it in a groupby().agg context as you'd expect

df.groupby("Group").agg(pl.reduce(kruskalwallis, ('col1','col2')))

To get the results in their own named columns you can use to_struct followed by unnest like this...

(
    df
        .groupby("Group")
        .agg(krusk=pl.reduce(kruskalwallis, ('col1','col2')))
        .with_columns(pl.col('krusk').list.to_struct(fields=['Hstats','p_val']))
        .unnest('krusk')
        )

Note: In OP's question, there's a parameter named arr. If you're on an old version of polars then the .list in the above has recently supplanted .arr but, for this purpose, it should just be changing the namespace with everything else staying the same.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • Thank you for your response. But I dont want to find Kruskal analysis between col1 & col2. Rather, I want to group all the columns by "Group" so that I can find kruskal analysis per column between the 2 groups A & B. So my original dataframe has col1 values associated to Group A & B. So if I do groupby("Group"), I will get col1 with all values clubbed into respective groups. So it will give me 2 rows for col1, each row giving a list of values associated to A & B respectively. I want to find kruskal of these 2 rows. That's what I am doing in my function calckwa(). – megha Aug 16 '23 at 08:59