2

I wrote the following function to calculate statistical tests between two categories across all pandas columns in parallel -

I was able to extract the categories using dask, but I need to compute and use pandas to get the ttest (or other statistical test) I was wondering if anyone has an idea to how to use dask to run not only the categorization , but the t-test for each column in parallel) Below is my code:

import numpy as np
import pandas as pd
import dask
from scipy.stats import ttest_ind 
from scipy.stats import ttest_rel 
from scipy.stats import kstest 

df = pd.DataFrame({
'var1'      : np.random.randint(0, 1000000, 1000000),
'var2'      : np.random.randint(0, 1000000, 1000000),
'var3'      : np.random.randint(0, 1000000, 1000000),
'Category'   : np.random.randint(0, 2, 1000000) 
})


custom_list = dask.dataframe.Aggregation('custom_test', 
                                         chunk= lambda s: s.apply(lambda x:list(x)),
                                         agg = lambda s0 :s0.obj.groupby(level=list(range(s0.obj.index.nlevels))).sum(),
                                         finalize= lambda s1 :s1.apply(lambda x: x))

def testCustom(x, test=kstest, **args):
    x=list(x)
    return test(x[0],x[1])

def diffDiffrentialCategory(df, catcol='Category', test=ttest_ind, pVal=0.05, chunksize=10000, **args):
    ddf=dask.dataframe.from_pandas(df,chunksize=chunksize)

    df1=ddf.groupby(catcol).aggregate(custom_list).compute()
    # I'd like to work directly on df1=ddf.groupby(catcol).aggregate(custom_list) w/o compute()
    df1=pd.DataFrame.from_records(df1.apply(testCustom, test=test)).set_index(df1.columns).rename(columns={0:'statistic', 1:'p-value'})
    return df1[df1['p-value']<=pVal]

Appreciate the help/advice

Jano

Jano
  • 21
  • 2
  • I assume that you're trying to detect if there's a statistically significant difference in means between the two categories? Are these samples dependent or independent? – Nick ODell Oct 03 '20 at 21:45
  • @NickODell, yes exactly, 1) I want to see for which columns the two categories are significantly different. This code actually does that work. 2) Because I wanted to be used in different situations you can see the function testCustom can take any test as input. 3) The problem is after I separate the two categories as list using dask to build df1, i need to compute df1 and then apply testCustom to a pandas dataframe. What I wanted to do is to apply testCustom or something similar to df1 before computing so it's ran in parallel by dask instead of pandas, does this makes sense? – Jano Oct 04 '20 at 18:53

0 Answers0