0

I have this DataFrame:

print(TempvsDType)
              CurrentThermostatTemp
DwellingType                       
Bungalow                        0.0
Bungalow                       22.0
Bungalow                       22.0
Bungalow                       25.0
Bungalow                       18.0
Bungalow                       21.0
Bungalow                       22.0
Bungalow                       10.0
Bungalow                       18.0
Bungalow                       20.0
Bungalow                       20.0
Bungalow                       22.0
Bungalow                       20.0
Bungalow                       10.0
Bungalow                       30.0
Bungalow                       22.0
Bungalow                       20.0
Bungalow                       20.0
Bungalow                       19.0
Bungalow                       20.0
Bungalow                       22.0
Bungalow                       20.0
Bungalow                       21.0
Bungalow                       22.0
Bungalow                       15.0
Bungalow                       22.0
Bungalow                        0.0
Bungalow                       24.0
Bungalow                       30.0
Bungalow                       20.0
...                             ...
Park Home                      20.0
Park Home                      23.0
Park Home                      20.0
Park Home                      20.0
Park Home                      20.0
Park Home                      18.0
Park Home                      20.0
Park Home                      15.0
Park Home                      12.0
Park Home                      20.0
Park Home                      20.0
Park Home                      23.0
Park Home                      21.0
Park Home                      20.0
Park Home                      20.0
Park Home                      20.0
Park Home                      23.0
Park Home                      18.0
Park Home                      20.0
Park Home                      18.0
Park Home                      16.0
Park Home                      17.0
Park Home                      20.0
Park Home                      20.0
Park Home                      18.0
Park Home                      18.0
Park Home                      20.0
Park Home                      20.0
Park Home                      15.0
Park Home                      21.0

[6247 rows x 1 columns]

I have separated each variable with the .truncate() method:


Flat = TempvsDType.truncate(before="Flat",after="Flat")
House = TempvsDType.truncate(before="House",after="House")
Bungalow = TempvsDType.truncate(before="Bungalow",after="Bungalow")
Maisonette = TempvsDType.truncate(before="Maisonette",after="Maisonette")
ParkHome = TempvsDType.truncate(before="Park Home",after="Park Home")

My goal here is to perform a student t-test for all possible combinations between the variables, except for duplicates or repeated pairs. However, I had to this manually which was very long and time consuming, especially for other scripts where there are more than 5 variables and number of combinations increases substantially . This was my manual method:

from scipy.stats import ttest_ind
#All possible combinations:
Flat_House = ttest_ind(Flat,House)
Flat_Bungalow = ttest_ind(Flat,Bungalow)
Flat_Maisonette = ttest_ind(Flat,Maisonette)
Flat_ParkHome = ttest_ind(Flat,ParkHome)
House_Bungalow = ttest_ind(House,Bungalow)
House_Maisonette = ttest_ind(House,Maisonette)
House_ParkHome = ttest_ind(House,ParkHome)
Bungalow_Maisonette = ttest_ind(Bungalow,Maisonette)
Bungalow_ParkHome = ttest_ind(Bungalow,ParkHome)
Maisonette_ParkHome = ttest_ind(Maisonette, ParkHome)
#t-test between each combination
print("t-test between {} and {} is {} and p-value:{}".format(u[0],u[1],Flat_House[0],Flat_House[1]))
print("t-test between {} and {} is {} and p-value:{}".format(u[0],u[2],Flat_Bungalow[0],Flat_Bungalow[1]))
print("t-test between {} and {} is {} and p-value:{}".format(u[0],u[3],Flat_Maisonette[0],Flat_Maisonette[1]))
print("t-test between {} and {} is {} and p-value:{}".format(u[0],u[4],Flat_ParkHome[0],Flat_ParkHome[1]))
print("t-test between {} and {} is {} and p-value:{}".format(u[1],u[2],House_Bungalow[0],House_Bungalow[1]))
print("t-test between {} and {} is {} and p-value:{}".format(u[1],u[3],House_Maisonette[0],House_Maisonette[1]))
print("t-test between {} and {} is {} and p-value:{}".format(u[1],u[4],House_ParkHome[0],House_ParkHome[1]))
print("t-test between {} and {} is {} and p-value:{}".format(u[2],u[3],Bungalow_Maisonette[0],Bungalow_Maisonette[1]))
print("t-test between {} and {} is {} and p-value:{}".format(u[2],u[4],Bungalow_ParkHome[0],Bungalow_ParkHome[1]))
print("t-test between {} and {} is {} and p-value:{}".format(u[3],u[4],Maisonette_ParkHome[0],Maisonette_ParkHome[1]))

Therefore, I would like to know how can I write a function that would do this automatically, i.e. print student t-test for all possible combinations except duplicates and existing pairs and return it the way I have printed it manually. I have tried this many times but have not succeeded.I would be very pleased if someone could help me. Thank you.

1 Answers1

0
from itertools import combinations
from scipy.stats import ttest_ind

dfs = dict(tuple(TempvsDType.drop_duplicates(inplace=False).groupby('DwellingType'))) #  drop duplicate rows, and create a dictionary of dataframes after grouping by DwellingType

def ttest(pair):
    results= ttest_ind(dfs[pair[0]]['CurrentThermostatTemp'], dfs[pair[1]]['CurrentThermostatTemp'])
    print(f"t-test between {pair[0]} and {pair[1]} is {results[0]} and p-value: {results[1]}")

all_combinations = list(combinations(list(dfs.keys()), 2)) # find all combinations in the keys of the dict with dataframes
[ttest(i) for i in all_combinations] # pass all combinations through the function ttest

Output: t-test between Bungalow and Park Home is 0.2594309721800956 and p-value: 0.7984182890048678

RJ Adriaansen
  • 9,131
  • 2
  • 12
  • 26
  • Thank you very much sir. By taking away ```python drop_duplicates(inplace=False)``` it works perfectly as I wanted. Appreciate the help. – GUILLE DOMINGUEZ Mar 15 '21 at 17:28