I've tried to have a T-test model for answering one of my questions. To do so, I create a subset data, then applied chi-square test to see whether data is proper for T-test or not. According to the results, p-value shown approximately 3.5, which is impossible. I thought that it could be because of the sample size of the data I specified, and sample size of the dependent variable(I calculate a new column and use it, its size is ~178).
In details: The code I am sharing is for the project's first question (attached the github link: enter link description here )
The dependent variable: Delay
& independent: Gender
The code I gave a try:
Subset data
male = df.query('Gender == "0"')['Delay']
female = df.query('Gender == "1"')['Delay']
df.groupby('Gender').describe()
Create contingency table
GD = pd.crosstab(index=df['Gender'], columns=df['Delay'], margins=True)
GD
chi-square test
chiRes = stats.chi2_contingency(GD)
print(f'chi-square statistic: {chiRes[0]}')
print(f'p-value: {chiRes[1]}')
print(f'degree of freedom: {chiRes[2]}')
print('expected contingency table')
print(chiRes[3])
And these are the findings:
chi-square statistic: 519.651581316998
p-value: 3.590660196919681e-19 (?)
degree of freedom: 262 (?)
As a second way, I tried to Shapiro-Wilks test for normality test.
The code (stats.shapiro(male)
) does not even run, creates this error:
ValueError: Data must be at least length 3.
Lastly, I checked the T-test as what if it ensure me on some points but it didn't.
rp.ttest(group1= df['Delay'][df['Gender'] == '0'], group1_name= "Male",
group2= df['Delay'][df['Gender'] == '1'], group2_name= "Female")
Output: All of Mean, SD, SE, Conf. Interval came with NaN. (Although I know that the data has no missing value.)
How can I use a statistical test with this dataset? Is there any points you want to mention?