In my Pandas DataFrame there are two categorical variable one is the target which has 2 unique values & the other one is the feature which has 300 unique values now I want to check the relationship between two variables using ChiSquare test now the data types of the two-column is the object so how can I perform the chi-square test or check the relationship between two columns that is - is the two-column is Correlated or not
Asked
Active
Viewed 989 times
1 Answers
1
300 unique values in a variable is too much, still you can use below lines of code to test:
import pandas as pd
from scipy.stats import chi2_contingency
table = pd.crosstab(df['Feature_Var'],df['Target_Var'])
print(table)
stat, pvalue, dof, expected = chi2_contingency(table)
print('Chi-sq Test Statistics = %.3f \nP-Value = %.3f \nDegrees of Freedom = %.3f' % (stat, pvalue, dof))

ManojK
- 1,570
- 2
- 9
- 17
-
He's asking for a two columns correlation, not for a column levels/target correlation. – Sergey Bushmanov Mar 19 '20 at 09:52
-
Yup I am asking two-column correlation not the correlation between the values of two columns I want to see if two-column is related to one another or not – geek Mar 19 '20 at 09:55
-
@geek - Can you post an example? – ManojK Mar 19 '20 at 09:55
-
yup 300 unique values are too much but the dataset contains 4000 rows – geek Mar 19 '20 at 09:55
-
Example- target columns contain only Yes and No & the other column which is feature contains the name of the devices – geek Mar 19 '20 at 09:57
-
@manojk You may try `df = pd.DataFrame({"f":np.random.choice(["a","b","c"], 1000, True, [.2,.4,.4]), "t":np.random.randint(0,2,1000)})` – Sergey Bushmanov Mar 19 '20 at 10:02
-
@SergeyBushmanov - Sorry if I am not able to understand, but OP mentioned that both columns are categorical, personally I know we can't directly find correlation between 2 categorical variables, I can only think of a chi-square test. – ManojK Mar 19 '20 at 10:06
-
1@manojk Yes, we cannot do a correlation test on two categorical variables, even if we converted them to numbers. The reason -- the conversion will be unordered. But your approach seems to work to me after looking at it closer.... I think it's closest what can be done answering OP. – Sergey Bushmanov Mar 19 '20 at 10:09
-
@SergeyBushmanov - Thanks, this is what I understand from a pandas two categorical columns correlation, actually two categorical columns correlation is same as chi-square test for relationship between two variables. – ManojK Mar 19 '20 at 10:15