PS: I am a student of Data Science, I was wondering the impact of correlation on categorical data.
Let say I have 2 features such as Ticket Class with 1,2,3 (class 3 is lower than class 1) as a category and Seat Numbers as A,B,C,D,E,F & N (where N represents missing data) another category.
It looks like this :
Tclass Seat
1 A
2 C
3 E
2 D
3 N
1 A
1 N
Steps I perform is :
- I one hot encode the seat no
- Then I check the correlation of resultant data frame by using df.corr()
The result of Correlation is :
Tclass 1.000000
Seat_N 0.713857
Seat_F 0.013122
Seat_C -0.042750
Seat_A -0.202143
Seat_E -0.225649
Seat_D -0.265341
Seat_B -0.353414
My questions are :
In this case the conclusion drawn is that missing data (N) is highly correlated to lower class. WHY was this conclusion made from the correlation data?
Conclusion made was Seat_B related to higher class while seat_N related to lower class tickets. Is this the answer : Since, Seat_N have a +ve correlation it should mean it yields higher value of Tclass, which is numeric value of 3. In other terms Lower class
If we correlate categorical data, how can we get -ve results? (can someone share some reading material on this?)
How to interpret the result of correlation of one categorical data on another categorical data? (this question leads on question 2)
Would it be possible for me to perform correlation if the Tclass was non-numerical/label encoded ?
Reference : https://www.kaggle.com/ccastleberry/titanic-cabin-features/comments