18

I have a dataset including categorical variables(binary) and continuous variables. I'm trying to apply a linear regression model for predicting a continuous variable. Can someone please let me know how to check for correlation among the categorical variables and the continuous target variable.

Current Code:

import pandas as pd
df_hosp = pd.read_csv('C:\Users\LAPPY-2\Desktop\LengthOfStay.csv')

data = df_hosp[['lengthofstay', 'male', 'female', 'dialysisrenalendstage', 'asthma', \
              'irondef', 'pneum', 'substancedependence', \
              'psychologicaldisordermajor', 'depress', 'psychother', \
              'fibrosisandother', 'malnutrition', 'hemo']]
print data.corr()

All of the variables apart from lengthofstay are categorical. Should this work?

petezurich
  • 9,280
  • 9
  • 43
  • 57
funnyguy
  • 229
  • 1
  • 3
  • 12
  • What have your tried so far? Provide us with the code and clearly mention where you're having the issue. – Adeel Ahmad Jun 22 '17 at 08:36
  • Look for ANOVA in python (in R would "aov"). This helps you identify, if the means (continous values) of the different groups (categorical values) have signficant differnt means. If you have only two groups, use a two-sided t.test (paired or unpaired). – Rockbar Jun 22 '17 at 08:38
  • Follow this tutorial. I think that is what you are looking for: http://www.marsja.se/four-ways-to-conduct-one-way-anovas-using-python/ – Rockbar Jun 22 '17 at 09:10
  • @AdeelAhmad I've added the code that i've got so far. the output that i got was a matrix but I'm not sure if that is correct or not. For continuous variables this works well, as far as I know. – funnyguy Jun 22 '17 at 09:37
  • Thanks @Rockbar, but I have the data in a pandas dataframe and there are multiple columns with huge number of observations. Would Anova be good here? – funnyguy Jun 22 '17 at 09:40
  • ok, then your method would be "logistic regression", identifying the descriptors of your continous variable (whatever it is) ?! An example of such an analysis case is here from Faraway on diabetes: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html – Rockbar Jun 22 '17 at 09:56
  • Hi @Rockbar but eventually I need to predict a continuous variable, I don't think we can use logistic regression for this – funnyguy Jun 22 '17 at 09:58
  • ok, then I have mistaken that. Then something like the suggestion below would be right way – Rockbar Jun 22 '17 at 11:32

3 Answers3

22

Convert your categorical variable into dummy variables here and put your variable in numpy.array. For example:

data.csv:

age,size,color_head
4,50,black
9,100,blonde
12,120,brown
17,160,black
18,180,brown

Extract data:

import numpy as np
import pandas as pd

df = pd.read_csv('data.csv')

df:

df

Convert categorical variable color_head into dummy variables:

df_dummies = pd.get_dummies(df['color_head'])
del df_dummies[df_dummies.columns[-1]]
df_new = pd.concat([df, df_dummies], axis=1)
del df_new['color_head']

df_new:

df_new

Put that in numpy array:

x = df_new.values

Compute the correlation:

correlation_matrix = np.corrcoef(x.T)
print(correlation_matrix)

Output:

array([[ 1.        ,  0.99574691, -0.23658011, -0.28975028],
       [ 0.99574691,  1.        , -0.30318496, -0.24026862],
       [-0.23658011, -0.30318496,  1.        , -0.40824829],
       [-0.28975028, -0.24026862, -0.40824829,  1.        ]])

See :

numpy.corrcoef

glegoux
  • 3,505
  • 15
  • 32
3

correlation in this scenario is quite misleading as we are comparing categorical variable with continuous variable

Harsh
  • 31
  • 1
  • Not necessarily. After converting to dummy variables, which is being done in the answer by @glegoux, the categorical variable is converted to multiple columns, each becoming a binary column. In such a scenario, the correlation becomes Point Biserial Correlation. – Sourajyoti Datta Aug 10 '22 at 08:40
0

There is one more method to compute the correlation between continuous variable and dichotomic (having only 2 classes) variable, since this is also a categorical variable, we can use it for the correlation computation. The link for point biserial correlation is given below. https://www.statology.org/point-biserial-correlation-python/

Yunus
  • 1
  • Please [add context to your link](https://meta.stackexchange.com/questions/8231/are-answers-that-just-contain-links-elsewhere-really-good-answers/8259#8259) so your fellow users will have some idea what it is and why it's there, and then quote the most relevant part of the page you're linking to in case the target page is unavailable. Answers that are little more than a link may be deleted. – Koedlt Dec 24 '22 at 08:07
  • This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/33479593) – コリン Dec 25 '22 at 14:44