I am currently trying to clean my dataset, and would like to remove variables that are correlated between one another. I have seen some code that has been previously shared doing this, but there doesn't seem to be any criteria for what variable is dropped. I am trying to get it so that the variable that has the lower correlation to the dependent is removed.
My dataset is in the format:
Name | Dependent | x1 | x2 | x3 | xn |
I have tried this so far, but it seems not to work. Any suggestions on how to change my code would be greatly appreciated!
import pandas as pd
import numpy as np
dataset = pd.read_csv('tetrahymena_padel_withDep.csv')
dataf1 = dataset.drop(['Name'], axis = 1)
dataf2 = dataset.drop(['Name', 'Dependent'], axis = 1)
corrWithDep = dataf1.corr().iloc[0]
corrWithVar = dataf2.corr()
col_corr = set()
for i in range(len(corrWithVar.columns)):
for j in range(i):
if (corrWithVar.iloc[i, j] >= 0.9) and (corrWithVar.columns[j] not in col_corr):
if (corrWithDep.iloc[i] >= corrWithDep.iloc[j]):
colname = corrWithVar.columns[j]
col_corr.add(colname)
else:
colname = corrWithVar.columns[i]
col_corr.add(colname)
if colname in dataset.columns:
del dataset[colname]