I am currently trying various methods: 1. Correlation. 2. Mutual Information. 3. Distance Correlation to find the strength of relationship between the variables in X and the dependent variable in y. Correlation is the fastest and simplest(1 hour on a sample to 3 million records and 560 variables). Mutual Information calculation takes approximately 16 hours. I am also looking at distance correlation because of it's interesting property: The distance correlation between Xi and Y is zero , if and only if they are independent. However I am facing a problem while doing the calculation in Python.
below is my data:
X
prop_tenure prop_12m prop_6m prop_3m
0.04 0.04 0.06 0.08
0 0 0 0
0 0 0 0
0.06 0.06 0.1 0
0.38 0.38 0.25 0
0.61 0.61 0.66 0.61
0.01 0.01 0.02 0.02
0.1 0.1 0.12 0.16
0.04 0.04 0.04 0.09
0.22 0.22 0.22 0.22
0.72 0.72 0.73 0.72
0.39 0.39 0.45 0.64
**y**
status
0
0
1
1
0
0
0
1
0
0
0
1
I want to capture the distance correlation of each variable in X with y and store it in a dataframe and hence I am doing.
from sklearn.metrics.pairwise import pairwise_distances
num_metrics_df['distance_correlation'] = pairwise_distances(X,y,metric = 'correlation',njobs = -1)
However the documentation mentions the below:
If Y is given (default is None), then the returned matrix is the pairwise distance between the arrays from both X and Y.
This requires equal number of features in both X and Y?
How can I get distance correlation between each Xi and y in python? Can someone please help me with this?
Update:
I tried the approach of repeating the columns of y as per X.shape[1] and then do the calculation but it gives memory error for a sample of 10k records:
X = data_col.values
lb = preprocessing.LabelBinarizer()
df_target['drform'] = lb.fit_transform(df_target['status'])
y = df_target.values
n_rep = X.shape[1]
y = np.repeat(y,n_rep,axis = 1)
num_metrics_df['distance_correlation'] = pairwise_distances(X,y,metric = 'correlation',njobs = -1)
Traceback (most recent call last):
File "<ipython-input-30-0f28f4b76a7e>", line 20, in <module>
num_metrics_df['distance_correlation'] = pairwise_distances(X,y,metric = 'correlation',njobs = -1)
File "C:\Users\test\AppData\Local\Continuum\anaconda3.1\lib\site-packages\sklearn\metrics\pairwise.py", line 1247, in pairwise_distances
return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
File "C:\Users\test\AppData\Local\Continuum\anaconda3.1\lib\site-packages\sklearn\metrics\pairwise.py", line 1090, in _parallel_pairwise
return func(X, Y, **kwds)
File "C:\Users\test\AppData\Local\Continuum\anaconda3.1\lib\site-packages\scipy\spatial\distance.py", line 2381, in cdist
dm = np.empty((mA, mB), dtype=np.double)
MemoryError