I have two continuous variables, and would like to compute mutual information between them as a measure of similarity.
I've read some posts suggesting to use the mutual_info_score
from scikit-learn
but will this work for continuous variables? One SO answer suggested converting the data into probabilites with np.histogram2d()
and passing the contingency table to the mutual_info_score
.
from sklearn.metrics import mutual_info_score
def calc_MI(x, y, bins):
c_xy = np.histogram2d(x, y, bins)[0]
mi = mutual_info_score(None, None, contingency=c_xy)
return mi
x = [1,0,1,1,2,2,2,2,3,6,5,6,8,7,8,9]
y = [3,0,4,4,4,5,4,6,7,7,8,6,8,7,9,9]
mi = calc_MI(x,y,4)
Is this a valid approach? I'm asking because I also read that when variables are continuous, then the sums in the formula for discrete data become integrals. But is this method implemented in scikit-learn
or any other package?
EDIT:
A more realistic dataset
L = np.linalg.cholesky( [[1.0, 0.60], [0.60, 1.0]])
uncorrelated = np.random.standard_normal((2, 300))
correlated = np.dot(L, uncorrelated)
A = correlated[0]
B = correlated[1]
x = (A - np.mean(A)) / np.std(A)
y = (B - np.mean(B)) / np.std(B)
Can I use calc_MI(x,y,bins=50)
on these data?