0

I have two continuous variables, and would like to compute mutual information between them as a measure of similarity.

I've read some posts suggesting to use the mutual_info_score from scikit-learn but will this work for continuous variables? One SO answer suggested converting the data into probabilites with np.histogram2d() and passing the contingency table to the mutual_info_score.

from sklearn.metrics import mutual_info_score

def calc_MI(x, y, bins):
    c_xy = np.histogram2d(x, y, bins)[0]
    mi = mutual_info_score(None, None, contingency=c_xy)
    return mi

x = [1,0,1,1,2,2,2,2,3,6,5,6,8,7,8,9]
y = [3,0,4,4,4,5,4,6,7,7,8,6,8,7,9,9]

mi = calc_MI(x,y,4)

Is this a valid approach? I'm asking because I also read that when variables are continuous, then the sums in the formula for discrete data become integrals. But is this method implemented in scikit-learn or any other package?

EDIT:

A more realistic dataset

L = np.linalg.cholesky( [[1.0, 0.60], [0.60, 1.0]])

uncorrelated = np.random.standard_normal((2, 300))
correlated = np.dot(L, uncorrelated)

A = correlated[0]
B = correlated[1]

x = (A - np.mean(A)) / np.std(A)
y = (B - np.mean(B)) / np.std(B)

Can I use calc_MI(x,y,bins=50) on these data?

HappyPy
  • 9,839
  • 13
  • 46
  • 68
  • *"I have two continuous variables..."* What do you *actually* have? Parameters for two different continuous probability distributions? A set of *measurements* (a.k.a. *observations* or *samples*) that are presumed to come from some continuous but unknown probability distributions? Something else? – Warren Weckesser Jan 27 '23 at 17:52
  • @WarrenWeckesser My two signals are normalized time series data from heart rate recordings. I guess it would be the second: `A set of measurements (a.k.a. observations or samples) that are presumed to come from some continuous but unknown probability distributions?` – HappyPy Jan 27 '23 at 17:59
  • @WarrenWeckesser, I edited my question with a more realistic example. Can I use `calc_MI` as is, or should I still try to transform my data somehow? – HappyPy Jan 30 '23 at 21:15

1 Answers1

0

I think the function you might be looking for is mutual_info_regression from sklearn.feature_selection.

This function will estimate the mutual info between a target vector consisting of continuous values and a feature matrix.

You can find more information on the sklearn doc page for the function

Steven Cromb
  • 31
  • 1
  • 5