78

For an m x n matrix, what's the optimal (fastest) way to compute the mutual information for all pairs of columns (n x n)?

By mutual information, I mean:

I(X, Y) = H(X) + H(Y) - H(X,Y)

where H(X) refers to the Shannon entropy of X.

Currently I'm using np.histogram2d and np.histogram to calculate the joint (X,Y) and individual (X or Y) counts. For a given matrix A (e.g. a 250000 X 1000 matrix of floats), I am doing a nested for loop,

n = A.shape[1]
for ix = arange(n)  
    for jx = arange(ix+1,n):
       matMI[ix,jx]= calc_MI(A[:,ix],A[:,jx])

Surely there must be better/faster ways to do this?

As an aside, I've also looked for mapping functions on columns (column-wise or row-wise operations) on arrays, but haven't found a good general answer yet.

Here is my full implementation, following the conventions in the Wiki page:

import numpy as np
 
def calc_MI(X,Y,bins):
 
    c_XY = np.histogram2d(X,Y,bins)[0]
    c_X = np.histogram(X,bins)[0]
    c_Y = np.histogram(Y,bins)[0]
 
    H_X = shan_entropy(c_X)
    H_Y = shan_entropy(c_Y)
    H_XY = shan_entropy(c_XY)
 
    MI = H_X + H_Y - H_XY
    return MI
 
def shan_entropy(c):
    c_normalized = c / float(np.sum(c))
    c_normalized = c_normalized[np.nonzero(c_normalized)]
    H = -sum(c_normalized* np.log2(c_normalized))  
    return H
 
A = np.array([[ 2.0,  140.0,  128.23, -150.5, -5.4  ],
              [ 2.4,  153.11, 130.34, -130.1, -9.5  ],
              [ 1.2,  156.9,  120.11, -110.45,-1.12 ]])
 
bins = 5 # ?
n = A.shape[1]
matMI = np.zeros((n, n))
 
for ix in np.arange(n):
    for jx in np.arange(ix+1,n):
        matMI[ix,jx] = calc_MI(A[:,ix], A[:,jx], bins)

Although my working version with nested for loops does it at reasonable speed, I'd like to know if there is a more optimal way to apply calc_MI on all the columns of A (to calculate their pairwise mutual information)?

I'd also like to know:

  1. Whether there are efficient ways to map functions to operate on columns (or rows) of np.arrays (maybe like np.vectorize, which looks more like a decorator)?

  2. Whether there are other optimal implementations for this specific calculation (mutual information)?

cottontail
  • 10,268
  • 18
  • 50
  • 51
nahsivar
  • 1,099
  • 1
  • 10
  • 13
  • since i see ```histogram2d``` being used, are we talking about discretized entropy here, not differential entropy? If so, what binning strategy is being used and how is the binning method identifiable from the code? – develarist Aug 19 '20 at 17:38

3 Answers3

75

I can't suggest a faster calculation for the outer loop over the n*(n-1)/2 vectors, but your implementation of calc_MI(x, y, bins) can be simplified if you can use scipy version 0.13 or scikit-learn.

In scipy 0.13, the lambda_ argument was added to scipy.stats.chi2_contingency This argument controls the statistic that is computed by the function. If you use lambda_="log-likelihood" (or lambda_=0), the log-likelihood ratio is returned. This is also often called the G or G2 statistic. Other than a factor of 2*n (where n is the total number of samples in the contingency table), this is the mutual information. So you could implement calc_MI as:

from scipy.stats import chi2_contingency

def calc_MI(x, y, bins):
    c_xy = np.histogram2d(x, y, bins)[0]
    g, p, dof, expected = chi2_contingency(c_xy, lambda_="log-likelihood")
    mi = 0.5 * g / c_xy.sum()
    return mi

The only difference between this and your implementation is that this implementation uses the natural logarithm instead of the base-2 logarithm (so it is expressing the information in "nats" instead of "bits"). If you really prefer bits, just divide mi by log(2).

If you have (or can install) sklearn (i.e. scikit-learn), you can use sklearn.metrics.mutual_info_score, and implement calc_MI as:

from sklearn.metrics import mutual_info_score

def calc_MI(x, y, bins):
    c_xy = np.histogram2d(x, y, bins)[0]
    mi = mutual_info_score(None, None, contingency=c_xy)
    return mi
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • 1
    Nice code! What is a reasonable default value for number of bins? – pir Nov 01 '15 at 14:40
  • 2
    @felbo That's a good question, and not one that can be easily answered. You might get some ideas if you ask it over at http://stats.stackexchange.com/ – Warren Weckesser Nov 01 '15 at 15:08
  • This method does not work if some counts equal zero. Why do you suggest this method over density estimation? Also, I upvoted your answer as it does provide an efficient way to calculate MI for some scenarios. – Jon Dec 10 '17 at 22:39
  • *"Why do you suggest this method over density estimation?"* I didn't. I only suggested a couple alternative implementations of the code given in the question. – Warren Weckesser Dec 10 '17 at 22:47
  • 1
    The two suggested methods differs by continuity correction. Change to `chi2_contingency(correction = False)` removes this inconsistency. – shouldsee Feb 22 '18 at 12:25
  • [Beware](https://kaushikghose.wordpress.com/2013/10/24/computing-mutual-information-and-other-scary-things/) of discretization (ie the bin sizes). This blog recommends the "Jack Knifed Estimate" to overcome this, or I would also expect density estimator techniques to help. – Josh.F Feb 06 '19 at 03:26
  • what's the difference between this implementation and https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html?highlight=mutual_info_classif#sklearn.feature_selection.mutual_info_classif ? – dada Sep 02 '22 at 14:21
0

To get rid of the outer loop (sort of), one way is to rewrite calc_MI to call the vectorized functions used in the construction of matMI on the entire array of c_XYs.

shan_entropy uses functions that can work on arrays of any size and c_X and c_Y are the marginal totals across the second and first dimensions of the output of np.histogram2d, respectively. Also the looped assignment to matMI may be done in vectorized fashion using np.triu_indices.

The only unavoidable loop here is the loop over pairs of columns of A to call np.histogram2d but that part can be parallelized using joblib.Parallel.

def shan_entropy(c):
    return - (c * np.log2(c, out=np.zeros(c.shape), where=(c!=0))).sum(axis=1)

def calc_pairwise_mutual_info(A, bins):
    
    m, n = A.shape
    matMI = np.zeros((n, n))

    c_XYs = np.array([np.histogram2d(A[:,ix], A[:,jx], bins)[0] for ix in range(n-1) for jx in range(ix+1, n)]) / m
    c_Xs = c_XYs.sum(axis=2)
    c_Ys = c_XYs.sum(axis=1)
    c_XYs = c_XYs.reshape(len(c_XYs), -1)

    H_X, H_Y, H_XY = map(shan_entropy, (c_Xs, c_Ys, c_XYs))

    MI = H_X + H_Y - H_XY
    matMI[np.triu_indices(n, k=1)] = MI
    return matMI


A = np.array([[ 2.0,  140.0,  128.23, -150.5, -5.4  ],
              [ 2.4,  153.11, 130.34, -130.1, -9.5  ],
              [ 1.2,  156.9,  120.11, -110.45,-1.12 ]])

bins = 5

MI_arr = calc_pairwise_mutual_info(A, bins)

You can verify that MI_arr is the same (np.allclose returns True) as matMI as computed in the OP.

cottontail
  • 10,268
  • 18
  • 50
  • 51
0

You can also use scipy.stat.entropy.

  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 15 '23 at 21:03
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/34551445) – Chenmunka Jun 19 '23 at 06:54