Information content in Python for real number dataset

Question

This question is supplementary to a previous question.

I need to compute information content from two Python lists. These lists contain real numbers. I understand that I can use the following formula where the probabilities are computed from the histograms of the list.

sum_ij p(x_i,y_j) log_2 (p(x_i,y_j)/(p(x_i)p(y_j))  / - sum_i p(y_i) log_2 p(y_i)

Is there any built in Python API to compute information content?

The answer to the previous question suggested to use the information_content() API from BioPython. But that functions works only for alphabetic symbols.

Thanks.

Are your values discrete or continuous? – Paul Brodersen Apr 19 '17 at 09:48 — Paul Brodersen, Apr 19 '17 at 09:48
@Paul they are continuous. – Omar Shehab Apr 19 '17 at 09:50 — Omar Shehab, Apr 19 '17 at 09:50

score 2 · Answer 1 · answered Apr 19 '17 at 10:01

For discrete distributions, you can use the aforementioned biopython or scikit-learn's sklearn.metrics.mutual_info_score. However, both compute the mutual information between "symbolic" data using the formula you cited (which is intended for symbolic data). In either case, you disregard that the values of your data have an inherent order.

For continuous distributions, you are better off using the Kozachenko-Leonenko k-nearest neighbour estimator for entropy (K & L 1987) and the corresponding Kraskov, ..., Grassberger (2004) estimator for mutual information. These circumvent the intermediate step of calculating the probability density function, and estimate the entropy directly from the distances of data point to their k-nearest neighbour.

The basic idea of the Kozachenko-Leonenko estimator is to look at (some function of) the average distance between neighbouring data points. The intuition is that if that distance is large, the dispersion in your data is large and hence the entropy is large. In practice, instead of taking the nearest neighbour distance, one tends to take the k-nearest neighbour distance (where k is typically a small integer in the range 5-20), which tends to make the estimate more robust.

I have implementations for both on my github: https://github.com/paulbrodersen/entropy_estimators

Information content in Python for real number dataset

1 Answers1