Mutual information between discrete and continuous data sets - PubMed Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Feb 19;9(2):e87357.
doi: 10.1371/journal.pone.0087357. eCollection 2014.

Mutual information between discrete and continuous data sets

Affiliations

Mutual information between discrete and continuous data sets

Brian C Ross. PLoS One. .

Abstract

Mutual information (MI) is a powerful method for detecting relationships between data sets. There are accurate methods for estimating MI that avoid problems with "binning" when both data sets are discrete or when both data sets are continuous. We present an accurate, non-binning MI estimator for the case of one discrete data set and one continuous data set. This case applies when measuring, for example, the relationship between base sequence and gene expression level, or the effect of a cancer drug on patient survival time. We also show how our method can be adapted to calculate the Jensen-Shannon divergence of two or more data sets.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The author has declared that no competing interests exist.

Figures

Figure 1
Figure 1. Procedures for estimating MI.
(A) An example joint probability density formula image where formula image is a real-valued scalar and formula image can take one of three values, indicated red, blue and green. For each value of formula image the probability density in formula image is shown as plot of that color, whose area is proportional to formula image. (B) A set of formula image data pairs sampled from this distribution, where formula image is represented by the color of each point and formula image by its position on the formula image-axis. (C) The computation of formula image in our nearest-neighbor method. Data point formula image is the red dot indicated by a vertical arrow. The full data set is on the upper line, and the subset of all red data points is on the lower line. We find that the data point which is the 3rd-closest neighbor to formula image on the bottom line is the 6th-closest neighbor on the top line. Dashed lines show the distance formula image from point formula image out to the 3rd neighbor. formula image, formula image, and for this point formula image and formula image. (D) A binning of the data into equal bins containing formula image data points. MI can be estimated from the numbers of points of each color in each bin.
Figure 2
Figure 2. MI estimated by nearest-neighbors versus binning.
(A) Sampling distributions formula image (thick lines) represented by a differently-colored graph in formula image for each of three possible values of the discrete variable formula image (red, blue and green). A histogram of a representative data set for each distribution is overlaid using a thinner line. (B) MI estimates as a function of formula image using the nearest-neighbor estimator. 100 data sets were constructed for each distribution, and the MI of each data set was estimated separately for different values of formula image. The median MI estimate of the 100 data sets for each formula image-value is shown with a black line; the shaded region indicates the range (lowest 10% to highest 10%) of MI estimates. (C) MI estimates plotted as a function of bin size formula image using the binning method (right panel), using the same 100 data sets for each distribution. The black line shows the median MI estimate of the 100 data sets for each formula image-value; the shaded region indicates the 10%–90% range
Figure 3
Figure 3. Binning error relative to nearest-neighbors error.
(A) Error from the binning method divided by error from the nearest-neighbor method. Errors in MI were calculated for each of the 100 data sets of the square-wave (light blue) and Gaussian (purple) 10,000-length data sets (see Figure 2). Each line shows the ratio of the median MI for a given number of neighbors formula image estimated using binning, as a function of n, to the median (over all data sets and all values of formula image) of all MI estimates using nearest neighbors. The binning method gives superior results for values of formula image for which this ratio is less than one. Evidently, there is no optimal value of formula image that works for all distributions: formula image works well for the square wave distribution but formula image is better for a Gaussian distribution. (B) MI error using nearest-neigbor method versus binning method for the 400-data point sets.

Similar articles

Cited by

References

    1. Cover T, Thomas J (1991) Elements of information theory. New York: John Wiley & Sons.
    1. Kraskov A, Stögbauer H, Grassberger P (2004) Estimating mutual information. Physical Review E 69: 066138. - PubMed
    1. Grosse I, Bernaola-Galván P, Carpena P, Román-Roldán R, Oliver J, et al. (2002) Analysis of symbolic sequences using the jensen-shannon divergence. Physical Review E 65: 041905. - PubMed
    1. Abramowitz M, Stegun I (1970) Handbook of mathematical functions. New York: Dover Publishing Inc.
    1. Kozachenko L, Leonenko NN (1987) Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii 23: 9–16.