Maximizing the information learned from finite data selects a simple model

Edited by Larry Wasserman, Carnegie Mellon University, Pittsburgh, PA, and approved January 9, 2018 (received for review September 1, 2017)
February 6, 2018
115 (8) 1760-1765

Significance

Most physical theories are effective theories, descriptions at the scale visible to our experiments which ignore microscopic details. Seeking general ways to motivate such theories, we find an information theory perspective: If we select the model which can learn as much information as possible from the data, then we are naturally led to a simpler model, by a path independent of concerns about overfitting. This is encoded as a Bayesian prior which is nonzero only on a subspace of the original parameter space. We differ from earlier prior selection work by not considering an infinite quantity of data. Having finite data is always a limit on the resolution of an experiment, and in our framework this selects how complicated a theory is appropriate.

Abstract

We use the language of uninformative Bayesian prior choice to study the selection of appropriately simple effective models. We advocate for the prior which maximizes the mutual information between parameters and predictions, learning as much as possible from limited data. When many parameters are poorly constrained by the available data, we find that this prior puts weight only on boundaries of the parameter space. Thus, it selects a lower-dimensional effective theory in a principled way, ignoring irrelevant parameter directions. In the limit where there are sufficient data to tightly constrain any number of parameters, this reduces to the Jeffreys prior. However, we argue that this limit is pathological when applied to the hyperribbon parameter manifolds generic in science, because it leads to dramatic dependence on effects invisible to experiment.

Continue Reading

Acknowledgments

We thank Vijay Balasubramanian, William Bialek, Robert de Mello Koch, Peter Grünwald, Jon Machta, James Sethna, Paul Wiggins, and Ned Wingreen for discussion and comments. We thank International Centre for Theoretical Sciences Bangalore for hospitality. H.H.M. was supported by NIH Grant R01GM107103. M.K.T. was supported by National Science Foundation (NSF)-Energy, Power, and Control Networks 1710727. B.B.M. was supported by a Lewis-Sigler Fellowship and by NSF Division of Physics 0957573. M.C.A. was supported by Narodowe Centrum Nauki Grant 2012/06/A/ST2/00396.

Supporting Information

Supporting Information (PDF)

References

1
LP Kadanoff, Scaling laws for Ising models near Tc. Physics 2, 263–272 (1966).
2
KG Wilson, Renormalization group and critical phenomena. 1. Renormalization group and the Kadanoff scaling picture. Phys Rev B4, 3174–3183 (1971).
3
JL Cardy Scaling and Renormalization in Statistical Physics (Cambridge Univ Press, Cambridge, UK, 1996).
4
JJ Waterfall, et al., Sloppy-model universality class and the Vandermonde matrix. Phys Rev Lett 97, 150601–150604 (2006).
5
RN Gutenkunst, et al., Universally sloppy parameter sensitivities in systems biology models. PLoS Comput Biol 3, 1871–1878 (2007).
6
MK Transtrum, BB Machta, JP Sethna, Why are nonlinear fits to data so challenging? Phys Rev Lett 104, 060201 (2010).
7
MK Transtrum, BB Machta, JP Sethna, Geometry of nonlinear least squares with applications to sloppy models and optimization. Phys Rev E 83, 036701 (2011).
8
BB Machta, R Chachra, MK Transtrum, JP Sethna, Parameter space compression underlies emergent theories and predictive models. Science 342, 604–607 (2013).
9
MK Transtrum, et al., Perspective: Sloppiness and emergent theories in physics, biology, and beyond. J Chem Phys 143, 010901 (2015).
10
T O’Leary, AC Sutton, E Marder, Computational models in the age of large datasets. Curr Opin Neurobiol 32, 87–94 (2015).
11
T Niksic, D Vretenar, Sloppy nuclear energy density functionals: Effective model reduction. Phys Rev C 94, 024333 (2016).
12
VR Dhruva, J Anderson, A Papachristodoulou, Delineating parameter unidentifiabilities in complex models. Phys Rev E 95, 032314 (2017).
13
G Bohner, G Venkataraman, Identifiability, reducibility, and adaptability in allosteric macromolecules. J Gen Physiol 149, 547–560 (2017).
14
A Raju, BB Machta, JP Sethna, Information geometry and the renormalization group. arXiv:1710.05787. (2017).
15
H Akaike, A new look at the statistical model identification. IEEE Trans Automat Contr 19, 716–772 (1974).
16
N Sugiura, Further analysts of the data by Akaike’s information criterion and the finite corrections. Commun Stat Theory Meth 7, 13–26 (1978).
17
V Balasubramanian, Statistical inference, Occam’s razor, and statistical mechanics on the space of probability distributions. Neural Comp 9, 349–368 (1997).
18
IJ Myung, V Balasubramanian, MA Pitt, Counting probability distributions: Differential geometry and model selection. Proc Natl Acad Sci USA 97, 11170–11175 (2000).
19
DJ Spiegelhalter, NG Best, BP Carlin, AVD Linde, Bayesian measures of model complexity and fit. J R Stat Soc B 64, 583–639 (2002).
20
S Watanabe, Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. JMLR 11, 3571–3594 (2010).
21
CH LaMont, PA Wiggins, Information-based inference for singular models and finite sample sizes., 2017).
22
MK Transtrum, P Qiu, Model reduction by manifold boundaries. Phys Rev Lett 113, 098701 (2014).
23
CE Shannon, A mathematical theory of communication. Bell Sys Tech J 27, 623–656 (1948).
24
DV Lindley, On a measure of the information provided by an experiment. Ann Math Stat 27, 986–100 (1956).
25
A Rényi, On some basic problems of statistics from the point of view of information theory. Proc 5th Berkeley Symp Math Stat Prob, pp. 531–543 (1967).
26
G Färber, Die Kanalkapazität allgemeiner Übertragunskanäle bei begrenztem Signalwertbereich beliebigen Signalübertragungszeiten sowie beliebiger Störung. Arch Elektr Übertr 21, 565–574 (1967).
27
JG Smith, The information capacity of amplitude-and variance-constrained scalar Gaussian channels. Inf Control 18, 203–219 (1971).
28
SL Fix, Rate distortion functions for squared error distortion measures. Proc 16th Annu Allerton Conf Commun Control Comput, pp. 704–711 (1978).
29
JO Berger, JM Bernardo, M Mendoza, On priors that maximize expected information. Recent Developments in Statistics and Their Applications, eds J Klein, J Lee (Freedom Academy, Seoul, Korea), pp. 1–20 (1988).
30
Z Zhang, Discrete noninformative priors. PhD thesis (Yale University, New Haven, CT). (1994).
31
JM Bernardo, Reference posterior distributions for Bayesian inference. J R Stat Soc B 41, 113–147 (1979).
32
SC Bertrand, AR Barron, Jeffreys’ prior is asymptotically least favorable under entropy risk. J Stat Plan Infer 41, 37–60 (1994).
33
HR Scholl, Shannon optimal priors on independent identically distributed statistical experiments converge weakly to Jeffreys’ prior. Test 7, 75–94 (1998).
34
H Jeffreys, An invariant form for the prior probability in estimation problems. Proc R Soc A 186, 453–461 (1946).
35
JE Kerrich An Experimental Introduction to the Theory of Probability (E Munksgaard, Copenhagen, 1946).
36
C O’Luanaigh, CERN data centre passes 100 petabytes. home.cern. Available at https://home.cern/about/updates/2013/02/cern-data-centre-passes-100-petabytes. (2013).
37
S Arimoto, An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans Inf Theory 18, 14–20 (1972).
38
R Blahut, Computation of channel capacity and rate-distortion functions. IEEE Trans Inf Theory 18, 460–473 (1972).
39
K Rose, A mapping approach to rate-distortion computation and analysis. IEEE Trans Inf Theory 40, 1939–1952 (1994).
40
D Haussler, A general minimax result for relative entropy. IEEE Trans Inf Theory 43, 1276–1280 (1997).
41
MN Ghosh, Uniform approximation of minimax point estimates. Ann Math Stat 35, 1031–1047 (1964).
42
G Casella, WE Strawderman, Estimating a bounded normal mean. Ann Stat 9, 870–878 (1981).
43
I Feldman, Constrained minimax estimation of the mean of the normal distribution with known variance. Ann Stat 19, 2259–2265 (1991).
44
M Chen, D Dey, P Müller, D Sun, K Ye Frontiers of Statistical Decision Making and Bayesian Analysis (Springer, New York, 2010).
45
CA Sims, Rational inattention: Beyond the linear-quadratic case. Am Econ Rev 96, 158–163 (2006).
46
J Jung, J Kim, F Matĕjka, CA Sims, Discrete actions in information-constrained decision problems. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.696.267. (2015).
47
S Laughlin, A simple coding procedure enhances a neuron’s information capacity. Z Naturforsch C 36, 910–912 (1981).
48
G Tkačik, CG Callan, W Bialek, Information flow and optimization in transcriptional regulation. Proc Natl Acad Sci USA 105, 12265–12270 (2008).
49
MD Petkova, G Tkačik, W Bialek, EF Wieschaus, T Gregor, Optimal decoding of information from a genetic network. arXiv:1612.08084. (2016).
50
A Mayer, V Balasubramanian, T Mora, AM Walczak, How a well-adapted immune system is organized. Proc Natl Acad Sci USA 112, 5950–5955 (2015).
51
MC Abbott, BB Machta, An information scaling law ζ=3/4. arXiv:1710.09351. (2017).
52
S Schnell, C Mendoza, Closed form solution for time-dependent enzyme kinetics. J Theor Biol 187, 207–212 (1997).
53
M Transtrum, G Hart, P Qiu, Information topology identifies emergent model classes. arXiv:1409.6203. (2014).
54
PE Paré, AT Wilson, MK Transtrum, SC Warnick, A unified view of balanced truncation and singular perturbation approximations. 2015 American Control Conference, 2015).
55
N Lewis, Combining independent Bayesian posteriors into a confidence distribution, with application to estimating climate sensitivity. J Stat Plan Inference, 2017).
56
N Lewis, Modification of Bayesian updating where continuous parameters have differing relationships with new and existing data. arXiv:1308.2791. (2013).
57
D Poole, AE Raftery, Inference for deterministic simulation models: The Bayesian melding approach. J Am Stat Assoc 95, 1244–1255 (2000).
58
T Seidenfeld, Why I am not an objective Bayesian; some reflections prompted by Rosenkrantz. Theory Decis 11, 413–440 (1979).
59
RE Kass, L Wasserman, The selection of prior distributions by formal rules. J Am Stat Assoc 91, 1343–1370 (1996).
60
J Williamson, Objective Bayesianism, Bayesian conditionalisation and voluntarism. Synthese 178, 67–85 (2009).
61
G Schwarz, Estimating the dimension of a model. Ann Stat 6, 461–464 (1978).
62
J Rissanen, Modeling by shortest data description. Automatica 14, 465–471 (1978).
63
CS Wallace, DM Boulton, An information measure for classification. Comput J 11, 185–194 (1968).
64
S Watanabe, A widely applicable Bayesian information criterion. J Mach Learn Res 14, 867–897 (2013).
65
PD Grünwald, IJ Myung, MA Pitt Advances in Minimum Description Length: Theory and Applications (MIT Press, Cambridge, MA, 2009).
66
S Arlot, A Celisse, A survey of cross-validation procedures for model selection. Stat Surv 4, 40–79 (2010).
67
PW Anderson, More is different. Science 177, 393–396 (1972).
68
RW Batterman, Philosophical implications of Kadanoff’s work on the renormalization group. J Stat Phys 167, 559–574 (2017).
69
J Bardeen, LN Cooper, JR Schrieffer, Theory of superconductivity. Phys Rev 108, 1175–1204 (1957).
70
L Michaelis, ML Menten, The kinetics of invertin action. FEBS Lett 587, 2712–2720 (2013).

Information & Authors

Information

Published in

The cover image for PNAS Vol.115; No.8
Proceedings of the National Academy of Sciences
Vol. 115 | No. 8
February 20, 2018
PubMed: 29434042

Classifications

Submission history

Published online: February 6, 2018
Published in issue: February 20, 2018

Keywords

  1. effective theory
  2. model selection
  3. renormalization group
  4. Bayesian prior choice
  5. information theory

Acknowledgments

We thank Vijay Balasubramanian, William Bialek, Robert de Mello Koch, Peter Grünwald, Jon Machta, James Sethna, Paul Wiggins, and Ned Wingreen for discussion and comments. We thank International Centre for Theoretical Sciences Bangalore for hospitality. H.H.M. was supported by NIH Grant R01GM107103. M.K.T. was supported by National Science Foundation (NSF)-Energy, Power, and Control Networks 1710727. B.B.M. was supported by a Lewis-Sigler Fellowship and by NSF Division of Physics 0957573. M.C.A. was supported by Narodowe Centrum Nauki Grant 2012/06/A/ST2/00396.

Notes

This article is a PNAS Direct Submission.
*Interned for 5 y, John Kerrich flipped his coin only 104 times (35). With computers we can do better, but even the Large Hadron Collider generated only about 1018 bits of data (36).
For simplicity we consider only regular models; i.e., we assume all parameters are structurally identifiable.
See Fig. 5 for a demonstration of this point. For another example, consider a parameter manifold Θ which is a cone, with Fisher metric ds2=(50dϑ)2+ϑ2dΩn2/4: There is one relevant direction ϑ[0,1] of length L=50, and there are n irrelevant directions forming a sphere of diameter ϑ. Then the prior on ϑ alone implied by pJ(θ) is p(ϑ)=(n+1)ϑn, putting most of the weight near ϑ=1, dramatically so if n=Dd is large. But since only the relevant direction is visible to our experiment, the region ϑ0 ought to be treated similarly to ϑ1. The prior p(θ) has this property.
§
We offer both numerical and analytic arguments for discreteness below. The exception to discreteness is that if there is an exact continuous symmetry, p(θ) will be constant along it. For example, if our Gaussian model Eq. 2 is placed on a circle (identifying both θθ+1 and xx+1), then the optimum prior is a constant.
The function fKL(θ) is sometimes called the Bayes risk, as it quantifies how poorly the prior will perform if θ turns out to be correct. One of the problems equivalent to maximizing the MI (40) is the minimax problem for this (Fig. 1):
maxp(θ)I(X;Θ)=minp(θ)maxθfKL(θ)=minq(x)maxp(θ)dθp(θ)DKL[p(x|θ)q(x)].
The distributions we call expected data p(x) are also known as Bayes strategies, i.e., distributions on X which are the convolution of the likelihood p(x|θ) with some prior p(θ). The optimal q(x) from this third formulation (with minq(x)) can be shown to be such a distribution (40).
#
Using a normal distribution of fixed σ here is what allows the metric in Eq. 6 to be so simple. However, the qualitative behavior from the Poisson distribution is very similar.
If we have more parameters than measurements, then the model must be singular. In fact the exponential model of Fig. 4 is already slightly singular, since k1k2 does not change the data; we could cure this by restricting to k2k1, or by working with y, to obtain a regular model.
**Edges of the parameter manifold give simpler models not only in the sense of having fewer parameters, but also in an algorithmic sense. For example, the Michaelis–Menten model is analytically solvable (52) in a limit which corresponds to a manifold boundary (53). Stable linear dynamical systems of order n are model boundaries of order n+1 systems (54). Taking some parameter combinations to the extreme can lock spins into Kadanoff blocks (53).
††
This view is natural in the objective Bayesian tradition, but see refs. 5760 for alternatives.
‡‡
Model selection usually starts from a list of models to be compared, in our language a list of submanifolds of Θ. We can also consider maximizing mutual information in this setting, rather than with an unconstrained function p(θ), and unsurprisingly we observe a similar preference for highly flexible simpler models. This is also discussed in Eq. S3.

Authors

Affiliations

Henry H. Mattingly
Department of Chemical and Biological Engineering, Princeton University, Princeton, NJ 08544;
Lewis-Sigler Institute, Princeton University, Princeton, NJ 08544;
Present address: Department of Molecular Cellular and Developmental Biology, Yale University, New Haven, CT 06520.
Mark K. Transtrum
Department of Physics and Astronomy, Brigham Young University, Provo, UT 84602;
Marian Smoluchowski Institute of Physics, Jagiellonian University, 30-348 Kraków, Poland;
Benjamin B. Machta2 [email protected]
Lewis-Sigler Institute, Princeton University, Princeton, NJ 08544;
Department of Physics, Princeton University, Princeton, NJ 08544
Present address: Department of Physics, Yale University, New Haven, CT, 06520; and Systems Biology Institute, Yale University, West Haven, CT, 06516.

Notes

2
To whom correspondence may be addressed. Email: [email protected] or [email protected].
Author contributions: H.H.M., M.K.T., M.C.A., and B.B.M. designed research; H.H.M., M.K.T., M.C.A., and B.B.M. performed research; H.H.M., M.K.T., M.C.A., and B.B.M. analyzed data; M.C.A. and B.B.M. wrote the paper; H.H.M. and M.C.A. performed the numerical experiments; and H.H.M. and M.K.T. contributed to writing.

Competing Interests

The authors declare no conflict of interest.

Metrics & Citations

Metrics

Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.


Citation statements

Altmetrics

Citations

Export the article citation data by selecting a format from the list below and clicking Export.

Cited by

    Loading...

    View Options

    View options

    PDF format

    Download this article as a PDF file

    DOWNLOAD PDF

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Personal login Institutional Login

    Recommend to a librarian

    Recommend PNAS to a Librarian

    Purchase options

    Purchase this article to access the full text.

    Single Article Purchase

    Maximizing the information learned from finite data selects a simple model
    Proceedings of the National Academy of Sciences
    • Vol. 115
    • No. 8
    • pp. 1665-E1936

    Figures

    Tables

    Media

    Share

    Share

    Share article link

    Share on social media