Maximizing the information learned from finite data selects a simple model
Edited by Larry Wasserman, Carnegie Mellon University, Pittsburgh, PA, and approved January 9, 2018 (received for review September 1, 2017)
Significance
Most physical theories are effective theories, descriptions at the scale visible to our experiments which ignore microscopic details. Seeking general ways to motivate such theories, we find an information theory perspective: If we select the model which can learn as much information as possible from the data, then we are naturally led to a simpler model, by a path independent of concerns about overfitting. This is encoded as a Bayesian prior which is nonzero only on a subspace of the original parameter space. We differ from earlier prior selection work by not considering an infinite quantity of data. Having finite data is always a limit on the resolution of an experiment, and in our framework this selects how complicated a theory is appropriate.
Abstract
We use the language of uninformative Bayesian prior choice to study the selection of appropriately simple effective models. We advocate for the prior which maximizes the mutual information between parameters and predictions, learning as much as possible from limited data. When many parameters are poorly constrained by the available data, we find that this prior puts weight only on boundaries of the parameter space. Thus, it selects a lower-dimensional effective theory in a principled way, ignoring irrelevant parameter directions. In the limit where there are sufficient data to tightly constrain any number of parameters, this reduces to the Jeffreys prior. However, we argue that this limit is pathological when applied to the hyperribbon parameter manifolds generic in science, because it leads to dramatic dependence on effects invisible to experiment.
Acknowledgments
We thank Vijay Balasubramanian, William Bialek, Robert de Mello Koch, Peter Grünwald, Jon Machta, James Sethna, Paul Wiggins, and Ned Wingreen for discussion and comments. We thank International Centre for Theoretical Sciences Bangalore for hospitality. H.H.M. was supported by NIH Grant R01GM107103. M.K.T. was supported by National Science Foundation (NSF)-Energy, Power, and Control Networks 1710727. B.B.M. was supported by a Lewis-Sigler Fellowship and by NSF Division of Physics 0957573. M.C.A. was supported by Narodowe Centrum Nauki Grant 2012/06/A/ST2/00396.
Supporting Information
Supporting Information (PDF)
- Download
- 980.76 KB
References
1
LP Kadanoff, Scaling laws for Ising models near . Physics 2, 263–272 (1966).
2
KG Wilson, Renormalization group and critical phenomena. 1. Renormalization group and the Kadanoff scaling picture. Phys Rev B4, 3174–3183 (1971).
3
JL Cardy Scaling and Renormalization in Statistical Physics (Cambridge Univ Press, Cambridge, UK, 1996).
4
JJ Waterfall, et al., Sloppy-model universality class and the Vandermonde matrix. Phys Rev Lett 97, 150601–150604 (2006).
5
RN Gutenkunst, et al., Universally sloppy parameter sensitivities in systems biology models. PLoS Comput Biol 3, 1871–1878 (2007).
6
MK Transtrum, BB Machta, JP Sethna, Why are nonlinear fits to data so challenging? Phys Rev Lett 104, 060201 (2010).
7
MK Transtrum, BB Machta, JP Sethna, Geometry of nonlinear least squares with applications to sloppy models and optimization. Phys Rev E 83, 036701 (2011).
8
BB Machta, R Chachra, MK Transtrum, JP Sethna, Parameter space compression underlies emergent theories and predictive models. Science 342, 604–607 (2013).
9
MK Transtrum, et al., Perspective: Sloppiness and emergent theories in physics, biology, and beyond. J Chem Phys 143, 010901 (2015).
10
T O’Leary, AC Sutton, E Marder, Computational models in the age of large datasets. Curr Opin Neurobiol 32, 87–94 (2015).
11
T Niksic, D Vretenar, Sloppy nuclear energy density functionals: Effective model reduction. Phys Rev C 94, 024333 (2016).
12
VR Dhruva, J Anderson, A Papachristodoulou, Delineating parameter unidentifiabilities in complex models. Phys Rev E 95, 032314 (2017).
13
G Bohner, G Venkataraman, Identifiability, reducibility, and adaptability in allosteric macromolecules. J Gen Physiol 149, 547–560 (2017).
14
A Raju, BB Machta, JP Sethna, Information geometry and the renormalization group. arXiv:1710.05787. (2017).
15
H Akaike, A new look at the statistical model identification. IEEE Trans Automat Contr 19, 716–772 (1974).
16
N Sugiura, Further analysts of the data by Akaike’s information criterion and the finite corrections. Commun Stat Theory Meth 7, 13–26 (1978).
17
V Balasubramanian, Statistical inference, Occam’s razor, and statistical mechanics on the space of probability distributions. Neural Comp 9, 349–368 (1997).
18
IJ Myung, V Balasubramanian, MA Pitt, Counting probability distributions: Differential geometry and model selection. Proc Natl Acad Sci USA 97, 11170–11175 (2000).
19
DJ Spiegelhalter, NG Best, BP Carlin, AVD Linde, Bayesian measures of model complexity and fit. J R Stat Soc B 64, 583–639 (2002).
20
S Watanabe, Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. JMLR 11, 3571–3594 (2010).
21
CH LaMont, PA Wiggins, Information-based inference for singular models and finite sample sizes., 2017).
22
MK Transtrum, P Qiu, Model reduction by manifold boundaries. Phys Rev Lett 113, 098701 (2014).
23
CE Shannon, A mathematical theory of communication. Bell Sys Tech J 27, 623–656 (1948).
24
DV Lindley, On a measure of the information provided by an experiment. Ann Math Stat 27, 986–100 (1956).
25
A Rényi, On some basic problems of statistics from the point of view of information theory. Proc 5th Berkeley Symp Math Stat Prob, pp. 531–543 (1967).
26
G Färber, Die Kanalkapazität allgemeiner Übertragunskanäle bei begrenztem Signalwertbereich beliebigen Signalübertragungszeiten sowie beliebiger Störung. Arch Elektr Übertr 21, 565–574 (1967).
27
JG Smith, The information capacity of amplitude-and variance-constrained scalar Gaussian channels. Inf Control 18, 203–219 (1971).
28
SL Fix, Rate distortion functions for squared error distortion measures. Proc 16th Annu Allerton Conf Commun Control Comput, pp. 704–711 (1978).
29
JO Berger, JM Bernardo, M Mendoza, On priors that maximize expected information. Recent Developments in Statistics and Their Applications, eds J Klein, J Lee (Freedom Academy, Seoul, Korea), pp. 1–20 (1988).
30
Z Zhang, Discrete noninformative priors. PhD thesis (Yale University, New Haven, CT). (1994).
31
JM Bernardo, Reference posterior distributions for Bayesian inference. J R Stat Soc B 41, 113–147 (1979).
32
SC Bertrand, AR Barron, Jeffreys’ prior is asymptotically least favorable under entropy risk. J Stat Plan Infer 41, 37–60 (1994).
33
HR Scholl, Shannon optimal priors on independent identically distributed statistical experiments converge weakly to Jeffreys’ prior. Test 7, 75–94 (1998).
34
H Jeffreys, An invariant form for the prior probability in estimation problems. Proc R Soc A 186, 453–461 (1946).
35
JE Kerrich An Experimental Introduction to the Theory of Probability (E Munksgaard, Copenhagen, 1946).
36
C O’Luanaigh, CERN data centre passes 100 petabytes. home.cern. Available at https://home.cern/about/updates/2013/02/cern-data-centre-passes-100-petabytes. (2013).
37
S Arimoto, An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans Inf Theory 18, 14–20 (1972).
38
R Blahut, Computation of channel capacity and rate-distortion functions. IEEE Trans Inf Theory 18, 460–473 (1972).
39
K Rose, A mapping approach to rate-distortion computation and analysis. IEEE Trans Inf Theory 40, 1939–1952 (1994).
40
D Haussler, A general minimax result for relative entropy. IEEE Trans Inf Theory 43, 1276–1280 (1997).
41
MN Ghosh, Uniform approximation of minimax point estimates. Ann Math Stat 35, 1031–1047 (1964).
42
G Casella, WE Strawderman, Estimating a bounded normal mean. Ann Stat 9, 870–878 (1981).
43
I Feldman, Constrained minimax estimation of the mean of the normal distribution with known variance. Ann Stat 19, 2259–2265 (1991).
44
M Chen, D Dey, P Müller, D Sun, K Ye Frontiers of Statistical Decision Making and Bayesian Analysis (Springer, New York, 2010).
45
CA Sims, Rational inattention: Beyond the linear-quadratic case. Am Econ Rev 96, 158–163 (2006).
46
J Jung, J Kim, F Matĕjka, CA Sims, Discrete actions in information-constrained decision problems. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.696.267. (2015).
47
S Laughlin, A simple coding procedure enhances a neuron’s information capacity. Z Naturforsch C 36, 910–912 (1981).
48
G Tkačik, CG Callan, W Bialek, Information flow and optimization in transcriptional regulation. Proc Natl Acad Sci USA 105, 12265–12270 (2008).
49
MD Petkova, G Tkačik, W Bialek, EF Wieschaus, T Gregor, Optimal decoding of information from a genetic network. arXiv:1612.08084. (2016).
50
A Mayer, V Balasubramanian, T Mora, AM Walczak, How a well-adapted immune system is organized. Proc Natl Acad Sci USA 112, 5950–5955 (2015).
51
MC Abbott, BB Machta, An information scaling law . arXiv:1710.09351. (2017).
52
S Schnell, C Mendoza, Closed form solution for time-dependent enzyme kinetics. J Theor Biol 187, 207–212 (1997).
53
M Transtrum, G Hart, P Qiu, Information topology identifies emergent model classes. arXiv:1409.6203. (2014).
54
PE Paré, AT Wilson, MK Transtrum, SC Warnick, A unified view of balanced truncation and singular perturbation approximations. 2015 American Control Conference, 2015).
55
N Lewis, Combining independent Bayesian posteriors into a confidence distribution, with application to estimating climate sensitivity. J Stat Plan Inference, 2017).
56
N Lewis, Modification of Bayesian updating where continuous parameters have differing relationships with new and existing data. arXiv:1308.2791. (2013).
57
D Poole, AE Raftery, Inference for deterministic simulation models: The Bayesian melding approach. J Am Stat Assoc 95, 1244–1255 (2000).
58
T Seidenfeld, Why I am not an objective Bayesian; some reflections prompted by Rosenkrantz. Theory Decis 11, 413–440 (1979).
59
RE Kass, L Wasserman, The selection of prior distributions by formal rules. J Am Stat Assoc 91, 1343–1370 (1996).
60
J Williamson, Objective Bayesianism, Bayesian conditionalisation and voluntarism. Synthese 178, 67–85 (2009).
61
G Schwarz, Estimating the dimension of a model. Ann Stat 6, 461–464 (1978).
62
J Rissanen, Modeling by shortest data description. Automatica 14, 465–471 (1978).
63
CS Wallace, DM Boulton, An information measure for classification. Comput J 11, 185–194 (1968).
64
S Watanabe, A widely applicable Bayesian information criterion. J Mach Learn Res 14, 867–897 (2013).
65
PD Grünwald, IJ Myung, MA Pitt Advances in Minimum Description Length: Theory and Applications (MIT Press, Cambridge, MA, 2009).
66
S Arlot, A Celisse, A survey of cross-validation procedures for model selection. Stat Surv 4, 40–79 (2010).
67
PW Anderson, More is different. Science 177, 393–396 (1972).
68
RW Batterman, Philosophical implications of Kadanoff’s work on the renormalization group. J Stat Phys 167, 559–574 (2017).
69
J Bardeen, LN Cooper, JR Schrieffer, Theory of superconductivity. Phys Rev 108, 1175–1204 (1957).
70
L Michaelis, ML Menten, The kinetics of invertin action. FEBS Lett 587, 2712–2720 (2013).
Information & Authors
Information
Published in
Classifications
Copyright
© 2018. Published under the PNAS license.
Submission history
Published online: February 6, 2018
Published in issue: February 20, 2018
Keywords
Acknowledgments
We thank Vijay Balasubramanian, William Bialek, Robert de Mello Koch, Peter Grünwald, Jon Machta, James Sethna, Paul Wiggins, and Ned Wingreen for discussion and comments. We thank International Centre for Theoretical Sciences Bangalore for hospitality. H.H.M. was supported by NIH Grant R01GM107103. M.K.T. was supported by National Science Foundation (NSF)-Energy, Power, and Control Networks 1710727. B.B.M. was supported by a Lewis-Sigler Fellowship and by NSF Division of Physics 0957573. M.C.A. was supported by Narodowe Centrum Nauki Grant 2012/06/A/ST2/00396.
Notes
This article is a PNAS Direct Submission.
†
For simplicity we consider only regular models; i.e., we assume all parameters are structurally identifiable.
‡
See Fig. 5 for a demonstration of this point. For another example, consider a parameter manifold which is a cone, with Fisher metric : There is one relevant direction of length , and there are irrelevant directions forming a sphere of diameter . Then the prior on alone implied by is , putting most of the weight near , dramatically so if is large. But since only the relevant direction is visible to our experiment, the region ought to be treated similarly to . The prior has this property.
§
We offer both numerical and analytic arguments for discreteness below. The exception to discreteness is that if there is an exact continuous symmetry, will be constant along it. For example, if our Gaussian model Eq. 2 is placed on a circle (identifying both and , then the optimum prior is a constant.
¶
The function is sometimes called the Bayes risk, as it quantifies how poorly the prior will perform if turns out to be correct. One of the problems equivalent to maximizing the MI (40) is the minimax problem for this (Fig. 1):The distributions we call expected data are also known as Bayes strategies, i.e., distributions on which are the convolution of the likelihood with some prior . The optimal from this third formulation (with ) can be shown to be such a distribution (40).
#
Using a normal distribution of fixed here is what allows the metric in Eq. 6 to be so simple. However, the qualitative behavior from the Poisson distribution is very similar.
‖
If we have more parameters than measurements, then the model must be singular. In fact the exponential model of Fig. 4 is already slightly singular, since does not change the data; we could cure this by restricting to , or by working with , to obtain a regular model.
**Edges of the parameter manifold give simpler models not only in the sense of having fewer parameters, but also in an algorithmic sense. For example, the Michaelis–Menten model is analytically solvable (52) in a limit which corresponds to a manifold boundary (53). Stable linear dynamical systems of order are model boundaries of order systems (54). Taking some parameter combinations to the extreme can lock spins into Kadanoff blocks (53).
‡‡
Model selection usually starts from a list of models to be compared, in our language a list of submanifolds of . We can also consider maximizing mutual information in this setting, rather than with an unconstrained function , and unsurprisingly we observe a similar preference for highly flexible simpler models. This is also discussed in Eq. S3.
Authors
Competing Interests
The authors declare no conflict of interest.
Metrics & Citations
Metrics
Citation statements
Altmetrics
Citations
Cite this article
115 (8) 1760-1765,
Export the article citation data by selecting a format from the list below and clicking Export.
Cited by
Loading...
View Options
View options
PDF format
Download this article as a PDF file
DOWNLOAD PDFLogin options
Check if you have access through your login credentials or your institution to get full access on this article.
Personal login Institutional LoginRecommend to a librarian
Recommend PNAS to a LibrarianPurchase options
Purchase this article to access the full text.