On the sparsity of fitness functions and implications for learning

Edited by Günter Wagner, Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT; received May 25, 2021; accepted November 11, 2021
December 22, 2021
119 (1) e2109649118

Significance

The properties of proteins and other biological molecules are encoded in large part in the sequence of amino acids or nucleotides that defines them. Increasingly, researchers estimate functions that map sequences to a particular property using machine learning and related statistical approaches. However, an important question remains unanswered: How many experimental measurements are needed in order to accurately learn these “fitness” functions? We leverage perspectives from the fields of biophysics, evolutionary biology, and signal processing to develop a theoretical framework that enables us to make progress on answering this question. We demonstrate that this framework can be used to make useful calculations on real-world data and suggest how these calculations may be used to guide experiments.

Abstract

Fitness functions map biological sequences to a scalar property of interest. Accurate estimation of these functions yields biological insight and sets the foundation for model-based sequence design. However, the fitness datasets available to learn these functions are typically small relative to the large combinatorial space of sequences; characterizing how much data are needed for accurate estimation remains an open problem. There is a growing body of evidence demonstrating that empirical fitness functions display substantial sparsity when represented in terms of epistatic interactions. Moreover, the theory of Compressed Sensing provides scaling laws for the number of samples required to exactly recover a sparse function. Motivated by these results, we develop a framework to study the sparsity of fitness functions sampled from a generalization of the NK model, a widely used random field model of fitness functions. In particular, we present results that allow us to test the effect of the Generalized NK (GNK) model’s interpretable parameters—sequence length, alphabet size, and assumed interactions between sequence positions—on the sparsity of fitness functions sampled from the model and, consequently, the number of measurements required to exactly recover these functions. We validate our framework by demonstrating that GNK models with parameters set according to structural considerations can be used to accurately approximate the number of samples required to recover two empirical protein fitness functions and an RNA fitness function. In addition, we show that these GNK models identify important higher-order epistatic interactions in the empirical fitness functions using only structural information.

Continue Reading

Data Availability

The code for our analyses is available on GitHub, https://github.com/dhbrookes/FitnessSparsity. The mTagBFP2 fitness data used in this work is available in Supplementary Data 3 of ref. 18 (https://doi.org/10.1038/s41467-019-12130-8). The His3p fitness data used in this work is described in ref. 46 and is available in the NCBI Gene Expression Omnibus repository, accession no. GSE99990. All other data generated or analyzed in this study are included in the article and/or supporting information.

Acknowledgments

We thank Clara Wong-Fannjiang and Nilesh Tripuraneni for enlightening discussions; and Akosua Busia and Chloe Hsu for helpful comments on the manuscript. D.H.B and J.L. were supported by the Chan Zuckerberg Investigator Program. A.A. was supported by Army Research Office Grant W911NF2110117.

Supporting Information

Appendix 01 (PDF)

References

1
J. Otwinowski, J. B. Plotkin, Inferring fitness landscapes by regression produces biased estimates of epistasis. Proc. Natl. Acad. Sci. U.S.A. 111, E2301–E2309 (2014).
2
J. Otwinowski, Biophysical inference of epistasis and the effects of mutations on protein stability and function. Mol. Biol. Evol. 35, 2345–2354 (2018).
3
A. Ballal et al., Sparse epistatic patterns in the evolution of terpene synthases. Mol. Biol. Evol. 37, 1907–1924 (2020).
4
P. A. Romero, A. Krause, F. H. Arnold, Navigating the protein fitness landscape with Gaussian processes. Proc. Natl. Acad. Sci. U.S.A. 110, E193–E201 (2013).
5
J. Zhou, D. M. McCandlish, Minimum epistasis interpolation for sequence-function relationships. Nat. Commun. 11, 1782 (2020).
6
K. K. Yang, Z. Wu, F. H. Arnold, Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
7
R. Rao et al., “Evaluating protein transfer learning with tape” in Advances in Neural Information Processing Systems, H. Wallach et al., Eds. (Curran Associates, Inc., Red Hook, NY, 2019), vol. 32, pp. 9689–9701.
8
S. Biswas, G. Khimulya, E. C. Alley, K. M. Esvelt, G. M. Church, Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).
9
R. J. Fox et al., Improving catalytic function by ProSAR-driven enzyme evolution. Nat. Biotechnol. 25, 338–344 (2007).
10
C. N. Bedbrook, K. K. Yang, A. J. Rice, V. Gradinaru, F. H. Arnold, Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization. PLOS Comput. Biol. 13, e1005786 (2017).
11
Z. Wu, S. B. J. Kan, R. D. Lewis, B. J. Wittmann, F. H. Arnold, Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl. Acad. Sci. U.S.A. 116, 8852–8858 (2019).
12
A. Gupta, J. Zou, Feedback GAN for DNA optimizes protein functions. Nat. Mach. Intell. 1, 105–111 (2019).
13
D. H. Brookes, H. Park, J. Listgarten, “Conditioning by adaptive sampling for robust design” in Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, K. Chaudhuri, R. Salakhutdinov, Eds. (Proceedings of Machine Learning Research, PMLR, Long Beach, CA), vol. 97, pp. 773–782 (2019).
14
C. Angermüller et al., “Model-based reinforcement learning for biological sequence design” in 8th International Conference on Learning Representations, ICLR 2020 (OpenReview.net, 2020). https://openreview.net/forum?id=HklxbgBKvr. Accessed 17 December 2021.
15
C. Fannjiang, J. Listgarten, “Autofocused oracles for model-based design” in Advances in Neural Information Processing Systems 33, H. Larochelle, M. Ranzato, R. Hadsell, M.-F. Balcan, H.-T. Lin, Eds. (NeurIPS, 2020).
16
Z. R. Sailer, M. J. Harms, Detecting high-order epistasis in nonlinear genotype-phenotype maps. Genetics 205, 1079–1088 (2017).
17
G. Yang et al., Higher-order epistasis shapes the fitness landscape of a xenobiotic-degrading enzyme. Nat. Chem. Biol. 15, 1120–1128 (2019).
18
F. J. Poelwijk, M. Socolich, R. Ranganathan, Learning the pattern of epistasis linking genotype and phenotype in a protein. Nat. Commun. 10, 4213 (2019).
19
A. Aghazadeh et al., Epistatic Net allows the sparse spectral regularization of deep neural networks for inferring fitness functions. Nat. Commun. 12, 5225 (2021).
20
A. Aghazadeh, O. Ocal, K. Ramchandran, CRISPRL and: Interpretable large-scale inference of DNA repair landscape based on a spectral approach. Bioinformatics 36 (suppl. 1), i560–i568 (2020).
21
E. J. Candes, J. Romberg, T. Tao, Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52, 489–509 (2006).
22
D. L. Donoho, Compressed sensing. IEEE Trans. Inf. Theory 52, 1289–1306 (2006).
23
S. Kauffman, S. Levin, Towards a general theory of adaptive walks on rugged landscapes. J. Theor. Biol. 128, 11–45 (1987).
24
A. Agarwala, D. S. Fisher, Adaptive walks on high-dimensional fitness landscapes and seascapes with distance-dependent statistics. Theor. Popul. Biol. 130, 13–49 (2019).
25
S. A. Kauffman, E. D. Weinberger, The NK model of rugged fitness landscapes and its application to maturation of the immune response. J. Theor. Biol. 141, 211–245 (1989).
26
W. Rowe et al., Analysis of a complete DNA-protein affinity landscape. J. R. Soc. Interface 7, 397–408 (2010).
27
J. Neidhart, I. G. Szendro, J. Krug, Exact results for amplitude spectra of fitness landscapes. J. Theor. Biol. 332, 218–227 (2013).
28
Y. Hayashi et al., Experimental rugged fitness landscape in protein sequence space. PLoS One 1, e96 (2006).
29
T. Aita et al., Extracting characteristic properties of fitness landscape from in vitro molecular evolution: A case study on infectivity of fd phage to E. coli. J. Theor. Biol. 246, 538–550 (2007).
30
J. Buzas, J. Dinitz, An analysis of NK landscapes: Interaction structure, statistical properties, and expected number of local optima. IEEE Trans. Evol. Comput. 18, 807–818 (2014).
31
S Nowak, J Krug, Analysis of adaptive walks on NK fitness landscapes with different interaction schemes. J. Stat. Mech. Theory Exp. 2015, P06014 (2015).
32
K. S. Sarkisyan et al., Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).
33
T. Hastie, R. Tibshirani, M. Wainwright, Statistical Learning with Sparsity: The Lasso and Generalizations (Chapman & Hall/CRC, Boca Raton, FL, 2015).
34
E. J. Candes, Y. Plan, A probabilistic and RIPless theory of compressed sensing. IEEE Trans. Inf. Theory 57, 7235–7254 (2011).
35
R. B. Heckendorn, D. Whitley, “A Walsh analysis of NK-landscapes” in Proceedings of the Seventh International Conference on Genetic Algorithms, T. Bäck, Ed. (Morgan Kaufmann, Burlington, MA), pp. 41–48 (1997).
36
S. Hwang, B. Schmiegelt, L. Ferretti, J. Krug, Universality classes of interaction structures for NK fitness landscapes. J. Stat. Phys. 172, 226–278 (2018).
37
D. M. Weinreich, Y. Lan, J. Jaffe, R. B. Heckendorn, The influence of higher-order epistasis on biological fitness landscape topography. J. Stat. Phys. 172, 208–225 (2018).
38
F. J. Poelwijk, V. Krishna, R. Ranganathan, The context-dependence of mutations: A linkage of formalisms. PLOS Comput. Biol. 12, e1004771 (2016).
39
D. M. Weinreich, Y. Lan, C. S. Wylie, R. B. Heckendorn, Should evolutionary geneticists worry about higher-order epistasis? Curr. Opin. Genet. Dev. 23, 700–707 (2013).
40
G. D. Stormo, Maximally efficient modeling of DNA sequence motifs at all levels of complexity. Genetics 187, 1219–1224 (2011).
41
P. F. Stadler, R. Seitz, G. P. Wagner, Population dependent Fourier decomposition of fitness landscapes over recombination spaces: Evolvability of complex characters. Bull. Math. Biol. 62, 399–428 (2000).
42
E. D. Weinberger, Local properties of Kauffman’s N-k model: A tunably rugged energy landscape. Phys. Rev. A 44, 6399–6413 (1991).
43
P. Ravikumar, M. J. Wainwright, J. D. Lafferty, High-dimensional Ising model selection using l1-regularized logistic regression. Ann. Stat. 38, 1287–1319 (2010).
44
C. A. Voigt, C. Martinez, Z. G. Wang, S. L. Mayo, F. H. Arnold, Protein building blocks preserved by recombination. Nat. Struct. Biol. 9, 553–558 (2002).
45
O. M. Subach et al., Structural characterization of acylimine-containing blue and red chromophores in mTagBFP and TagRFP fluorescent proteins. Chem. Biol. 17, 333–341 (2010).
46
V. O. Pokusaeva et al., An experimental assay of the interactions of amino acids from orthologous sequences shaping a complex fitness landscape. PLoS Genet. 15, e1008079 (2019).
47
J. Yang, Y. Zhang, I-TASSER server: New development for protein structure and function predictions. Nucleic Acids Res. 43, W174–W181 (2015).
48
R. Lorenz et al., ViennaRNA Package 2.0. Algorithms Mol. Biol. 6, 26 (2011).
49
L. du Plessis, G. E. Leventhal, S. Bonhoeffer, How good are statistical models at approximating complex fitness landscapes? Mol. Biol. Evol. 33, 2454–2468 (2016).
50
D. W. Anderson, A. N. McKeown, J. W. Thornton, Intermolecular epistasis shaped the function and evolution of an ancient transcription factor and its DNA binding sites. eLife 4, e07864 (2015).
51
D. M. Weinreich, N. F. Delaney, M. A. Depristo, D. L. Hartl, Darwinian evolution can follow only very few mutational paths to fitter proteins. Science 312, 111–114 (2006).
52
C. Bank, S. Matuszewski, R. T. Hietpas, J. D. Jensen, On the (un)predictability of a large intragenic fitness landscape. Proc. Natl. Acad. Sci. U.S.A. 113, 14085–14090 (2016).
53
N. C. Wu, L. Dai, C. A. Olson, J. O. Lloyd-Smith, R. Sun, Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife 5, e16965 (2016).
54
D. H. Bryant et al., Deep diversification of an AAV capsid protein by machine learning. Nat. Biotechnol. 39, 691–696 (2021).
55
B. Ricaud, P. Borgnat, N. Tremblay, P. Gonçalves, P. Vandergheynst, Fourier could be a data scientist: From graph Fourier transform to signal processing on graphs. C. R. Phys. 20, 474–488 (2019).
56
P. F. Stadler, “Towards a theory of landscapes” in Complex Systems and Binary Networks, R. López-Peña, H. Waelbroeck, R. Capovilla, R. García-Pelayo, F. Zertuche, Eds. (Lecture Notes in Physics, Springer, Berlin, 1995), vol. 461, pp. 78–163.
57
R. Hammack, W. Imrich, S. Klavzar, Handbook of Product Graphs (CRC Press, Inc., Boca Raton, FL, ed. 2, 2011).
58
A. S. Perelson, C. A. Macken, Protein evolution on partially correlated landscapes. Proc. Natl. Acad. Sci. U.S.A. 92, 9657–9661 (1995).
59
H. A. Orr, The population genetics of adaptation on correlated fitness landscapes: The block model. Evolution 60, 1113–1124 (2006).

Information & Authors

Information

Published in

The cover image for PNAS Vol.119; No.1
Proceedings of the National Academy of Sciences
Vol. 119 | No. 1
January 5, 2022
PubMed: 34937698

Classifications

Data Availability

The code for our analyses is available on GitHub, https://github.com/dhbrookes/FitnessSparsity. The mTagBFP2 fitness data used in this work is available in Supplementary Data 3 of ref. 18 (https://doi.org/10.1038/s41467-019-12130-8). The His3p fitness data used in this work is described in ref. 46 and is available in the NCBI Gene Expression Omnibus repository, accession no. GSE99990. All other data generated or analyzed in this study are included in the article and/or supporting information.

Submission history

Accepted: November 11, 2021
Published online: December 22, 2021
Published in issue: January 5, 2022

Keywords

  1. fitness functions
  2. compressed sensing
  3. epistasis
  4. protein structure

Acknowledgments

We thank Clara Wong-Fannjiang and Nilesh Tripuraneni for enlightening discussions; and Akosua Busia and Chloe Hsu for helpful comments on the manuscript. D.H.B and J.L. were supported by the Chan Zuckerberg Investigator Program. A.A. was supported by Army Research Office Grant W911NF2110117.

Notes

This article is a PNAS Direct Submission.

Authors

Affiliations

Biophysics Graduate Group, University of California, Berkeley, CA 94720;
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720;
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720;
Center for Computational Biology, University of California, Berkeley, CA 94720

Notes

1
To whom correspondence may be addressed. Email: [email protected].
Author contributions: D.H.B. and A.A. designed research; D.H.B. performed research; and D.H.B., A.A., and J.L. wrote the paper.

Competing Interests

Competing interest statement: J.L. is on the Scientific Advisory Board for Foresite Labs and Patch Biosciences.

Metrics & Citations

Metrics

Note: The article usage is presented with a three- to four-day delay and will update daily once available. Due to ths delay, usage data will not appear immediately following publication. Citation information is sourced from Crossref Cited-by service.


Altmetrics

Citations

Export the article citation data by selecting a format from the list below and clicking Export.

Cited by

    Loading...

    View Options

    View options

    PDF format

    Download this article as a PDF file

    DOWNLOAD PDF

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Personal login Institutional Login

    Recommend to a librarian

    Recommend PNAS to a Librarian

    Purchase options

    Purchase this article to access the full text.

    Single Article Purchase

    On the sparsity of fitness functions and implications for learning
    Proceedings of the National Academy of Sciences
    • Vol. 119
    • No. 1

    Figures

    Tables

    Media

    Share

    Share

    Share article link

    Share on social media