Rubisco is evolving for improved catalytic efficiency and CO2 assimilation in plants

Significance Rubisco is the most abundant enzyme on Earth and is the source of almost all biological carbon. Here we uncover the trajectory of rubisco adaptive evolution in plants. We reveal that rubisco has experienced continuous directional selection toward higher carbon dioxide/oxygen specificity, carboxylase turnover, and carboxylation efficiency. Moreover, we find that this directional selection toward improved catalytic efficiency has resulted in a corresponding improvement in leaf-level CO2 assimilation. Together, these findings have significant implications for our understanding of the past, present, and future of rubisco evolution.

Rubisco is the primary entry point for carbon into the biosphere.However, rubisco is widely regarded as inefficient leading many to question whether the enzyme can adapt to become a better catalyst.Through a phylogenetic investigation of the molecular and kinetic evolution of Form I rubisco we uncover the evolutionary trajectory of rubisco kinetic evolution in angiosperms.We show that rbcL is among the 1% of slowest-evolving genes and enzymes on Earth, accumulating one nucleotide substitution every 0.9 My and one amino acid mutation every 7.2 My.Despite this, rubisco catalysis has been continually evolving toward improved CO 2 /O 2 specificity, carboxylase turnover, and carboxylation efficiency.Consistent with this kinetic adaptation, increased rubisco evolution has led to a concomitant improvement in leaf-level CO 2 assimilation.Thus, rubisco has been slowly but continually evolving toward improved catalytic efficiency and CO 2 assimilation in plants.
Although all extant rubisco are descended from a single ancestral rubisco-like protein (10)(11)(12), the enzyme is found in a variety of compositional forms across the tree of life (Fig. 1) (13,14).The simplest manifestations are the Form II and Form III variants found in protists, archaea, and some bacteria which are composed of a dimer, or dimers, of the ~50 kDa rubisco large subunit (RbcL) (14)(15)(16)(17).In contrast, Form I rubisco is a hexadecamer comprised of four RbcL dimers organized in an antiparallel core capped at either end by the ~15 kDa rubisco small subunit (RbcS) (14,18).Of these three Forms, only Form I and II have been recruited for oxygenic photosynthesis (17), with Form I being responsible for the vast majority of global CO 2 assimilation (17,19).
Although there is kinetic variability between rubisco orthologs, the enzyme is considered to be an inefficient catalyst.For example, the maximum substrate-saturated turnover rate of Form I rubisco (<12 s −1 ) (62) is slower than average (63).In addition, rubisco catalyzes a reaction with O 2 (64)(65)(66) that is competitive with CO 2 and results in the loss of fixed carbon via photorespiration (67)(68)(69).As a consequence, rubisco appears poorly suited to the current O 2rich/CO 2poor atmosphere (Fig. 1).Moreover, it appears that instead of improving enzyme function, multiple lineages have evolved alternative strategies to overcome rubisco's shortcomings.For example, higher rates of CO 2 assimilation are often achieved either by synthesizing large quantities of rubisco [~50% of soluble protein in plants (70) and some microbes (71,72)], or by operating CO 2concentrating mechanisms (73)(74)(75).As a result, many have questioned whether the enzyme is already perfectly adapted and whether further kinetic improvements are possible (16,65,69,(76)(77)(78)(79)(80).Obtaining answers to these questions would shed light on the "rubisco paradox"-helping to explain why this enzyme of such paramount importance appears poorly adapted for its role.
The initial hypothesis that attempted to explain the above rubisco paradox proposed that rubisco is constrained by catalytic trade-offs that limit the enzyme's adaptation.This theory was pioneered by two studies (81,82) which found antagonistic correlations between rubisco kinetic traits and proposed that these trade-offs were caused by constraints on its catalytic mechanism.However, recent evidence has questioned this hypothesis as the sole mechanism to explain the rubisco paradox.Specifically, analysis of larger species sets have revealed that kinetic trait correlations are not strong (83)(84)(85).In addition, phylogenetic signal in rubisco kinetics causes kinetic trait correlations to be overestimated unless phylogenetic comparative methods are employed (22,23).Thus, when larger datasets are analyzed with phylogenetic methods, the strength of catalytic trade-offs are substantially reduced (22,23).Instead, phylogenetic constraints have had a larger impact on limiting enzyme adaptation compared to catalytic trade-offs (22,23).These recent findings motivate a revaluation of the rubisco paradox, and an investigation of whether rubisco is evolving for improved catalysis and CO 2 assimilation in plants.
Here, we address these issues through a phylogenetic interrogation of the molecular and kinetic evolution of the Form I holoenzyme.We reveal that RbcL has evolved at a slower rate than >98% of all other gene/protein sequences across the tree of life.Through simultaneous evaluation of molecular and kinetic evolution of rubisco during the radiation of C 3 angiosperms, we reveal that the enzyme has been continually evolving toward improved CO 2 /O 2 specificity, carboxylase turnover rate, and carboxylation efficiency.Furthermore, we demonstrate that enhanced rubisco evolution is associated with increased rates of CO 2 assimilation and higher photosynthetic nitrogen-use efficiencies.Thus, rubisco is not perfectly adapted, but is slowly evolving toward improved catalytic efficiency and CO 2 assimilation.

RbcL Has Evolved Slower than RbcS and Has Experienced
Stronger Purifying Selection.Sequences encoding Form I rubisco were obtained from the National Center for Biotechnology Information (NCBI) (https://www.ncbi.nlm.nih.gov/).This dataset was filtered to retain sequences for a given species only if a full-length sequence for both rbcL and rbcS were present.Although rbcL exists as a single copy gene in all species, many species harbor multiple rbcS genes in their genomes.Thus, for each species, a single rbcL sequence and all available rbcS sequences were taken forward.In total, this resulted in a set of 488 rbcL/RbcL and 1,140 rbcS/RbcS sequences across 488 species (SI Appendix, Supplemental File 1, Fig. S1, and Table S1).
In order to compare the rate at which the two rubisco subunits have evolved, species were partitioned into distinct taxonomic groups comprising the red algae (Rhodophyta; n = 201), the SAR supergroup (Stramenopiles, Alveolates, and Rhizaria; n = 129), the bacteria (Bacteria; n = 78), the land plants (Streptophyta; n = 68) and the green algae (Chlorophyta; n = 12) (SI Appendix, Fig. S1, Supplemental File 1, and Table S1).Hereinafter, the total amount of molecular evolution of the nucleotide sequences (rbcL and rbcS) and the total amount of molecular evolution of the protein sequences (RbcL and RbcS) in a taxonomic group is referred to as "the extent of nucleotide evolution" and "the extent of protein evolution", respectively.The term "the extent of molecular evolution" jointly refers to both.
Comparison of the two rubisco subunits revealed that the extent of molecular evolution in rbcL/RbcL is lower than that experienced by rbcS/RbcS (Fig. 2A).Specifically, the nucleotide and protein sequences of rbcL/RbcL have evolved at a rate equivalent to ~50% and ~25% of rbcS/RbcS on average across all taxa, respectively (Fig. 2 A and B).This was not an artifact of the higher gene copy number of rbcS, as a 1,000 bootstrapped stratified sampling recovered the same result when only a single rbcS/RbcS sequence was randomly sampled per species (Materials and Methods; Fig. 2A).Therefore, rbcL/RbcL has explored less nucleotide and protein sequence space than rbcS/RbcS in the same sets of species over the same period of time (Fig. 2 A and B).Furthermore, rbcL also experienced fewer amino acid changes per change in nucleotide sequence compared to rbcS (Fig. 2 A and C), indicating a higher degree of purifying selection.Thus, rbcL/RbcL has evolved more slowly and has been subject to a higher degree of functional constraint on the encoded protein sequence than rbcS/RbcS.RbcL Is One of the Slowest-Evolving Genes in the Tree of Life.To evaluate the rate of molecular evolution in the context of all other genes in the species under consideration, the percentile rank of rbcL/RbcL and rbcs/Rbcs was evaluated for all genes in all species (Materials and Methods).This revealed that 99.3% of all gene nucleotide sequences and 98.1% of all gene protein sequences evolved faster than rbcL/RbcL in the same sets of species over the same period of time (Fig. 3A and SI Appendix, Supplemental File 1 and Table S2).This held true even if rbcL/RbcL was only compared to the subset of genes that encode enzymes, with 99.2% of enzyme nucleotide sequences and 98.3% of enzyme protein sequences evolving faster than rbcL/RbcL (Fig. 3B and SI Appendix, Supplemental File 1 and Table S3).Furthermore, in land plants, rbcL/RbcL was also the slowest-evolving component of the Calvin-Benson-Bassham cycle (Fig. 3C and SI Appendix, Supplemental File 1 and Tables S4 and S5).This slow pace of evolution is not simply an artifact of being encoded in the plastid genome, as rbcL/RbcL was also one of the slowest-evolving genes/proteins in bacteria which encode all of their genes in a single cytoplasmic genome.Thus, rbcL/RbcL is one of the slowest-evolving genes/ enzymes in all species in which it is found, irrespective of the taxonomic group or genome in which it is encoded.
In contrast to rbcL/RbcL, considerable variability in the extent of molecular evolution in the rubisco small subunit was observed both within and between taxonomic groups (Fig. 3A and SI Appendix, Supplemental File 1 and Table S2).Analogous results in each taxonomic group were recovered when this analysis was restricted to the subset of genes that encode enzymes (Fig. 3B and SI Appendix, Supplemental File 1 and Table S3).Moreover, in land plants, rbcS/RbcS was the fastest-evolving component of the Calvin-Benson-Bassham cycle (Fig. 3C and SI Appendix, Supplemental File 1 and Tables S4 and S5).Thus, while the pace of molecular evolution in rbcL/RbcL is ubiquitously slow, the extent of molecular evolution of rbcS/RbcS is highly variable explaining the disparity in the rate of both subunits across the tree of life (Fig. 2B and SI Appendix, Supplemental File 1 and Table S6).A similar variable rate was also observed for rubisco's ancillary chaperones (SI Appendix, Supplemental File 1).Thus, the rate of molecular evolution of rbcL/RbcL is ubiquitously low, and lower than rbcS/RbcS or any associated chaperone.

Rubisco Is Evolving for Improved Kinetic Efficiency in Plants.
Given that rbcL is among the slowest-evolving genes on Earth, the question arises as to whether its sequence evolution is adaptive and is improving the catalysis of the enzyme.We hypothesized that if rubisco was undergoing directional selection for improved catalysis, then orthologs that have experienced the largest extent of molecular evolution would be the most efficient catalysts.To test this hypothesis, a dataset of kinetic measurements from C 3 angiosperms (22,23,83) was evaluated in the context of the molecular evolution of RbcL (Fig. 4 A and B).This investigation focused on RbcL as it is the primary determinant of kinetics (22)(23)(24)(25)(26)(27)(28)(29)(30)(31)(32)(33), and because sufficient sequence data for RbcS are unavailable.This analysis revealed that the more RbcL has evolved from the most recent common ancestral sequence, the better its CO 2 / O 2 specificity (S C/O ; 10.1% variance explained, P < 0.01), CO 2 turnover rate (k catC ; 4.6% variance explained, P < 0.05), and carboxylation efficiency (k catC /K C ; 3.8% variance explained, P < 0.05) (Fig. 4B).This result is not an artifact caused by potential systematic methodological biases associated with species sampling or potential uncertainties or errors in the underlying phylogenetic tree (SI Appendix, Supplemental File 1).Thus, rubisco has been adaptively evolving for improved S C/O , k catC , and k catC /K C during the radiation of the angiosperms.
Given that the origin of the angiosperms is estimated to have occurred 160 Mya (9) (Fig. 4A), it is possible to put the above kinetic change in the context of both molecular sequence changes and evolutionary time since the last common ancestor at the base of this clade (Table 1).As the large subunit acquired one nucleotide substitution every 0.9 My and one amino acid substitution every 7.2 My (SI Appendix, Supplemental File 1 and Fig. S2), each amino acid substitution resulted in an average increase in S C/O by 2.7 × 10 −1 mol mol −1 , in k catC by 3.6 × 10 −2 s −1 , and in k catC /K C by 1.8 × 10 −3 s −1 µM −1 .This is equivalent to a relative improvement of 0.3% (S C/O ), 1.4% (k catC ), and 1.1% (k catC /K C ) per amino acid substitution, and a relative improvement of 0.05% (S C/O ), 0.2% (k catC ), and 0.2% (k catC /K C ) per million years.Thus, there has been improvement in rubisco kinetics during the radiation of the angiosperms at a rate that is dependent on the extent of its molecular sequence change.

Rubisco Is Evolving for Improved Leaf-Level CO 2 Assimilation.
Given that rubisco is evolving to become a better catalyst, we hypothesized that this adaptation would also drive adaptation in the rate of leaf-level CO 2 assimilation.To test this hypothesis, we analyzed a large dataset of photosynthetic measurements from C 3 angiosperms (86) in the context of the extent of their RbcL evolution (Fig. 5 A-C).This revealed that the rate of leaf-level CO 2 assimilation was also dependent on the extent of molecular sequence change in rubisco, such that C 3 angiosperms with more evolved rubisco presented higher rates of CO 2 assimilation (A mass ; 19.2% variance explained, P < 0.001) (Fig. 5B).This is not a consequence of increased nitrogen investment in the leaf, as the association between rubisco evolution and increased CO 2 assimilation is strengthened when measurements are controlled for leaf nitrogen content (PNUE mass , 22.1% variance explained, P < 0.001) (Fig. 5B).Analogous results were obtained when measurements of CO 2 assimilation were evaluated on a leaf area basis (Fig. 5C).Together, these results are most parsimoniously explained by directional selection toward enhanced leaf-level CO 2 assimilation driven by the kinetic adaptation described above.Thus, the adaptive evolution of rubisco during the radiation of the angiosperms has resulted in the improvement of leaf-level CO 2 assimilation.

Discussion
Rubisco is the primary entry point for carbon into the biosphere and is responsible for fixing 250 billion tons of CO 2 annually (19).Despite this immense throughput, the enzyme is a surprisingly  S2. (B) As in A but calculating the percentile (%) extent of rubisco molecular evolution (substitutions per sequence site) relative to only the subset of genes and proteins in each species which encode enzymes.See also SI Appendix, Supplemental File 1 and Table S3.S4 and S5.The raw data for this figure can be found in SI Appendix, Supplemental File 5.
inefficient catalyst with a modest carboxylase turnover rate of <12 s −1 (62) and a competing oxygenase activity that results in the loss of fixed carbon (65,66,87).This discord presents an evolutionary paradox that has attracted significant attention (16,22,23,65,69,(76)(77)(78)(79)(80).Here we demonstrate rbcL is one of the slowest-evolving genes on Earth.Despite this, we show that rubisco has been evolving for improved CO 2 /O 2 specificity (S C/O ), carboxylase turnover rates (k catC ), and carboxylation efficiencies (k catC /K C ) in angiosperms.Moreover, we find that plants with more evolved rubisco exhibit higher leaf-level CO 2 assimilation and enhanced photosynthetic nitrogen-use efficiencies.Thus, rubisco has been continually evolving toward improved

Table 1. Rubisco kinetics in extinct and extant angiosperms
Last common angiosperm ancestor Kinetic trait values for the last common ancestor of the angiosperms were computed based on the estimated y intercept (mean ± 1 SE) of the linear regression analysis performed between the extent of RbcL protein evolution and each rubisco kinetic trait in Fig. 4B.Mean values of rubisco kinetic traits and associated variation (±1 SE) in extant C 3 species are shown for comparison.The raw dataset used can be found in SI Appendix, Supplemental File 7.
catalytic efficiency and CO 2 assimilation during the radiation of the angiosperms.
A slow rate of molecular evolution in rbcL has long been assumed and has underpinned the use of this gene for systematics and phylogenetics (88)(89)(90).However, to our knowledge, there has been no contextualized measurement of the rate of rbcL evolution across the tree of life.The analysis presented here addresses this gap by revealing that rbcL/RbcL has experienced a lower extent of molecular evolution than 99% of all gene nucleotide sequences and 98% of all gene protein sequences across all taxa in which it is encoded.It is interesting to note that this result is not due to the presence of rbcL in the chloroplast genome, as rbcL is also one of the slowest-evolving sequences in bacteria which lack organellar genomes.Thus, RbcL is universally one of the slowest-evolving sequences on Earth, irrespective of the taxon or genome in which it resides.
Although dissecting the factors which constrain the rate of rbcL evolution is beyond the scope of the current study, the slow pace of rbcL molecular evolution is most likely a consequence of several synergistic factors (91) including constraints imposed by expression (92-96), selection to preserve protein function (97)(98)(99)(100)(101), and the requirements for protein-protein interactions in vivo (102)(103)(104)(105).These factors would be particularly pertinent for rubisco given that it is the most abundant enzyme in organisms in which it is found (70,72), it is subject to catalytic trade-offs (22,23,82,83) and molecular activity-stability trade-offs (106)(107)(108)(109), and given that it relies on multiple interacting partners and chaperones for folding, assembly and metabolic regulation (70,110).Thus, a perfect storm of features exist which could limit the molecular evolution of rbcL and thereby cause it to be one of the slowest-evolving genes on Earth.Further work to elucidate the exact contribution of each of these biological determinants on rubisco's rate of molecular evolution is warranted, building upon the work here and previous investigations (22,23,111).
Our integrated analysis of rubisco evolution revealed a continual improvement in S C/O , k catC , and k catC /K C during the radiation of C 3 angiosperms.Thus, although rubisco is slowly-evolving, sequence changes have enhanced the catalytic properties of the enzyme.In the context of the C 3 leaf, such directional selection toward improved S C/O is consistent with adaptation to maintain adequate carbon assimilation in response to declining atmospheric CO 2 and increasing atmospheric O 2 (Fig. 1).This evolutionary strategy has been proposed previously (82), and is suggested to apply broadly across photoautotrophs lacking a CO 2 concentrating mechanism (84,112).In addition to adaptation for higher S C/O , we also identify simultaneous improvement in k catC and k catC /K C without antagonism in any other kinetic trait.These results are also consistent with the inferior S C/O and k catC /K C reported for extinct rubisco resurrected at the dawn of the Form IB (6) and Form I (36) lineages.It is noteworthy that on first appearances, all of these studies seem to contradict an analysis within the Solanaceae in which resurrected ancestral rubisco variants exhibited superior k catC and k catC /K C values.However, in this instance the kinetic differences were proposed to be driven by sequence changes in RbcS (113), and therefore do not contradict the analysis of RbcL presented here or in other studies (6,36).Thus, sequence change in RbcL during the radiation of angiosperms has driven the continual improvement of the enzyme in the presence of a declining atmospheric CO 2 :O 2 concentration.
The "FvCB model" of photosynthesis (114), as well as a suite of other experimental (115)(116)(117)(118)(119)(120)(121)(122) and computational (123,124) studies all demonstrate that rubisco is a major rate-limiting factor for CO 2 assimilation under ambient steady-state conditions.The findings presented here link these mechanistic studies with evolutionary biology, and reveal that rubisco has experienced directional selection to improve kinetic efficiency and CO 2 assimilation.It is worth noting that the strength of the relationship observed between rubisco evolution and leaf-level CO 2 assimilation is greater in magnitude than that between evolution and enzyme kinetics.This phenomenon can be explained by the fact that rubisco is not only evolving for improved kinetic efficiency, but is also simultaneously undergoing optimization for multiple other properties such as stability, activatability, and chaperone affinity.This multi-objective optimization means that not all sequence changes are expected to alter kinetics.Instead, these sequence changes influence a much larger set of properties that together contribute to improvement in leaf-level CO 2 assimilation.Thus, our finding that rubisco evolution has a stronger correlation with leaf-level CO 2 assimilation than with any individual kinetic trait aligns with evolutionary expectations.
Ultimately, the global trends of rubisco evolution presented here change our understanding of the rubisco paradox.Rubisco is not locked in evolutionary stasis, but is instead slowly evolving toward improved CO 2 assimilation.These findings have significant implications for our understanding of the past, present, and future potential of rubisco in natural and engineered contexts.Crucially, they also serve as a foundation for future work to further unravel the intricacies of rubisco evolution.For instance, an important focus of subsequent research would be to elucidate the adaptative trajectory of rubisco across different temporal, spatial, and phylogenetic scales.The present findings also motivate efforts to resolve the contribution of the rubisco small subunit and accessory chaperones to rubisco adaptation, given that these genes evolve faster than the large subunit and thus may have played an important role in the evolutionary dynamics of the enzyme over geological time scales.

Materials and Methods
Rubisco Sequence Data.All publicly available coding sequences of the rbcL and rbcS subunit genes in the NCBI database (https://www.ncbi.nlm.nih.gov/) as of July 2020 were downloaded (RbcL n = 239,492; RbcS n = 2,061).Manual inspection of nucleotide and translated protein sequences was performed to remove any duplicate, partial, chimeric, or erroneously annotated sequences.In addition, this dataset was further restricted to include only those species which possess a Form I rubisco and for which both a full-length rbcL and rbcS gene sequence could be obtained.Given that rbcL exists as a single copy gene in all species, only one rbcL sequence per species was retained for downstream analysis.In contrast, all possible full-length rbcS sequences were taken forward to account for the fact that rbcS is multicopy in some genomes.
Translated RbcL and RbcS protein sequences were aligned using the MAFFT L-INS-i algorithm (125).The corresponding codon alignments of the nucleotide sequences were generated by threading the nucleotide sequences through the aligned protein sequences that they encode using PAL2NAL software (126).Multiple sequence alignments were trimmed to remove non-aligned codon or residue positions such that only ungapped columns remained.During this process, the putative transit peptide of rbcS/RbcS sequences in taxa in which this gene is encoded by the nuclear genome was computationally cleaved.Following these data processing steps, alignments were partitioned depending on species membership to either the bacteria (Bacteria; n = 78), land plants (Streptophyta; n = 68), green algae (Chlorophyta; n = 12), red algae (Rhodophyta; n = 201) or the SAR supergroup (Stramenopiles, Alveolates, and Rhizaria; n = 129) by use of the NCBI taxonomy browser (https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi).Any sequences belonging to species in either the Haptophyta, Cryptophyta, Glaucocystophyta or Excavata taxonomic groups were excluded from the dataset at this point due to insufficient data availability.In total, this resulted in a combined set of 488 rbcL/RbcL and 1,140 rbcS/ RbcS gene and protein sequences across 488 species spanning 5 taxonomic groups (SI Appendix, Fig. S1, Supplemental File 1, and Table S1).The complete set of raw rbcL/ RbcL and rbcS/RbcS sequences, as well as the complete set of aligned and trimmed rbcL/RbcL and rbcS/RbcS sequences can be found in SI Appendix, Supplemental File 2.
Rubisco Phylogenetic Tree inference.Maximum-likelihood rbcL/RbcL and rbcS/RbcS phylogenetic gene trees were inferred across all sequences within a taxonomic group by IQ-TREE (127) using the ultrafast bootstrapping method with 1,000 replicates and the Shimodaira-Hasegawa approximate-likelihood ratio branch test.The best fitting models of nucleotide (SYM+R8) and amino acid (LG+R5) sequence evolution were respectively determined as those which exhibit the lowest combined Bayesian information criterion rank score across the complete sets of both RbcL and RbcS sequences (127).Across all taxonomic groups, the models of nucleotide and amino acid sequence evolution were held constant between the gene trees for rbcL and rbcS, and RbcL and RbcS, respectively, such that branch lengths are comparable across both subunits.The complete set of these rbcL/RbcL and rbcS/RbcS phylogenetic gene trees used as the basis of the analysis herein can be found in SI Appendix, Supplemental File 3. To account for potential biases in our analysis caused by some species exhibiting multiple copies of rbcS, random stratified sampling of the non-gapped rbcS/RbcS sequence alignments was also conducted as described in SI Appendix, Supplemental File 1.

Quantification of the Toal Extent of Nucleotide and Protein Molecular
Evolution in rbcL/RbcL and rbcS/RbcS.The extent of molecular evolution in both rubisco subunits was assessed across all species in a given taxonomic group as the total length (sequence substitutions per aligned sequence site) of the phylogenetic tree describing the evolutionary history of each respective gene.For this purpose, tree length was calculated as the combined sum of branch lengths leading from the root at the last common ancestor of the tree to the set of sequences at the terminal nodes.In this way, using the trees inferred across the complete cohort of rbcL/RbcL and rbcS/RbcS sequences in each taxonomic group, it was possible to capture all nucleotide and amino acid evolution which has arisen in each subunit since the most recent common ancestor of all sampled species in the group.The total tree length measure used here accounts for differences in alignment length between genes because the branch lengths in the tree are estimates of the number of substitutions per sequence site.This measure is also unaffected by species sampling or variation in model of sequence evolution as each of these factors was held constant in the analyses that were conducted.An identical analysis was also performed for each rbcS/RbcS tree generated by stratified sampling, with mean and SD of estimates being calculated in this case across the 1,000 unique bootstrap replicate trees.
Orthogroup Classification and Phylogenetic Tree inference.Complete sets of representative gene models were acquired for a total of 32 of the bacteria species, 27 of the land plant species, 8 of the SAR species, 6 of the red algae species, and 4 of the green algae species analyzed in the present study, respectively (SI Appendix, Supplemental File 1 and Tables S1 and S7).A detailed description of the method used to download and prepare these gene models for downstream analysis is provided in SI Appendix, Supplemental File 1.
The complete set of translated proteomes for species in each respective taxonomic group were subject to orthogroup inference using OrthoFinder V2.5.2 (128,129) software run with default settings and with the DIAMOND ultra-sensitive mode (130,131).Protein sequences within each orthogroup were aligned using the MAFFT L-INS-I algorithm with 1,000 cycles of iterative refinement (125).The corresponding codon alignments of the nucleotide sequences were generated by threading the nucleotide sequences through the aligned protein sequences that they encode using PAL2NAL software (126).Alignments were trimmed to remove positions which contain gap characters.Sequences that were <50% of the median length of the cohort of all other sequences in the given orthogroup were excluded to avoid analysis of partial or truncated genes that could influence downstream analysis.All nucleotide and protein multiple sequence alignments which satisfied the above criterion and which possessed >50 ubiquitously aligned codon or amino acid positions were subject to bootstrapped maximum likelihood phylogenetic tree inference using IQ-TREE (127) following the exact method and evolutionary substitution models described above.In total, this resulted in a combined set of 16,631 orthogroup phylogenies comprising 5,126,017 ortholog pairwise comparisons across 351 species pairwise comparisons for the land plant clade, 6,953 orthogroup phylogenies comprising 153,288 ortholog pairwise comparisons across 28 species pairwise comparisons for the SAR clade, 5,422 orthogroup phylogenies comprising 642,057 ortholog pairwise comparisons across 496 species pairwise comparisons for the bacteria clade, 4,269 orthogroup phylogenies comprising 31,133 ortholog pairwise comparisons across 6 species pairwise comparisons for the green algae clade and 3,966 orthogroup phylogenies comprising 54,091 ortholog pairwise comparisons across 15 species pairwise comparisons for the red algae clade, from which to base the analyses herein.A further breakdown of these metrics for each species comparison can be found in SI Appendix, Supplemental File 4. The set of these respective genes within each species proteome that encode enzymes was determined following the method described in SI Appendix, Supplemental File 1.

Quantification of the Percentile Extent of Rubisco Molecular Evolution
within Each Taxonomic Group.To evaluate the extent of molecular evolution in rubisco in the context of all other genes, only species in the rubisco sequence dataset possessing a publicly available whole-genome gene assembly were considered.Across each pairwise combination of species in a given taxonomic group which satisfied this criterion, the extent of rbcL/RbcL and rbcS/RbcS molecular evolution since the time point of species divergence was measured by computing the sum of branch lengths (sequence substitutions per aligned sequence site) separating these respective sequences in the rubisco phylogenetic trees previously inferred.Following this, the extent of molecular evolution separating all other pairs of orthologous (but not paralogous) gene and protein sequences for that given species pair was measured across all inferred orthogroup phylogenies, and the percentile rank rate of rubisco nucleotide or protein evolution was computed relative to the cohort of these measurements.To assess the extent of rubisco molecular evolution in the context of all other enzymes, the exact same steps were followed but only the subset of genes and proteins predicted to encode enzymes were included.In both of the above analyses, a minimum threshold of 100 measurements for orthologous genes and protein sequences was ensured per species pair.In cases where multiple percentiles are calculated for a rubisco subunit in a given species pair (due to gene duplications in the rbcS of some species, or due to a single species gene assembly matching multiple sub-species in the rubisco sequence dataset) the mean percentile was taken.To provide further context on the relative pace of rubisco molecular evolution, the difference in evolutionary rate between rubisco and all other photosynthetic isoforms of the Calvin-Benson-Bassham cycle enzymes (SI Appendix, Supplemental File 1 and Table S8) was also determined across the land plants (the taxonomic group where this data is available), as described in SI Appendix, Supplemental File 1.In addition, quantification of the percentile extent of molecular evolution in rubisco chaperones (SI Appendix, Supplemental File 1 and Table S9) was performed following the same method and the same dataset for rubisco above, as described in SI Appendix, Supplemental File 1.
The full set of raw data generated from these analyses measuring the molecular evolution in rubisco and across all other genes and proteins, and all other enzyme-encoding genes and proteins in each taxa have been deposited in figshare (https://doi.org/10.6084/m9.figshare.24994625)(132).Individual processed datasets quantifying the relative extents of molecular evolution in rubisco and rubisco chaperones for each unique pairwise species comparison can be found in SI Appendix, Supplemental File 5.This processed data has also been included in a combined dataset format in SI Appendix, Supplemental File 6.

Integrated Analysis of Rubisco Molecular and Kinetic Evolution.
To interrogate the relationship between the molecular and kinetic evolution of extant Form I rubisco, a dataset of rubisco kinetic traits was downloaded from Bouvier and colleagues (22,23), as modified from that originally compiled by Flamholz and colleagues (83).For the purpose of this study, only species in this dataset with a complete set of experimentally determined measurements of rubisco specificity (S C/O ) for CO 2 relative to O 2 (i.e., the overall carboxylation/oxygenation ratio of rubisco under defined concentrations of CO 2 and O 2 gases), maximum carboxylase turnover rate per active site (k catC ), and the respective Michaelis constant (i.e., the substrate concentration at half-saturated catalyzed rate) for both CO 2 (K C ) and O 2 (K O ) substrates were selected.For each of the 137 species which satisfied this criterion (all of which were angiosperm land plants), an estimate of the Michaelis constant for CO 2 in 20.95% O 2 air (K Cair ) was also available (22,23).In addition, the ratio of the Michaelis constant for CO 2 relative to O 2 (K C /K O ), as well as carboxylation efficiency defined as the ratio of the maximum carboxylase turnover to the Michaelis constant for CO 2 (k catC /K C ), were inferred.Measurements of the Michaelis constant for RuBP (K RuBP ) were not considered owing to a limited sample size (n = 19).All Limonium species in the dataset were also ignored on the basis that trait values obtained across different studies have been deemed to be inconsistent (33,133).In total, this left a dataset of rubisco kinetic trait measurements for 123 angiosperms.Of these, only the subset of 93 species which perform C 3 photosynthesis were considered for the purpose of the integrated molecular and kinetic evolution analysis herein.This is because of both a limited sample size of C 3 -C 4 species (n = 6), C 4 -like species (n = 3) and C 4 species (n = 21) in the kinetic dataset, and given that transition toward C 4 photosynthesis is associated with a change in rubisco kinetic evolution (22,23) that would confound the directional selection analysis being conducted.
Coding sequences of the rbcL gene were obtained from Bouvier and colleagues (23) for each species in the kinetic dataset.In order to facilitate more accurate downstream phylogenetic tree inference across these sequences and to minimize the impact of long-branch effects (134), the complete set of publicly available rbcL coding sequences in land plants were also acquired in parallel from NCBI (https://www.ncbi.nlm.nih.gov/) using the query term "rbcL[Gene Name] AND "plants"[porgn:_txid3193]".These sequences thus obtained were subject to the exact same data processing steps to remove ambiguous, partial or chimeric sequences as performed previously for the rbcL sequences of species in the rubisco kinetic dataset (23).In total, this step resulted in an additional set of 29,218 full-length rbcL coding sequences to aid downstream phylogenetic inference.Protein sequences were inferred from each rbcL coding sequences via in silico translation.Next, the complete set of translated RbcL sequences (including the set of sequences from angiosperms in the rubisco kinetic dataset, as well as the set of all publicly available sequences for land plants) were respectively aligned using MAFFT L-INS-I (125), and a corresponding rbcL coding sequence alignment was generated using PAL2NAL software (126).The resulting multiple sequence alignments were trimmed to remove non-aligned residue positions and bootstrapped phylogenetic trees were inferred using IQ-TREE (127) following the exact method described above and using the best-fit models of sequence evolution previously inferred.To facilitate downstream analysis, the rbcL and RbcL gene trees were subsequently modified to keep only internal and terminal branches leading to the set of species in the rubisco kinetic dataset, with pruned trees manually rooted in Dendroscope (135).
To compute the relative extent of protein evolution which has occurred in each angiosperm in the kinetic dataset, the summed branch length (sequence substitutions per aligned sequence site) leading from the last common ancestor at the root of this clade to each respective terminal node in the RbcL phylogeny was measured.The kinetic trait values and extent of molecular evolution for all C 3 angiosperm rubisco can be found in SI Appendix, Supplemental File 7. The predicted kinetic trait values at the last common ancestor at the base of the angiosperm clade were inferred from the estimated y-intercept values from these regression models, as found in Table 1.The rbcL/RbcL phylogenetic gene trees used as the basis of this analysis, including the trees inferred across the full set of sequences, as well as the pruned versions of these trees containing only the subset of C 3 species in the kinetic dataset, can be found in SI Appendix, Supplemental File 8.
To confirm that the above analysis of rubisco molecular and kinetic evolution was not an artifact of potential methodological biases, this analysis was repeated using both a minimal subset of phylogenetically diverse species (SI Appendix, Supplemental File 1 and Table S10) so as to control for incomplete species sampling, as well as using alternate RbcL phylogenetic gene trees so as to control for any phylogenetic uncertainties.These supporting additional analyses confirmed that the results of the original analysis were valid and robust to potential methodological biases.A full description of the design, method, and results of these additional analyses is provided in detail in SI Appendix, Supplemental File 1.

Integrated Analysis of Rubisco Molecular Evolution and CO 2 Assimilation.
To investigate the relationship between rubisco molecular evolution and wholeplant photosynthetic performance, a comprehensive meta-dataset of photosynthetic measurements from species spanning the whole land plant phylogeny was provided by Gago and colleagues (86).This dataset contained measurements of light-saturated net photosynthetic rates expressed both per unit leaf mass (A mass ) and per unit leaf area (A area ), as well as measurements of total nitrogen content expressed both per unit leaf mass (N mass ) and per unit leaf area (N area ).In addition, for each unique species observation in this dataset with a corresponding measurement for both A mass and N mass or for both A area and N area , the mass-based and area-based photosynthetic nitrogen-use efficiencies were also derived using the calculations A mass /N mass (PNUE mass ) and A area /N area (PNUE area ), respectively.In cases where duplicate entries for a parameter were present across species, the mean value was taken so as to collapse the dataset to contain only a single row per species.Finally, although photosynthetic measurements were available from individuals belonging to all major land plant lineages (including the mosses, liverworts, fern allies, ferns, gymnosperms, and angiosperms), only the subset of angiosperms for which a publicly available rbcL sequence could be obtained were taken forward.This is because various diffusional and biochemical factors other than rubisco are known to cause reduced photosynthetic capacities in non-angiosperm plants (86) that would bias the results of the current study.For the same reasons, only the subset of C 3 angiosperms in this dataset were taken forward to avoid picking up photosynthetic effects which result from CO 2 concentrating mechanisms that act upstream of rubisco.In total, this left a photosynthetic dataset of 366 C 3 angiosperms from which to base the analyses herein.This dataset included 272 unique species measurements for N mass , 137 unique species measurements for A mass and 118 unique species measurements for PNUE mass , as well as 270 unique species measurements for N area , 151 unique species measurements for A area , and 120 unique species measurements for PNUE area , respectively.
To compute the relative extent of RbcL molecular evolution which has occurred in each angiosperm in the photosynthetic dataset, the exact same method was followed as described above.First, the full RbcL phylogenetic gene tree in SI Appendix, Supplemental File 8 that was previously inferred from the complete set of publicly available RbcL sequences in NCBI was pruned so as to contain only terminal and internal branches corresponding to angiosperms in the photosynthetic dataset.Here, in situations where duplicate sequences in the alignment resulted in multiple terminal nodes for a given species, only a single node was retained based on the sequence which is first in the alphabetical order of the gene accession numbers.As above, this reduced RbcL tree was then manually rooted in Dendroscope (135), and the relative extent of RbcL protein evolution in each angiosperm was computed as the summed branch length (sequence substitutions per aligned sequence site) leading from the last common ancestor at the root of this clade to each respective terminal node.Finally, linear regression models were employed to assess the pairwise relationships between the variation in rubisco molecular evolution and each respective photosynthetic parameter.The resulting full integrated dataset containing photosynthetic measurements and comparable extents of RbcL molecular evolution for all 366 C 3 angiosperms can be found in SI Appendix, Supplemental File 9.The RbcL phylogenetic gene tree which has been pruned from that in SI Appendix, Supplemental File 8 to contain the subset of C 3 angiosperms in the photosynthetic dataset used for the basis of this analysis can be found in SI Appendix, Supplemental File 10.
Data, Materials, and Software Availability.The raw data generated from the analyses measuring the molecular evolution in rubisco and across all other genes and proteins (including across all other enzyme-encoding genes and proteins) has been deposited in figshare and is available at: https://doi.org/10.6084/m9.figshare.24994625(132).All other data are included in the article and/or supporting information.

Fig. 1 .
Fig. 1.The evolutionary history of rubisco in the context of atmospheric CO 2 (%) and O 2 (%) following divergence from the ancestral rubisco-like protein (RLP).Important branch points in the phylogeny at which rubisco diverged into different evolutionary lineages are indicated by gray vertical bars.To provide additional context, the time-period at which the First and Second Great Oxidation events occurred along this evolutionary trajectory are also labeled and referenced as gray vertical bars.Graphics of atmospheric CO 2 and O 2 levels were adapted from the TimeTree resource [http://www.timetree.org;(9)].

Fig. 2 .
Fig. 2. The extent of molecular evolution in rubisco during the radiation of each taxonomic group.(A) Bar plot depicting the total amount of molecular evolution (substitutions per sequence site) in the nucleotide and protein sequences of Form I rubisco across taxonomic groups.RbcL: the extent of sequence evolution in the RbcL subunit.RbcS: the extent of sequence evolution in the RbcS subunit.RbcS*: the extent of sequence evolution in the RbcS subunit using 1,000 bootstrapped stratified sampling of rbcS/RbcS per species (Materials and Methods).The genome in which rbcL and rbcS genes reside within each group is indicated above the plot (bacterial, plastid, nuclear).Error bars represent ± 1 SD of the mean.(B) Bar plot depicting the percentage ratio (%) of nucleotide or amino acid evolution between each rubisco subunit (rbcL to rbcS and RbcL to RbcS, respectively) in each taxonomic group.The color of each bar is determined by the genome in which the rbcL and rbcS gene resides, following the color scale in A. (C) Bar plot depicting the percentage ratio (%) of nucleotide to amino acid evolution in each rubisco subunit (rbcL to RbcL and rbcS to RbcS, respectively) in each taxonomic group.The color of each bar is the same as described in B. Dashed lines indicate the expected ratio given an rbcL or rbcS sequence evolving in the absence of selection.

Fig. 3 .
Fig.3.The extent of molecular evolution in rubisco in the context of other genes.(A) Boxplot of the extent of molecular evolution (substitutions per sequence site) in the nucleotide and protein sequences of the rbcL/RbcL and rbcS/RbcS subunit expressed as a percentile (%) of that measured across all other genes and proteins, respectively.See also SI Appendix, Supplemental File 1 and TableS2.(B) As in A but calculating the percentile (%) extent of rubisco molecular evolution (substitutions per sequence site) relative to only the subset of genes and proteins in each species which encode enzymes.See also SI Appendix, Supplemental File 1 and TableS3.(C) Boxplot of the total amount of molecular evolution (substitutions per sequence site) in the nucleotide and protein sequences of each Calvin-Benson-Bassham cycle enzyme expressed as a percentage (%) of that measured in rbcL/RbcL (100%; red horizontal line) across land plants.Phosphoglycerate kinase: PGK.Glyceraldehyde-3-phosphate dehydrogenase A/B subunit: GAPDH-A/ GAPDH-B.Triose phosphate isomerase: TPI.Fructose-bisphosphate aldolase: FBA.Fructose-1,6-bisphosphatase: FBP.Transketolase: TKL.Sedoheptulosebisphosphatase: SBP.Ribose 5-phosphate isomerase: RPI.Ribulose-p-3epimerase: RPE.Phosphoribulokinase: PRK.See also SI Appendix, Supplemental File 1 and TablesS4 and S5.The raw data for this figure can be found in SI Appendix, Supplemental File 5.
Fig.3.The extent of molecular evolution in rubisco in the context of other genes.(A) Boxplot of the extent of molecular evolution (substitutions per sequence site) in the nucleotide and protein sequences of the rbcL/RbcL and rbcS/RbcS subunit expressed as a percentile (%) of that measured across all other genes and proteins, respectively.See also SI Appendix, Supplemental File 1 and TableS2.(B) As in A but calculating the percentile (%) extent of rubisco molecular evolution (substitutions per sequence site) relative to only the subset of genes and proteins in each species which encode enzymes.See also SI Appendix, Supplemental File 1 and TableS3.(C) Boxplot of the total amount of molecular evolution (substitutions per sequence site) in the nucleotide and protein sequences of each Calvin-Benson-Bassham cycle enzyme expressed as a percentage (%) of that measured in rbcL/RbcL (100%; red horizontal line) across land plants.Phosphoglycerate kinase: PGK.Glyceraldehyde-3-phosphate dehydrogenase A/B subunit: GAPDH-A/ GAPDH-B.Triose phosphate isomerase: TPI.Fructose-bisphosphate aldolase: FBA.Fructose-1,6-bisphosphatase: FBP.Transketolase: TKL.Sedoheptulosebisphosphatase: SBP.Ribose 5-phosphate isomerase: RPI.Ribulose-p-3epimerase: RPE.Phosphoribulokinase: PRK.See also SI Appendix, Supplemental File 1 and TablesS4 and S5.The raw data for this figure can be found in SI Appendix, Supplemental File 5.

Fig. 4 .
Fig. 4. The relationship between rubisco molecular and kinetic evolution in C 3 angiosperms.(A) The relationship between RbcL evolution and its corresponding kinetic trait values.AA Ev.: The extent of RbcL amino acid evolution that has occurred since the last common ancestor at the root of the angiosperm phylogeny.S C/O : specificity.k catC : carboxylase turnover per site.k catC /K C : carboxylation efficiency.K C : the Michaelis constant for CO 2 .K C air : the inferred Michaelis constant for CO 2 in 20.95% O 2 .K O : the Michaelis constant for O 2 .K C /K O : the ratio of the Michaelis constant for CO 2 compared to O 2 .(B) The relationship between the extent of RbcL protein evolution (substitutions per sequence site) and each rubisco kinetic trait in A as assessed using least squares regression models.The raw data can be found in SI Appendix, Supplemental File 7.

Fig. 5 .
Fig. 5.The relationship between rubisco molecular evolution and CO 2 assimilation in C 3 angiosperms.(A) The relationship between the extent of RbcL evolution and leaf-level CO 2 assimilation.AA Ev.: The extent of RbcL amino acid evolution that has occurred since the most recent common ancestor at the root of the angiosperm phylogeny.A mass : Photosynthetic rate per unit leaf mass.PNUE mass : Photosynthetic nitrogen use efficiency rate, calculated as photosynthetic rate per unit leaf mass expressed per unit leaf mass nitrogen content (N mass ; % N).A area : Photosynthetic rate per unit leaf area.PNUE area : Photosynthetic nitrogen use efficiency rate, calculated as photosynthetic rate per unit leaf area expressed per unit leaf area nitrogen content (N area ; g m −2 N). (B) The relationship between the extent of RbcL protein evolution (substitutions per sequence site) and each photosynthetic trait in A evaluated on a mass-basis (A mass , PNUE mass ) as assessed using least squares regression models.(C) As in B but for each photosynthetic trait evaluated on an area-basis (A area , PNUE area ).The raw data can be found in SI Appendix, Supplemental File 9.