Frameshifting preserves key physicochemical properties of proteins

Significance Genetic information stored in DNA is transcribed to messenger RNAs, which are then translated to produce proteins. A frameshift in the reading frame at any stage of this process typically results in a significantly different protein sequence being produced. Here, we show that, nevertheless, several essential properties of many protein sequences, such as their hydrophobicity profiles, remain largely unchanged upon frameshifting. This finding suggests that frameshifting could be an effective evolutionary strategy for generating novel protein sequences, which retain the functionally relevant physicochemical properties of the sequences from which they derive.

Sequence identity Sequence identity describes the fraction of strictly identical amino acids that are located at the same position in two protein sequences over the total number of residues. Specifically, in a wild-type protein sequence and its frameshifted counterpart, this refers to locations in the two sequences defined by the respective wild-type codons and their +1 or -1 frameshifted counterparts in the mRNA coding sequence.
Genetic code analysis For each codon in the genetic code table, four associated frameshift codons were constructed by adding one of the four possible bases to the last two bases of the original codon, thereby creating the +1 frameshift graph, yielding 64×4 pairs of native and respective frameshifted codons. Due to this graph being cyclic, there was no need to separately consider the -1 frameshifts. All pairs containing a stop codon were excluded from the set, resulting in 232 frameshift pairs. Utilizing the universal genetic code (UGC), all codons were translated to their corresponding amino acids. Applying one of the 604 amino acid property scales to both original and frameshifted amino acids produced a set of numerical pairs for which the sample Pearson's correlation coefficient R was calculated. To compare the resulting 604 correlation coefficients against an appropriate background, we carried out the same procedure with 10 6 scales each containing 20 values randomly chosen between -1 and 1. Based on the standard deviation of this random Gaussian background, Z-scores were calculated for each of the 604 scales, which in turn were used to calculate P values from an analytic Gaussian distribution. Quantitatively similar results were obtained by randomizing the genetic code while keeping its box-like nature intact (see genetic code randomization).
Genetic code randomization Four different methods of generating a random background were used to assess the significance of our observations. First, instead of randomizing the genetic code, the scales themselves were randomized ("rand. scales") as described in the description of "Genetic code analysis" above. Virtually identical significances for individual scales were obtained via a randomization of UGC, while keeping its block structure intact. Specifically, boxes of codons as found in UGC were assigned to new amino acids at random, while keeping the stop codons in their original positions ("boxes"). As a third randomization method, all the assignments of codons to amino acids were shuffled, while retaining the absolute number of codons for each amino acid and also retaining stop codon assignments ("shuffle"). Lastly, a complete randomization was performed whereby amino acids were randomly assigned to codons with the only limitation that each amino acid appears at least once and that stop codon assignments again remain unchanged ("random"). In each of the randomization methods, 10 6 random samples were generated. Only in the "rand. scales" method, an analytical Gaussian background distribution was used to calculate significance, while in the other three methods the number of background samples with better performance was directly used as P value.
Optimization of scales The optimization of scales for frameshift resistance was performed via a zero-temperature Monte-Carlo scheme in analogy to our previously published analysis in a different context (11). Briefly, simulated annealing was performed by scaling the step-size with simulation time. A total of 3000 steps was carried out for each Monte-Carlo run. Scales were initialized with random values. At every step, between 1 and 3 scale values, chosen at random, were perturbed by a small time-dependent offset. If the resulting scale resulted in a higher Pearson's R than the original, the move was accepted and otherwise rejected. This resulted in a set of computationally optimized scales with correlations of up to R=0.558 between the UGC and its frameshifted version.
Principal component analysis The PCA was performed using scikit-learn (12) in Python programming language. The PCA space was fitted on a set of 1000 optimal scales for frameshift robustness calculated via Monte-Carlo optimization (as described in "Optimization of scales" above) with an explained variance of 83.2% in PC1 and 16.8% in PC2. The lengths and the directions of arrows in Fig. 1C in the main manuscript correspond to the relative contributions of the respective amino acids to PC1 and PC2. Other local optimization algorithms implemented in scikit-learn (Nelder-Mead, BFGS, trust-constr) yielded theoretical scales resulting in equivalent components in a PCA. Although the ratio of explained variance between PC1 and PC2 varied according to the specific set of optimal scales derived, their total remained ~100% for all cases. In a next step, we transformed the 604 studied 'realistic' scales into this PCA space of computationally derived scales with optimal frameshift stability.

Sequence frameshift generation
The frameshifted variants of individual protein sequences were generated by removing the first four bases (+1 shift) or the first two bases (resulting in the -1 shift) in their wild-type mRNA coding sequences and translating them using the UGC. This approach was chosen so that negative shifts could be performed without the need to have additional genomic information beyond the original coding sequence. After frameshifting, the AUG start codon at the beginning of each sequence was removed from all original wild-type sequences in order to enable comparison between equally long protein sequences. The effect of this on the obtained results is negligible since the first three bases contain identical information in all sequences (AUG in mRNAs, Met in protein sequences).
Profile calculation and comparison Protein sequences were converted to numerical profiles by exchanging each amino acid with its respective scale value for each of the 604 different physicochemical property scales. Protein profiles were subsequently smoothed using a 21residue window, to reduce noise and highlight global features. Profiles stemming from wild-type and frameshifted protein sequences were compared via the sample Pearson's correlation coefficient R. For each property in question, the full distribution of Pearson's Rs between wildtype and frameshifted profiles of all sequences in a given proteome was evaluated and its median used to assess the associated statistical significance in comparison with a randomized background. Specifically, we have derived the distribution of Pearson's R between wild-type and frameshifted profiles for a given proteome for 10 5 scales each containing 20 values randomly chosen between -1 and 1. The standard deviation of the distribution of median values corresponding to this random Gaussian background was subsequently used to calculate the Zscores and P values for each of the 604 property scales studied. Finally, in one instance, protein profiles were compared against mRNA nucleobase-density profiles (Fig. 4B, 4C in the main manuscript). For this, codons were converted to a number between 0 and 3, representing the number of specific bases they contained and subsequently smoothed using a 21-codon window.
Stop codons In most cases, frameshifted mRNA sequences contain premature stop codons. For the calculation of protein sequence profiles corresponding to different physicochemical properties of frameshifted variants, such positions were excluded from the calculation of the average value in a local smoothing window, while the size of the window was reduced by the number of stop codons in it.

GO term enrichment
The analysis of the enrichment of gene ontology (GO) terms in a target set defined as the top quartile of the distribution of Pearson's Rs for Factor 1 frameshifts in H. sapiens was done using GOrilla (13) and REVIGO (14) tools. In GOrilla, we used the list of 17083 genes in the H. sapiens dataset as the background, for which 16281 GO terms were found. In order to reduce the redundancy of the GOrilla output, each ontology enrichment listcellular compartment (CC), molecular function (MF) and biological process (BP) -was fed into REVIGO together with the associated multiple-hypothesis corrected P values. We used the Lin's measure of similarity with a similarity parameter of 0.5, characterized as a small allowed similarity. In the main text, we present the top four significant hits in the CC ontology for both +1 and -1 frameshifts, while the total REVIGO output of the GO analysis can be found in Supporting Information (Table S3). The same procedure was applied to M. musculus. Since organisms from other domains of life are not supported by GOrilla, we used Panther (15) for their GO analysis, with the same parameters whenever possible.
Orthologs Lists of orthologous genes between M. jannaschii, T. kodakarensis, E. coli, P. aeruginosa, M. musculus and H. sapiens were downloaded from InParanoid8 (16) and used to compare Pearson's Rs of proteins' Factor 1 frameshift stability between different organisms.

Intrinsic disorder
The disorder propensity of a given protein sequence was calculated using IUPred (17) and the 'long' setting for the size of disorder detection. Disorder profiles obtained from IUPred were smoothed using the same window of 21 residues as in case of other profiles for consistency. Since IUPred cannot handle stop codons, they were accounted for by creating two sequences in silico, one with the previous residue replacing the stop codon and one using the following residue. Disorder was calculated for both versions and subsequently averaged.

Data access All relevant original data is provided as Supplementary Material to this publication.
In case of already published data, the original publications are cited, and data should be accessed at the original source. All data necessary to replicate our results is accessible.   Fig. 1B in the main article. B) If the UGC box structure is kept intact while shuffling amino acids in the genetic code, similar results as in A are obtained. C) By shuffling each amino acid in the genetic code, the box structure of UGC is effectively abolished, resulting in a uniform decrease of P values. D) Instead of shuffling, amino acids are randomly assigned to codons with the only criterion being that each amino acid is assigned at least once. In all cases, STOP codons were not reassigned during the procedure. It is important to notice that the hydrophobicity category is highly significant regardless of the specific randomization procedure. Scales in panels C and D for which none of the randomized codes performed better than the original are represented with P values smaller than 10 -6 . In all panels, Gaussian jitter is added along the x-dimension with a sole purpose to separate the data points. Figure S3. Comparison of different randomization procedures for the generation of the background. Randomizing the universal genetic code while keeping the box structure intact results in quantitatively similar P values as using scales with random values as a background (blue). Similarly, the other two methods, shuffling the codon assignments and randomizing the assignments, results in highly similar results (orange and green). Although these two groups differ quantitatively, they still show the same behavior qualitatively.