Distinguishing protein-coding and noncoding genes in the human genome

Clamp et al. 10.1073/pnas.0709013104.

Supporting Information

Files in this Data Supplement:

SI Figure 4
SI Figure 5
SI Figure 6
SI Figure 7
SI Figure 8
SI Figure 9
SI Figure 10
SI Appendix




SI Figure 4

Fig. 4. Distribution of random (black) and orphan (red) ORF lengths. The distribution of ORF lengths obtained in random 2kb regions of dna (black) and the distribution of the 1,177 orphan ORFs (red).





SI Figure 5

Fig. 5. Cumulative distributions for various conservation properties for 5,986 well studied genes (blue) and matched random control sequences from the human genome (black). Median of each distribution is indicated by labeled point. The RFC score provides the best discrimination between the well studied genes and matched random controls. The conservation properties are described in text and methods. The well studied genes (see Results in the main text) have median ORF length of 1,335 bases and median number of coding exons of 8.





SI Figure 6

Fig. 6. RFC score and indel patterns. (a) Illustration showing how RFC score is calculated for a pairwise alignment. Species 1 shows a human putative gene sequence in which translation starts in reading frame 0 (that is, codons are read from the first base). Each human base can be assigned as being in codon position 0, 1, or 2. Species 2 shows the orthologous DNA sequence in the mouse genome, aligned to the human sequence with gaps indicated by dashes. The RFC analysis considers the three possible reading frames in which the mouse sequence could be translated. For each reading frame, it assigns each nucleotide in the mouse sequence as being in codon position 0, 1, or 2, and it counts the number of aligned human and mouse bases assigned to the same codon position. The RFC score is calculated by dividing the largest value across the three reading frames by the length of the sequence. For long genes, the RFC score is calculated in 50 base windows and averaged across the gene. (b-e) Examples of indel patterns in human alignment to mouse and dog. Indels are indicated by triangles (vertex down for insertions; vertex up for deletions, followed by white gap) and marked as frameshifting (red) or frame-preserving (gray). Regions of reading-frame mismatch relative to the human alignment are indicated by either a single red line (frame shifted by + 1) or two red lines (frame shifted by + 2). Conserved start codons across human, mouse and dog are shown by blue rectangles and the intron positions by vertical black lines. (b) Awell behaved gene. All indels are frame-preserving, resulting in no regions of reading-frame mismatch. (c) A gene likely to have a misannotated start site in human. All frameshifting indels lie at the beginning of the gene, with none occurring after the conserved ATG. (d) A gene with an internal portion of the alignment that has mismatched frame. (e) A gene with a low RFC score due to frameshifting indels across its entire length. The orthologous region contains no matching ORF in mouse or dog. The gene is likely to be a spurious human ORF.





SI Figure 7

Fig. 7. Joint mouse/dog RFC score distributions for fast evolving genes. The red curve shows the distribution for all orthologs and the black curve the random distribution. The remaining curves show how the RFC varies for the fastest evolving genes (as measured by Ka/Ks). The curves are the top 10% of fast evolving genes (Ka/Ks >0.5, green), the top 5% (Ka/Ks >0.6, blue), the top 2% (Ka/Ks >0.8, magenta), and the top 1% (Ka/Ks >1.0, cyan).





SI Figure 8

Fig. 8. Chimp RFC score distribution for fast evolving genes. The red curve shows the distribution for all orthologs and the black curve the random distribution. The remaining curves show how the RFC varies for the fastest evolving genes (as measured by Ka/Ks). The curves are the top 10% of fast evolving genes (Ka/Ks >0.5, green), the top 5% (Ka/Ks >0.6, blue), the top 2% (Ka/Ks >0.8, magenta), and the top 1% (Ka/Ks >1.0, cyan).





SI Figure 9

Fig. 9. Macaque RFC score distribution for fast evolving genes. The red curve shows the distribution for all orthologs and the black curve the random distribution. The remaining curves show how the RFC varies for the fastest evolving genes (as measured by Ka/Ks). The curves are the top 10% of fast evolving genes (Ka/Ks >0.5, green), the top 5% (Ka/Ks >0.6, blue), the top 2% (Ka/Ks >0.8, magenta), and the top 1% (Ka/Ks >1.0, cyan).





SI Figure 10

Fig. 10. RFC distributions for novel Ensembl v38 genes (a and b) and novel Vega and Refseq genes (c and d). Ortholog distributions are in blue, orphans are in red, and random controls are in black.

This Article

  1. PNAS December 4, 2007 vol. 104 no. 49 19428-19433
  1. AbstractFree
  2. Figures Only
  3. Full Text
  4. Full Text (PDF)
  5. » Supporting Information