Briggs et al. 10.1073/pnas.0704665104.
Fig. 5. Base composition of ancient and modern DNA sequences. Base composition near the ends of DNA sequences determined from a mammoth, Neandertal sequences cloned in a plasmid, a cave bear and a contemporary human. Mammoth, cave bear, Neandertal and human sequences were aligned to the elephant, dog and human genomes, respectively, and the base composition of the reference sequences were plotted as a function of position relative to the ancient sequences.
Fig. 6. Nucleotide substitution frequencies in ancient and modern DNA. The frequencies of all 12 possible substitutions between 454 sequences and a reference genome are plotted as a function of distance from the 5'-ends of the 454 sequence from a mammoth sequenced aligned to the elephant genome, Neandertal DNA sequences cloned in a plasmid vector to the human genome, cave bear sequences aligned to the dog genome, and contemporary human sequences aligned to the human genome.
Fig. 7. Prediction of deamination-derived substitutions generated by the ML parameter estimation model. Using our model of the 454 process and ML parameters estimated for the Neandertal sample, we simulated the expected frequencies of C to T and G to A substitutions deriving from deamination of cytosine as a function of position along a hypothetical Neandertal DNA sequence of length 75 nt.
Fig. 8. Misincorporation frequencies in ancient and modern DNA sequences. For each of the 12 nucleotide substitutions, the proportion in excess of what is observed on the lineage to the human genome sequence is shown for contemporary human and Neandertal sequences determined by 454. Each sequence was aligned to both chimpanzee and reference human genomes. Each mismatch position between the 454-determined sequence and the reference human sequence was polarized to the 454 branch or the reference branch by using the sequence aligned to the chimpanzee genome. It is assumed that the branch lengths leading to the 454 sequence and the reference human sequence should be equal in length and that therefore any excess on the 454-sequence branch is due to nucleotide misincorporations or sequencing errors. By using contemporary modern human DNA, sequenced by 454, it is possible to compare these proportions between ancient DNA and modern DNA samples. Note that apart from C to T and G to A substitutions, most error rates are not significantly different between contemporary human and Neandertal DNA.
Fig. 9. Hypervariable region I (HVRI) mtDNA sequences generated by 454 sequencing of the Vindija Neandertal sample. The Cambridge references sequence is given above the previously determined Vindija Neandertal sequence (ref. 8) and part of the Feldhofer I Neandertal sequence (ref. 9). Frequencies of the most informative sites in the contemporary human gene pool are shown as found in the HVR base (ref. 10).
SI Text
Data Acquisition
Direct 454 sequencing of Neandertal and PCR contamination assay. DNA was extracted from a 38,000-year-old Neandertal bone from Vindija cave, Croatia, as described in ref. 1. Mitochondrial PCR contamination assays of the Vindija Neandertal extract were carried out as described in ref. 1. The extract was sent to 454 Life Sciences, Branford, CT, where a 454 library was created and two full sequencing runs were performed on the GS20 platform, as described in ref. 1. These two runs were used in the downstream damage analysis presented in this paper. Five additional 454 runs were later carried out on the same Neandertal 454 library, which produced the hominid mitochondrial control region hits shown in SI Fig. 9.
Direct sequencing of mammoth. DNA was extracted from a 43,000-year-old mammoth bone from Russia as described in ref. 2 and sent to 454 Life Sciences, where a 454 library was produced, and one sequencing run carried out on the GS 20 platform as described in ref. 2.
Direct 454 sequencing of contemporary human. Nebulized DNA from a contemporary human sample was used to produce a 454 library, and one full run was carried out at 454 Life Sciences, as described in ref. 1.
Direct sequencing of cave bear. DNA was extracted from a 42,000-year-old cave bear bone from Ochsenhalt cave, Austria, and sent to 454 Life Sciences, where a 454 library was produced and one sequencing run carried out on the GS 20 platform according to the same protocols used for the directly sequenced Neandertal and mammoth samples.
Neandertal bacterial metagenomic library. Raw sequences generated on the GS20 platform (1.47 million) that had been generated by 454 sequencing of a Neandertal bacterial metagenomic library (described in ref. 3) were kindly provided by J. Noonan and E. Rubin (Joint Genome Institute, Walnut Creek, CA).
Removing nonindependent sequences. Although each raw 454 sequence should correspond to a unique emulsion PCR template molecule, the 454 process can produce multiple reads from a single PCR template molecule either if more than one bead is present in a single emulsion PCR reaction vesicle or if during pyrosequencing the optical signals from a sequenced bead "bleed" over to an empty neighboring well and get recognized there as an independent sequence (so-called "ghost wells"). To ensure that all our downstream analyses used only sequences corresponding to unique starting templates, we developed a clustering method to identify all raw sequences likely to stem from the same PCR template. Clustering was achieved by aligning each 454 sequence against all other 454 sequences from the same run, using a scoring scheme with linear gap penalties and further lowered penalties for gaps caused by homopolymer length differences, the major source of 454 sequence error. For sequences to be grouped into a cluster, two requirements had to be met. First, the sequences had to start with the same six "nucleotide flow identities" (nucleotide identities ignoring homopolymer lengths). Secondly, the sequences' distance measure, normalized for the length of the longer sequence, had to be less than a threshold which was determined for each run as follows: Randomly sampling the distance measures revealed a bimodal distribution of distance measures. This bimodality indicated that genuine cluster artifacts formed a distinct population from genuinely similar but independent sequences, allowing a clustering threshold to be chosen in between the two peaks. Typical thresholds amounted to »90% of the nucleotides of both sequences being identical, with a wide margin in which changing the threshold has negligible effect on the clustering results. Once clusters were identified, we removed "ghost wells" by filtering out any sequences with a low brightness signal in close spatial proximity on the sequencing plate to a brighter sequence from the same cluster. From the remaining sequences we identified the individual sequence within each cluster that gave the best alignment score to any primate genome, and used only that sequence in downstream analysis. By requiring the first six nucleotide flow identities to be the same for two sequences to be clustered, we avoided the possibility of falsely clustering two sequences that actually come from independent yet overlapping templates, since molecules in the extract are presumably randomly fragmented and thus with our low genomic coverage, independent overlapping fragments are extremely unlikely to start at the same position.
Before 454 sequencing, the Neandertal bacterial library had undergone two rounds of pooled amplification; namely, batch culture in bacteria followed by PCR amplification of all library vector inserts (ref. 3). Therefore there is even greater potential in this data set to contain multiple reads from a single original DNA extract molecule. However, our clustering method was still appropriate to use here, since it would still be expected that sequence reads corresponding to the same original extract DNA molecule would show identical start points and very high sequence similarity. Therefore, we used the same pipeline to cluster this dataset and identify all unique sequences.
Sequence processing. 454 runs were performed using a GS20 instrument. Signal processing and base-calling were done using 454 software version 1.0.53, with modifications to the 454 signal processing pipeline for handling fragmented ancient DNA molecules. The cut-off for short sequences was lowered to allow sequences ³50 nt, including the 44 nt 3' adapter. The 3' adapter sequence was left on the reported sequence. This change was necessary to allow detection of the 3' end of sequence reads. These changes were made using a modified 454 signal/sequence processing parameters file for each 454 run, available by request.
Plasmid library sequences. FASTA sequences from the Neandertal plasmid library had already been generated and processed in ref. 3. For our analysis of this data set, one extra step was needed before database searching and making semiglobal alignments: since inserts were PCR amplified from the pmcl200 plasmid with vector primers before 454 library preparation, all 454 sequences contained plasmid sequence, which was removed from all fragments before the database searching.
BLAST pipeline. To identify sequence reads from each ancient DNA sequencing run, we aligned each read to a target genome, all sequences in GenBank nt (15June2006), GenBank env (15June2006), the mouse genome (mm8), and the human genome (hg18). For the mammoth runs, the target genome was elephant, loxAfr1 (www.broad.mit.edu/ftp/pub/assemblies/mammals/elephant/loxAfr1). For the cave bear runs, the target genome was dog, canFam2 (ref. 4). For the Neandertal and contemporary human runs, the target genome was human, hg18 (ref. 5). Initial alignments were performed using megablast 2.2.12 with the following command-line arguments:
-b 10 -v 10 -U F -I T -e 0.001 -F F -a 1 -D 2 -M 15000 -W 16
All alignments for each sequence were compared by the number of aligning bases and the alignment that includes the most bases was considered the best. If this alignment was to the target genome, the sequence fragment was considered to be from the fossil species and the alignment was kept for further analysis. For the parameter estimation model and for 454 error rate estimation, we also aligned directly sequenced Neandertal and contemporary human sequences to the chimpanzee genome (panTro2) and made three-way alignments of 454 human/Neandertal, chimpanzee and the human reference sequence.
Semiglobal alignments. Because BLAST produces local alignments, which may not include the whole ancient sequence read, we applied a semiglobal alignment method to each BLAST-identified target sequence to align the entire ancient sequence against a reference genome. For each of the significant hits found by BLAST, the relevant region of the reference genome was cut out and the entire 454 sequence was semiglobally aligned against that region using a modified Smith-Waterman algorithm (refs. 6 and 7). Because the 454 B-adaptor can form part of a 454 sequence following the 3' end of the ancient sequence, all B-adaptor sequence needed to be identified and removed in the semi global alignment process. To identify and remove 454 B-adaptor sequence from the 454 sequence, any suffix of the sample sequence was also aligned against the 454 B-adapter and a suffix that was more similar to the B-adapter than to the reference genome was trimmed off.
Likelihood model for inferring damage rates
Data
We start with 454-sequenced Neandertal ancient DNA (aDNA) fragments that have been aligned to orthologous chimpanzee sequences.
Assumptions
Chimpanzees and Neandertals are considered to be separated by the same substitution matrix as chimpanzees and modern humans; in other words, we assume a molecular clock between modern humans and Neandertals. Nicks occur on either strand and with equal probability between any two adjacent bases in the double-stranded region of the fragment. Finally, we assume a geometric distribution of lengths for the single-strand overhang at each end of an aDNA fragment. At a particular end of the fragment, the overhang is equally likely to be extending either the 5' or the 3' strand.
Notation
Define
Likelihood calculation
We wish to calculate the probability of observing a particular aDNA sequence,
First we condition on the lengths of the left and right 5' overhangs (

We arbitrarily limit the maximum size of an overhang to be
Because we assume overhangs are equally likely to extend either the 5' or 3' strand, we must split the probability between having no 5' overhang (and thus any length 3' overhang) and having a 5' overhang (and thus no 3' overhang):

where

Moving on to the nick probability term in Eq. 1, note that the position of the first nick depends on the

The joint event in Eq. 3 requires no nick before the specified position (

After substituting Eq. 4 into Eq. 3, we arrive at the simplified version of Eq. 3:

Next we take the first term from Eq. 1, divide the sequence (

Finally, we take Eq. 6 and calculate the former term (damage) and the latter term (evolution):

We arrive at the total likelihood by assuming independence among fragments and multiplying the likelihood for a single fragment (Eq. 1, substituting in Eqs. 6, 7, 2, and 5) across all fragments.
Parameter estimation
Five parameters need to be estimated:
Confidence intervals
We calculate confidence intervals via a likelihood ratio test that compares profile likelihood values to the unconstrained maximum. Let
The profile likelihood ratio test statistic is:
Single-parameter asymptotic 95% confidence intervals correspond to points 1.92 log likelihood units below the maximum and thus could also be called support intervals. To find these points, our algorithm takes each parameter in turn, slowly adjusting that parameter while maximizing the likelihood with respect to all other parameters (i.e., calculating the profile likelihood,
Predictions from the model
Given the maximum likelihood parameter estimates produced by the model, we can calculate the probability of C to T and G to A misincorporations along a Neandertal sequence of a given length. These probabilities can then be used during likelihood inference of population genetic parameters from Neandertal data.
In SI Fig. 7, we plot these probabilities for a sequence of 75 nt, which is close to the average Neandertal fragment length. The model replicates the observed pattern quite well, indicating that our assumptions are broadly consistent with the data.
1. Green RE, Krause J, Ptak SE, Briggs AW, Ronan MT, Simons JF, Du L, Egholm M, Rothberg JM, Paunovic M, Pääbo S (2006) Nature 444:330-336.
2. Stiller M, Green RE, Ronan M, Simons JF, Du L, He W, Egholm M, Rothberg JM, Keats SG, Ovodov ND, et al. (2006) Proc Natl Acad Sci USA 103:13578-13584.
3. Noonan JP, Coop G, Kudaravalli S, Smith D, Krause J, Alessi J, Chen F, Platt D, Pääbo S, Pritchard JK, et al. (2006) Science 314:1113-1118.
4. Lindblad-Toh K, Wade CM, Mikkelsen TS, Karlsson EK, Jaffe DB, Kamal M, Clamp M, Chang JL, Kulbokas EJ, 3rd, Zody MC, et al. (2005) Nature 438:803-819.
5. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al. (2001) Nature 409:860-921.
6. Needleman SB, Wunsch CD (1970) J Mol Biol 48:443-453.
7. Smith TF, Waterman MS (1981) J Mol Biol 147:195-197.
8. Serre D, Langaney A, Chech M, Teschler-Nicola M, Paunovic M, Mennecier P, Hofreiter M, Possnert G, Pääbo S (2004) PLoS Biology 2:313-317.
9. Krings M, Stone A, Schmitz RW, Krainitzki H, Stoneking M, Pääbo S (1997) Cell 90:19-30.
10. Handt O, Meyer S, von Haeseler A (1998) Nucleic Acids Res 26:126-129.