Intro

Array Technology

This section describes the array-based technology used by Perlegen to resequence the 15 inbred mouse strains and discusses its reliability. The technology is then compared with whole-genome shotgun sequencing technology and the trace files generated are compared with those generated by dideoxy sequencing. Finally, the origin of N's in the genotyping data are discussed.

Array-based resequencing technology

Resequencing strategies are based on surveying the DNA sequences of two or more genomes for a given species and then aligning the survey DNA sequences to the reference genome to identify the bases that are the same and the bases that are different between the two copies. For this project, the publicly available genome sequence assembly of strain C57BL/6J was used as the reference genome and served as a scaffold on which the DNA sequences of the 15 inbred strains being resequenced were aligned.

High-density oligonucleotide array-based technology was used to resequence the genomic DNA of 15 inbred mouse strains, at the equivalence of a 1.5X draft sequence produced by dideoxy sequencing. The resequencing high density arrays were designed such that each nucleotide of the reference sequence was represented by a set of eight features (25-mer oligonucleotides) (see figure 1). A set of four features contained oligonucleotides identical to the C57BL/6J reference sequence from positions -12 to +12 with respect to the base to be queried (position 0), with position 0 replaced by each of the four bases A, C, G, and T. The remaining four features were similarly designed for the complementary strand.

When a labeled DNA sample, or "target," is hybridized to these eight features on the array, the two features complementary to the reference sequence and its reverse complement will yield the highest signal. If, however, the target contains a variant base at position zero, the two features complementary to the variant base will yield the highest signal.

The arrays used were manufactured by Affymetrix, Inc. and Perlegen developed the methods for analyzing whole wafers, each of which consisted of 49 chips: the ability to assay whole wafers instead of individual chips increased the efficiency of the process considerably. Because each feature was 7 microns in size, each base was assayed by 8 features, and each wafer contained approximately 180 million features, about 22 Mb of unique genomic sequence could be tiled on each wafer. By using two distinct fluorescent labels around 22 Mb of unique sequence for two different mouse strains could be assayed on each wafer. We used 68 wafers to resequence the entire genome (excluding repetitive sequences) of each mouse strain.

Long-range PCR primer pair selection

To produce the labeled DNA target for hybridization to the arrays, a long-range PCR (LR-PCR) method was developed to amplify the mouse genome in lengths of approximately 10 kb. The LR-PCR primer pairs were designed using Oligo 6, a commercially available software package. The default high-stringency criteria were used for selecting candidate primer pairs, but in regions where no suitable primers could be identified, the conditions were relaxed by making the following three modifications:

  1. GC Clamp stability = -9 kcal/mol (instead of -10 kcal/mol)
  2. Maximum number of acceptable sequence repeats = 4 (instead of 3)
  3. 3’ terminal nucleotides checked for dimers = 9 (instead of 23)

A minimal set of primer pairs yielding maximal coverage was selected from among the many candidate primer pairs by using optimization software (an implementation of Dijkstra’s algorithm).

The primers were typically between 28 and 32 nucleotides in length and had a melting temperature of > 65°C. The amplicons ranged in size from 3 kb to 12 kb, with an average of 10,336 bp.

Experimental data was used to mark each primer pair as working or failing. Failing primer pairs were those with which no SNPs were discovered, and which failed internal quality control metrics, consisting of a combination of signal-to-background ratio, the reference call-rate, and the quality score (as reported in the trace files). Of the 240,814 primers that could be mapped to Build 36, 5.4% were marked as failing. Note that the worked/failed status of the primer pairs relates only to this project, and could potentially differ for other investigators.

The sequences, positions, and working/failing status of the LR-PCR primer pairs used are available for download on the Download page, and are shown graphically on the mouse genome browser. Working primer pairs are shown in black, while failing primer pairs are shown in gray.

Long-range PCR conditions

 LR-PCR reactions (1-plex or 2-plex) were performed as follows (per reaction): 30 ng of genomic mouse DNA (obtained directly from the Jackson Laboratories) was amplified using 1.7 µM forward LR-PCR primer, 1.7 µM reverse LR-PCR primer, 0.6 U MasterAmpTM  extra long Taq polymerase (Epicentre Technologies), 20 ng/ml TaqStart®  antibody (Clontech), 0.1 X TaqStart®  Antibody buffer (Clontech), 0.4 µM dNTPs, 23 mM Tricine, 3 % DMSO, 45 mM Trizma, 2.7 mM MgCl2, 12.6 mM (NH4)2SO4, and 0.4 X MasterAmpTM PCR Enhancer with Betaine (Epicentre Technologies), in a volume of 12 µl.

The thermocycling conditions were as follows:

Following LR-PCR, amplicons to be hybridized in a single hybridization reaction (approximately 13 Mb) were combined into one tube, purified, and normalized. They were then fragmented to a peak fragment size of 100 bp with DNAse 1, end-labeled with either biotin or fluorescein, and hybridized to the arrays overnight. After washing and staining for the detection of the biotin- and fluorescein-labeled hybridized targets, the arrays were scanned using custom-built confocal scanners, and the fluorescence intensity data captured for analysis (see below).


Figure 1 : Resequencing genomes using high-density oligonucleotide arrays. Each base of the reference sequence is represented by a set of eight 25-mer probes synthesized on glass. The four probes shown here are complementary to the hypothetical reference DNA sequence, and are identical except at the 13th position, which is one of A, C, G, or T. Not shown are the four reverse complement probes. The base calling algorithm uses a sliding window spanning many sets of probes corresponding to nucleotides upstream and downstream of the queried base. Long-range PCR is used to amplify genomic DNA in segments averaging 10 kb in length. PCR products are digested, labeled and hybridized to the arrays. The probes complementary to the target DNA yield the brightest signal, allowing the target sequence to be determined. In this case the hybridization patterns indicate the presence of a G in this sequence position for mouse strain 1, and an A in this position for mouse strain 2.


Base calling and SNP detection analysis methods

Fluorescence intensity data from an oligonucleotide array scan was processed to determine an average intensity for each feature on the array. This yielded 8 data points per sequence position: one each for A, C, G, and T, for forward and reverse strands. A direct sequence was determined for each strand, consisting of the brightest base at each position. At each position, the local "conformance" of the array data was also determined as the fraction of base calls that match the reference sequence within a sliding window. For a position where the direct call matches the reference base, this window consisted of bases at positions -10 to +10. In the immediate vicinity of an alternate base call, hybridization intensities were reduced due to the presence of a one-base mismatch base between the target and probe DNA. To avoid the reduced-intensity interval in these cases, the window was shifted outwards to include bases -20 to -10, and +10 to +20.

A strict base call was made for a sequence position when the ratio of the brightest to next-brightest feature was greater than a threshold, and the conformance around that position was at least 0.80. The intensity ratio threshold was selected based on the experimental protocol used for a given scan. Typically values in the range of 1.1 to 1.3 were used. For alternate base calls that did not match the reference sequence, it was also required that there were no brighter alternate calls meeting these criteria within positions -5 to +5.

From these strict called sequences, for polymorphism detection, a consensus sequence of calls that were confirmed on both strands was created. Putative polymorphic sites also had to pass a final "footprint test." In this test, normalized intensities for probes matching the reference sequence across positions -5 to +5 were separately averaged for scans that resulted in reference base calls and alternate base calls. A SNP was rejected if the ratio of mean intensity around reference calls to mean intensity around alternate calls was less than 1.5. The footprint test required a cumulative analysis of a complete set of arrays of the same design. We required at least one consensus reference call and one alternate call to define a polymorphism; positions with no reference calls were rejected. Once a site was determined to be polymorphic, we relaxed the base calling criteria and accepted strict calls on just one strand if the other strand was found to be ambiguous (i.e., did not pass either the intensity ratio or conformance requirements). Estimating accuracy for SNP detection is challenging because it depends on the frequency of true SNPs and their distribution in the particular samples being resequenced. In our experience, we have found that consensus base calls at polymorphic sites are on the order of 99.9% accurate, and relaxed calls where one strand is ambiguous are roughly 99% accurate. The error rates for consensus calls across all sequenced bases are far better than this, since most sequence positions are correctly scored as not polymorphic.

Reliability of base calls made using Perlegen's methodology

To evaluate the accuracy of base calls made using Perlegen's methodology, a pilot mouse genome resequencing study [1] was performed. This pilot study consisted of resequencing 4.6 Mb of DNA from 5 different genomic regions in 15 inbred mouse strains and identifying variation between the strains. 4,596,781 bp of C57BL/6J genomic sequence from the 5 intervals was masked for highly repetitive sequence using RepeatMasker [2] and the resulting 3,002,233 bp of unique sequence were resequenced in the 15 mouse strains, which included C57BL/6J.

Using the pilot data collected for the C57BL/6J strain, it was determined that the agreement of the strict base calls with the NCBI reference sequence for this strain was 99.9%, corresponding to a Phred quality score [3] of q » 30 if the NCBI sequence was free of errors. For relaxed calls where one strand was ambiguous, a consistency of 98.6% was observed, corresponding to a Phred quality score of q » 19.

MatchesTotalAgreement
Strict base calls87510187595699.9%
Relaxed base calls77614378700198.6%

For all fifteen inbred strains examined in the pilot mouse resequencing project, an average of 39% of tiled bases produced strict base calls, and 24% of tiled bases produced relaxed calls. Thus, base calls were determined for 63% of the tiled bases with a Phred quality score of q » 19 or greater for each of the 15 inbred strains.

Reliability of SNP calls made using Perlegen's methodology

In the same pilot study the DNA sequences of the 15 inbred strains were compared to identify 18,893 SNPs [1]. To assess the validity of the mouse SNPs discovered using high-density oligonucleotide arrays, 78 singleton SNPs (detected in only one of the 15 strains) and 22 common SNPs (both alleles seen in more than one strain) were dideoxy sequenced. The overall rate of validation was 98%, with 97% (76 out of 78) of the singletons and 100% of the non-singletons (22/22) confirmed. These data indicate that the mouse SNPs identified by Perlegen using high-density oligonucleotide arrays have a low false-positive rate.

Comparison of array-based technology with whole-genome shotgun sequencing technology

The level of coverage and quality produced by Perlegen's use of array-based resequencing technology is similar to that of 1.5X whole-genome shotgun sequencing. Here we compare the coverage and quality of sequence data provided by the two resequencing technologies: Array-based technology and 1.5x whole-genome shotgun sequencing.

Ability to resequence unique and repetitive regions

Resequencing strategies are based on surveying the DNA sequences of two or more genomes for a given species and then aligning the survey DNA sequences to the reference genome to identify the bases that are the same and the bases that are different between the two copies. The mouse genome is composed of both unique sequences that are present only once and repetitive sequences that are present in multiple copies in the genome. Because resequencing the genomes of the 15 inbred mouse strains relies on aligning the survey DNA sequences to the C57BL/6J reference genome, significantly better coverage and quality of base reads will be obtained for unique sequences than for repetitive sequences. This is due to the fact that one cannot determine to which one of the many multiple possible positions in the genome a repetitive sequence actually maps. Fortunately, biologically interesting sequences, such as genes and their transcriptional regulatory elements, are typically unique, although some repetitive sequences are also biologically interesting.

The base-calling algorithm used by Perlegen for array-based resequencing uses a sliding window spanning many sets of oligonucleotides, which correspond to nucleotides upstream and downstream of the queried base, and is highly accurate for unique sequences. However, probes for repetitive elements will hybridize to target sequences from multiple genomic locations. Frequently, copies of a repetitive element have some sequence divergence and will not be identical at every base pair. Differences in probe affinity between instances of a repeat at several genomic positions are likely to produce uninterpretable hybridization patterns. Thus, array-based resequencing technology has limitations in its ability to examine repetitive sequences.

Whole-genome shotgun sequencing of the inbred mouse strains would be dependent on aligning the sequence reads for each inbred line onto the C57BL/6J reference genome and then computationally determining if the bases are the same or different between the two genomic copies. It is straightforward to align unique sequences. However, repetitive elements can be difficult or impossible to correctly assemble based on pairwise overlaps, because individual reads will align to multiple genomic locations. For this reason, whole-genome shotgun sequencing is notoriously poor at rendering the correct sequences of large repetitive elements. For example, the C57BL/6J reference genome sequence was produced by whole-genome shotgun sequencing at a 7.7X coverage. Even at this coverage, of the 3,000 potentially active retrotransposons thought to be in the mouse genome only 400 were found in the assembled sequence, and of these only 12 have their entire coding region intact [4].

Thus, many of the repetitive elements of potential biological interest in the mouse will not be assayable using either resequencing technology because they are not present in the C57BL/6J reference genome sequence.

Quality

As described above, data from our pilot study showed that the agreement of our strict base calls with the NCBI reference sequence for this strain was 99.9% for unique sequences, corresponding to a Phred quality score of q » 30. When relaxed calls are also included we observed a consistency of 98.6%, corresponding to a Phred quality score of q » 19. For all 15 inbred strains examined in the pilot mouse resequencing project, an average of 39% of the unique sequences produced strict base calls with a Phred quality score of q » 30, and 24% of unique sequences produced relaxed calls with a Phred quality score of q » 19. Thus, on average the base calls for 63% of the unique sequences within a given contig will have a Phred quality score of q » 19 or greater for each of the 15 inbred lines.

To obtain a 1.5X coverage of the 2.4 Gb genome using whole-genome shotgun sequencing, approximately 3.43 million reads with a mean read length of 700 bases would need to be obtained for each inbred strain. A typical dideoxy sequence read of 700 bp in length will have approximately 86% of the base calls made with 99% accuracy or greater, corresponding to a Phred quality score of q » 20 or better. Based on theory [5] and the data obtained in the recently published whole-genome shotgun sequence of the dog genome [6], 1.5X coverage would provide sequence data for ~78% of the genome, and thus for each strain 22% of all the bases would have no information.

Genome coverage

The format of the data generated in this study by array-based technology is shown in Figure 2A. Note that the same set of contigs of unique sequences separated by the same gaps of highly repetitive sequences is generated for each of the inbred mouse lines. Data from the pilot mouse resequencing project shows that the length of the contigs obtained for the inbred mouse strains is ~650 bp and the gaps, consisting mainly of highly repetitive sequences, have a mean length of ~340 bp.

In contrast to high-density oligonucleotide array resequencing, 1.5X whole-genome shotgun sequencing would produce different sets of sequence contigs and gaps for each of the inbred mouse lines (see Figure 2B for the format that would be obtained). Based on theory [5] and whole-genome shotgun sequence data of the dog genome [6] the contigs would have a mean length of ~1400 bp separated by gaps with a mean length of ~400 bp. Furthermore, the gaps between the contigs would consist of both unique and repetitive sequences.


Figure 2. Data format. A. High-density oligonucleotide array resequencing will produce the same set of contigs of unique sequences separated by the same gaps of repetitive sequences in all 15 inbred mouse lines.
B. Whole-genome shotgun sequencing would produce different sets of sequence contigs and gaps for each of the inbred mouse lines.

In conclusion, although the format of the genomic sequences produced by Perlegen using array-based technology and 1.5X whole-genome shotgun sequencing would be different, the overall quality and coverage of the two different resequencing approaches would be similar.

Comparison of trace files generated by array-based resequencing with those generated by dideoxy sequencing

While dideoxy sequencing and oligonucleotide array hybridization are completely different technologies, the fluorescence intensity measurements generated for each base by oligonucleotide array technology are analogous to peak heights in a dideoxy sequence trace. A tool was developed by Perlegen for automatically converting the array data into SCF or ZTR trace file format [7] using the Staden trace file IO library. The result is a trace file for each contiguous fragment sequenced in this study, containing the intensity measurements for A, C, T, and G for each base, the called sequence, and a quality score for each base call. These trace files can be directly viewed using conventional tools for dideoxy sequence traces such as Trev, available at http://staden.sourceforge.net/.

Each trace file represents data for one contiguous fragment of tiled sequence in one orientation. The trace amplitude data consists of mean fluorescence intensity measurements for each feature on the array. Unlike a conventional sequencing trace, there is a one-to-one correspondence between the trace amplitudes and sequence positions. The called sequence consists of the brightest of the four nucleotide probes for each position in the reference sequence. Confidence values are computed using an algorithm similar to Phred [3] that considers the relative intensities of the brightest and next brightest probes, and the consistency of surrounding base calls with the expected reference sequence. Due to experimental variation between individual hybridization experiments, the quality scores for a scan may not be perfectly calibrated. Similar to dideoxy sequencing quality scores, the reported scores represent estimated base 10 log error rates, so 20 = an error rate of 0.01, 30 = 0.001, and so on. The scoring algorithm derives a decision tree for estimating error rates for individual calls based on the input metrics. Since these trees are made up of a limited number of nodes, only a limited set of discrete scores are actually possible. Groups of calls with the same reported quality score could not be distinguished on the basis of the input data. The quality scores are not directly used in our SNP discovery algorithm, though SNP discovery does use the same underlying features (intensity ratios and local conformance).

Origin of 'N's in the genotyping data

Our algorithms use strict criteria in order to ensure an accuracy of 99% for calls made at polymorphic sites. Bases that do not meet these criteria are labeled 'N'. While relaxing our criteria would allow us to make correct calls for most of these 'N's, we prefer to maintain an extremely high accuracy in favor of an increased call rate. In some regions the 'N's may be clustered. The presence of clustered 'N's may be due to either experimental or biological reasons. Experimental causes include poor PCR amplification or hybridization conditions, while biological reasons include the presence of deletions in the strain of interest compared with the reference strain C57BL/6L.

References

[1] Frazer K, et al. Genome Research. 2004 Aug;14(8):1493-500.
[2] Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996-2004 http://www.repeatmasker.org
[3] Ewing B, Green P. Genome Research. 1998 Mar;8(3):186-94.
[4] Waterston R et al. Nature. 2002 Dec 5;420(6915): 520-562
[5] Lander E and Waterman M. Genomics. 1988 Apr;2(8): 231-239
[6] Kirkness E. Science. 2003 Sep26;301(5641): 1898-1903
[7] Bonfield J and Staden R. Bioinformatics. 2002 Jan;18(1):3-10.