Data Submissions
The SNPs, genotypes, and trace files were submitted to NCBI, and are available on NCBI's website.
dbSNP Submissions
SNPs generated from the NIEHS and Perlegen collaborative project were deposited in dbSNP on a regular basis and submitted under the handle "PERLEGEN." SNP details can be retrieved using Entrez SNP.The SNPs and genotypes identified through the project are also available for download by chromosome on the Download Data page of this website, where the SNPs are mapped to the NCBI Mouse Build 37.
The DNA used for sequencing was derived from multiple individual mice from each strain. All mice were male, and were obtained from the Jackson Laboratories. We report Y chromosome SNP genotypes as haploid (e.g. A) and X chromosome SNP genotypes as homozygous diploid (e.g. AA).
Note that each submission to dbSNP and the NCBI Trace Archive represents a
particular assay of a variation (in the case of dbSNP, a Submitted SNP (ss)
record). In a minority of cases, a single SNP was assayed more than once.
For example, in some instances there was intentional overlap of resequencing fragments to span junctions
between finished sequence from earlier in the project with finished sequence
assayed later in the project. In other cases, there was an assembly error in an early
build of the reference sequence involving duplication of a sequence segment that
was later identified as deriving from a single locus. As with all dbSNP
submissions, those from the project are clustered by dbSNP into RefSNP (rs) records
which are intended to provide unique identifiers for distinct variant positions
on the latest genome build. An example of this is seen at
http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=31879501.
The dbSNP ss_id values provided with the data on this site provide a persistent
link to the data deposited in dbSNP even if future reclustering at NCBI results
in assigning the ss_id to a different RefSNP cluster ID (rsID).
Trace Archive Submissions
Trace File Representation
The processed resequencing data have been deposited at the NCBI Trace Archive. A tool for automatically converting the array data into SCF or ZTR trace file format [1] using the Staden trace file IO library was devloped to allow submission to the Trace Archive. The trace files can be viewed using conventional tools for dideoxy sequence traces such as Trev, available at http://staden.sourceforge.net/.
The result is a trace file for each contiguous fragment sequenced in the study, containing the intensity measurements for A, C, T, and G for each base, the called sequence, and a quality score for each base call.
Each trace file represents data for one contiguous fragment of tiled sequence in one orientation. The trace amplitude data consists of mean fluorescence intensity measurements for each feature on the array. The range of these values is 0 to 4095. Unlike a conventional sequencing trace, there is a one-to-one correspondence between the trace amplitudes and sequence positions. The called sequence consists of the brightest of the four nucleotide probes for each position in the reference sequence. Data for the reverse tiling is reverse complemented before the trace files are generated, so that the forward and reverse reads are both reported for the "+" strand of the reference sequence. The trace files contain all the experimental information required to apply our SNP discovery algorithm. Due to round-off of the intensity data in the traces, there may be minor differences in results obtained from the trace files.Confidence values are computed using an algorithm similar to Phred [2] that [2] that considers the relative intensities of the brightest and next brightest probes, and the consistency of surrounding base calls with the expected reference sequence. Due to experimental variation between individual hybridization experiments, the quality scores for a scan may not be perfectly calibrated. Similar to dideoxy sequencing quality scores, the reported scores represent estimated base 10 log error rates, so 20 = an error rate of 0.01, 30 = 0.001, and so on. The scoring algorithm derives a decision tree for estimating error rates for individual calls based on the input metrics. Since these trees have a limited number of nodes, only a limited set of discrete scores are actually possible. Groups of calls with the same reported quality score could not be distinguished on the basis of the input data. The quality scores are not directly used in Perlegen's SNP discovery algorithm, though SNP discovery does use the same underlying features (intensity ratios and local conformance).
Trace File Metadata
In addition to the basic experimental data, called sequence, and quality scores, each trace also carries descriptive information, the structure of which is specified by the NCBI Trace Archive. The following table explains how to interpret some of these fields, and should supplement the Trace Archive documentation.
| Column NameDescription | |
|---|---|
| TRACE_NAME | Unique identifier for this trace, composed by concatenating the TEMPLATE_ID and TRACE_END. |
| TEMPLATE_ID | Uniquely identifies a pair of traces for forward and reverse tilings of the same sequence interval from the same scan: composed from the RUN_GROUP_ID, the scan date, and a code identifying the interval of tiled sequence. |
| TRACE_END | The orientation of the tiled fragment for this trace ('F' for forward or 'R' for reverse). |
| SUBSPECIES_ID | The strain name for the DNA sample used in this experiment. |
| RUN_GROUP_ID | An identifier that groups together all traces from the same scanned image, corresponding to a single GeneChip DAT file, and a single analysis run. |
| PREP_GROUP_ID | Groups together all scans from a single hybridization experiment, i.e., a single physical array. For wafer-scale hybridizations, many scans are made to cover an entire wafer, and a wafer may be hybridized with several samples using different fluorophores. |
| CHIP_DESIGN_ID or FEATURE_ID_FILE_NAME | Identifies the chip design for the array covered by this RUN_GROUP_ID. |
| REFERENCE_ACCESSION | NCBI GenBank accession for the source sequence used for design of the array for this tiled interval |
| REFERENCE_OFFSET | Position in the GenBank sequence corresponding to the first tiled base in this trace file. |
Trace File Access
A limitation of the NCBI Trace Archive is that only a subset of trace file features are searchable through the web interface, and these features do not include REFERENCE_ACCESSION and REFERENCE_OFFSET. As a result, there is no direct way to identify and retrieve traces corresponding to a given genomic interval on the NCBI Trace Archive.
However, the Mouse Genome Browser on this on this site maps submitted trace files to the NCBI Mouse Build 37 reference genomic sequence. When viewing a specific region of the genome, the trace fragments appear as annotations in the Resequencing Data Track on the Browser. Clicking on these annotations will take you to a page that lists the submitted trace files for that fragment for each strain.
Note that due to minor differences in the mapping of tiled sequences between Builds 34-37, a very small number of SNPs may not have mapped trace files available.
References
- [1] Bonfield J and Staden R. Bioinformatics. 2002 Jan;18(1):3-10.
- [2] Ewing B, Green P. Genome Research. 1998 Mar;8(3):186-94.

