Mouse SNP and Genotype Data Download
Overview
Perlegen's SNP, genotype (empirical and imputed), haplotype, trace, and PCR primer data has been compiled with NCBI Mouse Build information to produce data files for public use. This data, grouped by chromosome, is available here as flat files for download. SNP and genotype positions have been mapped from their original reference coordinates to NCBI Mouse Build 37 coordinates (see Data Release History).
Note that C57BL6/J strain was not selected for re-sequencing as this data would have been almost entirely redundant with the NCBI reference sequence. Since we did not actually determine genotypes for C57BL6/J, we did not submit genotypes for this strain to dbSNP. However, implicit genotypes for C57BL6/J can be obtained from the reference sequence at each SNP position (the reference allele is the first allele in the ALLELES column).
The data is available for download in two different compressed file formats. The files are saved as both PC ".zip" files and Unix compressed ".gz" files. Although tools to uncompress both formats are available on many platforms (PC, Mac, Unix-like, etc.), typically PC users use the .zip and users of Unix-like operating systems use the .gz file format.
Once the files have been downloaded they are quite large: there are typically over 100,000 rows of data per chromosome. This is too large for desktop applications such as Excel or Notepad, and is probably better loaded into a database application such as FileMaker, Oracle, MySQL, or Microsoft Access. For specific instructions about how to load data into a database, please consult the application's documentation. The files are stored as plain text. Each column is delimited by a tab character, each row is delimited by a newline character, and the first row contains the column names.
Genotype & SNP Data
To download genotype, imputed genotype, or SNP data, right-click on each appropriate link and choose "Save As". Flanking sequence of 200 base pairs (100 base pair on 5' and 3' of each SNP) is available inside the SNP download file. Depending on your connection speed some downloads may take minutes or hours to complete. For example, with a 56K modem it may take as much as 8 hours to download the entire data set. The "Genotype Data" files contain empirically-derived genotypes for the 15 original strains used for SNP-discovery (Phase 1 of the project), while the "Imputed Genotype Data" files contain imputed genotypes for an additional 40 strains (Phase 2 of the project).
Primer Data
To download primer data, right-click on the appropriate link and choose “Save As”. Unmapped primers refer to the 3776 primers that do not map to NCBI genome Build 37.
| Type | Primer Data(Description) | File Size |
|---|---|---|
| Mapped | (Save as .gz) (Save as .zip) | 8MB |
| Unmapped | (Save as .gz) (Save as .zip) | 85KB |
Trace Data
To download sequence trace mappings, right-click on each appropriate link and choose "Save As".
Haplotype Block Data
To download the chromosomal locations of each haplotype block, right-click on the
appropriate link and choose "Save As".
| Haplotype Block | Data (Description) | File Size |
|---|---|---|
| All Haplotype Blocks | (Save as .gz) (Save as .zip) | 316KB |
| Haplotype Blocks 5KB | (Save as .gz) (Save as .zip) | 22KB |
Data File Descriptions
Genotype Data File Description
The files b04_ChrXX_genotype.dat contain the diploid genotypes for SNPs for each of the individual 15 strains. Each file represents the genotypes of each strain by chromosome. [ Get data ]
| Column Name | Description |
|---|---|
| local_identifier | Perlegen internal SNP identifier. Matches the submitter_ID in dbSNP. |
| SS_ID | NCBI Assay ID (ss#). The ID assigned to the SNP by dbSNP at submission time. |
| chromosome | Chromosome of the NCBI Build 37 contig on which the best alignment was found. The "ChrM" files contain data from the mitochondrion, and the "ChrUn" files contain data mapped to NCBI contigs that are not assigned to a chromosome. Data that could not be mapped to any NCBI Build 37 contig can be found in the files labeled "unmapped." |
| accession_num | The accession number from NCBI Build 37 of the contig to which the SNP aligns. |
| position | The nucleotide position in NCBI Build 37 contig of the reference base in the alignment. |
| strand | The orientation of the reported SNP flanking sequences, alleles, and genotypes against the NCBI Build 37 sequence. |
| alleles | The nucleotide code for the alleles of this SNP. The first allele is the reference allele of the C57BL6/J strain and the second allele is the alternate allele discovered. For example, G/A. |
| 129S1/SvImJ CAST/EiJ BTBR T+ tf/J A/J MOLF/EiJ KK/HlJ AKR/J PWD/PhJ NZW/LacJ BALB/cByJ WSB/EiJ C3H/HeJ DBA/2J FVB/NJ NOD/LtJ | The nucleotide code for the two alleles found at this position for each strain. Nucleotide codes can be A, G, T, C, N for unknown, and "-" for strains for which genotypes were not attempted. Expect to see "AA","GG","TT","CC","NN", or "--" in each column |
SNP Data File Description
The files b04_ChrXX_snp.dat have the following information for SNPs that were identified as being polymorphic in the 15 strains genotyped. Each file represents the SNPs discovered from all strains by chromosome. [ Get data ]
| Column Name | Description |
|---|---|
| local_identifier | Perlegen internal SNP identifier. Matches the submitter_ID in dbSNP. |
| SS_ID | NCBI Assay ID (ss#). The ID assigned to the SNP by dbSNP at submission time. |
| chromosome | Chromosome of the NCBI Build 37 contig on which the best alignment was found. The "ChrM" files contain data from the mitochondrion, and the "ChrUn" files contain data mapped to NCBI contigs that are not assigned to a chromosome. Data that could not be mapped to any NCBI Build 37 contig can be found in the files labeled "unmapped." |
| accession_num | The accession number from NCBI Build 37 of the contig to which the SNP aligns. |
| position | The nucleotide position in NCBI Build 37 contig of the reference base in the alignment. |
| strand | The orientation of the reported SNP flanking sequences, alleles, and genotypes against the NCBI Build 37 sequence. |
| alleles | The nucleotide code for the alleles of this SNP. The first allele is the reference allele of the C57BL6/J strain and the second allele is the alternate allele discovered. For example, G/A. |
| five_prime_flank | The 100 base pairs from the original reference sequence that flank the SNP on the 5' end. |
| three_prime_flank | The 100 base pairs from the original reference sequence that flank the SNP on the 3' end. |
Primer Data File Description
The b04_primer_pair.dat file has the following information for primer pairs. [ Get data ]
| Column Name | Description |
|---|---|
| primer_pair_id | Perlegen internal primer identifier (PSMP = Perlegen Sciences Mouse Primer). |
| chromosome | Chromosome of the NCBI Build 37 contig on which the best alignment was found. Data that could not be mapped to any NCBI Build 37 contig can be found in the file labeled "b04_primer_pairs_unmapped." |
| accession_num | The accession number from NCBI Build 37 of the contig to which the primers align. |
| amplicon_start | The nucleotide position in NCBI Build 37 contig of the start of the primer pair in the alignment. |
| amplicon_end | The nucleotide position in NCBI Build 37 contig of the end of the primer pair in the alignment. |
| forward_sequence | The forward primer sequence. |
| reverse_sequence | The reverse primer sequence. |
| strand | The orientation of the primer pair against the NCBI Build 37 sequence. |
| working_status | One if the primer pair amplified successfully, zero if it failed. |
Trace Mapping File Description
The b04_ChrXX_frag.dat file has the following information for each fragment. [ Get data ]
| Column Name | Description |
|---|---|
| frag_id | Perlegen internal fragment identifier. |
| chromosome | Chromosome of the NCBI Build 37 contig on which the best alignment was found. |
| accession_num | The accession number from NCBI Build 37 of the contig to which the trace align. |
| min_pos | The nucleotide position in NCBI Build 37 contig of the start of the trace in the alignment. |
| max_pos | The nucleotide position in NCBI Build 37 contig of the end of the trace in the alignment. |
| strand | The orientation of the trace against the NCBI Build 37 sequence. |
| sample_name | The mouse strain used to generate the trace. |
| a_strand | The name of the forward sequence trace. |
| z_strand | The name of the reverse sequence trace. |
Haplotype Block File Description
The block_summary.dat file contains the following columns. [ Get data ]
| Column Name | Description |
|---|---|
| CHROMOSOME | The chromosome that contains the haplotype block |
| BLOCK_START_BP | The absolute position of the start of the haplotype block on the chromosome. |
| BLOCK_END_BP | The absolute position of the end of the haplotyep block on the chromosome. |
| BLOCK_LENGTH_BP | The length of the haplotype block, in bases. |
| NUM_SNPS_IN_BLOCK | The number of SNPs contained within the haplotype block. |

