Phase 2 Overview
Background
The initial phase of this project involved resequencing 15 inbred mouse strains, which led to the discovery of approximately 8.27 million SNPs across the genome (using C57BL/6 as the reference strains). The SNP and genotype data, as well as the sequences of the long-range PCR primer pairs used for SNP discovery, are available on this website for download, and can also be viewed using a genome browser, a haplotype browser, and an ancestry viewer.
The second phase of the project involves imputing the genotypes of the same 8.27 million SNPs in an additional 40 strains. These imputed data are now available for download, and can also be viewed in the genome and haplotype browsers. The imputation work was done in collaboration with Eleazar Eskin (UCLA) and Hyun Min Kang (UCSD).
Method for Imputation of 8.27 million SNPs
The 8.27 million genotypes were imputed in the 40 strains using experimentally determined genotypes for ~150,000 SNPs: 138,793 SNPs genotyped at the Broad Institue and 7,577 genotyped at Perlegen. The genotypes were inferred from the reference haplotypes of the 15 resequenced strains plus reference strain C57BL/6J using a hidden Markov model (HMM). At each genotype position, our HMM model has one of 17 states, which represent 16 reference strains and an unknown reference strain. The transition probabilities are defined by a standard recombination model, and the mutation probability is defined as the probability of the SNP allele being different from the reference haplotype. For the state of the unknown reference strain, the observational probability is always defined as 0.5.
The transition and mutation parameters were trained using the ~150,000 SNPs, and then applied to impute the rest of the 8.27 million genotypes for each of the 40 strains. When the probability of inferred state being the 'unknown reference strain' exceeds 0.5, the imputed genotype is predicted to be 'missing'. In this way, the average accuracy of genotype imputation for the 12 classical inbred strains using leave-one-out cross validation was 97.9%. Each genotype imputation has a confidence score, defined as the posterior probability of the imputed genotype. When the threshold is set to 0.9, the average imputation accuracy of 12 classical inbred strains increases to 98.6%. It should be noted that, for the wild-derived strains, the accuracy of the imputation algorithm is significantly less than 90%.
Please contact us for questions.
References
[1] Patil, N, et al. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294, 1719-23 (2001).
[2] Frazer, K.A. et al. Nature. 2007 A sequence-based variation map of 8.27 million SNPs in inbred mouse strains. Nature 448, 1050-3 (2007).

