A demo data was imputed by impute5 using Han Deep Sequencing reference panel. In total, data of 44 individuals with 28,996,189 variants was generated. Results in bplink format can be downloaded.

Summary:

Genotype imputation, or simply imputation in the context of our database, is to estimate the unobserved genotypes and replace the missing genotypes in a given dataset. Our imputation service is designed to meet 3 different requests for imputation:
  • achieving the best imputation result of Han population data with reference panel based on our NGS datasets of Han Chinese;
  • carrying out classical imputation tasks with public reference panels of global populations;
  • estimate and replace the missing genotypes in the data of users.

Our imputation service is implemented by common used tools: SHAPEIT4, IMPUTE5, Minimac4, Beagle5, PBWT (Only Beagle5 and PBWT can impute genotypes without reference panels). There are 12 reference panels available in our imputation service. In PGG Han 2.0, the panel of Han deep sequencing data and six Han regional substructure population date are newly introduced. Currently the imputation function is limitted to the biallelic SNV data.

Software and references:

SHAPEIT4
Delaneau, O., Zagury, J.-F., Robinson, M.R., Marchini, J., and Dermitzakis, E. (2018). Integrative haplotype estimation with sub-linear complexity. BioRxiv 493403.

IMPUTE5
Rubinacci S, Delaneau O, Marchini J. Genotype imputation using the Positional Burrows Wheeler Transform. PLoS Genet. 2020 Nov 16;16(11):e1009049. doi: 10.1371/journal.pgen.1009049. PMID: 33196638; PMCID: PMC7704051.

Minimac4
Das, S., Forer, L., Schönherr, S., Sidore, C., Locke, A.E., Kwong, A., Vrieze, S.I., Chew, E.Y., Levy, S., McGue, M., et al. (2016). Next-generation genotype imputation service and methods. Nature Genetics 48, 1284–1287.

Beagle5.1
Browning, B.L., Zhou, Y., and Browning, S.R. (2018). A One-Penny Imputed Genome from Next-Generation Reference Panels. American Journal of Human Genetics 103, 338–348.

PBWT
Durbin, R. (2014). Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics 30, 1266–1272.

Reference panels:

1KG
Reference panel from websites of SHAPEIT2 and IMPUTE2 which is based on 1000 Genome Phase 3 data (26 global populations, 2,504 individuals, 81,706,022 variants)

CONVERGE
Reference panel based on CONVERGE dataset which only keeps the sites passed the filter recommended by the author of paper “11,670 whole genome sequences representative of the Han Chinese population from the CONVERGE project”(10,640 Han females, 5,814,870 variants)

HRC
Reference panel of Haplotype Reference Consortium (HRC) release 1.1 (22691 individuals in chromosome 1, 27165 individuals in the other chromosomes, 39,131,578 variants)

SGDP
Reference panel based on Fermikit uniting variants of SGDP dataset (“The Simons Genome Diversity Project 300 genomes from 142 diverse populations”) from https://github.com/lh3/sgdp-fermi (263 individuals, 29,543,030 variants)

Han3K
Reference panel of highly representative core Han Chinese genomes covering all six genetic substructure regions of Han Chinese selected from previous Han100K.(3,057 individuals, 8,056,973 variants).

Han deep sequencing
Reference panel based on Han Zhongyuan dataset (1025 individuals, 28,996,189 variants).

Subgroup Central China Han
Reference panel of highly representative core Han Chinese genomes covering Subgroup Central China Han selected from previous Han100K.(3,057 individuals, 8,056,973 variants).

Subgroup Northeast Han
Reference panel of highly representative core Han Chinese genomes covering Subgroup Northeast Han selected from previous Han100K.(3,057 individuals, 8,056,973 variants).

Subgroup Northwest Han
Reference panel of highly representative core Han Chinese genomes covering Subgroup Northwest Han selected from previous Han100K.(1,000 individuals, 8,056,973 variants).

Subgroup Southeast Han
Reference panel of highly representative core Han Chinese genomes covering Subgroup Southeast Han selected from previous Han100K.(1,000 individuals, 8,056,973 variants).

Subgroup Southwest Han
Reference panel of highly representative core Han Chinese genomes covering Subgroup Southwest selected from previous Han100K.(1,000 individuals, 8,056,973 variants).

Subgroup Southcoast Han
Reference panel of highly representative core Han Chinese genomes covering Subgroup Southcoast Han selected from previous Han100K.(1,000 individuals, 8,056,973 variants).