Summary of the data sets included in the Han Chinese Genomes Database

Population samples

In the Han Chinese Genomes Database version 2.0 (PGG.Han 2.0), we designedly select and include only individuals of Han Chinese ancestry, particularly for imputation and GWAS in the Han Chinese population. All the participants are 18 years of age or older and contributed DNA to the project. The ancestry of participants is determined based on self-reported data and their genetic profiles. In the current release of the PGG.Han 2.0 database, we collect data of 137,012 individuals in total, including high coverage WGS data (n = 17,546), low coverage WGS data (n = 11,878), high coverage WES data (n = 5,002) and genotyping or partially imputed data (n = 102,586) With these more than 100K Han Chinese samples being whole-genome sequenced to high coverage (~30 – 80×), or medium coverage (~13.7×), low-coverage ~1.7, or whole-exome sequenced (>100×), or high-density genome-wide SNP genotyped or imputed, we constructed the multiple reference panels specific to the Han Chinese population, a.k.a., Han100K. See the description below and table 1 for more detailed information.

Available Data

Totally near 30,000 Han Chinese individuals with whole genomes sequenced using next-generation sequencing are archived in PGG.Han, including deep-sequencing data sets (~30 – 80×) (n=11,767) and a low-pass sequencing data sets (~1.7×, n=11,878). The deep-sequencing datasets include Han Chinese samples from 30 administrative divisions representing 28 dialects across China were collected; 2,780 samples were sequenced at a median depth (~13.7×) and obtained from the previous study (Wu et al., 2019). Moreover, 5,002 samples with whole-exomes were obtained from a previous study (Hao et al., 2021).

Whole-genome low-pass sequencing data

The genomes of 11,878 women Han Chinese were sequenced by the CONVERGE project as a control group for the investigation of major depressive disorders (Chiang et al., 2017). Although in a much lower coverage (~1.7×), this data set provides a catalog of 25,057,223 variants for the Han Chinese population and is also included in the PGG.Han.

Genome-wide SNP data

The high-density genome-wide SNP genotyped data of 102,586 samples are contributed by the collaborators of PGG.Han (see https://www.biosino.org/pgghan2/about for the list of collaborators), which were collected from previous GWAS projects, only non-patient (control) are retained. Using all the whole-genome sequencing data of Han Chinese samples as reference data, the 102,586 samples are carefully imputed so that the final data set retained 8,056,973 genome-wide SNPs.

Whole exome sequencing

Data of 5,002 samples were provided by the HuaBiao project with high depth of target region (>100×) (Hao et al., 2021). Finally, 121,773,740 biallelic SNVs were left after quality control.

The fine-scale genetic structure of the Han Chinese population

The 100K Han Chinese individuals formed a distinct cluster from the surrounding groups including minority groups in China and other neighboring countries, suggesting the identity of Han Chinese people in terms of overall genetic make-up. On the other hand, genetic differentiation within the Han Chinese population is discerned and 6 sub-groups are identified, i.e., Northwest Han (NWH), Northeast Han (NEH), Central China Han (CCH), Southwest Han (SWH), Southeast Han (SEH), and South coast Han (SCH) (Gao et al., 2020). Overall, the main difference is between northern and southern subgroups, consistent with previous observations (Chen et al., 2009; Qin et al., 2014; Xu et al., 2009), while southern subgroups show higher divergence than northern subgroups. We further screened nested AIMs panels for detecting population structure and controlling population stratification in association studies. We provided high-quality individual-level variant data for future genotype imputation, population genetics a,nalysis and association studies. The reference panel provided here is expected to greatly facilitate current and future GWAS or candidate-gene based association studies in Han Chinese and related Asian populations.

Table 1. Summary of the data sets included in PGG.Han

Data Source Data Type Sample Size Sequencing Coverage Geographical Coverage Language Coverage References
The Han Chinese Genomes ProjectGenotyping102,586NA34 divisions28 dialectsXu et al., 2009
Gao et al., 2020
The Han Chinese Genomes ProjectWGS11,767~30×30 divisions28 dialectsPGG.Han
HuaBiaoProject WES5,002>100×31 divisions5 dialects(Hao et al., 2021)
CONVERGEWGS11,878~1.7×24 divisionsNA(Chiang et al., 2018)
SG10KWGS2,780~13.7×NANA(Wu et al., 2019)
NyuWaWGS2,999~26.2×23NA(Zhang et al., 2021)

WGS: Whole Genome Sequencing; NA: Not available.

References

  • Chen, J., et al., Genetic structure of the Han Chinese population revealed by genome-wide SNP variation. Am J Hum Genet, 2009. 85(6): p. 775-85.
  • Chiang, C.W.K., et al., A Comprehensive Map of Genetic Variation in the World's Largest Ethnic Group-Han Chinese. Mol Biol Evol, 2018. 35(11): p. 2736-2750.
  • Lan, T., et al., Deep whole-genome sequencing of 90 Han Chinese genomes. Gigascience, 2017. 6(9): p. 1-7.
  • Qin, P., et al., A panel of ancestry informative markers to estimate and correct potential effects of population stratification in Han Chinese. Eur J Hum Genet, 2014. 22(2): p. 248-53.
  • Sung, W.K., et al., Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma. Nat Genet, 2012. 44(7): p. 765-9.
  • Genomes Project, C., et al., A global reference for human genetic variation. Nature, 2015. 526(7571): p. 68-74.
  • Xu, S., et al., Genomic dissection of population substructure of Han Chinese and its implication in association studies. Am J Hum Genet, 2009. 85(6): p. 762-74.
  • Wu, D., et al., Large-Scale Whole-Genome Sequencing of Three Diverse Asian Populations in Singapore. Cell, 2019. 179(3): p. 736-749 e15.
  • Gao, Y., et al., PGG.Han: the Han Chinese genome database and analysis platform. Nucleic Acids Res, 2020. 48(D1): p. D971-D976.
  • Hao, M., et al., The HuaBiao project: whole-exome sequencing of 5000 Han Chinese individuals. J Genet Genomics, 2021.
  • Zhang, P., et al., NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep, 2021. 37(7): p. 110017.

Research citing our database

  • Zhang, P. et al. NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep 37, 110017, doi:10.1016/j.celrep.2021.110017 (2021).
  • Yin, Q. & Flegel, W. A. DEL in China: the D antigen among serologic RhD-negative individuals. J Transl Med 19, 439, doi:10.1186/s12967-021-03116-6 (2021).
  • She, X., Xiao, H., Lu, S. & Guo, L. Association of Interleukin-1alpha Functional Polymorphism with Risk of Chronic Periodontitis in Han Chinese Population. Genet Res (Camb) 2021, 6614835, doi:10.1155/2021/6614835 (2021).
  • Mboowa, G., Sserwadda, I. & Aruhomukama, D. Genomics and bioinformatics capacity in Africa: no continent is left behind. Genome 64, 503-513, doi:10.1139/gen-2020-0013 (2021).
  • Liu, Y. N., Li, N., Zhu, X. & Qi, Y. How wide is the application of genetic big data in biomedicine. Biomed Pharmacother 133, doi:ARTN 11107410.1016/j.biopha.2020.111074 (2021).
  • Liu, Y. et al. Combined Low-/High-Density Modern and Ancient Genome-Wide Data Document Genomic Admixture History of High-Altitude East Asians. Front Genet 12, 582357, doi:10.3389/fgene.2021.582357 (2021).
  • Jiang, Q. et al. RET compound inheritance in Chinese patients with Hirschsprung disease: lack of penetrance from insufficient gene dysfunction. Hum Genet 140, 813-825, doi:10.1007/s00439-020-02247-y (2021).
  • Hao, M. et al. The HuaBiao project: whole-exome sequencing of 5000 Han Chinese individuals. J Genet Genomics, doi:10.1016/j.jgg.2021.07.013 (2021).
  • Dehghani, N., Bras, J. & Guerreiro, R. How understudied populations have contributed to our understanding of Alzheimer's disease genetics. Brain 144, 1067-1081, doi:10.1093/brain/awab028 (2021).
  • Barbitoff, Y. A. et al. Expanding the Russian allele frequency reference via cross-laboratory data integration: insights from 6,096 exome samples. medRxiv (2021).
  • Yunjia, W. et al. Rare coding variants in axonemal dynein heavy chain genes are associated with adolescent idiopathic scoliosis. Research Square, doi:10.21203/rs.3.rs-121210/v1 (2020).
  • Yin, C. et al. Genetic Reconstruction and Forensic Analysis of Chinese Shandong and Yunnan Han Populations by Co-Analyzing Y Chromosomal STRs and SNPs. Genes (Basel) 11, doi:10.3390/genes11070743 (2020).
  • Rigden, D. J. & Fernandez, X. M. The 27th annual Nucleic Acids Research database issue and molecular biology database collection. Nucleic Acids Res 48, D1-D8, doi:10.1093/nar/gkz1161 (2020).
  • Pan, Z. & Xu, S. Population genomics of East Asian ethnic groups. Hereditas 157, 49, doi:10.1186/s41065-020-00162-w (2020).
  • National Genomics Data Center, M. & Partners. Database Resources of the National Genomics Data Center in 2020. Nucleic Acids Res 48, D24-D33, doi:10.1093/nar/gkz913 (2020).
  • Mulindwa, J. et al. High Levels of Genetic Diversity within Nilo-Saharan Populations: Implications for Human Adaptation. Am J Hum Genet 107, 473-486, doi:10.1016/j.ajhg.2020.07.007 (2020).
  • Li, L. et al. Population genetic analysis of Shaanxi male Han Chinese population reveals genetic differentiation and homogenization of East Asians. Mol Genet Genomic Med 8, e1209, doi:10.1002/mgg3.1209 (2020).
  • Kuo, F. H. et al. Migraine as a Risk Factor for Peripheral Artery Occlusive Disease: A Population-Based Cohort Study. Int J Environ Res Public Health 17, doi:10.3390/ijerph17228549 (2020).
  • Khan, S. Y. et al. Whole genome sequencing data of multiple individuals of Pakistani descent. Sci Data 7, 350, doi:10.1038/s41597-020-00664-2 (2020).
  • He, G.-L. et al. Fine-scale north-to-south genetic admixture profile in Shaanxi Han Chinese revealed by genome-wide demographic history reconstruction. Journal of Systematics and Evolution n/a, doi:https://doi.org/10.1111/jse.12715(2020).
  • Eldon, B. Evolutionary Genomics of High Fecundity. Annu Rev Genet 54, 213-236, doi:10.1146/annurev-genet-021920-095932 (2020).
  • Chong, A. S., Dominic, N. A., Arasoo, J. & Cheang, H. L. Impact of Ethnicity on the Presentation of Hyperandrogenism in Polycystic Ovarian Syndrome: A Review. Pan Asian J Obs Gyn 3, 125-140 (2020).
  • Chen, Q. et al. Rare deleterious BUB1B variants induce premature ovarian insufficiency and early menopause. Hum Mol Genet 29, 2698-2707, doi:10.1093/hmg/ddaa153 (2020).