Summary of the data sets included in the Han Chinese Genomes Database

Population samples

In the Han Chinese Genomes Database 2.0(PGG.Han 2.0), we designedly select and include only individuals of Han Chinese ancestry, particularly for the purpose of imputation and GWAS in the Han Chinese population. All the participants are 18 years of age or older who contributed DNA to the project. The ancestry of participants is determined based on self-reported data and their genetic profile. In the current release of the PGG.Han 2.0 database, we collect data of 126,590 individuals in total, including whole-genome sequencing data of 19,002 samples, whole-exome sequencing data of 5,002 samples and genome-wide single nucleotide polymorphism (SNP) data of 102,586 samples. With these more than 100K Han Chinese samples being whole-genome sequenced to high coverage (~30 – 80×), or low-coverage (~1.7 – 4×), or high-density genome-wide SNP genotyped or imputed, we constructed the first and the largest reference panel specific to the Han Chinese population, a.k.a., Han100K. See the description below and the Table 1 for more detailed information.

Available Data

Totally around 20,000 Han Chinese individuals with whole genomes sequenced using next-generation sequencing are archived in PGG.Han, including three deep-sequencing data sets (~30 – 80×) (n=7,124), and two low-pass sequencing data sets (~1.7×, n=11,670; and ~4×, n=180). The deep-sequencing data sets include 7,124 Han Chinese samples with whole genomes deep-sequenced (~30 – 80×), of which 4,196 samples from 30 administrative divisions representing 28 dialects across China were collected and sequenced to high coverage (~30 – 60×) by Population Genomics Group (PGG); 90 samples (overlapping with CHB and CHS in the 1000 Genomes Projects) were sequenced by BGI-Shenzhen at a high depth (~80×) (Lan et al., 2017); 2,780 samples were sequenced at a median depth (~13.7×) and obtained from the previous study (Wu et al., 2019); another 98 samples were also sequenced to high coverage (~30×) and obtained from a previous study (Sung et al., 2012).

Whole genome low-pass sequencing data

The genomes of 11,670 women Han Chinese were sequenced by the CONVERGE project as a control group for the investigation of major depressive disorders (Chiang et al., 2017). Although in a much lower coverage (~1.7×), this data set provides a catalog of 25,057,223 variants for the Han Chinese population and is also included in the PGG.Han.

Genome-wide SNP data

The high-density genome-wide SNP genotyped data of 102,586 samples are contributed by the collaborators of PGG.Han (see https://www.biosino.org/pgghan2/about for the list of collaborators), which were collected from previous GWAS projects, only non-patient (control) are retained. Using all the whole-genome sequencing data of Han Chinese samples as reference data, the 102,586 samples are carefully imputed so that the final data set 8,056,973 genome-wide SNPs.

The fine-scale genetic structure of the Han Chinese population

The 100K Han Chinese individuals formed a distinct cluster from the surrounding groups including minority groups in China and other neighboring countries, suggesting a full-identity of Han Chinese people in terms of overall genetic make-up. On the other hand, genetic differentiation within the Han Chinese population is discerned and 6 sub-groups are identified, i.e., Northwest Han (NWH), Northeast Han (NEH), Central China Han (CCH), Southwest Han (SWH), Southeast Han (SEH), and South coast Han (SCH). Overall, the main difference is between northern and southern subgroups, consistent with previous observations (Chen et al., 2009; Qin et al., 2014; Xu et al., 2009), while southern subgroups show higher divergence than northern subgroups. We further screened nested AIMs panels for detecting population structure and controlling population stratification in association studies. We provided high-quality individual-level variant data for future genotype imputation, population genetics analysis and association studies. The reference panel provided here is expected to greatly facilitate current and future GWAS or candidate-gene based association studies in Han Chinese and related Asian populations.

Table 1. Summary of the data sets included in PGG.Han

Data Source Data Type Sample Size Sequencing Coverage Geographical Coverage Language Coverage References
The Han Chinese Genomes ProjectGenotyping102586NA34 divisions28 dialectsPGG.Han
The Han Chinese Genomes ProjectWGS4156~30 – 60×30 divisions28 dialectsPGG.Han
HuaBiaoProject WES5002>100×31 divisions5 dialects(Hao et al., 2021)
CONVERGEWGS11670~1.7×24 divisionsNA(Chiang et al., 2018)
The 1000 Genomes ProjectWGS208~4×NANA(The-1000-Genomes-Consortium et al., 2015)
HCC samplesWGS98~30×Southern ChinaNA(Sung et al., 2012)
BGI-ShenzhenWGS90~80×NANA(Lan et al., 2017)
SG10KWGS2780~13.7×NANA(Wu et al., 2019)
NyuWaWGS2999~26.2×23NA(Zhang et al., 2021)

WGS: Whole Genome Sequencing; NA: Not available.

References

  • Chen, J., et al., Genetic structure of the Han Chinese population revealed by genome-wide SNP variation. Am J Hum Genet, 2009. 85(6): p. 775-85.
  • Chiang, C.W.K., et al., A Comprehensive Map of Genetic Variation in the World's Largest Ethnic Group-Han Chinese. Mol Biol Evol, 2018. 35(11): p. 2736-2750.
  • Chiang, C.W.K., et al., A Comprehensive Map of Genetic Variation in the World's Largest Ethnic Group-Han Chinese. Mol Biol Evol, 2018. 35(11): p. 2736-2750.
  • Lan, T., et al., Deep whole-genome sequencing of 90 Han Chinese genomes. Gigascience, 2017. 6(9): p. 1-7.
  • Qin, P., et al., A panel of ancestry informative markers to estimate and correct potential effects of population stratification in Han Chinese. Eur J Hum Genet, 2014. 22(2): p. 248-53.
  • Sung, W.K., et al., Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma. Nat Genet, 2012. 44(7): p. 765-9.
  • Genomes Project, C., et al., A global reference for human genetic variation. Nature, 2015. 526(7571): p. 68-74.
  • Xu, S., et al., Genomic dissection of population substructure of Han Chinese and its implication in association studies. Am J Hum Genet, 2009. 85(6): p. 762-74.
  • Wu, D., et al., Large-Scale Whole-Genome Sequencing of Three Diverse Asian Populations in Singapore. Cell, 2019. 179(3): p. 736-749 e15.
  • Gao, Y., et al., PGG.Han: the Han Chinese genome database and analysis platform. Nucleic Acids Res, 2020. 48(D1): p. D971-D976.
  • Hao, M., et al., The HuaBiao project: whole-exome sequencing of 5000 Han Chinese individuals. J Genet Genomics, 2021.
  • Zhang, P., et al., NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep, 2021. 37(7): p. 110017.

Research citing our database

  • Zhang, P. et al. NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population. Cell Rep 37, 110017, doi:10.1016/j.celrep.2021.110017 (2021).
  • Yin, Q. & Flegel, W. A. DEL in China: the D antigen among serologic RhD-negative individuals. J Transl Med 19, 439, doi:10.1186/s12967-021-03116-6 (2021).
  • She, X., Xiao, H., Lu, S. & Guo, L. Association of Interleukin-1alpha Functional Polymorphism with Risk of Chronic Periodontitis in Han Chinese Population. Genet Res (Camb) 2021, 6614835, doi:10.1155/2021/6614835 (2021).
  • Mboowa, G., Sserwadda, I. & Aruhomukama, D. Genomics and bioinformatics capacity in Africa: no continent is left behind. Genome 64, 503-513, doi:10.1139/gen-2020-0013 (2021).
  • Liu, Y. N., Li, N., Zhu, X. & Qi, Y. How wide is the application of genetic big data in biomedicine. Biomed Pharmacother 133, doi:ARTN 11107410.1016/j.biopha.2020.111074 (2021).
  • Liu, Y. et al. Combined Low-/High-Density Modern and Ancient Genome-Wide Data Document Genomic Admixture History of High-Altitude East Asians. Front Genet 12, 582357, doi:10.3389/fgene.2021.582357 (2021).
  • Jiang, Q. et al. RET compound inheritance in Chinese patients with Hirschsprung disease: lack of penetrance from insufficient gene dysfunction. Hum Genet 140, 813-825, doi:10.1007/s00439-020-02247-y (2021).
  • Hao, M. et al. The HuaBiao project: whole-exome sequencing of 5000 Han Chinese individuals. J Genet Genomics, doi:10.1016/j.jgg.2021.07.013 (2021).
  • Dehghani, N., Bras, J. & Guerreiro, R. How understudied populations have contributed to our understanding of Alzheimer's disease genetics. Brain 144, 1067-1081, doi:10.1093/brain/awab028 (2021).
  • Barbitoff, Y. A. et al. Expanding the Russian allele frequency reference via cross-laboratory data integration: insights from 6,096 exome samples. medRxiv (2021).
  • Yunjia, W. et al. Rare coding variants in axonemal dynein heavy chain genes are associated with adolescent idiopathic scoliosis. Research Square, doi:10.21203/rs.3.rs-121210/v1 (2020).
  • Yin, C. et al. Genetic Reconstruction and Forensic Analysis of Chinese Shandong and Yunnan Han Populations by Co-Analyzing Y Chromosomal STRs and SNPs. Genes (Basel) 11, doi:10.3390/genes11070743 (2020).
  • Rigden, D. J. & Fernandez, X. M. The 27th annual Nucleic Acids Research database issue and molecular biology database collection. Nucleic Acids Res 48, D1-D8, doi:10.1093/nar/gkz1161 (2020).
  • Pan, Z. & Xu, S. Population genomics of East Asian ethnic groups. Hereditas 157, 49, doi:10.1186/s41065-020-00162-w (2020).
  • National Genomics Data Center, M. & Partners. Database Resources of the National Genomics Data Center in 2020. Nucleic Acids Res 48, D24-D33, doi:10.1093/nar/gkz913 (2020).
  • Mulindwa, J. et al. High Levels of Genetic Diversity within Nilo-Saharan Populations: Implications for Human Adaptation. Am J Hum Genet 107, 473-486, doi:10.1016/j.ajhg.2020.07.007 (2020).
  • Li, L. et al. Population genetic analysis of Shaanxi male Han Chinese population reveals genetic differentiation and homogenization of East Asians. Mol Genet Genomic Med 8, e1209, doi:10.1002/mgg3.1209 (2020).
  • Kuo, F. H. et al. Migraine as a Risk Factor for Peripheral Artery Occlusive Disease: A Population-Based Cohort Study. Int J Environ Res Public Health 17, doi:10.3390/ijerph17228549 (2020).
  • Khan, S. Y. et al. Whole genome sequencing data of multiple individuals of Pakistani descent. Sci Data 7, 350, doi:10.1038/s41597-020-00664-2 (2020).
  • He, G.-L. et al. Fine-scale north-to-south genetic admixture profile in Shaanxi Han Chinese revealed by genome-wide demographic history reconstruction. Journal of Systematics and Evolution n/a, doi:https://doi.org/10.1111/jse.12715(2020).
  • Eldon, B. Evolutionary Genomics of High Fecundity. Annu Rev Genet 54, 213-236, doi:10.1146/annurev-genet-021920-095932 (2020).
  • Chong, A. S., Dominic, N. A., Arasoo, J. & Cheang, H. L. Impact of Ethnicity on the Presentation of Hyperandrogenism in Polycystic Ovarian Syndrome: A Review. Pan Asian J Obs Gyn 3, 125-140 (2020).
  • Chen, Q. et al. Rare deleterious BUB1B variants induce premature ovarian insufficiency and early menopause. Hum Mol Genet 29, 2698-2707, doi:10.1093/hmg/ddaa153 (2020).