Summary of the data sets included in the Han Chinese Genomes Database

Population samples

In the Han Chinese Genomes Database 2.0(PGG.Han 2.0), we designedly select and include only individuals of Han Chinese ancestry, particularly for the purpose of imputation and GWAS in the Han Chinese population. All the participants are 18 years of age or older who contributed DNA to the project. The ancestry of participants is determined based on self-reported data and their genetic profile. In the current release of the PGG.Han 2.0 database, we collect data of 126,590 individuals in total, including whole-genome sequencing data of 19,002 samples, whole-exome sequencing data of 5,002 samples and genome-wide single nucleotide polymorphism (SNP) data of 102,586 samples. With these more than 100K Han Chinese samples being whole-genome sequenced to high coverage (~30 – 80×), or low-coverage (~1.7 – 4×), or high-density genome-wide SNP genotyped or imputed, we constructed the first and the largest reference panel specific to the Han Chinese population, a.k.a., Han100K. See the description below and the Table 1 for more detailed information.

Available Data

Totally around 20,000 Han Chinese individuals with whole genomes sequenced using next-generation sequencing are archived in PGG.Han, including three deep-sequencing data sets (~30 – 80×) (n=7,124), and two low-pass sequencing data sets (~1.7×, n=11,670; and ~4×, n=180). The deep-sequencing data sets include 7,124 Han Chinese samples with whole genomes deep-sequenced (~30 – 80×), of which 4,196 samples from 30 administrative divisions representing 28 dialects across China were collected and sequenced to high coverage (~30 – 60×) by Population Genomics Group (PGG); 90 samples (overlapping with CHB and CHS in the 1000 Genomes Projects) were sequenced by BGI-Shenzhen at a high depth (~80×) (Lan et al., 2017); 2,780 samples were sequenced at a median depth (~13.7×) and obtained from the previous study (Wu et al., 2019); another 98 samples were also sequenced to high coverage (~30×) and obtained from a previous study (Sung et al., 2012).

Whole genome low-pass sequencing data

The genomes of 11,670 women Han Chinese were sequenced by the CONVERGE project as a control group for the investigation of major depressive disorders (Chiang et al., 2017). Although in a much lower coverage (~1.7×), this data set provides a catalog of 25,057,223 variants for the Han Chinese population and is also included in the PGG.Han.

Genome-wide SNP data

The high-density genome-wide SNP genotyped data of 102,586 samples are contributed by the collaborators of PGG.Han (see for the list of collaborators), which were collected from previous GWAS projects, only non-patient (control) are retained. Using all the whole-genome sequencing data of Han Chinese samples as reference data, the 102,586 samples are carefully imputed so that the final data set 8,056,973 genome-wide SNPs.

The fine-scale genetic structure of the Han Chinese population

The 100K Han Chinese individuals formed a distinct cluster from the surrounding groups including minority groups in China and other neighboring countries, suggesting a full-identity of Han Chinese people in terms of overall genetic make-up. On the other hand, genetic differentiation within the Han Chinese population is discerned and 6 sub-groups are identified, i.e., Northwest Han (NWH), Northeast Han (NEH), Central China Han (CCH), Southwest Han (SWH), Southeast Han (SEH), and South coast Han (SCH). Overall, the main difference is between northern and southern subgroups, consistent with previous observations (Chen et al., 2009; Qin et al., 2014; Xu et al., 2009), while southern subgroups show higher divergence than northern subgroups. We further screened nested AIMs panels for detecting population structure and controlling population stratification in association studies. We provided high-quality individual-level variant data for future genotype imputation, population genetics analysis and association studies. The reference panel provided here is expected to greatly facilitate current and future GWAS or candidate-gene based association studies in Han Chinese and related Asian populations.

Table 1. Summary of the data sets included in PGG.Han

Data Source Data Type Sample Size Sequencing Coverage Geographical Coverage Language Coverage References
The Han Chinese Genomes ProjectGenotyping102586NA34 divisions28 dialectsPGG.Han
The Han Chinese Genomes ProjectWGS4156~30 – 60×30 divisions28 dialectsPGG.Han
The Han Chinese Genomes ProjectWES5002~80×34 divisions28 dialectsPGG.Han
CONVERGEWGS11670~1.7×24 divisionsNA(Chiang et al., 2018)
The 1000 Genomes ProjectWGS208~4×NANA(The-1000-Genomes-Consortium et al., 2015)
HCC samplesWGS98~30×Southern ChinaNA(Sung et al., 2012)
BGI-ShenzhenWGS90~80×NANA(Lan et al., 2017)
SG10KWGS2780~13.7×NANA(Wu et al., 2019)

WGS: Whole Genome Sequencing; NA: Not available.


  • Chen, J., Zheng, H., Bei, J.X., Sun, L., Jia, W.H., Li, T., Zhang, F., Seielstad, M., Zeng, Y.X., Zhang, X., et al. (2009). Genetic structure of the Han Chinese population revealed by genome-wide SNP variation. Am J Hum Genet 85, 775-785.
  • Chiang, C.W.K., Mangul, S., Robles, C., and Sankararaman, S. (2018). A Comprehensive Map of Genetic Variation in the World's Largest Ethnic Group-Han Chinese. Mol Biol Evol 35, 2736-2750.
  • Chiang, C.W.K., Mangul, S., Robles, C.R., Kretzschmar, W.W., Cai, N., Kendler, K.S., Sankararam, S., and Flint, J. (2017). A comprehensive map of genetic variation in the world's largest ethnic group - Han Chinese. Carbohydrate Polymers 75, 104-109.
  • Lan, T., Lin, H., Zhu, W., Laurent, T., Yang, M., Liu, X., Wang, J., Wang, J., Yang, H., Xu, X., et al. (2017). Deep whole-genome sequencing of 90 Han Chinese genomes. Gigascience 6, 1-7.
  • Qin, P., Li, Z., Jin, W., Lu, D., Lou, H., Shen, J., Jin, L., Shi, Y., and Xu, S. (2014). A panel of ancestry informative markers to estimate and correct potential effects of population stratification in Han Chinese. European journal of human genetics : EJHG 22, 248-253.
  • Sung, W.K., Zheng, H., Li, S., Chen, R., Liu, X., Li, Y., Lee, N.P., Lee, W.H., Ariyaratne, P.N., Tennakoon, C., et al. (2012). Genome-wide survey of recurrent HBV integration in hepatocellular carcinoma. Nat Genet 44, 765-769.
  • The-1000-Genomes-Consortium, Auton, A., Brooks, L.D., Durbin, R.M., Garrison, E.P., Kang, H.M., Korbel, J.O., Marchini, J.L., McCarthy, S., McVean, G.A., et al. (2015). A global reference for human genetic variation. Nature 526, 68-74.
  • Xu, S., Yin, X., Li, S., Jin, W., Lou, H., Yang, L., Gong, X., Wang, H., Shen, Y., Pan, X., et al. (2009). Genomic dissection of population substructure of Han Chinese and its implication in association studies. Am J Hum Genet 85, 762-774.
  • Wu D, Dou J, Chai X, Bellis C, Wilm A, Shih CC, Soon WWJ, Bertin N, Lin CB, Khor CC, DeGiorgio M, Cheng S, Bao L, Karnani N, Hwang WYK, Davila S, Tan P, Shabbir A, Moh A, Tan EK, Foo JN, Goh LL, Leong KP, Foo RSY, Lam CSP, Richards AM, Cheng CY, Aung T, Wong TY, Ng HH; SG10K Consortium, Liu J, Wang C. Large-Scale Whole-Genome Sequencing of Three Diverse Asian Populations in Singapore. Cell. 2019 Oct 17;179(3):736-749.e15.