The genome-wide association study (GWAS) is an approach to finding genetic variations associated with a particular trait or disease by scanning genome-wide genetic markers (typically SNPs) of many samples.

Here, we provide the platform for GWAS analysis, as well as the multiple control of the Han Chinese population. And users only need to provide genotype data in binary plink format, covariate files, and phenotype files.

The entire pipeline is conducted in three steps:
  • Quality control. Genotype missing rate and p-value of Hardy-Weinberg Equilibrium will be calculated for each site using Plink1.9, and population structure will be analyzed using flashPCA. Rather than filtering sites and samples directly, lists of sites and samples will be provided, and the user should decide whether to process the data filtering.
  • Association analysis. Association between phenotype(s) and genetic variants is analyzed using Plink1.9 with covariates both provided by the user and extracted from the top 5 principal components. If only case samples were found in the user-submitted data, the control group will be provided by sampling those with the same ancestry and dose genetic distance as case samples.
  • Visualization and report. Manhattan plot, QQ-plot, and plots of variant annotation will be provided in the report. And association statistics for each site will be documented in a plain text file, which could be downloaded by the user.
File format for covariate and phenotype files:
  • The first line contains “FID” and “IID”, followed by a vector of the covariate/phenotype names. And “FID” and “IID” stand for “family ID” and “individual ID”, respectively.
  • The remaining linescontains the family ID, individual ID, and covariate/phenotype values for each sample. For case/control study, control=1 and case=2. Control samples should be the same as those in the “fam” file.
  • Columns should be tab-separated.
  • Gender would be automatically used as a covariate if the gender information is provided in the fam file.

If your data is extremely large (i.e., > 5 million markers and >200individualss), please contact us for help(