ISSN: 2378315X BBIJ
Biometrics & Biostatistics International Journal
Mini Review
Volume 2 Issue 4  2015
A Mini Review for Aggregating Analyses for GenomeWide Association Studies (GWAS)
Caiyan Li*
Takeda Pharmaceuticals, USA
Received: April 08, 2015  Published: April 29, 2015
*Corresponding author: Caiyan Li, Takeda Pharmaceuticals, USA, Email:
Citation: Caiyan Li (2014) A Mini Review for Aggregating Analyses for GenomeWide Association Studies (GWAS). Biom Biostat Int J 2(4): 00038. DOI: 10.15406/bbij.2015.02.00038
Univariate Analysis
Genomewide Association Studies (GWAS) has emerged as a popular tool for identifying common genetic variants for complex disease. A typical casecontrol GWAS often involves genotyping hundreds of thousands of Singlenucleotide polymorphism (SNPs), in thousands of disease cases and healthy controls, with the goal of identifying individual SNPs that are associated with the disease (or outcome). GWAS usually consists of a discovery phase, in which SNPs are identified followed by a validation phase, in which the SNPs identified in discovery phase are replicated in a separate study cohort. The standard GWAS method is univariate analysis, in which the phenotype is regressed onto each individual SNP to generate pvalue for association measurement. The SNPs are then ranked based on their univariate pvalues, and a threshold is set such that any SNP with a pvalue below the threshold is identified as significant. Multiplicity adjustment methods, such as bonferroni, false discovery rate (FDR) can be used to set the threshold.
Though the use of univariate analysis has helped to identify many diseasesusceptibility variants, and it is easy to comprehend and conduct, there are limitations which pose difficulty in some settings. Due to the large number of tested individual SNPs (e.g., 4 million SNPs), the adjusted threshold for genomewide significance can be extremely low, e.g. $\alpha ={10}^{7}$
[1], which is very difficult to attain. Secondly, univariateSNP analysis often has poor reproducibility. Many highlyranked SNPs identified in the discovery phase are false positives due to the low power for detecting SNPs with small effects. Moreover, rare variants (defined as alleles with a frequency of 1%5%) require a significant large sample size in order to be detected via standard univariate analysis. Since multivariate analysis has the ability to take into account of the correlation between multiple SNPs, as well as help reduce the multiplicity burden (e.g., lower the significance level), it could be advantageous to consider the joint effect of multiple SNPs together. To this end, researchers have established various methodologies to aggregate individual SNPs together as a set and then conduct association analyses, which has aided with the discovery of diseaserelated variants, especially on rare variants. Among these methods, there are two major categories: burden tests and kernelbased variance component tests. Below, we will review some popular methods within these two categories.
Burden Tests
Burden tests assess the cumulative effects of multiple variants within a genomic region by collapsing or summarizing the variants within a region by a single value, which is then used to test for association with the outcome of interest. Some popular burden tests are briefed below:
Cohort Allelic Sum Test (CAST)[2]: collapse genotypes across all variants within a specific region into a single dichotomous value for each subject. A subject is coded as 1 if a rare variant is present at any of the variant sites within the region; otherwise as 0. One variation of CAST involves collapsing by counting the number of rare variants within a specific region, instead of dichotomizing.
Combined Multivariate and Collapsing (CMC)[3]: The CMC method is a unified approach that combines collapsing and multivariate tests. SNPs within the region are divided into subgroups based on predefined criteria, such as allele frequencies, and within each subgroup CART was used to collapse genotypes. A multivariate test (e.g., Hotellings’stest) is then applied to the subgroups for association analysis.
Weighted Sum Statistic (WSS)[4]: The WSS method specifically considers the casecontrol setting and collapse SNPs within a set into a single weighted average of the number of rare alleles for each individual, and then apply Wilcoxon rank sum test to compare between case and control. Specifically, the weight for SNP i is defined as ${w}_{i}=\frac{1}{\sqrt{{n}_{i}{q}_{i}(1{q}_{i})}}$
where ${q}_{i}=\frac{{m}_{i}^{U}+1}{2{n}_{i}^{U}+2},{m}_{i}^{U}$
is the number of rare alleles observed for SNP i in the control group, ${n}_{i}^{U}$
is the number of control subjects genotypes for SNP i, and ${n}_{i}$
is the total number of subjects genotypes for SNP i. The weight is the inverse of the estimated standard deviation of the total number of rare alleles in the sample (including both cases and controls), under the null hypothesis of no frequency differences. It is used to downweight rare alleles in constructing the weightedsum score test.
One limitation for all these burden tests is that they implicitly assume that all rare variants influence the phenotype in the same direction and with the same magnitude of effect (after incorporating weights). However, biological mechanism leads us to believe that most variants within a region have little or no effect on phenotype; in addition some variants are protective and some are deleterious, and the magnitude of effects may vary as well. In these situations, employ the above burden tests may introduce substantial noise and largely reduce the power. Thus, researchers proposed methods that can account for the different direction and magnitude of effects for SNPs within a region.
Sequence kernel Association Test (SKAT)[5]
Kernelbased test methods, such as SKAT, are nonburden tests. Instead of aggregating variants, SKAT aggregates the associations between variants and the phenotype through a kernel matrix and can allow for SNPSNP interactions, i.e., epistatic effects. SKAT is a flexible, computationally efficient, regression based approach that tests for association between variants in a region (both common and rare) and a phenotype (dichotomous or continuous) while adjusting for covariates. It borrows information from correlated SNPs that are grouped on the basis of prior biological knowledge (e.g., gene or pathway) and hence produce results with improved reproducibility and power. In this review, we assume a populationbased casecontrol GWAS with n independent genotyped subjects. For a given SNPset containing p SNPs, let ${Z}_{i1},{Z}_{i2},{Z}_{i3},\mathrm{......},{Z}_{ip}$
be the genotype values for subject $i\left(i=1,\mathrm{...}n\right)$
. Let denote the casecontrol status for subject i (1 for case and 0 for control); ${X}_{i1},{X}_{i2},\mathrm{......},{X}_{im}$
denote the covariates (e.g., demographic and clinical variables) that we would like to adjust for. The goal is to test the global null of whether any of the p SNPs is related to the outcome while adjusting for covariates. In evaluating the significance of the joint effect of the SNPset, a logistic kernelmachine regression model is employed as follows:
$\mathrm{log}it\text{}P\left({y}_{i}=1\right)={\alpha}_{0}+{\alpha}_{1}{x}_{i1}+\dots +{\alpha}_{m}{x}_{im}+h({z}_{i1,}{z}_{i2,}\dots ,{z}_{ip})$
The SNPs influence the outcome through the general function h(.), which is an arbitrary function that has a form defined only by a positive semidefinite kernel function K(. , .). For subject i, we can fully define using the kernel function K(. , .) as: $h\left({z}_{i1,}{z}_{i2,}\dots ,{z}_{ip}\right)={\sum}_{i=1}^{n}{\gamma}_{{i}^{\text{'}}}K({Z}_{i},{Z}_{{i}^{\text{'}}})$
, for some ${\gamma}_{1},\dots ,{\gamma}_{n}$
. By choosing different kernel functions, we can specify different bases and corresponding models to assess the effects of SNPs. For example, if we define K(. , .) as the linear kernel such that $K\left({Z}_{i},{Z}_{{i}^{\text{'}}}\right)={\sum}_{j=1}^{p}{z}_{ij}{z}_{{i}^{\text{'}}j}$
then we are assuming the simple linear and logistic model defined by $\mathrm{log}it\text{}P\left({y}_{i}=1\right)={\alpha}_{0}+{\alpha}_{1}{x}_{i1}+\dots +{\alpha}_{m}{x}_{im}+{\beta}_{1}{z}_{i1}+\dots +{\beta}_{p}{z}_{ip}$
The outcome depends on the SNPs solely through the function h(.), thus, in order to test whether there is a true SPNset effect, the following null hypothesis is considered: ${H}_{0}:h\left(z\right)=0$
against the general alternative. A variancecomponent score test is developed to test the above hypothesis. To learn more about the technical details, please refer to references.
The choice of kernel changes the underlying base for the nonparametric function governing the relationship between the outcome and SNPs in the set. Essentially, K(. , .) is a function that projects the genotype data from the original space to another space and then h(.) is modeled linearly in this new space. More intuitively, $K\left({Z}_{i},{Z}_{{i}^{\text{'}}}\right)$
can be viewed as a function that measures the similarity between two individuals with regard to the genotypes. Some specific kernels that SKAT considers are weighted linear kernel, $K\left({Z}_{i},{Z}_{{i}^{\text{'}}}\right)={\sum}_{j=1}^{p}{w}_{j}{z}_{ij}{z}_{{i}^{\text{'}}j}$
which implies that the outcome depends on SNPs in a linear fashion; weighted quadratic kernel, $K\left({Z}_{i},{Z}_{{i}^{\text{'}}}\right)={(1+{\sum}_{j=1}^{p}{w}_{j}{z}_{ij}{z}_{{i}^{\text{'}}j})}^{2}$
which assumes that the model depends on the main effects and the quadratic terms of SNPs and first order of SNPbySNP interactions; weighted IBS (identicalbystate) kernel $K\left({Z}_{i},{Z}_{{i}^{\text{'}}}\right)={\sum}_{j=1}^{p}{w}_{j}IBS({z}_{ij},{z}_{{i}^{\text{'}}j})$
which defines similarity between two individuals as the number of alleles that share IBS. The weighted IBS kernel allows for epistatic effects because it does not assume linearity or interactions of any particular order (e.g., second order). Experiences suggest using the IBS kernel when the number of interacting SNPs within the region is modest. In all these kernels, is an allele specific weight that controls the relative importance of the SNP. Without prior information, SKAT suggests to use $\sqrt{{w}_{j}}=Beta(MA{F}_{j};1,25)$
, where $MA{F}_{j}$
is the minor allele frequency of the ${j}^{th}$
SNP. If prior information is available, weight can be selected to increase (or decrease) the weight for likely functionality.
Although SKAT makes few assumptions about rarevariants effects and provides attractive power, it has some limitations in certain settings. For example, in the setting that a large proportion of the rare variants in a given region are truly causal and influence the phenotype in the same direction, SKAT can be less powerful than burden tests. Thus, SKATO [6] is proposed, which automatically behaves like the burden test when the burden test is more powerful than SKAT, and behaves like SKAT when SKAT is more powerful than the burden test.
The SKAT research team has developed a powerful R package which implements SKAT and SKATO and their SmallSample Adjustments. The team also developed functions to calculate power and sample size to help design studies to evaluate SNPset effects. http://www.hsph.harvard.edu/~xlin/software.html
Discussion
In this mini review, we have briefly summarized threetypes of methodologies for analyzing SNP data: univariate analysis, burden tests and SKAT (SKATO). While univariate analysis lacks of power, it is the classic method that has been used in GWAS to help identify many diseaserelated SNPs. SKAT has the advantage of taking into account of correlation of a set of SNPs, thus improve power. SKATO is a combination of SKAT and burden tests. Personal experiences suggest carrying out both univariate analysis and SKAT (SKATO) while dealing with genomewide association studies. Deepsequencing will soon generate enormous genetic information for large disease samples and the need for statistical methods to analyze these big data is increasing. More research in this field to search for unified optimal approach which incorporates known biological information is needed.
References
 Wu MC1, Kraft P, Epstein MP, Taylor DM, Chanock SJ, et al. (2010) Powerful SNPSet Analysis for CaseControl Genomewide Association Studies. Am J Hum Genet 86(6): 929942.
 Morgenthaler S, Thilly WG (2007) A strategy to discover genes that carry multiallelic or monoallelic risk for common diseases: a cohort allelic sums test (CAST). Mutat Res 615(12): 2856.
 Madsen BE, Browning SR (2009) A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet 5(2): e1000384.
 Li B, Leal SM (2008) Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 83(3): 311321.
 Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X (2011) Rarevariant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 89(1) 8293.
 Lee S, Emond MJ, Bamshad MJ, Barnes KC, Rieder MJ, et al. (2012) Optimal Unified Approach for RareVariant Association Testing with Application to SmallSample CaseControl WholeExome Sequencing Studies. Am J Hum Genet. 91(2): 224237.

