Tools considering epistatic effects
Epistatic interactions of SNPs are believed to be very important in determining individual susceptibility to complex diseases. Multiple genetic variations may show little effect individually but strong interactions jointly which is known as epistasis or multilocus interaction (Cordell et al 2002). Therefore the detection of epistatic interactions may help to reveal the underlying mechanisms behind complex disease. This page describes methods which predict these interactions.
- RandomPat - available on request
Combinatorial methods - tested only on small datasets
Problem of these methods is the usage of exhaustive search methods which are not feasible in the datasets of the size of regular GWA studies.
- MDR (exhaustive search method, nonparametric)
- Monte Carlo logic regression
- penalized regression
- others ( Chatterjee et al)
Methods testing statistical epistasis
Recommended review article
To see a extensive list of methods published on year 2007 and before that see review article Motsinger et al.
Basis of epiMODE tool is a definition of "epistatic module" as a smallest genetic unit that independently influences the disease risk. Based on this definition epiMODE uses Bayesian marker partition model to explain observerd case-control data and uses Gibbs sampling strategy with reversible jump Markov chain Monte Carlo (RJ-MCMC) procedure to facilitate the detection of epistatic modules.
Download: for windows or for linux
Reference:Tang et al (2009). Epistatic module detection for case-control studies: a Bayesian model with a Gibbs sampling strategy. PLoS Genet. 2009 May;5(5):e1000464. doi:10.1371/journal.pgen.1000464 or pubmedid=19412524
epiForest is a random forest approach for the detection of epistatic interactions in case-control studies. First the random forest analysis with all SNPs is run to obtain the gini importance of the each SNP and then sliding window sequential forward feature selection (SWSFS) algorithm is used to select a subset of SNPs that can minimize the classification error of positive (cases) and negative (controls) samples when SNPs are used as categorical features. All possible interactions are enumerated for this subset obtained as a result of SWSFS algorithm. So as input program needs information about SNPs and information about cases and controls. As an output user gets interactions of candidate SNPs with statistical value describing the significance of their association.
Reference: Jiang et al (2009). A random forest approach to the detection of epistatic interactions in case-control studies. BMC Bioinformatics, 2009, 10 Suppl 1, S65.doi:10.1186/1471-2105-10-S1-S65.
PBEAM is a parallel version of BEAM (Bayesian Epistatis Association Mapping). BEAM uses Markov Chain Monte Carlo (MCMC) to search for both single-marker and interaction effects from case-control SNP data. BEAM algorithm has two essential components: a bayesian epistasis inference tool implemented via MCMC and a novel test statistic for evaluating statistical significance. Using these methods coming from opposite schools of statistics gives on the other hand the change to include prior knowledge of each marker (in coding region or not) and on the other hand using P values for evaluating statistical significance gives more robustness to the analysis. The BEAM algorithm takes case-control genotype marker data as input. The input genotyped markers should be in their natural genomic order when there's linkage disequilibrium (LD) among some of them (Zang et al). As an output of the analysis is a posterior probability whether each marker or epistasis (interactive set of markers) is associated with disease. It classifies the SNPs into three types: SNPs associated with the disease, SNPs contributing to the disease susceptibility independently and SNPs influencing the disease risk jointly with each other (Tang et al).
Reference:Zhang et al. Bayesian inference of epistatic interactions in case-control studies.Nat.Genet., 2007, 39, 9, 1167-1173.doi:10.1038/ng2110
Reference: Miller et al. An algorithm for learning maximum entropy probability models of disease risk that efficiently searches and sparingly encodes multilocus genomic interactions. Bioinformatics, 2009, 25, 19, 2478-2485. doi:10.1093/bioinformatics/btp435
MegaSNPHunter takes case-control genotype data as input and produces a ranked list of multi-SNP interactions. The method works in the following way: Whole genome is partitioned into multiple short subgenomes which each cover the genomic area of possible haplotype effects. For each of these subgenomes MegaSNPHunter builds a boosting tree classifier based on multi-SNP interactions and it measures the importance of SNPs on the basis of their contributions in the classifier. The method keeps relatively more important SNPs and lets them compete with each other in the same way in the next level. The competition terminates when the number of selected SNPs is less than the size of the subgenome. Finally MegaSNPHunter extracts and reports the valuable multi-SNP interactions.
Reference: Wan et al(2009). MegaSNPHunter: a learning approach to detect disease predisposition SNPs and high level interactions in genome wide association study. BMC Bioinformatics 2009,10:13. doi:10.1186/1471-2105-10-13
SNPHarverster is a method to detect SNP-SNP interactions in GWA studies. Its a stochastic search method and it can select a set of significant SNP groups from hundreds of thousands of SNPs efficiently. These selected SNP groups can then be searched by other methods. SNPHarvester is a useful tool because most of the tools looking for epistatic interactions cannot handle the amount of data obtained by GWA studies. Therefore they need a reduced set of this data. SNPHarvester efficiently reduces the number of SNPs and enables the direct application of existing statistical tools in interaction detection. SNPHarvester is an intermediate tool that takes in genotypes from GWA study and as output gives out SNP groups which should then be analyzed by programs like MDR.
Reference: Yang et al. SNPHarvester: a filtering-based approach for detecting epistatic interactions in genome-wide association studies. Bioinformatics, 2009, 25, 4, 504-511. doi:10.1093/bioinformatics/btn652
SNPRuler uses a novel learning approach based on the predictive rule learning to detect epistatic interactions. Rules learning is used for infering interactions because each epistatic interaction implicitly contains some predictive rules and because finding and evaluating rules are much easier and faster than finding and evaluating interactions. Learning algorithm used here seeks to identify the rules and uses them to infer possible epistatic interactions.
Reference: Wan et al.Predictive rule inference for epistatic interaction detection in genome-wide association studies. Bioinformatics 2010 26(1):30-37; doi:10.1093/bioinformatics/btp622
Reference: Motsinger et al. GPNN: power studies and applications of a neural network method for detecting gene-gene interactions in studies of human disease.BMC Bioinformatics, 2006, 7, 39. doi:10.1186/1471-2105-7-39
MDR identifies k-way interactions through an exhaustive search and evaluates the association between each interaction and the disease by cross-validations (Zhang et al). This type of exhaustive search method works well on small size problem. In GWA studies, direct application of these methods is computationally prohibitive. An effective filtering is needed to significantly reduce the number of SNPs so that exhaustive search is computationally feasible on the reduced SNP set (Yang et al. 2009). SNPHarvester can be used for this purpose. According to comparison made by Zhang et al MDR performs better than logic regression for common diseases but has little power when disease allele frequencies were small.
website:MDR at epistasis website
Download:MDR at sourceforge.net
Reference:Ritchie et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am.J.Hum.Genet., 2001, 69, 1, 138-147. doi:10.1086/321276
Monte Carlo Logic Regression used in the article by Kooperber et al combines logic regression and MCMC in searching the SNP interactions.
website: logic regression
Download: from R CRAN package:[LogicReg]
Reference:Kooperberg C, Ruczinski I (2005). Identifying Interacting SNPs using Monte Carlo Logic Regression. Genetic Epidemiology, 28(2): 157-70.
Penalized regression uses variant of logistic regression with quadratic penalization to detect epistatic interactions.
Reference: Park et al. Penalized logistic regression for detecting gene interactions. Biostatistics, 2008, 9, 1, 30-50.doi:10.1093/biostatistics/kxm010
CPM exhaustively searches for combinatory genotype group that had the most significant difference in the mean of the responding continuos phenotype (Tang et al). CPM uses brute-force search which is impractical for large datasets.
Reference: Nelson et al. (2001). A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res., 2001, 11, 3, 458-470. doi:10.1101/gr.172901
RPM is a modifies the CPM method by ignoring the partitions tht combined individual genotypes with very different mean trait values (Tang et al).
Reference: Culverhouse et al. (2004). Detecting epistatic interactions contributing to quantitative traits. Genet.Epidemiol., 2004, 27, 2, 141-152. doi: 10.1002/gepi.20006
BGTA uses a bootstrap-type resampling screening procedure to select markers, and those markers which return frequencies greater than third quatile plus 1.8 times the interquartile range are considered to be disease-associated markers (Zhang et al).
Reference: Zeng et al (2006). Backward genotype-trait association (BGTA)-based dissection of complex traits in case-control designs.Hum.Hered., 2006, 62, 4, 196-212. doi:10.1159/000096995
Reference: Millstein et al. A testing framework for identifying susceptibility genes in the presence of epistasis. Am.J.Hum.Genet., 2006, 78, 1, 15-27. doi:10.1086/498850
HapForest uses a forest-based approach to identifying haplotype-haplotype interactions.
Reference: Chen et al. A forest-based approach to identifying gene and gene gene interactions. Proc.Natl.Acad.Sci.U.S.A., 2007, 104, 49, 19199-19203.doi:10.1073/pnas.0709868104
“BOolean Operation-based Screening and Testing” (BOOST) is a method for the discovery of unknown gene-gene interactions that underlie complex diseases. It belongs to the group of methods testing statistical epistasis. BOOST allows examination of all pairwise interactions in genome-wide case-control studies.
Download: BOOST executables or from sourceforge.net
Reference:Xiang Wan et al (2010).BOOST: A Fast Approach to Detecting Gene-Gene Interactions in Genome-wide Case-Control Studies.The American Journal of Human Genetics, Volume 87, Issue 3, 325-340, 02 September 2010. doi:10.1016/j.ajhg.2010.07.021
INTERSNP is a software for genome-wide interaction analysis (GWIA) of case-control SNP data and quantitative traits. SNPs are selected for joint analysis using a priori information. Sources of information to define meaningful strategies can be statistical evidence (single marker association at a moderate level, computed from the own data) and genetic/biologic relevance (genomic location, function class or pathway information).
Reference: Herold et al. INTERSNP: genome-wide interaction analysis guided by a priori information, Bioinformatics 25 (2009), pp. 3275–3281. doi: 10.1093/bioinformatics/btp596
PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner. In addition to its other functions it can be used to test for statistical epistasis.
Reference: Purcell et al. PLINK: a tool set for whole-genome association and population-based linkage analyses, Am. J. Hum. Genet. 81 (2007), pp. 559–575. doi:10.1086/519795
- Liang,y. and Kelelm, A. (2008). Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases. Statist. Surv. Volume 2 (2008), 43-60. doi:10.1214/07-SS026
- Musani et al (2007). Detection of gene x gene interactions in genome-wide association studies of human population data. Hum.Hered., 2007, 63, 2, 67-84. doi:10.1159/000099179
- Motsinger et al (2007). Novel methods for detecting epistasis in pharmacogenomics studies. Pharmacogenomics, 2007, 8, 9, 1229-1241.doi:10.2217/14622422.214.171.1249
- Cordell: Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum.Mol.Genet., 2002, 11, 20, 2463-2468. http://hmg.oxfordjournals.org/cgi/content/abstract/11/20/2463