Tools and methods for mapping genomic structural variation
This page concentrates on tools and methods for mapping genomic structural variation. Specific to new sequencing techniques is the unprecedented speed and short read lenghts. The new tools mapping the genomic structural variation are design to handle the output from these analysis and map the location of genomic structural variants based on this information. Listing these as a disease prediction tools is based on the fact that all structural variants are very potential risk factors for pathogenicity.
- Method in article Lee et al
BreakDancer predicts a wide variety of structural variants including insertion-deletions (indels), inversions and translocations. BrakDancer software package consist of two complementary algorithms:BreakDancerMax and BreakDancerMini. BreakDancerMini uses Kolmogorov-Smirnov test as a mapping algorithm. As an input programs require map files produced by MAQ. As an output the program reports structural variants: BreakDancerMax reports deletions, insertions, inversions, and intra and interchromosomal translocations and BreakDancerMini small indels.
Download: from Nature Methods web site or from Sourceforge site
Reference: Chen et al 2009 BreakDancer: an algorithm for high-resolution mapping of genomic structural variation.Nat.Methods, 2009, 6, 9, 677-681.doi:10.1038/NMETH.1363
Indelign is a probabilistic framework for annotation of insertions and deletions in a multiple alignment.
Reference: Kim et al. Indelign: a probabilistic framework for annotation of insertions and deletions in a multiple alignment. Bioinformatics. 2006 Nov 15. doi: 10.1093/bioinformatics/btl578
MAQ is both a tool from mapping short DNA sequencing reads and for identification of small-size indels (<10 base pairs). MAQ makes full use of mate-pair information and estimates the error probability of each read alignment. As an input MAQ takes sequence reads with mate-pair information.As an output it generates mapping of reads and in addition detected short indels
Download: from Sourceforge site
Reference: Li et al. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res., 2008, 18, 11, 1851-1858. doi:10.1101/gr.0788212.108
See also: MAQGene ( Web-based user interface for MAQ)
VariationHunter is a package of programs need to find structural variations which mappings of paired-end reads are known. VariationHunter uses MrFast as mapping algorithm. As an input it needs mappings of pair-end sequenced reads plus some additional information related to them. Output containing information about structural variants is given in three files: deletions, insertions and inversions each in their own file. Method is used to identify indels larger than 50 bp (Lee et al).
Reference: Hormozdiari et al. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res., 2009, 19, 7, 1270-1278. doi:10.1101/gr.088633.108
MoDIL says to be the first method to identify medium size (20-50 bp) indels from high-throughput sequencing data while there exist several methods identificating small and large indels. As an input MoDIL takes sequence reads. MoDIL uses EM algorithm and Kolmogorov-Smirnov test while doing the analysis. As an output program gives identified indels.
Reference: Lee et al. MoDIL: detecting small indels from clone-end sequencing with mixtures of distributions. Nat.Methods, 2009, 6, 7, 473-474. doi:10.1038/NMETH.F.256
PEMER consist of analysis pipeline,simulation-based error models and a back-end database. Tool is used to identify indels larger than 50 bp (Lee et al). Method should be relatively insensitive to base-calling errors. PEMer can process the data from several next-generation DNA sequencing platforms including 454 (Roche), Illumina and ABI. Back-end databases, BreakDB, is a web accessible database developed to store, annotate and dsplay SV breakpoint events identified by PEMer and from other sources.
Reference: Korbel et al. PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biol., 2009, 10, 2, R23. doi: 10.1186/gb-2009-10-2-r23.
cnvHMM is a Washington University algorithm for Illumina and Solexa data. cnvHMM does copy number analysis using hidden markov algorithm.
GASV is a software for classification and comparison of strutural variants measured via paired-end sequencing and/or array-CGH. GASV currently supports three features: clustering a set of ESP's and producing breakpoint regions, filtering paired-end sequences (ESP) by a reference set, and taking a set of ESP's and producing unclustered breakpoint regions.
Reference: Sindi et al. A geometric approach for classification and comparison of structural variants. Bioinformatics, 2009, 25, 12, i222-30. doi:10.1093/bioinformatics/btp208
Sequence Variant Analyzer (SVA) is a tool developed to analyze genetic variants from whole-genome sequencing studies. As an input tool takes single site variants , small indels and large copy number changes. SVA uses a number of biological databases to perform the functional annotation, and then, implements several internal and external programs to perform the statistical and bioinformatical analyses for identifying potential causal variants and genes responsible for the biological traits or medical outcomes of interest.
Reference: If you use the tool, please use the following citation:
Author: Dongliang Ge & David B. Goldstein
SWT is a WashU Sliding Window Tool for detecting copy number variants from Illumina/Solexa data.
Many tools can handle the output of just one technology, VarScan is able to detect SNPs and indels from both Solexa and Roche platforms. Unlike currently available variant detection tools, VarScan is compatible with several read aligners (BLAT, Newbler, cross_match, Bowtie and Novoalign) and calls variants in both individual and pooled samples. As input VarScan requires an alignment file. As output user gets report of SNPs, insertions and deletions with their chromosomal coordinates, alleles, flanking sequence and read counts. VariantScan does not predict the effect of these variants just their existence.
Download: (Download VarScan from here)
Reference: Kobolt et al. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics, 2009, 25, 17, 2283-2285. doi:10.1093/bioinformatics/btp373
Pindel is a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. As an input Pindel requires genomic reference in fasta format and read file which stores one-end-mapped pair-end reads. As a result user gets mapped indels and an alignment of supporting reads with reference sequence.
Reference:Ye et al. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics, 2009. doi:10.1093/bioinformatics/btp394
Method developed by Lee et al uses probabilistic framwork for the identification of structural variants using clone-end sequencing.
Reference:Lee et al. A robust framework for detecting structural variations in a genome. Bioinformatics, 2008, 24, 13, i59-67. doi:10.1093/bioinformatics/btn176
CNV-seq is a method for detecting DNA copy number variation (CNV) usinh high-throughput sequencing. As an input program requires an output of reads aligner (for exampe BLAT). As an output user gets CNV predictions.
Reference: Xie et al. CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinformatics, 2009, 10, 80. doi:10.1186/1471-2105-10-80
- High throughput sequencing tools - BioAssist wiki page
- Useful comparison of alignment tools
- NGS Aligment programs
Database containing clinical findings associated with submicroscopic chromosomal imbalance (including deletions, duplications, insertions, translocations, and inversions)
- DECIPHER (DatabasE of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources)
- CHOP (The Copy Number Variation project at the Children's Hospital of Philadelphia)
- DGV (Database of Genomic Variants)
- Samtools - SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments
Article reviewing published articles found with the terms "copy number variation" and "structural variation" between Jan 1, 2004 and Nov 3, 2008.
- Wain et al. Genomic copy number variation, human health, and disease. Lancet, 2009, 374, 9686, 340-350. doi:10.1016/S0140-6736(09)60249-X
- Tabone. Mutations, structural variations, and genome-wide resequencing: where to from here in our understanding of disease and evolution? Hum.Mutat., 2008, 29, 6, 886-890. doi:10.1002/humu.20781
- Harismendy et.al. Evaluation of next generation sequencing platforms for population targeted sequencing studies,Genome Biology 2009, 10:R32
- Bioinformatics issue on Next Generation Sequencing, 2009