Copyright (c) 2018, Gao Wang, Biao Li, Diana Cornejo Sánchez & Suzanne M. Leal
Variant Association Tools [VAT, Wang et al (2014)] [1] was developed to perform quality control and association analysis of sequence data. It can also be used to analyze genotype data, e.g. exome chip data and imputed data. The software incorporates many rare variant association methods which include but not limited to Combined Multivariate Collapsing (CMC) [2], Burden of Rare Variants (BRV) [3], Weighted Sum Statistic (WSS) [4], Kernel Based Adaptive Cluster (KBAC) [5], Variable Threshold (VT) [6] and Sequence Kernel Association Test (SKAT) [7]. VAT inherits the intuitive command-line interface of Variant Tools (VTools) [8] with re-design and implementation of its infrastructure to accommodate the scale of dataset generated from current sequencing efforts on large populations. Features of VAT are implemented into VTools subcommand system.
To reproduce this tutorial:
https://github.com/gaow/ismb-2018
Basic concepts to handle sequence data using vtools
can be found at:
http://varianttools.sourceforge.net/Main/Concepts
VAT Software documentation:
Exome genotype data was downloaded from the 1000 Genomes pilot data July 2010 release for both the CEU and YRI populations. Only the autosomes are contained in the datasets accompanying this exercise.
The data sets (CEU.exon.2010_03.genotypes.vcf.gz, YRI.exon.2010_03.genotypes.vcf.gz
) are available from:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2010_07/exon/snps
Due to the nature of next-generation sequencing data, a reasonably powerful machine with high speed internet connection is needed to use this tool for real-world applications. For this reason, in this tutorial we will use a small demo dataset to demonstrate association analysis.
and check the available subcommands by typing:
Subcommand system is used for various data manipulation tasks (to check details of each subcommand use vtools <name of subcommand> -h
). This tutorial is mission oriented and focuses on a subset of the commands that are relevant to variant-phenotype association analysis, rather than introducing them systematically. For additional functionality, please refer to documentation and tutorials online.
Command vtools init
creates a new project in the current directory. A directory can only have one project. After a project is created, subsequent vtools
calls will automatically load the project in the current directory. Working from outside of a project directory is not allowed.
Import all vcf files under the current directory:
Command vtools import
imports variants, sample genotypes and related information fields. The imported variants are saved to the master variant table for the project, along with their information fields.
The command above imports two vcf files sequentially into an empty vtools
project. The second INFO
message in the screen output shows that 3,489 variant sites are imported from the first vcf file, where 3,489 new means that all of them are new because prior to importing the first vcf the project was empty so there was 0 site. The fourth INFO
message tells that 5,175 variant sites are imported from the second vcf file, but only 3,498 of them are new (which are not seen in the existing 3,489) because prior to importing the second vcf there were already 3,489 existing variant sites from first vcf.
Thus, 5,175 - 3,498 = 1,677 variant sites are overlapped sites between first and second vcfs. The last INFO
message summarizes that the sum of variant sites contained in both vcfs is 8,664 = 3,489 + 5,175, where there are 6,987 variant sites after merging variants from both vcfs.
More details about vtools import
command can be found at:
http://varianttools.sourceforge.net/Vtools/Import
Since the input VCF file uses hg18 as the reference genome while most modern annotation data sources are hg19-based, we need to "liftover" our project using hg19 in order to use various annotation sources in the analysis. Vtools provides a command which is based on the tool of USCS liftOver
to map the variants from existing reference genome to an alternative build. More details about vtools liftover
command can be found at:
The phenotype file includes information for every individual, the sample name, sequencing panel, sex and BMI. To import the phenotype data:
Unlike vtools import
, this command imports/adds properties to samples rather than to variants. More details about vtools phenotype
command can be found at:
Summary information for the project can be viewed anytime using the command vtools show
, which displays various project and system information. More details about vtools show
can be found at:
http://varianttools.sourceforge.net/Vtools/Show
Some useful data summary commands are:
There are 6987 variants in our toy data-set.
vtools select table condition action
selects from a variant table table
a subset of variants satisfying a specified condition
, and perform an action
of
--to_table
is specified.--count
is specified.--output
is specified.The condition
should be a SQL expression using one or more fields in a project (displayed in vtools show
fields). If the condition argument is unspecified, then all variants in the table will be selected. An optional condition --samples [condition]
can also be used to limit selected variants to specific samples. More details about vtools select
command can be found at:
The command vtools show genotypes
displays the number of genotypes for each sample and names of the available genotype information fields for each sample, e.g. GT - genotype; DP geno - genotype read depth. Such information is useful for the calculation of summary statistics of genotypes (e.g. depth of coverage).
In the test data, the maximum DP for variant sites is 25490, minimum DP 13, average DP about 6815, standard deviation of DP about 3434, lower quartile of DP 4301 and upper quartile of DP 9143.
The same syntax can be applied to other variant information or annotation information fields. The command vtools output <name of variant table>
outputs properties of variants in a specified variant table. The properties include fields from annotation databases and variant tables, basically fields outputted from command vtools show fields
, and SQL-supported functions and expressions. There are several freely available SQL resources on the web to learn more about SQL functions and expressions.
It is also possible to view variant level summary statistic for variants satisfying certain filtering criteria using vtools select <name of variant table>
command, for example to count only variants having passed all quality filters:
All 6987 variants have passed the quality filters. To combine variant filtering and summary statistics:
The output information of command above will be the same as the previous vtools output command, since all variants have passed quality filter.
The command below will calculate:
total
: Total number of genotypes (GT) for a variantnum
: Total number of alternative alleles across all sampleshet
: Total number of heterozygote genotypes 1/0hom
: Total number of homozygote genotypes 1/1other
: Total number of double-homozygotes 1/2min/max/meanDP
: Summaries for depth of coverage and genotype quality across samplesmaf
: Minor allele frequencyfields
, which can be shown by commands vtools show fields
and vtools show table variant
Command vtools update
updates variant info fields (and to a lesser extend genotype info fields) by adding more fields or updating values at existing fields. It does not add any new variants or genotypes, and does not change existing variants, samples, or genotypes. Using three parameters --from_file
, --from_stat
, and --set
, variant information fields could be updated from external file, sample genotypes, and existing fields. More details about vtools update command can be found at
The --genotypes CONDITION
option restricts calculation to genotypes satisfying a given condition. Later we will remove individual genotypes by DP_geno
filters. The command below will calculate summary statistics genotypes of all samples per variant site. It can assist us in determining filtering criteria for genotype call quality.
You will notice the change in genotype counts when applying the filter on genotype depth of coverage and only retaining those genotypes with a read depth greater than 10X. There are now 6976 variant sites after filtering on DP_geno>10
. Note that some variant sites will become monomorphic after removing genotypes due to low read depth.
In previous steps, we calculated MAFs for each variant site before and after filtering on genotype read depth. Below is a summary of the results:
Adding “> filename.txt
” at the end of the above command will write the output to a file.
Next, we examine population specific MAFs. Our data is imported from two files, a CEU dataset (90 samples) and an YRI dataset (112 samples). To calculate allele frequency for each population, let us first assign an additional RACE phenotype (0 for YRI samples and 1 for CEU samples):
Population specific MAF calculations will be performed using those genotypes that passed the read depth filter (DP_geno>10
).
You will observe zero values because some variant sites are monomorphic or they are population specific.
Similar operations could be performed on a sample level instead of on a variant level. More details about obtaining genotype level summary information using vtools phenotype --from_stat
can be found at
For rare variant aggregated association tests, we want to focus on analyzing aggregating variants having potential functional contribution to a phenotype. Thus, each variant site needs to be annotated for its functionality. Annotation is performed using variant annotation tools
[7] which implements an ANNOVAR
pipeline for variant function annotation [9]. More details about the ANNOVAR pipeline can be found at
The following command will output the annotated variant sites to the screen.
Many more annotation sources are available which are not covered in this tutorial. Please read
http://varianttools.sourceforge.net/Annotation
for annotation databases, and
http://varianttools.sourceforge.net/Pipeline for annotation pipelines.
Before performing any data QC we examine the transition/transversion (Ti/Tv) ratio for all variant sites. Note that here we are obtaining Ti/Tv ratios for the entire sample, Ti/Tv ratios can also be obtained for each sample.
The command above counts the number of transition and transversion variants and calculates its ratio. More details about vtools report trans_ratio
command can be found at
http://varianttools.sourceforge.net/VtoolsReport/TransRatio
If only genotype calls having depth of coverage greater than 10 are considered:
We can see that Ti/Tv ratio has increase slightly if low depth of coverage calls are removed. There is only a small change in the Ti/Tv ratio since only a few variant sites become monomorphic and are no longer included in the calculation. In practice Ti/Tv ratios can be used to evaluate which threshold should be used in data QC.
We should not need to remove any variant site based on read depth because all variants passed the quality filter. To demonstrate removal of variant sites, let us remove those with a total read depth DP<15
:
We can see that one variant site has been removed from master variant table. The vtools remove
command can remove various items from the current project. More details about vtools remove
command can be found at:
http://varianttools.sourceforge.net/Vtools/Remove
Using a combination of select/remove subcommands low quality variant sites can be easily filtered out. The vtools show fields
, vtools show tables
, and vtools show table variant
commands will allow you to see the new/updated fields and tables you have added/changed to the project.
We have calculated various summary statistics using the command --genotypes CONDITION
but we have not yet removed genotypes having genotype read depth of coverage lower than 10X. The command below removes these genotypes.
The command above selects variant sites that are either nonsynonymous (by condition mut_type like ’non%’
) or stop-gain/stop-loss (by condition mut_type like ’stop%’
) or alternative splicing (by condition region-type=’splicing’
)
3367 functional variant sites are selected.
We want to carry out the association analysis for CEU and YRI separately. For starters we demonstrate analysis of CEU samples; and the same commands will be applicable for YRI samples. After completing the analysis of CEU samples please use the same commands to analyze the YRI data set. You should not analyze the data from different populations together, once you have the p-values from each analysis, you may perform a meta-analysis.
To carry out association tests we need to treat common and rare variants separately. The dataset for our tutorial has very small sample size, but with large sample size it is reasonable to define rare variants as having observed MAF<0.01, and common variants as variants having observed MAF$\ge$0.05. First, we create variant tables based on calculated alternative allele frequencies for both populations
Notice that for selection of rare variants we only keep those that are annotated as functional (chosen from v_funct
table). There are 1450 and 604 variant sites selected for MAF0.05 and MAF<0.01, respectively.
For gene based rare variant analysis we need annotations that tell us the boundaries of genes. We use the refGene annotation database for this purpose.
The names of genes are contained in the refGene.name2
field. The vtools use
command, attaches an annotation database to the project, effectively incorporating one or more attributes available to variants in the project. More details about vtools use
command can be found at
Note that we use the quantitative trait BMI as the phenotype, and we will account for “SEX” as a covariate in the regression framework. More details about vtools associate
command can be found at
http://varianttools.sourceforge.net/Vtools/Associate
By default, the program will perform single variant tests using a simple linear model, and the Wald test statistic will be evaluated for p-values:
Option -j1
specifies that 1 CPU core be used for association testing. You may use larger number of jobs for real world data analysis, e.g., use -j16
if your computational resources has 16 CPU cores available. Linux command cat /proc/cpuinfo
shows the number of cores and other information related to the CPU on your computer.
The following command displays error messages about the failed tests. In each case, the sample size was too small to perform the regression analysis.
A summary from the association test is written to the file EA_CV.asso.res
. The first column indicates the variant chromosome and base pair position so that you may follow up on the top signals using various annotation sources that we will not introduce in this tutorial. The result will be automatically built into annotation database if --to_db
option is specified.
To sort the results by p-value and output the first 10 lines of the file use the command:
If you obtain significant p-values be sure to also observe the accompanying sample size. Significant p-values from too small of a sample size may not be results you can trust.
Also, depending on your phenotype you may have to add additional covariates to your analysis. VAT allows you to test many different models for the various phenotypes and covariates. P-values for covariates are also reported.
Similar to using an annotation database, you can use the results from the association test to annotate the project and follow up variants of interest, for example:
You see additional annotation fields starting with EA CV, the name of the annotation database you just created from association test (if you used the --to db option mentioned above). You can use them to easily select/output variants of interest. More details about outputting annotation fields for significant findings can be found at
http://varianttools.sourceforge.net/Vtools/Output
BRV method uses the count of rare variants in given genetic region for association analysis, regardless of the region length.
We use the -g
option and use the ‘refGene.name2’ field to define the boundaries of a gene. By default, the test is a linear regression using aggregated counts of variants in a gene region as the regressor.
Association tests on 404 groups have completed. 13 failed. To view failed tests:
The output file is EA_RV.asso.res
. The first column is the gene name, with corresponding p-values in the sixth column for the entire gene.
You can also sort these results by p-value using command:
The variable thresholds (VT) method will carry out multiple testing in the same gene region using groups of variants based on observed variant allele frequencies. This test will maximize over statistics thus obtain a final test statistic, and calculate the empirical p-value so that multiple comparisons are adjusted for correctly.
We will use adaptive permutation to obtain empirical p-values. Therefore, to avoid performing too large number of permutations we use a cutoff to limit the number of permutations when the p-value is greater than 0.0005, e.g. not all 100,000 permutations are performed. Generally, even more permutations are used but we limit it to 100,000 to save time for this exercise.
The command using variable thresholds method on our data is:
To view test that failed,
To view results,
Sort and output the lowest p-values using the command:
Notice that vtools associate
command will fail on some association test units. Instances of failure are printed to terminal in red and are recorded in the project log file. Most failures occur due to an association test unit having too few samples or number of variants (for gene based analysis). You should view these error messages after each association scan is complete, e.g., using the Linux command grep -i error *.log
and make sure you are informed of why failures occur.
In the variable thresholds analysis above, gene TMCC1 failed the association test. If we look at this gene more closely we can see which variants are being analyzed by our test:
After applying our QC filters we are left with one variant within the TMCC1 gene to analyze. Because the MAF for this variant is 0.0 there are no variants in the gene to analyze so that this gene is ignored. Note that all individuals are homozygous for the alternative allele for this variant site.
The vtools report plot association command generates QQ and Manhattan plots from output of vtools associate command. More details about vtools report plot association can be found at
http://varianttools.sourceforge.net/VtoolsReport/PlotAssociation
QQ plots aid in evaluating if there is systematic inflation of test statistics. A common cause of inflation is population structure or batch effects. If you observe significant inflation of test you may consider including MDS components in the association test model.
This pipeline needs PLINK 1.07 and KING.
You should not arbitrarily include MDS (or PCA) components in the analysis. Instead put in each MDS component and examine the lambda value, i.e. include MDS component 1 them MDS components 1 and 2, etc. Visualization of the QQ plot is also useful to determine if population substructure/admixture is controlled.
Procedures for YRI sample association analysis is the same as for CEU samples as previously has been described, thus is left as an extra exercise for you to work on your own. Commands to perform analysis for YRI are found below:
Here we demonstrate the application of meta-analysis to combine association results from the two populations via vtools report meta_analysis
. More details about vtools report meta_analysis
command can be found at
http://varianttools.sourceforge.net/VtoolsReport/MetaAnalysis
The input to this command are the association results files generated from previous steps, for example:
To view the results,
Note that for genes that only appears in one study but not the other, or only have a valid p-value in one study but not the other, will be ignored from meta-analysis.
Analyzing variants with VAT is much like any other analysis software with a general workflow of:
The data cleaning and filtering conditions within this exercise should be considered as general guidelines. Your data may allow you to be laxer with certain criteria or force you to be more stringent with others.
[1]Wang, G.T., Peng, B., and Leal, S.M. (2014). Variant Association Tools for Quality Control and Analysis of Large-Scale Sequence and Genotyping Array Data. Am. J. Hum. Genet. 94, 770783
[2]Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet 2008 83:311-21
[3]Auer PL, Wang G, Leal SM. Testing for rare variant associations in the presence of missing data. Genet Epidemiol 2013 37:529-38
[4]Liu DJ, Leal SM. A novel adaptive method for the analysis of next-generation sequencing data to detect com-plex trait associations with rare variants due to gene main effects and interactions. PLoS Genet 2010 6:e1001156
[5]Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet 2009 5:e1000384
[6]Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, Sunyaev SR. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet 20010 86:832-8
[7]Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet 2011 89:82-93
[8]Lucas FAS, Wang G, Scheet P, Peng B. Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools. Bioinformatics 2012 28:421-2
[9]Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 2010 38:e164
[10]Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM. Robust relationship inference in genome-wide association studies. Bioinformatics 2010 26(22):2867-2873
[11]Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC. PLINK: a toolset for whole-genome association and population-based linkage analysis. Am J Hum Genet, 2007 81:559-75