Introduction to TREVA
TREVA (Targeted REsequencing Virtual Appliance) is a user-friendly virtual appliance, containing complex bioinformatics pipelines that can be installed and setup with minimal efforts. TREVA pipelines support a series of analyses commonly required for targeted resequencing and whole exome sequencing data, including: single-nucleotide and insertion/deletion variant calling, copy number analysis, and cohort-based analyses such as pathway and significantly mutated genes analyses.
The current version of TREVA (TREVA-1) uses analysis packages and databases published in 2011/2012. Main packages include BWA, Picard, GATK, Genome Music and GISTIC. Primary data source is Ensembl Release 64. The pipelines can be used to analyse standard Fastq or Bam files derived from Next-Generation Sequencers, and have been extensively used in-house at Peter Mac on GAIIx and HiSeq data.
TREVA-1 was built with Oracle VirtualBox, running Ubuntu Lucid.
License
TREVA-1 is released under Apache License 2.0. This applies to the image and all the custom scripts generated at Peter Mac.Below is a list of the specific licenses of all open source components included.
- BWA - GNU General Public License v3 - Reference
- Picard - Apache License V2.0, MIT License - Reference
- GATK - MIT License - Reference
- cutadapt - MIT License - Reference
- Fastqc - GNU General Public License v2/3 - Reference
- Genome MuSiC - Lesser GNU Public License (LGPL) version 3 - Reference
- GISTIC - GISTIC 2.0 License Agreement - Reference
- muTect - BROAD INSTITUTE SOFTWARE LICENSE AGREEMENT - Reference
- CONTRA - GNU General Public License v3 - Reference
- ADTEx - GNU General Public License v3 - Reference
- Ensembl - modified Apache License - Reference
Download
Main images
Three versions of TREVA-1 have been made available for download:- HUMAN - contains databases, files and pipelines for HUMAN data analysis
- MOUSE - contains databases, files and pipelines for MOUSE data analysis
- FULL - contains both HUMAN and MOUSE
OVF (container) files (depending on your browser, you may need to "Right Click > Save As > Change extension back to .ovf"):
VMDK (virtual disk) files - pick a mirror closest to you:
- TREVA-1-HUMAN (45GB) - Australia (NeCTAR Melb), Taiwan (Yourgene Bioscience)
- TREVA-1-MOUSE (35GB) - Australia (NeCTAR Melb), Taiwan (Yourgene Bioscience)
- TREVA-1-FULL (75GB) - Australia (NeCTAR Melb), Taiwan (Yourgene Bioscience)
Patches
Patches are small downloads that provide an update to the base image.- Patch-1 - http://bioinformatics.petermac.org/treva/patch1.tgz - Updates to the Cohort Pipeline
wget http://bioinformatics.petermac.org/treva/patch1.tgz tar xvfz patch1.tgz sudo ./install-patch1.shIf you are behind a firewall, remember to set http_proxy for wget to work.
Installation & System Requirements
Minimum requirements:
Host machine: 64-bit machine, 8GB RAM, 2 CPUs, 250GB free hard drive space.
For whole-exome data, the virtual machine requires a minimum of 6GB RAM; higher is recommended if available (e.g. 16GB).
Installation:
- Download and install Oracle VirtualBox (Freeware; available for Mac, Windows and Linux)
- Import the TREVA-1 .ovf file you have downloaded from this page. (Need to have the .vmdk file in the same directory)
- Change the RAM and CPU settings of TREVA-1 to the desired levels.
- Test if you can play the imported image.If you are prompted with a message about failure to boot due to VT-x/AMD-V support, you need to restart your computer into BIOS settings and activate those supports accordingly. Most modern computers support Virtualization Technology, but some require manual activation of these AMD-V/VT-x features (Virtualization Technology features) in the BIOS settings.
User Guide
Foreword
Basic linux skills are required (e.g. ls, cp, cd, nohup and the concept of environment variables). Any bioinformatician/computer scientist can run and maintain the pipelines. We also encourage biologists who are interested to give it a go.Administrative details (for advanced users)
Username: trevaPassword: treva
To gain root access: sudo su (then key in "treva" as password)
Mysql user: root
Mysql password: treva
Path where the main packages and pipelines are installed: /Software/ExomePipeline
Setup
To make your data accessible by TREVA, two common methods are:
- Use VirtualBox's Shared Folders support by going to Settings > Shared Folders. Select the host folders containing your Fastq/Bam files and then enable auto-mount and write access. The shared folders will be visible under /media/sf_xxx.
- Use Network File System (NFS) techniques to mount a network folder onto the virtual machine. Contact your network administrator as to how to do this.
Optionally, you can setup email server so that you receive pipeline progress notifications via emails. The pipelines send out an email upon success or error. The pipelines will still run even if this is not set up properly. The most common setup is as follows, but some networks may require different setups.
- Open Terminal
- Run "sudo dpkg-reconfigure exim4-config"
- Select SMARTHOST; to enable sending out emails (outbound)
- Type in your SMTP/Exchange server address as RELAY (ask your Institutional System/Network administrator)
- Test by sending yourself an email: "echo Test | mutt myemail@example.org"
If you will be processing multiple or larger than average files, you will need temporary storage more than what is provided by the VM. To achieve more storage, you will point the TMP folder to a folder in your network/shared folder where more space is available:
- cd /Software/ExomePipeline/
- rmdir tmp
- ln -s /some/directory/on/your/network tmp
Test Data
Example data and command for both the primary and the secondary pipelines are included in /home/treva/TestData/. Try to understand the file structure and the single-line commands provided in the .sh files. In the VariantCallPipeline example, nohup is used to push the job into background.
See this page regarding access to a test server (so you can test it before downloading).
Running the primary variant calling pipelines:
- Put fastq or bam files into proper directory structure:
- For Fastq files: i) create a directory using the sample name (alphanumerics and dash only; underscore NOT allowed), ii) rename the files into <SAMPLENAME>_[XXX_]R1.fastq.gz and <SAMPLENAME>_[XXX_]R2.fastq.gz. XXX_ is optional labels or identifiers, and can have multiple underscores (e.g. FlowcellID_LaneID_ ). The simplest form is Sample1_R1.fastq.gz and Sample1_R2.fastq.gz.
- For Bam files: i) bam header should use SAMPLENAME as the Read Group. The bam file should be named <SAMPLENAME>_aligned.bam, ii) run "structuriseDirForBam.sh"
to create directory structure. "targetFolder" should be SAMPLENAME. - EXAMPLE - For a tumour (SAMP001T) with matched normal (SAMP001N), there are typically four files arranged in the following structure:
/mnt/Data/MyProject/SAMP001T/SAMP001T_R1.fastq.gz /mnt/Data/MyProject/SAMP001T/SAMP001T_R2.fastq.gz /mnt/Data/MyProject/SAMP001N/SAMP001N_R1.fastq.gz /mnt/Data/MyProject/SAMP001N/SAMP001N_R2.fastq.gz
- cd (change directory) into the root directory where the samples are located. E.g., for above example,
cd /mnt/Data/MyProject/
-
Execute "runSomatic.sh" if match control is available, or "runGermLine.sh" if not. Execution without any parameters will print USAGE. E.g., for above example,
runSomatic.sh -t SAMP001T -n SAMP001N -e myemail@example.org -b $AgilentV2 -p 2
This example assumes the data was captured using Agilent SureSelect Human Exome V2 (specified by -b), and a maximum of 2 threads for parallel processing (-p). Please refer to ENVIRONMENT VARIABLES section regarding capture assays and their corresponding environment variables.
Running the cohort pipeline:
-
Group together the samples that you want to co-analyse in the same directory. For example, you have five tumour samples with matched normals:
/mnt/Data/MyProject/SAMP001T/ /mnt/Data/MyProject/SAMP002T/ /mnt/Data/MyProject/SAMP003T/ /mnt/Data/MyProject/SAMP004T/ /mnt/Data/MyProject/SAMP005T/ /mnt/Data/MyProject/SAMP001N/ /mnt/Data/MyProject/SAMP002N/ /mnt/Data/MyProject/SAMP003N/ /mnt/Data/MyProject/SAMP004N/ /mnt/Data/MyProject/SAMP005N/
-
Prepare a tab-delimited file defining the sample subgroups based on phenotypes or clinical information. For example:
##SPECIES=HUMAN ##BED=$AgilentV2 ##FILTER_FLAGS=CBS ##MIN_DEPTH=20 ##GENES_TO_PLOT=TP53,BRAF,NRAS,KIT,PREX2 #SAMPLE_ID #CONTROL_ID Cancer_Site CancerSubtype SAMP001T SAMP001N Head/Neck Nodular Melanoma SAMP002T SAMP002N Head/Neck Nodular Melanoma SAMP003T SAMP003N Upper Limbs Nodular Melanoma SAMP004T SAMP004N Upper Limbs Superficial Spreading Melanoma SAMP005T SAMP005N Head/Neck Superficial Spreading Melanoma
Header lines are used for configuration. FILTER_FLAGS and MIN_DEPTH specify how the primary variant calls should be filtered. CBS represents "Canonical transcript", "Bidirectional reads" (requires reads on both strands to support the variants) and "Somatic" (ignore germline variants). Running "combine_variants.py" will list all available filters. -
Execute runCohort.py. Running it without parameters will print USAGE.
runCohort.py -c sampleDef.txt -o outputDir -r /mnt/Data/MyProject -p 4
In this example, "sampleDef.txt" is the sample definition file described in the previous step, "outputDir" is directory to hold all results, "/mnt/Data/MyProject" is where all the sample data is found, "-p 4" specifies the number of threads for parallel processing.
Useful environment variables:
Capture assays - the environment variables hold the prefixes of corresponding BED files:$AgilentKinome (Agilent SureSelect Human Kinome Capture) $AgilentV2 (Agilent SureSelect Human Whole Exome V2) $AgilentV4 (Agilent SureSelect Human Whole Exome V4) $AgilentV5 (Agilent SureSelect Human Whole Exome V5) $AgilentMouse (Agilent SureSelect MOUSE exome) $NimbleGenV1 (NimbleGen EZExome Human V1) $NimbleGenV2 (NimbleGen EZExome Human V2) $Illumina (Illumina TruSeq Human Exome)
(For capture assays not listed here, use prepareBed.sh to generate the required bed files.)
Fasta files:$HumanREF $MouseREFVCF files:
$HumandbSNP (dbSNP for human) $MousedbSNP (dbSNP for mouse) $COSMIC (COSMIC variants) $HAPMAP (HAPMAP variants) $G1000_snps (1000 genome project SNPs) $G1000_indels (1000 genome project INDELs)
Other useful utilities:
- prepareBed.sh -- to convert target region bed files (of custom captures) into a format suitable for pipeline processing
- cmdqueue_new.py and cmdqueue_add.py -- a simple job submission / scheduling system useful for managing a large number of concurrent processes (e.g. when you need to run primary variant calling pipeline on a large number of samples with limited resources)
- annotate_vcf.sh -- given a variant file in VCF format, this tool will run the annotation section of our pipeline and append the file with annotation columns.
- annotate_v2.pl -- given a bed file, append columns with Gene Symbols, Nearest Genes and Strand.
- extract_seq.pl and extract_seq_byFile.pl -- given genome coordinates, extract DNA sequences from reference genome.
- combine_performance.py -- combine the performance summaries of individual samples (generated by primary pipeline) into a single .csv file.
- rename_alignedSample.sh -- rename BAM file AND set Read Group to sample name in the bam header.
- UpdateSequenceFile.pl -- convert old Illumina sequence.txt to new Sanger fastq format.
Team and contact
The bioinformatics group at Peter MacCallum Cancer Centre is dedicated to building the best pipelines for cancer research.
Pipeline developers | Pipeline advisors | VM developers |
---|---|---|
Jason Li Maria Doyle Jason Ellul David Goode Franco Caramia Ken Doig |
Richard Tothill Stephen Wong Victoria Mar Ella Thompson Grant McArthur Ian Campbell Alex Dobrovic Tony Papenfuss |
Jason Li Isaam Saeed Franco Caramia |
Please contact Jason.Li@petermac.org for all matters in relation to TREVA.