Introduction to TREVA

TREVA (Targeted REsequencing Virtual Appliance) is a user-friendly virtual appliance, containing complex bioinformatics pipelines that can be installed and setup with minimal efforts. TREVA pipelines support a series of analyses commonly required for targeted resequencing and whole exome sequencing data, including: single-nucleotide and insertion/deletion variant calling, copy number analysis, and cohort-based analyses such as pathway and significantly mutated genes analyses.

The current version of TREVA (TREVA-1) uses analysis packages and databases published in 2011/2012. Main packages include BWA, Picard, GATK, Genome Music and GISTIC. Primary data source is Ensembl Release 64. The pipelines can be used to analyse standard Fastq or Bam files derived from Next-Generation Sequencers, and have been extensively used in-house at Peter Mac on GAIIx and HiSeq data.

TREVA-1 was built with Oracle VirtualBox, running Ubuntu Lucid.

License

TREVA-1 is released under Apache License 2.0. This applies to the image and all the custom scripts generated at Peter Mac.

Below is a list of the specific licenses of all open source components included.

BWA - GNU General Public License v3 - Reference
Picard - Apache License V2.0, MIT License - Reference
GATK - MIT License - Reference
cutadapt - MIT License - Reference
Fastqc - GNU General Public License v2/3 - Reference
Genome MuSiC - Lesser GNU Public License (LGPL) version 3 - Reference
GISTIC - GISTIC 2.0 License Agreement - Reference
muTect - BROAD INSTITUTE SOFTWARE LICENSE AGREEMENT - Reference
CONTRA - GNU General Public License v3 - Reference
ADTEx - GNU General Public License v3 - Reference
Ensembl - modified Apache License - Reference

Download

Main images

Three versions of TREVA-1 have been made available for download:

HUMAN - contains databases, files and pipelines for HUMAN data analysis
MOUSE - contains databases, files and pipelines for MOUSE data analysis
FULL - contains both HUMAN and MOUSE

Download BOTH the ovf and vmdk files.

OVF (container) files (depending on your browser, you may need to "Right Click > Save As > Change extension back to .ovf"):

TREVA-1-HUMAN - .ovf file
TREVA-1-MOUSE - .ovf file
TREVA-1-FULL - .ovf file

VMDK (virtual disk) files - pick a mirror closest to you:

TREVA-1-HUMAN (45GB) - Australia (NeCTAR Melb), Taiwan (Yourgene Bioscience)
TREVA-1-MOUSE (35GB) - Australia (NeCTAR Melb), Taiwan (Yourgene Bioscience)
TREVA-1-FULL (75GB) - Australia (NeCTAR Melb), Taiwan (Yourgene Bioscience)

Alternatively, we can post a hard drive to you directly via Fedex containing the images (charges applied to cover the HD and Postage & Handling). Please contact us.

Patches

Patches are small downloads that provide an update to the base image.

Patch-1 - http://bioinformatics.petermac.org/treva/patch1.tgz - Updates to the Cohort Pipeline

The following steps will download and install the patch from the VM:

    wget http://bioinformatics.petermac.org/treva/patch1.tgz
    tar xvfz patch1.tgz
    sudo ./install-patch1.sh

If you are behind a firewall, remember to set http_proxy for wget to work.

Installation & System Requirements

Minimum requirements:

Host machine: 64-bit machine, 8GB RAM, 2 CPUs, 250GB free hard drive space.

For whole-exome data, the virtual machine requires a minimum of 6GB RAM; higher is recommended if available (e.g. 16GB).

Installation:

Download and install Oracle VirtualBox (Freeware; available for Mac, Windows and Linux)
Import the TREVA-1 .ovf file you have downloaded from this page. (Need to have the .vmdk file in the same directory)
Change the RAM and CPU settings of TREVA-1 to the desired levels.
Test if you can play the imported image.If you are prompted with a message about failure to boot due to VT-x/AMD-V support, you need to restart your computer into BIOS settings and activate those supports accordingly. Most modern computers support Virtualization Technology, but some require manual activation of these AMD-V/VT-x features (Virtualization Technology features) in the BIOS settings.

User Guide

Foreword

Basic linux skills are required (e.g. ls, cp, cd, nohup and the concept of environment variables). Any bioinformatician/computer scientist can run and maintain the pipelines. We also encourage biologists who are interested to give it a go.

Administrative details (for advanced users)

Username: treva
Password: treva
To gain root access: sudo su (then key in "treva" as password)
Mysql user: root
Mysql password: treva
Path where the main packages and pipelines are installed: /Software/ExomePipeline

Setup

To make your data accessible by TREVA, two common methods are:

Use VirtualBox's Shared Folders support by going to Settings > Shared Folders. Select the host folders containing your Fastq/Bam files and then enable auto-mount and write access. The shared folders will be visible under /media/sf_xxx.
Use Network File System (NFS) techniques to mount a network folder onto the virtual machine. Contact your network administrator as to how to do this.

Optionally, you can setup email server so that you receive pipeline progress notifications via emails. The pipelines send out an email upon success or error. The pipelines will still run even if this is not set up properly. The most common setup is as follows, but some networks may require different setups.

Open Terminal
Run "sudo dpkg-reconfigure exim4-config"
Select SMARTHOST; to enable sending out emails (outbound)
Type in your SMTP/Exchange server address as RELAY (ask your Institutional System/Network administrator)
Test by sending yourself an email: "echo Test | mutt myemail@example.org"

If you will be processing multiple or larger than average files, you will need temporary storage more than what is provided by the VM. To achieve more storage, you will point the TMP folder to a folder in your network/shared folder where more space is available:

cd /Software/ExomePipeline/
rmdir tmp
ln -s /some/directory/on/your/network tmp

Test Data

Example data and command for both the primary and the secondary pipelines are included in /home/treva/TestData/. Try to understand the file structure and the single-line commands provided in the .sh files. In the VariantCallPipeline example, nohup is used to push the job into background.

See this page regarding access to a test server (so you can test it before downloading).

Running the primary variant calling pipelines:

Put fastq or bam files into proper directory structure:
- For Fastq files: i) create a directory using the sample name (alphanumerics and dash only; underscore NOT allowed), ii) rename the files into <SAMPLENAME>_[XXX_]R1.fastq.gz and <SAMPLENAME>_[XXX_]R2.fastq.gz. XXX_ is optional labels or identifiers, and can have multiple underscores (e.g. FlowcellID_LaneID_ ). The simplest form is Sample1_R1.fastq.gz and Sample1_R2.fastq.gz.
- For Bam files: i) bam header should use SAMPLENAME as the Read Group. The bam file should be named <SAMPLENAME>_aligned.bam, ii) run "structuriseDirForBam.sh" to create directory structure. "targetFolder" should be SAMPLENAME.
- EXAMPLE - For a tumour (SAMP001T) with matched normal (SAMP001N), there are typically four files arranged in the following structure:
```
    /mnt/Data/MyProject/SAMP001T/SAMP001T_R1.fastq.gz
    /mnt/Data/MyProject/SAMP001T/SAMP001T_R2.fastq.gz
    /mnt/Data/MyProject/SAMP001N/SAMP001N_R1.fastq.gz
    /mnt/Data/MyProject/SAMP001N/SAMP001N_R2.fastq.gz
```
cd (change directory) into the root directory where the samples are located. E.g., for above example,
```
        cd /mnt/Data/MyProject/
```
Execute "runSomatic.sh" if match control is available, or "runGermLine.sh" if not. Execution without any parameters will print USAGE. E.g., for above example,
```
        runSomatic.sh -t SAMP001T -n SAMP001N -e myemail@example.org -b $AgilentV2 -p 2
```
This example assumes the data was captured using Agilent SureSelect Human Exome V2 (specified by -b), and a maximum of 2 threads for parallel processing (-p). Please refer to ENVIRONMENT VARIABLES section regarding capture assays and their corresponding environment variables.

Running the cohort pipeline:

Group together the samples that you want to co-analyse in the same directory. For example, you have five tumour samples with matched normals:

        /mnt/Data/MyProject/SAMP001T/
        /mnt/Data/MyProject/SAMP002T/
        /mnt/Data/MyProject/SAMP003T/
        /mnt/Data/MyProject/SAMP004T/
        /mnt/Data/MyProject/SAMP005T/
        /mnt/Data/MyProject/SAMP001N/
        /mnt/Data/MyProject/SAMP002N/
        /mnt/Data/MyProject/SAMP003N/
        /mnt/Data/MyProject/SAMP004N/
        /mnt/Data/MyProject/SAMP005N/

Prepare a tab-delimited file defining the sample subgroups based on phenotypes or clinical information. For example:

        ##SPECIES=HUMAN
        ##BED=$AgilentV2
        ##FILTER_FLAGS=CBS
        ##MIN_DEPTH=20
        ##GENES_TO_PLOT=TP53,BRAF,NRAS,KIT,PREX2
        #SAMPLE_ID    #CONTROL_ID    Cancer_Site    CancerSubtype
        SAMP001T      SAMP001N       Head/Neck      Nodular Melanoma
        SAMP002T      SAMP002N       Head/Neck      Nodular Melanoma
        SAMP003T      SAMP003N       Upper Limbs    Nodular Melanoma
        SAMP004T      SAMP004N       Upper Limbs    Superficial Spreading Melanoma
        SAMP005T      SAMP005N       Head/Neck      Superficial Spreading Melanoma

Header lines are used for configuration. FILTER_FLAGS and MIN_DEPTH specify how the primary variant calls should be filtered. CBS represents "Canonical transcript", "Bidirectional reads" (requires reads on both strands to support the variants) and "Somatic" (ignore germline variants). Running "combine_variants.py" will list all available filters.

Execute runCohort.py. Running it without parameters will print USAGE.
```
    runCohort.py -c sampleDef.txt -o outputDir -r /mnt/Data/MyProject -p 4
		
```
In this example, "sampleDef.txt" is the sample definition file described in the previous step, "outputDir" is directory to hold all results, "/mnt/Data/MyProject" is where all the sample data is found, "-p 4" specifies the number of threads for parallel processing.

Useful environment variables:

Capture assays - the environment variables hold the prefixes of corresponding BED files:

	$AgilentKinome  (Agilent SureSelect Human Kinome Capture)
	$AgilentV2  (Agilent SureSelect Human Whole Exome V2)
	$AgilentV4  (Agilent SureSelect Human Whole Exome V4)
	$AgilentV5  (Agilent SureSelect Human Whole Exome V5)
	$AgilentMouse  (Agilent SureSelect MOUSE exome)

	$NimbleGenV1  (NimbleGen EZExome Human V1)
	$NimbleGenV2  (NimbleGen EZExome Human V2)

	$Illumina  (Illumina TruSeq Human Exome)

(For capture assays not listed here, use prepareBed.sh to generate the required bed files.)

Fasta files:

	$HumanREF
	$MouseREF

VCF files:

	$HumandbSNP  (dbSNP for human)
	$MousedbSNP  (dbSNP for mouse)
	$COSMIC  (COSMIC variants)
	$HAPMAP  (HAPMAP variants)
	$G1000_snps  (1000 genome project SNPs)
	$G1000_indels  (1000 genome project INDELs)

Other useful utilities:

prepareBed.sh -- to convert target region bed files (of custom captures) into a format suitable for pipeline processing
cmdqueue_new.py and cmdqueue_add.py -- a simple job submission / scheduling system useful for managing a large number of concurrent processes (e.g. when you need to run primary variant calling pipeline on a large number of samples with limited resources)
annotate_vcf.sh -- given a variant file in VCF format, this tool will run the annotation section of our pipeline and append the file with annotation columns.
annotate_v2.pl -- given a bed file, append columns with Gene Symbols, Nearest Genes and Strand.
extract_seq.pl and extract_seq_byFile.pl -- given genome coordinates, extract DNA sequences from reference genome.
combine_performance.py -- combine the performance summaries of individual samples (generated by primary pipeline) into a single .csv file.
rename_alignedSample.sh -- rename BAM file AND set Read Group to sample name in the bam header.
UpdateSequenceFile.pl -- convert old Illumina sequence.txt to new Sanger fastq format.

Team and contact

The bioinformatics group at Peter MacCallum Cancer Centre is dedicated to building the best pipelines for cancer research.

Pipeline developers	Pipeline advisors	VM developers
Jason Li Maria Doyle Jason Ellul David Goode Franco Caramia Ken Doig	Richard Tothill Stephen Wong Victoria Mar Ella Thompson Grant McArthur Ian Campbell Alex Dobrovic Tony Papenfuss	Jason Li Isaam Saeed Franco Caramia

Please contact Jason.Li@petermac.org for all matters in relation to TREVA.