1. Introduction to TREVA
  2. License
  3. Download
  4. Installation & System Requirements
  5. User Guide
  6. Team and contact
  7. Sponsors

Introduction to TREVA

TREVA (Targeted REsequencing Virtual Appliance) is a user-friendly virtual appliance, containing complex bioinformatics pipelines that can be installed and setup with minimal efforts. TREVA pipelines support a series of analyses commonly required for targeted resequencing and whole exome sequencing data, including: single-nucleotide and insertion/deletion variant calling, copy number analysis, and cohort-based analyses such as pathway and significantly mutated genes analyses.

The current version of TREVA (TREVA-1) uses analysis packages and databases published in 2011/2012. Main packages include BWA, Picard, GATK, Genome Music and GISTIC. Primary data source is Ensembl Release 64. The pipelines can be used to analyse standard Fastq or Bam files derived from Next-Generation Sequencers, and have been extensively used in-house at Peter Mac on GAIIx and HiSeq data.

TREVA-1 was built with Oracle VirtualBox, running Ubuntu Lucid.

License

TREVA-1 is released under Apache License 2.0. This applies to the image and all the custom scripts generated at Peter Mac.

Below is a list of the specific licenses of all open source components included.

Download

Main images

Three versions of TREVA-1 have been made available for download: Download BOTH the ovf and vmdk files.

OVF (container) files (depending on your browser, you may need to "Right Click > Save As > Change extension back to .ovf"):

VMDK (virtual disk) files - pick a mirror closest to you:

Alternatively, we can post a hard drive to you directly via Fedex containing the images (charges applied to cover the HD and Postage & Handling). Please contact us.

Patches

Patches are small downloads that provide an update to the base image. The following steps will download and install the patch from the VM:
    wget http://bioinformatics.petermac.org/treva/patch1.tgz
    tar xvfz patch1.tgz
    sudo ./install-patch1.sh
If you are behind a firewall, remember to set http_proxy for wget to work.

Installation & System Requirements

Minimum requirements:

Host machine: 64-bit machine, 8GB RAM, 2 CPUs, 250GB free hard drive space.

For whole-exome data, the virtual machine requires a minimum of 6GB RAM; higher is recommended if available (e.g. 16GB).

Installation:

  1. Download and install Oracle VirtualBox (Freeware; available for Mac, Windows and Linux)
  2. Import the TREVA-1 .ovf file you have downloaded from this page. (Need to have the .vmdk file in the same directory)
  3. Change the RAM and CPU settings of TREVA-1 to the desired levels.
  4. Test if you can play the imported image.If you are prompted with a message about failure to boot due to VT-x/AMD-V support, you need to restart your computer into BIOS settings and activate those supports accordingly. Most modern computers support Virtualization Technology, but some require manual activation of these AMD-V/VT-x features (Virtualization Technology features) in the BIOS settings.

User Guide

Foreword

Basic linux skills are required (e.g. ls, cp, cd, nohup and the concept of environment variables). Any bioinformatician/computer scientist can run and maintain the pipelines. We also encourage biologists who are interested to give it a go.

Administrative details (for advanced users)

Username: treva
Password: treva
To gain root access: sudo su (then key in "treva" as password)
Mysql user: root
Mysql password: treva
Path where the main packages and pipelines are installed: /Software/ExomePipeline

Setup

To make your data accessible by TREVA, two common methods are:

  1. Use VirtualBox's Shared Folders support by going to Settings > Shared Folders. Select the host folders containing your Fastq/Bam files and then enable auto-mount and write access. The shared folders will be visible under /media/sf_xxx.
  2. Use Network File System (NFS) techniques to mount a network folder onto the virtual machine. Contact your network administrator as to how to do this.

Optionally, you can setup email server so that you receive pipeline progress notifications via emails. The pipelines send out an email upon success or error. The pipelines will still run even if this is not set up properly. The most common setup is as follows, but some networks may require different setups.

  1. Open Terminal
  2. Run "sudo dpkg-reconfigure exim4-config"
  3. Select SMARTHOST; to enable sending out emails (outbound)
  4. Type in your SMTP/Exchange server address as RELAY (ask your Institutional System/Network administrator)
  5. Test by sending yourself an email: "echo Test | mutt myemail@example.org"

If you will be processing multiple or larger than average files, you will need temporary storage more than what is provided by the VM. To achieve more storage, you will point the TMP folder to a folder in your network/shared folder where more space is available:

  1. cd /Software/ExomePipeline/
  2. rmdir tmp
  3. ln -s /some/directory/on/your/network tmp

Test Data

Example data and command for both the primary and the secondary pipelines are included in /home/treva/TestData/. Try to understand the file structure and the single-line commands provided in the .sh files. In the VariantCallPipeline example, nohup is used to push the job into background.

See this page regarding access to a test server (so you can test it before downloading).

Running the primary variant calling pipelines:

  1. Put fastq or bam files into proper directory structure:
    • For Fastq files: i) create a directory using the sample name (alphanumerics and dash only; underscore NOT allowed), ii) rename the files into <SAMPLENAME>_[XXX_]R1.fastq.gz and <SAMPLENAME>_[XXX_]R2.fastq.gz. XXX_ is optional labels or identifiers, and can have multiple underscores (e.g. FlowcellID_LaneID_ ). The simplest form is Sample1_R1.fastq.gz and Sample1_R2.fastq.gz.
    • For Bam files: i) bam header should use SAMPLENAME as the Read Group. The bam file should be named <SAMPLENAME>_aligned.bam, ii) run "structuriseDirForBam.sh" to create directory structure. "targetFolder" should be SAMPLENAME.
    • EXAMPLE - For a tumour (SAMP001T) with matched normal (SAMP001N), there are typically four files arranged in the following structure:
          /mnt/Data/MyProject/SAMP001T/SAMP001T_R1.fastq.gz
          /mnt/Data/MyProject/SAMP001T/SAMP001T_R2.fastq.gz
          /mnt/Data/MyProject/SAMP001N/SAMP001N_R1.fastq.gz
          /mnt/Data/MyProject/SAMP001N/SAMP001N_R2.fastq.gz
      
  2. cd (change directory) into the root directory where the samples are located. E.g., for above example,
            cd /mnt/Data/MyProject/
    
  3. Execute "runSomatic.sh" if match control is available, or "runGermLine.sh" if not. Execution without any parameters will print USAGE. E.g., for above example,
            runSomatic.sh -t SAMP001T -n SAMP001N -e myemail@example.org -b $AgilentV2 -p 2
    
    This example assumes the data was captured using Agilent SureSelect Human Exome V2 (specified by -b), and a maximum of 2 threads for parallel processing (-p). Please refer to ENVIRONMENT VARIABLES section regarding capture assays and their corresponding environment variables.

Running the cohort pipeline:

  1. Group together the samples that you want to co-analyse in the same directory. For example, you have five tumour samples with matched normals:
            /mnt/Data/MyProject/SAMP001T/
            /mnt/Data/MyProject/SAMP002T/
            /mnt/Data/MyProject/SAMP003T/
            /mnt/Data/MyProject/SAMP004T/
            /mnt/Data/MyProject/SAMP005T/
            /mnt/Data/MyProject/SAMP001N/
            /mnt/Data/MyProject/SAMP002N/
            /mnt/Data/MyProject/SAMP003N/
            /mnt/Data/MyProject/SAMP004N/
            /mnt/Data/MyProject/SAMP005N/
    			
  2. Prepare a tab-delimited file defining the sample subgroups based on phenotypes or clinical information. For example:
            ##SPECIES=HUMAN
            ##BED=$AgilentV2
            ##FILTER_FLAGS=CBS
            ##MIN_DEPTH=20
            ##GENES_TO_PLOT=TP53,BRAF,NRAS,KIT,PREX2
            #SAMPLE_ID    #CONTROL_ID    Cancer_Site    CancerSubtype
            SAMP001T      SAMP001N       Head/Neck      Nodular Melanoma
            SAMP002T      SAMP002N       Head/Neck      Nodular Melanoma
            SAMP003T      SAMP003N       Upper Limbs    Nodular Melanoma
            SAMP004T      SAMP004N       Upper Limbs    Superficial Spreading Melanoma
            SAMP005T      SAMP005N       Head/Neck      Superficial Spreading Melanoma
    			
    Header lines are used for configuration. FILTER_FLAGS and MIN_DEPTH specify how the primary variant calls should be filtered. CBS represents "Canonical transcript", "Bidirectional reads" (requires reads on both strands to support the variants) and "Somatic" (ignore germline variants). Running "combine_variants.py" will list all available filters.
  3. Execute runCohort.py. Running it without parameters will print USAGE.
        runCohort.py -c sampleDef.txt -o outputDir -r /mnt/Data/MyProject -p 4
    		
    In this example, "sampleDef.txt" is the sample definition file described in the previous step, "outputDir" is directory to hold all results, "/mnt/Data/MyProject" is where all the sample data is found, "-p 4" specifies the number of threads for parallel processing.

Useful environment variables:

Capture assays - the environment variables hold the prefixes of corresponding BED files:
	$AgilentKinome  (Agilent SureSelect Human Kinome Capture)
	$AgilentV2  (Agilent SureSelect Human Whole Exome V2)
	$AgilentV4  (Agilent SureSelect Human Whole Exome V4)
	$AgilentV5  (Agilent SureSelect Human Whole Exome V5)
	$AgilentMouse  (Agilent SureSelect MOUSE exome)

	$NimbleGenV1  (NimbleGen EZExome Human V1)
	$NimbleGenV2  (NimbleGen EZExome Human V2)

	$Illumina  (Illumina TruSeq Human Exome)

                  (For capture assays not listed here, use prepareBed.sh to generate the required bed files.)

Fasta files:
	$HumanREF
	$MouseREF
VCF files:
	$HumandbSNP  (dbSNP for human)
	$MousedbSNP  (dbSNP for mouse)
	$COSMIC  (COSMIC variants)
	$HAPMAP  (HAPMAP variants)
	$G1000_snps  (1000 genome project SNPs)
	$G1000_indels  (1000 genome project INDELs)

Other useful utilities:

Team and contact

The bioinformatics group at Peter MacCallum Cancer Centre is dedicated to building the best pipelines for cancer research.

Pipeline developersPipeline advisorsVM developers
Jason Li
Maria Doyle
Jason Ellul
David Goode
Franco Caramia
Ken Doig
Richard Tothill
Stephen Wong
Victoria Mar
Ella Thompson
Grant McArthur
Ian Campbell
Alex Dobrovic
Tony Papenfuss
Jason Li
Isaam Saeed
Franco Caramia

Please contact Jason.Li@petermac.org for all matters in relation to TREVA.


Last modified 2nd September 2013