Next-Generation Sequencing Analysis Resources

Next-Generation Sequencing Analysis Resources

The NYU Center For Genomics and Systems Biology in New York and Abu Dhabi have developed a new website with resources for mastering NGS analysis: https://learn.gencore.bio.nyu.edu/

Modules are designed to provide hands on experience with analyzing next generation sequencing data. Standard pipelines are presented that provide the user with and step-by-step guide to using state of the art bioinformatics tools. Each module includes sample data sets and scripts that can be accessed on NYU’s HPC facility.

Before you begin, it’s important to have some foundational skills in Linux and R. Check out the pre-requisites.

Included Modules:

  • File Formats: Learn about the variety of file formats in next generation sequencing data.
  • Alignment: The first step of most NGS analysis is to align the reads against the reference genome. This module describes how to map short DNA sequence reads, assess the quality of the alignment and prepare to visualize the mapping of the reads.
  • Visualization: Learn about the Integrative Genomics Viewer (IGV), a high-performance visualization tool for interactive exploration of large, integrated genomic datasets.
  • Variant Calling: Variant calling entails identifying single nucleotide polymorphisms (SNPs) and small insertions and deletion (indels) from next generation sequencing data. This module will cover SNP & Indel detection in germline cells.
  • RNA-Seq: Learn how to identify differential expression of genes by comparing different samples, or attempt to capture all RNA molecules in a given species.
  • HPC: Resources for analysing data on NYU’s High Performance Computing Cluster.
  • ChiPseq Analysis: ChiP-seq combines chromatin immunoprecipitation with high throughput sequencing and is used to analyze protein interaction with DNA and identify their binding sites.
  • De novo genome assembly: De novo (from new) genome assembly refers to the process of reconstructing an organism’s genome from smaller sequenced fragments.
  • Metagenomics: Metagenomics is the study of genetic material recovered directly from environmental samples. This is useful when attempting to understand what microbes are present and what they are doing in a microbiome, such as soil or the human gut.
  • Single cell RNA sequencing: scRNA allows us to understand cellular differences in expression, and hence it is directly applicable to the studies of cell heterogeneity, cell population and subpopulation identification, effects of low copy mRNA distribution and transcriptional regulation.

Note: New modules added as developed, this post may not reflect all modules currently available.

In addition to modules, material from the Bioinformatics And Data Analysis Seminars (BADAS) will be published on this site as well.

Building an Analysis Pipeline for HPC using Python

Building an Analysis Pipeline for HPC using Python

In this post we will build a pipeline for the HPC using Python 3. We will begin by building the foundation for a pipeline in Python in part 1, and then use that to build a simple NGS analysis pipeline in part 2.

At NYU, we submit jobs to the HPC using the Slurm Workload Manager. Therefore, we will implement this pipeline using Slurm, however this could easily be substituted for PBS or another similar scheduler.

This tutorial assumes basic level familiarity with Slurm and Python.

Part 1: Building a template for submitting jobs to Slurm in python

The first thing we will do is build a function in python that will be used to submit jobs to Slurm. The function will submit an sbatch job through the command line using the subprocess python package.

Let’s start by taking alook at an example of submitting a simple job Slurm directly from the command line, one that just prints “Hello World”.

sbatch -J my_simple_job -o my_simple_job.out -e my_simple_job.err --wrap='echo Hello World'

If you execute this command on the Prince or Dalma HPC, you should end up with two new files after the job completes (it should only take a second to finish):

my_simple_job.out
my_simple_job.err

If you look in my_simple_job.out you should see the text “Hello World”.

Let’s break down the command we just used. If you’re familiar with submitting Slurm jobs, this should be review:

sbatch: calling the scheduler
-J: job name
-o: standard out filename
-e: standard err filename
--wrap: this is the most important part, the command you wish to run. In this example, we simply asked to echo “Hello World”.

Let’s put this same command in a python function (we’ll call it sbatch()), and see how we can use this as a template to submit all our jobs. Note that we make use of the subprocess package to execute something on the command line from python. Before we use the subprocess package, we have to import it.

import subprocess

def sbatch():
     sbatch_command = "sbatch -J my_simple_job -o my_simple_job.out -e my_simple_job.err --wrap='echo Hello World'"
     sbatch_response = subprocess.getoutput(sbatch_command)
     print(sbatch_response)

sbatch() # call the sbatch function

Save that as simple.py. If you run this python script (python3 simple.py), you should see a response from sbatch (indicating the ID of the job submitted), and you should find the same file my_simple_job.out with the text “Hello World”. Try change the job name, output file names, or command, and re-run the script.

Let’s transform the sbatch function above into something more flexible.

def sbatch(job_name, command):
    sbatch_command = "sbatch -J {} -o {}.out -e {}.err --wrap='{}".format(job_name, job_name, job_name, command)
    sbatch_response = subprocess.getoutput(sbatch_command)
    print(sbatch_response)

We introduce two new parameters to the function: job_name and command. We also make use the string format method in order to build strings easily. Let’s take a quick look at a couple simple examples of the string format method in Python.

name = "Rick"
print( "Hello {}".format(name) )
# prints "Hello Rick"

first_name = "Rick"
last_name = "Sanchez"
print( "Hello {} {}".format(first_name, last_name) )
# prints "Hello Rick Sanchez"

The nice thing about this method is that it automatically formats any field into a string, so you can fill values with integers, floating point numbers, mathematical functions, etc. Let’s Take a look at one final example:

some_integer = 7
some_float = 2.99
print( "{} + {} = {} ".format(some_integer, some_float, some_integer + some_float) )
# prints "7 + 2.99 = 9.99"

Let’s go back to our sbatch example. What we have now is a function that we can send a job_name and a command, and the function will build the sbatch command using job_name as the Job Name and stdout/stderr file names, and will put whatever command we give it in the ‘wrap’ argument. Let’s try the same basic Hello World example using our new template.

import subprocess
def sbatch(job_name, command):
    sbatch_command = "sbatch -J {} -o {}.out -e {}.err --wrap='{}".format(job_name, job_name, job_name, command)
    sbatch_response = subprocess.getoutput(sbatch_command)
    print(sbatch_response)

sbatch("my_simple_job", "echo Hello World") # call the sbatch function, but provide job_name and command parameters.

Let’s add some additional parameters that we might want to use when submitting a job to sbatch: time, memory, and tasks. We will give each of these parameters a default value so they don’t need to be specified everytime you call sbatch(), unless you want to override them. Let’s also add a mail option so Slurm will email us if our job ever fails.

def sbatch(job_name, command, time=4, mem=60, tasks=20):
    sbatch_command = "sbatch -J {} -o {}.out -e {}.err --mail-user=$USER@nyu.edu --mail-type=FAIL -t {}:00:00 --mem={}000 --ntasks-per-node={} --wrap='{}'".format(job_name, job_name, job_name, time, mem, tasks, command)
    sbatch_response = 
    subprocess.getoutput(sbatch_command)print(sbatch_response)

We now have function that allows us to quite easily submit any command we want to the Slurm scheduler, control some common job parameters, and notify us by email in case of failure. The final piece required to make this function operable in a pipeline is support for dependencies. We need this function to do two more things:

1) return a job id that can be passed as a dependency to another job
2) allow the sbatch function to accept a dependency, optionally

Let’s tackle #1 first:

When you submit an sbtach job, you typically receive a response like this:
Submitted job 123456789

We’re interested in just the job number. There are a few different ways to extract this number from the response, we’ll use a simple string split and take the last element:

job_id = sbatch_response.split(' ')[-1].strip()

The command above takes the sbatch_response (which is the output of running the sbatch command with subprocess.getoutput()), splits it based on the space (‘ ‘) delimiter, takes the last element (the job ID), and strips away any spaces or newline characters leaving just the job ID.

Now let’s implement support for #2: enabling the function to accept an optional dependency parameter.

We’ll start by adding a parameter called dep to the sbatch function definition and assigning it a default value of nothing (empty string).

def sbatch(job_name, cmd, time=4, mem=60, tasks=20, dep='')

Now we’ll add a couple new lines at the top of the function to check if the dependency is empty or not, and add the required parameters to the sbatch command if it’s not empty. We also add the lines to parse the and return job_id.

def sbatch(job_name, command, time=4, mem=60, tasks=20, dep=''):
    if dep != '':
        dep = '--dependency=afterok:{} --kill-on-invalid-dep=yes '.format(dep)
 
    sbatch_command = "sbatch -J {} -o {}.out -e {}.err --mail-user=$USER@nyu.edu --mail-type=FAIL -t {}:00:00 --mem={}000 --ntasks-per-node={} --wrap='{}' {}".format(job_name, job_name, job_name, time, mem, tasks, command, dep)
    sbatch_response = subprocess.getoutput(sbatch_command)
    print(sbatch_response) 
    job_id = sbatch_response.split(' ')[-1].strip()
    return job_id

Note that the dep parameter is always included in the sbatch command that we send to the command line. If there is no dependency specified, then dep will be an empty string.

We now have a function that can be used to build a fully functioning pipeline with dependencies!

Part 2: Building a Pipeline

In part 1 we built a template function to submit jobs to the HPC. In part 2, we will use this function to build a pipeline. For the purpose of this tutorial, we will implement a simple NGS analysis workflow consisting of FastQC, alignment, sorting, indexing, and generating alignment statistics.

Using dependencies in a pipeline allows us to maximize the resources we can use while minimizing the wait time to run our jobs. This also allows individual jobs to run in parallel (where possible), reducing our overall analysis time even further.

The figure below depicts the simple pipeline we will build. Note the dependencies (indicated with arrows) and the steps which run in parallel (same row).

1. Input files

We’ll use a single-end fastq file and we’ll align to the Unicorn genome.

input_1 = sample_R1.fastq

ref = Unicorn_v1.fa

For the purpose of this exercise, we’ll assume the index files for the aligner are already built.

2. Build the individual steps

Here’s the command for running fastqc:

fastqc file.fastq

We’ll build a function for each step of the workflow to keep things organized. These functions will call the sbatch() function we created above.

def fastqc():
    # Build the command for fastqc
     command = "fastqc {}".format(input_1)
 
     # Pass job name and command to sbatch
     # sbatch() will return the job_id
     job_id = sbatch('fastqc', command)

    # Return the job id
     return job_id

Now let’s build the functions for the remaining steps

Alignment:

def alignment():
    command = "bowtie2 -x {} -U {} > output.sam".format(ref, input_1)
    job_id = sbatch('align', command)
    return job_id

Note: The next three steps require a dependency, i.e. they require a previous job to complete before they can be started. For this reason, we will add the dep parameter to the following functions.

Stats:

def stats(dep=''):
    command = "samtools flagstat output.sam > stats.txt"
    job_id = sbatch('stats', command, dep=dep)
    return job_id

Sort:

def sort(dep=''):
    command = "samtools sort output.sam -o output.sorted.bam"
    job_id = sbatch('sort', command, dep=dep)
    return job_id

Index:

def index(dep=''):
    command = "samtools index output.sorted.bam output.sorted.bai"
    job_id = sbatch('index', command, dep=dep)
    return job_id

3. Build the pipeline

Now that we have all our individual functions for each step built, we can stitch them together in a pipeline. Most of the heavy lifting is already done, this part is the easiest.

fastqc_jobid = fastqc()
alignment_jobid = alignment()

# Stats needs the alignment to complete before it can begin, so we pass 
# alignment_jobid to the stats() function
stats_jobid = stats(alignment_jobid)

# Sort needs the alignment to complete before it can begin, so we pass 
# alignment_jobid to the sort() function
sort_jobid = sort(alignment_jobid)

# Note: Both stats() and sort() will be submitted simultaneously when the the alignment job is complete

# Index needs sorting to complete before it can begin, so we pass sort_jobid to index()
index_jobid = index(sort_jobid)

4. Putting it all together

Below you will find the full code for this pipeline. I’ve added three new components:

a) Parse the sample id (from the input file name) and use this to name the outputs of the pipeline
b) Override some of the default sbatch job parameters (ex: tasks, memory, and time) for the alignment job
c) Load respective modules (required for NYU’s HPC). I add this directly into the command for each step.

 

import subprocess

# Inputs
input_1 = 'sample_R1.fastq'
ref = 'Unicorn_v1.fa'

# Parse the sample id so we can use it to name the outputs
# sample_id in this example will be 'sample_R1'
sample_id = input_1.split('.')[0]

def sbatch(job_name, command, time=4, mem=60, tasks=20, dep=''): 
    if dep != '': dep = '--dependency=afterok:{} --kill-on-invalid-dep=yes '.format(dep) 
 
   sbatch_command = "sbatch -J {} -o {}.out -e {}.err --mail-user=$USER@nyu.edu --mail-type=FAIL -t {}:00:00 --mem={}000 --ntasks-per-node={} --wrap='{}' {}".format(job_name, job_name, job_name, time, mem, tasks, command, dep) 
 
   sbatch_response = subprocess.getoutput(sbatch_command) 
   print(sbatch_response) 
 
   job_id = sbatch_response.split(' ')[-1].strip() 
   return job_id
 
def fastqc(): 
   # Build the command for fastqc 
   command = "module load fastqc/0.11.5; fastqc {}".format(input_1)
 
   # Pass job name and command to sbatch 
   # sbatch() will return the job_id 
   job_id = sbatch('fastqc', command) 
 
   # Return the job id 
   return job_id
 
def alignment(): 
   command = "module load bowtie2/intel/2.3.2; bowtie2 -x {} -U {} > {}.sam".format(ref, input_1, sample_id) 
   job_id = sbatch('align', command, time=8, mem=120) 
   return job_id
 
def stats(dep=''):
   command = "module load samtools/intel/1.6; samtools flagstat {}.sam > {}_stats.txt".format(sample_id, sample_id)
   job_id = sbatch('stats', command, dep=dep)
   return job_id

def sort(dep=''):
   command = "module load samtools/intel/1.6; samtools sort {}.sam -o {}.sorted.bam".format(sample_id, sample_id)
   job_id = sbatch('sort', command, dep=dep)
   return job_id

def index(dep=''):
   command = "module load samtools/intel/1.6; samtools index {}.sorted.bam {}.sorted.bai".format(sample_id, sample_id)
   job_id = sbatch('index', command, dep=dep)
   return job_id
 
#Run the pipeline!
fastqc_jobid = fastqc() 
alignment_jobid = alignment() 
stats_jobid = stats(alignment_jobid)
sort_jobid = sort(alignment_jobid)
index_jobid = index(sort_jobid)

Congratulations! You’ve completed the tutorial. If you were following along, you should have the following output files once the pipeline has finished:

sample_R1_fastqc.zip
sample_R1_fastqc.html
sample_R1.sam
sample_R1_stats.txt
sample_R1.sorted.bam
sample_R1.sorted.bai

Post below if you have any questions and I’ll do my best to answer.

Explore the New Shared Genome Resource

Explore the New Shared Genome Resource

Save time and resources with the local CGSB repository of commonly used genomic data sets. Data is obtained from Ensembl and NCBI. New versions/releases will be added periodically or upon request. Previous versions/releases will be preserved.

All files are readable from within the shared genome resource. There is no need to copy the file(s) to your local directory. The second table below shows all available data types.

All data are stored in a common location with the following naming convention:

/genomics/genomes/Public/<KINGDOM>/<SPECIES>/<SOURCE>/<VERSION>/

Search and find your genome of interest in the table below. Clicking on the row containing your genome will generate the specific path (see below table) to the data on the HPC .

Explore

Species
Kingdom Species Source Version
/genomics/genomes/Public/<KINGDOM>/<SPECIES>/<SOURCE>/<VERSION>/
Available Data Types
Name Extensions Built With
Reference Sequence
  • .fa
  • .fna

Request Data

To request a new organism, version, or data type, please email mk5636@nyu.edu

Remote Desktop Connection to Prince

Remote Desktop Connection to Prince

Connect to Prince using a remote desktop to analyze your data in RStudio, visualize in IGV, and interact with other GUI applications on the HPC.

Instructions for Mac users

Instructions for PC users

Instructions for Mac users

Initial Setup:

  1. Download TigerVNC:
    https://bintray.com/tigervnc/stable/download_file?file_path=TigerVNC-1.8.0.dmg
  2. Setup a VNC password on Prince:
    Login to Prince and type the command vncpasswd
    If you ever forget your VNC password, you can reset it at anytime by typing vncpasswd again (it will ask you to enter a password, then confirm the password)
  3. Copy this script into your home folder:
    prince:/scratch/work/cgsb/scripts/launch_vnc/launch-vnc.sh

Once initial setup is complete, you can launch a vnc session and connect to it at any time by doing the following:

  1. Submit the sbtach script (which you obtained above) to the queue, like this:
    > sbatch launch-vnc.sh
  2.  Once the job is running, look for the output file called VNC.o1234567, where ‘1234567’ is your job ID. Open that file and copy the first ssh command, which looks like this:
    ssh -L 5901:c06-04:5901 netID@prince.hpc.nyu.edu
  3. Open a terminal from your Mac (local terminal) and paste in the command from above and hit enter, you will be asked for your password (your regular NYU password).
  4. Open TigerVNC on your computer and enter localhost:X in the server field, where X is the Display Port (590X – If your port from the ssh command above was 5901, you enter 1, if the port was 5902, you enter 2.). Using the command above as an example, the correct entry would be localhost:1
    Press the ‘Connect’ button. You will then be asked for a password, enter the password you setup with vncpasswd (not your NYU password!) Click OK and it should launch the desktop.
  5. The first time you launch your remote desktop, you should disable the screensaver lock. Go to Applications -> System Tools -> Settings -> Power and change the setting for Blank Screen to Never. This will prevent you from being locked out of your desktop. This only needs to be done once.
  6. Now you can run IGV and load any file which you have access to on the HPC. Open a terminal and enter:
    > module load igv
    > igv.sh
  7. Your VNC session will be active for 12 hours. After this period, you will need to re-launch and connect to your remote desktop by redoing steps 1-4.

Instructions for PC users

Initial Setup:

  1. Download Putty:
    32-bit: https://the.earth.li/~sgtatham/putty/latest/w32/putty.exe
    64-bit: https://the.earth.li/~sgtatham/putty/latest/w64/putty.exe
  2. Download and install TightVNC (accept the default settings when installing):
    http://www.tightvnc.com/download.php
  3. Setup a VNC password on mercer:
    Login to Prince and type the command vncpasswd
    If you ever forget your VNC password, you can reset it at anytime by typing vncpasswd again (it will ask you to enter a password, then confirm the password)
  4. Copy this script into your home folder:
    prince:/scratch/work/cgsb/scripts/launch_vnc/launch-vnc.sh

Once initial setup is complete, you can launch a vnc session and connect to it at any time by doing the following:

  1. Submit the sbatch script (which you obtained above) to the queue, like this:
    > sbatch launch-vnc.sh
  2. Once the job is running, look for the output file called VNC.o1234567, where ‘1234567’ is your job ID. Open that file and look at the first  ssh command, which looks like this:
    ssh -L 5901:c06-04:5901 netID@prince.hpc.nyu.edu
  3. From the line above, we know the VNC port is 5901 and the compute node is c06-04. You will need these two values.
  4. Open Putty and enter prince.hpc.nyu.edu for Host Name and 22 for Port, as shown below.
  5. On the left, click to expand the options for ‘SSH’, then click ‘Tunnels’ (a). Enter the port number obtained in step 3 above into the ‘Source Port’ field. In the example shown above, the port number is 5901 (b). For ‘Destination’, enter compute-node:port. In the example above, this is co6-04:5901 (c). Click ‘Add’ (d). Then click ‘Open’ (e). When the terminal opens, authenticate with you NYU netID and password.
    Note: this configuration (Source port + Destination) will need to be updated whenever a new VNC session is launched (whenever you submit a new launch-vnc.sh job).
  6. Launch TightVNC Viewer. Enter the VNC Server as follows:
    If your port is 5900, VNC Server: localhost:0
    If your port is 5901, VNC Server: localhost:1
    If your port is 5902, VNC Server: localhost:2
    In the example above, the port was 5901, so we enter localhost:1
    vnc-connection
  7. Click Connect, enter your VNC password (not NYU NetID password!) which you created in the initial setup.
  8. The first time you launch your remote desktop, you should disable the screensaver lock. Go to Applications -> System Tools -> Settings -> Power and change the setting for Blank Screen to Never. This will prevent you from being locked out of your desktop. This only needs to be done once.
  9. Now you can run IGV and load any file which you have access to on the HPC. Open a terminal and enter:
    > module load igv
    > igv.sh
  10. Your VNC session will be active for 12 hours. After this period, you will need to re-launch and connect to your remote desktop by submitting the launch-vnc.sh script and configuring a new putty session.

 

Salmon & kallisto: Rapid Transcript Quantification for RNA-Seq Data

Salmon & kallisto: Rapid Transcript Quantification for RNA-Seq Data

Salmon and kallisto might sound like a tasty entree from a hip Tribeca restaurant, but the duo are in fact a pair of next-generation applications for rapid transcript quantification. They represent a new approach to transcript quantification using NGS that has a number of advantages over existing alignment-based methods. I’ve tried them both out and provide my thoughts below.

We are used to quantification methods that rely on full base-to-base alignment of reads to a reference, but newer methods leverage the idea that that when quantifying reads, the necessary information is not where in a transcript a read aligns, but only which transcripts could have generated the read. This idea renders base-level alignment, which has been the computational bottleneck until now, unnecessary for this analysis. Because this approach does not involve actual alignment to the genome, it is sometimes referred to as ‘pseudoalignment’.

kallisto

Pros:

  • Extremely Fast & Lightweight – can quantify 20 million reads in under five minutes on a laptop computer
  • Easy to use
  • Sleuth – an interactive R-based companion for exploratory data analysis

Cons:

  • No support for stranded libraries
    Update: kallisto now offers support for strand specific libraries

kallisto, published in April 2016 by Lior Pachter and colleagues, is an innovative new tool for quantifying transcript abundance. kallisto uses the concept of ‘pseudoalignments’, which are essentially relationships between a read and a set of compatible transcripts. kallisto obtains highly accurate pseudoalignments efficiently using k-mers together with the transcriptome de Bruijn graph (T-DBG). Grouping of pseudoalignments belonging to the same set of transcripts into equivalence classes allows for a simpler model and more efficient use of underlying algorithms. The result is a tool which is able to quantify a paired-end library with 22 million reads in 3.5 minutes on a single-core with 8GB of memory.

total_run_time_chart

Both kallisto and Salmon are magnitudes faster than Tophat + Cufflinks. Times computed on 22M read PE illumina dataset, 1 CPU core and 8GB memory.

Speed is great, but accuracy is critical. The plot below displays a high correlation (r = 0.941) between results obtained by kallisto and Cufflinks. You can see that the majority of differences are for transcripts with low expression levels.

kallisto is self-described as being ‘near optimal’. In this case, ‘optimal’ refers to two key aspects of the analysis: Speed and Accuracy. While the benchmark for accuracy can be determined by comparing results to a known “gold standard” data set or results from other existing tools, the benchmark for speed is typically based on how fast an input file can be read. A good way to measure optimal speed is to calculate how long it takes to read through a file, the linux word count (wc) utility provides a measure of this. For example, if it takes your machine 45 seconds to perform a word count on a set of reads, 45 seconds could be considered the benchmark for speed in quantifying this set of reads (on your machine). You can see in the plot below that kallisto is taking just over twice as long as it takes to read through the file. That’s not bad for a complex problem of quantifying RNA-seq data!

optimal-plot

The linux word count (wc) utility can serve as a benchmark for ‘optimal’ speed.

A companion to kallisto, sleuth is an R-based program for exploratory data analysis powered by Shiny. The tools in sleuth allow you to investigate transcript abundance data or combine results from multiple samples for differential expression analysis. Interactive tools to explore scatterplots, volcano plots, MA plots, heat maps, PCA analysis, and more make analyzing your data faster and easier. For example, when examining a volcano plot, users can highlight a region of interest and immediately see a list of the transcripts selected in a table below.

A Volcano Plot in sleuth
Selecting points on a volcano plot brings up the transcripts corresponding to those points in the table below, in real time.

Using the ID’s of the transcripts, users can drill down further and examine boxplots of transcript abundances to see technical variation in each sample and biological variation between conditions.

Transcript View in sleuth
Enter a transcript ID in transcript view, or a gene ID in gene view, to visualize technical and biological variation quickly.

sleuth requires technical replicates for its analysis. In lieu of actual technical replicates, sleuth makes use of bootstrapped values which serve as accurate proxies. Bootstrapping here is the process of repeating the quantification analysis after resampling with replacement from the original data, in order to simulate random sampling associated with sequencing.

sleuth computes differential expression using a statistical model that combines variance information from biological replicates as well as bootstrapped technical replicates to estimate true biological variance. A test of sleuth on data simulated according to the DESeq2 model found that sleuth significantly outperforms other methods (ex: DESeq2, edgeR). A preprint is forthcoming.

The power of kallisto and sleuth lie in their speed and ease of use. A user can go from raw reads to analysis in minutes. As impressive as kallisto is, one major drawback is that its simplified model makes it unable to account for strandedness in reads. This seems like a major limitation given that most RNA-seq protocols generated stranded information.. If support for strandedness is a requirement for your analysis, check out Salmon below. Update: kallisto now offers support for strand specific libraries

Salmon

Pros:

  • Fast & Lightweight – can quantify 20 million reads in under eight minutes on a desktop computer
  • Support for strand-specific libraries
  • Accepts BAM and FASTQ files as input

Cons:

  • No automatic support for compressed input files

From the creators of Sailfish, Salmon (released but unpublished) is another rapid quantification tool which leverages k-mer based counting to produce, in Salmon terms, ‘quasi-alignments’. Although Salmon is not quite as fast as kallisto and lacks some features such as seamless support for compressed files, its strong point is its ability to learn and account for the effects of experiment-specific parameters and biases, including support for strand-specific RNA-seq data.

In the data set used, the correlation between results obtained by Salmon and Cufflinks (r = 0.939) is nearly identical to that between kallisto and Cufflinks (r = 0.941).

Another feature Salmon offers is the ability to quantify pre-existing alignments (from BAM files). This means that if you already have BAM alignments (or will need BAMs for other reasons), you can provide Salmon with these. If you don’t have BAM files, you can provide your reads in fastq format and Salmon will create the quasi-alignments ‘wicked-fast’, and then quantify them.

Finally, if you’re quantifying your data with Salmon and would like to use sleuth for downstream analysis, you’re in luck. The creators of Salmon have created an R library called wasabi which prepares Salmon (and Sailfish) output for use with sleuth.

The latest versions of Salmon, kallisto, sleuth, and wasabi are available for use on Mercer.

Getting Started (on Mercer)

kallisto

module load kallisto/intel/0.42.5
kallisto quant -i <kallisto_index> -o <output_dir> <read_1.fastq> <read_2.fastq>

To produce bootstrap values for downstream analysis with sleuth (in this example, 100 bootstraps):

kallisto quant -i <kallisto_index> -o <output_dir> -b 100 <read_1.fastq> <read_2.fastq>

 

sleuth

! If your data is on mercer, run sleuth in a VNC session for web browser functionality
! On mercer, It is necessary to load Firefox or a browser of your choice prior to launching R
module load r/intel/3.2.2
module load firefox
R
>library("sleuth")
>base_dir<-(“/scratch/<netID>/kallisto-results/”)

The next step requires describing the experimental design and the relationship between the kallisto directories and the samples, which is variable depending on your experimental design and kallisto output structure. The final steps involve creating the sleuth object and performing the differential expression calculations, which take only a couple minutes in total to complete. Once these steps are complete, an interactive browser session can be launched. Please see Introduction to Sleuth for more.

Salmon

module load salmon/gnu/0.6.0
salmon quant -i <salmon_index> -l <lib_type> -1 <read_1.fastq> -2 <read_2.fastq> -o <output_dir>

In this example, we quantify a paired end, stranded library (where read 1 comes from the reverse strand), from Arabidopsis thaliana, with 100 bootstraps:

salmon quant -i A_thaliana_salmon_index -l ISR --numBootstraps 100 -1 read_1.fastq -2 read_2.fastq -o salmon_out

 

Resources

kallisto Paper:

http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.3519.html

kallisto Manual:

http://pachterlab.github.io/kallisto/manual.html

Sleuth Blogpost:

https://liorpachter.wordpress.com/2015/08/17/a-sleuth-for-rna-seq/

Sleuth Tutorial:

https://rawgit.com/pachterlab/sleuth/master/inst/doc/intro.html

Salmon Preprint:

http://biorxiv.org/content/early/2015/06/27/021592

Salmon Manual:

http://salmon.readthedocs.io/en/latest/salmon.html

Node to Joy: Maximize Your Performance on the HPC

Node to Joy: Maximize Your Performance on the HPC

In this post we’ll discuss maximizing your performance on the HPC. This entry is aimed towards experienced HPC users; for new users, please see Getting Started on the HPC.

Recent advances in sequencing technology have made High Performance Computing (HPC) more critical than ever in data-driven biological research. NYU’s HPC resources are available to all NYU faculty, staff, and faculty-sponsored students.

Familiarizing yourself with NYU’s HPC resources, and optimizing your use of this technology will enable maximum speed and efficiency in your analyses.

Get to Know Your Nodes

Mercer, the primary NYU HPC cluster in New York, contains 400 nodes, 16TB of total memory, and over 3200 CPU cores. When a job is submitted to the cluster, one or more nodes are allocated to the job based on the resources requested when submitting. Typically, a single node is sufficient, and the node will be allocated based on amount of memory and CPU cores requested.

For example, of the 400 nodes available on Mercer, 112 of them are 64GB nodes with 20 CPU cores available per node. Note that 64GB is the total memory per node but 62GB is the amount of memory available to jobs. Therefore, if 64GB of memory is requested, the job will run on the next largest memory node. In this case, the node will be allocated based on how many processors per node (cores) are requested. There are twelve 96GB nodes with 12 cores on Mercer, so if 12 or less cores are requested, a 96GB node will be allocated. If 20 cores are requested, the next highest memory nodes with 20 cores are forty eight 192GB nodes (189GB available for jobs), so one of these nodes will be allocated. See list of Mercer nodes. There are also four new nodes with 500GB available memory and 20 CPU cores, and two new nodes with 1.5TB memory and 48 CPU cores, that are not included in this list.

Generally speaking, if 20 CPU cores on a single compute node is requested, it is ok to declare 62GB memory. If more than 62GB memory with 20 CPU cores on a single compute node is needed, it is ok to declare 189GB memory. In the above cases, since 20 cores are being requested, no one would be able to share the compute node (since all CPU cores are being used) and so it would be efficient to request all the available memory there.

Which Node is Right for your Analysis?

Choosing the right node for your analysis requires a little benchmarking and testing. Over-requesting resources when they are not needed might result in your analyses taking longer because of queues for the higher resource nodes since they are fewer of these. For example, a job which may have spawned almost immediately after submission on a lower resource node and taken 30 minutes to run might have taken less time to complete on a 1TB node, but the queue time for this job might have been longer since there are only 3 nodes with 1TB memory available.

It can be difficult to know what resources are required for a particular job and this only comes with experience. Enabling email notifications in your pbs scripts (with the directive #PBS -M NetID@nyu.edu) is very helpful as well because it allows you to get notifications about compute resources used after a job is complete. These reports can guide resource requests for future jobs.

Finally, if you plan on running an application with multiple CPU cores, for example:

bowtie2 -p 20 -x <ref> -U <input.fq>

Ensure that you request this number of CPU cores when submitting your PBS job, as such:

#PBS -l nodes=1:ppn=20,mem=62GB,walltime=12:00:00

The bottom line: the time to complete a job on the HPC is a function of BOTH the queue time and the compute time. So, consider both when setting up your jobs.

To learn how to see which jobs are currently running on which nodes and cores visit: https://wikis.nyu.edu/display/NYUHPC/Interpreting+pbstop

For more information on running jobs on the NYU HPC cluster visit: https://wikis.nyu.edu/display/NYUHPC/Running+jobs+on+the+NYU+HPC+clusters

Variant Calling Pipeline: FastQ to Annotated SNPs in Hours

Variant Calling Pipeline: FastQ to Annotated SNPs in Hours

Identifying genomic variants, such as single nucleotide polymorphisms (SNPs) and DNA insertions and deletions (indels), can play an important role in scientific discovery. To this end, a pipeline has been developed to allow researchers at the CGSB to rapidly identify and annotate variants. The pipeline employs the Genome Analysis Toolkit (GATK) to perform variant calling and is based on the best practices for variant discovery analysis outlined by the Broad Institute. Once SNPs have been identified, SnpEff is utilized to annotate and predict the effects of the variants.

Full List of Tools Used in this Pipeline:

  • GATK
  • BWA
  • Picard
  • Samtools
  • Bedtools
  • SnpEff
  • R (dependency for some GATK and Picard steps)

Script Location:

prince:/scratch/work/cgsb/scripts/variant_calling/pipeline.sh

https://github.com/gencorefacility/variant-calling-pipeline

How to use this script:

  1. Create a new directory for this project in your home folder on scratch
    cd /scratch/<netID>/
    mkdir variant-calling
    cd variant-calling
  2. Copy pipeline.sh into your project directory
    cp /scratch/work/cgsb/scripts/variant-calling/pipeline.sh .
  3. The pipeline requires six input values. The values can be hard coded into the script, or provided as command line arguments. They are found at the very top of the script.
    1. DIR: The directory with your FastQ libraries to be processed. This should be located in /scratch/cgsb/gencore/out/<PI>/
    2. REF: Reference genome to use. This should be located in the Shared Genome Resource (/scratch/work/cgsb/reference_genomes/). Selecting a reference genome from within the Shared Genome Resource ensures that all index files and reference dictionaries required by BWA, Picard, GATK, etc. are available.
    3. SNPEFF_DB: The name of the SnpEff database to use. To determine the name of the SnpEff DB to use, issue the following command:java -jar /share/apps/snpeff/4.1g/snpEff.jar databases | grep -i SPECIES_NAME
      From the list of returned results, select the database you wish to use. The value of the first column is the value to input for SNPEFF_DB.Example:
      java -jar /share/apps/snpeff/4.1g/snpEff.jar databases | grep -i thaliana

      Output:

      athalianaTair10   Arabidopsis_Thaliana    http://downloads.sourceforge.net/...

      Then:

      SNPEFF_DB='athalianaTair10'

      Note: If your organism is not available in SnpEff, it is possible to create a custom SnpEff Database if a reference fasta and gff file are available.

    4. PL: Platform. Ex: illumina
    5. PM: Machine. Ex: nextseq
    6. EMAIL: Your email address. Used to report PBS job failures

    In the example below, I have hard coded the parameters in the script:

    DIR='/data/cgsb/gencore/out/Gresham/2015-10-23_HK5NHBGXX/lib1-26/'
    REF='/scratch/work/cgsb/reference_genomes/Public/Fungi/Saccharomyces_cerevisiae/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna'
    SNPEFF_DB='GCF_000146045.2_R64'
    PL='illumina'
    PM='nextseq'
    EMAIL='some1@nyu.edu' # Not a real email address

  4. Run the script. If you hard coded the parameters in the script as shown in the example above, no additional parameters need to be provided. Simply run:
    sh pipeline.sh
  5. Optional: Monitor the jobs
    watch -d qstat -u <netID>

Overview of the pipeline

Variant Calling Workflow

Important Notes:

  • The pipeline assumes no known variants are available for the Base Quality Score Recalibration step and as such bootsraps a set of SNPs and Indels to provide as input at this step. Contact me if a list of high quality known variants is available for your organism as this data can improve the quality of the output
  • The current script is designed to work with Paired End data
  • Due to HPC queue limits on mercer, the pipeline can run on a maximum of 25 libraries at a time
  • If you need to run the pipeline on more than 25 libraries or on Single End reads, or have any other questions, please contact me

Walk through

Step 1 Alignment – Map to Reference
Tool BWA MEM
Input .fastq files, reference genome
Output

aligned_reads.sam*

*Intermediary file, removed from final output

Notes

Need to provide the -M flag to BWA, this tells it to consider split reads as secondary, need this for GATK variant calling/Picard support. Alternate alignment tools: Bowtie2, Novoalign

Readgroup info is provided with the -R flag. This information is key for downstream GATK functionality. GATK will not work without a read group tag.

Command bwa mem -M -R '@RG\tID:sample_1\tLB:sample_1\tPL:ILLUMINA\tPM:HISEQ\tSM:sample_1' ref input_1 input_2 > aligned_reads.sam

 

Step 2 Sort SAM file by coordinate, convert to BAM
Tool Picard Tools
Input aligned_reads.sam
Output

sorted_reads.bam*

*Intermediary file, removed from final output

Command java -jar picard.jar SortSam INPUT=aligned_reads.sam OUTPUT=sorted_reads.bam SORT_ORDER=coordinate

 

Step 3 Collect Alignment & Insert Size Metrics
Tool Picard Tools, R, Samtools
Input sorted_reads.bam, reference genome
Output alignment_metrics.txt,
insert_metrics.txt,
insert_size_histogram.pdf,
depth_out.txt
Command

java -jar picard.jar CollectAlignmentSummaryMetrics R=ref I=sorted_reads.bam O=alignment_metrics.txt

java -jar picard.jar CollectInsertSizeMetrics INPUT=sorted_reads.bam OUTPUT=insert_metrics.txt HISTOGRAM_FILE=insert_size_histogram.pdf

samtools depth -a sorted_reads.bam > depth_out.txt

 

Step 4 Mark Duplicates
Tool Picard Tools
Input sorted_reads.bam
Output

dedup_reads.bam*

metrics.txt

*Intermediary file, removed from final output

Command java -jar picard.jar MarkDuplicates INPUT=sorted_reads.bam OUTPUT=dedup_reads.bam METRICS_FILE=metrics.txt

 

Step 5 Build BAM Index
Tool Picard Tools
Input dedup_reads.bam
Output

dedup_reads.bai*

*Intermediary file, removed from final output

Command java -jar picard.jar BuildBamIndex INPUT=dedup_reads.bam

 

Step 6 Create Realignment Targets
Tool GATK
Input dedup_reads.bam,
reference genome
Output realignment_targets.list
Notes This is the first step in a two-step process of realigning around indels
Commad java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref -I dedup_reads.bam -o realignment_targets.list

 

Step 7 Realign Indels
Tool GATK
Input dedup_reads.bam,
realignment_targets.list,
reference genome
Output realigned_reads.bam
Notes This step performs the realignment around the indels which were identified in the previous step (the ‘realignment targets’)
Command java -jar GenomeAnalysisTK.jar -T IndelRealigner -R ref -I dedup_reads.bam -targetIntervals realignment_targets.list -o realigned_reads.bam

 

Step 8 Call Variants
Tool GATK
Input realigned_reads.bam,
reference genome
Output raw_variants.vcf
Notes First round of variant calling. The variants identified in this step will be filtered and provided as input for Base Quality Score Recalibration
Command java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref -I realigned_reads.bam -o raw_variants.vcf

 

Step 9 Extract SNPs & Indels
Tool GATK
Input raw_variants.vcf,
reference genome
Output raw_indels.vcf, raw_snps.vcf
Notes This step separates SNPs and Indels so they can be processed and used independently
Command java -jar GenomeAnalysisTK.jar -T SelectVariants -R ref -V raw_variants.vcf -selectType SNP -o raw_snps.vcf
java -jar GenomeAnalysisTK.jar -T SelectVariants -R ref -V raw_variants.vcf -selectType INDEL -o raw_indels.vcf

 

Step 10 Filter SNPs
Tool GATK
Input raw_snps.vcf,
reference genome
Output filtered_snps.vcf
Notes

The filtering criteria for SNPs are as follows:

QD < 2.0
FS > 60.0
MQ < 40.0
MQRankSum < -12.5
ReadPosRankSum < -8.0
SOR > 4.0

Note: SNPs which are ‘filtered out’ at this step will remain in the filtered_snps.vcf file, however they will be marked as ‘basic_snp_filter’, while SNPs which passed the filter will be marked as ‘PASS’

Command java -jar GenomeAnalysisTK.jar -T VariantFiltration -R ref -V raw_snps.vcf --filterExpression 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0' --filterName "basic_snp_filter" -o filtered_snps.vcf

 

Step 11 Filter Indels
Tool GATK
Input raw_indels.vcf,
reference genome
Output filtered_indels.vcf
Notes

The filtering criteria for SNPs are as follows:

QD < 2.0
FS > 200.0
ReadPosRankSum < -20.0
SOR > 10.0

Note: Indelss which are ‘filtered out’ at this step will remain in the filtered_indels.vcf file, however they will be marked as ‘basic_indel_filter’, while Indels which passed the filter will be marked as ‘PASS’

Command java -jar GenomeAnalysisTK.jar -T VariantFiltration -R ref -V raw_indels.vcf --filterExpression 'QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0 || SOR > 10.0' --filterName "basic_indel_filter" -o filtered_indels.vcf

 

Step 12 Base Quality Score Recalibration (BQSR) #1
Tool GATK
Input realigned_reads.bam,
filtered_snps.vcf,
filtered_indels.vcf,
reference genome
Output

recal_data.table*

*Intermediary file, removed from final output

Notes BQSR is performed twice. The second pass is optional, but is required to produce a recalibration report.
Command java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R ref -I realigned_reads.bam -knownSites filtered_snps.vcf -knownSites filtered_indels.vcf -o recal_data.table

 

Step 13 Base Quality Score Recalibration (BQSR) #2
Tool GATK
Input recal_data.table,
realigned_reads.bam,
filtered_snps.vcf,
filtered_indels.vcf,
reference genome
Output

post_recal_data.table

*Intermediary file, removed from final output

Notes The second time BQSR is run, it takes the output from the first run (recal_data.table) as input
Command java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R ref -I realigned_reads.bam -knownSites filtered_snps.vcf -knownSites filtered_indels.vcf -BQSR recal_data.table -o post_recal_data.table

 

Step 14 Analyze Covariates
Tool GATK
Input recal_data.table,
post_recal_data.table,
reference genome
Output recalibration_plots.pdf
Notes This step produces a recalibration report based on the output from the two BQSR runs
Command java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R ref -before recal_data.table -after post_recal_data.table -plots recalibration_plots.pdf

 

Step 15 Apply BQSR
Tool GATK
Input recal_data.table,
realigned_reads.bam,
reference genome
Output recal_reads.bam
Notes This step applies the recalibration computed in the first BQSR step to the bam file.
Command java -jar GenomeAnalysisTK.jar -T PrintReads -R ref -I realigned_reads.bam -BQSR recal_data.table -o recal_reads.bam

 

Step 16 Call Variants
Tool GATK
Input recal_reads.bam,
reference genome
Output raw_variants_recal.vcf
Notes Second round of variant calling performed on recalibrated bam
Command java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref -I recal_reads.bam -o raw_variants_recal.vcf

 

Step 17 Extract SNPs & Indels
Tool GATK
Input raw_variants_recal.vcf,
reference genome
Output raw_indels_recal.vcf, raw_snps_recal.vcf
Notes This step separates SNPs and Indels so they can be processed and analyzed independently
Command java -jar GenomeAnalysisTK.jar -T SelectVariants -R ref -V raw_variants_recal.vcf -selectType SNP -o raw_snps_recal.vcf
java -jar GenomeAnalysisTK.jar -T SelectVariants -R ref -V raw_variants_recal.vcf -selectType INDEL -o raw_indels_recal.vcf

 

Step 18 Filter SNPs
Tool GATK
Input raw_snps_recal.vcf,
reference genome
Output filtered_snps_final.vcf
Notes

The filtering criteria for SNPs are as follows:

QD < 2.0
FS > 60.0
MQ < 40.0
MQRankSum < -12.5
ReadPosRankSum < -8.0
SOR > 4.0

Note: SNPs which are ‘filtered out’ at this step will remain in the filtered_snps_final.vcf file, however they will be marked as ‘basic_snp_filter’, while SNPs which passed the filter will be marked as ‘PASS’

Command java -jar GenomeAnalysisTK.jar -T VariantFiltration -R ref -V raw_snps_recal.vcf --filterExpression 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0' --filterName "basic_snp_filter" -o filtered_snps_final.vcf

 

Step 19 Filter Indels
Tool GATK
Input raw_indels_recal.vcf,
reference genome
Output filtered_indels_final.vcf
Notes

The filtering criteria for SNPs are as follows:

QD < 2.0
FS > 200.0
ReadPosRankSum < -20.0
SOR > 10.0

Note: Indelss which are ‘filtered out’ at this step will remain in the filtered_indels_recal.vcf file, however they will be marked as ‘basic_indel_filter’, while Indels which passed the filter will be marked as ‘PASS’

Command java -jar GenomeAnalysisTK.jar -T VariantFiltration -R ref -V raw_indels_recal.vcf --filterExpression 'QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0 || SOR > 10.0' --filterName "basic_indel_filter" -o filtered_indels_recal.vcf

 

Step 20 Annotate SNPs and Predict Effects
Tool SnpEff
Input filtered_snps_final.vcf
Output filtered_snps_final.ann.vcf,
snpeff_summary.html,
snpEff_genes.txt
Command java -jar snpEff.jar -v snpeff_db filtered_snps_final.vcf > filtered_snps_final.ann.vcf

 

Step 21 Compute Coverage Statistics
Tool Bedtools
Input recal_reads.bam
Output genomecov.bedgraph
Notes Load the genomecov.bedgraph file into IGV to view a coverage map at the entire genome or chromosome level
Command bedtools genomecov -bga -ibam recal_reads.bam > genomecov.bedgraph

 

Step 22 Compile Statistics
Tool parse_metrics.sh (in house)
Input alignment_metrics.txt,
insert_metrics.txt,
raw_snps.vcf,
filtered_snps.vcf,
raw_snps_recal.vcf,
filtered_snps_final.vcf,
depth_out.txt
Output report.csv
Notes A single report file is generated with summary statistics for all libraries processed containing the following pieces of information:

  • # of Reads
  • # of Aligned Reads
  • % Aligned
  • # Aligned Bases
  • Read Length
  • % Paired
  • Mean Insert Size
  • # SNPs, # Filtered SNPs
  • # SNPs after BQSR, # Filtered SNPs after BQSR
  • Average Coverage

This pipeline was presented at the NYU-HiTS seminar on March 03, 2016. Slides available here.