Analyze your Data Faster with NASQAR: Nucleic Acid SeQuence Analysis Resource

Analyze your Data Faster with NASQAR: Nucleic Acid SeQuence Analysis Resource

The bioinformatics team at the NYU Center for Genomics and Systems Biology in Abu Dhabi and New York have recently developed NASQAR (Nucleic Acid SeQuence Analysis Resource), a web-based platform providing an intuitive interface to popular R-based bioinformatics data analysis and visualization tools including Seurat, DESeq2, Shaman, clusterProfiler, and more.

These tools, although powerful, typically require significant computational experience and lack a graphical user interface (GUI), making them inaccessible to many researchers. NASQAR addresses this problem by wrapping these tools in a beautifully simple UI, allowing users to explore their data in seconds using various tools and visualization methods. Users also have the ability to download analysis results and a wide array of figures.

The SeuratV3 Wizard in NASQAR

Behind the scenes data pre-processing allows for a seamless transition to downstream analysis, and sample data included with each app allows users to take a test drive without having any data on hand.

The web app lowers the entry barrier and provides greater independence for researchers with little or no computational experience to carry out standard bioinformatics analysis, visualize their data, and produce publication-ready data files and figures.

The platform is publicly accessible at

NASQAR is open-source and the code is available through GitHub at

NASQAR is also available as a Docker image at

Let us know what you think in the comments below!

GPU-Accelerated MinION Basecalling On the HPC

GPU-Accelerated MinION Basecalling On the HPC

I recently helped the Rockman lab basecall their MinION sequencing data on the HPC, leveraging the power of the GPUs available there. This allowed us to bring the total time required for basecalling down to around five hours, from the two weeks(!) it was going to take on the desktop.

Since more people are beginning to perform MinION sequencing here at the Center for Genomics and Systems Biology, I thought it would be helpful to share the procedure for basecalling with GPUs on the HPC.

First, you’ll need to transfer your data to the HPC. I recommend rsync for this if you’re on a mac. If you’re using Windows, I suggest WinSCP. You’ll also need to know the flowcell and kit that was used for sequencing (see this table for the full list of options), and lastly the output path where you want the basecalled fastq files to go.

Copy the script below to your local directory, modify the first four parameters shown in red (leave --device "auto" intact), then submit it to Slurm like this: sbatch script-name.s

If you have multiple fast5 directories (for example: fast5_pass and fast5_skip), you can combine the fast5 files into one directory, or you can run the script twice, providing a different input path each time.

If you’re doing RNA sequencing, you need to provide the --reverse_sequence argument as well.

The script above should notify you via email when it begins, ends, or if there are any problems, but you can also track it’s status using:

watch squeue -u <your_netID>

Questions? E-mail me (mk5636) or post in the comments below.

Available Flowcell + Kit Combinations

Flowcell Kit

Beginners Guide: What is OpenStack?

Beginners Guide: What is OpenStack?

OpenStack, a project originally started by NASA and Rackspace, is an open source cloud computing platform that enables users to access and control pools of compute, storage, and networking resources.

TechCrunch calls it “one of the most important and complex open-source projects you’ve never heard of”.

Like competitor Amazon Web Services (AWS) and other cloud platforms, OpenStack consists of several interrelated components: compute, storage, image management, networking, etc. Together these components deliver a massively scalable cloud operating system.

The principal difference between OpenStack and AWS is that OpenStack is open source and AWS is proprietary. With OpenStack, everything is transparent. Anyone can access the source code, build upon it, and freely share the changes with the community at large. OpenStack has the benefit of thousands of developers all over the world working in tandem to develop the strongest, most robust, and most secure product that they can. It’s backed by some of the biggest companies in software development and hosting including Dell, Cisco, IBM, Intel, and Citrix. The non-profit OpenStack Foundation oversees both development and community-building around the project.

OpenStack is used by numerous universities and research institutions around the world, and companies such as ATT, Verizon, Comcast, and PayPal.

What can you do with OpenStack?

OpenStack allows users to deploy virtual machines on the fly. It makes horizontal scaling easy, which allows for dynamic allocation of resources that scale up or down depending on usage. Consider one simple example of a JupyterHub instance that sees spikes in usage during classes when the JupyterHub server is used. If the resources required during classes were dedicated to JupyterHub permanently, these resources would be idle (and thus wasted) for much of the time when there were no classes. OpenStack allows these resources to be provisioned dynamically to meet the challenges of differing and dynamic workloads. Here are some ways in which OpenStack is being used today:


The European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL), use OpenStack to power their collaborative research project known as the Embassy Cloud.

The Cloud Infrastructure for Microbial Bioinformatics (Climb) project aims to create the world’s largest single system dedicated to Microbial Bioinformatics research.

OpenStack at The European Organization for Nuclear Research (CERN): Exploring the origins of the universe with high-throughput computing.

NASA’s Jet Propulsion Laboratory is hoping to put humans on Mars by 2030, with the help of OpenStack.

Web Applications

OpenStack is a great option for running web apps such as Jupyter Notebook and Shiny. It’s also great for building and hosting your own web apps which require a web server, full control of your environment (i.e. sudo rights), and the ability to scale quickly if it becomes necessary. At NYU’s Center for Genomics and Systems Biology, the Huang Lab is using OpenStack to to develop a next generation web based genome browser.

Web Hosting

Industry scale managed web hosting providers use OpenStack to power their services. After all, OpenStack was originally started by a web hosting company (Rackspace). For individual users, OpenStack provides a great resource for building and hosting your development site. For production websites however, unless you’re willing to be vigorously committed to keeping your system updated and secure, can ensure 99.9% uptime and data redundancy, implement SSL and CDNs, etc., stick to a professional web hosting company.

Big Data

A primary use case of OpenStack in many areas is being able to provide scalable, elastic infrastructure for big data collection and analytics. For example, automakers (by some accounts the second largest largest generator of data) collect, store, and analyze petabytes of data globally from millions of vehicle sensors, customers, dealers and social media using OpenStack.

Database as a Service (DBaaS)

OpenStack DBaaS runs in the data center, as well as public cloud. Deployment as a private cloud in a data center allows full control over practices and policies for data retention, encryption, backups, etc. OpenStack provides a single framework to operate multiple database technologies in a consistent way. This includes both SQL and NoSQL databases, ones optimized for both operational and analytical workloads, and both commercial and open source products. (In contrast, Amazon AWS only offers a limited number of database options.)

Video Processing and Content Delivery

A popular use case for OpenStack is video processing and content delivery across regions. Here’s an example of how DigitalFilm Tree funnels thousands of hours of video footage to make a 1 hour TV show using a mix of private and public OpenStack clouds.

How to Launch an OpenStack Instance and Setup Jupyter Notebook

    1. Login to your OpenStack dashboard. At NYU it’s

      Domain: NYU
      User Name: Your NYU NetID
      Password: Your NYU Password

      Note: You must be on the NYU network (on campus or on VPN) to access the NYU OpenStack portal.NYU OpenStack Login Screen

    2. Once you’re logged in, click on “Instances” under the “Compute” tab.
      OpenStack Dashboard
    3. Click “Launch Instance”.
      Launch Instance
    4. Provide your new instance with a name (ex: my_instance). Keep “Availability Zone” as nova, and “Count” as 1.
      Instance Details
    5. Click “Source”, select image from “Select Boot Source”, and give it 10GB or more for “Volume Size”. Scroll to the bottom and click the “+” button beside “centos7” to select this image for your instance.
      Instance Source
    6. Next click on “Flavor”, then select the type of system you wish to deploy. For this tutorial, we’ll use the p100-1-tiny setup. Click the “+” button to allocate these resources to your instance.
      Select Instance Flavor
    7. Click “Networks”, then add the tenant network.
      Insatnce Networks
    8. Click “Key Pair”, then click + Create Key Pair
      Create Key Pair
    9. Give your Key Pair a name, then click Create Keypair. This will create the new key pair and automatically download it onto your computer. Keep this file in a safe and secure place, you will need it to login to your instance.
      Create Key Pair 2
    10. Click “Launch Instance”
      Launch Instance after Setup
    11. It might take a minute before your instance is ready. Once it’s ready, click on it to bring up the instance overview.
      Instance Ready
    12. Click “Access & Security”.
      Instance Overview
    13. Click “Manage Rules”.
      Manage Rules
    14. Click “+ Add Rule”.
      Add Rule
    15. Select SSH for “Rule”, then click “Add”. We need to allow this to be able to connect to our instance via SSH.
      Add SSH Rule
    16. Click “+ Add Rule” again, and add the HTTPS rule. We need to allow this to be able to connect to our jupyter notebook through a web browser.
      Add HTTPS Rule
    17. Click “Instances”, then select “Associate Floating IP” from the Actions drop-down menu.
      Associate Floating IP Address
    18. Select an IP address from the “IP Address” drop-down, select your newly created instance from the “Port to be associated” drop down,  then click “Associate”.
      Associate Floating IP Window
    19. Connect to your new instance via ssh using the floating IP address you associated in the previous step and the keypair.pem file you downloaded in step #9. You can find always see your floating IP address on the instances overview screen.SSH Command:
      ssh -i <keypair.pem> centos@<floating_ip>


      ssh -i my-keypair.pem centos@

      Instance IP Address
      Logging In via SSH

    20. We now have a fresh system on which we will setup Jupyter. We’ll use Anaconda for this which will take care of installing Python and all required dependencies, making setup a breeze. Since we have a fresh system, we need to install a few basic tools first to be able to download and install Anaconda:
      sudo yum update
      sudo yum install wget bzip2

      If it asks you the following:

      Is this ok [y/d/N]:

      Enter y and press return.

    21. Next, download and install Anaconda. Be sure to get the latest version from:

    22. Restart the terminal and enter the following command:
       jupyter notebook

      NYU users need to do a couple work-arounds due to stringent security protocols:

      1. Switch to root
        sudo su -
      2. Source the centos user .bashrc file (to load the necessary $PATH variables)
        source /home/centos/.bashrc
      3. Enter the following command to launch Jupyter
        jupyter notebook --ip= --port=443 --allow-root --no-browser
    23. If everything was successful, you should see something like the following.  Pay attention to the token in the last line, you’ll need this in a moment. In this example, it’s 26340a16487173a8a3c58342ccb4c7248b4bc817cdf84de0
      [centos@my-instance ~]$ jupyter notebook
      [I 23:50:21.114 NotebookApp] JupyterLab extension loaded from /home/centos/anaconda3/lib/python3.7/site-packages/jupyterlab
      [I 23:50:21.114 NotebookApp] JupyterLab application directory is /home/centos/anaconda3/share/jupyter/lab
      [I 23:50:21.118 NotebookApp] Serving notebooks from local directory: /home/centos
      [I 23:50:21.118 NotebookApp] The Jupyter Notebook is running at:
      [I 23:50:21.119 NotebookApp] http://(my-instance.novalocal or
      [I 23:50:21.119 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
      [W 23:50:21.119 NotebookApp] No web browser found: could not locate runnable browser.
      [C 23:50:21.119 NotebookApp]
          Copy/paste this URL into your browser when you connect for the first time,
          to login with a token:
              http://(my-instance.novalocal or
    24. Now, connect to your jupyter notebook through your web browser by going to the following address:

      Where <floating_ip> is your Floating IP Address from Step #18 above.



    25. Login with your token which you receive in Step #23 above. Alternatively, use your token to create a password which you can use for authentication going forward. Note that you will need to restart jupyter notebook before your new password will take effect.
      Jupyter Notebook Login Screen
    26. Congratulations, you’ve launched your very own jupyter notebook in the cloud using OpenStack! When you’re done working with your instance, be sure to shut down your instance from the Actions drop-down menu on your Instances page. The next time you want to work with your jupyter notebook, simply restart your instance, enter the jupyter notebook command above, and connect through your web browser.
reform: Modify Reference Sequence and Annotation Files Quickly and Easily

reform: Modify Reference Sequence and Annotation Files Quickly and Easily

reform is a python-based command line tool that allows for fast, easy and robust editing of reference genome sequence and annotation files. With the increase in use of genome editing tools such as CRISPR/Cas9, and the use of reference genome based analyses, the ability to edit existing reference genome sequences and annotations to include novel sequences and features (e.g. transgenes, markers) is increasingly necessary. reform provides a fast, easy, reliable, and reproducible solution for creating modified reference genome data for use with standard analysis pipelines and the 10x Cell Ranger pipeline.

The Problem

The Problem

While it is possible to edit reference sequences and annotations manually, the process is time consuming and prone to human error. Data that appears correct to the eye after manual editing may break strict file formatting rules rendering the data unusable during analysis.

The Solution

The Solution

Execution of reform requires a reference sequence (fasta), reference annotation (GFF or GTF), the novel sequences to be added (fasta), and corresponding novel annotations (GFF or GTF). A user provides as arguments the name of the modified chromosome and either the position at which the novel sequence is inserted, or the upstream and downstream sequences flanking the novel sequences. This results in the addition and/or deletion of sequence from the reference in the modified fasta file. In addition to the novel annotations, any changes to the reference annotations that result from deleted or interrupted sequence are incorporated into the modified gff. Importantly, modified gff and fasta files include a record of the modifications.

The position at which the novel sequence is to be inserted can be provided in two different ways:

1) Provide position at which to insert novel sequence.

2) Provide upstream and downstream flanking sequences, resulting in deletion of reference sequence between these two regions and addition of the novel sequence.



reform requires Python3 and Biopython.

On the Prince HPC at NYU, simply load the required module:

module load biopython/intel/python3.6/1.72

Invoke the python script like this:

  --chrom=<chrom> \
  --position=<pos> \ 
  --in_fasta=<in_fasta> \
  --in_gff=<in_gff> \
  --ref_fasta=<ref_fasta> \


chrom ID of the chromsome to modify

position Position in chromosome at which to insert <in_fasta>. Can use -1 to add to end of chromosome. Note: Either position, or upstream AND downstream sequence must be provided.

upsteam_fasta Path to Fasta file with upstream sequence. Note: Either position, or upstream AND downstream sequence must be provided.

downsteam_fasta Path to Fasta file with downstream sequence. Note: Either position, or upstream AND downstream sequence must be provided.

in_fasta Path to new sequence to be inserted into reference genome in fasta format.

in_gff Path to GFF file describing new fasta sequence to be inserted.

ref_fasta Path to reference fasta file.

ref_gff Path to reference gff file.


python3 \
  --chrom="I" \
  --upstream_fasta="data/up.fa" \
  --downstream_fasta="data/down.fa" \
  --in_fasta="data/new.fa" \
  --in_gff="data/new.gff" \
  --ref_fasta="data/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa" \


reformed.fa Modified fasta file.

reformed.gff3 Modified GFF file.

An Example Using 10x Cell Ranger

If you’re using the Cell Ranger pipeline, you’ll need to modify your GTF file with reform and then run cellranger makeref to create the new genome data needed for cellranger count. Here’s an example:

1) Prepare reference data using reform

We’ll need the original reference in fasta format, and a fasta file containing the novel sequence to be incorporated into the reference sequence. We’ll also need the original reference GTF file, and a GTF file with the annotations corresponding to the novel sequence.

Let’s get a reference sequence:


And the reference GTF:


Here’s the novel sequence we’re adding to the reference (new.fa)


Here’s the GTF corresponding to the novel sequence in new.fa (new.gtf):

I    reform    exon    1    240    .    +    0    gene_id "new_gene"; transcript_id "new_gene.1";
I    reform    CDS    1    238    .    +    0    gene_id "new_gene"; transcript_id "new_gene.1";
I    reform    start_codon    1    3    .    +    0    gene_id "new_gene"; transcript_id "new_gene.1";
I    reform    stop_codon    238    240    .    +    0    gene_id "new_gene"; transcript_id "new_gene.1";

In this example, we’ll add this new gene at position 2650 on chromosome “I”:

python3 \
 --chrom="I" \
 --position=2650 \
 --in_fasta="new.fa" \
 --in_gff="new.gff" \
 --ref_fasta="Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa" \

We should have two new files reformed.fa and reformed.gtf which we’ll need in the next step.

2) Run Cell Ranger makeref

cellranger mkref \
 --genome=output_genome \
 --fasta=reformed.fa \

This will produce a directory called output_genome which we’ll need to provide for the next step

3) Run Cell Ranger count

cellranger count \
 --id=run_1 \
 --transcriptome=output_genome \
 --fastqs=/path/to/fastq_path/ \

4) Putting it together

Here’s an sbatch script that runs everything together and makes use of resources on the HPC

#SBATCH --verbose
#SBATCH --job-name=reform_ranger
#SBATCH --output=reform_ranger_%j.out
#SBATCH --error=reform_ranger_%j.err
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20
#SBATCH --mem=110000
#SBATCH --mail-user=${USER}

module load cellranger/2.1.0
module load biopython/intel/python3.6/1.72

python3 \
 --chrom="I" \
 --position=2650 \
 --in_fasta="new.fa" \
 --in_gff="new.gff" \
 --ref_fasta="Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa" \

cellranger mkref \
 --genome=output_genome \
 --fasta=reformed.fa \
 --genes=reformed.gtf \
 --nthreads=20 \

cellranger count \
 --id=run_1 \
 --transcriptome=output_genome \
 --fastqs=/path/to/fastq_path/ \


Next-Generation Sequencing Analysis Resources

Next-Generation Sequencing Analysis Resources

The NYU Center For Genomics and Systems Biology in New York and Abu Dhabi have developed a new website with resources for mastering NGS analysis:

Modules are designed to provide hands on experience with analyzing next generation sequencing data. Standard pipelines are presented that provide the user with and step-by-step guide to using state of the art bioinformatics tools. Each module includes sample data sets and scripts that can be accessed on NYU’s HPC facility.

Before you begin, it’s important to have some foundational skills in Linux and R. Check out the pre-requisites.

Included Modules:

  • File Formats: Learn about the variety of file formats in next generation sequencing data.
  • Alignment: The first step of most NGS analysis is to align the reads against the reference genome. This module describes how to map short DNA sequence reads, assess the quality of the alignment and prepare to visualize the mapping of the reads.
  • Visualization: Learn about the Integrative Genomics Viewer (IGV), a high-performance visualization tool for interactive exploration of large, integrated genomic datasets.
  • Variant Calling: Variant calling entails identifying single nucleotide polymorphisms (SNPs) and small insertions and deletion (indels) from next generation sequencing data. This module will cover SNP & Indel detection in germline cells.
  • RNA-Seq: Learn how to identify differential expression of genes by comparing different samples, or attempt to capture all RNA molecules in a given species.
  • HPC: Resources for analysing data on NYU’s High Performance Computing Cluster.
  • ChiPseq Analysis: ChiP-seq combines chromatin immunoprecipitation with high throughput sequencing and is used to analyze protein interaction with DNA and identify their binding sites.
  • De novo genome assembly: De novo (from new) genome assembly refers to the process of reconstructing an organism’s genome from smaller sequenced fragments.
  • Metagenomics: Metagenomics is the study of genetic material recovered directly from environmental samples. This is useful when attempting to understand what microbes are present and what they are doing in a microbiome, such as soil or the human gut.
  • Single cell RNA sequencing: scRNA allows us to understand cellular differences in expression, and hence it is directly applicable to the studies of cell heterogeneity, cell population and subpopulation identification, effects of low copy mRNA distribution and transcriptional regulation.

Note: New modules added as developed, this post may not reflect all modules currently available.

In addition to modules, material from the Bioinformatics And Data Analysis Seminars (BADAS) will be published on this site as well.

Building an Analysis Pipeline for HPC using Python

Building an Analysis Pipeline for HPC using Python

In this post we will build a pipeline for the HPC using Python 3. We will begin by building the foundation for a pipeline in Python in part 1, and then use that to build a simple NGS analysis pipeline in part 2.

At NYU, we submit jobs to the HPC using the Slurm Workload Manager. Therefore, we will implement this pipeline using Slurm, however this could easily be substituted for PBS or another similar scheduler.

This tutorial assumes basic level familiarity with Slurm and Python.

Part 1: Building a template for submitting jobs to Slurm in python

The first thing we will do is build a function in python that will be used to submit jobs to Slurm. The function will submit an sbatch job through the command line using the subprocess python package.

Let’s start by taking alook at an example of submitting a simple job Slurm directly from the command line, one that just prints “Hello World”.

sbatch -J my_simple_job -o my_simple_job.out -e my_simple_job.err --wrap='echo Hello World'

If you execute this command on the Prince or Dalma HPC, you should end up with two new files after the job completes (it should only take a second to finish):


If you look in my_simple_job.out you should see the text “Hello World”.

Let’s break down the command we just used. If you’re familiar with submitting Slurm jobs, this should be review:

sbatch: calling the scheduler
-J: job name
-o: standard out filename
-e: standard err filename
--wrap: this is the most important part, the command you wish to run. In this example, we simply asked to echo “Hello World”.

Let’s put this same command in a python function (we’ll call it sbatch()), and see how we can use this as a template to submit all our jobs. Note that we make use of the subprocess package to execute something on the command line from python. Before we use the subprocess package, we have to import it.

import subprocess

def sbatch():
     sbatch_command = "sbatch -J my_simple_job -o my_simple_job.out -e my_simple_job.err --wrap='echo Hello World'"
     sbatch_response = subprocess.getoutput(sbatch_command)

sbatch() # call the sbatch function

Save that as If you run this python script (python3, you should see a response from sbatch (indicating the ID of the job submitted), and you should find the same file my_simple_job.out with the text “Hello World”. Try change the job name, output file names, or command, and re-run the script.

Let’s transform the sbatch function above into something more flexible.

def sbatch(job_name, command):
    sbatch_command = "sbatch -J {} -o {}.out -e {}.err --wrap='{}".format(job_name, job_name, job_name, command)
    sbatch_response = subprocess.getoutput(sbatch_command)

We introduce two new parameters to the function: job_name and command. We also make use the string format method in order to build strings easily. Let’s take a quick look at a couple simple examples of the string format method in Python.

name = "Rick"
print( "Hello {}".format(name) )
# prints "Hello Rick"

first_name = "Rick"
last_name = "Sanchez"
print( "Hello {} {}".format(first_name, last_name) )
# prints "Hello Rick Sanchez"

The nice thing about this method is that it automatically formats any field into a string, so you can fill values with integers, floating point numbers, mathematical functions, etc. Let’s Take a look at one final example:

some_integer = 7
some_float = 2.99
print( "{} + {} = {} ".format(some_integer, some_float, some_integer + some_float) )
# prints "7 + 2.99 = 9.99"

Let’s go back to our sbatch example. What we have now is a function that we can send a job_name and a command, and the function will build the sbatch command using job_name as the Job Name and stdout/stderr file names, and will put whatever command we give it in the ‘wrap’ argument. Let’s try the same basic Hello World example using our new template.

import subprocess
def sbatch(job_name, command):
    sbatch_command = "sbatch -J {} -o {}.out -e {}.err --wrap='{}".format(job_name, job_name, job_name, command)
    sbatch_response = subprocess.getoutput(sbatch_command)

sbatch("my_simple_job", "echo Hello World") # call the sbatch function, but provide job_name and command parameters.

Let’s add some additional parameters that we might want to use when submitting a job to sbatch: time, memory, and tasks. We will give each of these parameters a default value so they don’t need to be specified everytime you call sbatch(), unless you want to override them. Let’s also add a mail option so Slurm will email us if our job ever fails.

def sbatch(job_name, command, time=4, mem=60, tasks=20):
    sbatch_command = "sbatch -J {} -o {}.out -e {}.err --mail-user=$ --mail-type=FAIL -t {}:00:00 --mem={}000 --ntasks-per-node={} --wrap='{}'".format(job_name, job_name, job_name, time, mem, tasks, command)
    sbatch_response = 

We now have function that allows us to quite easily submit any command we want to the Slurm scheduler, control some common job parameters, and notify us by email in case of failure. The final piece required to make this function operable in a pipeline is support for dependencies. We need this function to do two more things:

1) return a job id that can be passed as a dependency to another job
2) allow the sbatch function to accept a dependency, optionally

Let’s tackle #1 first:

When you submit an sbtach job, you typically receive a response like this:
Submitted job 123456789

We’re interested in just the job number. There are a few different ways to extract this number from the response, we’ll use a simple string split and take the last element:

job_id = sbatch_response.split(' ')[-1].strip()

The command above takes the sbatch_response (which is the output of running the sbatch command with subprocess.getoutput()), splits it based on the space (‘ ‘) delimiter, takes the last element (the job ID), and strips away any spaces or newline characters leaving just the job ID.

Now let’s implement support for #2: enabling the function to accept an optional dependency parameter.

We’ll start by adding a parameter called dep to the sbatch function definition and assigning it a default value of nothing (empty string).

def sbatch(job_name, cmd, time=4, mem=60, tasks=20, dep='')

Now we’ll add a couple new lines at the top of the function to check if the dependency is empty or not, and add the required parameters to the sbatch command if it’s not empty. We also add the lines to parse the and return job_id.

def sbatch(job_name, command, time=4, mem=60, tasks=20, dep=''):
    if dep != '':
        dep = '--dependency=afterok:{} --kill-on-invalid-dep=yes '.format(dep)
    sbatch_command = "sbatch -J {} -o {}.out -e {}.err --mail-user=$ --mail-type=FAIL -t {}:00:00 --mem={}000 --ntasks-per-node={} --wrap='{}' {}".format(job_name, job_name, job_name, time, mem, tasks, command, dep)
    sbatch_response = subprocess.getoutput(sbatch_command)
    job_id = sbatch_response.split(' ')[-1].strip()
    return job_id

Note that the dep parameter is always included in the sbatch command that we send to the command line. If there is no dependency specified, then dep will be an empty string.

We now have a function that can be used to build a fully functioning pipeline with dependencies!

Part 2: Building a Pipeline

In part 1 we built a template function to submit jobs to the HPC. In part 2, we will use this function to build a pipeline. For the purpose of this tutorial, we will implement a simple NGS analysis workflow consisting of FastQC, alignment, sorting, indexing, and generating alignment statistics.

Using dependencies in a pipeline allows us to maximize the resources we can use while minimizing the wait time to run our jobs. This also allows individual jobs to run in parallel (where possible), reducing our overall analysis time even further.

The figure below depicts the simple pipeline we will build. Note the dependencies (indicated with arrows) and the steps which run in parallel (same row).

1. Input files

We’ll use a single-end fastq file and we’ll align to the Unicorn genome.

input_1 = sample_R1.fastq

ref = Unicorn_v1.fa

For the purpose of this exercise, we’ll assume the index files for the aligner are already built.

2. Build the individual steps

Here’s the command for running fastqc:

fastqc file.fastq

We’ll build a function for each step of the workflow to keep things organized. These functions will call the sbatch() function we created above.

def fastqc():
    # Build the command for fastqc
     command = "fastqc {}".format(input_1)
     # Pass job name and command to sbatch
     # sbatch() will return the job_id
     job_id = sbatch('fastqc', command)

    # Return the job id
     return job_id

Now let’s build the functions for the remaining steps


def alignment():
    command = "bowtie2 -x {} -U {} > output.sam".format(ref, input_1)
    job_id = sbatch('align', command)
    return job_id

Note: The next three steps require a dependency, i.e. they require a previous job to complete before they can be started. For this reason, we will add the dep parameter to the following functions.


def stats(dep=''):
    command = "samtools flagstat output.sam > stats.txt"
    job_id = sbatch('stats', command, dep=dep)
    return job_id


def sort(dep=''):
    command = "samtools sort output.sam -o output.sorted.bam"
    job_id = sbatch('sort', command, dep=dep)
    return job_id


def index(dep=''):
    command = "samtools index output.sorted.bam output.sorted.bai"
    job_id = sbatch('index', command, dep=dep)
    return job_id

3. Build the pipeline

Now that we have all our individual functions for each step built, we can stitch them together in a pipeline. Most of the heavy lifting is already done, this part is the easiest.

fastqc_jobid = fastqc()
alignment_jobid = alignment()

# Stats needs the alignment to complete before it can begin, so we pass 
# alignment_jobid to the stats() function
stats_jobid = stats(alignment_jobid)

# Sort needs the alignment to complete before it can begin, so we pass 
# alignment_jobid to the sort() function
sort_jobid = sort(alignment_jobid)

# Note: Both stats() and sort() will be submitted simultaneously when the the alignment job is complete

# Index needs sorting to complete before it can begin, so we pass sort_jobid to index()
index_jobid = index(sort_jobid)

4. Putting it all together

Below you will find the full code for this pipeline. I’ve added three new components:

a) Parse the sample id (from the input file name) and use this to name the outputs of the pipeline
b) Override some of the default sbatch job parameters (ex: tasks, memory, and time) for the alignment job
c) Load respective modules (required for NYU’s HPC). I add this directly into the command for each step.


import subprocess

# Inputs
input_1 = 'sample_R1.fastq'
ref = 'Unicorn_v1.fa'

# Parse the sample id so we can use it to name the outputs
# sample_id in this example will be 'sample_R1'
sample_id = input_1.split('.')[0]

def sbatch(job_name, command, time=4, mem=60, tasks=20, dep=''): 
    if dep != '': dep = '--dependency=afterok:{} --kill-on-invalid-dep=yes '.format(dep) 
   sbatch_command = "sbatch -J {} -o {}.out -e {}.err --mail-user=$ --mail-type=FAIL -t {}:00:00 --mem={}000 --ntasks-per-node={} --wrap='{}' {}".format(job_name, job_name, job_name, time, mem, tasks, command, dep) 
   sbatch_response = subprocess.getoutput(sbatch_command) 
   job_id = sbatch_response.split(' ')[-1].strip() 
   return job_id
def fastqc(): 
   # Build the command for fastqc 
   command = "module load fastqc/0.11.5; fastqc {}".format(input_1)
   # Pass job name and command to sbatch 
   # sbatch() will return the job_id 
   job_id = sbatch('fastqc', command) 
   # Return the job id 
   return job_id
def alignment(): 
   command = "module load bowtie2/intel/2.3.2; bowtie2 -x {} -U {} > {}.sam".format(ref, input_1, sample_id) 
   job_id = sbatch('align', command, time=8, mem=120) 
   return job_id
def stats(dep=''):
   command = "module load samtools/intel/1.6; samtools flagstat {}.sam > {}_stats.txt".format(sample_id, sample_id)
   job_id = sbatch('stats', command, dep=dep)
   return job_id

def sort(dep=''):
   command = "module load samtools/intel/1.6; samtools sort {}.sam -o {}.sorted.bam".format(sample_id, sample_id)
   job_id = sbatch('sort', command, dep=dep)
   return job_id

def index(dep=''):
   command = "module load samtools/intel/1.6; samtools index {}.sorted.bam {}.sorted.bai".format(sample_id, sample_id)
   job_id = sbatch('index', command, dep=dep)
   return job_id
#Run the pipeline!
fastqc_jobid = fastqc() 
alignment_jobid = alignment() 
stats_jobid = stats(alignment_jobid)
sort_jobid = sort(alignment_jobid)
index_jobid = index(sort_jobid)

Congratulations! You’ve completed the tutorial. If you were following along, you should have the following output files once the pipeline has finished:

Post below if you have any questions and I’ll do my best to answer.

Explore the New Shared Genome Resource

Explore the New Shared Genome Resource

Save time and resources with the local CGSB repository of commonly used genomic data sets. Data is obtained from Ensembl and NCBI. New versions/releases will be added periodically or upon request. Previous versions/releases will be preserved.

All files are readable from within the shared genome resource. There is no need to copy the file(s) to your local directory. The second table below shows all available data types.

All data are stored in a common location with the following naming convention:


Search and find your genome of interest in the table below. Clicking on the row containing your genome will generate the specific path (see below table) to the data on the HPC .


Kingdom Species Source Version
Available Data Types
Name Extensions Built With
Reference Sequence
  • .fa
  • .fna

Request Data

To request a new organism, version, or data type, please email

Remote Desktop Connection to Prince

Remote Desktop Connection to Prince

Connect to Prince using a remote desktop to analyze your data in RStudio, visualize in IGV, and interact with other GUI applications on the HPC.

Instructions for Mac users

Instructions for PC users

Instructions for Mac users

Initial Setup:

  1. Download TigerVNC:
  2. Setup a VNC password on Prince:
    Login to Prince and type the command vncpasswd
    If you ever forget your VNC password, you can reset it at anytime by typing vncpasswd again (it will ask you to enter a password, then confirm the password)
  3. Copy this script into your home folder:

Once initial setup is complete, you can launch a vnc session and connect to it at any time by doing the following:

  1. Submit the sbtach script (which you obtained above) to the queue, like this:
    > sbatch
  2.  Once the job is running, look for the output file called VNC.o1234567, where ‘1234567’ is your job ID. Open that file and copy the first ssh command, which looks like this:
    ssh -L 5901:c06-04:5901
  3. Open a terminal from your Mac (local terminal) and paste in the command from above and hit enter, you will be asked for your password (your regular NYU password).
  4. Open TigerVNC on your computer and enter localhost:X in the server field, where X is the Display Port (590X – If your port from the ssh command above was 5901, you enter 1, if the port was 5902, you enter 2.). Using the command above as an example, the correct entry would be localhost:1
    Press the ‘Connect’ button. You will then be asked for a password, enter the password you setup with vncpasswd (not your NYU password!) Click OK and it should launch the desktop.
  5. The first time you launch your remote desktop, you should disable the screensaver lock. Go to Applications -> System Tools -> Settings -> Power and change the setting for Blank Screen to Never. This will prevent you from being locked out of your desktop. This only needs to be done once.
  6. Now you can run IGV and load any file which you have access to on the HPC. Open a terminal and enter:
    > module load igv
  7. Your VNC session will be active for 12 hours. After this period, you will need to re-launch and connect to your remote desktop by redoing steps 1-4.

Instructions for PC users

Initial Setup:

  1. Download Putty:
  2. Download and install TightVNC (accept the default settings when installing):
  3. Setup a VNC password on mercer:
    Login to Prince and type the command vncpasswd
    If you ever forget your VNC password, you can reset it at anytime by typing vncpasswd again (it will ask you to enter a password, then confirm the password)
  4. Copy this script into your home folder:

Once initial setup is complete, you can launch a vnc session and connect to it at any time by doing the following:

  1. Submit the sbatch script (which you obtained above) to the queue, like this:
    > sbatch
  2. Once the job is running, look for the output file called VNC.o1234567, where ‘1234567’ is your job ID. Open that file and look at the first  ssh command, which looks like this:
    ssh -L 5901:c06-04:5901
  3. From the line above, we know the VNC port is 5901 and the compute node is c06-04. You will need these two values.
  4. Open Putty and enter for Host Name and 22 for Port, as shown below.
  5. On the left, click to expand the options for ‘SSH’, then click ‘Tunnels’ (a). Enter the port number obtained in step 3 above into the ‘Source Port’ field. In the example shown above, the port number is 5901 (b). For ‘Destination’, enter compute-node:port. In the example above, this is co6-04:5901 (c). Click ‘Add’ (d). Then click ‘Open’ (e). When the terminal opens, authenticate with you NYU netID and password.
    Note: this configuration (Source port + Destination) will need to be updated whenever a new VNC session is launched (whenever you submit a new job).
  6. Launch TightVNC Viewer. Enter the VNC Server as follows:
    If your port is 5900, VNC Server: localhost:0
    If your port is 5901, VNC Server: localhost:1
    If your port is 5902, VNC Server: localhost:2
    In the example above, the port was 5901, so we enter localhost:1
  7. Click Connect, enter your VNC password (not NYU NetID password!) which you created in the initial setup.
  8. The first time you launch your remote desktop, you should disable the screensaver lock. Go to Applications -> System Tools -> Settings -> Power and change the setting for Blank Screen to Never. This will prevent you from being locked out of your desktop. This only needs to be done once.
  9. Now you can run IGV and load any file which you have access to on the HPC. Open a terminal and enter:
    > module load igv
  10. Your VNC session will be active for 12 hours. After this period, you will need to re-launch and connect to your remote desktop by submitting the script and configuring a new putty session.


Salmon & kallisto: Rapid Transcript Quantification for RNA-Seq Data

Salmon & kallisto: Rapid Transcript Quantification for RNA-Seq Data

Salmon and kallisto might sound like a tasty entree from a hip Tribeca restaurant, but the duo are in fact a pair of next-generation applications for rapid transcript quantification. They represent a new approach to transcript quantification using NGS that has a number of advantages over existing alignment-based methods. I’ve tried them both out and provide my thoughts below.

We are used to quantification methods that rely on full base-to-base alignment of reads to a reference, but newer methods leverage the idea that that when quantifying reads, the necessary information is not where in a transcript a read aligns, but only which transcripts could have generated the read. This idea renders base-level alignment, which has been the computational bottleneck until now, unnecessary for this analysis. Because this approach does not involve actual alignment to the genome, it is sometimes referred to as ‘pseudoalignment’.



  • Extremely Fast & Lightweight – can quantify 20 million reads in under five minutes on a laptop computer
  • Easy to use
  • Sleuth – an interactive R-based companion for exploratory data analysis


  • No support for stranded libraries
    Update: kallisto now offers support for strand specific libraries

kallisto, published in April 2016 by Lior Pachter and colleagues, is an innovative new tool for quantifying transcript abundance. kallisto uses the concept of ‘pseudoalignments’, which are essentially relationships between a read and a set of compatible transcripts. kallisto obtains highly accurate pseudoalignments efficiently using k-mers together with the transcriptome de Bruijn graph (T-DBG). Grouping of pseudoalignments belonging to the same set of transcripts into equivalence classes allows for a simpler model and more efficient use of underlying algorithms. The result is a tool which is able to quantify a paired-end library with 22 million reads in 3.5 minutes on a single-core with 8GB of memory.


Both kallisto and Salmon are magnitudes faster than Tophat + Cufflinks. Times computed on 22M read PE illumina dataset, 1 CPU core and 8GB memory.

Speed is great, but accuracy is critical. The plot below displays a high correlation (r = 0.941) between results obtained by kallisto and Cufflinks. You can see that the majority of differences are for transcripts with low expression levels.

kallisto is self-described as being ‘near optimal’. In this case, ‘optimal’ refers to two key aspects of the analysis: Speed and Accuracy. While the benchmark for accuracy can be determined by comparing results to a known “gold standard” data set or results from other existing tools, the benchmark for speed is typically based on how fast an input file can be read. A good way to measure optimal speed is to calculate how long it takes to read through a file, the linux word count (wc) utility provides a measure of this. For example, if it takes your machine 45 seconds to perform a word count on a set of reads, 45 seconds could be considered the benchmark for speed in quantifying this set of reads (on your machine). You can see in the plot below that kallisto is taking just over twice as long as it takes to read through the file. That’s not bad for a complex problem of quantifying RNA-seq data!


The linux word count (wc) utility can serve as a benchmark for ‘optimal’ speed.

A companion to kallisto, sleuth is an R-based program for exploratory data analysis powered by Shiny. The tools in sleuth allow you to investigate transcript abundance data or combine results from multiple samples for differential expression analysis. Interactive tools to explore scatterplots, volcano plots, MA plots, heat maps, PCA analysis, and more make analyzing your data faster and easier. For example, when examining a volcano plot, users can highlight a region of interest and immediately see a list of the transcripts selected in a table below.

A Volcano Plot in sleuth
Selecting points on a volcano plot brings up the transcripts corresponding to those points in the table below, in real time.

Using the ID’s of the transcripts, users can drill down further and examine boxplots of transcript abundances to see technical variation in each sample and biological variation between conditions.

Transcript View in sleuth
Enter a transcript ID in transcript view, or a gene ID in gene view, to visualize technical and biological variation quickly.

sleuth requires technical replicates for its analysis. In lieu of actual technical replicates, sleuth makes use of bootstrapped values which serve as accurate proxies. Bootstrapping here is the process of repeating the quantification analysis after resampling with replacement from the original data, in order to simulate random sampling associated with sequencing.

sleuth computes differential expression using a statistical model that combines variance information from biological replicates as well as bootstrapped technical replicates to estimate true biological variance. A test of sleuth on data simulated according to the DESeq2 model found that sleuth significantly outperforms other methods (ex: DESeq2, edgeR). A preprint is forthcoming.

The power of kallisto and sleuth lie in their speed and ease of use. A user can go from raw reads to analysis in minutes. As impressive as kallisto is, one major drawback is that its simplified model makes it unable to account for strandedness in reads. This seems like a major limitation given that most RNA-seq protocols generated stranded information.. If support for strandedness is a requirement for your analysis, check out Salmon below. Update: kallisto now offers support for strand specific libraries



  • Fast & Lightweight – can quantify 20 million reads in under eight minutes on a desktop computer
  • Support for strand-specific libraries
  • Accepts BAM and FASTQ files as input


  • No automatic support for compressed input files

From the creators of Sailfish, Salmon (released but unpublished) is another rapid quantification tool which leverages k-mer based counting to produce, in Salmon terms, ‘quasi-alignments’. Although Salmon is not quite as fast as kallisto and lacks some features such as seamless support for compressed files, its strong point is its ability to learn and account for the effects of experiment-specific parameters and biases, including support for strand-specific RNA-seq data.

In the data set used, the correlation between results obtained by Salmon and Cufflinks (r = 0.939) is nearly identical to that between kallisto and Cufflinks (r = 0.941).

Another feature Salmon offers is the ability to quantify pre-existing alignments (from BAM files). This means that if you already have BAM alignments (or will need BAMs for other reasons), you can provide Salmon with these. If you don’t have BAM files, you can provide your reads in fastq format and Salmon will create the quasi-alignments ‘wicked-fast’, and then quantify them.

Finally, if you’re quantifying your data with Salmon and would like to use sleuth for downstream analysis, you’re in luck. The creators of Salmon have created an R library called wasabi which prepares Salmon (and Sailfish) output for use with sleuth.

The latest versions of Salmon, kallisto, sleuth, and wasabi are available for use on Mercer.

Getting Started (on Mercer)


module load kallisto/intel/0.42.5
kallisto quant -i <kallisto_index> -o <output_dir> <read_1.fastq> <read_2.fastq>

To produce bootstrap values for downstream analysis with sleuth (in this example, 100 bootstraps):

kallisto quant -i <kallisto_index> -o <output_dir> -b 100 <read_1.fastq> <read_2.fastq>



! If your data is on mercer, run sleuth in a VNC session for web browser functionality
! On mercer, It is necessary to load Firefox or a browser of your choice prior to launching R
module load r/intel/3.2.2
module load firefox

The next step requires describing the experimental design and the relationship between the kallisto directories and the samples, which is variable depending on your experimental design and kallisto output structure. The final steps involve creating the sleuth object and performing the differential expression calculations, which take only a couple minutes in total to complete. Once these steps are complete, an interactive browser session can be launched. Please see Introduction to Sleuth for more.


module load salmon/gnu/0.6.0
salmon quant -i <salmon_index> -l <lib_type> -1 <read_1.fastq> -2 <read_2.fastq> -o <output_dir>

In this example, we quantify a paired end, stranded library (where read 1 comes from the reverse strand), from Arabidopsis thaliana, with 100 bootstraps:

salmon quant -i A_thaliana_salmon_index -l ISR --numBootstraps 100 -1 read_1.fastq -2 read_2.fastq -o salmon_out



kallisto Paper:

kallisto Manual:

Sleuth Blogpost:

Sleuth Tutorial:

Salmon Preprint:

Salmon Manual:

Node to Joy: Maximize Your Performance on the HPC

Node to Joy: Maximize Your Performance on the HPC

In this post we’ll discuss maximizing your performance on the HPC. This entry is aimed towards experienced HPC users; for new users, please see Getting Started on the HPC.

Recent advances in sequencing technology have made High Performance Computing (HPC) more critical than ever in data-driven biological research. NYU’s HPC resources are available to all NYU faculty, staff, and faculty-sponsored students.

Familiarizing yourself with NYU’s HPC resources, and optimizing your use of this technology will enable maximum speed and efficiency in your analyses.

Get to Know Your Nodes

Mercer, the primary NYU HPC cluster in New York, contains 400 nodes, 16TB of total memory, and over 3200 CPU cores. When a job is submitted to the cluster, one or more nodes are allocated to the job based on the resources requested when submitting. Typically, a single node is sufficient, and the node will be allocated based on amount of memory and CPU cores requested.

For example, of the 400 nodes available on Mercer, 112 of them are 64GB nodes with 20 CPU cores available per node. Note that 64GB is the total memory per node but 62GB is the amount of memory available to jobs. Therefore, if 64GB of memory is requested, the job will run on the next largest memory node. In this case, the node will be allocated based on how many processors per node (cores) are requested. There are twelve 96GB nodes with 12 cores on Mercer, so if 12 or less cores are requested, a 96GB node will be allocated. If 20 cores are requested, the next highest memory nodes with 20 cores are forty eight 192GB nodes (189GB available for jobs), so one of these nodes will be allocated. See list of Mercer nodes. There are also four new nodes with 500GB available memory and 20 CPU cores, and two new nodes with 1.5TB memory and 48 CPU cores, that are not included in this list.

Generally speaking, if 20 CPU cores on a single compute node is requested, it is ok to declare 62GB memory. If more than 62GB memory with 20 CPU cores on a single compute node is needed, it is ok to declare 189GB memory. In the above cases, since 20 cores are being requested, no one would be able to share the compute node (since all CPU cores are being used) and so it would be efficient to request all the available memory there.

Which Node is Right for your Analysis?

Choosing the right node for your analysis requires a little benchmarking and testing. Over-requesting resources when they are not needed might result in your analyses taking longer because of queues for the higher resource nodes since they are fewer of these. For example, a job which may have spawned almost immediately after submission on a lower resource node and taken 30 minutes to run might have taken less time to complete on a 1TB node, but the queue time for this job might have been longer since there are only 3 nodes with 1TB memory available.

It can be difficult to know what resources are required for a particular job and this only comes with experience. Enabling email notifications in your pbs scripts (with the directive #PBS -M is very helpful as well because it allows you to get notifications about compute resources used after a job is complete. These reports can guide resource requests for future jobs.

Finally, if you plan on running an application with multiple CPU cores, for example:

bowtie2 -p 20 -x <ref> -U <input.fq>

Ensure that you request this number of CPU cores when submitting your PBS job, as such:

#PBS -l nodes=1:ppn=20,mem=62GB,walltime=12:00:00

The bottom line: the time to complete a job on the HPC is a function of BOTH the queue time and the compute time. So, consider both when setting up your jobs.

To learn how to see which jobs are currently running on which nodes and cores visit:

For more information on running jobs on the NYU HPC cluster visit: