reform: Modify Reference Sequence and Annotation Files Quickly and Easily

reform: Modify Reference Sequence and Annotation Files Quickly and Easily

reform is a python-based command line tool that allows for fast, easy and robust editing of reference genome sequence and annotation files. With the increase in use of genome editing tools such as CRISPR/Cas9, and the use of reference genome based analyses, the ability to edit existing reference genome sequences and annotations to include novel sequences and features (e.g. transgenes, markers) is increasingly necessary. reform provides a fast, easy, reliable, and reproducible solution for creating modified reference genome data for use with standard analysis pipelines and the 10x Cell Ranger pipeline.

The Problem

The Problem

While it is possible to edit reference sequences and annotations manually, the process is time consuming and prone to human error. Data that appears correct to the eye after manual editing may break strict file formatting rules rendering the data unusable during analysis.

The Solution

The Solution

Execution of reform requires a reference sequence (fasta), reference annotation (GFF or GTF), the novel sequences to be added (fasta), and corresponding novel annotations (GFF or GTF). A user provides as arguments the name of the modified chromosome and either the position at which the novel sequence is inserted, or the upstream and downstream sequences flanking the novel sequences. This results in the addition and/or deletion of sequence from the reference in the modified fasta file. In addition to the novel annotations, any changes to the reference annotations that result from deleted or interrupted sequence are incorporated into the modified gff. Importantly, modified gff and fasta files include a record of the modifications.

The position at which the novel sequence is to be inserted can be provided in two different ways:

1) Provide position at which to insert novel sequence.

2) Provide upstream and downstream flanking sequences, resulting in deletion of reference sequence between these two regions and addition of the novel sequence.

Download

https://github.com/gencorefacility/reform

Usage

reform requires Python3 and Biopython.

On the Prince HPC at NYU, simply load the required module:

module load biopython/intel/python3.6/1.72

Invoke the python script like this:

python3 reform.py 
  --chrom=<chrom> \
  --position=<pos> \ 
  --in_fasta=<in_fasta> \
  --in_gff=<in_gff> \
  --ref_fasta=<ref_fasta> \
  --ref_gff=<ref_gff>

Parameters

chrom ID of the chromsome to modify

position Position in chromosome at which to insert <in_fasta>. Can use -1 to add to end of chromosome. Note: Either position, or upstream AND downstream sequence must be provided.

upsteam_fasta Path to Fasta file with upstream sequence. Note: Either position, or upstream AND downstream sequence must be provided.

downsteam_fasta Path to Fasta file with downstream sequence. Note: Either position, or upstream AND downstream sequence must be provided.

in_fasta Path to new sequence to be inserted into reference genome in fasta format.

in_gff Path to GFF file describing new fasta sequence to be inserted.

ref_fasta Path to reference fasta file.

ref_gff Path to reference gff file.

Example

python3 reform.py \
  --chrom="I" \
  --upstream_fasta="data/up.fa" \
  --downstream_fasta="data/down.fa" \
  --in_fasta="data/new.fa" \
  --in_gff="data/new.gff" \
  --ref_fasta="data/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa" \
  --ref_gff="data/Saccharomyces_cerevisiae.R64-1-1.34.gff3"

Output

reformed.fa Modified fasta file.

reformed.gff3 Modified GFF file.

An Example Using 10x Cell Ranger

If you’re using the Cell Ranger pipeline, you’ll need to modify your GTF file with reform and then run cellranger makeref to create the new genome data needed for cellranger count. Here’s an example:

1) Prepare reference data using reform

We’ll need the original reference in fasta format, and a fasta file containing the novel sequence to be incorporated into the reference sequence. We’ll also need the original reference GTF file, and a GTF file with the annotations corresponding to the novel sequence.

Let’s get a reference sequence:

prince:/genomics/genomes/Public/Fungi/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa

And the reference GTF:

prince:/genomics/genomes/Public/Fungi/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Saccharomyces_cerevisiae.R64-1-1.34.gtf

Here’s the novel sequence we’re adding to the reference (new.fa)

>new_gene
TGAAATTCATAGCTGTAGAAAATAAAGAAGCACTCATTTGGTCCATATCCAAACTCATTT
AGTGGTTATGCTGGGTATCGTAAACAAAAATGGAGCCAAATGCAGTTATTAATCGTTTAT
TGGAGGATCGCAAATTACTACGCTCATCTTTTGTTGGAGAACTACCAATTGCTGCAGTGA
CAACGCATATAGCAGTGATTTGTTTCCAAATGAAAGCTGCTGTCTTTGCGTCGTCGATAA

Here’s the GTF corresponding to the novel sequence in new.fa (new.gtf):

I    reform    exon    1    240    .    +    0    gene_id "new_gene"; transcript_id "new_gene.1";
I    reform    CDS    1    238    .    +    0    gene_id "new_gene"; transcript_id "new_gene.1";
I    reform    start_codon    1    3    .    +    0    gene_id "new_gene"; transcript_id "new_gene.1";
I    reform    stop_codon    238    240    .    +    0    gene_id "new_gene"; transcript_id "new_gene.1";

In this example, we’ll add this new gene at position 2650 on chromosome “I”:

python3 reform.py \
 --chrom="I" \
 --position=2650 \
 --in_fasta="new.fa" \
 --in_gff="new.gff" \
 --ref_fasta="Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa" \
 --ref_gff="Saccharomyces_cerevisiae.R64-1-1.34.gtf"

We should have two new files reformed.fa and reformed.gtf which we’ll need in the next step.

2) Run Cell Ranger makeref

cellranger mkref \
 --genome=output_genome \
 --fasta=reformed.fa \
 --genes=reformed.gtf

This will produce a directory called output_genome which we’ll need to provide for the next step

3) Run Cell Ranger count

cellranger count \
 --id=run_1 \
 --transcriptome=output_genome \
 --fastqs=/path/to/fastq_path/ \
 --sample=sample_1

4) Putting it together

Here’s an sbatch script that runs everything together and makes use of resources on the HPC

#!/bin/sh
#
#SBATCH --verbose
#SBATCH --job-name=reform_ranger
#SBATCH --output=reform_ranger_%j.out
#SBATCH --error=reform_ranger_%j.err
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20
#SBATCH --mem=110000
#SBATCH --mail-user=${USER}@nyu.edu

module load cellranger/2.1.0
module load biopython/intel/python3.6/1.72

python3 reform.py \
 --chrom="I" \
 --position=2650 \
 --in_fasta="new.fa" \
 --in_gff="new.gff" \
 --ref_fasta="Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa" \
 --ref_gff="Saccharomyces_cerevisiae.R64-1-1.34.gff3"

cellranger mkref \
 --genome=output_genome \
 --fasta=reformed.fa \
 --genes=reformed.gtf \
 --nthreads=20 \
 --memgb=100

cellranger count \
 --id=run_1 \
 --transcriptome=output_genome \
 --fastqs=/path/to/fastq_path/ \
 --sample=sample_1

 

Building an Analysis Pipeline for HPC using Python

Building an Analysis Pipeline for HPC using Python

In this post we will build a pipeline for the HPC using Python 3. We will begin by building the foundation for a pipeline in Python in part 1, and then use that to build a simple NGS analysis pipeline in part 2.

At NYU, we submit jobs to the HPC using the Slurm Workload Manager. Therefore, we will implement this pipeline using Slurm, however this could easily be substituted for PBS or another similar scheduler.

This tutorial assumes basic level familiarity with Slurm and Python.

Part 1: Building a template for submitting jobs to Slurm in python

The first thing we will do is build a function in python that will be used to submit jobs to Slurm. The function will submit an sbatch job through the command line using the subprocess python package.

Let’s start by taking alook at an example of submitting a simple job Slurm directly from the command line, one that just prints “Hello World”.

sbatch -J my_simple_job -o my_simple_job.out -e my_simple_job.err --wrap='echo Hello World'

If you execute this command on the Prince or Dalma HPC, you should end up with two new files after the job completes (it should only take a second to finish):

my_simple_job.out
my_simple_job.err

If you look in my_simple_job.out you should see the text “Hello World”.

Let’s break down the command we just used. If you’re familiar with submitting Slurm jobs, this should be review:

sbatch: calling the scheduler
-J: job name
-o: standard out filename
-e: standard err filename
--wrap: this is the most important part, the command you wish to run. In this example, we simply asked to echo “Hello World”.

Let’s put this same command in a python function (we’ll call it sbatch()), and see how we can use this as a template to submit all our jobs. Note that we make use of the subprocess package to execute something on the command line from python. Before we use the subprocess package, we have to import it.

import subprocess

def sbatch():
     sbatch_command = "sbatch -J my_simple_job -o my_simple_job.out -e my_simple_job.err --wrap='echo Hello World'"
     sbatch_response = subprocess.getoutput(sbatch_command)
     print(sbatch_response)

sbatch() # call the sbatch function

Save that as simple.py. If you run this python script (python3 simple.py), you should see a response from sbatch (indicating the ID of the job submitted), and you should find the same file my_simple_job.out with the text “Hello World”. Try change the job name, output file names, or command, and re-run the script.

Let’s transform the sbatch function above into something more flexible.

def sbatch(job_name, command):
    sbatch_command = "sbatch -J {} -o {}.out -e {}.err --wrap='{}".format(job_name, job_name, job_name, command)
    sbatch_response = subprocess.getoutput(sbatch_command)
    print(sbatch_response)

We introduce two new parameters to the function: job_name and command. We also make use the string format method in order to build strings easily. Let’s take a quick look at a couple simple examples of the string format method in Python.

name = "Rick"
print( "Hello {}".format(name) )
# prints "Hello Rick"

first_name = "Rick"
last_name = "Sanchez"
print( "Hello {} {}".format(first_name, last_name) )
# prints "Hello Rick Sanchez"

The nice thing about this method is that it automatically formats any field into a string, so you can fill values with integers, floating point numbers, mathematical functions, etc. Let’s Take a look at one final example:

some_integer = 7
some_float = 2.99
print( "{} + {} = {} ".format(some_integer, some_float, some_integer + some_float) )
# prints "7 + 2.99 = 9.99"

Let’s go back to our sbatch example. What we have now is a function that we can send a job_name and a command, and the function will build the sbatch command using job_name as the Job Name and stdout/stderr file names, and will put whatever command we give it in the ‘wrap’ argument. Let’s try the same basic Hello World example using our new template.

import subprocess
def sbatch(job_name, command):
    sbatch_command = "sbatch -J {} -o {}.out -e {}.err --wrap='{}".format(job_name, job_name, job_name, command)
    sbatch_response = subprocess.getoutput(sbatch_command)
    print(sbatch_response)

sbatch("my_simple_job", "echo Hello World") # call the sbatch function, but provide job_name and command parameters.

Let’s add some additional parameters that we might want to use when submitting a job to sbatch: time, memory, and tasks. We will give each of these parameters a default value so they don’t need to be specified everytime you call sbatch(), unless you want to override them. Let’s also add a mail option so Slurm will email us if our job ever fails.

def sbatch(job_name, command, time=4, mem=60, tasks=20):
    sbatch_command = "sbatch -J {} -o {}.out -e {}.err --mail-user=$USER@nyu.edu --mail-type=FAIL -t {}:00:00 --mem={}000 --ntasks-per-node={} --wrap='{}'".format(job_name, job_name, job_name, time, mem, tasks, command)
    sbatch_response = 
    subprocess.getoutput(sbatch_command)print(sbatch_response)

We now have function that allows us to quite easily submit any command we want to the Slurm scheduler, control some common job parameters, and notify us by email in case of failure. The final piece required to make this function operable in a pipeline is support for dependencies. We need this function to do two more things:

1) return a job id that can be passed as a dependency to another job
2) allow the sbatch function to accept a dependency, optionally

Let’s tackle #1 first:

When you submit an sbtach job, you typically receive a response like this:
Submitted job 123456789

We’re interested in just the job number. There are a few different ways to extract this number from the response, we’ll use a simple string split and take the last element:

job_id = sbatch_response.split(' ')[-1].strip()

The command above takes the sbatch_response (which is the output of running the sbatch command with subprocess.getoutput()), splits it based on the space (‘ ‘) delimiter, takes the last element (the job ID), and strips away any spaces or newline characters leaving just the job ID.

Now let’s implement support for #2: enabling the function to accept an optional dependency parameter.

We’ll start by adding a parameter called dep to the sbatch function definition and assigning it a default value of nothing (empty string).

def sbatch(job_name, cmd, time=4, mem=60, tasks=20, dep='')

Now we’ll add a couple new lines at the top of the function to check if the dependency is empty or not, and add the required parameters to the sbatch command if it’s not empty. We also add the lines to parse the and return job_id.

def sbatch(job_name, command, time=4, mem=60, tasks=20, dep=''):
    if dep != '':
        dep = '--dependency=afterok:{} --kill-on-invalid-dep=yes '.format(dep)
 
    sbatch_command = "sbatch -J {} -o {}.out -e {}.err --mail-user=$USER@nyu.edu --mail-type=FAIL -t {}:00:00 --mem={}000 --ntasks-per-node={} --wrap='{}' {}".format(job_name, job_name, job_name, time, mem, tasks, command, dep)
    sbatch_response = subprocess.getoutput(sbatch_command)
    print(sbatch_response) 
    job_id = sbatch_response.split(' ')[-1].strip()
    return job_id

Note that the dep parameter is always included in the sbatch command that we send to the command line. If there is no dependency specified, then dep will be an empty string.

We now have a function that can be used to build a fully functioning pipeline with dependencies!

Part 2: Building a Pipeline

In part 1 we built a template function to submit jobs to the HPC. In part 2, we will use this function to build a pipeline. For the purpose of this tutorial, we will implement a simple NGS analysis workflow consisting of FastQC, alignment, sorting, indexing, and generating alignment statistics.

Using dependencies in a pipeline allows us to maximize the resources we can use while minimizing the wait time to run our jobs. This also allows individual jobs to run in parallel (where possible), reducing our overall analysis time even further.

The figure below depicts the simple pipeline we will build. Note the dependencies (indicated with arrows) and the steps which run in parallel (same row).

1. Input files

We’ll use a single-end fastq file and we’ll align to the Unicorn genome.

input_1 = sample_R1.fastq

ref = Unicorn_v1.fa

For the purpose of this exercise, we’ll assume the index files for the aligner are already built.

2. Build the individual steps

Here’s the command for running fastqc:

fastqc file.fastq

We’ll build a function for each step of the workflow to keep things organized. These functions will call the sbatch() function we created above.

def fastqc():
    # Build the command for fastqc
     command = "fastqc {}".format(input_1)
 
     # Pass job name and command to sbatch
     # sbatch() will return the job_id
     job_id = sbatch('fastqc', command)

    # Return the job id
     return job_id

Now let’s build the functions for the remaining steps

Alignment:

def alignment():
    command = "bowtie2 -x {} -U {} > output.sam".format(ref, input_1)
    job_id = sbatch('align', command)
    return job_id

Note: The next three steps require a dependency, i.e. they require a previous job to complete before they can be started. For this reason, we will add the dep parameter to the following functions.

Stats:

def stats(dep=''):
    command = "samtools flagstat output.sam > stats.txt"
    job_id = sbatch('stats', command, dep=dep)
    return job_id

Sort:

def sort(dep=''):
    command = "samtools sort output.sam -o output.sorted.bam"
    job_id = sbatch('sort', command, dep=dep)
    return job_id

Index:

def index(dep=''):
    command = "samtools index output.sorted.bam output.sorted.bai"
    job_id = sbatch('index', command, dep=dep)
    return job_id

3. Build the pipeline

Now that we have all our individual functions for each step built, we can stitch them together in a pipeline. Most of the heavy lifting is already done, this part is the easiest.

fastqc_jobid = fastqc()
alignment_jobid = alignment()

# Stats needs the alignment to complete before it can begin, so we pass 
# alignment_jobid to the stats() function
stats_jobid = stats(alignment_jobid)

# Sort needs the alignment to complete before it can begin, so we pass 
# alignment_jobid to the sort() function
sort_jobid = sort(alignment_jobid)

# Note: Both stats() and sort() will be submitted simultaneously when the the alignment job is complete

# Index needs sorting to complete before it can begin, so we pass sort_jobid to index()
index_jobid = index(sort_jobid)

4. Putting it all together

Below you will find the full code for this pipeline. I’ve added three new components:

a) Parse the sample id (from the input file name) and use this to name the outputs of the pipeline
b) Override some of the default sbatch job parameters (ex: tasks, memory, and time) for the alignment job
c) Load respective modules (required for NYU’s HPC). I add this directly into the command for each step.

 

import subprocess

# Inputs
input_1 = 'sample_R1.fastq'
ref = 'Unicorn_v1.fa'

# Parse the sample id so we can use it to name the outputs
# sample_id in this example will be 'sample_R1'
sample_id = input_1.split('.')[0]

def sbatch(job_name, command, time=4, mem=60, tasks=20, dep=''): 
    if dep != '': dep = '--dependency=afterok:{} --kill-on-invalid-dep=yes '.format(dep) 
 
   sbatch_command = "sbatch -J {} -o {}.out -e {}.err --mail-user=$USER@nyu.edu --mail-type=FAIL -t {}:00:00 --mem={}000 --ntasks-per-node={} --wrap='{}' {}".format(job_name, job_name, job_name, time, mem, tasks, command, dep) 
 
   sbatch_response = subprocess.getoutput(sbatch_command) 
   print(sbatch_response) 
 
   job_id = sbatch_response.split(' ')[-1].strip() 
   return job_id
 
def fastqc(): 
   # Build the command for fastqc 
   command = "module load fastqc/0.11.5; fastqc {}".format(input_1)
 
   # Pass job name and command to sbatch 
   # sbatch() will return the job_id 
   job_id = sbatch('fastqc', command) 
 
   # Return the job id 
   return job_id
 
def alignment(): 
   command = "module load bowtie2/intel/2.3.2; bowtie2 -x {} -U {} > {}.sam".format(ref, input_1, sample_id) 
   job_id = sbatch('align', command, time=8, mem=120) 
   return job_id
 
def stats(dep=''):
   command = "module load samtools/intel/1.6; samtools flagstat {}.sam > {}_stats.txt".format(sample_id, sample_id)
   job_id = sbatch('stats', command, dep=dep)
   return job_id

def sort(dep=''):
   command = "module load samtools/intel/1.6; samtools sort {}.sam -o {}.sorted.bam".format(sample_id, sample_id)
   job_id = sbatch('sort', command, dep=dep)
   return job_id

def index(dep=''):
   command = "module load samtools/intel/1.6; samtools index {}.sorted.bam {}.sorted.bai".format(sample_id, sample_id)
   job_id = sbatch('index', command, dep=dep)
   return job_id
 
#Run the pipeline!
fastqc_jobid = fastqc() 
alignment_jobid = alignment() 
stats_jobid = stats(alignment_jobid)
sort_jobid = sort(alignment_jobid)
index_jobid = index(sort_jobid)

Congratulations! You’ve completed the tutorial. If you were following along, you should have the following output files once the pipeline has finished:

sample_R1_fastqc.zip
sample_R1_fastqc.html
sample_R1.sam
sample_R1_stats.txt
sample_R1.sorted.bam
sample_R1.sorted.bai

Post below if you have any questions and I’ll do my best to answer.