Explore the New Shared Genome Resource

Explore the New Shared Genome Resource

Save time and resources with the local CGSB repository of commonly used genomic data sets. Data is obtained from Ensembl and NCBI. New versions/releases will be added periodically or upon request. Previous versions/releases will be preserved.

All files are readable from within the shared genome resource. There is no need to copy the file(s) to your local directory. The second table below shows all available data types.

All data are stored in a common location with the following naming convention:

/genomics/genomes/Public/<KINGDOM>/<SPECIES>/<SOURCE>/<VERSION>/

Search and find your genome of interest in the table below. Clicking on the row containing your genome will generate the specific path (see below table) to the data on the HPC .

Explore

Species
Kingdom Species Source Version
/genomics/genomes/Public/<KINGDOM>/<SPECIES>/<SOURCE>/<VERSION>/
Data Types
Name Extensions Built With
Reference Sequence
  • .fa
  • .fna

Request Data

To request a new organism, version, or data type, please email mk5636@nyu.edu

Remote Desktop Connection to Prince

Remote Desktop Connection to Prince

Connect to Prince using a remote desktop to analyze your data in RStudio, visualize in IGV, and interact with other GUI applications on the HPC.

Instructions for Mac users

Instructions for PC users

Instructions for Mac users

Initial Setup:

  1. Download TigerVNC:
    https://bintray.com/tigervnc/stable/download_file?file_path=TigerVNC-1.8.0.dmg
  2. Setup a VNC password on Prince:
    Login to Prince and type the command vncpasswd
    If you ever forget your VNC password, you can reset it at anytime by typing vncpasswd again (it will ask you to enter a password, then confirm the password)
  3. Copy this script into your home folder:
    prince:/scratch/work/cgsb/scripts/launch_vnc/launch-vnc.sh

Once initial setup is complete, you can launch a vnc session and connect to it at any time by doing the following:

  1. Submit the sbtach script (which you obtained above) to the queue, like this:
    > sbatch launch-vnc.sh
  2.  Once the job is running, look for the output file called VNC.o1234567, where ‘1234567’ is your job ID. Open that file and copy the first ssh command, which looks like this:
    ssh -L 5901:c06-04:5901 netID@prince.hpc.nyu.edu
  3. Open a terminal from your Mac (local terminal) and paste in the command from above and hit enter, you will be asked for your password (your regular NYU password).
  4. Open TigerVNC on your computer and enter localhost:X in the server field, where X is the Display Port (590X – If your port from the ssh command above was 5901, you enter 1, if the port was 5902, you enter 2.). Using the command above as an example, the correct entry would be localhost:1
    Press the ‘Connect’ button. You will then be asked for a password, enter the password you setup with vncpasswd (not your NYU password!) Click OK and it should launch the desktop.
  5. The first time you launch your remote desktop, you should disable the screensaver lock. Go to Applications -> System Tools -> Settings -> Power and change the setting for Blank Screen to Never. This will prevent you from being locked out of your desktop. This only needs to be done once.
  6. Now you can run IGV and load any file which you have access to on the HPC. Open a terminal and enter:
    > module load igv
    > igv.sh
  7. Your VNC session will be active for 12 hours. After this period, you will need to re-launch and connect to your remote desktop by redoing steps 1-4.

Instructions for PC users

Initial Setup:

  1. Download Putty:
    32-bit: https://the.earth.li/~sgtatham/putty/latest/w32/putty.exe
    64-bit: https://the.earth.li/~sgtatham/putty/latest/w64/putty.exe
  2. Download and install TightVNC (accept the default settings when installing):
    http://www.tightvnc.com/download.php
  3. Setup a VNC password on mercer:
    Login to Prince and type the command vncpasswd
    If you ever forget your VNC password, you can reset it at anytime by typing vncpasswd again (it will ask you to enter a password, then confirm the password)
  4. Copy this script into your home folder:
    prince:/scratch/work/cgsb/scripts/launch_vnc/launch-vnc.sh

Once initial setup is complete, you can launch a vnc session and connect to it at any time by doing the following:

  1. Submit the sbatch script (which you obtained above) to the queue, like this:
    > sbatch launch-vnc.sh
  2. Once the job is running, look for the output file called VNC.o1234567, where ‘1234567’ is your job ID. Open that file and look at the first  ssh command, which looks like this:
    ssh -L 5901:c06-04:5901 netID@prince.hpc.nyu.edu
  3. From the line above, we know the VNC port is 5901 and the compute node is c06-04. You will need these two values.
  4. Open Putty and enter prince.hpc.nyu.edu for Host Name and 22 for Port, as shown below.
  5. On the left, click to expand the options for ‘SSH’, then click ‘Tunnels’ (a). Enter the port number obtained in step 3 above into the ‘Source Port’ field. In the example shown above, the port number is 5901 (b). For ‘Destination’, enter compute-node:port. In the example above, this is co6-04:5901 (c). Click ‘Add’ (d). Then click ‘Open’ (e). When the terminal opens, authenticate with you NYU netID and password.
    Note: this configuration (Source port + Destination) will need to be updated whenever a new VNC session is launched (whenever you submit a new launch-vnc.sh job).
  6. Launch TightVNC Viewer. Enter the VNC Server as follows:
    If your port is 5900, VNC Server: localhost:0
    If your port is 5901, VNC Server: localhost:1
    If your port is 5902, VNC Server: localhost:2
    In the example above, the port was 5901, so we enter localhost:1
    vnc-connection
  7. Click Connect, enter your VNC password (not NYU NetID password!) which you created in the initial setup.
  8. The first time you launch your remote desktop, you should disable the screensaver lock. Go to Applications -> System Tools -> Settings -> Power and change the setting for Blank Screen to Never. This will prevent you from being locked out of your desktop. This only needs to be done once.
  9. Now you can run IGV and load any file which you have access to on the HPC. Open a terminal and enter:
    > module load igv
    > igv.sh
  10. Your VNC session will be active for 12 hours. After this period, you will need to re-launch and connect to your remote desktop by submitting the launch-vnc.sh script and configuring a new putty session.

 

Salmon & kallisto: Rapid Transcript Quantification for RNA-Seq Data

Salmon & kallisto: Rapid Transcript Quantification for RNA-Seq Data

Salmon and kallisto might sound like a tasty entree from a hip Tribeca restaurant, but the duo are in fact a pair of next-generation applications for rapid transcript quantification. They represent a new approach to transcript quantification using NGS that has a number of advantages over existing alignment-based methods. I’ve tried them both out and provide my thoughts below.

We are used to quantification methods that rely on full base-to-base alignment of reads to a reference, but newer methods leverage the idea that that when quantifying reads, the necessary information is not where in a transcript a read aligns, but only which transcripts could have generated the read. This idea renders base-level alignment, which has been the computational bottleneck until now, unnecessary for this analysis. Because this approach does not involve actual alignment to the genome, it is sometimes referred to as ‘pseudoalignment’.

kallisto

Pros:

  • Extremely Fast & Lightweight – can quantify 20 million reads in under five minutes on a laptop computer
  • Easy to use
  • Sleuth – an interactive R-based companion for exploratory data analysis

Cons:

  • No support for stranded libraries
    Update: kallisto now offers support for strand specific libraries

kallisto, published in April 2016 by Lior Pachter and colleagues, is an innovative new tool for quantifying transcript abundance. kallisto uses the concept of ‘pseudoalignments’, which are essentially relationships between a read and a set of compatible transcripts. kallisto obtains highly accurate pseudoalignments efficiently using k-mers together with the transcriptome de Bruijn graph (T-DBG). Grouping of pseudoalignments belonging to the same set of transcripts into equivalence classes allows for a simpler model and more efficient use of underlying algorithms. The result is a tool which is able to quantify a paired-end library with 22 million reads in 3.5 minutes on a single-core with 8GB of memory.

total_run_time_chart

Both kallisto and Salmon are magnitudes faster than Tophat + Cufflinks. Times computed on 22M read PE illumina dataset, 1 CPU core and 8GB memory.

Speed is great, but accuracy is critical. The plot below displays a high correlation (r = 0.941) between results obtained by kallisto and Cufflinks. You can see that the majority of differences are for transcripts with low expression levels.

kallisto is self-described as being ‘near optimal’. In this case, ‘optimal’ refers to two key aspects of the analysis: Speed and Accuracy. While the benchmark for accuracy can be determined by comparing results to a known “gold standard” data set or results from other existing tools, the benchmark for speed is typically based on how fast an input file can be read. A good way to measure optimal speed is to calculate how long it takes to read through a file, the linux word count (wc) utility provides a measure of this. For example, if it takes your machine 45 seconds to perform a word count on a set of reads, 45 seconds could be considered the benchmark for speed in quantifying this set of reads (on your machine). You can see in the plot below that kallisto is taking just over twice as long as it takes to read through the file. That’s not bad for a complex problem of quantifying RNA-seq data!

optimal-plot

The linux word count (wc) utility can serve as a benchmark for ‘optimal’ speed.

A companion to kallisto, sleuth is an R-based program for exploratory data analysis powered by Shiny. The tools in sleuth allow you to investigate transcript abundance data or combine results from multiple samples for differential expression analysis. Interactive tools to explore scatterplots, volcano plots, MA plots, heat maps, PCA analysis, and more make analyzing your data faster and easier. For example, when examining a volcano plot, users can highlight a region of interest and immediately see a list of the transcripts selected in a table below.

A Volcano Plot in sleuth
Selecting points on a volcano plot brings up the transcripts corresponding to those points in the table below, in real time.

Using the ID’s of the transcripts, users can drill down further and examine boxplots of transcript abundances to see technical variation in each sample and biological variation between conditions.

Transcript View in sleuth
Enter a transcript ID in transcript view, or a gene ID in gene view, to visualize technical and biological variation quickly.

sleuth requires technical replicates for its analysis. In lieu of actual technical replicates, sleuth makes use of bootstrapped values which serve as accurate proxies. Bootstrapping here is the process of repeating the quantification analysis after resampling with replacement from the original data, in order to simulate random sampling associated with sequencing.

sleuth computes differential expression using a statistical model that combines variance information from biological replicates as well as bootstrapped technical replicates to estimate true biological variance. A test of sleuth on data simulated according to the DESeq2 model found that sleuth significantly outperforms other methods (ex: DESeq2, edgeR). A preprint is forthcoming.

The power of kallisto and sleuth lie in their speed and ease of use. A user can go from raw reads to analysis in minutes. As impressive as kallisto is, one major drawback is that its simplified model makes it unable to account for strandedness in reads. This seems like a major limitation given that most RNA-seq protocols generated stranded information.. If support for strandedness is a requirement for your analysis, check out Salmon below. Update: kallisto now offers support for strand specific libraries

Salmon

Pros:

  • Fast & Lightweight – can quantify 20 million reads in under eight minutes on a desktop computer
  • Support for strand-specific libraries
  • Accepts BAM and FASTQ files as input

Cons:

  • No automatic support for compressed input files

From the creators of Sailfish, Salmon (released but unpublished) is another rapid quantification tool which leverages k-mer based counting to produce, in Salmon terms, ‘quasi-alignments’. Although Salmon is not quite as fast as kallisto and lacks some features such as seamless support for compressed files, its strong point is its ability to learn and account for the effects of experiment-specific parameters and biases, including support for strand-specific RNA-seq data.

In the data set used, the correlation between results obtained by Salmon and Cufflinks (r = 0.939) is nearly identical to that between kallisto and Cufflinks (r = 0.941).

Another feature Salmon offers is the ability to quantify pre-existing alignments (from BAM files). This means that if you already have BAM alignments (or will need BAMs for other reasons), you can provide Salmon with these. If you don’t have BAM files, you can provide your reads in fastq format and Salmon will create the quasi-alignments ‘wicked-fast’, and then quantify them.

Finally, if you’re quantifying your data with Salmon and would like to use sleuth for downstream analysis, you’re in luck. The creators of Salmon have created an R library called wasabi which prepares Salmon (and Sailfish) output for use with sleuth.

The latest versions of Salmon, kallisto, sleuth, and wasabi are available for use on Mercer.

Getting Started (on Mercer)

kallisto

module load kallisto/intel/0.42.5
kallisto quant -i <kallisto_index> -o <output_dir> <read_1.fastq> <read_2.fastq>

To produce bootstrap values for downstream analysis with sleuth (in this example, 100 bootstraps):

kallisto quant -i <kallisto_index> -o <output_dir> -b 100 <read_1.fastq> <read_2.fastq>

 

sleuth

! If your data is on mercer, run sleuth in a VNC session for web browser functionality
! On mercer, It is necessary to load Firefox or a browser of your choice prior to launching R
module load r/intel/3.2.2
module load firefox
R
>library("sleuth")
>base_dir<-(“/scratch/<netID>/kallisto-results/”)

The next step requires describing the experimental design and the relationship between the kallisto directories and the samples, which is variable depending on your experimental design and kallisto output structure. The final steps involve creating the sleuth object and performing the differential expression calculations, which take only a couple minutes in total to complete. Once these steps are complete, an interactive browser session can be launched. Please see Introduction to Sleuth for more.

Salmon

module load salmon/gnu/0.6.0
salmon quant -i <salmon_index> -l <lib_type> -1 <read_1.fastq> -2 <read_2.fastq> -o <output_dir>

In this example, we quantify a paired end, stranded library (where read 1 comes from the reverse strand), from Arabidopsis thaliana, with 100 bootstraps:

salmon quant -i A_thaliana_salmon_index -l ISR --numBootstraps 100 -1 read_1.fastq -2 read_2.fastq -o salmon_out

 

Resources

kallisto Paper:

http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.3519.html

kallisto Manual:

http://pachterlab.github.io/kallisto/manual.html

Sleuth Blogpost:

https://liorpachter.wordpress.com/2015/08/17/a-sleuth-for-rna-seq/

Sleuth Tutorial:

https://rawgit.com/pachterlab/sleuth/master/inst/doc/intro.html

Salmon Preprint:

http://biorxiv.org/content/early/2015/06/27/021592

Salmon Manual:

http://salmon.readthedocs.io/en/latest/salmon.html

Node to Joy: Maximize Your Performance on the HPC

Node to Joy: Maximize Your Performance on the HPC

In this post we’ll discuss maximizing your performance on the HPC. This entry is aimed towards experienced HPC users; for new users, please see Getting Started on the HPC.

Recent advances in sequencing technology have made High Performance Computing (HPC) more critical than ever in data-driven biological research. NYU’s HPC resources are available to all NYU faculty, staff, and faculty-sponsored students.

Familiarizing yourself with NYU’s HPC resources, and optimizing your use of this technology will enable maximum speed and efficiency in your analyses.

Get to Know Your Nodes

Mercer, the primary NYU HPC cluster in New York, contains 400 nodes, 16TB of total memory, and over 3200 CPU cores. When a job is submitted to the cluster, one or more nodes are allocated to the job based on the resources requested when submitting. Typically, a single node is sufficient, and the node will be allocated based on amount of memory and CPU cores requested.

For example, of the 400 nodes available on Mercer, 112 of them are 64GB nodes with 20 CPU cores available per node. Note that 64GB is the total memory per node but 62GB is the amount of memory available to jobs. Therefore, if 64GB of memory is requested, the job will run on the next largest memory node. In this case, the node will be allocated based on how many processors per node (cores) are requested. There are twelve 96GB nodes with 12 cores on Mercer, so if 12 or less cores are requested, a 96GB node will be allocated. If 20 cores are requested, the next highest memory nodes with 20 cores are forty eight 192GB nodes (189GB available for jobs), so one of these nodes will be allocated. See list of Mercer nodes. There are also four new nodes with 500GB available memory and 20 CPU cores, and two new nodes with 1.5TB memory and 48 CPU cores, that are not included in this list.

Generally speaking, if 20 CPU cores on a single compute node is requested, it is ok to declare 62GB memory. If more than 62GB memory with 20 CPU cores on a single compute node is needed, it is ok to declare 189GB memory. In the above cases, since 20 cores are being requested, no one would be able to share the compute node (since all CPU cores are being used) and so it would be efficient to request all the available memory there.

Which Node is Right for your Analysis?

Choosing the right node for your analysis requires a little benchmarking and testing. Over-requesting resources when they are not needed might result in your analyses taking longer because of queues for the higher resource nodes since they are fewer of these. For example, a job which may have spawned almost immediately after submission on a lower resource node and taken 30 minutes to run might have taken less time to complete on a 1TB node, but the queue time for this job might have been longer since there are only 3 nodes with 1TB memory available.

It can be difficult to know what resources are required for a particular job and this only comes with experience. Enabling email notifications in your pbs scripts (with the directive #PBS -M NetID@nyu.edu) is very helpful as well because it allows you to get notifications about compute resources used after a job is complete. These reports can guide resource requests for future jobs.

Finally, if you plan on running an application with multiple CPU cores, for example:

bowtie2 -p 20 -x <ref> -U <input.fq>

Ensure that you request this number of CPU cores when submitting your PBS job, as such:

#PBS -l nodes=1:ppn=20,mem=62GB,walltime=12:00:00

The bottom line: the time to complete a job on the HPC is a function of BOTH the queue time and the compute time. So, consider both when setting up your jobs.

To learn how to see which jobs are currently running on which nodes and cores visit: https://wikis.nyu.edu/display/NYUHPC/Interpreting+pbstop

For more information on running jobs on the NYU HPC cluster visit: https://wikis.nyu.edu/display/NYUHPC/Running+jobs+on+the+NYU+HPC+clusters

Variant Calling Pipeline: FastQ to Annotated SNPs in Hours

Variant Calling Pipeline: FastQ to Annotated SNPs in Hours

Identifying genomic variants, such as single nucleotide polymorphisms (SNPs) and DNA insertions and deletions (indels), can play an important role in scientific discovery. To this end, a pipeline has been developed to allow researchers at the CGSB to rapidly identify and annotate variants. The pipeline employs the Genome Analysis Toolkit (GATK) to perform variant calling and is based on the best practices for variant discovery analysis outlined by the Broad Institute. Once SNPs have been identified, SnpEff is utilized to annotate and predict the effects of the variants.

Full List of Tools Used in this Pipeline:

  • GATK
  • BWA
  • Picard
  • Samtools
  • Bedtools
  • SnpEff
  • R (dependency for some GATK and Picard steps)

Script Location:

mercer:/scratch/work/cgsb/scripts/variant-calling/pipeline.sh

How to use this script:

  1. Create a new directory for this project in your home folder on scratch
    cd /scratch/<netID>/
    mkdir variant-calling
    cd variant-calling
  2. Copy pipeline.sh into your project directory
    cp /scratch/work/cgsb/scripts/variant-calling/pipeline.sh .
  3. The pipeline requires six input values. The values can be hard coded into the script, or provided as command line arguments. They are found at the very top of the script.
    1. DIR: The directory with your FastQ libraries to be processed. This should be located in /scratch/cgsb/gencore/out/<PI>/
    2. REF: Reference genome to use. This should be located in the Shared Genome Resource (/scratch/work/cgsb/reference_genomes/). Selecting a reference genome from within the Shared Genome Resource ensures that all index files and reference dictionaries required by BWA, Picard, GATK, etc. are available.
    3. SNPEFF_DB: The name of the SnpEff database to use. To determine the name of the SnpEff DB to use, issue the following command:java -jar /share/apps/snpeff/4.1g/snpEff.jar databases | grep -i SPECIES_NAME
      From the list of returned results, select the database you wish to use. The value of the first column is the value to input for SNPEFF_DB.

      Example:
      java -jar /share/apps/snpeff/4.1g/snpEff.jar databases | grep -i thaliana

      Output:

      athalianaTair10   Arabidopsis_Thaliana    http://downloads.sourceforge.net/...

      Then:

      SNPEFF_DB='athalianaTair10'

      Note: If your organism is not available in SnpEff, it is possible to create a custom SnpEff Database if a reference fasta and gff file are available.

    4. PL: Platform. Ex: illumina
    5. PM: Machine. Ex: nextseq
    6. EMAIL: Your email address. Used to report PBS job failures

    In the example below, I have hard coded the parameters in the script:

    DIR='/data/cgsb/gencore/out/Gresham/2015-10-23_HK5NHBGXX/lib1-26/'
    REF='/scratch/work/cgsb/reference_genomes/Public/Fungi/Saccharomyces_cerevisiae/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna'
    SNPEFF_DB='GCF_000146045.2_R64'
    PL='illumina'
    PM='nextseq'
    EMAIL='some1@nyu.edu' # Not a real email address

  4. Run the script. If you hard coded the parameters in the script as shown in the example above, no additional parameters need to be provided. Simply run:
    sh pipeline.sh
  5. Optional: Monitor the jobs
    watch -d qstat -u <netID>

Overview of the pipeline

Variant Calling Workflow

Important Notes:

  • The pipeline assumes no known variants are available for the Base Quality Score Recalibration step and as such bootsraps a set of SNPs and Indels to provide as input at this step. Contact me if a list of high quality known variants is available for your organism as this data can improve the quality of the output
  • The current script is designed to work with Paired End data
  • Due to HPC queue limits on mercer, the pipeline can run on a maximum of 25 libraries at a time
  • If you need to run the pipeline on more than 25 libraries or on Single End reads, or have any other questions, please contact me

Walk through

Step 1 Alignment – Map to Reference
Tool BWA MEM
Input .fastq files, reference genome
Output

aligned_reads.sam*

*Intermediary file, removed from final output

Notes

Need to provide the -M flag to BWA, this tells it to consider split reads as secondary, need this for GATK variant calling/Picard support. Alternate alignment tools: Bowtie2, Novoalign

Readgroup info is provided with the -R flag. This information is key for downstream GATK functionality. GATK will not work without a read group tag.

Command bwa mem -M -R '@RG\tID:sample_1\tLB:sample_1\tPL:ILLUMINA\tPM:HISEQ\tSM:sample_1' ref input_1 input_2 > aligned_reads.sam

 

Step 2 Sort SAM file by coordinate, convert to BAM
Tool Picard Tools
Input aligned_reads.sam
Output

sorted_reads.bam*

*Intermediary file, removed from final output

Command java -jar picard.jar SortSam INPUT=aligned_reads.sam OUTPUT=sorted_reads.bam SORT_ORDER=coordinate

 

Step 3 Collect Alignment & Insert Size Metrics
Tool Picard Tools, R, Samtools
Input sorted_reads.bam, reference genome
Output alignment_metrics.txt,
insert_metrics.txt,
insert_size_histogram.pdf,
depth_out.txt
Command

java -jar picard.jar CollectAlignmentSummaryMetrics R=ref I=sorted_reads.bam O=alignment_metrics.txt

java -jar picard.jar CollectInsertSizeMetrics INPUT=sorted_reads.bam OUTPUT=insert_metrics.txt HISTOGRAM_FILE=insert_size_histogram.pdf

samtools depth -a sorted_reads.bam > depth_out.txt

 

Step 4 Mark Duplicates
Tool Picard Tools
Input sorted_reads.bam
Output

dedup_reads.bam*

metrics.txt

*Intermediary file, removed from final output

Command java -jar picard.jar MarkDuplicates INPUT=sorted_reads.bam OUTPUT=dedup_reads.bam METRICS_FILE=metrics.txt

 

Step 5 Build BAM Index
Tool Picard Tools
Input dedup_reads.bam
Output

dedup_reads.bai*

*Intermediary file, removed from final output

Command java -jar picard.jar BuildBamIndex INPUT=dedup_reads.bam

 

Step 6 Create Realignment Targets
Tool GATK
Input dedup_reads.bam,
reference genome
Output realignment_targets.list
Notes This is the first step in a two-step process of realigning around indels
Commad java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R ref -I dedup_reads.bam -o realignment_targets.list

 

Step 7 Realign Indels
Tool GATK
Input dedup_reads.bam,
realignment_targets.list,
reference genome
Output realigned_reads.bam
Notes This step performs the realignment around the indels which were identified in the previous step (the ‘realignment targets’)
Command java -jar GenomeAnalysisTK.jar -T IndelRealigner -R ref -I dedup_reads.bam -targetIntervals realignment_targets.list -o realigned_reads.bam

 

Step 8 Call Variants
Tool GATK
Input realigned_reads.bam,
reference genome
Output raw_variants.vcf
Notes First round of variant calling. The variants identified in this step will be filtered and provided as input for Base Quality Score Recalibration
Command java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref -I realigned_reads.bam -o raw_variants.vcf

 

Step 9 Extract SNPs & Indels
Tool GATK
Input raw_variants.vcf,
reference genome
Output raw_indels.vcf, raw_snps.vcf
Notes This step separates SNPs and Indels so they can be processed and used independently
Command java -jar GenomeAnalysisTK.jar -T SelectVariants -R ref -V raw_variants.vcf -selectType SNP -o raw_snps.vcf
java -jar GenomeAnalysisTK.jar -T SelectVariants -R ref -V raw_variants.vcf -selectType INDEL -o raw_indels.vcf

 

Step 10 Filter SNPs
Tool GATK
Input raw_snps.vcf,
reference genome
Output filtered_snps.vcf
Notes

The filtering criteria for SNPs are as follows:

QD < 2.0
FS > 60.0
MQ < 40.0
MQRankSum < -12.5
ReadPosRankSum < -8.0
SOR > 4.0

Note: SNPs which are ‘filtered out’ at this step will remain in the filtered_snps.vcf file, however they will be marked as ‘basic_snp_filter’, while SNPs which passed the filter will be marked as ‘PASS’

Command java -jar GenomeAnalysisTK.jar -T VariantFiltration -R ref -V raw_snps.vcf --filterExpression 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0' --filterName "basic_snp_filter" -o filtered_snps.vcf

 

Step 11 Filter Indels
Tool GATK
Input raw_indels.vcf,
reference genome
Output filtered_indels.vcf
Notes

The filtering criteria for SNPs are as follows:

QD < 2.0
FS > 200.0
ReadPosRankSum < -20.0
SOR > 10.0

Note: Indelss which are ‘filtered out’ at this step will remain in the filtered_indels.vcf file, however they will be marked as ‘basic_indel_filter’, while Indels which passed the filter will be marked as ‘PASS’

Command java -jar GenomeAnalysisTK.jar -T VariantFiltration -R ref -V raw_indels.vcf --filterExpression 'QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0 || SOR > 10.0' --filterName "basic_indel_filter" -o filtered_indels.vcf

 

Step 12 Base Quality Score Recalibration (BQSR) #1
Tool GATK
Input realigned_reads.bam,
filtered_snps.vcf,
filtered_indels.vcf,
reference genome
Output

recal_data.table*

*Intermediary file, removed from final output

Notes BQSR is performed twice. The second pass is optional, but is required to produce a recalibration report.
Command java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R ref -I realigned_reads.bam -knownSites filtered_snps.vcf -knownSites filtered_indels.vcf -o recal_data.table

 

Step 13 Base Quality Score Recalibration (BQSR) #2
Tool GATK
Input recal_data.table,
realigned_reads.bam,
filtered_snps.vcf,
filtered_indels.vcf,
reference genome
Output

post_recal_data.table

*Intermediary file, removed from final output

Notes The second time BQSR is run, it takes the output from the first run (recal_data.table) as input
Command java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R ref -I realigned_reads.bam -knownSites filtered_snps.vcf -knownSites filtered_indels.vcf -BQSR recal_data.table -o post_recal_data.table

 

Step 14 Analyze Covariates
Tool GATK
Input recal_data.table,
post_recal_data.table,
reference genome
Output recalibration_plots.pdf
Notes This step produces a recalibration report based on the output from the two BQSR runs
Command java -jar GenomeAnalysisTK.jar -T AnalyzeCovariates -R ref -before recal_data.table -after post_recal_data.table -plots recalibration_plots.pdf

 

Step 15 Apply BQSR
Tool GATK
Input recal_data.table,
realigned_reads.bam,
reference genome
Output recal_reads.bam
Notes This step applies the recalibration computed in the first BQSR step to the bam file.
Command java -jar GenomeAnalysisTK.jar -T PrintReads -R ref -I realigned_reads.bam -BQSR recal_data.table -o recal_reads.bam

 

Step 16 Call Variants
Tool GATK
Input recal_reads.bam,
reference genome
Output raw_variants_recal.vcf
Notes Second round of variant calling performed on recalibrated bam
Command java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref -I recal_reads.bam -o raw_variants_recal.vcf

 

Step 17 Extract SNPs & Indels
Tool GATK
Input raw_variants_recal.vcf,
reference genome
Output raw_indels_recal.vcf, raw_snps_recal.vcf
Notes This step separates SNPs and Indels so they can be processed and analyzed independently
Command java -jar GenomeAnalysisTK.jar -T SelectVariants -R ref -V raw_variants_recal.vcf -selectType SNP -o raw_snps_recal.vcf
java -jar GenomeAnalysisTK.jar -T SelectVariants -R ref -V raw_variants_recal.vcf -selectType INDEL -o raw_indels_recal.vcf

 

Step 18 Filter SNPs
Tool GATK
Input raw_snps_recal.vcf,
reference genome
Output filtered_snps_final.vcf
Notes

The filtering criteria for SNPs are as follows:

QD < 2.0
FS > 60.0
MQ < 40.0
MQRankSum < -12.5
ReadPosRankSum < -8.0
SOR > 4.0

Note: SNPs which are ‘filtered out’ at this step will remain in the filtered_snps_final.vcf file, however they will be marked as ‘basic_snp_filter’, while SNPs which passed the filter will be marked as ‘PASS’

Command java -jar GenomeAnalysisTK.jar -T VariantFiltration -R ref -V raw_snps_recal.vcf --filterExpression 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0' --filterName "basic_snp_filter" -o filtered_snps_final.vcf

 

Step 19 Filter Indels
Tool GATK
Input raw_indels_recal.vcf,
reference genome
Output filtered_indels_final.vcf
Notes

The filtering criteria for SNPs are as follows:

QD < 2.0
FS > 200.0
ReadPosRankSum < -20.0
SOR > 10.0

Note: Indelss which are ‘filtered out’ at this step will remain in the filtered_indels_recal.vcf file, however they will be marked as ‘basic_indel_filter’, while Indels which passed the filter will be marked as ‘PASS’

Command java -jar GenomeAnalysisTK.jar -T VariantFiltration -R ref -V raw_indels_recal.vcf --filterExpression 'QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0 || SOR > 10.0' --filterName "basic_indel_filter" -o filtered_indels_recal.vcf

 

Step 20 Annotate SNPs and Predict Effects
Tool SnpEff
Input filtered_snps_final.vcf
Output filtered_snps_final.ann.vcf,
snpeff_summary.html,
snpEff_genes.txt
Command java -jar snpEff.jar -v snpeff_db filtered_snps_final.vcf > filtered_snps_final.ann.vcf

 

Step 21 Compute Coverage Statistics
Tool Bedtools
Input recal_reads.bam
Output genomecov.bedgraph
Notes Load the genomecov.bedgraph file into IGV to view a coverage map at the entire genome or chromosome level
Command bedtools genomecov -bga -ibam recal_reads.bam > genomecov.bedgraph

 

Step 22 Compile Statistics
Tool parse_metrics.sh (in house)
Input alignment_metrics.txt,
insert_metrics.txt,
raw_snps.vcf,
filtered_snps.vcf,
raw_snps_recal.vcf,
filtered_snps_final.vcf,
depth_out.txt
Output report.csv
Notes A single report file is generated with summary statistics for all libraries processed containing the following pieces of information:

  • # of Reads
  • # of Aligned Reads
  • % Aligned
  • # Aligned Bases
  • Read Length
  • % Paired
  • Mean Insert Size
  • # SNPs, # Filtered SNPs
  • # SNPs after BQSR, # Filtered SNPs after BQSR
  • Average Coverage

This pipeline was presented at the NYU-HiTS seminar on March 03, 2016. Slides available here.