This is an updated version of the variant calling pipeline post published in 2016 (link). This updated version employs GATK4 and is available as a containerized Nextflow script on GitHub.

Identifying genomic variants, including single nucleotide polymorphisms (SNPs) and DNA insertions and deletions (indels), from next generation sequencing data is an important part of scientific discovery. At the NYU Center for Genomics and Systems Biology (CGSB) this task is central to many research programs. For example, the Carlton Lab analyzes SNP data to study population genetics of the malaria parasites Plasmodium falciparum and Plasmodium vivax. The Gresham Lab uses SNP and indel data to study adaptive evolution in yeast, and the Lupoli Lab in the Department of Chemistry uses variant analysis to study antibiotic resistance in E. coli

To facilitate this research, a bioinformatics pipeline has been developed to enable researchers to accurately and rapidly identify, and annotate, sequence variants. The pipeline employs the Genome Analysis Toolkit 4 (GATK4) to perform variant calling and is based on the best practices for variant discovery analysis outlined by the Broad Institute. Once SNPs have been identified, SnpEff is used to annotate, and predict, variant effects. 

This pipeline is intended for calling variants in samples that are clonal – i.e. a single individual.  The frequencies of variants in these samples are expected to be 1 (for haploids or homozygous diploids) or 0.5 (for heterozygous diploids).  To call variants in samples that are heterogeneous, such as human tumors and mixed microbial populations, in which allele frequencies vary continuously between 0 and 1 researcher should use GATK4 Mutect2 which is designed to identify subclonal events (workflow coming soon). 

Base Quality Score Recalibration (BQSR) is an important step for accurate variant detection that aims to minimize the effect of technical variation on base quality scores (measured as Phred scores). As with the original pipeline (link), this pipeline assumes that a ‘gold standard’ set of SNPS and indels are not available for BQSR.  In the absence of a gold standard the pipeline performs an initial step detecting variants without performing BQSR, and then uses the identified SNPs as input for BQSR before calling variants again. This process is referred to as bootstrapping and is the procedure recommended by the Broad Institute’s best practices for variant discovery analysis when a gold standard is not available.

This pipeline uses nextflow, a framework that enables reproducible and portable workflows (https://www.nextflow.io/).  The full pipeline and instructions on how to use the Nextflow script are described below.

Script Location

prince

/scratch/work/cgsb/scripts/variant_calling/gatk4/main.nf
  1. You don’t need to copy the script, but you should copy the file nextflow.config from the above directory into a new project directory.
    (/scratch/work/cgsb/scripts/variant_calling/gatk4/nextflow.config)
  2. If you want to copy and run the main.nf script locally, you must copy the /bin directory as well. This needs to be in the same directory as main.nf.
  3. You will set parameters for your analysis in the nextflow.config file
  4. See How to Use and Examples below to learn how to configure and run the script. 
  5. This Nextflow script uses pre-installed software modules on the Prince HPC, and is tailored to the unique file naming scheme used by the Genomics Core at the NYU CGSB.

github

https://github.com/gencorefacility/variant-calling-pipeline-gatk4
  1. This version of the script is configured for standard Illumina filenames
  2. This version of the script does not use software modules local to the NYU Prince HPC, it is instead packaged with a Docker container which has all tools used in the pipeline. Integration with Docker allows for completely self-contained and truly reproducible analysis. This example shows how to run the workflow using the container. The container is hosted on Docker at: https://hub.docker.com/r/gencorefacility/variant-calling-pipeline-gatk4

Full List of Tools Used in this Pipeline

  • GATK4
  • BWA
  • Picard Tools
  • Samtools
  • SnpEff
  • R (dependency for some GATK steps)

How to Use This Script

Setting Up Nextflow

This pipeline is written in Nextflow, a computational pipeline framework. The easiest way to get setup with Nextflow is with conda. On the NYU Prince HPC, conda is already installed and available as a module.

module load anaconda3/2019.10

Note: Use module avail anaconda to check for the latest conda version. 

Update: Prince users can now load Nextflow using the module load command. There is no need to install via conda. Simply load the module and skip to “Setting Up The Script” below.

For users outside NYU, see the conda installation instructions, then come back and follow the rest of the instructions below.

Once you have conda loaded, make sure you add the bioconda channel (required for Nextflow) as well as the other channels bioconda depends on. It is important to add them in this order so that the priority is set correctly (if you get warnings that a channel exists and is being moved to the top, that’s OK keep going):

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Create a conda environment for Nextflow. I call it ‘nf-env’ but you can call it anything you like.

conda create --name nf-env

(If you get a message asking you to update conda, press enter to proceed and allow your conda environment to be created.)

We’ll install Nextflow inside this newly created environment. Anytime we want to use Nextflow, we’ll need to load the environment first using the following command:

conda activate nf-env

When a conda environment has been loaded, you’ll see its name in parenthesis at the command prompt, for example:

(nf-env) [user@log-1 ~]$

Once you’ve activated your conda environment, install nextflow using:

conda install nextflow

(When you’re finished with nextflow, deactivate the environment using conda deactivate)

Setting Up The Script

Once you have Nextflow setup, we can run the pipeline. The pipeline requires six mandatory input parameters to run. The values can be hard coded into a config file, or provided as command line arguments (examples below).

1) params.reads The directory with your FastQ libraries to be processed, including a glob path matcher. Must include a paired end group pattern matcher (i.e. {1,2}).
Example:
params.reads = "/path/to/data/*_n0{1,2}_*.fastq.gz"
or
params.reads = "/path/to/data/*_R{1,2}_*.fastq.gz"

2) params.ref Reference genome to use, in fasta format. If you’re at the NYU CGSB, there’s a good chance your genome of interest is located in the Shared Genome Resource (/scratch/work/cgsb/genomes/Public/). Selecting a reference genome from within the Shared Genome Resource ensures that all index files and reference dictionaries required by BWA, Picard, GATK, etc. are available. If you’re using your own reference data, you will need to build index files for BWA (using BWA), a fasta index (using samtools), and a reference dictionary (using Picard Tools). These files need to be located in the same directory as the reference fasta file.

3) params.snpeff_dbThe name of the SnpEff database to use. To determine the name of the SnpEff DB to use, ensure you have snpeff installed/loaded and then issue the databases command.

# On the Prince HPC you can load SnpEff using module load
module load snpeff/4.3i

# $SNPEFF_JAR is the path to the snpEff.jar file.
# On Prince this is automatically set to the
# snpEff.jar path when the module is loaded
java -jar $SNPEFF_JAR databases | grep -i SPECIES_NAME


From the list of returned results, select the database you wish to use. The value of the first column is the value to input for params.snpeff_db.

For example if you wanted to use the Arabidopsis Thaliana SnpEff database:

java -jar $SNPEFF_JAR databases | grep -i thaliana

Output:

athalianaTair10   Arabidopsis_Thaliana http://downloads.sourceforge.net/...

Then:

params.snpeff_db='athalianaTair10'

Note: If your organism is not available in SnpEff, it is possible to create a custom SnpEff Database if a reference fasta and gff file are available.

4) params.outdir By default, output files will go into params.outdir/out, temporary GATK files will go into params.outdir/gatk_temp, SnpEff database files will go into params.outdir/snpeff_db, and Nextflow will run all analysis in params.outdir/nextflow_work_dir.These individual paths can be overridden by specifying the params.out, params.tmpdir, params.snpeff_data, and the workDir parameters respectively. Note: There is no  params. prefix for workDir.

Note: on the Prince HPC, we recommend you set params.outdir to somewhere in your directory on /beegfs, e.g. /beegfs/netID/squid_analysis

5) params.pl Platform. E.g. Illumina (required for read groups)

6) params.pm Machine. E.g. nextseq (required for read groups)

Examples

Note: If you are working on the Prince HPC, you should run all nextflow jobs in an interactive slurm session. This is because Nextflow uses some resources in managing your workflow. 

Once you have set up your nextflow environment you can perform your analysis in multiple ways.  Below I describe four different ways this can be done.

Scenario 1: Hard code all parameters in config file

Copy the nextflow.config file into a new directory (see Script Location). Edit the first six parameters in this file, e.g.:

params.reads = "/path/to/reads/*_n0{1,2}_*.fastq.gz"
params.ref = "/path/to/ref.fa"
params.outdir = "/scratch/user/gatk4/"
params.snpeff_db = "athalianaTair10"
params.pl = "illumina"
params.pm = "nextseq"

After activating your conda environment with Nextflow installed, run:

nextflow run main.nf

By default, Nextflow will look for a config file called nextflow.config inside the directory from which Nextflow was launched. You can specify a different config file on the command line using -c:

nextflow run main.nf -c your.config

Note: Parameters hard coded in the config can be overridden by arguments passed through the command line. See following scenario.

Scenario 2: Call the script providing all required parameters through the command line (after activating your Nextflow conda environment)

nextflow run main.nf \
    --reads "/path/to/reads/*_n0{1,2}_*.fastq.gz" \
    --ref "/path/to/ref.fa" \
    --outdir "/scratch/user/gatk4/" \
    --snpeff_db "athalianaTair10" \
    --pl "illumina" \
    --pm "illumina"

Note: You are able to set/override any and all parameters in this way.

Scenario 3: Call directly from GitHub
Nextflow allows you to run scripts hosted on GitHub directly. There is no need to download the script (main.nf) to your local computer. Rather, you can call the variant calling pipeline script directly from github by entering the following command:

nextflow run gencorefacility/variant-calling-pipeline-gatk4

In this scenario you still need the config file (which you can download from the git repo) to set your parameters (see Scenario 1), or pass parameters through the command line (see Scenario 2). 

Scenario 4: Use with a Docker container

All software tools used in the pipeline have been packaged in a Docker container. You can instruct Nextflow to use a containerized Docker image for the analysis by passing the -with-docker or -with-singularity flags and the container. Calling the script directly from GitHub and telling it to use this docker image looks like this: 

nextflow run gencorefacility/variant-calling-pipeline-gatk4 -with-docker gencorefacility/variant-calling-pipeline-gatk4

Launching Nextflow

Once Nextflow begins running, it will actively manage and monitor your jobs. If you’re working on the Prince HPC at NYU, the corresponding nextflow.config file contains the line process.executor = 'slurm' which instructs Nextflow to submit jobs to SLURM. You can monitor your jobs in real time using watch and squeue:

watch squeue -u <netID>

Note: Since Nextflow actively manages your jobs, the session in which Nextflow is running must remain open. For example, if Nextflow is running and you lose power or network, Nextflow will be terminated. To avoid this, I recommend running Nextflow inside a tmux session, using nohup, or even submitting the nextflow script as a SLURM job. My preference is to use tmux (terminal multiplexer). 

Enter the command tmux to enter a tmux session. Launch an interactive slurm session if you are on the Prince HPC. Activate your conda environment. Now, you can launch your Nextflow script.

“Detach” from the tmux session at any time by pressing “ctrl” + “B” together, then pressing “D” (for detach). If you ever lose power, connection, or detach, you can reattach your tmux session by typing tmux a. You will find your session running just as when you last left it. 

If your Nextflow script does get interrupted for some reason, you can always resume the pipeline from the last successfully completed step using the -resume flag:

nextflow run main.nf -resume

Workflow Overview

GATK4 Variant Calling Pipeline Workflow
Step 1Alignment – Map to Reference
ToolBWA MEM
Input.fastq files
reference genome
Outputaligned_reads.sam
Notes-Y tells BWA to use soft clipping for supplementary alignments
-K tells BWA to process INT input bases in each batch regardless of nThreads (for reproducibility)
Readgroup info is provided with the -R flag. This information is key for downstream GATK functionality. GATK will not work without a read group tag.
Commandbwa mem \
-K 100000000 \
-Y \
-R '@RG\tID:sample_1\tLB:sample_1\tPL:ILLUMINA\tPM:HISEQ\tSM:sample_1' \
ref.fa \
sample_1_reads_1.fq \
sample_1_reads_2.fq \
> aligned_reads.sam
Step 2Mark Duplicates + Sort
ToolGATK4 MarkDuplicatesSpark
Inputaligned_reads.sam
Outputsorted_dedup_reads.bam
sorted_dedup_reads.bam.bai
dedup_metrics.txt
NotesIn GATK4, the Mark Duplicates and Sort Sam steps have been combined into one step using the MarkDuplicatesSpark tool. In addition, the BAM index file (.bai) is created as well by default. The tool is optimized to run on queryname-grouped alignments (that is, all reads with the same queryname are together in the input file). The output of BWA is query-grouped, however if provided coordinate-sorted alignments, the tool will spend additional time first queryname sorting the reads internally. Due to MarkDuplicatesSpark queryname-sorting coordinate-sorted inputs internally at the start, the tool produces identical results regardless of the input sort-order. That is, it will flag duplicates sets that include secondary, and supplementary and unmapped mate records no matter the sort-order of the input. This differs from how Picard MarkDuplicates behaves given the differently sorted inputs. (i.e. coordinate sorted vs queryname sorted). 

You don’t need a Spark cluster to run Spark-enabled GATK tools. If you’re working on a “normal” machine (even just a laptop) with multiple CPU cores, the GATK engine can still use Spark to create a virtual standalone cluster in place, and set it to take advantage of however many cores are available on the machine — or however many you choose to allocate.
(https://gatk.broadinstitute.org/hc/en-us/articles/360035890591?id=11245)
Commandgatk \        
MarkDuplicatesSpark \
        -I aligned_reads.sam \
        -M dedup_metrics.txt \
        -O sorted_dedup_reads.bam
Step 3Collect Alignment & Insert Size Metrics
ToolPicard Tools, R, Samtools
Inputsorted_dedup_reads.bam
reference genome
Outputalignment_metrics.txt,
insert_metrics.txt,
insert_size_histogram.pdf,
depth_out.txt
Commandjava -jar picard.jar \
        CollectAlignmentSummaryMetrics \
        R=ref.fa \
        I=sorted_dedup_reads.bam \
        O=alignment_metrics.txt


java -jar picard.jar \
CollectInsertSizeMetrics \
        INPUT=sorted_dedup_reads.bam \
        OUTPUT=insert_metrics.txt \
        HISTOGRAM_FILE=insert_size_histogram.pdf


samtools depth -a sorted_dedup_reads.bam > depth_out.txt
Step 4Call Variants
ToolGATK4
Inputsorted_dedup_reads.bam
reference genome
Outputraw_variants.vcf
NotesFirst round of variant calling. The variants identified in this step will be filtered and provided as input for Base Quality Score Recalibration (BQSR)
Commandgatk HaplotypeCaller \
        -R ref.fa \
        -I sorted_dedup_reads.bam \
        -o raw_variants.vcf
Step 5Extract SNPs & Indels
ToolGATK4
Inputraw_variants.vcf
reference genome
Outputraw_indels.vcf
raw_snps.vcf
NotesThis step separates SNPs and Indels so they can be processed and used independently
Commandgatk SelectVariants \
        -R ref.fa \
        -V raw_variants.vcf \
        -selectType SNP \
        -o raw_snps.vcf
gatk SelectVariants \
        -R ref.fa \
        -V raw_variants.vcf \
        -selectType INDEL \
        -o raw_indels.vcf
Step 6Filter SNPs
ToolGATK4
Inputraw_snps.vcf
reference genome
Outputfiltered_snps.vcf
filtered_snps.vcf.idx
NotesThe filters below are a good starting point provided by the Broad. You will need to experiment a little to find the values that are appropriate for your data, and to produce the tradeoff between sensitivity and specificity that is right for your analysis.

QD < 2.0: This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-hom-ref samples. This annotation is intended to normalize the variant quality in order to avoid inflation caused when there is deep coverage. For filtering purposes it is better to use QD than either QUAL or DP directly.

FS > 60.0: This is the Phred-scaled probability that there is strand bias at the site. Strand Bias tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. When there is little to no strand bias at the site, the FS value will be close to 0.

MQ < 40.0: This is the root mean square mapping quality over all the reads at the site. Instead of the average mapping quality of the site, this annotation gives the square root of the average of the squares of the mapping qualities at the site. It is meant to include the standard deviation of the mapping qualities. Including the standard deviation allows us to include the variation in the dataset. A low standard deviation means the values are all close to the mean, whereas a high standard deviation means the values are all far from the mean.When the mapping qualities are good at a site, the MQ will be around 60.

SOR > 4.0: This is another way to estimate strand bias using a test similar to the symmetric odds ratio test. SOR was created because FS tends to penalize variants that occur at the ends of exons. Reads at the ends of exons tend to only be covered by reads in one direction and FS gives those variants a bad score. SOR will take into account the ratios of reads that cover both alleles.

MQRankSum < -8.0: Compares the mapping qualities of the reads supporting the reference allele and the alternate allele.

ReadPosRankSum < -8.0: Compares whether the positions of the reference and alternate alleles are different within the reads. Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors.

Learn more about hard filtering: https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants

Note: SNPs which are ‘filtered out’ at this step will remain in the filtered_snps.vcf file, however they will be marked as ‘_filter’, while SNPs which passed the filter will be marked as ‘PASS’. We need to extract and provide only the passing SNPs to the BQSR tool, we do this in the next step (step 9). 
Commandgatk VariantFiltration \
        -R ref.fa \
        -V raw_snps.vcf \
        -O filtered_snps.vcf \
        -filter-name "QD_filter" -filter "QD < 2.0" \
        -filter-name "FS_filter" -filter "FS > 60.0" \
        -filter-name "MQ_filter" -filter "MQ < 40.0" \
        -filter-name "SOR_filter" -filter "SOR > 4.0" \
        -filter-name "MQRankSum_filter" -filter "MQRankSum < -12.5" \
        -filter-name "ReadPosRankSum_filter" -filter "ReadPosRankSum < -8.0"
Step 7Filter Indels
ToolGATK4
Inputraw_indels.vcf
reference genome
Outputfiltered_indels.vcf
filtered_indels.vcf.idx
NotesThe filters below are a good starting point provided by the Broad. You will need to experiment a little to find the values that are appropriate for your data, and to produce the tradeoff between sensitivity and specificity that is right for your analysis.

QD < 2.0: This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-hom-ref samples. This annotation is intended to normalize the variant quality in order to avoid inflation caused when there is deep coverage. For filtering purposes it is better to use QD than either QUAL or DP directly.

FS > 200.0: This is the Phred-scaled probability that there is strand bias at the site. Strand Bias tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. When there is little to no strand bias at the site, the FS value will be close to 0.

SOR > 10.0: This is another way to estimate strand bias using a test similar to the symmetric odds ratio test. SOR was created because FS tends to penalize variants that occur at the ends of exons. Reads at the ends of exons tend to only be covered by reads in one direction and FS gives those variants a bad score. SOR will take into account the ratios of reads that cover both alleles.

Learn more about hard filtering: https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants

Note: Indels which are ‘filtered out’ at this step will remain in the filtered_snps.vcf file, however they will be marked as ‘_filter’, while SNPs which passed the filter will be marked as ‘PASS’. We need to extract and provide only the passing indels to the BQSR tool, we do this next. 
Commandgatk VariantFiltration \
        -R ref.fa \

        -V raw_indels.vcf \
        -O filtered_indels.vcf \
        -filter-name "QD_filter" -filter "QD < 2.0" \
        -filter-name "FS_filter" -filter "FS > 200.0" \
        -filter-name "SOR_filter" -filter "SOR > 10.0"
Step 8Exclude Filtered Variants
ToolGATK4
Inputfiltered_snps.vcf
filtered_indels.vcf
Outputbqsr_snps.vcf
bqsr_indels.vcf
NotesWe need to extract only the passing variants and provide this as input to BQSR (next step). 
Commandgatk SelectVariants \
        --exclude-filtered \
        -V filtered_snps.vcf \
        -O bqsr_snps.vcf
gatk SelectVariants \
        --exclude-filtered \
        -V filtered_indels.vcf \

        -O bqsr_indels.vcf
Step 9Base Quality Score Recalibration (BQSR) #1
ToolGATK4
Inputsorted_dedup_reads.bam (from step 2)
bqsr_snps.vcf
bqsr_indels.vcf
reference genome
Outputrecal_data.table
NotesBQSR is performed twice. The second pass is optional, only required to produce a recalibration report.
Command gatk BaseRecalibrator \
        -R ref.fa \
        -I sorted_dedup_reads.bam \
        --known-sites bqsr_snps.vcf \
        --known-sites bqsr_indels.vcf \
        -O recal_data.table
Step 10Apply BQSR
ToolGATK4
Inputrecal_data.table
sorted_dedup_reads.bam
reference genome
Outputrecal_reads.bam
NotesThis step applies the recalibration computed in the first BQSR step to the bam file. This recalibrated bam file is now analysis-ready. 
Commandgatk ApplyBQSR \
        -R ref.fa \
        -I sorted_dedup_reads.bam \
        -bqsr recal_data.table \
        -O recal_reads.bam \
Step 11Base Quality Score Recalibration (BQSR) #2
ToolGATK4
Inputrecal_reads.bam
bqsr_snps.vcf
bqsr_indels.vcf
reference genome
Outputpost_recal_data.table
NotesThis round of BQSR is optional. It is required if you want to produce a recalibration report with the Analyze Covariates step (next). For this round of BQSR, you provide the recalibrated reads obtained from the Apply BQSR step above as input. 
Commandgatk BaseRecalibrator \
        -R ref.fa \
        -I recal_reads.bam \
        --known-sites bqsr_snps.vcf \
        --known-sites bqsr_indels.vcf \
        -O post_recal_data.table
Step 12Analyze Covariates
ToolGATK4
Inputrecal_data.table
post_recal_data.table
Outputrecalibration_plots.pdf
NotesThis step produces a recalibration report based on the output from the two BQSR runs
Commandgatk AnalyzeCovariates \        
       -before recal_data.table \
        -after post_recal_data.table \
        -plots recalibration_plots.pdf
Step 13Call Variants
ToolGATK4
Inputrecal_reads.bam
reference genome
Outputraw_variants_recal.vcf
NotesSecond round of variant calling performed using recalibrated (analysis-ready) bam
Commandgatk HaplotypeCaller \
        -R ref.fa \
        -I recal_reads.bam \
        -o raw_variants_recal.vcf
Step 14Extract SNPs & Indels
ToolGATK4
Inputraw_variants_recal.vcf
reference genome
Outputraw_indels_recal.vcfraw_snps_recal.vcf
NotesThis step separates SNPs and Indels so they can be processed and analyzed independently
Commandgatk SelectVariants \
        -R ref.fa \
        -V raw_variants_recal.vcf \
        -selectType SNP \
        -o raw_snps_recal.vcf
gatk SelectVariants \

        -R ref.fa \
        -V raw_variants.vcf \
        -selectType INDEL \
        -o raw_indels_recal.vcf
Step 15Filter SNPs
ToolGATK4
Inputraw_snps_recal.vcf
reference genome
Outputfiltered_snps_final.vcf
filtered_snps_final.vcf.idx
NotesThe filters below are a good starting point provided by the Broad. You will need to experiment a little to find the values that are appropriate for your data, and to produce the tradeoff between sensitivity and specificity that is right for your analysis.

QD < 2.0: This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-hom-ref samples. This annotation is intended to normalize the variant quality in order to avoid inflation caused when there is deep coverage. For filtering purposes it is better to use QD than either QUAL or DP directly.

FS > 60.0: This is the Phred-scaled probability that there is strand bias at the site. Strand Bias tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. When there is little to no strand bias at the site, the FS value will be close to 0.

MQ < 40.0: This is the root mean square mapping quality over all the reads at the site. Instead of the average mapping quality of the site, this annotation gives the square root of the average of the squares of the mapping qualities at the site. It is meant to include the standard deviation of the mapping qualities. Including the standard deviation allows us to include the variation in the dataset. A low standard deviation means the values are all close to the mean, whereas a high standard deviation means the values are all far from the mean.When the mapping qualities are good at a site, the MQ will be around 60.

SOR > 4.0: This is another way to estimate strand bias using a test similar to the symmetric odds ratio test. SOR was created because FS tends to penalize variants that occur at the ends of exons. Reads at the ends of exons tend to only be covered by reads in one direction and FS gives those variants a bad score. SOR will take into account the ratios of reads that cover both alleles.

MQRankSum < -8.0: Compares the mapping qualities of the reads supporting the reference allele and the alternate allele.

ReadPosRankSum < -8.0: Compares whether the positions of the reference and alternate alleles are different within the reads. Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors.

Learn more about hard filtering: https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants

Note: SNPs which are ‘filtered out’ at this step will remain in the filtered_snps.vcf file, however they will be marked as ‘_filter’, while SNPs which passed the filter will be marked as ‘PASS’.
Command gatk VariantFiltration \       
-R ref.fa \
        -V raw_snps_recal.vcf \
        -O filtered_snps_final.vcf \
        -filter-name "QD_filter" -filter "QD < 2.0" \
        -filter-name "FS_filter" -filter "FS > 60.0" \
        -filter-name "MQ_filter" -filter "MQ < 40.0" \
        -filter-name "SOR_filter" -filter "SOR > 4.0" \
        -filter-name "MQRankSum_filter" -filter "MQRankSum < -12.5" \
        -filter-name "ReadPosRankSum_filter" -filter "ReadPosRankSum < -8.0"
Step 16Filter Indels
ToolGATK4
Inputraw_indels_recal.vcf
reference genome
Outputfiltered_indels_final.vcf
filtered_indels_final.vcf.idx
NotesThe filters below are a good starting point provided by the Broad. You will need to experiment a little to find the values that are appropriate for your data, and to produce the tradeoff between sensitivity and specificity that is right for your analysis.

QD < 2.0: This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-hom-ref samples. This annotation is intended to normalize the variant quality in order to avoid inflation caused when there is deep coverage. For filtering purposes it is better to use QD than either QUAL or DP directly.

FS > 200.0: This is the Phred-scaled probability that there is strand bias at the site. Strand Bias tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. When there is little to no strand bias at the site, the FS value will be close to 0.

SOR > 10.0: This is another way to estimate strand bias using a test similar to the symmetric odds ratio test. SOR was created because FS tends to penalize variants that occur at the ends of exons. Reads at the ends of exons tend to only be covered by reads in one direction and FS gives those variants a bad score. SOR will take into account the ratios of reads that cover both alleles.

Learn more about hard filtering: https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants

Note: Indels which are ‘filtered out’ at this step will remain in the filtered_snps.vcf file, however they will be marked as ‘_filter’, while SNPs which passed the filter will be marked as ‘PASS’.
Command gatk VariantFiltration \        
-R ref.fa \
        -V raw_indels_recal.fa \
        -O filtered_indels_final.vcf \
        -filter-name "QD_filter" -filter "QD < 2.0" \
        -filter-name "FS_filter" -filter "FS > 200.0" \
        -filter-name "SOR_filter" -filter "SOR > 10.0"
Step 17Annotate SNPs and Predict Effects
ToolSnpEff
Inputfiltered_snps_final.vcf
Outputfiltered_snps_final.ann.vcf
snpeff_summary.html
snpEff_genes.txt
Commandjava -jar snpEff.jar -v \        
<snpeff_db> \
        filtered_snps_final.vcf > $filtered_snps_final.ann.vcf
Step 18Compile Statistics
Toolparse_metrics.sh (in /bin)
Inputsample_id_alignment_metrics.txt
sample_id_insert_metrics.txt
sample_id_dedup_metrics.txt
sample_id_raw_snps.vcf
sample_id_filtered_snps.vcf
sample_id_raw_snps_recal.vcf
sample_id_filtered_snps_final.vcf
sample_id_depth_out.txt
Outputreport.csv
Commandparse_metrics.sh sample_id > sample_id_report.csv
NotesA single report file is generated with summary statistics for each library processed containing the following pieces of information:

# of Reads
# of Aligned Reads
% Aligned
# Aligned Bases
Read Length
% Paired
% Duplicate
Mean Insert Size
# SNPs, # Filtered SNPs
# SNPs after BQSR, # Filtered SNPs after BQSR
Average Coverage

Give the script a sample id and it will look for the corresponding sample_id_* files (listed in the input section above) and print the statistics and header to stdout. 

9 Comments

Eric Borenstein · 2020-04-03 at 1:34 pm

This workflow can also be ran as an SBATCH rather than interactively. The SBATCH options to change would be job-name, output, and possibly time. The resources set in SBATCH are only for the job controller nextflow and not the actual compute, so no need to increase. The resources for your compute would be set in the config file given. Job process can be monitored in the slurm output file you set.

#!/bin/bash

#SBATCH –nodes=1
#SBATCH –ntasks-per-node=1
#SBATCH –cpus-per-task=2
#SBATCH –time=5:00:00
#SBATCH –mem=2GB
#SBATCH –job-name=myJob
#SBATCH –output=slurm_%j.out
module purge
module load jdk/1.8.0_111
module load nextflow/20.01.0

nextflow run main.nf -c your.config

MARION · 2020-05-01 at 10:18 am

Thank you so much for this elaborate workflow Mohammed. However I am continuously getting this error nextflow.exception.AbortOperationException: Not a valid project name: main.nf when I run nextflow run main.nf. Thank you in advance!

    Mohammed Khalfan · 2020-05-04 at 6:03 pm

    This error means there is no main.nf file in your working directory. You should change into the directory where the main.nf script is located and then run nextflow run main.nf.

Darach · 2020-05-03 at 1:52 pm

Glad to see you could use that filter/tap trick, and the adoption of nextflow generally. Cleanly written.

Does NYU HPC let you use Docker?

    Mohammed Khalfan · 2020-05-04 at 6:05 pm

    Thank you Darach, I’m really liking nextflow. There are a couple other workflows on our git too. Have you had the opportunity to test out the modules feature? I’m interested to try that next.

    We use Singularity on the HPC which supports the use of Docker containers.

Darach · 2020-05-04 at 7:30 pm

No, I haven’t seen modules yet. It looks like a really smart idea. I look forward to what y’all write in that; I know that’ll be really cleanly done.

Ah, so it makes sense to write it as a Docker recipe, because then that’s more flexible for users in both platforms. Presumably folks using cloud-computing will appreciate that.

I’m going to copy the way you’re writing channel operators in the actual input field. I hadn’t thought to do that, and I think that style is a much cleaner way of organizing nextflow. Thanks for sharing your code with everyone, for direct use and inspiration.

Mohammed Khalfan · 2020-05-06 at 10:22 pm

Thank you for the kind words Darach, I value your feedback.

Marion · 2020-05-07 at 1:27 pm

Hi Mohammed, I got an error at the markDuplicatesSpark step caused by: java.lang.IllegalStateException: Duplicate key A00431:16:H5JTVDSXX:1:1101:27308:32972 (attempted merging values -1 and -1). I wonder what it is all about, I have tried to figure out how to go about this and failed. Thank you in advance.

Marion · 2020-05-08 at 9:49 am

Was able to resolve this by upgrading my GATK version from V4.0.4.0 to V4.0.5.0. Thank you!

Leave a Reply

Your email address will not be published. Required fields are marked *