nf-core is a community effort to collect a curated set of analysis pipelines built using Nextflow. This post will walk you through running the nf-core RNA-Seq workflow.
The pipeline uses the STAR aligner by default, and quantifies data using Salmon, providing gene/transcript counts and extensive quality control. Prior to alignment, the pipeline uses Trim Galore to automatically trim low quality bases from the 3′ end of reads, and perform adapter trimming, attempting to auto-detect which adapter has been used (from the standard illumina, small rna, and nextera adapters). The pipeline runs a host of other QC tools, including DESeq2 to produce a PCA plot for sample-level QC (note that this requires a minimum of two replicates per library). Results are automatically compiled into a MultiQC report and can be emailed to you upon pipeline completion.
While you have the option to provide reference genome index files to the pipeline, I recommend you provide only the FASTA and GTF files and let the pipeline generate these files the first time you run the workflow (for reproducibility), providing the --save_reference parameter so they can be saved for subsequent use (the index building step can be very time-consuming).
Instead of a path to a file, a URL can be supplied to download reference FASTA and GTF files at the start of the pipeline with the --downloadFasta and --downloadGTF parameters.
Alternatively, reference data can be obtained from AWS-iGenomes automatically by providing the --genome parameter (ex: --genome GRCh37).
As per the documentation, it’s a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you’ll be running the same version of the pipeline, even if there have been changes to the code since.
To see what versions are available, go to the nf-core/rnaseq releases page and find the latest version number – numeric only (eg. 3.0). Then specify this when running the pipeline with -r (one hyphen) – eg. -r 3.0.
The version number will be logged in reports when you run the pipeline, so that you’ll know what you used when you look back in the future.
For additional options for trimming, ribosomal RNA removal, UMI-based read de-duplication, other alignment and quantification tools, and more see: https://nf-co.re/rnaseq/parameters
NOTE: NYU CGSB users will require an additional step (step #1), and have the ability to publish their results to the department JBrowse server (step #6).
1) Rename Fastq Files
NOTE: This step is for NYU CGSB users only
nf-core workflows expect standard illumina filenames by default. At NYU CGSB, fastq files are not named using the standard illumina file naming scheme. Therefore, CGSB users will need to run the following script prior to running any nf-core workflow. This script will create a folder within the target directory called ‘inames’ containing symlinks to the original data that use standard illumina filenames. CGSB users must then provide the path to the files in this ‘inames’ directory when creating the samplesheet (next step). Run this script as follows, providing the path to the reads and target directory as parameters:
sh /scratch/work/cgsb/scripts/rename_fastq_files/rename_fastq_files.sh \ <path_to_reads> \ <target_dir>
Example:
sh /scratch/work/cgsb/scripts/rename_fastq_files/rename_fastq_files.sh \ /scratch/cgsb/gencore/out/Gresham/2020-01-10_HV2J2BGXC/merged/ \ /scratch/$USER/project_dir
2) Prepare Samplesheet
The pipeline requires a samplesheet in csv format as input. The samplesheet must contain the following five headers: group, replicate, fastq_1, fastq_2, strandedness.
It is possible to include multiple runs of the same library in a samplesheet. The group and replicate identifiers are the same when you have re-sequenced the same sample more than once (e.g. to increase sequencing depth). The pipeline will concatenate the raw reads before alignment.
strandedness can be forward, reverse, unstranded, or auto in which case the pipeline will attempt to automatically determine the strandedness of the reads.
It is also possible to mix paired-end and single-end reads in a samplesheet.
Below is an example of a samplesheet consisting of both single- and paired-end data. This is for two experimental groups in triplicate, where the last replicate of the treatment group has been sequenced twice.
Copy this gist into your project directory and provide the path to your samplesheet, FASTA, GTF, and out_root, and your email address if you’d like to be notified when the pipeline completes or if there are any errors.
4) Running the workflow
Note: v3.0 of the RNA-Seq pipeline requires Nextflow 20.11.0-edge or higher.
On the NYU HPC, you must submit nextflow pipelines as an SBATCH job.
Copy and paste the code below into a file called launch-nextflow.s. Edit the path to your config file (-c) and version number (-r) as required, and add any additional parameters if necessary.
Once you have created this file, submit it to SLURM using the following command:
sbatch launch-nextflow.s
5) Output
Once the pipeline has completed, you will find your output files in the results directory within the directory you set as out_root in the config.
If you provided your email address, you will receive notification via email when your pipeline completes with information pertaining to your analysis as well as a comprehensive MultiQC report attached. If you did not provide your email address, you can find the MultiQC report in the results directory.
6) JBrowse
CGSB users have the option to push their results to JBrowse for visualization. To push data to JBrowse you will need to request access at https://forms.bio.nyu.edu if you have not already done so (requires PI approval). Then, simply execute the following command:
If you use nf-core/rnaseq for your analysis, please cite it using the following doi: 10.5281/zenodo.1400710
In addition, cite the nf-core publication as follows:
The nf-core framework for community-curated bioinformatics pipelines. Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
Additional Comments
Recently, I observed this warning message immediately upon launching the pipeline (for documentation purposes, this was version 3.14.0):
WARN: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Multiple config files detected!
Please provide pipeline parameters via the CLI or Nextflow '-params-file' option.
Custom config files including those provided by the '-c' Nextflow option can be
used to provide any configuration except for parameters.
Docs: https://nf-co.re/usage/configuration#custom-configuration-files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It appears that we need to move our parameters into a separate JSON file, and provide this file to the pipeline using the -params-file option. According to the documentation, this is necessary for nf-core pipelines using DSL2, which was implemented in version 2.0 of this pipeline. The warning message above, displayed upon running the pipeline, instructs us to provide pipeline parameters via the -params-file option. However, the documentation clarifies that “For Nextflow DSL2 nf-core pipelines – parameters defined in the parameter block in custom.config files WILL NOT override defaults in nextflow.config“, which might mean that you can keep your parameters in your nextflow.config file. This suggests that even though it’s possible to keep parameters in your nextflow.config file, as evidenced by my tests where the pipeline runs with all parameters in the nextflow.config file, you will still receive the warning message. Similarly, the warning will appear even when using a JSON file with the -params-file option. If you choose or need to use a JSON file for your parameters as indicated in the warning message, here is how you would proceed:
Create the JSON file with your parameters as shown below. Save this file as nf-params.json. (See documentation for JSON format here)
This is an updated version of the variant calling pipeline post published in 2016 (link). This updated version employs GATK4 and is available as a containerized Nextflow script on GitHub.
Identifying genomic variants, including single nucleotide polymorphisms (SNPs) and DNA insertions and deletions (indels), from next generation sequencing data is an important part of scientific discovery. At the NYU Center for Genomics and Systems Biology (CGSB) this task is central to many research programs. For example, the Carlton Lab analyzes SNP data to study population genetics of the malaria parasites Plasmodium falciparum and Plasmodium vivax. The Gresham Lab uses SNP and indel data to study adaptive evolution in yeast, and the Lupoli Lab in the Department of Chemistry uses variant analysis to study antibiotic resistance in E. coli.
To facilitate this research, a bioinformatics pipeline has been developed to enable researchers to accurately and rapidly identify, and annotate, sequence variants. The pipeline employs the Genome Analysis Toolkit 4 (GATK4) to perform variant calling and is based on the best practices for variant discovery analysis outlined by the Broad Institute. Once SNPs have been identified, SnpEff is used to annotate, and predict, variant effects.
This pipeline is intended for calling variants in samples that are clonal – i.e. a single individual. The frequencies of variants in these samples are expected to be 1 (for haploids or homozygous diploids) or 0.5 (for heterozygous diploids). To call variants in samples that are heterogeneous, such as human tumors and mixed microbial populations, in which allele frequencies vary continuously between 0 and 1 researcher should use GATK4 Mutect2 which is designed to identify subclonal events (workflow coming soon).
Base Quality Score Recalibration (BQSR) is an important step for accurate variant detection that aims to minimize the effect of technical variation on base quality scores (measured as Phred scores). As with the original pipeline (link), this pipeline assumes that a ‘gold standard’ set of SNPS and indels are not available for BQSR. In the absence of a gold standard the pipeline performs an initial step detecting variants without performing BQSR, and then uses the identified SNPs as input for BQSR before calling variants again. This process is referred to as bootstrapping and is the procedure recommended by the Broad Institute’s best practices for variant discovery analysis when a gold standard is not available.
This pipeline uses nextflow, a framework that enables reproducible and portable workflows (https://www.nextflow.io/). The full pipeline and instructions on how to use the Nextflow script are described below.
You don’t need to copy the script, but you should copy the file nextflow.config from the above directory into a new project directory. (/scratch/work/cgsb/scripts/variant_calling/gatk4/nextflow.config)
If you want to copy and run the main.nf script locally, you must copy the /bin directory as well. This needs to be in the same directory as main.nf.
You will set parameters for your analysis in the nextflow.config file
See How to Use and Examples below to learn how to configure and run the script.
This Nextflow script uses pre-installed software modules on the NYU HPC, and is tailored to the unique file naming scheme used by the Genomics Core at the NYU CGSB.
This version of the script is configured for standard Illumina filenames
This version of the script does not use software modules local to the NYU HPC, it is instead packaged with a Docker container which has all tools used in the pipeline. Integration with Docker allows for completely self-contained and truly reproducible analysis. This example shows how to run the workflow using the container. The container is hosted on Docker at: https://hub.docker.com/r/gencorefacility/variant-calling-pipeline-gatk4
Full List of Tools Used in this Pipeline
GATK4
BWA
Picard Tools
Samtools
SnpEff
R (dependency for some GATK steps)
How to Use This Script
Setting Up Nextflow
This pipeline is written in Nextflow, a computational pipeline framework. The easiest way to get setup with Nextflow is with conda. On the NYU HPC, conda is already installed and available as a module.
module load anaconda3/2019.10
Note: Use module avail anaconda to check for the latest conda version.
Update: NYU HPC users can load Nextflow using the module load command. There is no need to install via conda. Simply load the module and skip to “Setting Up The Script” below.
For users outside NYU, see the conda installation instructions, then come back and follow the rest of the instructions below.
Once you have conda loaded, make sure you add the bioconda channel (required for Nextflow) as well as the other channels bioconda depends on. It is important to add them in this order so that the priority is set correctly (if you get warnings that a channel exists and is being moved to the top, that’s OK keep going):
Create a conda environment for Nextflow. I call it ‘nf-env’ but you can call it anything you like.
conda create --name nf-env
(If you get a message asking you to update conda, press enter to proceed and allow your conda environment to be created.)
We’ll install Nextflow inside this newly created environment. Anytime we want to use Nextflow, we’ll need to load the environment first using the following command:
conda activate nf-env
When a conda environment has been loaded, you’ll see its name in parenthesis at the command prompt, for example:
(nf-env) [user@log-1 ~]$
Once you’ve activated your conda environment, install nextflow using:
conda install nextflow
(When you’re finished with nextflow, deactivate the environment using conda deactivate)
Setting Up The Script
Once you have Nextflow setup, we can run the pipeline. The pipeline requires six mandatory input parameters to run. The values can be hard coded into a config file, or provided as command line arguments (examples below).
1) params.reads The directory with your FastQ libraries to be processed, including a glob path matcher. Must include a paired end group pattern matcher (i.e. {1,2}). Example: params.reads = "/path/to/data/*_n0{1,2}_*.fastq.gz" or params.reads = "/path/to/data/*_R{1,2}_*.fastq.gz"
2) params.ref Reference genome to use, in fasta format. If you’re at the NYU CGSB, there’s a good chance your genome of interest is located in the Shared Genome Resource (/scratch/work/cgsb/genomes/Public/). Selecting a reference genome from within the Shared Genome Resource ensures that all index files and reference dictionaries required by BWA, Picard, GATK, etc. are available. If you’re using your own reference data, you will need to build index files for BWA (using BWA), a fasta index (using samtools), and a reference dictionary (using Picard Tools). These files need to be located in the same directory as the reference fasta file.
3) params.snpeff_dbThe name of the SnpEff database to use. To determine the name of the SnpEff DB to use, ensure you have snpeff installed/loaded and then issue the databases command.
# On the NYU HPC you can load SnpEff using module load module load snpeff/4.3i
# $SNPEFF_JAR is the path to the snpEff.jar file. # On the NYU HPC this is automatically set to the # snpEff.jar path when the module is loaded java -jar $SNPEFF_JAR databases | grep -i SPECIES_NAME
From the list of returned results, select the database you wish to use. The value of the first column is the value to input for params.snpeff_db.
For example if you wanted to use the Arabidopsis Thaliana SnpEff database:
Note: If your organism is not available in SnpEff, it is possible to create a custom SnpEff Database if a reference fasta and gff file are available.
4) params.outdir By default, output files will go into params.outdir/out, temporary GATK files will go into params.outdir/gatk_temp, SnpEff database files will go into params.outdir/snpeff_db, and Nextflow will run all analysis in params.outdir/nextflow_work_dir.These individual paths can be overridden by specifying the params.out, params.tmpdir, params.snpeff_data, and the workDir parameters respectively. Note: There is no params. prefix for workDir.
5) params.pl Platform. E.g. Illumina (required for read groups)
6) params.pm Machine. E.g. nextseq (required for read groups)
Examples
Note: If you are working on the NYU HPC, you should run all nextflow jobs in an interactive slurm session. This is because Nextflow uses some resources in managing your workflow.
Once you have set up your nextflow environment you can perform your analysis in multiple ways. Below I describe four different ways this can be done.
Scenario 1: Hard code all parameters in config file
Copy the nextflow.config file into a new directory (see Script Location). Edit the first six parameters in this file, e.g.:
After activating your conda environment with Nextflow installed, run:
nextflow run main.nf
By default, Nextflow will look for a config file called nextflow.config inside the directory from which Nextflow was launched. You can specify a different config file on the command line using -c:
nextflow run main.nf -c your.config
Note: Parameters hard coded in the config can be overridden by arguments passed through the command line. See following scenario.
Scenario 2: Call the script providing all required parameters through the command line (after activating your Nextflow conda environment)
Note: You are able to set/override any and all parameters in this way.
Scenario 3: Call directly from GitHub Nextflow allows you to run scripts hosted on GitHub directly. There is no need to download the script (main.nf) to your local computer. Rather, you can call the variant calling pipeline script directly from github by entering the following command:
nextflow run gencorefacility/variant-calling-pipeline-gatk4
In this scenario you still need the config file (which you can download from the git repo) to set your parameters (see Scenario 1), or pass parameters through the command line (see Scenario 2).
Scenario 4: Use with a Docker container
All software tools used in the pipeline have been packaged in a Docker container. You can instruct Nextflow to use a containerized Docker image for the analysis by passing the -with-docker or -with-singularity flags and the container. Calling the script directly from GitHub and telling it to use this docker image looks like this:
nextflow run gencorefacility/variant-calling-pipeline-gatk4 -with-docker gencorefacility/variant-calling-pipeline-gatk4
Launching Nextflow
Once Nextflow begins running, it will actively manage and monitor your jobs. If you’re working on the HPC at NYU, the corresponding nextflow.config file contains the line process.executor = 'slurm' which instructs Nextflow to submit jobs to SLURM. You can monitor your jobs in real time using watch and squeue:
watch squeue -u <netID>
Note: Since Nextflow actively manages your jobs, the session in which Nextflow is running must remain open. For example, if Nextflow is running and you lose power or network, Nextflow will be terminated. To avoid this, I recommend running Nextflow inside a tmux session, using nohup, or even submitting the nextflow script as a SLURM job. My preference is to use tmux (terminal multiplexer).
Enter the command tmux to enter a tmux session. Launch an interactive slurm session if you are on the NYU HPC. Activate your conda environment. Now, you can launch your Nextflow script.
“Detach” from the tmux session at any time by pressing “ctrl” + “B” together, then pressing “D” (for detach). If you ever lose power, connection, or detach, you can reattach your tmux session by typing tmux a. You will find your session running just as when you last left it.
If your Nextflow script does get interrupted for some reason, you can always resume the pipeline from the last successfully completed step using the -resume flag:
nextflow run main.nf -resume
Workflow Overview
Step 1
Alignment – Map to Reference
Tool
BWA MEM
Input
.fastq files reference genome
Output
aligned_reads.sam
Notes
-Y tells BWA to use soft clipping for supplementary alignments -K tells BWA to process INT input bases in each batch regardless of nThreads (for reproducibility) Readgroup info is provided with the -R flag. This information is key for downstream GATK functionality. GATK will not work without a read group tag.
In GATK4, the Mark Duplicates and Sort Sam steps have been combined into one step using the MarkDuplicatesSpark tool. In addition, the BAM index file (.bai) is created as well by default. The tool is optimized to run on queryname-grouped alignments (that is, all reads with the same queryname are together in the input file). The output of BWA is query-grouped, however if provided coordinate-sorted alignments, the tool will spend additional time first queryname sorting the reads internally. Due to MarkDuplicatesSpark queryname-sorting coordinate-sorted inputs internally at the start, the tool produces identical results regardless of the input sort-order. That is, it will flag duplicates sets that include secondary, and supplementary and unmapped mate records no matter the sort-order of the input. This differs from how Picard MarkDuplicates behaves given the differently sorted inputs. (i.e. coordinate sorted vs queryname sorted).
You don’t need a Spark cluster to run Spark-enabled GATK tools. If you’re working on a “normal” machine (even just a laptop) with multiple CPU cores, the GATK engine can still use Spark to create a virtual standalone cluster in place, and set it to take advantage of however many cores are available on the machine — or however many you choose to allocate. (https://gatk.broadinstitute.org/hc/en-us/articles/360035890591?id=11245)
samtools depth -a sorted_dedup_reads.bam > depth_out.txt
Step 4
Call Variants
Tool
GATK4
Input
sorted_dedup_reads.bam reference genome
Output
raw_variants.vcf
Notes
First round of variant calling. The variants identified in this step will be filtered and provided as input for Base Quality Score Recalibration (BQSR)
The filters below are a good starting point provided by the Broad. You will need to experiment a little to find the values that are appropriate for your data, and to produce the tradeoff between sensitivity and specificity that is right for your analysis.
QD < 2.0: This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-hom-ref samples. This annotation is intended to normalize the variant quality in order to avoid inflation caused when there is deep coverage. For filtering purposes it is better to use QD than either QUAL or DP directly.
FS > 60.0: This is the Phred-scaled probability that there is strand bias at the site. Strand Bias tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. When there is little to no strand bias at the site, the FS value will be close to 0.
MQ < 40.0: This is the root mean square mapping quality over all the reads at the site. Instead of the average mapping quality of the site, this annotation gives the square root of the average of the squares of the mapping qualities at the site. It is meant to include the standard deviation of the mapping qualities. Including the standard deviation allows us to include the variation in the dataset. A low standard deviation means the values are all close to the mean, whereas a high standard deviation means the values are all far from the mean.When the mapping qualities are good at a site, the MQ will be around 60.
SOR > 4.0: This is another way to estimate strand bias using a test similar to the symmetric odds ratio test. SOR was created because FS tends to penalize variants that occur at the ends of exons. Reads at the ends of exons tend to only be covered by reads in one direction and FS gives those variants a bad score. SOR will take into account the ratios of reads that cover both alleles.
MQRankSum < -8.0: Compares the mapping qualities of the reads supporting the reference allele and the alternate allele.
ReadPosRankSum < -8.0: Compares whether the positions of the reference and alternate alleles are different within the reads. Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors.
Note: SNPs which are ‘filtered out’ at this step will remain in the filtered_snps.vcf file, however they will be marked as ‘_filter’, while SNPs which passed the filter will be marked as ‘PASS’. We need to extract and provide only the passing SNPs to the BQSR tool, we do this in the next step (step 9).
The filters below are a good starting point provided by the Broad. You will need to experiment a little to find the values that are appropriate for your data, and to produce the tradeoff between sensitivity and specificity that is right for your analysis.
QD < 2.0: This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-hom-ref samples. This annotation is intended to normalize the variant quality in order to avoid inflation caused when there is deep coverage. For filtering purposes it is better to use QD than either QUAL or DP directly.
FS > 200.0: This is the Phred-scaled probability that there is strand bias at the site. Strand Bias tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. When there is little to no strand bias at the site, the FS value will be close to 0.
SOR > 10.0: This is another way to estimate strand bias using a test similar to the symmetric odds ratio test. SOR was created because FS tends to penalize variants that occur at the ends of exons. Reads at the ends of exons tend to only be covered by reads in one direction and FS gives those variants a bad score. SOR will take into account the ratios of reads that cover both alleles.
Note: Indels which are ‘filtered out’ at this step will remain in the filtered_snps.vcf file, however they will be marked as ‘_filter’, while SNPs which passed the filter will be marked as ‘PASS’. We need to extract and provide only the passing indels to the BQSR tool, we do this next.
This round of BQSR is optional. It is required if you want to produce a recalibration report with the Analyze Covariates step (next). For this round of BQSR, you provide the recalibrated reads obtained from the Apply BQSR step above as input.
The filters below are a good starting point provided by the Broad. You will need to experiment a little to find the values that are appropriate for your data, and to produce the tradeoff between sensitivity and specificity that is right for your analysis.
QD < 2.0: This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-hom-ref samples. This annotation is intended to normalize the variant quality in order to avoid inflation caused when there is deep coverage. For filtering purposes it is better to use QD than either QUAL or DP directly.
FS > 60.0: This is the Phred-scaled probability that there is strand bias at the site. Strand Bias tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. When there is little to no strand bias at the site, the FS value will be close to 0.
MQ < 40.0: This is the root mean square mapping quality over all the reads at the site. Instead of the average mapping quality of the site, this annotation gives the square root of the average of the squares of the mapping qualities at the site. It is meant to include the standard deviation of the mapping qualities. Including the standard deviation allows us to include the variation in the dataset. A low standard deviation means the values are all close to the mean, whereas a high standard deviation means the values are all far from the mean.When the mapping qualities are good at a site, the MQ will be around 60.
SOR > 4.0: This is another way to estimate strand bias using a test similar to the symmetric odds ratio test. SOR was created because FS tends to penalize variants that occur at the ends of exons. Reads at the ends of exons tend to only be covered by reads in one direction and FS gives those variants a bad score. SOR will take into account the ratios of reads that cover both alleles.
MQRankSum < -8.0: Compares the mapping qualities of the reads supporting the reference allele and the alternate allele.
ReadPosRankSum < -8.0: Compares whether the positions of the reference and alternate alleles are different within the reads. Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors.
Note: SNPs which are ‘filtered out’ at this step will remain in the filtered_snps.vcf file, however they will be marked as ‘_filter’, while SNPs which passed the filter will be marked as ‘PASS’.
The filters below are a good starting point provided by the Broad. You will need to experiment a little to find the values that are appropriate for your data, and to produce the tradeoff between sensitivity and specificity that is right for your analysis.
QD < 2.0: This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-hom-ref samples. This annotation is intended to normalize the variant quality in order to avoid inflation caused when there is deep coverage. For filtering purposes it is better to use QD than either QUAL or DP directly.
FS > 200.0: This is the Phred-scaled probability that there is strand bias at the site. Strand Bias tells us whether the alternate allele was seen more or less often on the forward or reverse strand than the reference allele. When there is little to no strand bias at the site, the FS value will be close to 0.
SOR > 10.0: This is another way to estimate strand bias using a test similar to the symmetric odds ratio test. SOR was created because FS tends to penalize variants that occur at the ends of exons. Reads at the ends of exons tend to only be covered by reads in one direction and FS gives those variants a bad score. SOR will take into account the ratios of reads that cover both alleles.
Note: Indels which are ‘filtered out’ at this step will remain in the filtered_snps.vcf file, however they will be marked as ‘_filter’, while SNPs which passed the filter will be marked as ‘PASS’.
A single report file is generated with summary statistics for each library processed containing the following pieces of information:
# of Reads # of Aligned Reads % Aligned # Aligned Bases Read Length % Paired % Duplicate Mean Insert Size # SNPs, # Filtered SNPs # SNPs after BQSR, # Filtered SNPs after BQSR Average Coverage
Give the script a sample id and it will look for the corresponding sample_id_* files (listed in the input section above) and print the statistics and header to stdout.
In this article I’ll go over three Nextflow patterns I frequently use
to make development of Nextflow data processing pipelines easier and
faster. I use each of these in most of my workflows, so they really come
in handy.
I am assuming here that you know what processes, channels, strings,
closures, directives and operators are and are somewhat comfortable
writing Groovy and Nextflow code. If you want further details on any of
the topics I am touching on in this article, check out the Nextflow documentation for explanation of all these concepts and more.
Let’s start with the first one, that probably gives you the biggest speedup for developing your pipelines.
1. Reducing the number of input files during development
Commonly when developing a pipeline, we do not want to run it on all
data at once. This way it runs (and fails!) quickly, so we can develop
it more quickly! Then when we are happy with our pipeline, we can run
all of our data through it.
Instead, I like using the take operator on my main channel of inputs, only letting a set number of items through the channel. Because I do not want this to always be the case, I make it conditional on a params variable params.dev. Any variable specified as params.somevariable can be set to somevalue on the commandline by invoking nextflow run pipeline.nf --somevariable somevalue.
Explanation: This example snippet first fills a channel with the numbers from 1 to 300. Then either params.number_of_inputs many of those numbers (here by default 2) or all of them (when params.dev is false and take receives -1 as input), are let through by the take operator, and those numbers that make it through are printed.
In this example params.dev is by default false, so when developing a pipeline, we simply add the --dev flag to set it to true, like so:
nextflow run conditional_take.nf --dev
Now our pipeline only lets through the first two numbers:
N E X T F L O W ~ version 19.09.0-edge
Launching `conditional_take.nf` [scruffy_shannon] - revision: 1e3507df3d
1
2
If we were to run the pipeline without the --dev flag, it prints all the numbers from 1 to 300:
N E X T F L O W ~ version 19.09.0-edge
Launching 'conditional_take.nf' [scruffy_shannon] - revision: 1e3507df3d
1
2
3
[4-298 omitted here]
299
300
2. Collecting output files of a process with publishDir
Nextflow frees us from thinking about where files produced by a
process end up and making sure they are available for the next process
that uses them.
However, often we want to see files output by a process without having
to dig into the work directory.
For this we use the publishDir directive in our process to tell Nextflow in which directory we want to publish our files that are tracked in the output: channel(s) of the process.
publishDir is the NUMBER ONE THING new users ask for on the Nextflow gitter channel because we are so used to having to track where files are manually.
The publishDir directive takes a directory relative to where we are running the pipeline from.
We can also give it a glob pattern to specify files with which extensions should be published.
Although it’s not explicitly mentioned in the documentation, we can specify publishDir
multiple times for the same process, which can be useful if a process
produces multiple types of files and we want to publish several
different groupings of the same files.
Also useful, publishDir allows us to provide a closure to specify the path/ filename a file should be saveAs (relative to the publishDir), given the name of the file.
This last one is a bit tricky because the closure does not get passed a file object but just a string.
Hence, in order to use file methods like getSimpleName, we first have to turn turn the passed string into a file file(it).
I have coded up an example below, where we pass a fruit and a baked good to someprocess and publish the resulting fruit_file and pastry_file in different directories:
Channel
.from(['apples', 'tarte'], ['peach', 'pie'])
.set{input_ch}
process someprocess {
publishDir "fruits", pattern: '*.fruit', saveAs: { "${file(it).getSimpleName()}.${file(it).getExtension()}"}
publishDir "pastries", pattern: '*.pastry'
input:
set val(fruit), val(pastry) from input_ch
output:
set file(fruit_file), file(pastry_file) into output_ch
script:
fruit_file = "${fruit}.pastry.fruit"
pastry_file = "${pastry}.fruit.pastry"
"""
touch $fruit_file
touch $pastry_file
"""
}
We want to publish the fruit_files in the directory fruits and want
to only keep the last file extension and the file name (i.e. remove the
.pastry from the file).
The pastry_files we will publish in the pastries directory.
Let’s run it:
nextflow run do_you_have_a_minute_to_talk_about_publishdir.nf
N E X T F L O W ~ version 19.09.0-edge
Launching 'do_you_have_a_minute_to_talk_about_publishdir.nf' [mad_mestorf] - revision: d436b24a1c
executor > local (2)
[92/97d75e] process > someprocess (2) [100%] 2 of 2 ✔
Let’s see where our files ended up:
$ ls fruits
apples.fruit
peach.fruit
$ ls pastries
pie.fruit.pastry
tarte.fruit.pastry
Perfect, this is what we specified the two separate publishDir directives for – see the fruits only have one extension whereas the pastries retained all of theirs.
Note that you could also publish the same file multiple times, which can be useful in some instances.
Finally, by default publishDir makes a symbolic link (see mode in the publishDir section of the docs) but you can also have the files copied, hard-linked, or even moved.
The latter is not recommended because it breaks reruns.
3. Making a custom config file within a process
For this section, I am assuming you know how string interpolation in Groovy works, if not, click the link for a little refresher.
Some software requires a custom config file.
Now while of course we could require our end-user to supply a config
file to our pipeline, if possible, we’d like to automate that – it
usually just adds unnecessary complexity to workflows that we want to
hide from the user.
Let’s break the code for this example up into bits.
First the input:
A common reason why tools require you to specify some config file is them having multiple input files.
Here our input_ch consist of some genome names (strain_1
and strain_2) and their associated fasta files which are required for
our example config file format.
Having a separate name for a genome could be necessary because filenames can include forbidden characters for some output file format. An example would be parentheses in genome names being forbidden in Newick files because they have syntactic meaning.
process software_that_requires_as_custom_config {
publishDir "custom_config"
input:
set val(genome_names), file(genome_fastas) from input_ch
output:
file(custom_config) into output_ch
The first portion of this process definition is just the usual –
input channels and output channels and for convenience of the example, a
publishDir.
In the script part we first transpose the genome_names and genome_fastas to get a list of lists in which each name is paired up with its corresponding fasta file.
Building a multi-line Groovy string –> one-line bash string
Here is what we want our final config file to look like:
$ cat custom_config/custom.config
#list of all our genome_names and genome_fastas:
strain_1 = lmao_this_file_name_is_a_mess_strain1.fasta
strain_2 = strain2_some_totally_random_filename.fasta
We build the text of our config file into a variable config_file.
The intent here is to make a one-line string that we can printf into our config file.
We indent it to make it more readable and strip those indents again with stripIndents.
Note that this will only work if the indents are the same for every line
– it’s crucial that the first line of this multiline string only has the “”” and nothing else.
Otherwise stripIndents would count this multiline string as not indented.
An additional gotcha is that this multiline Groovy string should eventually turn into one line in bash.
As a result, to get a \n (newline character for printf) in our bash script, we have to write \\n in our Groovy script.
Then we can use collect to add a custom string for each name & fasta pairing to our config file.
This produces a list and we don’t want the [] to end up in our string, so we additionally use join to concatenate all the list items into a string.
Finally the whole string needs to be split by newline (this is the Groovy newlines, not our \\n) and joined again.
The actual bash portion of this script just printfs the contents of our one-line string into our config file.
script:
custom_config = 'custom.config'
name_fasta_pairing = [genome_names, genome_fastas.collect()].transpose() // these are now organized like this: [[name1, fasta1], [name2, fasta2]]
config_file = """
#list of all our genome_names and genome_fastas:\\n
${name_fasta_pairing.collect{"${it[0]} = ${it[1]}\\n"}.join()}
""".stripIndent().split('\n').join('')
"""
printf '$config_file' >> $custom_config
"""
}
Now you may wonder why we needed to go through all the nonsense of
escaping bash syntax in our Groovy string – why not make a Groovy string
and write directly to a file using Groovy?
The problem here lies with where this Groovy code gets executed.
The following would create the config file in the directory from where
we run the workflow, rather than inside the work directory:
config_file = """
#list of all our genome_names and genome_fastas:
${name_fasta_pairing.collect{"${it[0]} = ${it[1]}\n"}.join()}
""".stripIndent()
file(custom_config) << config_file
This leads to problems because the file would be overwritten every run, so for now our clunky solution has to do.
Let’s run it:
nextflow run write_custom_config_files_within_process.nf
N E X T F L O W ~ version 19.09.0-edge
Launching 'write_custom_config_files_within_process.nf' [special_cuvier] - revision: 08a6a5850d
executor > local (1)
[1b/a693d8] process > software_that_requi... [100%] 1 of 1 ✔
And it looks just how we wanted it:
$ cat custom_config/custom.config
#list of all our genome_names and genome_fastas:
strain_1 = lmao_this_file_name_is_a_mess_strain1.fasta
strain_2 = strain2_some_totally_random_filename.fasta
I hope this article was informative and you learned at least one new thing.