Tag: rnaseq

  • Streamlined RNA-Seq Analysis Using Nextflow

    Streamlined RNA-Seq Analysis Using Nextflow

    UPDATED: April 16, 2024

    nf-core is a community effort to collect a curated set of analysis pipelines built using Nextflow. This post will walk you through running the nf-core RNA-Seq workflow.

    The pipeline uses the STAR aligner by default, and quantifies data using Salmon, providing gene/transcript counts and extensive quality control. Prior to alignment, the pipeline uses Trim Galore to automatically trim low quality bases from the 3′ end of reads, and perform adapter trimming, attempting to auto-detect which adapter has been used (from the standard illumina, small rna, and nextera adapters). The pipeline runs a host of other QC tools, including DESeq2 to produce a PCA plot for sample-level QC (note that this requires a minimum of two replicates per library). Results are automatically compiled into a MultiQC report and can be emailed to you upon pipeline completion.

    While you have the option to provide reference genome index files to the pipeline, I recommend you provide only the FASTA and GTF files and let the pipeline generate these files the first time you run the workflow (for reproducibility), providing the --save_reference parameter so they can be saved for subsequent use (the index building step can be very time-consuming).

    Instead of a path to a file, a URL can be supplied to download reference FASTA and GTF files at the start of the pipeline with the --downloadFasta and --downloadGTF parameters.

    Alternatively, reference data can be obtained from AWS-iGenomes automatically by providing the --genome parameter (ex: --genome GRCh37).

    As per the documentation, it’s a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you’ll be running the same version of the pipeline, even if there have been changes to the code since.

    To see what versions are available, go to the nf-core/rnaseq releases page and find the latest version number – numeric only (eg. 3.0). Then specify this when running the pipeline with -r (one hyphen) – eg. -r 3.0.

    The version number will be logged in reports when you run the pipeline, so that you’ll know what you used when you look back in the future.

    For additional options for trimming, ribosomal RNA removal, UMI-based read de-duplication, other alignment and quantification tools, and more see: https://nf-co.re/rnaseq/parameters

    NOTE: NYU CGSB users will require an additional step (step #1), and have the ability to publish their results to the department JBrowse server (step #6).

    1) Rename Fastq Files

    NOTE: This step is for NYU CGSB users only

    nf-core workflows expect standard illumina filenames by default. At NYU CGSB, fastq files are not named using the standard illumina file naming scheme. Therefore, CGSB users will need to run the following script prior to running any nf-core workflow. This script will create a folder within the target directory called ‘inames’ containing symlinks to the original data that use standard illumina filenames. CGSB users must then provide the path to the files in this ‘inames’ directory when creating the samplesheet (next step). Run this script as follows, providing the path to the reads and target directory as parameters:

    sh /scratch/work/cgsb/scripts/rename_fastq_files/rename_fastq_files.sh \
        <path_to_reads> \
        <target_dir>

    Example:

    sh /scratch/work/cgsb/scripts/rename_fastq_files/rename_fastq_files.sh \
        /scratch/cgsb/gencore/out/Gresham/2020-01-10_HV2J2BGXC/merged/ \
        /scratch/$USER/project_dir

    2) Prepare Samplesheet

    The pipeline requires a samplesheet in csv format as input. The samplesheet must contain the following five headers: group, replicate, fastq_1, fastq_2, strandedness.

    It is possible to include multiple runs of the same library in a samplesheet. The group and replicate identifiers are the same when you have re-sequenced the same sample more than once (e.g. to increase sequencing depth). The pipeline will concatenate the raw reads before alignment.

    strandedness can be forward, reverse, unstranded, or auto in which case the pipeline will attempt to automatically determine the strandedness of the reads.

    It is also possible to mix paired-end and single-end reads in a samplesheet.

    Below is an example of a samplesheet consisting of both single- and paired-end data. This is for two experimental groups in triplicate, where the last replicate of the treatment group has been sequenced twice.

    More information about samplesheets at the official docs: https://nf-co.re/rnaseq/usage#introduction

    3) Prepare Config File

    Copy this gist into your project directory and provide the path to your samplesheet, FASTA, GTF, and out_root, and your email address if you’d like to be notified when the pipeline completes or if there are any errors.

    4) Running the workflow

    Note: v3.0 of the RNA-Seq pipeline requires Nextflow 20.11.0-edge or higher.

    On the NYU HPC, you must submit nextflow pipelines as an SBATCH job.

    Copy and paste the code below into a file called launch-nextflow.s. Edit the path to your config file (-c) and version number (-r) as required, and add any additional parameters if necessary.

    Once you have created this file, submit it to SLURM using the following command:

    sbatch launch-nextflow.s

    5) Output

    Once the pipeline has completed, you will find your output files in the results directory within the directory you set as out_root in the config.

    If you provided your email address, you will receive notification via email when your pipeline completes with information pertaining to your analysis as well as a comprehensive MultiQC report attached. If you did not provide your email address, you can find the MultiQC report in the results directory.

    6) JBrowse

    CGSB users have the option to push their results to JBrowse for visualization. To push data to JBrowse you will need to request access at https://forms.bio.nyu.edu if you have not already done so (requires PI approval). Then, simply execute the following command:

    cgsb_upload2jbrowse \
        -p PI \
        -d DATASET \
        $ref \
        $gff3 \
        --profile nf-rnaseq \
        --root ROOTPATH

    Example:

    cgsb_upload2jbrowse \
        -p Gresham \
        -d project-name \
        /path/to/ref.fa.gz \
        /path/to/ref.gff3 \
        --profile nf-rnaseq \
        --root /scratch/netID/rnaseq_project/results/

    7) Citing

    If you use nf-core/rnaseq for your analysis, please cite it using the following doi: 10.5281/zenodo.1400710

    In addition, cite the nf-core publication as follows:

    The nf-core framework for community-curated bioinformatics pipelines.
    Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
    Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

    Additional Comments

    Recently, I observed this warning message immediately upon launching the pipeline (for documentation purposes, this was version 3.14.0):

    WARN: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      Multiple config files detected!
      Please provide pipeline parameters via the CLI or Nextflow '-params-file' option.
      Custom config files including those provided by the '-c' Nextflow option can be
      used to provide any configuration except for parameters.
    
      Docs: https://nf-co.re/usage/configuration#custom-configuration-files
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    It appears that we need to move our parameters into a separate JSON file, and provide this file to the pipeline using the -params-file option. According to the documentation, this is necessary for nf-core pipelines using DSL2, which was implemented in version 2.0 of this pipeline. The warning message above, displayed upon running the pipeline, instructs us to provide pipeline parameters via the -params-file option. However, the documentation clarifies that “For Nextflow DSL2 nf-core pipelines – parameters defined in the parameter block in custom.config files WILL NOT override defaults in nextflow.config“, which might mean that you can keep your parameters in your nextflow.config file. This suggests that even though it’s possible to keep parameters in your nextflow.config file, as evidenced by my tests where the pipeline runs with all parameters in the nextflow.config file, you will still receive the warning message. Similarly, the warning will appear even when using a JSON file with the -params-file option. If you choose or need to use a JSON file for your parameters as indicated in the warning message, here is how you would proceed:

    Create the JSON file with your parameters as shown below. Save this file as nf-params.json. (See documentation for JSON format here)

    {
      "input": "/path/to/samplesheet.csv",
      "fasta": "/path/to/ref.fa",
      "gtf": "/path/to/ann.gtf",
      "email": "netID@nyu.edu",
      "outdir": "/scratch/netID/rnaseq_project"
    }

    Update your nextflow command to include -params-file and the path to your json file.

    nextflow run nf-core/rnaseq \
            -profile singularity \
            -c nextflow.config \
            -params-file nf-params.json \
            -r 3.14.0 \
            --save_reference
  • Gene Set Enrichment Analysis in Minutes with the NASQAR Web App

    Gene Set Enrichment Analysis in Minutes with the NASQAR Web App

    Gene Set Enrichment Analysis (GSEA) is a common method to analyze RNA-Seq data that determines whether a predefined defined set of genes (for example those in a GO term or KEGG pathway) show statistically significant and concordant differences between two biological phenotypes. There are a myriad of tools for GSEA analysis, and one of them which I particularly like is clusterProfiler. Developed as an R-based tool, clusterProfiler has until now been inaccessible to users unfamiliar with the R programming language. NASQAR, recently developed by the core bioinformatics team at the NYU Center for Genomics and Systems Biology, now makes clusterProfiler available as a web app with a simple UI, enabling anyone to use clusterProfiler and perform GSEA with GO Term and KEGG Pathway gene set data in minutes.

    Getting Started

    1. Navigate to the GSEA app
    2. Click “Input Data”
    3. Upload a CSV file containing a list of gene names and log2 fold change values. This data is typically produced by differential expression analysis tool such as DESeq2*

    * If you don’t have this data, you can run the DESeq2 app built into NASQAR to perform DE analysis (you’ll need to provide gene count data). Alternatively, you can use the sample data available in the app to follow along.

    Parameters

    Once your data is uploaded, you’ll need to select the appropriate columns in your CSV and set the following parameters before your data can be analyzed:

    Organism
    clusterProfiler supports all Org.Db annotation packages. See the list of available organisms here: http://bioconductor.org/packages/release/BiocViews.html#___OrgDb
    Key Type
    this is the source of the annotation (gene ids). This might be ENSEMBL, NCBI, ENTREZID, etc. Available options for your organism are available in a dropdown list to select from.
    Ontology
    GO Term ontology, one of “BP” (Biological Process), “MF’ (Molecular Function), “CC” (Cellular Component), or “ALL”
    Number of Permutations
    the higher the number of permutations you set, the more accurate your result will, but the longer the analysis will take (increased compute time).
    minGSsizeminimum
    number of genes in set (gene sets with lower than this many genes in your dataset will be ignored).
    maxGSSizemaximum
    number of genes in set (gene sets with greater than this many genes in your dataset will be ignored).
    pAdjustMethod
    one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”

    Output

    Once your results are ready you can visualize your enriched gene set data in various ways.

    Dotplot

    Visualize the top activated and suppressed enriched GO Terms or KEGG Pathways.

    Enrichment Map

    Organizes enriched terms into a network with edges connecting overlapping gene sets. In this way, mutually overlapping gene sets tend to cluster together, making it easy to identify functional modules.

    GSEA Enrichment Map Plot

    GSEA Plot

    Plot of the Running Enrichment Score (green line) for a gene set as the analysis walks down the ranked gene list, including the location of the maximum enrichment score (the red line). The black lines in the Running Enrichment Score show where the members of the gene set appear in the ranked list of genes, indicating the leading edge subset.

    The Ranked list metric shows the value of the ranking metric (log2 fold change) as you move down the list of ranked genes. The ranking metric measures a gene’s correlation with a phenotype.

    GSEA Plot

    Ridgeplot

    Grouped by gene set, density plots are generated by using the frequency of fold change values per gene within each set. Helpful to interpret up/down-regulated pathways.

    GSEA Ridgeplot

    PubMed Trend of Enriched Terms

    The number of publications per year based on the query result from PubMed Central.

    Pubmed Trends Plot

    Enriched KEGG Pathway Plot

    Enriched KEGG Pathway Plot

    Summary

    GSEA is a powerful and common analysis method for RNA-Seq experiments and is normally employed after differential expression analysis, for example when studying genes that are differently expressed under different conditions. NASQAR provides an intuitive interface that allows users to perform this analysis easily and efficiently.

    Additional Material

    If you’d like to learn how to use clusterProfiler in R, checkout this tutorial:

    There’s also an app in NASQAR for over-representation analysis:

    https://nasqar.abudhabi.nyu.edu/ClusterProfShinyORA/

    Like the app? Let us know in the comments below.