Author: Eric Borenstein

  • Nextflow & nf-core on NYU HPC

    Nextflow & nf-core on NYU HPC

    All nextflow and nf-core pipelines have been successfully configured for use on the HPC Cluster at New York University. The configuration applies required and recommended options in order to have efficient and reliable nextflow runs.

    Below is the NYU HPC configuration and the latest version can always be found at nf-core GitHub.

    params {
        config_profile_description = 'New York University HPC profile provided by nf-core/configs.'
        config_profile_contact = 'HPC@nyu.edu'
        config_profile_url = 'https://hpc.nyu.edu'
        max_memory = 3000.GB
        max_cpus = 96
        max_time = 7.d
    }
    
    singularity.enabled = true
    
    process {
        executor = 'slurm'
        clusterOptions = '--export=NONE'
        maxRetries = 3
        errorStrategy = { task.attempt <=3 ? 'retry' : 'finish' }
        cache = 'lenient'
    }
    
    executor {
        queueSize = 1900
        submitRateLimit = '20 sec'
    }

    The parameters max_memory, max_cpus, max_time, queueSize, and submitRateLimit do not hinder your nextflow workflows but sets the resource maximum set by HPC. For example, there is no compute node with 97 CPUs so if your workflow makes the reqeust for 97+ CPUs it will fail. The same logic applies to the other settings.

    The process code block instructs nextflow on how to run using the cluster scheduler Slurm and how to handle errors by retrying up to 3 times.

    Using the Nextflow Config

    For nf-core pipelines, run the pipeline with -profile nyu_hpc. This will automatically apply the latest nyu_hpc.config.

    Example nextflow sbatch script using nf-core pipeline scrnaseq (https://nf-co.re/scrnaseq/2.6.0)

    #!/bin/bash -e
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=1
    #SBATCH --cpus-per-task=2
    #SBATCH --mem=8GB
    #SBATCH --time=24:00:00
    #SBATCH --job-name=nextflow
    #SBATCH --output=nf_%j.out
    # The nextflow job manager does not require a lot 
    # of resources, 2 CPU and 8GB mem is more than enough
    
    module purge
    module load nextflow/23.04.1
    
    # https://nf-co.re/scrnaseq/2.6.0
    nextflow run nf-core/scrnaseq \
       -profile nyu_hpc \ # <- Set the NYU_HPC profile
       --input samplesheet.csv \
       --genome_fasta GRCm38.p6.genome.chr19.fa \
       --gtf gencode.vM19.annotation.chr19.gtf \
       --protocol 10XV2 \
       --aligner star \
       --outdir $SCRATCH/nf_scrnaseq_out

    For other nextflow pipelines, download the NYU nf-core config into your nextflow working directory and include it in your nextflow run command as shown below. Note the capital -C for the nyu_hpc.config, which is provided before the run command, and the lower case -c for your.config, which is provided after the run command.

    # Download the config
    wget https://raw.githubusercontent.com/nf-core/configs/master/conf/nyu_hpc.config
    
    # Execute the nextflow
    nextflow -C nyu_hpc.config run -c your.config main.nf

    Please reach out to hpc@nyu.edu if there are any questions.

  • JBrowse Genome Browser

    JBrowse Genome Browser

    During the summer of 2020, the Ghedin and Gresham labs at New York University sequenced several SARS-CoV-2 isolates from clinical samples acquired in New York City. To visualize and share the data among researchers and collaborators we built a JBrowse web server. JBrowse is a web-based genome visualization software allowing you to visualize your genomic data files, such as FA, VCF, BAM, CRAM, and GFF3 files.

    To benefit all researchers at NYU engaged in genomics research, we have implemented a centralized JBrowse service at NYU’s CGSB at http://jbrowse.bio.nyu.edu/ for PIs and their lab members.

    Features and Benefits

    • Visually analyze your data in custom tracks
    • Specific URLs for desired views to share with external collaborators
    • Integration with the GATK pipeline on NYU HPC

    For the most up to date documentation, click here.

    We have integrated automated JBrowse visualization into existing gencore tools. For example, the results of the GATK pipeline , which performs alignment and variant calling, can now be automatically uploaded to the JBrowse site for immediate visual analysis. Within your nextflow.config file add the following lines to specify the data set name and the PI.

    // JBrowse params
    params.do_jbrowse = true
    params.gff = "/scratch/work/cgsb/genomes/Public/Fungi/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Saccharomyces_cerevisiae.R64-1-1.34.gff3"
    params.jbrowse_pi = "Smith"
    params.dataset_name = "project1"

    Each data file generated by this workflow will result in a track that you can view and customize.

    Search for features of interest with the search bar at the top.

    The URL will dynamically change to meet your current selection of tracks, view, and highlights. You can then use this unique URL to share with colleagues or post in publications.

    Getting Started

    The first step is to establish a lab specific account and request access to your PI’s lab, here. This is different from Prince and requires approval by your PI.  Just like the PI shared directories on the HPC cluster, your fellow lab members have the ability to modify or delete your data. 

    Once you have access you can upload data into a new or existing data set. On the Prince HPC cluster there is a single command cgsb_upload2jbrowse that can run to transfer and format the data. Outside the cluster a user can rsync the data manually.

    USAGE: cgsb_upload2jbrowse -p PI -d DATASET [-f FOLDER] [-s SAMPLELIST] [FILES] 
    ----------------------------------------------------------------------------------------- 
    -p | --PI                specify PI 
    -d | --dataset           specify data set 
    -f | --folder            specify folder containing files 
    -s | --samplelist        specify sample list for categorization 
    ----------------------------------------------------------------------------------------- 
    File formats supported: 
    - fa 
    - fasta 
    - fna 
    - vcf.gz* 
    - bam* 
    - bam.bw 
    - cram*  
    - gff3.gz* 
    - gff 
    
    *Requires index file (tbi, bai, crai) of the same base name 

    Example 1:
    To transfer your data within your scratch (/scratch/user/project1/data) that includes the reference data to Smith’s project1 data set run the following.

    cgsb_upload2jbrowse -p Smith -d project1 \
     -f /scratch/user/project1/data  \

    Example 2:
    To transfer your data within your scratch (/scratch/user/project1/data) along with the reference data in the prince shared genome repository folders to Smith’s project1 data set run the following.

    cgsb_upload2jbrowse -p Smith -d project1 \
     -f /scratch/user/project1/data  \
     /scratch/work/cgsb/genomes/Public/Fungi/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa \
     /scratch/work/cgsb/genomes/Public/Fungi/Saccharomyces_cerevisiae/Ensembl/R64-1-1/Saccharomyces_cerevisiae.R64-1-1.34.gff3

    Example 3:
    To transfer outside of the Prince cluster

    # Transfer the files
    rsync --progress -ruv /path/to/dataset/ <NYUnetID>@jbrowse.bio.nyu.edu:/jbrowse/<PI>/<DATASET>
    # Build and publish the tracks based on the files uploaded
    ssh <NYUnetID>@jbrowse.bio.nyu.edu addTracks --PI <PI> --dataset <DATASET>

    The data will be accessible immediately on the JBrowse server. Choose your PI on the JBrowse homepage’s dropdown menu then the data set name that was specified in the previous step. Once accessed you will be able to display visualizations or tracks for each file. These tracks by default will be named after the file itself. You can find more information on customizing track names and appearance in the documentation online.

    The available tracks will be selectable on the left allowing you to display only items of interest and their order displayed. If you go to the `Track` menu at the top of the page, you have two options to create a combination track combining 2 tracks or a sequence search track, which shows regions of the referenced sequence or its translations that match a DNA or amino acid sequence.