Gaining Access to Sequencing Data on the HPC

  1. Create an HPC account by following the directions found here: NYU High Performance Computing Wiki.
  2. Submit a request via the Biology Computation Support Form to be added to the CGSB Linux working group on the HPC, and to be granted permission to your lab’s sequencing results directory.

Data Policy and Retention

  1. Demultiplexed fastqs or raw lane fastqs are copied to lab directories on /scratch on the HPC.
  2. Owners of the data will have read access to the fastqs on /scratch.
  3. Sequencing data in lab directories on /scratch are backed up and not subject to flushing.
  4. Raw sequencing run directories are archived and backed up locally for a minimum of five years.
  5. Raw sequencing run directories can be made available to users on request.
Gencore Data Plan Visualization

HPC Best Practices

  1. Your job should run in, and output written to, your personal directory on scratch. i.e. /scratch//my-project/job-xyz/
  2. Your Slurm scripts should live in your job or project directory
  3. Keeping a copy of the job script in it’s run directory is good practice as it allows you to check later what parameters were used and facilitates reproducibility.
  4. All other scripts (ex: python scripts, other executables) should live in your home folder (ie. /home/netID/)
  5. If you need to run a script that you created (a python script for example), call it from your home directory (accessible via the $HOME variable) in your slurm script
  6. If you need a software package which is not available on the HPC, please email the HPC team at hpc@nyu.edu with your request. You can check if a module already exists by typing module avail tool_name on the command line.

HPC Important Locations

Fastq Delivery from GenCore

/scratch/cgsb/gencore/out/
  • Files here are not subject to flushing
  • Files here are backed up

Lab Share Directory

/scratch/cgsb/
  • 5TB Quota
  • Files here are not subject to flushing
  • Files here are backed up to tape
  • Use this directory to share your analysis and results with other members of your lab
  • Reference and input data (ex: GenCore delivered fastq files, reference genome, indexes, etc.) should not live here!
  • To establish a labshare directory on /scratch/cgsb and to add members to your Labshare directory submit a request using this form

Personal Directory on scratch

/scratch/netID/
  • 5TB Quota
  • Files here which are not used for a period of 60 days are subject to flushing.
  • Use this directory to run your analysis and store analysis results as you are working on them

Personal Directory on home

/home/netID/
  • 20GB Quota
  • Not subject to flushing
  • Backed up
  • Your custom scripts (python scripts or other executables) should live here

Personal Directory on archive

/archive/netID/
  • 2TB Quota
  • Not subject to flushing
  • Data will be retained for 5 years
  • Data will be backed up to tape
  • Completed analyses and results that you want to store should be archived (tar) and then stored here

Shared Genome Resource

/genomics/genomes/
  • Local CGSB repository of commonly used genomic data sets
  • New organisms/versions/releases will be made available periodically or upon request (Mohammed Khalfan – mkhalfan@nyu.edu)
  • Previous versions/releases will be preserved
  • More about the Shared Genome Resource…
Recent Articles

Analyze your Data Faster with NASQAR: Nucleic Acid SeQuence Analysis Resource

The bioinformatics team at the NYU Center for Genomics and Systems Biology in Abu Dhabi and New York have recently …

GPU-Accelerated MinION Basecalling On the HPC

I recently helped the Rockman lab basecall their MinION sequencing data on the HPC, leveraging the power of the GPUs …

How To Find Out What Barcodes Are In Your Undetermined Reads

Sometimes after demultiplexing there exists a high number of undetermined reads, i.e. reads which were not assigned to any library …