Sometimes after demultiplexing there exists a high number of undetermined reads, i.e. reads which were not assigned to any library based on the barcodes provided. This is most often the result of incorrect metadata or barcode contamination. Determining what barcodes are present in the undetermined reads can be useful in troubleshooting your run.

Figure 1. High undetermined read count displayed in MultiQC report

NOTE: If you’re sequencing at the NYU Genomics Core, we automatically provide undetermined read data for you in your MultiQC report

The following script allows you to find out what barcodes are present in your undetermined reads and in what frequency. It takes a .fastq.gz file as input and returns all barcodes present in the fastq file sorted in ascending order of frequency.

Usage:

  1. You must have Python 3 in order to use this script. On the Prince HPC load the Python 3 module like this:
    module load python3/intel/3.6.3
  2. Save the script above as
    count_barcode_frequency.py.
  3. Run the script like this:
    python3 count_barcode_frequency.py input.fastq.gz
  4. The script will return list of barcodes to stdout. Redirect the output to a file to save it for later.
    python3 count_barcode_frequency.py input.fastq.gz > input_barcodes.txt

Output:

The output consists of all barcodes present in the input fastq file sorted in ascending order of frequency. Executing tail -20 on input_barcodes.txt displays the top 20 barcodes found in the input fastq.

[mk5636@log-0 temp]$ tail -20 input_barcodes.txt
NNNNNN 42475
GGGGGG 3262198
TAATCG 4550383
CATGGC 5257887
TACAGC 5377243
CACTCA 5530110
ATGAGC 5802017
GAGTGG 5828838
CGTACG 5970294
CACGAT 6319180
ACTGAT 6493155
GTTTCG 6543201
GGTAGC 6715409
CAACTA 6718555
ATTCCT 6747165
CAAAAG 6857987
CAGGCG 6888980
CCAACA 7036683
CATTTT 9409941
GACGAC 12103222

Comparing this output with your library metadata can provide useful insight into the reason behind the high undetermined read count.


1 Comment

soundsgood · 2024-10-16 at 11:38 pm

This python code might be helpful to find out the mixed indexes from the Undetermined fastq files. But is the input FASTQ file for this code meant for single reads, or can it be applied to a combined FASTQ file of paired reads?

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *