Sometimes after demultiplexing there exists a high number of undetermined reads, i.e. reads which were not assigned to any library based on the barcodes provided. This is most often the result of incorrect metadata or barcode contamination. Determining what barcodes are present in the undetermined reads can be useful in troubleshooting your run.
NOTE: If you’re sequencing at the NYU Genomics Core, we automatically provide undetermined read data for you in your MultiQC report
The following script allows you to find out what barcodes are present in your undetermined reads and in what frequency. It takes a .fastq.gz file as input and returns all barcodes present in the fastq file sorted in ascending order of frequency.
- You must have Python 3 in order to use this script. On the Prince HPC load the Python 3 module like this:
module load python3/intel/3.6.3
- Save the script above as
- Run the script like this:
python3 count_barcode_frequency.py input.fastq.gz
- The script will return list of barcodes to stdout. Redirect the output to a file to save it for later.
python3 count_barcode_frequency.py input.fastq.gz > input_barcodes.txt
The output consists of all barcodes present in the input fastq file sorted in ascending order of frequency. Executing
tail -20 on
input_barcodes.txt displays the top 20 barcodes found in the input fastq.
[mk5636@log-0 temp]$ tail -20 input_barcodes.txt NNNNNN 42475 GGGGGG 3262198 TAATCG 4550383 CATGGC 5257887 TACAGC 5377243 CACTCA 5530110 ATGAGC 5802017 GAGTGG 5828838 CGTACG 5970294 CACGAT 6319180 ACTGAT 6493155 GTTTCG 6543201 GGTAGC 6715409 CAACTA 6718555 ATTCCT 6747165 CAAAAG 6857987 CAGGCG 6888980 CCAACA 7036683 CATTTT 9409941 GACGAC 12103222
Comparing this output with your library metadata can provide useful insight into the reason behind the high undetermined read count.