Preprocessing DropSeq FASTQ files

Description

This tool converts FASTQ files to unaligned BAM file, tags the sequence reads with the cell and molecular barcodes, removes the reads which contain low quality bases in barcodes, trims adapters and polyA tails, converts the preprocessed BAM file back into a FASTQ format for alignment, and filters out too short reads from the FASTQ file. As an output it produces an unaligned, tagged and trimmed BAM file, a trimmed FASTQ file, a summary file, and plots of the process.

Parameters

Base range for cell barcode [1-12]
Base range for molecule barcode [13-20]
Base quality [10]
Adapter sequence [AAGCAGTGGTATCAACGCAGAGTGAATGGG]
Mismatches in adapter [0]
Number of bases to check in adapter [5]
Mismatches in polyA [0]
Number of bases to check in polyA [6]
Minimum length of reads to keep [50]

Details

This tool combines several tools:

DropSeq (TagBamWithReadSequenceExtended, FilterBam)
Picard (FASTQ to BAM, BAM to FASTQ)
Trimmomatic (MINLEN)

Overview of the steps:

Convert the FASTQ files into an unaligned BAM file
Tag the sequence reads in the BAM file with cell and molecular barcodes
Filter and trim the reads in the BAM file
Convert the tagged and trimmed BAM file back to FASTQ format for aligning the reads to genome
Filter out too short reads from the FASTQ file

Since the FASTQ format cannot hold the information about the cell and molecular barcodes of the sequence reads, we need to transform the FASTQ files into a BAM file. The BAM format has a tag field which can store the barcode information. We also trim and filter the sequence reads in the BAM format. However, aligners take as input only FASTQ format, so we need to transform the trimmed & filtered BAM back to FASTQ format. After this preprocessing step, we will have one unaligned, tagged and trimmed BAM file that holds the information of the cell and molecular barcodes in the tags, and a FASTQ file ready for alignment. After the alignment, these two files are merged using the tool Merge aligned and unaligned BAM.

Detailed description of the steps:

Convert the FASTQ files into an unaligned BAM file
Tag the sequence reads with cell and molecular barcodes extracts cell and molecular barcodes from the barcode read, and puts the barcode bases in BAM tags XC and XM, respectively.
This program is run once per barcode extraction to add a tag. On the first iteration, the cell barcode is extracted from bases determined in the first parameter (default: 112). On the second iteration, the molecular barcode is extracted from bases determined by the base range for molecular barcode parameter (default 1320) of the barcode read.
The tool also tags reads where the base quality in the barcode drops below a threshold. The number of bases that fall below the threshold is marked in the XQ tag. This information is used in the subsequent filtering step.
Filter and trim the reads in the BAM file performs several things:
- XQ tags are used to filter out reads where more than one base in the barcode have quality below the threshold (default: 10).
- Any user determined sequences are trimmed away. The SMART Adapter sequence is offered as a default. You can determine how many mismatches are allowed (default: 0), and how long stretch of the sequence there has to be at least (default: 5 bases).
- Trailing polyA tails are hard-clipped from the reads. The tool searches for contiguous A's from the end of the read. You can determine the number of mismatches allowed (default: 0) and how many A's there at least need to be for the clipping to happen (default: 6).
Convert the tagged and trimmed BAM file back to FASTQ format for aligning the reads to genome
Filter out too short reads of the FASTQ file. After trimming and filtering, you might end up having some rather short reads in your BAM file. It is advisable to remove those, as this makes the alignment step faster. The Trimmomatic tool and MINLEN option is used in the last step: the default value for the Minimum length of reads to keep -parameter is set to 50.

For more details, please check the Drop-seq manual and the home page of Picard tools.

Output

[input_name].bam: Tagged, trimmed & filtered unaligned BAM
[input_name].fq.gz: Trimmed and filtered FASTQ file
tagging_and_trimming_summary.txt: Summary of the tagging and trimming steps
tagging_and_trimming_histograms.pdf: Graphics presenting the failed bases in tagging steps, and adapters and polyAs trimmed