Preprocessing single cell DropSeq FASTQ files


This tool converts the FASTQ file to unaligned BAM file, tags the reads with the cellular and molecular barcodes, removes reads where the cell or molecular barcode has low quality bases, trims adapters and polyA tails, converts the preprocessed BAM file back into a FASTQ format for alignment, and trims too short reads from the FASTQ file. As an output: unaligned, tagged and trimmed BAM file, trimmed FASTQ file and summary and plots of the process.



This tool is a combination of several tools: DropSeq (TagBamWithReadSequenceExtended, FilterBam) , Picard (FASTQ to BAM, BAM to FASTQ) and Trimmomatic (MINLEN).
The steps are:

  1. Convert the FASTQ files into a unaligned BAM file
  2. Tag the reads in the BAM file with the cellular and molecular barcodes
  3. Filter and trim the reads in the BAM file
  4. Convert the tagged and trimmed BAM file back into a FASTQ file for the alignment
  5. Trim too short reads of the FASTQ file

Since the FASTQ format cannot hold the information about the cellular and molecular barcodes (or "tags"), we need to transform the FASTQ files into a BAM file. The BAM format tags can be used to held this information. In this format we can also do some trimming and filtering for the reads. However, the aligners take as input only FASTQ format, which is why we need to transform the trimmed & filtered BAM back to FASTQ format.

After this preprocessing step, we will have one unaligned, tagged and trimmed BAM file (that holds the information of the molecular and cellular barcodes in it), and a FASTQ file ready for alignment. After the alignment, these two files are merged using the Merge BAM alignment tool.

In the second step, Tag the reads with cellular and molecular barcodes the tool extracts bases from the cell/molecular barcode encoding read, and creates a new BAM tag with those bases on the g​enome read.​ We use the BAM tag XM for molecular barcodes, and XC for cell barcodes.

This program is run once per barcode extraction to add a tag. On the first iteration, the cell barcode is extracted from bases determined in the first parameter (default: 1-­12). On the second iteration, the molecular barcode is extracted from bases determined by the base range for molecular barcode parameter (default 13-­20) of the barcode read.

The tool also tags the reads in which the quality drops below the base quality threshold. The number of bases that fall below the threshold is marked in the XQ tag. This information can be used later on in the filtering tools.

In the third step, Filter and trim the reads in the BAM file, several things are performed:
First, the information added to the XQ tag in Tag BAM tool is used to filter out reads where more than one (1) base have quality below the threshold used in this Tag BAM tool (default: 10).
Next, any user determined sequences are trimmed away. User can determine how many mismatches are allowed in these sequences (default: 0), and how long stretch of the sequence there has to be in the read at least (default: 5 bases). The SMART Adapter sequence is offered as a default.
Lastly, trailing polyA tails are hard clipped from the reads. The tools searches for contiguous A's from the end of the read. User is again allowed to determine the number of mismatches allowed (default: 0) and how many A's there at least need to be for the clipping to happen (default: 6).

After trimming and filtering, you might end up having some rather short reads in your BAM file. It is advisable to remove those, as this makes the alignment step faster. For this purpose, the Trimmomatic tool and MINLEN option is used in the last step: the default for the Minimum length of reads to keep is set to 50.

For more details, please check the Drop-seq manual and the home page of Picard tools.