Preprocessing single cell DropSeq FASTQ files

Description

This tool converts FASTQ files to unaligned BAM file, tags the reads with the cellular and molecular barcodes, removes the reads which contain low quality bases in the cell or molecular barcode, trims adapters and polyA tails, converts the preprocessed BAM file back into a FASTQ format for alignment, and filters out too short reads from the FASTQ file. As an output it produces an unaligned, tagged and trimmed BAM file, a trimmed FASTQ file, a summary file, and plots of the process.

Parameters

Details

This tool combines several tools: Overview of the steps:
  1. Convert the FASTQ files into an unaligned BAM file
  2. Tag the reads in the BAM file with the cellular and molecular barcodes
  3. Filter and trim the reads in the BAM file
  4. Convert the tagged and trimmed BAM file back to FASTQ format for aligning the reads to genome
  5. Filter out too short reads of the FASTQ file

Since the FASTQ format cannot hold the information about the cellular and molecular barcodes (or "tags"), we need to transform the FASTQ files into a BAM file. The BAM format has a tag field which can store the barcode information. We also trim and filter the reads in the BAM format. However, the aligners take as input only FASTQ format, so we need to transform the trimmed & filtered BAM back to FASTQ format. After this preprocessing step, we will have one unaligned, tagged and trimmed BAM file that holds the information of the molecular and cellular barcodes in the tags, and a FASTQ file ready for alignment. After the alignment, these two files are merged using the tool Merge aligned and unaligned BAM.

Detailed description of the steps:
  1. Convert the FASTQ files into an unaligned BAM file
  2. Tag the reads with cellular and molecular barcodes extracts bases from the read which contains the cell and molecular barcodes, and creates a new BAM tag with those bases on the g​enome read.​ We use the BAM tag XM for molecular barcodes, and XC for cell barcodes.
    This program is run once per barcode extraction to add a tag. On the first iteration, the cell barcode is extracted from bases determined in the first parameter (default: 1-­12). On the second iteration, the molecular barcode is extracted from bases determined by the base range for molecular barcode parameter (default 13-­20) of the barcode read.
    The tool also tags reads where the base quality in the barcode drops below a threshold. The number of bases that fall below the threshold is marked in the XQ tag. This information is used in the subsequent filtering step.
  3. Filter and trim the reads in the BAM file performs several things:
    First, the information added to the XQ tag in Tag BAM tool is used to filter out reads where more than one base in the barcode have quality below the threshold used in this Tag BAM tool (default: 10).
    Next, any user determined sequences are trimmed away. User can determine how many mismatches are allowed in these sequences (default: 0), and how long stretch of the sequence there has to be in the read at least (default: 5 bases). The SMART Adapter sequence is offered as a default.
    Lastly, trailing polyA tails are hard clipped from the reads. The tool searches for contiguous A's from the end of the read. User is again allowed to determine the number of mismatches allowed (default: 0) and how many A's there at least need to be for the clipping to happen (default: 6).
  4. Convert the tagged and trimmed BAM file back to FASTQ format for aligning the reads to genome
  5. Filter out too short reads of the FASTQ file. After trimming and filtering, you might end up having some rather short reads in your BAM file. It is advisable to remove those, as this makes the alignment step faster. For this purpose, the Trimmomatic tool and MINLEN option is used in the last step: the default value for the Minimum length of reads to keep -parameter is set to 50.

For more details, please check the Drop-seq manual and the home page of Picard tools.

Output