Preprocessing DropSeq FASTQ files

Description

This tool converts FASTQ files to unaligned BAM file, tags the sequence reads with the cell and molecular barcodes, removes the reads which contain low quality bases in barcodes, trims adapters and polyA tails, converts the preprocessed BAM file back into a FASTQ format for alignment, and filters out too short reads from the FASTQ file. As an output it produces an unaligned, tagged and trimmed BAM file, a trimmed FASTQ file, a summary file, and plots of the process.

Parameters

Details

This tool combines several tools: Overview of the steps:
  1. Convert the FASTQ files into an unaligned BAM file
  2. Tag the sequence reads in the BAM file with cell and molecular barcodes
  3. Filter and trim the reads in the BAM file
  4. Convert the tagged and trimmed BAM file back to FASTQ format for aligning the reads to genome
  5. Filter out too short reads from the FASTQ file

Since the FASTQ format cannot hold the information about the cell and molecular barcodes of the sequence reads, we need to transform the FASTQ files into a BAM file. The BAM format has a tag field which can store the barcode information. We also trim and filter the sequence reads in the BAM format. However, aligners take as input only FASTQ format, so we need to transform the trimmed & filtered BAM back to FASTQ format. After this preprocessing step, we will have one unaligned, tagged and trimmed BAM file that holds the information of the cell and molecular barcodes in the tags, and a FASTQ file ready for alignment. After the alignment, these two files are merged using the tool Merge aligned and unaligned BAM.

Detailed description of the steps:
  1. Convert the FASTQ files into an unaligned BAM file
  2. Tag the sequence reads with cell and molecular barcodes extracts cell and molecular barcodes from the barcode read, and puts the barcode bases in BAM tags XC and XM, respectively.
    This program is run once per barcode extraction to add a tag. On the first iteration, the cell barcode is extracted from bases determined in the first parameter (default: 1­12). On the second iteration, the molecular barcode is extracted from bases determined by the base range for molecular barcode parameter (default 13­20) of the barcode read.
    The tool also tags reads where the base quality in the barcode drops below a threshold. The number of bases that fall below the threshold is marked in the XQ tag. This information is used in the subsequent filtering step.
  3. Filter and trim the reads in the BAM file performs several things:
  4. Convert the tagged and trimmed BAM file back to FASTQ format for aligning the reads to genome
  5. Filter out too short reads of the FASTQ file. After trimming and filtering, you might end up having some rather short reads in your BAM file. It is advisable to remove those, as this makes the alignment step faster. The Trimmomatic tool and MINLEN option is used in the last step: the default value for the Minimum length of reads to keep -parameter is set to 50.

For more details, please check the Drop-seq manual and the home page of Picard tools.

Output