Create digital gene expression matrix

Description

This Drop-seq tool does two things:

identifies and corrects bead synthesis errors
extracts gene expression values from a BAM file where reads have been tagged with gene names

In the bead synthesis step (step 1) the tool identifies cell barcodes with aberrant fixed UMI bases. If only the last UMI base is fixed as a T, the cell barcode is corrected (the last base is trimmed off) and all cell barcodes with identical sequence at the first 11 bases are merged together. If any other UMI base is fixed, the reads with that cell barcode are discarded.

The tool asks the user to select a number of barcodes on which to perform the correction. In the original Drop-seq manual, the tool developers guide users to use roughly 2 times the anticipated number cells, as they have empirically found that this allows to recover nearly every defective cell barcode that corresponds to a STAMP (rather than an empty bead cell barcode).

This program reads in the BAM file, and looks at the distribution of bases at each position of all UMIs for a cell barcode. It detects unusual distributions of base frequency, where a base with ≥ 80% frequency at any position is detected as an error. Barcodes with less than 25 total UMIs are ignored.

The tool also checks for PRIMER_MATCHes, where the UMI perfectly matches one of the PCR primers. These cell barcodes are dropped. These errors are only detected if a PRIMER_SEQUENCE argument is supplied as a parameter.

From the digital expression stage (step 2), there are two outputs available:
-the primary is the DGE matrix, with each a row for each gene, and a column for each cell
-the secondary analysis is a summary of the DGE matrix on a per cell level, indicating the number of genes and transcripts observed.

Method:

For each gene, find the molecular barcodes on the exons of that gene.
Determine how many HQ mapped reads are assigned to each barcode.
Collapse barcodes by edit distance.
Throw away barcodes with less than threshold # of reads.
Count the number of remaining unique molecular barcodes for the gene.

This program requires a tag for what gene a read is on, a molecular barcode tag, and a exon tag. The exon and gene tags may not be present on every read.

The selection of the sets of cells:
Choose first the selection criteria (How to filter the DGE matrix):

Number of core barcodes: Counts the number of reads per cell barcode and include N top cells with most reads.
Min number of genes per cell: Select cells that have at least this many genes detected.

Then set the filtering parameter.

For more details, please check the Drop-seq manual.

Parameters

Estimate the number of barcodes for correction [2000]
Sequence [AAGCAGTGGTATCAACGCAGAGTGAATGGG]
How to filter the DGE matrix [Min number of genes per cell]
Filtering threshold [0]

Output

cleaned.bam: Cleaned BAM
synthesis_stats.txt: Synthesis statistics
synthesis_stats_summary.txt: Counts of the synthesis errors
digital_expression.tsv: The DGE matrix (primary output)
digital_expression_summary.txt: summary of the DGE matrix on a per cell level (the secondary output)