Create digital gene expression matrix
Description
This Drop-seq tool does two things:
- identifies and corrects bead synthesis errors
- extracts gene expression values from a BAM file where reads have been tagged with gene names
In the bead synthesis step (step 1) the tool identifies cell barcodes with aberrant fixed UMI bases.
If only the last UMI base is fixed as a T, the cell barcode is corrected (the last base is trimmed off) and
all cell barcodes with identical sequence at the first 11 bases are merged together.
If any other UMI base is fixed, the reads with that cell barcode are discarded.
The tool asks the user to select a number of barcodes on which to perform the correction.
In the original Drop-seq manual, the tool developers guide users to use roughly 2 times the
anticipated number cells,
as they have empirically found that this allows to recover nearly every defective cell barcode
that corresponds to a STAMP (rather than an empty bead cell barcode).
This program reads in the BAM file, and looks at the distribution of bases at each position of all UMIs
for a cell barcode. It detects unusual distributions of base frequency, where a base with ≥ 80% frequency at
any position is detected as an error. Barcodes with less than 25 total UMIs are ignored.
The tool also checks for PRIMER_MATCHes, where the UMI perfectly matches one of the PCR primers.
These cell barcodes are dropped.
These errors are only detected if a PRIMER_SEQUENCE argument is supplied as a parameter.
From the digital expression stage (step 2), there are two outputs available:
-the primary is the DGE matrix, with each a row for each gene, and a column for each cell
-the secondary analysis is a summary of the DGE matrix on a per cell level,
indicating the number of genes and transcripts observed.
Method:
- For each gene, find the molecular barcodes on the exons of that gene.
- Determine how many HQ mapped reads are assigned to each barcode.
- Collapse barcodes by edit distance.
- Throw away barcodes with less than threshold # of reads.
- Count the number of remaining unique molecular barcodes for the gene.
This program requires a tag for what gene a read is on, a molecular barcode tag, and a exon tag.
The exon and gene tags may not be present on every read.
The selection of the sets of cells:
Choose first the selection criteria (How to filter the DGE matrix):
- Number of core barcodes: Counts the number of reads per cell barcode and include N top cells with
most
reads.
- Min number of genes per cell: Select cells that have at least this many genes detected.
Then set the filtering parameter.
For more details, please check the Drop-seq
manual.
Parameters
- Estimate the number of barcodes for correction [2000]
- Sequence [AAGCAGTGGTATCAACGCAGAGTGAATGGG]
- How to filter the DGE matrix [Min number of genes per cell]
- Filtering threshold [0]
Output
- cleaned.bam: Cleaned BAM
- synthesis_stats.txt: Synthesis statistics
- synthesis_stats_summary.txt: Counts of the synthesis errors
- digital_expression.tsv: The DGE matrix (primary output)
- digital_expression_summary.txt: summary of the DGE matrix on a per cell level (the secondary output)