Filter reads for duplicates

Description

This tool filters reads for duplicates. The input data can be in FASTQ or in FASTA format.

Details

The main purpose of removing duplicates is to mitigate the effects of PCR amplification bias introduced during library construction. In addition, removing duplicates can result in computational benefits by reducing the number of sequences that need to be processed and by lowering the memory requirements. Sequence duplicates can also impact abundance or expression measures and can result in false variant (SNP) calling.

The duplicate reads can be defined in several ways:

exact duplicates are reads that are identical copies to some other read
5' or 3' duplicates are reads that are identical with the 5' or 3' end of a longer read.
exact reverse complement duplicates are reads for which the reverse complement is identical copies to some other read
reverse complement 5' or 3' duplicates are reads that for which the reverse complement is identical with the 5' or 3' end of a longer read.

In addition to the way how duplicates are defined, you should also define a threshold value: Number of allowed duplicates. For example, to remove sequences that occur more than 5 times, you would specify value 6. Note that this parameter is used only for filtering exact duplicates or reverse complement exact duplicates.

Output

The reads that pass the filtering condition are saved to file called accepted.fastq or accepted.fasta. You can also choose to store the reads that are filtered out to a separate file (rejected.fastq or rejected.fasta). You can also print out a log file that contains information about the filtering task and statistics about how many reads were accepted and rejected.

Reference

This tool is based on the PRINSEQ package.