This tool filters reads for duplicates. The input data can be in FASTQ or in FASTA format.
The main purpose of removing duplicates is to mitigate the effects of PCR amplification bias introduced during library construction. In addition, removing duplicates can result in computational benefits by reducing the number of sequences that need to be processed and by lowering the memory requirements. Sequence duplicates can also impact abundance or expression measures and can result in false variant (SNP) calling.
The duplicate reads can be defined in several ways:The reads that pass the filtering condition are saved to file called accepted.fastq or accepted.fasta. You can also choose to store the reads that are filtered out to a separate file (rejected.fastq or rejected.fasta). You can also print out a log file that contains information about the filtering task and statistics about how many reads were accepted and rejected.