Filter reads for low complexity

Description

This tool filters reads based on low complexity using either the DUST or Entropy algorithm. The algorithm is selected by defining a threshold value for it.

Details

The input data can be in FASTQ or FASTA format. You should give a threshold value to DUST or Entropy filter, but not both. The method for which the threshold value is defined will be used.

The DUST method uses the threshold value as maximum allowed score. The DUST approach is adapted from the algorithm used to mask low-complexity regions during BLAST search preprocessing. The scores are computed based on how often different trinucleotides occur and are scaled from 0 to 100. Higher scores imply lower complexity. A sequence of homopolymer repeats (e.g. TTTTTTTTT) has a score of 100, of dinucleotide repeats (e.g. TATATATATA) has a score around 49, and of trinucleotide repeats (e.g. TAGTAGTAGTAG) has a score around 32.

The Entropy method uses the threshold as minimum allowed value. The Entropy approach evaluates the entropy of trinucleotides in a sequence. The entropy values are scaled from 0 to 100 and lower entropy values imply lower complexity. A sequence of homopolymer repeats (e.g. TTTTTTTTT) has an entropy value of 0, of dinucleotide repeats (e.g. TATATATATA) has a value around 16, and of trinucleotide repeats (e.g. TAGTAGTAGTAG) has a value around 26.

Output

The reads that pass the filtering condition are saved to file called accepted.fastq or accepted.fasta. You can also choose to store the reads that are filtered out to a separate file (rejected.fastq or rejected.fasta). You can also print out a log file that contains information about the filtering task and statistics about how many reads were accepted and rejected.

Reference

This tool is based on the PRINSEQ package.