This tool filters reads for several criteria such as length, low complexity, Ns and duplicates. The input data can be in FASTQ or FASTA format. Note that if you paired-end reads and you need to keep them in the matching order, you should give both files to this tool at the same time. The tool will automatically make sure that only filtered pairs are given in the output.
Only those filtering methods are used for which you assign a value. You can use several filtering methods at the same time. For example filtering with parameters Maximum length: 100 and Maximum count of Ns: 1 would produce a reads set where all the reads are shorter than 101 bases and they contain less than 2 Ns. For detailed description of the different filtering conditions, please check the manual of the PRINSEQ package. From the table below, you can check what PRINSEQ command line options the parameter definitions correspond to:
Parameter name | Command line | Description |
Max length | -max_len | Select only reads that are shorter than the given value. |
Min length | -min_len | Select only reads that are longer than the given value. |
Max GC content | -max_gc | Select only reads that has GC content that is less than the given value. |
Min GC content | -min_gc | Select only reads that has GC content that is more than the given value. |
Min quality score | -min_qual_score | Filter reads with GC content below than the given value. |
Max quality score | -max_qual_score | Filter reads with GC content above then the given value. |
Min mean quality | -min_qual_mean | Filter reads with quality score mean below the given value. |
Max mean quality | -max_qual_mean | Filter reads with quality score mean above the given value. |
Max percentage of Ns | -ns_max_p | Filter reads for which the percentage of Ns is higher than the given value. |
Max count of Ns | -ns_max_n | Filter reads for which the count of Ns is higher than the given value. |
Max number of reads | -seq_num | Only keep the given number number of reads that pass all other filters. |
Type of duplicates to filter | -derep | Type of duplicates to filter. |
Number of allowed duplicates | -derep_min | This option specifies the number of allowed duplicates. For example, to remove reads that occur more than 2 times, you would specify value 3. |
DUST filter threshold | -lc_method dust, -lc_threshold | Use DUST algorithm with the given maximum alloed low complexity score, between 0 and 100. |
Entropy filter threshold | -lc_method entropy, -lc_threshold | Use Entropy algorithm with the given minimum allowedlow complexity score, between 0 and 100. |
Base quality encoding | -phred64 | Select "Sanger" for Illumina 1.8+, Sanger, Roche/454, Ion Torrent and PacBio data. |
By default the tool outputs the reads that pass the filtering conditions. You can also choose to output the reads that are filtered out to separate files. For paired end reads you can also choose to output the singletons to separate files. You can also print out a log file that contains information about the filtering task and statistics about how many reads were accepted and rejected.