Filter reads for several criteria

Description

This tool filters reads for several criteria such as length, low complexity, Ns and duplicates. The input data can be in FASTQ or FASTA format. Note that if you paired-end reads and you need to keep them in the matching order, you should give both files to this tool at the same time. The tool will automatically make sure that only filtered pairs are given in the output.

Details

Only those filtering methods are used for which you assign a value. You can use several filtering methods at the same time. For example filtering with parameters Maximum length: 100 and Maximum count of Ns: 1 would produce a reads set where all the reads are shorter than 101 bases and they contain less than 2 Ns. For detailed description of the different filtering conditions, please check the manual of the PRINSEQ package. From the table below, you can check what PRINSEQ command line options the parameter definitions correspond to:

Parameter name Command line Description

Max length -max_len Select only reads that are shorter than the given value.

Min length -min_len Select only reads that are longer than the given value.

Max GC content -max_gc Select only reads that has GC content that is less than the given value.

Min GC content -min_gc Select only reads that has GC content that is more than the given value.

Min quality score -min_qual_score Filter reads with GC content below than the given value.

Max quality score -max_qual_score Filter reads with GC content above then the given value.

Min mean quality -min_qual_mean Filter reads with quality score mean below the given value.

Max mean quality -max_qual_mean Filter reads with quality score mean above the given value.

Max percentage of Ns -ns_max_p Filter reads for which the percentage of Ns is higher than the given value.

Max count of Ns -ns_max_n Filter reads for which the count of Ns is higher than the given value.

Max number of reads -seq_num Only keep the given number number of reads that pass all other filters.

Type of duplicates to filter -derep Type of duplicates to filter.

Number of allowed duplicates -derep_min This option specifies the number of allowed duplicates. For example, to remove reads that occur more than 2 times, you would specify value 3.

DUST filter threshold -lc_method dust, -lc_threshold Use DUST algorithm with the given maximum alloed low complexity score, between 0 and 100.

Entropy filter threshold -lc_method entropy, -lc_threshold Use Entropy algorithm with the given minimum allowedlow complexity score, between 0 and 100.

Base quality encoding -phred64 Select "Sanger" for Illumina 1.8+, Sanger, Roche/454, Ion Torrent and PacBio data.

Output

By default the tool outputs the reads that pass the filtering conditions. You can also choose to output the reads that are filtered out to separate files. For paired end reads you can also choose to output the singletons to separate files. You can also print out a log file that contains information about the filtering task and statistics about how many reads were accepted and rejected.

Reference

This tool is based on the PRINSEQ package.

Parameter name	Command line	Description
Max length	-max_len	Select only reads that are shorter than the given value.
Min length	-min_len	Select only reads that are longer than the given value.
Max GC content	-max_gc	Select only reads that has GC content that is less than the given value.
Min GC content	-min_gc	Select only reads that has GC content that is more than the given value.
Min quality score	-min_qual_score	Filter reads with GC content below than the given value.
Max quality score	-max_qual_score	Filter reads with GC content above then the given value.
Min mean quality	-min_qual_mean	Filter reads with quality score mean below the given value.
Max mean quality	-max_qual_mean	Filter reads with quality score mean above the given value.
Max percentage of Ns	-ns_max_p	Filter reads for which the percentage of Ns is higher than the given value.
Max count of Ns	-ns_max_n	Filter reads for which the count of Ns is higher than the given value.
Max number of reads	-seq_num	Only keep the given number number of reads that pass all other filters.
Type of duplicates to filter	-derep	Type of duplicates to filter.
Number of allowed duplicates	-derep_min	This option specifies the number of allowed duplicates. For example, to remove reads that occur more than 2 times, you would specify value 3.
DUST filter threshold	-lc_method dust, -lc_threshold	Use DUST algorithm with the given maximum alloed low complexity score, between 0 and 100.
Entropy filter threshold	-lc_method entropy, -lc_threshold	Use Entropy algorithm with the given minimum allowedlow complexity score, between 0 and 100.
Base quality encoding	-phred64	Select "Sanger" for Illumina 1.8+, Sanger, Roche/454, Ion Torrent and PacBio data.