Read quality statistics with PRINSEQ

Description

This tool calculates several quality control statistics for reads using the PRINSEQ package. Please note that if your file is larger than 4 GB, we recommend that you submit only a sample of reads for the quality statistics analysis, because PRINSEQ uses a lot of memory when producing the html report and might fail with bigger files. You can use the tool "Utilities / Make a subset of FASTQ" for this.

Details

The statistics are calculated using the PRINSEQ option: -stats_all. The input data can be in FASTQ or FASTA format

Output

This tool produces a comprehensive quality report reads-stats.html containing many useful plots. For viewing this file, please select the visualization method "Open in external web browser". In addition, a table reads-stats.tsv is produced, containing following information:

stats_dinuc aattDinucleotide odds ratio for AA/TT.
stats_dinuc acgtDinucleotide odds ratio for AC/GT.
stats_dinuc agctDinucleotide odds ratio for AG/CT.
stats_dinuc atDinucleotide odds ratio for AT.
stats_dinuc catgDinucleotide odds ratio for CA/TG.
stats_dinuc ccggDinucleotide odds ratio for CC/GG.
stats_dinuc cg Dinucleotide odds ratio for CG.
stats_dinuc gatcDinucleotide odds ratio for GA/TC.
stats_dinuc gcDinucleotide odds ratio for GC.
stats_dinuc taDinucleotide odds ratio for TA.
stats_dupl 3The number of 3' duplicates.
stats_dupl 3maxd
stats_dupl 5 The number of 5' duplicates.
stats_dupl 5maxd
stats_dupl exact The number of exact duplicates.
stats_dupl exactmaxd
stats_dupl exactrevcompNumber of exact duplicates with reverse complements.
stats_dupl exactrevcompmaxd
stats_dupl revcomp Number of 5'/3' duplicates with reverse complements.
stats_dupl revcompmaxd
stats_dupl totalTotal number of duplicates.
stats_info basesTotal number of bases in the input file.
stats_info readsNumber of reads in the input file.
stats_len max 101 Length of the longest read.
stats_len meanMean length of the reads.
stats_len medianMedian of the read lengths.
stats_len minLength of the shortest read.
stats_len modeMode of the read lengths.
stats_len modevalNumber of mode length sequences.
stats_len rangeRange of the sequence lengths.
stats_len stddevStandard deviation of the read lengths.
stats_ns maxnMaximum number of Ns in one read.
stats_ns maxpThe maximum percentage of Ns per read.
stats_ns seqswithnNumber of reads with ambiguous base N.
stats_tag midnum The number of predefined MIDs.
stats_tag prob3The probability of a tag sequence at the 3'-end (in percentage).
stats_tag prob5The probability of a tag sequence at the 5'-end (in percentage).

Reference

This tool is based on the PRINSEQ package.