Precluster aligned sequences

Description

Given a fasta-formatted alignment and a count file or names file, this tool pre-clusters sequences in order to remove sequences that are likely to contain sequencing errors.

Parameters

Number of differences allowed [1]

Details

The basic idea is that abundant sequences are more likely to generate erroneous sequences than rare sequences. With that in mind, the algorithm proceeds by ranking sequences in order of their abundance. It then walks through the list of sequences looking for rarer sequences that are within one or more mismatch for every 100 bases of the original sequence. Those that are within this threshold are merged with the larger sequence. By pre-clustering you remove a large number of sequences making the distance calculation much faster.

You can allow more mismatches per 100 bases by tuning the Number of differences allowed parameter.

This tool is based on the Pre.cluster command of the Mothur package.

Output

The analysis output consists of the following:

preclustered.fasta: Alignment with preclustered sequences
preclustered-summary.tsv: Summary statistics for the alignment
preclustered.count: Count file with preclustered sequences

References

Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF. Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.