Extract unique sequences

Description

Given a fasta file and groups file, removes identical sequences from the fasta file.

Parameters

None

Details

Many sequences are identical and it would be computationally wasteful to align the same sequence many times later. It is therefore better to keep only one representative sequence in the fasta file, and keep track of how many sequences it represents and store this info in a count_table file. Alternatively we could list the names of each represented sequence, but this names file would be very large as sequence names are long.

This tool is based on the Unique.seqs and Count.seqs commands of the Mothur package.

Output

The analysis output consists of the following:

unique.fasta: Unique sequences
unique.summary.tsv: Summary statistics for the sequences
unique.count_table = how many represented sequences are in each sample

References

Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF. Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.