Combine paired reads to contigs

Description

Combines paired reads to contigs for each sample and then collects the sequences from all the samples in one fasta file.

Parameters

none

Details

This tool takes a Tar package of fastq files (you can create one using the tool Utilities / Make a Tar package). It extracts the sequence and quality data from the fastq files, creates a reverse complement of the reverse read, and joins the reads into contigs by performing a Needleman alignment. When it identifies positions where the two reads disagree, it proceeds as follows:

If one of the disagreeing bases has a quality score which is 6 or more points higher than the other, that base is reported. If not, that base is set to N.
If one of the reads suggest a gap, the other read has to have a base with a quality score over 25 to be considered real.

Note that Mothur Make.contigs tool is quite "greedy", so it will assemble contigs even if the reads are not high quality. Therefore it is recommended to use this tool only for good quality reads which totally overlap. Dr Patrick Schloss, the developer of Mothur, says that even with the advance in Illumina MiSeq technology to paired 300 nt reads, they are sticking with the paired 250 nt reads to sequence the 16S rRNA V4 region.

The tool figures out by itself which fastq files belong to each sample. You can check this in the output file samples.fastqs.txt. If the file assignment to samples is not correct, you can make a samples-files.txt yourself and give it as input. This file has 3 columns: sample name, forward fastq file name (R1), and reverse fastq file name (R2). So it should look something like this:
sample1 sample1_L001_R1_001.fastq sample1_L001_R2_001.fastq
sample2 sample2_L001_R1_001.fastq sample2_L001_R2_001.fastq
sample3 sample3_L001_R1_001.fastq sample3_L001_R2_001.fastq
etc.

This tool is based on the make.contigs and make.file commands of the Mothur package.

Output

The analysis output consists of the following:

contigs.fasta.gz: Fasta file containing the contig sequences.
contigs.summary.tsv: Summary statistics for the sequences
contigs.groups: A two column list with the first column indicating the sequence names in the input file and the second column the sample that it came from.
contig.numbers.txt: List how many contigs each sample has.
samples.fastqs.txt: Shows how fastq files were assigned to samples.

References

Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF. Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.