BWA for paired-end reads and own genome

Description

Aligns paired-end reads to the reference genome sequence given by the user. The alignment is constructed using the BWA aln algorithm. The genome needs to be supplied in fasta format and the reads in FASTQ files.

Parameters

Seed length. How many bases of the left, good quality part of the read should be used as the seed region. If the seed length is longer than the reads, the seeding will be disabled. Corresponds to the command line parameter -l. [32].
Maximum number of differences in the seed region. Maximum number of differences such as mismatches or indels in the seed region. [2].
Maximum edit distance for the whole read. Maximum edit distance if the value is more than one. If the value is between 1 and 0 then it defines the fraction of missing alignments given 2% uniform base error rate. In the latter case, the maximum edit distance is automatically chosen for different read lengths. Corresponds to the command line parameter -n. [0.04].
Quality value format used. Note that this parameter is taken into account only if you chose to apply the mismatch limit to the seed region. Are the quality values in the Sanger format (ASCII characters equal to the Phred quality plus 33) or in the Illumina Genome Analyzer Pipeline v1.3 or later format (ASCII characters equal to the Phred quality plus 64). Corresponds to the command line parameter -I. [Sanger].
Maximum number of gaps. Maximum number of gap openings for one read. Corresponds to the command line parameter -o. [1].
Maximum number of gap extensions. Maximum number of gap extensions, -1 for disabling long gaps. Corresponds to the command line parameter -e. [-1].
Gap opening penalty. Corresponds to the command line parameter -O. [11].
Gap extension penalty. Corresponds to the command line parameter -E. [4].
Mismatch penalty threshold. BWA will not search for suboptimal hits with a score lower than the alignment score minus this. Corresponds to the command line parameter -M. [3].
Disallow gaps in region. Disallow a long deletion within the given number of bp towards the 3’-end. Corresponds to the command line parameter -d. [16].
Disallow an indel within the given number of bp towards the ends. Do not put an indel within the defined value of bp towards the ends. Corresponds to the command line parameter -i. [5].
Quality trimming threshold. Quality threshold for read trimming down to 35bp. Corresponds to the command line parameter -q. [0].
Barcode length. Length of barcode starting from the 5 prime-end. The barcode of each read will be trimmed before mapping. Corresponds to the command line parameter -B. [0].
How many valid alignments are reported per read. Maximum number of alignments to report. Corresponds to the command line parameter bwa samse -n [3].
Maximum hits to output for paired reads. Maximum number of alignments to output in the XA tag for reads paired properly. If a read has more than the given amount of hits, the XA tag will not be written. Corresponds to the command line parameter bwa sampe -n. [3].
Maximum hits to output for discordant pairs. Maximum number of alignments to output in the XA tag for disconcordant read pairs, excluding singletons. If a read has more than INT hits, the XA tag will not be written. Corresponds to the command line parameter bwa sampe -N. [10].
Maximum insert size. Maximum insert size for a read pair to be considered being mapped properly. This option is only used when there are not enough good alignments to infer the distribution of insert sizes. Corresponds to the command line parameter bwa sampe -a. [500].
Maximum occurrences for one end.Maximum occurrences of a read for pairing. A read with more occurrences will be treated as a single-end read. Reducing this parameter helps faster pairing. The default value is 100000. For reads shorter than 30bp, applying a smaller value is recommended to get a sensible speed at the cost of pairing accuracy. Corresponds to the command line parameter bwa sampe -o. [100000].

Details

This tool uses BWA short read aligner to align a set of FASTQ formatted sequences against a against a FASTA formatted reference sequence. Aligning is performed with Burrows-Wheeler Transform based BWA aln algorithm that allows gaps in the alignments. This algorithm is designed for short queries up to ~200bp with low error rate (<3%).

It is possible to give the tool more than one FASTQ file pair. The tool will run the alignment for each file pair separately, and finally merge the resulting BAM files.

If you provide two FASTQ files, the tool will try to assign R1 and R2 reads correctly by file name.

If you have more than two FASTQ files, you will need to provide lists of filenames of the FASTQ files as text files; one file for R1 files, and another one for the R2 files (e.g.R1files.txt and R2files.txt). These lists can be generated with the tool Utilities / Make a list of file names . The read pairs must be ordered identically in both lists.

To run, select the genome file, list files (R1files.txt and R2files.txt) and ALL FASTQ files, and assign the files correctly. When assigning the genome and list files, they are automatically inactivated in the "reads" file list.

Output

As a result the tool returns a sorted and indexed BAM-formatted alignment.