VirusDetect with own genome
Description
This tool runs the VirusDetect pipeline, that performs virus identification using small RNA (sRNA) sequencing data. Given a FASTQ file, it performs de novo assembly and reference-guided assembly by
aligning sRNA reads to the reference database of known viruses. The assembled contigs are compared to the reference virus sequences for virus identification first using BLASTN and then BLASTX.
Virus assignments are selected based on the three cutoff parameters described below.
More detailed definition of Virus detect pipeline can be found at the home page of VirusDetect.
If possible, the reads should be cleaned from sequences originating from the host genome. This can be done by mapping the reads against the host genome and selecting
only those reads that did not match. Also the resulting contigs are aligned to the host genome and the matching ones are removed.
If your host genome is not available in Chipster but you have the host genome sequence as a fasta file, you can use this tool to automatically calculate the required
BWA indexes and perform the host genome filtering for your data. Please note that it may take several hours to calculate BWA indexes for a large genome.
When the VirusDetect analysis is finished, the BWA indexes of the host genome are returned as one tar-formatted archive file. This archive file can be used as an input file, instead of the fasta formatted genome, for the subsequent
VirusDetect and BWA jobs so that you don't have to repeat the time consuming indexing process.
Input Files
This tool requires two input files:
- sRNA reads should be given as FASTQ formatted sequence file.
- Host genome should be given as fasta formatted sequence file or as tar formatted BWA index file from a previous VirusDetect job.
Parameters
- Reference virus database: VirusDetect is mainly used for detecting plant viruses, but you can use it for other viruses too. Use this parameter to select a virus reference database matching you virus type.
- Minimum fraction of a contig covered by virus reference: At least this proportion of the contig has to match to the reference in order to consider the match significant for the virus assignment.
For example the default value 0.75 means that only contigs that match to a virus reference for more than 75% of their length are considered as significant matches.
- Minimum fraction of virus reference covered by contigs: Virus assignment is reported only if more than the given proportion of the virus reference is
covered with significant matches to contigs. For example the default value 0.1 means that at least 10 % of the virus reference sequence must match to the contigs generated by VirusDetect.
- Minimum read depth: The average number of times each nucleotide of the reference sequence must be covered by reads in the sample.
- Use input names in output file names: The names of the VirusDetect result files will start with the name of the input file.
- Return results in one archive file: All the output files are stored in a single tar file. This feature is useful if you run VirusDetect on several samples at the same time.
The result files can be extracted from the tar package using the tool Extract .tar.gz file.
Output
VirusDetect produces a large number of result files. Output related options are used to select, what data is returned. By default VirusDetect returns the following files:
- blastn_matching_references.html Table listing reference viruses that have corresponding virus contigs identified by BLASTN. A pdf formatted report file is returned for each match.
- blastx_matching_references.html Table listing reference viruses that have corresponding virus contigs identified by BLASTX. A pdf formatted report file is returned for each match.
- blastn_matches.tsv Table of BLASTN matches to the reference virus database.
- blastx_matches.tsv Table of BLASTX matches to the reference virus database.
- contigs_with_blastn_matches.fa Sequences of contigs that match to virus references by BLASTN.
- contigs_with_blastx_matches.fa Sequences of contigs that match to virus references by BLASTX.
- undetermined.html Table listing the length, siRNA size distribution and 21-22nt percentage of undetermined contigs. Potential virus contigs (21-22 nt > 50%) are indicated in green.
- undetermined_blast.html Table listing contigs having hits in the virus reference database but not assigned to any reference viruses because they did not meet the coverage or depth criteria.
- undetermined_contigs.fa Sequences of contigs that do not match to virus references.
- virusdetect_contigs.fa Sequences of non-redundant contigs derived through reference-guided and de novo assemblies.
If the parameter Return matching reference sequences is turned on, also the following files are returned
- blastn_matching_references.fa and .fai. Virus reference sequences that produced hits for BLASTN search with the potential virus contigs.
- blastx_matching_references.fa and .fai. Virus reference sequences that produced hits for BLASTX search with the potential virus contigs.
If the parameter Return BAM formatted alignments is turned on, also the following files are returned
- blastn_matches.bam and .bai. BAM file containing the BLASTN alignment of each contig to its corresponding virus reference sequences.
- blastx_matches.bam and .bai. BAM file containing the BLASTX alignment of each contig to its corresponding virus reference sequences.
Note: If you select the blastn_matching_references.fa and blastn_matches.bam, you can use the Chipster Genome Browser to visualize the BLAST results.
In the Genome Browser the blastn_matching_references.fa is used as the genome and each reference virus sequence is listed in the Chromosome pull down menu.