HISAT2 for paired-end reads and own genome

Description

This tool aligns Illumina paired end RNA-seq read to a genome provided either as a FASTA format sequence or as a tar package with a HISAT2 index.

You need to supply the reads in FASTQ files. The files can be compressed with gzip. Note that if you have more than two FASTQ files per sample (for example, Illumina NextSeq produces 8 FASTQ files per sample), you need to provide also two list files containing the file names in order to assign the FASTQ files to each direction. Please produce the list files using the tool "Utilities / Make a list of file names".

List files are optional if you provide just two FASTQ files. Chipster will try to assign the files to directions based on file names. This assumes the files are named so that the beginning of the name is identical and the directions are specified with _1 and _2, e.g. Abc123_1, Abc123_2. If your files are named differently, you need to provide list files to make sure the files are assigned correctly.

Parameters

RNA-strandness (unstranded, FR, RF) [unstranded]
How many hits is a read allowed to have (1-1000000) [5]
Base quality encoding used (phred+33, phred+64) [phred+33]
Minimum intron length [20]
Maximum intron length [500000]
Allow soft clipping (yes, no) [yes]
Are long anchors required (yes, no) [no]

Details

HISAT2 (hierarchical indexing for spliced alignment of transcripts) is a highly efficient system for aligning reads from RNA sequencing experiments. HISAT uses an indexing scheme based on the Burrows-Wheeler transform and the Ferragina-Manzini (FM) index, employing two types of indexes for alignment: a whole-genome FM index to anchor each alignment and numerous local FM indexes for very rapid extensions of these alignments. HISAT's hierarchical index for the human genome contains 48,000 local FM indexes, each representing a genomic region of ~64,000 bp.

HISAT2 searches by default for up to 5 distinct, primary alignments for each read, but you can change this number. Primary alignments mean alignments whose alignment score is equal or higher than any other alignments. It is possible that multiple distinct alignments have the same score. The alignment score for a paired-end alignment equals the sum of the alignment scores of the individual mates. Note that HISAT2 does not "find" alignments in any specific order, so for reads that have more than 5 distinct, valid alignments, HISAT2 does not guarantee that the 5 alignments reported are the best possible in terms of alignment score. By default soft clipping is allowed, meaning that the ends of the read don't need to align if this increases the alignment score.

If you are planning to do transcriptome assembly afterwards, you should set the long anchor parameter to yes. With this option, HISAT2 requires longer anchor lengths for de novo discovery of splice sites. This leads to fewer alignments with short anchors, which helps transcript assemblers improve significantly in computation and memory usage.

If you use a FASTA format genome, the tool will produce a .tar file with the HISAT2 indexes. If you run the tool again with the same genome, you should use the .tar file as the genome input, as this saves the time needed to generate the indexes.

After running HISAT2, Chipster indexes the BAM file using the SAMtools package. This way the results are ready to be visualized in the genome browser.

Output

This tool returns the following files:

*.bam: BAM file containing the alignments
*.bam.bai: Index for the BAM file
hisat.log: Summary of the alignment results
*.hisat2.tar: HISAT2 index as a tar package (only if FASTA genome provided)

Reference

This tool is based on the HISAT2 package. Please cite the following article: Kim D, Langmead B and Salzberg SL. HISAT: a fast spliced aligner with low memory requirements Nature Methods 2015.