RNA-seq / Map aligned reads to genes with HTSeq using own GTF

Description

Given mapped reads in a BAM file and gene locations in a GTF file, this tool calculates how many reads map to each gene.

Parameters

Details

You need to provide gene locations in GTF format. Note that the chromosome names have to be same in the GTF and BAM files.

By default the GFF attribute gene_id is used as a feature ID. Several GFF lines (e.g. exons) with the same feature ID will be considered as parts of the same feature, and the feature ID is used to identify the counts in the output table. In other words, a gene is considered as a union of all its exons. There are three different modes to handle reads which overlap with more than one gene. These are illustrated in the HTSeq manual.

If your RNA-seq data was produced with a stranded protocol, it is important that you select the correct strandedness option in the parameter "Is the data stranded and how", as this affects the counting:

For example, if your paired end data was produced with an Illumina TruSeq Stranded kit, you have to select REVERSE here. You also have to make sure before that when you map the reads to the reference genome with TopHat, you set the parameter "Library type" to "fr-firststrand". Correspondingly, if you used "fr-secondstrand" as "Library type" in TopHat, select YES here. For "fr-unstranded" in TopHat, use NO here. You can read more about stranded data here.

Output

Output is a table with counts for each gene. In order to use the output for differential expression analysis in edgeR or DESeq, you need to select all the samples and run the tool "Utilities - Define NGS experiment".

The tool also generates a separate text file (htseq-count-info.txt) listing how many reads

References

This tool is based on the HTSeq package by Simon Anders.