RNA-seq / Count aligned reads per genes with HTSeq

Description

Given mapped reads in a BAM file, this tool counts how many reads map to each gene using Ensembl gene location information.

Parameters

Details

By default the GFF attribute gene_id is used as a feature ID. Several GFF lines (e.g. exons) with the same feature ID will be considered as parts of the same feature, and the feature ID is used to identify the counts in the output table. In other words, a gene is considered as a union of all its exons. There are three different modes to handle reads which overlap with more than one gene. These are illustrated in the HTSeq manual.

If your RNA-seq data was produced with a stranded protocol, it is important that you select the correct strandedness option in the parameter "Is the data stranded and how", as this affects the counting:

For example, if your paired end data was produced with an Illumina TruSeq Stranded kit, you have to select REVERSE here. You also have to make sure before that when you map the reads to the reference genome with TopHat, you set the parameter "Library type" to "fr-firststrand". Correspondingly, if you used "fr-secondstrand" as "Library type" in TopHat, select YES here. For "fr-unstranded" in TopHat, use NO here. You can read more about stranded data here.

You can add genomic location information to the count table. This enables the edgeR and DESeq tools to create a BED file of the differentially expressed genes, which allows you to visualize them easily in the genome browser.

Output

Output is a table with counts for each gene. In order to use the output for differential expression analysis in edgeR or DESeq, you need to select all the samples and run the tool "Utilities - Define NGS experiment".

The tool also generates a separate text file (htseq-count-info.txt) listing how many reads

References

This tool is based on the HTSeq package by Simon Anders.