RNA-seq / Count aligned reads per genes with HTSeq

Description

Given mapped reads in a BAM file, this tool counts how many reads map to each gene using Ensembl gene location information.

Parameters

Reference organism (...) [homo sapiens]
Chromosome names in my BAM file look like (chr1, 1) [1]
Does the alignment file contain paired-end data (yes, no) [no]
Is the data stranded and how (yes, no, reverse) [no]
Mode to handle reads overlapping more than one gene (union, intersection-strict, intersection-nonempty) [union]
Minimum mapping quality (0-100) [10]
Feature type to count (exon, CDS) [exon]
Feature ID to use (gene_id, transcript_id, gene_name, transcript_name, protein_name) [gene_id]
Add chromosomal coordinates to the count table (yes, no) [yes]

Details

By default the GFF attribute gene_id is used as a feature ID. Several GFF lines (e.g. exons) with the same feature ID will be considered as parts of the same feature, and the feature ID is used to identify the counts in the output table. In other words, a gene is considered as a union of all its exons. There are three different modes to handle reads which overlap with more than one gene. These are illustrated in the HTSeq manual.

If your RNA-seq data was produced with a stranded protocol, it is important that you select the correct strandedness option in the parameter "Is the data stranded and how", as this affects the counting:

If you select NO, a read will be counted for a gene regardless of which strand it maps to.
If you select YES and you have single end data, the read has to map to the same strand as the gene. For paired end data, the first read of a pair has to map to the same strand as the gene, and the second read has to map to the opposite strand.
If you select REVERSE and you have paired end data, the second read has to map to the same strand as the gene, and the first read has to map to the opposite strand.

For example, if your paired end data was produced with an Illumina TruSeq Stranded kit, you have to select REVERSE here. You also have to make sure before that when you map the reads to the reference genome with TopHat, you set the parameter "Library type" to "fr-firststrand". Correspondingly, if you used "fr-secondstrand" as "Library type" in TopHat, select YES here. For "fr-unstranded" in TopHat, use NO here. You can read more about stranded data here.

You can add genomic location information to the count table. This enables the edgeR and DESeq tools to create a BED file of the differentially expressed genes, which allows you to visualize them easily in the genome browser.

Output

Output is a table with counts for each gene. In order to use the output for differential expression analysis in edgeR or DESeq, you need to select all the samples and run the tool "Utilities - Define NGS experiment".

The tool also generates a separate text file (htseq-count-info.txt) listing how many reads

could not be assigned to any gene (no_feature)
mapped to a location containing several genes and were hence not counted (ambiguous)
had too low alignment quality (too_low_aQual)
were not aligned (not_aligned)
had more than one alignment (alignment_not_unique)

References

This tool is based on the HTSeq package by Simon Anders.