RNA-seq / Map aligned reads to genes with HTSeq using own GTF
Description
Given mapped reads in a BAM file and gene locations in a GTF file, this tool calculates how many reads map to each gene.
Parameters
- Does the alignment file contain paired-end data (yes, no) [no]
- Is the data stranded and how (yes, no, reverse) [no]
- Mode to handle reads overlapping more than one gene (union, intersection-strict, intersection-nonempty) [union]
- Minimum mapping quality (0-100) [10]
- Feature type to count (exon, CDS) [exon]
- Feature ID to use (gene_id, ID, GeneID, transcript_id, gene_name, transcript_name, protein_name) [gene_id]
Details
You need to provide gene locations in GTF format.
Note that the chromosome names have to be same in the GTF and BAM files.
By default the GFF attribute gene_id is used as a feature ID. Several GFF lines (e.g. exons) with the same feature ID will be considered as parts of the same feature,
and the feature ID is used to identify the counts in the output table. In other words, a gene is considered as a union of all its exons.
There are three different modes to handle reads which overlap with more than one gene.
These are illustrated in the HTSeq manual.
If your RNA-seq data was produced with a stranded protocol, it is important that you select the correct strandedness option in the parameter "Is the data stranded and how", as this affects the counting:
- If you select NO, a read will be counted for a gene regardless of which strand it maps to.
- If you select YES and you have single end data, the read has to map to the same strand as the gene. For paired end data, the first read of a pair has to map to the same strand as the gene, and the second read has to map to the opposite strand.
- If you select REVERSE and you have paired end data, the second read has to map to the same strand as the gene, and the first read has to map to the opposite strand.
For example, if your paired end data was produced with an Illumina TruSeq Stranded kit, you have to select REVERSE here.
You also have to make sure before that when you map the reads to the reference genome with TopHat, you set the parameter "Library type" to "fr-firststrand".
Correspondingly, if you used "fr-secondstrand" as "Library type" in TopHat, select YES here. For "fr-unstranded" in TopHat, use NO here. You can read more about stranded data here.
Output
Output is a table with counts for each gene. In order to use the output for differential expression analysis in edgeR or DESeq, you need to select all the samples and run the tool "Utilities - Define NGS experiment".
The tool also generates a separate text file (htseq-count-info.txt)
listing how many reads
- could not be assigned to any gene (no_feature)
- mapped to a location containing several genes and were hence not counted (ambiguous)
- had too low alignment quality (too_low_aQual)
- were not aligned (not_aligned)
- had more than one alignment (alignment_not_unique)
References
This tool is based on the HTSeq package by Simon Anders.