RNA-seq quality metrics with RseQC

Description

Given an RNA-seq BAM file, this tool reports several quality metrics such as coverage uniformity, gene and junction saturation, junction annotation and alignment statistics. You can provide your own BED file with gene and exon locations as a second input file, OR use one of the existing annotations by selecting the correct reference organism in the parameters. (Please note, that using your own BED file as an input overwrites the reference geneome selection.) This tool is based on the RSeQC package.

If using your own BED reference, note that the chromosome names in the BAM file and the BED file need to match. This is the case if you use BAM files produced by Chipster and BED files from Chipster server (=the reference files selected using the Organism parameter). Both files are based on Ensembl which uses just numbers without the chr prefix.

Parameters

Details

The tool performs five different analyses and one optional one (the following descriptions are from the RSeQC homepage):

geneBody_coverage

Read coverage over gene body. This module is used to check if reads coverage is uniform and if there is any 5'/3' bias. This module scales all transcripts to 100 nt and calculates the number of reads covering each nucleotide position. Finally, it generates a plot illustrating the coverage profile along the gene body.

junction_saturation

It's very important to check if current sequencing depth is deep enough to perform alternative splicing analyses. For a well annotated organism, the number of expressed genes in particular tissue is almost fixed so the number of splice junctions is also fixed. The fixed splice junctions can be predetermined from reference gene model. All (annotated) splice junctions should be rediscovered from a saturated RNA-seq data, otherwise, downstream alternative splicing analysis is problematic because low abundance splice junctions are missing. This module checks for saturation by resampling 5%, 10%, 15%, ..., 95% of total alignments from BAM or SAM file, and then detects splice junctions from each subset and compares them to reference gene model.

junction_annotation

For a given alignment file in BAM or SAM format and a reference gene model in BED format, this program will compare detected splice junctions to reference gene model. splicing annotation is performed in two levels: splice event level and splice junction level.

All detected junctions can be grouped to 3 exclusive categories:

  1. Annotated: The junction is part of the gene model. Both splice sites, 5' splice site (5'SS) and 3' splice site (3'SS) can be annotated by reference gene model.
  2. complete_novel: Complete new junction. Neither of the two splice sites cannot be annotated by gene model
  3. partial_novel: One of the splice site (5'SS or 3'SS) is new, while the other splice site is annotated (known)

RPKM_saturation

The precision of any sample statistics (RPKM) is affected by sample size (sequencing depth); "resampling" or "jackknifing" is a method to estimate the precision of sample statistics by using subsets of available data. This module will resample a series of subsets from total RNA reads and then calculate RPKM value using each subset. By doing this we are able to check if the current sequencing depth was saturated or not (or if the RPKM values were stable or not) in terms of genes' expression estimation. If sequencing depth was saturated, the estimated RPKM value will be stationary or reproducible. By default, this module will calculate 20 RPKM values (using 5%, 10%, ... , 95%,100% of total reads) for each transcripts.

In the output figure, Y axis is "Percent Relative Error" or "Percent Error" which is used to measures how the RPKM estimated from subset of reads (i.e. RPKMobs) deviates from real expression level (i.e. RPKMreal). However, in practice one cannot know the RPKMreal. As a proxy, we use the RPKM estimated from total reads to approximate RPKMreal.

All transcripts were sorted in ascending order according to expression level (RPKM). Then they are divided into 4 groups:
Q1 (0-25%): Transcripts with expression level ranked below 25 percentile.
Q2 (25-50%): Transcripts with expression level ranked between 25 percentile and 50 percentile.
Q3 (50-75%): Transcripts with expression level ranked between 50 percentile and 75 percentile.
Q4 (75-100%): Transcripts with expression level ranked above 75 percentile.

BAM_stat

This program is used to calculate reads mapping statistics from provided BAM file. This script determines "uniquely mapped reads" from mapping quality, which quality the probability that a read is misplaced (Do NOT confused with sequence quality, sequence quality measures the probability that a base-calling was wrong).

Inner_distance (optional)

This module is only applicable to paired-end data and it calculates the inner distance between two paired RNA reads. The distance is the mRNA length between two paired reads.

Output

References

This tool uses the RSeQC package. Please cite the article:

Wang L, Wang S, Li W* RSeQC: quality control of RNA-seq experiments Bioinformatics (2012) 28 (16): 2184-2185. doi: 10.1093/bioinformatics/bts356

Please see the RSeQC homepage for more details.