Differential expression analysis using DESeq2

Description

Differential expression analysis using the DESeq2 Bioconductor package. This tool allows you to have more than two experimental groups and account for a second experimental factor.

Parameters

Column describing groups [group]
Column describing additional experimental factor [EMPTY]
Cutoff for the adjusted P-value (0-1) [0.05]

Details

This tool takes as input a table of raw counts. The count table has to be associated with a phenodata file describing the experimental groups. These files are best created by the tool "Utilities / Define NGS experiment", which combines count files for different samples to one table, and creates a phenodata file for it.

DESeq2 performs an internal normalization where geometric mean is calculated for each gene across all samples. The counts for a gene in each sample is then divided by this mean. The median of these ratios in a sample is the size factor for that sample. This procedure corrects for library size and RNA composition bias, which can arise for example when only a small number of genes are very highly expressed in one experiment condition but not in the other.

As small numbers of replicates make it impossible to estimate within-group variance reliably, DESeq2 uses shrinkage estimation for dispersions and fold changes. A dispersion value is estimated for each gene through a model fit procedure. You need to have biological replicates of each experiment condition in order to estimate dispersion properly. If there are no replicates, DESeq will estimate dispersion using the samples from the different conditions as replicates.

DESeq2 fits negative binomial generalized linear models for each gene and uses the Wald test for significance testing. In addition to the group information, you can give an additional experimental factor like pairing to the analysis.

DESeq2 detects automatically count outliers using Cooks's distance and removes these genes from analysis. It also automatically removes genes whose mean of normalized counts is below a threshold determined by an optimization procedure. Removing these genes with low counts improves the detection power by making the multiple testing adjustment of the p-values less severe.

Output

The analysis output consists of the following files. Note that if you have more than two experimental groups, the output figures sum up information from all pairwise comparisons.

de-list-deseq2.tsv: Table containing the significantly differentially expressed genes. The columns include
- baseMean = the average of the normalized counts taken over all samples
- log2FoldChange = log2 fold change between the groups. E.g. value 2 means that the expression has increased 4-fold
- lfcSE = standard error of the log2FoldChange estimate
- stat = Wald statistic
- pvalue = Wald test p-value
- padj = Benjamini-Hochberg adjusted p-value
de-list-deseq2.bed: The BED version of the results table contains genomic coordinates and log2 fold change values.
summary.txt: Textual summary of the differential expression results, including information on filtering and outliers.
deseq2_report.pdf: A PDF file containing:
- MA scatter plot where the significantly differentially expressed genes are highlighted.
- Plot of dispersion estimates at different count levels, showing
  - black dot = dispersion estimate for each gene as obtained by considering the information from each gene separately
  - red line = fitted estimates showing the dispersions' dependence on the mean
  - blue dot = the final dispersion estimates shrunk from the gene-wise estimates towards the fitted estimates. The values are used in the statistical testing.
  - blue circles = genes which have high gene-wise dispersion estimates and are hence labelled dispersion outliers and not shrunk toward the fitted trend line
- Plot of the raw and adjusted p-value distributions of the statistical test.

References

This tool uses the DESeq2 package. Please read the following article for more detailed information:

M Love, W Huber and S Anders: Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biol. 2014 15:550