Differential expression using edgeR
Description
Differential expression analysis using the exact test of the edgeR Bioconductor package.
Please note that this tool only does a pairwise comparison of two groups
(the "classic" approach in the edgeR
user guide, see chapter 3.2.2).
This tool takes as input the counts table and a phenodata file, where the sample groups are described.
You can generate these files using the Define
NGS experiment tool.
We recommend marking the control group with 0 and the other group with 1 in the phenodata column
(by default, this is the group column): the group with the smaller
number (or, if using letters, the alphabetically first letter) is used as the baseline.
This means that in the results table, positive fold changes are upregulated and negative downregulated
genes compared to the control group.
For more complex comparisons, or multifactor experiments, you can use the
Differential expression using edgeR for
multivariate experiments tool,
which uses generalized linear models -based statistical methods ("glm edgeR", see the edgeR user guide for more
information).
Parameters
- Column describing groups [group]
- Filter out genes which don't have counts in at least this many samples (1-10000) [1]
- P-value cutoff (0-1) [0.05]
- Multiple testing correction (none, Bonferroni, Holm, Hochberg, BH, BY) [BH]
- Dispersion method (common, tagwise) [tagwise]
- Dispersion value used if no replicates are available (0-1) [0.16]
- Apply TMM normalization (yes, no) [yes]
- Plot width (200-3200 [600]
- Plot height (200-3200) [600]
Details
This tool takes as input a table of raw counts from the different samples. The count file has to be associated
with a phenodata file describing the experimental groups.
These files are best created by the tool "Utilities / Define NGS experiment", which combines count files for
different samples to one table, and creates a phenodata file for it.
You should set the filtering parameter to the number of samples in your smallest experimental group.
Filtering will cause those genes which are not expressed or are expressed in very low levels (less than 5 counts) to
be ignored in statistical testing.
These genes have little chance of showing significant evidence for differential expression, and removing them
reduces the severity of multiple testing correction of p-values.
Trimmed mean of M-values (TMM) normalization is used to calculate normalization factors in order to correct for
different library sizes and to reduce RNA composition effect,
which arises when a small number of genes are very highly expressed in one experiment condition but not in the
other.
Dispersion means biological coeffient of variation (BCV) squared. E.g. if genes expression typically differs from
replicate to replicate by 20% its BCV is 0.2, and its dispersion is 0.04.
EdgeR estimates dispersion from replicates using the quantile-adjusted conditional maximum likelyhood method (qCML).
Common dispersion calculates a common dispersion value for all genes, while the tagwise method calculates
gene-specific dispersions.
It uses an empirical Bayes strategy to squeeze the original gene-wise dispersions towards the global,
abundance-dependent trend.
You should always have at least three biological replicates for each experiment condition.
If you don't have replicates, you can still run the analysis by setting the dispersion value manually with the
'Dispersion value' parameter.
The default value for this parameter is 0.16. This corresponds to BCV of 0.4 which is typical for human data (where
genes expression can differ from replicate to replicate by 40 %).
Typical BVC for genetically identical model organisms is 0.1, so you should set the dispersion guess at 0.01 for
this kind of data.
Once negative binomial models are fitted and dispersion estimates are obtained,
edgeR proceeds with testing for differential expression using the exact test, which is based on the qCML methods.
Output
The analysis output consists of the following files:
- de-list-edger.tsv: Result table from statistical testing, including fold change estimates and p-values.
- logFC = log2 fold change between the groups. E.g. value 2 means that the expression has increased 4-fold
- logCPM = the average log2-counts-per-million
- PValue = the two-sided p-value
- FDR = adjusted p-value
- de-list-edger.bed: If you data contained genomic coordinates, the result table is also given as a BED file
for genome browser use. The score column contains log2 fold change values.
- edgeR_report.pdf: A PDF file containing:
- ma-plot-edger.pdf: MA plot where significantly differentially expressed features are highlighted.
- dispersion-edger.pdf: Biological coefficient of variation plot.
- mds-plot-edger.pdf: Multidimensional scaling plot to visualize sample similarities.
- p-value-plot-edger.pdf: Raw and adjusted p-value distribution plot.
- edger-log.txt: Log file if no significantly different expression was found.
References
This tool uses the edgeR package for statistical analysis.
Please read the edgeR
user guide
and the following article for more detailed information:
MD Robinson, DJ McCarthy, and GK Smyth. edgeR: a
bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics,
26 (1):139-40, Jan 2010.