PCA and heatmap of samples with DESeq2


Given a table of read counts for an experiment, this tool performs principal component analysis (PCA) and hierarchical clustering of the samples using the DESeq2 Bioconductor package.



This tool takes as input a table of raw counts. The count table has to be associated with a phenodata file describing the experimental groups. These files are best created by the tool "Utilities / Define NGS experiment", which combines count files for different samples to one table, and creates a phenodata file for it.

PCA plot and sample heatmap give an overview of similarities and dissimilarities between samples. Visualizing the overall effect of experimental covariates and batch effects this way can help you to perform experiment level quality control.

Read count data exhibits a strong dependence of the variance on the mean, especially when the counts are low. This trend needs to be removed by transforming the data prior to PCA and clustering. This tool uses variance stabilizing transformation from the Bioconductor package DESeq2. The transformation is performed in a "blind" fashion, meaning that experimental grouping of samples is not taken into account.

DESeq2 also normalizes the data for library size and RNA composition effect, which can arise when only a small number of genes are very highly expressed in one experiment condition but not in the other. It calculates the geometric mean for each gene across all samples, and then divides the counts for a gene in each sample by this mean. The median of these ratios is the size factor for that sample.


The analysis output consists of a PCA plot and a sample heatmap with dendrograms. The PCA plot shows the first two principal components and the amount of variance explained by each component. The samples are colored according to their experimental group. You can also choose to use different shapes to visualise another sample group that you have determined in the phenodata-file. Ideally the groups separate along the first component. The heatmap shows the Euclidean distances between the samples. Similar samples get darker color, and the samples are ordered according to the dendrogram. Sample names are taken from the Description column of the phenodata.


This tool uses the DESeq2 package. Please read the following article for more detailed information:

M Love, W Huber and S Anders: Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biol. 2014 15:550