Clustering / Hierarchical

Description

Hierarchical clustering creates a dendrogram describing the relationships between genes or chips in a selected genelist. Clustering consists of two separate steps. First, all pairwise distances between objects (genes or chips) are calculated. The dendrogram is then drawn using these distances by a selected method. The maximum number of genes/samples to be clustered is 20 000.

Parameters

What to cluster (genes, chips) [genes]
Distance method (euclidian, manhattan, pearson, spearman) [euclidian]
Tree method (single, average, complete, ward) [average]
Resampling method (none, permutation-topodist, bootstrap) [none]
Number of replications (1-10000) [1000]
Image width (200-3200) [600]
Image height (200-3200) [600]

Details

User can select from four different distance methods:

Euclidean distance

This is an absolute distance between two gene expression profiles or chips. Euclidean distances between objects, such as dots on a paper could be measured with a ruler.

Manhattan distance

A distance between gene expression profiles or chips measured at right angles.

Pearson correlation

Distance using Pearson correlation is calculated as 1-correlation coefficient.

Spearman correlation

Distance using Spearman correlation is calculated as 1-correlation coefficient.

It is also possible to modify the dendrogram drawing method. Four possible options are:

Single linkage

A distance between clusters in the tree is calculated using the shortest distance between them.

Average linkage (UPGMA)

A distance between clusters in the tree is calculated using average distance between them.

Complete linkage

A distance between clusters in the tree is calculated using the longest distance between them.

Ward

At every step of clustering two clusters that result into a minimal loss of information are combined. Information loss is measured using error sum-of-squares criterion.

The results of the hierarchical clustering can be checked using bootstrapping testing. Bootstrapping creates a user specified number of pseudodatasets from the original one. In the pseudodatasets, each row or column (depending on whether genes or chips were clustered), can be present zero, one or several times. Every bootstrapped dataset is then converted into a dendrogram. Say, if 100 bootstrap samples were used, 100 trees are produced. A majority rule consensus is then created from these trees, and the results are displayed for the user. In the majority rule consensus tree every node in the tree is labeled with a number. This number represents the number of trees where that node was present. The higher the number the better. Note that bootstrapping can only be done using Euclidian distance or Pearson correlation. Bootstrap resampling on datasets larger than 1000 genes is not possible due to computing time limitations. Please note that you can run hierarchical clustering on datasets including up to 20000 genes, provided the resampling option is turned off.

Output

A file with information on how to draw the tree. This file can be visualised using the interactive "Hierarchical clustering" visualisation. Please note that you can select genes in this visualisation by drawing a box in the heatmap area. Clicking on the tab "Selected" allows you to create a new data set based on your selection.

References

This tools uses R packages ape and amap. The citation for ape is:

Paradis E., Claude J. & Strimmer K. 2004. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20: 289–290. PDF [37 KB].