Clustering and Classification methods for Biologists


MMU logo

Background to clustering and classification

LTSN Bioscience logo

Links

 

Dimensionality

The raw dimensions of a data matrix {R} are R x C, the number of Rows multiplied by the number of Columns. However, except in special cases the effective dimensionality is usually less than R x C, where r lt R and c lt C. How do we know this? Consider the extremes. If there is zero correlation all variables convey independent information and the variables are said to be orthogonal. At the other extreme of perfect correlation, any one of the variables carries all the information. No information is lost if all but one are discarded. Therefore, irrespective of the number of variables, the effective dimensionality is 1.

Generalising from these extremes it can be shown that any correlation between variables reduces the dimensionality by an amount proportional to the degree of correlation. PCA provides one mechanism for estimating effective dimensionality of the data matrix consisting of continuous, correlated variables.

There is a similar reduction in dimensionality if the similarities between cases, rather than variables, is taken into account. As cases become more similar effective row dimensionality decreases and the redundancy can be exploited by clustering algorithms. The relatedness (non-independence) of cases is a major problem for parametric statistical analyses. This is most obvious with spatial and temporal autocorrelation, but also less obviously in taxonomic analyses. For example, heights, weights, etc of siblings (relatives such as brothers and sisters) will be more similar to each other than those from non-siblings.