MMU - 61BL4165: Cluster algorithms

Clusters: Clustering Algorithms

Background

Having decided how to measure similarity (the distance measure) we must now choose the clustering algorithm, i.e. the rules which govern how distances are measured between clusters. There are many methods available, the criteria used differ and hence different classifications may be obtained for the same data, even using the same distance measure.This is important since it tells us that, although cluster analysis may provide an objective method for the clustering of cases, there can be subjectivity in the choice of method.

Five clustering algorithms are described.

Average linkage clustering

The dissimilarity between clusters is calculated using average values. Unfortunately, there are many ways of calculating an average. The most common (and recommended if there is no reason for using other methods) is UPGMA - Unweighted Pair-Groups Method Average. The average distance is calculated from the distance between each point in a cluster and all other points in another cluster. The two clusters with the lowest average distance are joined together to form the new cluster.

Average measure showing lines connecting one data point to all data points in
a different cluster

The + signs mark the centres of the two clusters.

There are other methods based on CENTROID and MEDIAN averages. Centroid, or UPGMC (Unweighted Pair-Groups Method Centroid), uses the group centroid as the average. The centroid is defined as the centre of a cloud of points. A problem with the centroid method is that some switching and reversal may take place, for example as the agglomeration proceeds some cases may need to be switched from their original clusters. This makes interpretation of the dendrogram quite difficult.

Centroid clustering, a single line joining the centres of two clusters.

The + signs mark the centres of the two clusters.

top

COMPLETE LINKAGE CLUSTERING (Maximum or Furthest-Neighbour Method)

The dissimilarity between 2 groups is equal to the greatest dissimilarity between a member of cluster i and a member of cluster j. This method tends to produce very tight clusters of similar cases.

Complete linkage clustering showing a line between the two most
widely separated cases in two clusters

top

SINGLE LINKAGE CLUSTERING (Minimum or Nearest-Neighbour Method)

The dissimilarity between 2 clusters is the minimum dissimilarity between members of the two clusters. This methods produces long chains which form loose, straggly clusters. This method has been widely used in numerical taxonomy.

Single linkage clusteringshowing a line between the two closest
cases in two clusters

top

WITHIN GROUPS CLUSTERING

This is similar to UPGMA except that clusters are fused so that within cluster variance is minimised. This tends to produce tighter clusters than the UPGMA method.

top

Ward's method

Cluster membership is assessed by calculating the total sum of squared deviations from the mean of a cluster. The criterion for fusion is that it should produce the smallest possible increase in the error sum of squares.

top

Which distance measures are compatible with the above algorithms?

Lance and Williams (1967) investigated this question and came to the following conclusions. It appears that most combinations are compatible. The general distance definitions are given in the distance section.

Algorithm	Euclidean metric	Non-Euclidean metric	Semi-metric
Single
Complete
Average
Median
Centroid
Ward's

Lance, G. N. and Williams, W. T. 1967. A general theory of classificatory sorting strategies. Computer Journal, 9: 373-380.

next topic: dendrograms

top

Short SAQ

Algorithm advantages and disadvantages

Use the drop down boxes to select the most appropriate piece of text to complete the sentence.

A problem with the method is that some switching and reversal may take place, for example as the agglomeration proceeds some cases may need to be switched from their original clusters. The method tends to produce very tight clusters of similar cases and performs well when cases form distinct clusters. However, if the cases are loosely spread the algorithm may be appropriate. A algorithm is recommended when cluster sizes are very heterogeneous. The approach does not involve the calculation of the distances between clusters, instead it minimises the amount of variation in a cluster.

A problem with the centroid method is that some switching and reversal may take place, for example as the agglomeration proceeds some cases may need to be switched from their original clusters. The complete linkage method tends to produce very tight clusters of similar cases and performs well when cases form distinct clusters. However, if the cases are loosely spread the single linkage algorithm may be appropriate. A weighted algorithm is recommended when cluster sizes are very heterogeneous. The within-groups approach does not involve the calculation of the distances between clusters, instead it minimises the amount of variation in a cluster.

All clustering algorithms have advantages and disadvantages. You can only make an informed choice if you are aware of them.

top

Clustering and Classification methods for Biologists

Cluster Analysis - Algorithms

Page Outline