Clustering and Classification methods for Biologists

MMU logo

Methods review

LTSN Bioscience logo

Page Outline



[ Yahoo! ] options

Intended Learning Outcomes

At the end of this section you should be able to complete the following tasks.



Many biological systems are inherently multifactorial, often making it necessary to examine many variables in order to understand a particular system. Multivariate statistical techniques were developed to deal with such situations in which two or more variables are analysed simultaneously. They can be placed in two broad categories:

More recently, a wider range of methods have been developed that tend to be placed in a different category called Machine Learning. However, it is important to understand that most of these more recent methods could be placed in one of the above catergories.


Some important definitions

Many new terms are used in these pages. At this stage there are two important terms that feature throughout the book and relate to the title of this resource.

  1. The term classifier describes a general predictive technique which allocates cases to a small number of predefined classes on the basis of evidence collected from one or more predictor variables.
  2. A clustering technique is any method that organises cases into groups (clusters) that not predefined in data.

It is also possible to view these two categories by two related terms.

  1. Supervised learning

    Learning, as applied to classifier development, refers to the gradual reduction in classification errors as more cases are presented to a classifier. Supervised learning applies to all algorithms in which a 'teacher' continually assesses, or supervises, the classifier's performance during training by marking predictions as correct or incorrect. This supervision is only possible when each case has a valid class label.

  2. Unsupervised learning

    This type of learning has no teacher and the must find its own classes or clusters. The ability to find classes depends on the existence of 'natural' classes in the data. A typical example would be finding habitat classes in a satellite image or patterns in gene expression levels that may represent a particular cell state. It usually best to view unsupervised learning as a type of exploratory data analysis which may provide evidence for some underlying structure in data.



The value of a single response variable is assumed to be a function of a set of predictor variables. For example we may wish to predict:

Response (single variable) Predictors (>1)
number of species climate, soils, disturbance
amount of lead in body tissue traffic volume and distance from a road
bone mineral density in survivors from childhood cancer age, weight, height, diet
probability of death blood pressure, gender and diet.

These methods are generally classed as regression analysis. This means that the variables are separated into a single 'response' or 'dependent' variable (y) and a set of one or more 'predictor' or 'independent' variables (x1 ... xP). The value of y is assumed to be some function of x, i.e. in a generalised format y = f(x). There are a range of techniques in this category which differ in the nature of both y and f(x) and include:



In these methods the primary aim is one of dimension reduction (see the ordination web page for more detailed information). Consider a data table that has n rows (e.g. cases, samples or sites) and p columns (variables). A table such as this can be compressed in one or two directions, effectively reducing the number of rows or columns and thereby reducing its dimensionality.

Two general categories of methods have been recognised:

Methods such as correspondence analysis compress the table in two directions simultaneously.

The main aim of the geometrical methods is compression of the variables. The reason for this approach is that studies often collect data for many variables. A large number of variables is difficult to process and assimilate. However, if many variables are correlated it may be possible to combine them into a small number of groups (derived variables) which relate to more abstract features, for example increasing human influence or 'lifestyle'. These derived variables can be used as the primary measures in subsequent analyses. A variety of methods, such as PCA and Correspondence Analysis, have been employed to obtain these new variables.

Clustering and partitioning methods group cases on the basis of their similarity over a range of variables. The main examples of these techniques come under the general heading of cluster analysis. Many clustering algorithms are available; they differ with respect to the method used to measure similarities (or dissimilarities) and the points between which distances are measured. Thus, although clustering algorithms are objective, there is scope for subjectivity in the selection of an algorithm. The most common clustering algorithms are polythetic agglomerative, i.e. a series of increasingly larger clusters are formed by the fusion of smaller clusters on the basis of similarity measured over more than one variable. A problem with the hierarchical approach is that they are computer-intensive and large data sets may be difficult to analyse. A less computer intensive approach is the nonhierarchical k means or iterative relocation algorithm. Each case is initially placed in one of k clusters, cases are then moved between clusters if it minimises the differences between cases within a cluster.

The Ordination Web site has a lot of useful background about the application of ordination methods, particularly related to ecological problems.


Multiple Regression (A regression method)

If the dependent variable is continuous, or at least not binary or categorical, multiple regression is usually employed to investigate the nature of the relationship between the response variable and its predictors. Multiple regression attempts to fit, in the simplest case, a plane to the predictors such that the error (as measured by a sum of squares) between observed and predicted y values is minimised. (the coefficient of determination) provides a measure of the overall fit between observed and predicted values of y. The resultant relationship is described by an equation such as

y = b0 + b1x1 + b2x2 + ... bpxp

If a b coefficient is significantly different from 0 it provides a measure of the rate of change in y with respect to each one unit chnage to xi. The main problems with multiple regression relate to the validity of assumptions; the interpretation of the coefficients and the selection of the appropriate model.


Logistic Regression (A regression method)

If the dependent variable has two possible values, for example 0 and 1, methods such as multiple regression become invalid since predicted values of y are not constrained within the 0 and 1 limits. Discriminant analysis can be used in such circumstances. However discriminant analysis will only produce optimal solutions if its assumptions are supported by the data. An alternative approach is logistic regression. In logistic regression the dependent variable is the probability that an event will occur, hence y is constrained between 0 and 1. The logistic model is written as:

logistic model p(event) = 1/(1+e^-z)

where z is b0 + b1x1 + b2x2 + ... bpxp

The logistic equation can be rearranged by converting the probability into a log odds or logit.

Log [Prob.(event)/Prob.(n event)] = b0 + b1x1 + b2x2 + ... bPxP

This produces a relationship similar to that in multiple regression except that each one-unit change in a predictor is associated with a change in log odds rather than the response directly. Different types of response model can be investigated with this approach, for example if squared predictors (quadratic terms) are included as predictors the model is assumed to be gaussian rather than sigmoidal. As with multiple regression it is also possible to test a range of models by applying stepwise inclusion or elimination of predictors. Interpretation of the coefficients is complicated by the fact that they relate to changes in log odds rather than the response itself.


Generalized additive models

Generalized additive models (GAM) are related to other general linear models such as logistic regression. However, they are only semi-parametric because they use a non-parametric 'smoothed' link function, where the form of the smoothing function is data-drive rather than by some pre-defined probability model. Despite this flexibility, they are not fully non-parametric because the probability distribution of the class variable has to be specified (e.g. binomial). GAM models do not have simple coefficients (weights) for the predictors. Instead the relationship between a class and the predictor is represented using a smoothing function. Although they have some advantages over simple polynomial regressions they are more difficult to fit, and more user-involvement in their design. They may also be more difficult to interpret.


Discriminant Analysis (A regression method)

Identifying features that are associated with splitting a set of observations into two or more groups, such as nest and non-nest sites or benign and malignant tumours, is a common biological problem. If there is information about individual sampling units, obtained from a number of variables, it is reasonable to ask if these variables can be used to define the groups. Discriminant analysis works by combining the variables in such a way that the differences between the predefined groups are maximised. It also provides a classification rule (an equation or discriminant function) that can be used with other cases to predict which group they belong to.

Discriminant analysis can be considered to be a special case of regression analysis where the response variable identifies group membership, for example used and unused habitat. It is possible to use normal regression analysis programs to carry out discriminant analysis.


Principal Components Analysis (An ordination method)

PCA is a geometrical ordination method which compresses a set of variables into a smaller number of derived variables or components. It is used to pick out patterns in the relationships between the variables in such a way that most of the original information can be represented by a reduced number of new variables. A useful metaphor is think about a photograph. This is a 2-dimensional representation of a 3-dimensional object. As long as an appropriate camera angle is chosen little information about the subject will be lost. Thus, the original 3 dimensions can be compressed into 2 dimensions with little information loss.

As with all statistical techniques there are assumptions about the data. The main one is that the derived components are normally distributed and uncorrelated (orthogonal). If PCA is being used to test statistical hypotheses the assumptions should be valid. The assumptions are less important when PCA is used as a descriptive and exploratory tool. In practice, if the principal components are normally distributed the assumptions may be considered valid. A more useful yardstick by which a PCA should be judged is 'do the results make biological sense?', a bad one does not!


Correspondence Analysis (An ordination method)

CA is an ordination method which aims to simultaneously compress the rows and columns of a data table to achieve a single, simultaneous geometrical representation of cases and variables. CA has been applied widely in plant ecology. Take up by english language researchers may have been limited because of its main proponents often write in French. In a typical ecological study a table of species by sites is subjected to analysis. The table does not contain environmental data such as soil pH, etc. It is hoped that CA will reorganise the table such that environmental gradients can be identified and used to interpret the species distributions.

The CA axes have associated eigenvalues, in the range 0 - 1. Generally only those axes with eigenvalues over 0.5 can be said to show good separation between sites or species. The eigenvalues can be thought of as correlations between row and column scores. An eigenvalue of 1.0 implies that one sample (or group of samples) shares no species with all other samples. A modified analysis called Detrended Correspondence Analysis (DCA) is one of the most widely used ordination techniques in ecology.


Canonical Correspondence Analysis (An ordination method)

The best ordination techniques reduce dimensions by detecting the most "important" dimensions (or gradients) in a data set and ignoring "noise" or unexplained variation. Most early ordination techniques were indirect and "exploratory". It was your job to work out what environmental gradients had been detected by the analysis. CCA is different because it allows statistical hypotheses to be tested. It many respects it can be considered as a special case of multiple regression where the y variable is no longer a single variable but an ordination axis representing a particular species or sites assemblage. As with all statistical tests a null hypothesis must be clearly stated and data must be collected in a repeatable manner.


Hierarchical clustering (An ordination method)

Cases are assigned to clusters on the basis of some similarity or distance measure. Clusters are organised into hierarchies, such that a tree of increasingly different clusters is formed.


k-means clustering (An ordination method)

Cases are clustered into k (a value set by the user) nonhierarchical classes.The method is conceptually simple but computationally intensive. It is useful for large datasets that may be difficult to analyze using a hierarchical method. It can also be used to assign cases to classes if the class characteristics (e.g. class means) are known.


Machine Learning and Others

This is a very diverse set of methods. All of classifiers described above are based on the general linear model. The methods listed below share little in common with each other, other than being data-driven, or the model-driven general linear models. The first two methods are biologically inspired and the third has much in common with species identification keys. The other methods appear to work well but are less used in biology at the moment.

  1. Artificial neural network classifiers and Self-organising map clustering methods are very loosely based on the organisation of real neural networks.
  2. Genetic algorithms use techniques loosely based around natural selection to preform searches that may produce effective classifiers. No additional information is provided here but the outdated genetic algorithm FAQs (AI Repository, 1997) are still a good source of background information.
  3. Decision trees produce classifications that resemble the types of keys often used for species identification. However, the techniques used to produce these rules are very different from the often qualitative methods used to construct identification keys.
  4. Support Vector Machines (SVM) have only been recently developed, building on developments in computational learning theory. They are being increasingly used in bioinformatic applications because of their accuracy and their ability to deal with a large number of predictors. Descriptions and explanations of SVMs are heavily mathematical and the reader is advised to consult Burges's (1998) tutorial, which uses the VC dimension as a starting point. DTREG is a commercial classification and regression tree program and its web site a very clear graphical explanation of the SVM algorithm (
  5. Nearest heighbour classifiers are conceptually simple and use instance-based learning methods which only build a classifier when a new unknown case needs to be classified. The class of a case is the majority class (vote) of its nearest neighbours in predictor space.
  6. Naive Bayesian classifiers are probabilistic classifiers that use Bayesian theory. They have an important role in much machine learning research becuase they are the optimal supervised learning method if the predictors are independent (uncorrelated), given the class. Despite these unrealsitic assumptions the naive Baye's classifier is usually very effective, even when its assumptions do not apply. The effectiveness of this classifier arises because predictions will be accurate as long as the probability of being in the correct class is greater than that for any of the incorrect classes. In other words only an approximate solution is required that has the correct rank order for class membership probabilities.
  7. Case-based reasoning (CBR) is often linked with expert systems. These methods aim to model human reasoning and thinking and are said improve with time, as more cases are added. CBR algorithms assign classes by matching possible 'solutions' to a database of previous cases. Watson and Marir (1994) is a comprehensive, although rather old, review of case based reasoning methods and applications from the ai-cbr web site. Although this site is no longer updated it still contains some useful resources for CBR methods.