Clustering and Classification methods for Biologists


MMU logo

Decision Trees

LTSN Bioscience logo

Page Outline

 

Search

[ Yahoo! ] options

Decision and Regression Trees

Background

Skip to the next section for a more detailed description of decision trees.

Decision and regression trees are an example of a machine learning technique. You can find out more about the machine learning methods, and their use with ecological data, in Fielding, A. H. (1999) (editor) Ecological Applications of Machine Learning Methods, Kluwer Academic. A clue to how they function is provided by their alternative name of recursive partitioning methods.

In certain circumstances they have an advantage over more common supervised learning methods such as discriminant analysis. In particular, they do not have to conform to the same probability distrubution restrictions. There is also no assumption of a linear model or the need to pre-specify a probability distribution for the errors. They are particularly useful when the predictors may be associated in some non-linear fashion. The advantages of these approaches can be summarised as follows.

Decision points are called nodes, and at each node the data are partitioned. Each of these partitions is then partitioned independently of all other partitions, hence the alternative name of recursive partitioning. This could carry on until each partition consisted on one case. This would be a tree with a lot of branches and as many terminal segments (leaves) as there are cases. Normally some 'stopping rule' is applied before arriving at this extreme condition. Inevitably this may mean that some partitions are 'impure' (cases are a mixture of classes), but it is necessary to balance accuracy against generality. A tree which produced a perfect classification of training data would probably perform poorly with new data.

The following diagram summarises the main characteristics of a classification tree for a problem with a binary response (Groups I and II) and four predictors (A, C & D are continuous, B is categorical).

A simple decision tree described below

The data are first split (partitioned) into cases with values for A above or below 57.5. The left side of the tree (A<57.5) is next split on the value of B (Y/N). If B = Yes a terminal node is reached in which all cases have B=Yes and A <57.5. If B=No the cases are next split by variable C (above or below C = 48.2). On the right side of the tree (A>57.5) the cases are split depending on the value of C (above or below 15.5). Cases with C<15.5 are split using variable B while those with C>15.5 are split using variable D.

The next section describes decision trees in more detail.

A separate page describes the random Forest algorithm.

top


Resources

QUEST is Windows-based Public domain CART package. It is available from http://www.stat.wisc.edu/~loh.

There is a nice, free, classification / decision tree addin for Excel available from http://www.geocities.com/adotsaha/CTree/CtreeinExcel.html.

The AI-depot has a short description of recursive partitioning, linked to a discussion list.

Chapter 5 of the book Machine Learning, Neural and Statistical Classification edited by D. Michie, D.J. Spiegelhalter, C.C. Taylor describes decision trees. The entire book can be downloaded as a pdf file (1.79 Mb).

John Bell has written a chapter on the use of decision trees for ecological applications. The reference is: Bell, J. F. (1999) Tree-based methods. The use of classification trees to predict species distributions. In A. Fielding (ed) Machine learning methods for ecological applications. Kluwer Associates.

top


Self assessment questions

This series of questions uses the CTREE Excel add-in program (see above) and a datafile loosely based on pronghorn antelope data file (PP30-36 in Manly et al (2002) Resource Selection by Animals (2nd Edition), Kluwer Academic). You will need both CTREE and the datafile to fully answer these questions.

top