Skip to main content

Outline

Building a decision tree is a computationally intensive process and programs can take quite a long time to run with large data sets. There are two stages. In the first a tree is built. This tree is likely to be too complex with too many nodes that make a good tree for the training data but could result in poor performance with new data. Consequently the second stage prunes the tree to a simpler, and hopefully more general, structure.

Stage 1 Building a tree

Starting at the root node

Stage 2 Prune the tree

Important concept

A classifier, such as a decision tree, has little merit if its predictions cannot be, or are not, assessed for their accuracy using independent data. In other words it is important to have some idea about how well the classifier will perform with new data. This is important because the accuracy achieved with the original data is often much greater than that achieved with new data. Consequently, it is generally accepted that robust measures of a classifier's accuracy must make use of independent data, i.e. data not used to develop the classifier. The two data sets needed to develop and test predictions are known by a variety of synonyms. The terms 'training' and 'testing' data are used here.

You can find out more about these concepts, and the measurement of accuracy, in:

Fielding, A. H. and Bell, J. F. 1997. A review of methods for the assessment of prediction errors in conservation presence / absence models. Environmental Conservation 24: 38-49.

Fielding, A. H.1999. How should accuracy be measured? pp. 209-223 in A.Fielding (ed) Ecological Applications of Machine Learning Methods. Kluwer Academic, Boston, MA.

Fielding, A. H. 1999. A review of machine learning methods. pp 1-37 in A.Fielding (ed) Ecological Applications of Machine Learning Methods. Kluwer Academic, Boston, MA.

Fielding, A. H. 2002. What are the appropriate characteristics of an accuracy measure? pp 271-280 in J. M. Scott, P. J. Heglund, M. Morrison, J. B. Haufler, M. G. Raphael, W. B. Wall, and F. Samson (eds) Predicting Plant and Animal Occurences: Issues of Scale and Accuracy, Island Press.