MMU - MSC Multivariate Statistics, Decision Tree SAQ: Analysis 1a

Analysis 1a: No criteria set and the tree is pruned

Before beginning make sure that the following options are set on the UserInput sheet. Later examples will investigate the consequences of other choices for these options.

Leaf Node Criteria - Make sure that none are ticked, so that none apply (see region highlighted by red box).

The 3 Leaf Node Criteria: Minimum Node Size, Minimum Purity and Maximum Depth should all be unticked.

Tree pruning and test set options

Make sure that the Tree Pruning option is set to Yes.

If you click on the cell with Yes a drop down menu will appear

, make sure that use this to change this and most other options.

The data should be partitioned using method 2 (use the drop down menu from the cell). The consequences of choosing option 1 will be explored later.

Partitioning method 2 uses the number of rows specified by the option 2 choice cell. In the example it is set at 50. Make sure that yours also says 50. This means that the first 200 cases (the training set) will be used to develop the tree. The tree will then be applied to last 50 cases (the test set) to get an indication of how well it will work with data collected in the future. There is no drop down menu for this cell, you will have to type in a number AND press the Enter key.

You begin the analysis by clicking on the Build Tree button on the UserInput sheet.

If nothing happens it may be because you have forgotten to press Enter after changing an option value.

The program may take 10-20 seconds to complete the tree building, after which the Results sheet will be displayed.

Results for Analysis 1a

Your results screen should look like this. I have added three boxes (Regions 1, 2 & 3) to simply a description of the output.

Screen shot of the results page after the stage 1a analysis. Details are described in the next section.

Region 1 of the Results sheet gives some basic information about the data. There are 250 cases, split between 200 training and 50 test cases. There are 4 predictors (vegetation, slope, water and aspect). The name of the Class Variable is 'present', this is the name in the P23 cell on the Data sheet. There are two classes and most of them (127 out 250) belong to class 1. Class 1 is therefore the majority class. The final section gives a figure of 36% misclassified. This figure comes from the test data (the last 50 cases). There are 18 class 1 and 32 class 0 cases in the test data. Therefore, the majority class here is class 0. If all of the test cases were labelled as 0 this would mean that 18 class 1 cases would be misclassified as class 0. 18/50 = 36%.

Region 2 gives details about the tree. The important part is at the top, the lower part just gives the timings for different parts of the analysis. There are 56 nodes in the tree and 30 leaf (terminal) nodes. The tree has 11 layers, this means that you would need to pass through 11 nodes to get to some leaf nodes. This tree does very well with the training data, only 8% (16 ) cases are given the wrong label. Unfortunately, 38% of the test cases are given the wrong label. The details of these misclassifications are given in section 3 - the confusion matrices.

Region 3 has two confusion matrices, one each for the training and test data. The training matrix shows that 200 cases were classified, of which 91 had a true class of 0 and 109 had a true class of 0. 95 of the 200 cases were predicted to have a class of 0 and 105 to have a class of 1. Of those predicted to be class 0, 85 were correctly placed and 10 were incorrectly placed. Similarly, of the 105 predicted to be class 1, 99 were correctly identified and 6 were misclassified.

Analysis 1a: Test data confusion matrix

Match these elements of the confusion matrix to the correct number.

a)	Number of cases with a true class of 1
b)	Number of cases predicted to be class 0
c)	Number of class 0 cases correctly predicted to be class 0
d)	Number of class 1 cases incorrectly predicted as class 0
e)	Number of class 0 cases incorrectly predicted as class 1
f)	Number of cases tested
See the description of the training data confusion matrix for a full account. 50 cases were tested made up of 32 0s and 18 1s. The 18 1s were split between 6 incorrectly predicted as class 0 and 12 correctly predicted as class 1. The 32 0s were split between 19 correctly predicted as class 0 and 13 incorrectly predicted as class 1.

The Analysis 1a tree

The tree is far too large to display on one screen. The following was captured from the Excel page displayed at 10% zoom. Although details cannot be seen it does show the overall structure. This image has been rotated through 90 degrees.

Analysis 1a decision tree, details are described in the text.

The top left hand part of the tree is shown below.

Top part of the tree showing the root node and two main nodes, 1 & 1. Node 1 has a rule that water is less than 500 m, Node 2 has a rule that water is greater than 500 m. The upport ofr Node 1 is 19% with a confidence of 100%.

This image of the tree has been edited for interpretative purposes. The Node 1 box has been coloured blue to highlight it and the horizontal line from node 1 has been drastically shortened to show the leaf node details (support and confidence). This part of the tree shows that cases in the root node are split by a rule which uses a distance to water threshold.

What is true about cases in Node 2?

If you click on Node 1 in the tree and then on the View Node button at the top left of the tree sheet you will be given details of the cases for which this rule is true, i.e. they are all less than 500 m from water.

The View Node sheet informs us that this is Node 1 (others can be viewed using the up/down arrows). It is a leaf node and has 38 records (cases), which is 19% of the total (38/200 = 19%). The majority class is the second class, presence = 1 and all 38 cases are class 1 so there are no misclassifications. The pie chart shows that 100% of the cases are class 1.

If you follow the horizontal line to right from Node 1 it eventually takes you to a box which is labelled 1, i.e. cases are predicted to be presence = 1. The support is 19% (19% of the cases) and the confidence is 100 % (because no cases are misclassified).

Interpreting Node 2

Select the correct entries from the drop down list for each gap.

This is the information presented in this view

[D]

Node is a non-leaf node, this means that the cases are split again by rules at a greater . There are cases and majority are locations. This node also contains a large number of presence locations.

Well done. Remember that these cases all have a value for distance to water that is greater than 500 m, but this includes almost equal proportions of absence and presence locations. Hence, the tree will attempt to split them using a different, deeper rule.Node 2 is a non-leaf node, this means that the cases are split again by rules at a greater tree depth. There are 162 cases and majority are absence locations. This node also contains a large number of misclassified presence locations. Remember that these cases all have a value for distance to water that is greater than 500 m, but this includes almost equal proportions of absence and presence locations. Hence, the tree will attempt to split them using a different, deeper rule.

No further interpretation will be made on this tree due to its complexity. The next sections explore some of the analysis options.