MMU - MSC Multivariate Statistics, Decision Tree SAQ: Analysis 2

Analysis 2: Investigating tree structure

These nine questions assume that the optimum tree is being used. This has a maximum depth of 6, with pruning and the last 50 cases used for testing. Make sure that you have the correct settings before building the tree.

The first rule

If a location is 400 m from water the prediction is that a mountain sheep will be present.

In order to answer the next question you need to examine the rules in each of the nodes.

Variables in rules

Which predictor featured in the smallest number of nodes?

The next question asks you to match the rules to the nodes. For example, node 5 has a rule that "slope < 65". But, remember that in order to get to node 5 you must have passed through node 2. Therefore, cases in node 5 have a value for water > 50 and a slope > 65.

Rules and Nodes

Match the rules to the Node numbers.

Nodes 7 to 10

Which feature are the nodes 7 - 10 using to segregate cases.

Nodes are either non-leaf or leaf. If they are a leaf node they are terminal nodes and are not split by any further rules.

Leaf nodes

Which of the following combinations are composed entirely of leaf nodes?

Leaf nodes can be 'pure' (all cases have the same class) or 'impure' (cases are a mix of classes). The next question concerns an impure leaf node.

Node 16

Why was node 16 not split any further when it contains 10 absence and 8 presence locations?

	a)	Because not splitting these cases produces the minimum error with the test data.
	b)	Because the purity of the node is above the maximum for pruning.
	c)	Because its depth is 6.
	d)	Because the cases are identical with repsect to their predictors.
Node 16 is at a depth of 6 so the maximum depth criterium is applied. Accuracy with the test data is not used in tree building. Recall that pruning makes use of cross-validated training data. Even if a maximum purity criterium had been set it answer 2 would not apply because the split is almost 50:50. Finally, the cases are not identical, if they had been it would not be possible to split this node.Node 16 is at a depth of 6 so the maximum depth criterium is applied. Accuracy with the test data is not used in tree building. Recall that pruning makes use of cross-validated training data. Even if a maximum purity criterium had been set it answer 2 would not apply because the split is almost 50:50. Finally, the cases are not identical, if they had been it would not be possible to split this node.

Some nodes are very impure and contain a lot of cases. If a node is vvery impure one of the classes will not dominate by very much.

7	The worst node Which leaf node number failed to separate the largest number of cases? (Type in your answer as a number)

The confusion matrices summarise the performance of the tree, in accuracy terms. The next question asks you to fill in the gaps.

Misclassifications and accuracy

Select the appropriate value from the drop down list to complete this description of the accuracy of the tree.

About 1 in cases were misclassified in both training and test sets. The largest number of misclassifications was , and these were misclassifications of locations as . Very few absence locations were as presence. The large number of presence locations that were misclassified is not too surprising, there are many reasons why sheep were not recorded in locations, For example, if sheep had been low it would be impossible for sheep to occupy all possible locations. Similarly, these are point observations, at one point in time, it is quite possible that they had used an location at some previous time.

About 1 in 4 cases were misclassified in both training and test sets. The largest number of misclassifications was 42, and these were misclassifications of presence locations as absences. Very few absence locations were misclassified as presence. The large number of presence locations that were misclassified is not too surprising, there are many reasons why sheep were not recorded in suitable locations, For example, if sheep density had been low it would be impossible for sheep to occupy all possible locations. Similarly, these are point observations, at one point in time, it is quite possible that they had used an absence location at some previous time.About 1 in 4 cases were misclassified in both training and test sets. The largest number of misclassifications was 42, and these were misclassifications of presence locations as absences. Very few absence locations were misclassified as presence. The large number of presence locations that were misclassified is not too surprising, there are many reasons why sheep were not recorded in suitable locations, For example, if sheep density had been low it would be impossible for sheep to occupy all possible locations. Similarly, these are point observations, at one point in time, it is quite possible that they had used an absence location at some previous time.

One of the reasons for building the tree is to make future predictions. For example, predicting the locations where you might expect to see sheep. You could test this by going out and collecting data which are then compared with the predictions. Although you wouldn't normally make the predictions by hand it is a good test of your comprehension of the tree.

Making predictions

Using the values for water, slope, aspect and vegetation, use the tree to identify which cases would be correctly classified.

	a)	ABSENT water 1750, slope 20, aspect 2, vegetation 0.8
	b)	ABSENT water 800, slope 70, aspect 1, vegetation 10
	c)	ABSENT water 450, slope 20, aspect 2, vegetation 0.5
	d)	PRESENT water 2500, slope 10, aspect 2, vegetation 0.1
	e)	PRESENT water 1750, slope 45, aspect 1, vegetation 5
	f)	PRESENT water 400, slope 35, aspect 1, vegetation, 5
Well doneThe nodes that these values satisfy are listed after each entry. For example node 1 tests if water is less than 500. ABSENT water 1750, slope 20, aspect 2, vegetation 0.8 node 2, node 3, node 5, node 8, node 13 which predicts absence ABSENT water 800, slope 70, aspect 1, vegetation 10 node 2, node 3, node 6 which predicts absence ABSENT water 450, slope 20, aspect 2, vegetation 0.5 Incorrect: node 1 predicts presence PRESENT water 2500, slope 10, aspect 2, vegetation 0.1 Incorrect: node 2, node 4 which predicts absence PRESENT water 1750, slope 45, aspect 1, vegetation 5 node 2, node 3, node 5, node 7, node 12, which predicts presence PRESENT water 400, slope 35, aspect 1, vegetation, 5 node 1 predicts presence

Decision Tree Self Assessment