Skip to main content

Analysis 2: Investigating tree structure

These nine questions assume that the optimum tree is being used. This has a maximum depth of 6, with pruning and the last 50 cases used for testing. Make sure that you have the correct settings before building the tree.

1

The first rule

If a location is 400 m from water the prediction is that a mountain sheep will be present.

a) True
b) False
Correct, this is the first rule (Node 1), which separates off 38 presence cases into a leaf node.Node 1 is the first rule, and is a test of the distance to water. If a location is less than 500 m from water the predicition is presence. This rule separates off 38 presence cases into a leaf node.
Check your answer

In order to answer the next question you need to examine the rules in each of the nodes.

2

Variables in rules

Which predictor featured in the smallest number of nodes?

a) Distance to water
b) Slope
c) Aspect
d) Vegetation
Correct, it was only used in 2 minor nodesVegetation was only used in 2 minor nodes, followed by aspect (4 nodes). Water and slope featured in 6 nodes each.
Check your answer

The next question asks you to match the rules to the nodes. For example, node 5 has a rule that "slope < 65". But, remember that in order to get to node 5 you must have passed through node 2. Therefore, cases in node 5 have a value for water > 50 and a slope > 65.

3

Rules and Nodes

Match the rules to the Node numbers.

a) Water > 2250
b) Slope > 65
c) Aspect = 3
d) Slope < 15
e) Vegetation > 0.4
Well doneGo to the TREE sheet and click on a node. You will see the rule in the box, it will also appear in the insert function (fx) window.
Check your answer

4

Nodes 7 to 10

Which feature are the nodes 7 - 10 using to segregate cases.

a) Distance to water
b) Slope
c) Aspect
d) Vegetation
Correct, note that this is a categorical variable so the only possible test is one of equality. It makes no sense to use 'greater or less than' with this type of variable.The correct answer is Aspect. Note that this is a categorical variable so the only possible test is one of equality. It makes no sense to use 'greater or less than' with this type of variable.
Check your answer

Nodes are either non-leaf or leaf. If they are a leaf node they are terminal nodes and are not split by any further rules.

5

Leaf nodes

Which of the following combinations are composed entirely of leaf nodes?

a) 6, 9, 11, 12, 16
b) 1, 6, 10, 16, 17
c) 1, 6, 11, 12, 16
d) 1, 7, 11, 12, 16
Correct. Leaf nodes are the terminal nodes, i.e. they are not split again. Some are all one class, others are mixed. If a terminal node is mixed all cases are labelled by the majority class (note that other programs may use more sophisticated mechanisms such as minimising the cost of misclassifications).The correct answer is 1, 6, 11, 12, 16. All of the others contain at least one non-leaf node. Leaf nodes are the terminal nodes, i.e. they are not split again. Some are all one class, others are mixed. If a terminal node is mixed all cases are labelled by the majority class (note that other programs may use more sophisticated mechanisms such as minimising the cost of misclassifications).
Check your answer

Leaf nodes can be 'pure' (all cases have the same class) or 'impure' (cases are a mix of classes). The next question concerns an impure leaf node.

6

Node 16

Why was node 16 not split any further when it contains 10 absence and 8 presence locations?

a) Because not splitting these cases produces the minimum error with the test data.
b) Because the purity of the node is above the maximum for pruning.
c) Because its depth is 6.
d) Because the cases are identical with repsect to their predictors.
Node 16 is at a depth of 6 so the maximum depth criterium is applied. Accuracy with the test data is not used in tree building. Recall that pruning makes use of cross-validated training data. Even if a maximum purity criterium had been set it answer 2 would not apply because the split is almost 50:50. Finally, the cases are not identical, if they had been it would not be possible to split this node.Node 16 is at a depth of 6 so the maximum depth criterium is applied. Accuracy with the test data is not used in tree building. Recall that pruning makes use of cross-validated training data. Even if a maximum purity criterium had been set it answer 2 would not apply because the split is almost 50:50. Finally, the cases are not identical, if they had been it would not be possible to split this node.
Check your answer

Some nodes are very impure and contain a lot of cases. If a node is vvery impure one of the classes will not dominate by very much.

7

The worst node

Which leaf node number failed to separate the largest number of cases? (Type in your answer as a number)

Check your answer

The confusion matrices summarise the performance of the tree, in accuracy terms. The next question asks you to fill in the gaps.

8

Misclassifications and accuracy

Select the appropriate value from the drop down list to complete this description of the accuracy of the tree.

About 1 in cases were misclassified in both training and test sets. The largest number of misclassifications was , and these were misclassifications of locations as . Very few absence locations were as presence. The large number of presence locations that were misclassified is not too surprising, there are many reasons why sheep were not recorded in locations, For example, if sheep had been low it would be impossible for sheep to occupy all possible locations. Similarly, these are point observations, at one point in time, it is quite possible that they had used an location at some previous time.

About 1 in 4 cases were misclassified in both training and test sets. The largest number of misclassifications was 42, and these were misclassifications of presence locations as absences. Very few absence locations were misclassified as presence. The large number of presence locations that were misclassified is not too surprising, there are many reasons why sheep were not recorded in suitable locations, For example, if sheep density had been low it would be impossible for sheep to occupy all possible locations. Similarly, these are point observations, at one point in time, it is quite possible that they had used an absence location at some previous time.About 1 in 4 cases were misclassified in both training and test sets. The largest number of misclassifications was 42, and these were misclassifications of presence locations as absences. Very few absence locations were misclassified as presence. The large number of presence locations that were misclassified is not too surprising, there are many reasons why sheep were not recorded in suitable locations, For example, if sheep density had been low it would be impossible for sheep to occupy all possible locations. Similarly, these are point observations, at one point in time, it is quite possible that they had used an absence location at some previous time.Check your answer

One of the reasons for building the tree is to make future predictions. For example, predicting the locations where you might expect to see sheep. You could test this by going out and collecting data which are then compared with the predictions. Although you wouldn't normally make the predictions by hand it is a good test of your comprehension of the tree.

9

Making predictions

Using the values for water, slope, aspect and vegetation, use the tree to identify which cases would be correctly classified.

a) ABSENT water 1750, slope 20, aspect 2, vegetation 0.8
b) ABSENT water 800, slope 70, aspect 1, vegetation 10
c) ABSENT water 450, slope 20, aspect 2, vegetation 0.5
d) PRESENT water 2500, slope 10, aspect 2, vegetation 0.1
e) PRESENT water 1750, slope 45, aspect 1, vegetation 5
f) PRESENT water 400, slope 35, aspect 1, vegetation, 5
Well doneThe nodes that these values satisfy are listed after each entry. For example node 1 tests if water is less than 500.

ABSENT water 1750, slope 20, aspect 2, vegetation 0.8

node 2, node 3, node 5, node 8, node 13 which predicts absence

ABSENT water 800, slope 70, aspect 1, vegetation 10

node 2, node 3, node 6 which predicts absence

ABSENT water 450, slope 20, aspect 2, vegetation 0.5

Incorrect: node 1 predicts presence

PRESENT water 2500, slope 10, aspect 2, vegetation 0.1

Incorrect: node 2, node 4 which predicts absence

PRESENT water 1750, slope 45, aspect 1, vegetation 5

node 2, node 3, node 5, node 7, node 12, which predicts presence

PRESENT water 400, slope 35, aspect 1, vegetation, 5

node 1 predicts presence
Check your answer