Stepwise Statistics
The previous analysis is repeated but using a stepwise variable selection procedure. Note that this suffers from the same problems associated with all variable selection procedures.
A range of slection methods are available. This example uses the SPSS default method based on a minimization of Wilk's lambda.
In this analysis pyruvate kinase was entered first, followed by hemopexin. No other variables were entered because they would not significantly improve the discrimination between the groups. Using the F values and correlation coefficients that you noted earlier can you understand why only these two variables were selected? (see below for the explanation).
Entered | Wilks' Lambda | ||||||||
---|---|---|---|---|---|---|---|---|---|
Statistic | df1 | df2 | df3 | Exact F | |||||
Step | |||||||||
1 | pyruvate kinase | .631 | 1 | 1 | 71.000 | 41.458 | 1 | 71.000 | .000 |
2 | hemopexin | .524 | 2 | 1 | 71.000 | 31.781 | 2 | 70.000 | .000 |
At each step, the variable that minimizes the overall Wilks' Lambda is entered. | |||||||||
a Maximum number of steps is 8. | |||||||||
b Minimum partial F to enter is 3.84. | |||||||||
c Maximum partial F to remove is 2.71. | |||||||||
d F level, tolerance, or VIN insufficient for further computation. |
Recall that pyruvate had the largest F value, and hence the 2 groups differed most with respect to this variable. Hemopexin had the second largest F value and it was also uncorrelated with pyruvate. The other two variables were both correlated with pyruvate, hence they could do little to improve the seperation of the groups.
A summary of the final model shows that at step 1 the pyruvate kinase predictor was selected. In the second, and last step, hemopexin was added.
Step | Tolerance | F to Remove | Wilks' Lambda | |
---|---|---|---|---|
1 | pyruvate kinase | 1.000 | 41.458 | |
2 | pyruvate kinase | 0.995 | 26.142 | 0.720 |
hemopexin | 0.995 | 14.324 | 0.631 |
The next table is a list of predictors that were not used during each step. Note that at step 0 none were used but the 'F to Enter' statistics are used to rank them.
Step | Tolerance | Min. Tolerance | F to Enter | Wilks' Lambda | |
---|---|---|---|---|---|
0 | creatine kinase | 1.000 | 1.000 | 18.958 | 0.789 |
hemopexin | 1.000 | 1.000 | 27.634 | 0.720 | |
lactate dehydrogenase | 1.000 | 1.000 | 16.258 | 0.814 | |
pyruvate kinase | 1.000 | 1.000 | 41.458 | 0.631 | |
1 | creatine kinase | 0.857 | 0.857 | 2.682 | 0.608 |
hemopexin | 0.995 | 0.995 | 14.324 | 0.524 | |
lactate dehydrogenase | 0.754 | 0.754 | 0.583 | 0.626 | |
2 | creatine kinase | 0.854 | 0.850 | 2.858 | 0.503 |
lactate dehydrogenase | 0.751 | 0.748 | 0.812 | 0.518 |
Number of Variables | Lambda | df1 | df2 | df3 | Exact F | ||||
---|---|---|---|---|---|---|---|---|---|
Step | |||||||||
1 | 1 | .631 | 1 | 1 | 71 | 41.458 | 1 | 71.000 | <0.0001 |
2 | 2 | .524 | 2 | 1 | 71 | 31.781 | 2 | 70.000 | <0.0001 |
Summary of Canonical Discriminant Functions
The remainder of the output is similar to the full model, so only the additional aspects are described further.
Function | Eigenvalue | % of Variance | Cumulative % | Canonical Correlation |
---|---|---|---|---|
1 | 0.908(a) | 100.0 | 100.0 | 0.690 |
a First 1 canonical discriminant functions were used in the analysis. |
Test of Function(s) | Wilks' Lambda | Chi-square | df | Sig. |
---|---|---|---|---|
1 | 0.524 | 45.225 | 2 | 0.000 |
Function | |
---|---|
1 | |
hemopexin | 0.599 |
pyruvate kinase | 0.758 |
Although only two predictors were used in the discriminat function it is still possible for others to be correlated with the discriminant score. From the correlation coefficients listed below it is apparent that the difference between the groups is largely due to pyruvate and hemopexin values.
Function | |
---|---|
1 | |
pyruvate kinase | 0.802 |
hemopexin | 0.655 |
lactate dehydrogenase(a) | 0.366 |
creatine kinase(a) | 0.270 |
Pooled within-groups correlations between
discriminating variables and standardized canonical discriminant functions Variables ordered by absolute size of correlation within function. |
|
a This variable not used in the analysis. |
Function | |
---|---|
Duchenne Muscular Dystrophy | 1 |
NonCarrier | -0.877 |
Carrier | 1.006 |
Classification Statistics
Prior | Cases Used in Analysis | ||
---|---|---|---|
Duchenne Muscular Dystrophy | Unweighted | ||
NonCarrier | 0.500 | 39 | 39.000 |
Carrier | 0.500 | 34 | 34.000 |
Total | 1.000 | 73 | 73.000 |
Predicted Group Membership |
Total | ||||
---|---|---|---|---|---|
Duchenne Muscular Dystrophy |
NonCarrier | Carrier | |||
Original | Count | NonCarrier | 34 | 5 | 39 |
Carrier | 5 | 29 | 34 | ||
% | NonCarrier | 87.2 | 12.8 | 100.0 | |
Carrier | 14.7 | 85.3 | 100.0 | ||
Cross- validated(a) |
Count | NonCarrier | 33 | 6 | 39 |
Carrier | 5 | 29 | 34 | ||
% | NonCarrier | 84.6 | 15.4 | 100.0 | |
Carrier | 14.7 | 85.3 | 100.0 | ||
a Cross validation is done only for those
cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case. |
|||||
b 86.3% of original grouped cases correctly classified. | |||||
c 84.9% of cross-validated grouped cases correctly classified. |
Despite using fewer predictors there has only been a marginal decline in prediction accuracy. Indeed it could be argued that it has got better since fewer carriers are misclassified.