Background
This analysis uses the same data as the previous example. However, the last 25 cases from each class have been removed and retained for testing. Consequently, there are now two data sets.
- A training set with 100 cases (50 in each class)
- A testing set with 50 cases (25 in each class)
Two analyses are presented. The first uses a loess smoother with a 0.25 span, while the second reduces the span to 0.25. The output from R is shown below with comments.
First the two datasets are imported from SPSS files into the dataframes trainset and testset
trainset <-read.spss("lr_b_train.sav",to.data.frame=TRUE) testset <-read.spss("lr_b_test.sav",to.data.frame=TRUE)
Analysis 1, 0.25 span
Next the analysis details are specified and the analysis completed. The summary command provides the information needed to interpret the analysis.
>gam.object25 <- gam(CLASS ~ lo(B1, span=0.25) + lo(B2, span=0.25) + + lo(B3, span=0.25) + lo(B4, span=0.25), family = binomial, data=trainset, trace=TRUE) GAM lo.wam loop 1: deviance = 59.69833 GAM lo.wam loop 2: deviance = 45.15113 GAM lo.wam loop 3: deviance = 36.93244 GAM lo.wam loop 4: deviance = 31.94526 GAM lo.wam loop 5: deviance = 28.78603 GAM lo.wam loop 6: deviance = 26.93014 GAM lo.wam loop 7: deviance = 26.01743 GAM lo.wam loop 8: deviance = 25.61917 GAM lo.wam loop 9: deviance = 25.43764 GAM lo.wam loop 10: deviance = 25.35002 GAM lo.wam loop 11: deviance = 25.30850 GAM lo.wam loop 12: deviance = 25.28738 GAM lo.wam loop 13: deviance = 25.27666 GAM lo.wam loop 14: deviance = 25.27162 GAM lo.wam loop 15: deviance = 25.26929 GAM lo.wam loop 16: deviance = 25.26821 GAM lo.wam loop 17: deviance = 25.26771 GAM lo.wam loop 18: deviance = 25.26748 GAM lo.wam loop 19: deviance = 25.26737 GAM lo.wam loop 20: deviance = 25.26732 GAM lo.wam loop 21: deviance = 25.26729 GAM lo.wam loop 22: deviance = 25.26728 GAM lo.wam loop 23: deviance = 25.26728 GAM lo.wam loop 24: deviance = 25.26727 GAM lo.wam loop 25: deviance = 25.26727 >summary(gam.object25) Call: gam(formula = CLASS ~ lo(B1, span = 0.25) + lo(B2, span = 0.25) + lo(B3, span = 0.25) + lo(B4, span = 0.25), family = binomial, data = trainset, trace = TRUE) Deviance Residuals: Min 1Q Median 3Q Max -1.5202875 -0.2197366 -0.0003062 0.2552785 1.8220946 (Dispersion Parameter for binomial family taken to be 1) Null Deviance: 138.6294 on 99 degrees of freedom Residual Deviance: 25.2673 on 65.9892 degrees of freedom AIC: 93.2889 Number of Local Scoring Iterations: 25
Finally, the model is used to make predictions for the test data, which are then written to an external csv file.
>trnpredict25<-predict(gam.object25,testset) >write.csv(trnpredict25, file = "f:/lo25_predictions.csv")
Analysis 2, 0.75 span
>gam.object75 <- gam(CLASS ~ lo(B1, span=0.75) + lo(B2, span=0.75) + + lo(B3, span=0.75) + lo(B4, span=0.75), family = binomial, data=trainset, trace=TRUE) GAM lo.wam loop 1: deviance = 77.94521 GAM lo.wam loop 2: deviance = 70.27621 GAM lo.wam loop 3: deviance = 67.13024 GAM lo.wam loop 4: deviance = 66.0293 GAM lo.wam loop 5: deviance = 65.77753 GAM lo.wam loop 6: deviance = 65.74575 GAM lo.wam loop 7: deviance = 65.74333 GAM lo.wam loop 8: deviance = 65.74319 GAM lo.wam loop 9: deviance = 65.74318 GAM lo.wam loop 10: deviance = 65.74318 >summary(gam.object75) Call: gam(formula = CLASS ~ lo(B1, span = 0.75) + lo(B2, span = 0.75) + lo(B3, span = 0.75) + lo(B4, span = 0.75), family = binomial, data = trainset, trace = TRUE) Deviance Residuals: Min 1Q Median 3Q Max -1.77468 -0.54864 -0.07332 0.44083 2.31262 (Dispersion Parameter for binomial family taken to be 1) Null Deviance: 138.6294 on 99 degrees of freedom Residual Deviance: 65.7432 on 88.1628 degrees of freedom AIC: 89.4175 Number of Local Scoring Iterations: 10 DF for Terms and Chi-squares for Nonparametric Effects Df Npar Df Npar Chisq P(Chi) (Intercept) 1.0 lo(B1, span = 0.75) 1.0 1.6 1.8777 0.3096 lo(B2, span = 0.75) 1.0 2.0 4.5465 0.1062 lo(B3, span = 0.75) 1.0 1.2 10.5442 0.0017 lo(B4, span = 0.75) 1.0 1.9 1.2628 0.5158 >trnpredict75<-predict(gam.object75,testset) >write.csv(trnpredict75, file = "f:/lo75_predictions.csv")
Accuracy
The accuracy of these models was assessed by estimating the AUC (Area Under the ROC curve) for the test data using the model developed with the training data (see table below).
Model | AIC | AUC | LCL | UCL |
---|---|---|---|---|
Logistic regression | 96.09 | 0.918 | 0.847 | 0.990 |
Logistic regression | 93.29 | 0.862 | 0.763 | 0.962 |
Logistic regression | 89.42 | 0.912 | 0.838 | 0.986 |
Questions
These questions relate to the two analyses described above and also make use of the results from the regression analysis of the same data.
1 |
2 |
3 |