MMU - Clustering and Classification

Background

This analysis uses the same data as the previous example. However, the last 25 cases from each class have been removed and retained for testing. Consequently, there are now two data sets.

A training set with 100 cases (50 in each class)
A testing set with 50 cases (25 in each class)

Two analyses are presented. The first uses a loess smoother with a 0.25 span, while the second reduces the span to 0.25. The output from R is shown below with comments.

First the two datasets are imported from SPSS files into the dataframes trainset and testset

	trainset <-read.spss("lr_b_train.sav",to.data.frame=TRUE)
	testset <-read.spss("lr_b_test.sav",to.data.frame=TRUE)

Analysis 1, 0.25 span

Next the analysis details are specified and the analysis completed. The summary command provides the information needed to interpret the analysis.

	>gam.object25 <- gam(CLASS ~ lo(B1, span=0.25) + lo(B2, span=0.25) +
	+ lo(B3, span=0.25) + lo(B4, span=0.25), family = binomial, data=trainset, trace=TRUE)
	GAM lo.wam loop 1: deviance = 59.69833 
	GAM lo.wam loop 2: deviance = 45.15113 
	GAM lo.wam loop 3: deviance = 36.93244 
	GAM lo.wam loop 4: deviance = 31.94526 
	GAM lo.wam loop 5: deviance = 28.78603 
	GAM lo.wam loop 6: deviance = 26.93014 
	GAM lo.wam loop 7: deviance = 26.01743 
	GAM lo.wam loop 8: deviance = 25.61917 
	GAM lo.wam loop 9: deviance = 25.43764 
	GAM lo.wam loop 10: deviance = 25.35002 
	GAM lo.wam loop 11: deviance = 25.30850 
	GAM lo.wam loop 12: deviance = 25.28738 
	GAM lo.wam loop 13: deviance = 25.27666 
	GAM lo.wam loop 14: deviance = 25.27162 
	GAM lo.wam loop 15: deviance = 25.26929 
	GAM lo.wam loop 16: deviance = 25.26821 
	GAM lo.wam loop 17: deviance = 25.26771 
	GAM lo.wam loop 18: deviance = 25.26748 
	GAM lo.wam loop 19: deviance = 25.26737 
	GAM lo.wam loop 20: deviance = 25.26732 
	GAM lo.wam loop 21: deviance = 25.26729 
	GAM lo.wam loop 22: deviance = 25.26728 
	GAM lo.wam loop 23: deviance = 25.26728 
	GAM lo.wam loop 24: deviance = 25.26727 
	GAM lo.wam loop 25: deviance = 25.26727 

	>summary(gam.object25)

	Call: gam(formula = CLASS ~ lo(B1, span = 0.25) + lo(B2, span = 0.25) + 
	    lo(B3, span = 0.25) + lo(B4, span = 0.25), family = binomial, 
	    data = trainset, trace = TRUE)
	Deviance Residuals:
	       Min         1Q     Median         3Q        Max 
	-1.5202875 -0.2197366 -0.0003062  0.2552785  1.8220946 

	(Dispersion Parameter for binomial family taken to be 1)

	    Null Deviance: 138.6294 on 99 degrees of freedom
	Residual Deviance: 25.2673 on 65.9892 degrees of freedom
	AIC: 93.2889 

	Number of Local Scoring Iterations: 25

Finally, the model is used to make predictions for the test data, which are then written to an external csv file.

	
	>trnpredict25<-predict(gam.object25,testset)
	>write.csv(trnpredict25, file = "f:/lo25_predictions.csv")

top

Analysis 2, 0.75 span

	>gam.object75 <- gam(CLASS ~ lo(B1, span=0.75) + lo(B2, span=0.75) +
	+ lo(B3, span=0.75) + lo(B4, span=0.75), family = binomial, data=trainset, trace=TRUE)
	GAM lo.wam loop 1: deviance = 77.94521 
	GAM lo.wam loop 2: deviance = 70.27621 
	GAM lo.wam loop 3: deviance = 67.13024 
	GAM lo.wam loop 4: deviance = 66.0293 
	GAM lo.wam loop 5: deviance = 65.77753 
	GAM lo.wam loop 6: deviance = 65.74575 
	GAM lo.wam loop 7: deviance = 65.74333 
	GAM lo.wam loop 8: deviance = 65.74319 
	GAM lo.wam loop 9: deviance = 65.74318 
	GAM lo.wam loop 10: deviance = 65.74318 
	
	>summary(gam.object75)

	Call: gam(formula = CLASS ~ lo(B1, span = 0.75) + lo(B2, span = 0.75) + 
	    lo(B3, span = 0.75) + lo(B4, span = 0.75), family = binomial, 
	    data = trainset, trace = TRUE)
	Deviance Residuals:
	     Min       1Q   Median       3Q      Max 
	-1.77468 -0.54864 -0.07332  0.44083  2.31262 

	(Dispersion Parameter for binomial family taken to be 1)

	    Null Deviance: 138.6294 on 99 degrees of freedom
	Residual Deviance: 65.7432 on 88.1628 degrees of freedom
	AIC: 89.4175 

	Number of Local Scoring Iterations: 10 

	DF for Terms and Chi-squares for Nonparametric Effects

	                     Df Npar Df Npar Chisq  P(Chi)
	(Intercept)         1.0                           
	lo(B1, span = 0.75) 1.0     1.6     1.8777  0.3096
	lo(B2, span = 0.75) 1.0     2.0     4.5465  0.1062
	lo(B3, span = 0.75) 1.0     1.2    10.5442  0.0017
	lo(B4, span = 0.75) 1.0     1.9     1.2628  0.5158
	>trnpredict75<-predict(gam.object75,testset)
	>write.csv(trnpredict75, file = "f:/lo75_predictions.csv")

top

Accuracy

The accuracy of these models was assessed by estimating the AUC (Area Under the ROC curve) for the test data using the model developed with the training data (see table below).

AIC (training data) and AUC estimates (with 95% confidence limits) for test data
Model	AIC	AUC	LCL	UCL
Logistic regression	96.09	0.918	0.847	0.990
Logistic regression	93.29	0.862	0.763	0.962
Logistic regression	89.42	0.912	0.838	0.986

top

Questions

These questions relate to the two analyses described above and also make use of the results from the regression analysis of the same data.

2	Accuracy Which is the best model in terms of its accuracy? The largest AUC is from the logistic regression, however this is very similar to the AUC from the 0.75 span GAM model. There is a marked decrease in accuracy using the 0.25 span GAM model.

Why is the 0.75 span GAM model better?

Explain why you think that 0.75 span GAM model is superior to the 0.25 span GAM model. In the previous analyses the 0.25 span model was clearly superior.

First recalll that the training data has 100 cases compared with 150 in the previous analysis. The complexity of the smoothers has remained the same, increasing the potential for data-overfitting. The residual variance (25.27) with the more complex smoother is considerably smaller than that for the simpler smoother (65.74), suggesting a much more accurate model. However, the more complex model has used up many more degrees of freedom than the simpler model (approximately 34 and 12 respectively). The logistic regression only uses up 5 degrees of freedom. When the residual variance is combined with the number of parameters the extra reduction in residual variance for the 0.25 span smoother, the result is a larger AIC and a reduction in accuracy with novel data (the test set). So, despite its early promise, the more complex model is less effective at correctly identifying new cases. Almost certainly this is a consequence of sacrificing an identification of the general trend with a better match between the model and the training data.

top

Clustering and Classification methods for Biologists

Logistic Regression Example

Page Outline

Background

Analysis 1, 0.25 span

Analysis 2, 0.75 span

Accuracy

Questions

Calculate Delta AIC values

Accuracy

Why is the 0.75 span GAM model better?