Background

This analysis uses the same data as that used in the first logistic regression example. The results of the analysis (using SPSS 12.0) are given below, followed by a series of self-assessment questions. For this analysis the data have have been split, randomly, into 100 training cases and 50 test cases. The training data have 47 class 0 cases and 53 calss 1 cases. In the test data the frequencies are 28 and 22. You may wish to print this material before attempting the questions.

Variables in the Equation
		B	S.E.	Wald	df	Sig.	Exp(B)
Step 0	Constant	0.120	0.200	0.360	1	0.549	1.128

1	Initial value for the constant Explain why the initial value for the constant is 0.12. The initial intercept is - LOGe(Proportion in class 1 / Proportion in class 2). These proportions are 0.47 and 0.53 so the initial value for the intercept is -LOGe (0.47/0.53) = -LOGe(0.8868) = 0.12.

Omnibus Tests of Model Coefficients
		Chi-square	df
Step 1	Step	59.526	4
	Block	59.526	4
	Model	59.526	4

Analysis details

Decide which of the following statements are valid, with respect to this analysis.

	a)	Variables were selected using a stepwise algorithm.
	b)	The logistic regression equation has four coefficients.
	c)	Including the four predictors in the model significantly improved the fit.
	d)	The initial model deviance was 139.27.
	e)	The model explains almost 60% of the variation in the predcitors.
	f)	All four predictor coefficients are significantly greater than 0.
a) Correct, this is a direct-entry model, i.e. no stepwise selection. a) No, this is a direct-entry model, i.e. no stepwise selection. b) There are five coefficients, the constant plus four for the predictors. b) No, there are five coefficients, the constant plus four for the predictors. c) Correct. c) There was a significant improvement as indicated by the Model Chi-square (59.26) and its p value. d) Correct (59.526 + 78.743).d) This is correct. The initial deviance is the model improvement (59.526) + the model's -2LL (78.743). e) Correct, it is variation in the class, not the predictors, that is explained. e) Incorrect, it is variation in the class, not the predictors, that is explained. f) Only b1, b2 and b3 have coefficients significantly greater than 0. f) Only b1, b2 and b3 have coefficients significantly greater than 0. The p value for b4 is greater than 0.05 suggesting that it does not differ significantly from 0.

Model Summary
Step	-2 Log likelihood	Cox & Snell R Square	Nagelkerke R Square
1	78.743	0.449	0.599

Hosmer and Lemeshow Test
Step	Chi-square	df	Sig.
1	3.482	8	0.901

Contingency Table for Hosmer and Lemeshow Test
		class = 0		class = 1		Total
		Observed	Expected	Observed	Expected	Total
Step 1	1	10	9.788	0	0.212	10
	2	8	9.093	2	0.907	10
	3	8	8.000	2	2.000	10
	4	8	7.047	2	2.953	10
	5	6	5.401	4	4.599	10
	6	3	3.802	7	6.198	10
	7	3	2.172	7	7.828	10
	8	1	1.193	9	8.807	10
	9	0	0.432	10	9.568	10
	10	0	0.072	10	9.928	10

Classification Table(c)
	Observed		Predicted
			Training Cases			Testing Cases
			class		Percentage Correct	class		Percentage Correct
			0	1	Percentage Correct	0	1	Percentage Correct
Step 1	class	0	38	9	80.9	19	9	67.9
	class	1	8	45	84.9	3	19	86.4
	Overall Percentage				83.0			76.0
a The cut value is .500.

Confidence limits

Complete the following text by selecting the appropriate values.

95% confidence limits for the coefficients were obtained using a z value of . As an example, using b1, the confidence interval is z x 0.140, giving a confidence interval of . This gives a lower confidence limit for Exp(b1) of and uppper confidence limit of . The lower confidence limits for Exp(b) were for b2, for b3 and for b4. Similarly, the upper confidence limits were for b2, for b3 and for b4. The confidence limits for b4 include 1.00 and therefore suggest that this predictor has no effect on the class of a case. This is supported by the p value of for its Wald statistic.

Correct, you appear to understand the calculation and interpretation of the confidence limits. You first find the confidence interval for a coefficient, which is 1.96 (the 2-tailed z value for 95% Confidence limits) multiplied by its s.e. (standard error from the table). This confidence interval is then subtracted, for a lower liimit, and added, for an upper limit, to the coefficient before calculating Exp(confidence limit), where Exp(confidence limit) is e raised to the power of the confidence limit.

Variables in the Equation
		B	S.E.	Wald	df	Sig.	Exp(B)
Step 1(a)	b1	0.452	0.140	10.417	1	0.001	1.572
	b2	0.463	0.123	14.222	1	0.000	1.589
	b3	0.309	0.097	10.192	1	0.001	1.363
	b4	-.072	0.053	1.896	1	0.169	0.930
	Constant	-17.251	3.491	24.415	1	0.000	0.000
a Variable(s) entered on step 1: b1, b2, b3, b4.

ROC Curve

AUC statistics

The AUC for the training data is significantly larger than the test data AUC.

Area Under the Curve for Training Data
Area	Std. Error(a)	Asymptotic Sig.(b)	Asymptotic 95% Confidence Interval
Area	Std. Error(a)	Asymptotic Sig.(b)	Lower Bound	Upper Bound
0.898	0.031	0.000	0.838	0.958
a Under the nonparametric assumption
b Null hypothesis: true area = 0.5

Area Under the Curve for Testing Data
Area	Std. Error(a)	Asymptotic Sig.(b)	Asymptotic 95% Confidence Interval
Area	Std. Error(a)	Asymptotic Sig.(b)	Lower Bound	Upper Bound
0.859	0.053	0.000	0.754	0.963
a Under the nonparametric assumption
b Null hypothesis: true area = 0.5

Clustering and Classification methods for Biologists

Logistic Regression SAQ1

Page Outline

Background

Initial value for the constant

Analysis details

Accuracy statistics

Confidence limits

ROC Curve

AUC statistics