Clustering and Classification methods for Biologists


MMU logo

Logistic regression and categorical predictors

LTSN Bioscience logo

Page Outline

 

Intended Learning Outcomes

At the end of this section students should be able to complete the following tasks.

top


Background

Suppose that a logistic regression predicts the presence (1) or absence (0) of some outcome using a set of three binary predictors such as gender (A), smoking (B) and colour blindness (C), where one indicates presence and zero absence of that predictor. The following equation was obtained: log odds = -2.7 + 2.5 A - 3.7 B + 1.8 C. The predictor A coefficient of 2.5 is the estimated change in the logarithm of P(presence)/P(absence) for the outcome, when B and C are held constant.

If A, B & C are present (i.e. have the value 1) the log odds are: log odds = -2.7 + 2.5 - 3.7 + 1.8 = -2.1, giving odds of 0.1225 (e-2.1). The 'risk' (probability of outcome 1) is defined by the odds ratio (odds / (1 + odds)). Odds ratios less than one correspond to decreases in the odds as the predictor increases in value while odds ratios greater than one correspond to increases in the odds. If the odds ratio is approximately one, changes in the predictor do not alter the odds of the event. Note that a coefficient of zero is the same as an odds ratio of one (because e-0 is 1.0), both imply the predictor has no effect of the response. In this example the odds, when all predictors have a value of one are 0.1225 / 1.1225 or 0.1091 (10.91%). If A, B and C are absent (all = zero) the log odds equal the constant (-2.7) and the odds are 0.067 (e-2.7 ). This means that 'risk of' of event 1 is 0.067 / 1.067 or 0.063 (6.3%). Therefore, absence of the three predictors almost halves the chance of the event. However, if predictors A and C are present, while B is absent, the log odds are -2.7 + 2.5 + 1.8 = 1.6, giving odds of 4.953 (e-1.6) and a risk of 4.953/5.953 = 0.832 or 83.2%. It appears that the absence of predictor B vastly increases the probability of event 1.


Preliminaries

The previous data set could have been analysed using discriminant analysis because all of the predictors were continuous. However, it becomes more difficult to use discriminant analysis when predictors are categorical. This analysis uses data collected from 87 people (available as an Excel file or a plain text file). The response variable is whether or not a person smoke cigarettes (66 non-smokers and 21 smokers). The five predictors consist of two categorical (gender and ABO blood type) and three continuous (age (years), body mass index (bmi - kg m-2) and white blood cell count (wbc - x 109 l-1)) variables. Because gender has two values it is coded as a single binary predictor (0 = male, 1 = female). Blood group has four categories that must be coded as three dummy predictors. The coding is arbitrary and is shown below. Blood group AB has an implicit coding of zero for each of the three dummy blood group predictors.

Categorical Variables Codings


Frequency Parameter coding
abo O 21 1 0 0
A 24 0 1 0
B 22 0 0 1
AB 20 0 0 0
sex F 41 1
  M 460

top


Null Model

The null model does not include any of the predictors. Instead the only term in the model is the intercept, whose initial value is:
loge (P(smoke)/P(non-smoke))
= loge 0.241/0.759
= loge 0.318
= -1.145.

Null Model

B S.E. Wald df Sig. Exp(B)
Constant -1.145 .251 20.891 1 .000 0.318

top


Model significance

The deviance for the null model is 96.164, which reduces to 75.8 when all of the predictors are added. The reduction in the deviance (20.371) is significant (p=0.005) suggesting that the predictors significantly reduce the unexplained variation. This is further supported by the pseudo-R2 values, although values around 0.3 indicate that a large amount of variation remains unexplained by these predictors.

Omnibus Tests of Model Coefficients

Chi-square df Sig.
Model 20.371 7 0.005

 

Model Summary
Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square
1 75.793 0.209 0.312

top


Goodness of fit

The Hosmer and Lemeshow test compares observed and expected frequencies in 10 classes. If the model is a good fit the p value, as below, will be greater than 0.05.

Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 9.004 8 0.342

 

Contingency Table for Hosmer and Lemeshow Test


smoke = N smoke = Y Total
Observed Expected Observed Expected
Step 1
1 9 8.752 0 0.248 9
2 8 8.429 1 0.571 9
3 8 8.243 1 0.757 9
4 9 8.043 0 0.957 9
5 6 7.647 3 1.353 9
6 6 6.958 3 2.042 9
7 7 6.571 2 2.429 9
8 7 5.722 2 3.278 9
9 6 4.174 3 4.826 9
10 0 1.461 6 4.539 6

top


Classification accuracy

text block

Classification Table(a)

Predicted class Percentage Correct

non-smokersmoker
Step 1 Observed class non-smoker 62 4 93.9
smoker 12 9 42.9
Overall Percentage

81.6
a The cut value is .500

These classification results suggest that this model tends to predict non-smokers quite accurately (93.9% accuracy). However, the accuracy with smokers is poor (42.9%). This inequality between the class predictions is at least partly related to the large difference in the proportions of smokers and non-smokers. If the class allocation threshold is adjusted (from 0.50 to 0.28) to match the class proportions in the sample data the overall accuracy declines to 73.6% but the number of smokers correctly predicted rises from 9/21 to 14/21. The decline in overall accuracy is a consequence of more false identifications of non-smokers as smokers. Possible solutions to the problems created by unequal class proportions are covered by the accuracy page.

top


Model structure

text block

Variables in the Equation


B S.E. Wald df Sig. Exp(B)95% LCL95% LCL
Step 1(a) bmi -0.214 0.096 4.942 1 0.026 0.807 0.669 0.975
age 0.031 0.038 0.682 1 0.409 1.032 0.958 1.110
wbc 0.632 0.214 8.705 1 0.003 1.882 1.236 2.864
sex(1) -0.188 0.615 0.094 1 0.760 0.829 0.248 2.764
abo

3.685 3 0.298


abo(1) -1.290 0.885 2.124 1 0.145 0.275 0.049 1.560
abo(2) -0.430 0.762 0.319 1 0.572 0.651 0.146 2.895
abo(3) -1.335 0.814 2.688 1 0.101 0.263 0.053 1.298
Constant -0.034 2.775 0.000 1 0.990 0.967

It is not surprising that age and blood group are not good predictors (p>0.05) of smoking habits (see above). It is less obvious if gender should be a significanr predictor. In these data gender does not predict smoking (p=.760). Only bmi and wbc are significant predictors. The probability of smoking appears to decline as bmi increases (negative coefficient). Because bmi is a measure of obesity (weight corrected for height) this suggests if two people are the same height, but different weights, the lighter person is the most likely to smoke. The coefficient for wbc is positive, suggesting that the probability that a person smokes is greater if their white blood cell count is larger. There is a well known relationship for white blood cell counts to be higher in smokers.

top


ROC analysis

The earlier classification table suggested that the model had an overall accuracy of 81.6%. The calculator from the accuracy page estimates that Kappa (0.422) is only just above the 0.4 threshold that is usually taken to indicate poor agreement.

There are two main problems with these accuracy assessments. Firstly, they are based on the training data and are likely to be over-optimistic (see the accuracy page). Secondly, they assume that the 0.5 cut-point is appropriate. It is generally better to assess accuracy with a threshold-independent measure from a ROC plot. The ROC plot below has an AUC of 0.895 (std error = 0.057), with 95% confidence limits of 0.671 and 0.895. These figures suggest that the model is reasonably accurate.

ROC Curve for smoking prediction

top


Diagnostics

Scatter of Cook's distance by smoking class.

Cook's Distance (D) measures the influence of an observation by estimating how much the other residuals would change if the case was removed. Its value is a function of the case's leverage and of the magnitude of its standardized residual. Normally, D > 1.0 identifies cases that might be influential. An arbitrary threshold criterion of D > 1.0 is normally used to identify influential cases. None of the cases in this analysis have a D value < 1, suggesting that none of the case is creating problems.

top