MMU - Clustering and Classification

Background

Most common statistical methods assume that the response variable, and the residuals or errors, are drawn from a normal distribution. If they are not other methods must be used that make different assumptions about the fequency distribution of the response variable. These methods fall into a general class referred to as Generalised Linear Models.

Logistic regression is a generalised linear models that assumes the response variable is drawn from a binomial distribution. A brief introduction is given below and some worked examples are provided. There are links at the end which provide additional information.

top

Binary dependent variable

If the dependent, or response, variable has only two possible values, for example 0 and 1, it is unwise to use methods such as multiple regression to predict its values. This is because predicted values of y, using multiple regression, are not be constrained to lie between 0 and 1. Discriminant analysis can be used in such circumstances. However discriminant analysis will only produce optimal solutions if its assumptions are supported by the data. If logistic regression is used instead the dependent variable is the probability that an event will occur, hence y is constrained between 0 and 1. Logistic regression has an additional advantage that predcitors can be binary, categorical, continuous or a mixture.

The logistic model is written as:

where z is b₀ + b₁x₁ + b₂x₂+ ... b_px_p

The logistic equation can be rearranged into a linear form by converting the probability into a log odds or logit.

log [Prob(event)/Prob(no event)] = b0 + b₁x₁ + b₂x₂+ ... b_Px_P

This produces a relationship similar to that in multiple regression except that now each one-unit change in a predictor is associated with a change in log odds rather than the response directly. Different types of response model can be investigated with this approach, for example if the squared predictors (quadratic terms) are included as predictors the model is assumed to be gaussian rather than sigmoidal. As with multiple regression it is also possible to test a range of models by applying stepwise inclusion or elimination of predictors. Interpretation of the coefficients is complicated by the fact that they relate to changes in log odds rather than the response itself.

top

Example analyses

Three analyses are provided to illustrate the details of a logistic regression. Each analysis opens in a new browser window.

The first uses an artificial dataset with continuous independent variables to predict the class of a case.
The second uses the same dataset as the basis for a self assessment exercise.
The third uses some real data, with a mixture of continuous and categorical independent variables, to predict if a person smokes.

top

Logistic regression v Discriminant analysis

Many studies compared these two methods but the comparisons remain inconclusive. The only universal criterion for selecting logistic regression over discriminant analysis relates to the preditor data type limitations of discriminant analysis. Because both methods build linear boundaries between the classes they have the same functional form (a sum of weighted predictors). Logistic regression (maximum likelihood estimation) and discriminant analysis (least squares technique) use different methods to estimate the regression coefficients. Because of these difference discriminant analysis should outperform logistic regression, when its assumptions are valid. However, in other circumstances, logistic regression should be better. Empirical comparisons of these methods do not always support the data type restrictions. As long as the predictors are not categorical there appears to be little difference between the performance of the two methods when sample sizes are reasonable (>50). If there are outliers it is likely that logistic regression will be better because the outliers could distort the variance - covariance matrix used in the estimation of predictor weights by discriminant analysis.

top

Generalised Linear Models

These are extensions of general linear models that allow for non-linearity and non-constant variance structures in the data. They also have an assumed linear relationship between the mean of the response variable and the predictors. However, unlike the general linear models, the linear component is established, indirectly, via a link function. The link function is important because, in the case of logistic regression, it constrains predicted values to fall within the range 0 - 1. In addition, the generalized linear models can use a response variable that comes from any distribution within the exponential family of probability distributions (including the normal, binomial and poisson distributions). An important computational complication is that, if the response variable is not normally distributed, the regression coefficients, or weights, are estimated using maximum likelihood, rather than, least squares techniques.

A generalized linear model, y = g(L) + error, has a superficial similarity to the general linear model. However, the g(L) term is made up of two components, a linear term plus a link function. The link function transforms the underlying non-linear part of a model into a set of linear parameters. This is normally written as g(expected value) = L, where L is a function that combines the predictors in a linear fashion. The form of the link function is related to the probability distribution of the response. For example, if the response is binary the link function is the Logit ( Log [P(event)/P(no event)] = f(L)), while it is Ln(expected) = f(L) for a response variable from the Poisson distribution.

top

Resources

There is a nice example using the Challenger Shuttle disaster and temperature determination of turtle sex that comes from the NCSSM Statistics Leadership Institute notes (which has other useful sections).

Stephen Lea's description for pyschology students: http://www.exeter.ac.uk/~SEGLea/multvar2/disclogi.html

The Prophet Statguide has a detailed and comprehensive description of logistic regression (and other multivariate methods): http://www.basic.nwu.edu/statguidefiles/logistic.html

Michael Friendly's Categorical Data Analysis site has a detailed look at certain aspects of logistic regression, which includes step by step description of program output: http://math.yorku.ca/SCS/Courses/grcat/grc6.html.

Biller Miller's Openstat is a free statistics package that can be used to undertake a logistic regression analysis.

The 'standard' text is Hosmer, D. W. and Lemeshow, S. 1989. Applied Logistic Regression. Wiley and Sons.

Hall, G. H. and Round, A. P. 1994. Logistic Regression - explanation and use. Journal of the Royal College of Physicians of London, 28: 242 - 246,

Bender, R. and Grouven, U. 1997.Ordinal logistic regression in medical research. Journal of the Royal College of Physicians of London, 31: 546 - 551,

Clustering and Classification methods for Biologists

Logistic Regression

Page Outline