Clustering and Classification methods for Biologists


MMU logo

Principal Components Analysis

LTSN Bioscience logo

Page Outline

Mathematical background to PCA

Background

PCA is an eigenanalysis problem. The general method is outlined below, and is provided as an aid to the comprehension of the output from a PCA program.

Let X be a matrix of the standardised variable scores, the rows are the cases and the columns are the variables. Thus, case 1 has a score of -2 for the first variable, 3 for the second, etc.

-2

3

1

0

1

0

-2

3

……

     

2

1

0

4

Let S be the variance-covariance matrix derived from X. Since there are 4 variables S would be a 4 * 4 matrix. The diagonal elements would the variances of the four variables while the remaining entries would be the covariances between pairs of variables.

  ht wt sbp age
ht        
wt        
sbp        
age        

For example, the second entry on the first row would be the ht v wt covariance, the fourth entry is the ht v age covariance, etc. These are relatively easy to compute using matrix algebra techniques, S = (X X)/(n-1).

This method described above represents just one of many computational routes available for the determination of principal components. In practice you need not be too aware of the underlying computational procedures since modern computer packages insulate you from such details (although it might help to explain some of the output from such programs). However, it is important to remember that in the example described above standardised scores were used to find the variance-covariance matrix. If raw data, i.e. unstandardised, are used differences in measurement scales will have an effect on the principal components, i.e. the analysis will be dominated by variables having the largest variances. The default method used by SPSS and Minitab to calculate the components is from a correlation matrix which means that all variables are automatically standardised. Pre-standardisation of the variables will not affect the analysis if it is based on correlation coefficients, but if the analysis is based on a variance-covariance matrix pre-standardisation will alter the findings. Later versions of both Minitab and SPSS offer the variance-covariance matrix approach as an option.

The next section is important because it explains why eigen analyses are so important to PCA

top