Mathematical background to PCA
Background
PCA is an eigenanalysis problem. The general method is outlined below, and is provided as an aid to the comprehension of the output from a PCA program.
Let X be a matrix of the standardised variable scores, the rows are the cases and the columns are the variables. Thus, case 1 has a score of -2 for the first variable, 3 for the second, etc.
-2 |
3 |
1 |
0 |
1 |
0 |
-2 |
3 |
|
|||
2 |
1 |
0 |
4 |
Let S be the variance-covariance matrix derived from X. Since there are 4 variables S would be a 4 * 4 matrix. The diagonal elements would the variances of the four variables while the remaining entries would be the covariances between pairs of variables.
ht | wt | sbp | age | |
---|---|---|---|---|
ht | ||||
wt | ||||
sbp | ||||
age |
For example, the second entry on the first row would be the ht v wt covariance, the fourth entry is the ht v age covariance, etc. These are relatively easy to compute using matrix algebra techniques, S = (X X)/(n-1).
- Because the variables have been standardised each has a variance of 1. Thus, the total variation, measured by the variances, is the simply the number of variables.
- The four eigenvalues of S give us the % of total variance accounted for by each of the four new components. For example, if the first component has an eigenvalue of 3 this component would account for 75% (3/4) of the total variation. If we retained only this component the data dimensionality would reduce from 4 to 1 while retaining 75% of the original information.
- The relative contribution of each of the original variables to the components is obtained from a matrix U which is composed of the normalised eigenvectors of S. These relative contributions are known as factor loadings and can be used to provide a biological interpretation for the components. Recall that there are an infinite number of eignevectors associated with each eigenvalue, normalisation provides an objective method for choosing between the alternatives. The new scores for each case on these components are obtained from a matrix that is the product of XU. Recall that the eigenvectors give us the direction of the axes (vectors) whose length is given by their eigenvalues. The number of elements in an eignevector is equal to the number of original variables, the value of each element being the loading for that variable on the vector. Therefore, if a variable has a high loading for a vector it will be an important component of the new variable.
This method described above represents just one of many computational routes available for the determination of principal components. In practice you need not be too aware of the underlying computational procedures since modern computer packages insulate you from such details (although it might help to explain some of the output from such programs). However, it is important to remember that in the example described above standardised scores were used to find the variance-covariance matrix. If raw data, i.e. unstandardised, are used differences in measurement scales will have an effect on the principal components, i.e. the analysis will be dominated by variables having the largest variances. The default method used by SPSS and Minitab to calculate the components is from a correlation matrix which means that all variables are automatically standardised. Pre-standardisation of the variables will not affect the analysis if it is based on correlation coefficients, but if the analysis is based on a variance-covariance matrix pre-standardisation will alter the findings. Later versions of both Minitab and SPSS offer the variance-covariance matrix approach as an option.
The next section is important because it explains why eigen analyses are so important to PCA