Principal Components Analysis (using example data)
This analysis was carried out using Minitab. A short series of notes describe how to complete a PCA using Minitab.
Principal Component Analysis Eigenanalysis of the Correlation Matrix (default method) Eigenvalue 2.449 1.6497 0.5140 0.3028 0.0845 Proportion 0.490 0.330 0.103 0.061 0.017 Cumulative 0.490 0.820 0.923 0.983 1.000
- The first row has the eigen values of the correlation matrix.
- The second row has the proportion of variation associated with each component, e.g. for PC1 2.449/5.000 = 0.490 or 49%.
- The third row gives the cumulative variation retained by the components, e.g. 0.820 is 0.490 + 0.330.
- Most of the information (82%) is contained within the first two components. Why do think this has happened (hint look at the correlation matrix and scatter plots)?
The eigen values (2.449, 1.6497, 0.514, 0.3028 & 0.0845) sum to 5.000 (same as the number of variables). Indeed the sum is always the number of variables (if the Correlation Matrix is analysed). When the correlation matrix is used each of the variables is standardised to have a mean of 0 and a variance of 1.0. Thus, the total variance to be partitioned between the components is equal to the number of variables.
The next table gives the eigen vectors of the five components. Normally (because their eigen values are larger than 1.0) we would restrict our interpretation to PC1 & PC2 only. Again there is a convention that we will follow. The eigen vectors are termed component loadings and they are used to calculate the component scores.
Variable | PC1 | PC2 | PC3 | PC4 | PC5 |
---|---|---|---|---|---|
v1 | -0.514 | 0.260 | 0.477 | 0.631 | -0.205 |
v2 | -0.605 | -0.134 | -0.228 | -0.022 | 0.751 |
v3 | -0.587 | -0.078 | -0.381 | -0.357 | -0.613 |
v4 | -0.120 | -0.652 | 0.667 | -0.340 | -0.020 |
v5 | 0.102 | -0.695 | -0.361 | 0.598 | -0.134 |
Note that the calculation assumes standardised scores for the variables (mean = 0, standard deviation = 1.0). Because the variables share the same scale the loading, or weight, controls the contribution that each variable makes to the component score. Thus, v4 & v5 make only minor contributions to the first component, and more major contributions to the second component. In a real example we might attempt to assign some name to the combination of variables associated with a component. This aspect is considered in the examples that use real data.
Thus, our interpretation of these data is that
- The data can be represented adequately in just 2 dimensions.
- The first of these is associated with variables v1, v2 and v3.
- The second is associated with variables v4 and v5.
AN IMPORTANT POINT
The factor scores that are calculated have a mean of 0. This means that a negative score indicates a case with a below average score, whilst a positive factor score indicates a case with an above average score. Obviously a case with a 0 score is average.