Principal Component Analysis

 

This post is a gentle introduction to the fundamental concept of Principal Component Analysis (PCA), which is arguably one of the most renowned statistical dimension reduction methods.

 


Statistical Dimension Reduction

From previous post, I have emphasized that we don’t necessarily need all of the variables in a dataset because just a subset of them might be enough to retain sufficient information of the original data. This was related to the issue of multicollinearity and as a one possible remedy for this problem, variable selection methods including ridge and lasso regression was introduced.

However, there was a critical problem of information loss for variable selection methods such that all the information in the discarded variables is completely lost.

Thus, we are now going to take a look at the dimension reduction methods, which is an endeavor extract new information from the original data without having to lose them. Specifically, as subcategories of dimension reduction methods, we will be talking about principal component analysis, factor analysis, cluster analysis and discriminant analysis. Although these methods are grouped together as dimension reduction methods, how each of them works and what they aim to accomplish are all subtly different. So we are going to focus on their differences and intuitions by applying each methods on a toy dataset and get a glimpse of what they allow us to do.

Let’s start with the PCA!

 


Principal Component Analysis (PCA)

The goal of principal component analysis in short is to construct a set of whole new variables from the original dataset. In other words, PCA re-expresses the original data in terms of the newly defined uncorrelated variables, each of which are derived by linear combinations of the original variables.

To put it simply, it’s like seeing an object with different point of view. As an example, let’s say we are walking down the street and found a friend walking ahead of us. In fact, he is our best friend, so we can immediately recognize him just by looking at the back of his head and without having to see his face. Although what we are mostly aquainted with him is his front view such as his face, we can still recognize him as our friend even if there is a little bit of variation in what we actually see because the nature of the friend himself does not change.

Let’s bring this example to the dataset. Normally, there are two ways of interpreting a typical matrix-like data; by row-wise and by column-wise. Row-wise viewpoint is looking at the data from the perspective of individual observations, while column-wise viewpoint is looking from the perspective of features. A critical point here is that regardless of which view point we take, the data itself is unique. In other words, the data remain consistent regardless of which view point we see it from. Although we need some background knowledge on the linear algebra to fully understand this, this idea comes in handy althought it is very intuitive. It would be weird if the personality of your friend changes depending on the direction you’re talking at.

This idea is the fundamental concept of PCA and we can visualize it as the following plot.

 

 

Above is a scatter plot of two variables which seem to have positive linear correlationship. Each individual observations are marked as circle and is represented by coordinates in terms of x and y axis. Now let’s focus our attention to the green lines. Tilt your head a little bit to the left and try to think of the two green lines as new x and y axis. Then it will be like the following plot.

 

 

If we consider the two orthogonal green lines as our new x and y axis, we can see that the observations are now uncorrelated. They are scattered randomly without distinct pattern or trend. What we have just done is indeed, a PCA. We have re-expressed the data in terms of two uncorrelated variables - the first and second principal component (PC). We have decided to look at the original data from a slightly different perspective (axis), but the nature of the data itself remains unchanged.

Note that if we look at the length of two green lines which are the principal components, then the first PC is longer then the second PC. This indicates that the variance of the first PC is the greater than that of the second PC, which is equivalent to saying that the first PC retains more information of the original data than the second PC. Therefore in PCA, the principal components are usually aligned in decreasing order by their variances, where the first PC is a linear combination of $X$ and a norm vector $u$ that results in maximum variance.

Then how can we find the principal components? Two linear-algebraic methods are mostly used.

  • By using Spectral Decomposition of the sample correlation matrix
  • By using Singular Value Decomposition of the original matrix

If you are familiar with linear algebra, the difference between these two methods is trivial as the correlation matrix is guarenteed to be a square matrix. (SVD is a generalization of the Spectral Decomposition!)

In either ways, we can analytically prove that the principal components correspond to the eigen vectors of the matrix that is used for linear decomposition. Mathematical derivations will be omitted here for simplicity but rather we are going to take a look at its implementation using R.

Some practical tips on performing PCA is as follows:

  • When measurement units (scale) among the variables are different, we have to standardize the variables by using correlation matrix. This is because linear decomposition is dependent on the scale of variables.
  • For variables with same measurement units, there’s no problem of directly using covariance matrix.

 


Example of PCA

Now let’s actually perform PCA on a toy dataset. We’ll be using car dataset which consists of 74 independent car types with 13 features measured for each. Below is the data dictionary.

 

 

We’ll be using built-in functions in R. In the following code, simple preprocessing of the data is done and PCA is applied afterwards.

 

From the summarized result of PCA, we got 12 principal components derived from the original data.

Let’s take a look at the proportion of variances in each principal components. This is equivalent to the amount of variation (or information) each principal components explain. So only by the first two principal components, around 71.7% of total information is retained. This means that the rest of components from 3 to 12 only have less than 30% of total information. In practice, there is a tendency to choose the number of principal components as to be where the cumulative variance is in between 70% to 90%, so we can conclude that using the first 2 principal components would be enough for our car dataset.

The number of principal components to be used can also be examined by the Scree Plot.

 

In the following scree plot, we have to choose the point where the decreasing spline becomes mild. It is clear that this trend is smoothed after the second pricipal component, thus it is verified that using the first two components is enough.

As a next step, let’s try to interpret the result of PCA by using the first 2 principal components for representing original dataset. A strong adventage of using just two components is that visualization gets handy. This 2-dimensional visualization of PCA result is called as a Biplot and here’s the result. Specific code can be found here.

 

 

We can see that most of the variables are negatively correlated with the second PC, while the first PC seperates variables dichotomically. Since the texts are sort of overlapped, let’s try to understand the result from a numerical point of view by looking at the values of principal component loadings. It is a matrix which expresses the correlation between the original variables and PC variables.

 

 

The first PC is positively correlated with variables <Price, Size of Headroom, Seat Clearance, Trunk Space, Weight, Length, Diameter, Displacement>. Can we find out some common attributes in them? Referring to the data dictionary above, the first PC seems to be related with the size of a car. On the other hand, the second PC is negatively correlated with most of the variables but among them, repair records show greatest negative correlation. By this information, we can say the second PC is mostly related to the stability of a car.

By this way, sometimes we can name the principal components and use this information to interpret the biplot. In our biplot above, the cars grouped as category 1 (blue color) indicates cars made by US companies. From this, we can conclude that American cars are generally large in size but show relatively weak stability compared to the cars in category 2 (Japanese cars) and category 3 (European cars).

In fact, the interpretation of the results may vary depending on the researcher because each people have different amount of domain knowledge on the dataset. As I don’t have any profound knowledge on cars, my interpretation would’ve been probably shallow, but I still think it is interesting to see that we can possibly name the newly defined variables and interpret the results. This interpretablilty is one of the profound advantage of using statistical models.

This is the end of the first statistical dimention reduction method, PCA. In the following post, we will continue to talk about another widely used dimension reduction method known as Factor Analysis.


You might also enjoy