So, you have a sense of what the PCs are. Now, I have an important thing to tell you about how we use them. In fact, we frequently don’t use all the PCs. Instead, we decide to use some fraction of them starting with the first, which we think explain most of the variance, and I’m going to explain what I mean by that in a moment. So, for example, with the 2D dataset I’m showing you here, if we decide to only use the first PC, instead of an x-coordinate and a y-coordinate for each data point, we just use a single coordinate that represents how far along the first basis dimension the point falls. This is a lower dimensional representation of the same dataset and it makes the most sense if the data approximately fall along a line to begin with. In this new representation, we lose the information about how far the original data points lie in the direction perpendicular to the first PC. However, if those distances are small, we don’t lose much information. But if we have many dimensions to begin with, how do we decide how many PCs to use? Well, it can depend on the application, but one way is by calculating how much variance each of the PCs account for and by dropping those that account for the least variance. That seems reasonable, but how do we implement that quantitatively? Well, we know that the variance along each dimension is an important quantity in PCA. It turns out that the total variance is the same in the original basis as it is in the new basis. Let me explain what I mean by that. Let’s start with the three data points we had before, where we’ve already mean centered the data. The variance along the original horizontal dimension is the sum of the squares of these lengths. The variance along the original vertical dimension is the sum of the squares of these lengths. But because of the Pythagorean theorem, the sum of the squares of all of these lengths equals the sum of the squares of the distances from the origin to each data point. When we find the new PC basis, we can calculate the variances along the new dimensions. But since each dimension is still orthogonal to every other dimension, we still have that the sum of the variances in each dimension equals the squared distance to each data point from the origin. Since the sum of the variance of all the data points is determined by their distance from the center, the sum of total variance is the same, regardless of which basis you choose. This quantity, the sum of the squares of the distances to each data point from the origin, is called the total variance of the data. As you have already seen, each principal component, that is, each new dimension for the data, is associated with some fraction of the total variance. The first PC is associated with the most, the second with the next most, and so on down the line. So, in order to decide how many PCs to keep, we might look at the variance of the data along each dimension and drop the dimensions along which the data vary the least. It is in this sense that PCA is used as a dimensional reduction algorithm. When we drop the dimensions that capture less of the spread of the data, we have lost some information, but retained most of the spread and thus most of the information.