Okay. Let’s get back to what we really want to talk about, PCA. In a nutshell, what is PCA? PCA is a series of calculations that gives us a new and special basis for our data. Why is it special? Well, the first dimension is the dimension along which the data points are the most spread out, we say they have the most variance along this dimension. What do we mean exactly? We mean that, if we start with a dataset and we consider a new axis in the 2D plane represented by this line through the origin, then we find the coordinates of our points along this new axis by projecting them by the shortest path to the new axis. Consider the variance or spread of this set of coordinates along the line. When we do PCA or trying to choose our new axis in such a way that the new coordinates are as spread out as possible or have maximum variance, it turns out that the choice of line, where the coordinates are most spread out, is also the choice of line that minimizes the perpendicular distance of each coordinate to the line. We say that the basis minimizes reconstruction error. Maximizing variance and minimizing reconstruction error go hand in hand. The squared distance from the origin to the projection, plus the squared distance from the projection to the point, equals the squared distance from the origin to the point, this is just the Pythagorean theorem. So, when you change the orientation of the line, if one increases the other must decrease. The orientation of the line chosen by PCA is the one that maximizes the squared distances along the line for all points and simultaneously minimizes the squared perpendicular distances to the line for all points. This is how we find the first basis direction. The next basis direction must be perpendicular or orthogonal to the first. In our little example, there is only one choice for this dimension, because we are working in a two-dimensional space. But if we were working in a higher-dimensional space, the requirement for the next basis direction would be that it be orthogonal to the first and also maximize the variance of the points along that dimension and so on until we have as many new dimensions as we had dimensions to start with. So, if we had data in four dimensions, PCA would give us four new axes.