Hello and welcome. In this notebook, we will learn how to use principal component analysis for dimensionality reduction. Dimensionality reduction, is one of the main applications of PCA. In the previous lessons, you’ve already learned how PCA works and about eigenvectors and eigenvalues. In this notebook, we will see how to apply PCA to a small dataset. Let’s begin by understanding what dimensionality reduction is all about. Let’s suppose we had some two dimensional data that looks like this. We can see that most of the data points lie close to a straight line. We can also see that most of the variation in the data occurs along this direction, but there’s not much variation along this direction. This means we can explain most of the variation of the data by only looking at how the data points are distributed along this straight line. Therefore, we could reduce this two-dimensional data to one-dimensional data by projecting all these data points onto this straight line. By projecting the data onto a straight line, we can actually reduce the number of variables needed to describe the data because you only need one number to specify a data point’s position along a straight line. Therefore, the two variables that describe the original data can be replaced by a new variable that actually encodes this linear relationship. It is important to note that the new variable is just an abstract tool that allows us to express this data in a more compact form, and may or may not be interpreted as a real-world quantity. Now, let’s see how we can do this in code. For simplicity, we will use a small two-dimensional dataset. In a later notebook, you’ll get a chance to apply what you learned in this notebook to real stock data. We will start by creating some randomly correlated data. In this code, you can choose the range of your data and the amount of correlation. The code outputs a plot with the data points and the amount of correlation. In this case, we chose our data range to be between 10 and 80. Therefore, the datapoints range between 10 and 80 in both the x and y axis. Remember, a correlation of zero means no correlation at all, and a correlation of one means complete correlation. You can vary the amount of correlation and create the data that you like. Once you have created your data, the next step in PCA is to center your data around zero. It is also customary to normalize your data. This is known as mean normalization. While centering the data is necessary, normalizing the data is optional, and this is what I have done here. Mean normalization not only centers your data around zero, but also distributes your data evenly, in a small interval around zero. As you can see here, the data is no longer in the range between 10 and 80. But after normalization, the data is now distributed between minus three and three. This will help your algorithm converge faster. With our data centered, we’re ready to perform PCA. To do this, we will use a package called Scikit-learn. Scikit-learn’s PCA class allows us to easily implement PCA on data. The first thing we need to do is to create a PCA object with a given set of parameters including the number of principal components we want to use. We’ll start by using two components because we want to visualize them later. Here, we can see the parameters that the PCA algorithm is going to use. The next step is to pass the data to the PC object using the fit method. A quick note. In Scikit-Learn, the PCA algorithm automatically centers the data for you. So, you could pass the original dataset to the fit method instead of the normalized data as we have done here. Once we fit the data, we can use the attributes of the PCA class to see the eigenvectors also known as the principle components, and its eigenvalues. One important attribute of the PCA class is the explained variance ratio. The explained variance ratio gives us the percentage of variance explained by each of the principal components. In general, the principle components with the largest eigenvalues explain the majority of the variance, and this is usually the ones that we want to keep. For example, here we can see that the first principle component explains 94 percent of the variance, and has the largest eigenvalue. Now that we have the principle components, we can visualize them. Here, we see the data with the first principle component, and the second principle component. We can see that the first principal component lies along the direction in which the data varies the most. One question that you will frequently face is how many principal components you should use. For example, suppose you had a dataset with 1,000 dimensions. Should you reduce this dataset to 500 dimensions or could you do better and reduce it to 100 dimensions? Usually, the number of principal components is chosen depending on how much of the variance of the original data you want to retain. For example, you may want to retain 90 percent of the variance, or you may only want to retain 50 percent of the variance. You can use the explained variance ratio attribute of the PCA class to determine the number of components you need to keep to retain a given amount of variance. For example, if you wanted to retain 90 percent of the variance, you can add up the elements in the explained variance ratio array until the desired value is reached. The number of elements you had to add up to reach the desired value determines the number of principal components needed to retain that level of variance. For example, if you had to add up five elements to retain 90 percent of the variance, then you will need five principal components. Now, that we have seen what all the principal components look like, we will now use PCA to perform dimensionality reduction. Since the data we’re using in this simple example only has two dimensions, the best we can do, is to reduce it to one dimension. So, now choose the number of principle components in our PCA algorithm to be equal to one. Once we ran the PC algorithm with only one component, we can see what the transform data looks like by using the transform method of the PCA class. In this simple case, the transform method projects the data onto the first principal component so we will just get a straight line. When working with higher dimensional data, the transform method will project the data onto the lower-dimensional surface determined by the number of principal components you used in your algorithm.