# 7 – M2L3 10 Linear Regression V4

Now, we’ll look at how to use one random variable to predict another random variable. We’ll cover the basics of regression as this forms the basis for several models that are used to analyze stock returns over time. If we want to estimate the price of a house, we may assume that home buyers are willing to pay more for a bigger house, all other things being equal. So, we may find data on the area covered by each house as well as its price. We want to find a coefficient that we can multiply by the area, then add a constant term, which we refer to as the intercept. This is the equation for a straight line. Fortunately, we don’t have to guess the best values for the coefficient and intercept. When we plot the price against the area, we can draw a line that tries its best to pass through most of the points. We can measure how well the line fits the data by measuring the vertical difference between the point and the line. An optimal regression line is one that manages to reduce these differences. This process that finds the optimal regression line is called ordinary least squares. Even after we find the best regression line, we can expect to see differences between the data points and the line. These differences between the best fit regression line at each point are called residuals or error terms. In other words, the residual is the difference between the actual value, and the predicted value. We can check if the residuals follow a normal distribution. If the residuals follow a normal distribution with a mean of zero and a constant standard deviation, then these residuals can be considered random. By random, I mean that the model’s predicted value is equally likely to be higher or lower than the actual value. If however, the average of the residuals is not zero, this gives us a hint that the model has a bias in its prediction errors. One way to improve our model is to look for other independent variables. This is called multiple regression, when we use more than one independent variable to predict a dependent variable. Now that we fit a regression model, we want to check if we can rely on it for future predictions. One measure of our model’s ability to fit the data is the R-squared value. The R-squared is a metric that ranges from zero to one. R-squared of one means that all the variation in the dependent variable, can be explained by all the variation in the independent variables. A better metric is the adjusted R-squared, which helps us to find the minimum combination of independent variables that are most relevant for our model. Another way to check our model, is by performing an F-test. The F-test checks whether our coefficients and intercepts are not zero, and therefore, meaningful. If we get a P-value of 0.05 or smaller, we can assume that our parameters are not zero. When parameters are not zero, then we can say that our model describes a meaningful relationship.