We’ve just seen how to model our data if we assume that it is normally distributed. But how do we decide if our data can be described by a normal distribution in the first place? A quick way to visually check, is to plot a histogram of our data. We can then compare the histogram to a plot of the normal distribution. Does this data’s distribution look normal? How about this one? Does it look like we can still describe it with a normal distribution? We can also use a boxplot to check for normality. As you’ll soon see, the boxplot helps to check for symmetry in your data. Normal distributions are symmetric around their mean. Here is a boxplot. There is a box with a line inside. The bottom edge of the box, the middle line in the box, and the top edge of the box, are quantiles that divide the data in two groups. We call the dividing lines quantiles because they split the data into groups with the same number of points. Since the three dividing lines create four groups, we call these lines quartiles. Remember that the quartiles are the three dividing lines, not the four groups themselves. The line inside the box is the median. If you ordered all of your dataset from smallest to largest, 50 percent of your data points are less than the median. The bottom side of the box represents the first quartile. Twenty five percent of the data are less than the first quartile. The top side of the box is the third quartile. Seventy five percent of the data are less than the third quartile. We can calculate the interquartile range by taking the third quartile, minus the first quartile. The boxplot has small lines on either end. We call these lines whiskers. The lower bound whisker is defined as the first quartile, minus 1.5 times the interquartile range. Similarly, the upper bound whisker is set to the third quartile, plus 1.5 times the interquartile range. Points lying outside of the whiskers can be considered outliers and are visualized as individual points. Going back to our test for normality, if you look at the normal distribution, it is symmetric. When a distribution is symmetric, it’s median is equal to it’s mean. The first quartile and third quartile are also the same distance away from the median. Also, the whiskers are the same distance away from the median. By visualizing your data with a boxplot, you can see whether it’s symmetric or not. If it’s not symmetric, it’s not normally distributed. We can look at the data using both a box whisker plot and a histogram. Here, it looks like it follows a normal distribution. This other dataset is not normal. In fact, we say that it is skewed because it has a long tail of data to only one side of the distribution. When there is a long tail of data to the left, the mean of the distribution lies to the left of the median. We say that the distribution is left skewed. In fact, stock returns tend to exhibit left skew and fat tails. This means that extreme negative returns occur with greater frequency than would be expected by a normal distribution. Please note that even if your data is symmetric, it’s not necessarily a bell-shaped curve of a normal distribution. So, it helps to see the box whisker plot, as well as the histogram. To do a more thorough check for normality, we can use a QQ plot. The QQ plot checks if the shape of the data matches the shape of our probability density function. The first Q in QQ plot stands for quantile. The second Q in QQ plot also stands for quantile. Common quantiles are quartiles, deciles, and percentiles. Recall that when we sort the data points from smallest to largest, we can find boundaries that divide the data into groups with equal number of points. If we divide a data into 10 buckets, the nine boundaries are deciles. If we divide the data into 100 buckets, the 99 boundaries are called percentiles. Notice, that this is not the same as plotting a histogram. With a histogram, the buckets have the same width. With quantiles, the buckets all have the same number of data points. QQ plots let us compare any two distributions to check if they have the same shape. Since we want to check if a data is shaped like a normal distribution, we can plot it’s quantiles against the quantiles of a normal distribution. As an example, let’s take the 50th percentile of our data as our y-coordinate. Next, we take the 50th percentile of the standard normal distribution as our x-coordinate. We plot this xy point on the QQ plot. We repeat this for all the other percentiles. If the two distributions have the same shape, then the plotted points will follow a straight line that goes from the bottom left to the top right. In this QQ plot, we can see that the data does not look like a normal distribution. This is because the plotted points form a curve instead of a straight line. So far, we’ve seen plots that help us visually check if our data looks like it’s normal. What if we wanted a single number to represent how normal the data is? If we had a single number, we could choose a cutoff point. Anything on one side of this cutoff point, and we decide that it’s normal. Anything on the other side of the cutoff point, and we decide that it’s not normal. The Shapiro-Wilk test and the D’Augustino-Pearson tests, are hypotheses tests that give a p-value. The p-value ranges from zero to one. These are hypotheses tests in which the null hypothesis assumes that the data is normally distributed. If the p-value is larger than 0.05, then we will assume that it follows a normal distribution. If the p-value is less than 0.05, then it likely does not follow a normal distribution. There is a another more general test to decide whether any two distributions are the same. This test is called the Kolmogorov-Smirnov test. Given two distributions, the Kolmogorov-Smirnov test returns a p-value. If the p-value is greater than 0.05, we will assume that the two distributions are the same. If the p-value is less than 0.05, then we say the two distributions are not the same. Remember the reason why we care about the data being normal. When we use statistical models such as regression, we use hypotheses tests to check if we can trust the model parameters of the model. These tests assume that our data is normally distributed. If the data are not normally distributed, these tests tend to tell us that our model is valid when in fact it is not.