2 – M2L3 02 Distributions V2

Many statistical models assume that the data follows a normal distribution, also referred to as a Gaussian or a bell curve. This is important when checking whether our models are valid. There are various tests that we use to check that our models describe a meaningful relationship. These tests assume that the data are normally distributed. If the data are not normally distributed, then these tests tend to conclude that the model is valid when in fact it is not. We will review the concepts of normality, show how to check for normality, and introduce some ways to transform our data so that they follow a normal distribution. What is a random variable? A random variable is a variable that can take on a random value. The probability that the random variable takes a particular value is determined by its probability distribution. You can think of a probability distribution as a range of numbers each with a probability associated with it. With data from the real world, we don’t actually know its underlying probability distribution. However, we can often approximate its probability distribution with an equation. If we say that a random variable is normally distributed, how do we visualize what this means? We can imagine the random variable as a tennis ball machine. This machine shoots tennis falls onto a number line that ranges from negative infinity to infinity. The number line has buckets of equal size placed on the line, so that tennis balls will collect in these buckets. If the random variables distribution is centered around zero, this means that tennis balls are more likely to hit the number line at zero, rather than say 500. If we shoot enough tennis balls along this number line, then we end up with a hill shaped pile of tennis balls. This plot is called a histogram. A histogram resembles the shape of its probability distribution.