6 – M2L3 09 Transforming Data V3

So, now the question is, what do we do when our data is not normally distributed? Similarly, what do we do when our data is heteroscedastic? To reshape our data and make it more normal, we can feed our data into the log function. To get data that is homoscedastic, we can take the time difference between periods. By time difference, I mean that we can view our data as the rate of return from one day to the next. Similarly, we could get the time difference by subtracting each previous day’s value from the current day’s value. In practice, we take the rate of change for each period and then apply the log function. You may have seen this earlier when learning about why we model financial data with log returns. A way to make our data both normally distributed and homoscedastic, is by applying the Box-Cox transformation. To preview what the Box-Cox transformation does, let’s get a conceptual idea of what a transformation looks like. Imagine you place all your data points on a number line, they are not evenly spaced and have odd clusters in some parts and no data points in other places. Think of this as a necklace in which the beads are not spaced out in a very nice way. Now, imagine if we could nudge each of these beads a little bit to the left and to the right to even out the spacing in a nicer way. Notice that since these beads are all on the same string, the relative order of each data point remains the same, we’re just spacing them out in a way that’s easier to work with. In math terms, we just applied a monotonic transformation. A monotonic transformation changes the values of a dataset but preserves the relative order. The Box-Cox transformation takes a dataset and outputs a dataset that is more normally distributed. You can see that the Box-Cox transformation has one constant value, lambda. If you choose lambda to be zero, then the transformation function is defined as the natural log. You can try different values for lambda to transform the data, then perform tests for normality and homoscedasticity. Once you can say your data is normally distributed, you can use it in the various models that we’ll discuss in the rest of the lesson.