Let’s take a closer look at linear function approximation and how to estimate the parameter vector w. As you’ve seen already, a linear function is a simple sum over all the features multiplied by their corresponding weights. Let’s assume you have initialized these weights randomly and computed the value of a state v hat (s,w). How would you tweak w to bring the approximation closer and closer to the true function? Sounds like a numerical optimization problem. Let’s use gradient descent to find the optimal parameter vector. Firstly, note that since v hat is a linear function, its derivative with respect to w is simply the feature vector x(s). This is a nice thing about linear functions and why they are so popular. We’ll use this shortly. Now let’s think about what we are trying to optimize. We said we want to reduce or minimize the difference between the true value function v pi and the approximate value function v hat. Let’s write that down as a squared difference, since we are not concerned with the sign of the error and we simply want to drive the difference down toward zero. To be more accurate, since reinforcement learning domains are typically stochastic, this is the expected squared error. We now have an objective function to minimize. To do that using gradient descent, let’s find the gradient or derivative of this function with respect to w. Using the chain rule of differentiation, we get minus two times the value difference times the derivative of v hat, which we earlier noted was simply the feature vector (x,s). Note that we remove the expectation operator here to focus on the error gradient indicated by a single state s, which we assume has been chosen stochastically. If we are able to sample enough states, we can come close to the expected value. Let’s plug this in to the general form of a gradient descent update rule with alpha as a step-size or learning rate parameter. Note that the minus half here is just to cancel out the minus two we got in the derivative. This is the basic formulation we will use to iteratively reduce the error for each sample state till the approximate and true function are almost equal. Here’s an intuitive explanation of how gradient descent optimizes the parameter vector. In each iteration, you change the weights a small step away from the error direction. So, the feature vector here points out what direction is bad so that we can move away from it. So far, we’ve only been talking about approximating the state-value function. In order to solve a model-free control problem, that is, to take actions in an unknown environment, we need to approximate the action-value function. We can do this by defining a feature transformation that utilizes both the state and action, then we can use the same gradient descent method as we did for the state-value function. Finally, let’s look at the case where we wish the approximation function to compute all of the action-values at once. We can think of this as producing an action vector. For this purpose, we can continue to use the same feature transformation as before, taking in both the state and action. But now, how do we generate the different action-values? One way of thinking about it is that we are trying to find n different action-value functions, one for each action dimension. But intuitively, we know that these functions are related. So, it makes sense to compute them together. We can do this by extending our weight vector and turning it into a matrix. Each column of the matrix emulates a separate linear function, but the common features computed from the state and action keep these functions tied to each other. If we have a problem domain with a continuous state space, but a discrete action space which is very common, we can easily select the action with the maximum value. Without this sort of parallel processing, we would have to parcel each action one by one and then find their maximum. If our action space is also continuous, then this form allows us to output more than a single value at once. For example, if we were driving a car, we’d want to control both steering and throttle at the same time. The primary limitation of linear function approximation is that we can only represent linear relationships between inputs and outputs. With one-dimensional input, this is basically a line. In 2D, it becomes a plane, and so on. What if our underlying value function has a non-linear shape? A linear approximation may give a very bad result. That’s when we need to start looking at non-linear functions.