Reinforcement learning is ultimately about learning an optimal policy from interaction with the environment. So far, we’ve been looking at value-based methods, where we first tried to find an estimate of the optimal action value function. For small state spaces, this optimal value function was represented in a table with one row for each state and one column for each action. Then we use the table to build the optimal policy one state at a time. For each state, we just pull its corresponding row from the table and the optimal action is just the action with the largest entry. But what about environments with much larger state spaces? For instance, consider the carpool environment, where the goal is to teach an agent to balance a pole that’s attached to a moving cart. At each time step, the agent pushes the cart either to the left or to the right to try to keep the pole from falling down. The stated ending timestamp is a vector with four numbers containing the carts position and velocity, along with the poles angle and velocity. There’s a huge number of possible states corresponding to every possible value that these board numbers can have. So, without some sort of discretization technique, it’s impossible to represent the optimal action value function in a table. This is because we would need a row for each possible state and that would make the table way too big to be useful in practice. So, we investigated how to represent the optimal action value function with a neural network which formed the basis for the deep Q learning algorithm. In this case, the neural network took the environment state as input. As output, it returned the value of each possible action. In the case of carpool, for instance, the possible actions are to move the cart left or right. Then identical to the previous setting where we used the table, we can easily obtain the best action for any state by just looking at the values of the input state produces in the network. It’s just the action that maximizes these values. But the important message here is that in both cases, whether we used a table for small state spaces or a neural network for much larger state spaces, we had to first estimate the optimal action value function before we could tackle the optimal policy. But now the question is, can we directly find the optimal policy without worrying about a value function at all? The answer is yes, and we accomplish this through a class of algorithms known as policy-based methods. This is what we’ll explore in this lesson.