2 – M3 L5 02 Motivation V1

Actor-critic methods are at the intersection of value-based methods such as DQN and policy-based methods such as reinforce. If a deep reinforcement learning agent uses a deep neural network to approximate a value function, the agent is said to be value-based. If an agent uses a deep neural network to approximate a policy, the agent is said to be policy-based. The DQN agent you learned about, is a value-based agent because it learns about the optimal action value function. This is just one of the many functions you can approximate. You can learn about the state value function V pi, the action value function Q pi, the advantage function A pi and the optimal versions of these; V star, Q star and A star. If your agent learns a value function well, deriving a good policy from it is straight forward. They reinforce agent previously learned about is a policy-based agent. These agent parameterizes the policy and learns to optimize it directly. The policy is usually stochastic in this setting. But you can also learn about deterministic policies. Remember that stochastic policies, taking a state and returned a probability distribution over the actions. Though you often see is slightly different notation, in which you taking a state and an action and return the probability of taking that action in that state. But there are pretty much the same though. Given the same state, the policy could prescribe a different action. This policy is a stochastic. Deterministic policies on the other hand, prescribe a single action for any given state. So, they take in a state and return an action. There’s no stochasticity. The policy is deterministic Finally, you also learned about using baselines to reduce the variance of policy-based agents. Did you notice that you can use a value function as a baseline. So, think about it. If we train a neural network to approximate a value function and then use it as a baseline, would this make for a better baseline, and if so, would a better baseline further reduce the variance of policy-based methods? Indeed. In fact, that’s basically all actor-critic methods are trying to do, to use value-based techniques to further reduce the variance of policy-based methods.

%d 블로거가 이것을 좋아합니다: