6 – L606 MC Prediction Part 3 RENDERv1 V4

Before you read the pseudo code, there’s a special case we have to discuss. What if, in the same episode, we select the same action from a state multiple times? For instance, say that at time step two, we select action down from state three, and, say we do the same thing at time step 99. I mean, if we count from the first time, then we get a return of negative 87, and if we count from the last time, then we get a return of 10. When this happens, we have two options and that gives us two different algorithms. We can as a first option, take the average of both time steps on making the table. So in this case, we would get a value of negative 38.5. Another option that will work well is to just use the first time we tried out the state action combination. So in this case, we would get a value of negative 87. We refer to the first option as in every-visit Monte-Carlo method, and we refer to the second option as a first-visit Monte-Carlo method. To understand where these names come from, we’ll have to introduce some new terminology. We define every occurrence of a state-action pair in an episode as a visit to that state-action pair. Every-visit Monte-Carlo prediction, averages the return following every visit to a state-action pair. First-visit Monte-Carlo prediction, considers only first-visits to the state-action pair and averages those returns. These algorithms do yield different behavior, and you can read more about those differences below.

%d 블로거가 이것을 좋아합니다: