6 – Context Training

OK. Now let’s talk about another trick. When we moved from recognizing isolated signs to recognizing phrases of signs, the combination of movements looks very different. >> For example, when Thad signed NEED in isolation, his hands started from a rest position and finished in the rest position. When he signs NEED in the context of I NEED CAT, the first part of NEED runs into the last part of I, and the last part of NEED runs into the first part of CAT. His hands are no longer moving so much. >> That’s right, the signs before and after a given sign can significantly affect how it looks. >> So instead of recognizing the three state model for need, we’re going to concatenate a combined six state model of I NEED, and NEED CAT. >> In practice, we often have lots of examples of phrases of sign and not individual signs to train on. In my original paper on recognizing phrases of ASL, I had 40 signs in the vocabulary, but the only training data that I had was 500 phrases, that each contained five signs. >> In this case, let’s suppose we have lots of examples of our three sign phrase. In our original example on training in HMM in data, we assume that the data for an isolated version of the sign for I was evenly divided among its three states. We then calculated the output probabilities given that assumption, adjusted the boundaries and transition probabilities, and iterated until convergence. >> We’re going to do the same thing for our first step here, but this time we will assume that the data is evenly divided between each sign, and then we’ll divide the data for each state in each sign. >> And we iterate the same way we did before, adjusting the boundaries on each state and each sign until we converge. >> Now’s when things get interesting. After we’ve converged everything for each sign, we’re going to go back and find every place where I NEED occurs. Notice that there will be a lot fewer of those than there are examples of NEED. >> But let’s assume that there are enough examples. >> Okay. Well we are going to cut out the data we think belongs to I NEED, and train the combined six state model on it. >> How does that help? >> Well the output probabilities at the boundary between I and NEED, here and here, as well as the transition probabilities in that region, will be tuned to better represent I NEED than the general case which would include we need. In speech the effect of one phoneme affecting the adjacent phoneme is called coarticulation, and this method of modeling is called context training. >> So I see we’re going to do the same thing for NEED CAT1. >> Yep, and for every other two sign combination. NEED-CAT2, WANT-CAT1, WANT-CAT2, I-WANT, WE-NEED, and WE-WANT. We are going to iterate using Baum-Welch, using these larger contexts from embedded training, until we converge again. >> Why not use three sign contexts, or even more when the phrases are complex enough? >> If we have enough data that’s not a bad idea, because the benefits are actually pretty large. For recognition tasks, where there’s a language structure, we expect context training to divide our error rate in half.

%d 블로거가 이것을 좋아합니다: