8 – 06 Attention Decoder V1

Let’s now look at things on the decoder side. In models without attention, we’d only feed the last context vector to the decoder RNN, in addition to the embedding of the end token, and it will begin to generate an element of the output sequence at each time-step. The case is different in an attention decoder, however. An attention decoder has the ability to look at the inputted words, and the decoders own hidden state, and then it would do the following. It would use a scoring function to score each hidden state in the context matrix. We’ll talk later about the scoring function, but after scoring each context vector would end up with a certain score and if we feed these scores into a softmax function, we end up with scores that are all positive, that are all between zero and one, and that all sum up to one. These values are how much each vector will be expressed in the attention vector that the decoder will look at before producing an output. Simply multiplying each vector by its softmax score and then, summing up these vectors produces an attention contexts vector, this is a basic weighted sum operation. The context vector is an important milestone in this process, but it’s not the end goal. In a later video, we’ll explain how the context vector merges with the decoders hidden state to create the real output of the decoder at the time-step. The decoder has now looked at the input word and at the attention context vector, which focused its attention on the appropriate place in the input sequence. So, it produces a hidden state and it produces the first word in the output sequence. Now, this is still an over-simplified look, that’s why we have the asterisks here. There is still a step, whether we’ll talk about in later video, between the RNN and the final output. In the next time-step, the RNN takes its previous output as an input, and it generates its own context vector for that time-step, as well as the hidden state from the previous time-step, and that produces new hidden state for the decoder, and a new word in the output sequence, and this goes on until we’ve completed our output sequence.

%d 블로거가 이것을 좋아합니다: