10 – 08 Multiplicative Attention V2

Earlier in this lesson, we looked at how the key concept of attention is to calculate an attention weight vector, which is used to amplify the signal from the most relevant parts of the input sequence and in the same time, drown out the irrelevant parts. In this video, we’ll begin to look at the scoring functions that produce these attention weights. An attention scoring function tends to be a function that takes in the hidden state of the decoder and the set of hidden states of the encoder. Since this is something we’ll do at each timestep on the decoder side, we only use the hidden state of the decoder at that timestep or the previous timestep in some scoring methods. Given these two inputs, this vector and this matrix, it produces a vector that scores each of these columns. Before looking at the matrix version, which calculates the scores for all the encoder hidden states in one step, let’s simplify it by looking at how to score a single encoder hidden state. The first scoring method and the simplest is to just calculate the dot product of the two input vectors. The dot product of two vectors produces a single number, so that’s good. But the important thing is the significance of this number. Geometrically, the dot product of two vectors is equal to multiplying the lengths of the two vectors by the cosine of the angle between them, and we know that cosine has this convenient property that it equals one if the angle is zero and it decreases, the wider the angle becomes. What this means is that if we have two vectors with the same length, the smaller the angle between them, the larger the dot product becomes. This dot product is a similarity measure between vectors. The dot product produces a larger number, the smaller the angle between the vectors are. In practice, however, we want to speed up the calculation by scoring all the encoder hidden states at once, which leads us to the formal mathematical definition of dot product attention. That’s what we have here. It is the hidden state of the current timestep transposed times the matrix of the encoder hidden timesteps. That looks like this and that will produce the vector of the scores. With the simplicity of this method comes the drawback of assuming the encoder and decoder have the same embedding space. So, while this might work for text summarization, for example, where the encoder and decoder use the same language and the same embedding space. For machine translation, however, you might find that each language tends to have its own embedding space. This is a case where we might want to use the second scoring method, which is a slight variation on the first. It simply introduces a weight matrix between the multiplication of the decoder hidden state and the encoder hidden states. This weight matrix is a linear transformation that allows the inputs and outputs to use different embeddings and the result of this multiplication would be the weights vector. Let us now look back at this animation and incorporate everything that we know about attention. The first time step in the attention decoder starts by taking an initial hidden state as well as the embedding for the end symbol. It does its calculation and generates the hidden state at that timestep and here, we are ignoring the actual outputs of the RNN, we’re just using the hidden states. Then we do our attention step. We do that by taking in the matrix of the hidden states of the encoder. We produce a scoring as we’ve mentioned. So, if we’re doing multiplicative attention, we’ll use the dot product. In general, we produce the scores, we do a softmax, we multiply the softmax scores by each corresponding hidden state from the encoder, we sum them up producing our attention context vector and then what we do next is this, we concatenate the attention context vector with the hidden state of a decoder at that timestep so h4. So this would be c4 concatenated with h4, that’s what we will do here. So, we basically glued them together as one vector and then we pass them through a fully connected neural network which is basically multiplying by the weights matrix WC and apply a tanh activation. The output of this fully connected layer would be our first outputted word in the output sequence. We can now proceed to the second step, passing the hidden state to it and taking the output from the first decoder timestep. We produce h5, we start our attention at this step as well, we score, we produce a weights vector, we do softmax, we multiply, we add them up producing c5. The attention context vector at step five, we glue it together with the hidden state, we pass it through the same fully-connected network with tanh activation producing the second word in our output and this goes on until we have completed outputting the output sequence. This is pretty much the full view of how attention works in sequence-to-sequence models. In the next video, we’ll touch on additive attention.

%d 블로거가 이것을 좋아합니다: