11 – 09 Additive Attention V2

In this video, we’ll look at the third commonly used scoring method. It’s called concat, and the way to do it is to use a feedforward neural network. To take a simple example, let’s say we’re scoring this encoder hidden state, at the fourth time step at the decoder. Again this is an oversimplified example scoring only one, while in practice we’ll actually do a matrix and do it all discord in one step. The concat scoring method, is commonly done by concatenating the two vectors, and making that the input to a feed forward neural network. Let’s see how that works. So, we merge them, we concat them into one vector, and then we pass them through a neural network. This network has a single hidden layer, and outputs this score. The parameters of this network, are learned during the training process. Namely the WA weights matrix, and the VA weights matrix. To look at how the calculation is done, this is our concatenated vector, we simply multiply it by W of A, we apply 10H activation producing this two by one matrix. We multiply that by the V of A weights matrix, and we get the score for this encoder hidden state. Formally, it is expressed like this, where H of T as we’ve mentioned is the hidden state at the current time step, and H of S is the collection of the set of encoder hidden states. This is the concatenation and then we multiply it by W of A, tan H activation and then multiply it by V of A transpose. One thing to note is the difference. So concat is very similar to the scoring method from Mark Daniel paper, but this is the one, this is the concat method from the Lung paper, where there’s only one weight matrix. In the Mark Daniel paper there are two major differences that we can look at. One of them is that the weights matrix is split into two, so we don’t have just W of A, we have W of A and U of A, and each is applied to the respective vector. The decoder hidden state in this case, and the encoder hidden state at this case. Another thing to note is that the Mark Daniel paper used the hidden state from the previous time step at the decoder. While in the loop paper it uses the one from the current time step at the decoder. Let’s make a note on notation here, in case you’re planning to read the papers. Here we’ve used the notation mainly from the Lung paper where we referred to the encoder and the decoder hidden states as H. So, H of T for the decoder, and H of S for the encoder. This means so H is for hidden state,T is for target, so that’s the target sequence that we’re going to output so that’s associated with the decoder. S is for source. In the Mark Daniel paper, this is called S. So, it is not H, it’s called S. So, now the picture is complete. Now, we’ve gone over the entire attention process. It’s time now to look at some applications in the next video.

%d 블로거가 이것을 좋아합니다: