7 – 07 Additive And Multiplicative Attention V1

Before delving into the details of scoring functions, we need to make a distinction of the two major types of attention. These are often referred to as “Additive Attention and Multiplicative Attention.” Sometimes they’re also called “Bahdanau Attention” and “Luong Attention,” referring to the first authors of the papers, which described them. Bahdanau attention refers to Dzmitry Badanau, the first author of the paper, “Neural machine translation by jointly learning to align and translate,” which proposed a lot of these ideas. The scoring function in Bahdanau attention looks like this, where h of j is the hidden state from the encoder, s of i minus one is the hidden state of the decoder in the previous time step. u of a, W of a, and v of a, are all weight matrices that are learned during the training process. Basically, this is a scoring function, which takes the hidden state of the encoder, hidden state of the decoder, and produces a single number for each decoder time step. If this looks too complicated, don’t worry about it. We’ll get into more detail and with a visual explanation as well. The scores are then passed into softmax, this is what softmax looks like, and then this is our weighted sum operation, where we multiply each encoder hidden state, by its score, and then we sum them all up, producing our attention context vector. In their architecture, the encoder is a bidirectional RNN, and they produce the encoder vector by concatenating the states of these two layers. Multiplicative attention or Luong attention, referring to Thang Luong, the first author of the paper, “Effective Approaches to Attention-based Neural Machine Translation.” Luong attention built on top of the Bahdanau attention by adding a couple more scoring function. Their architecture is also different and that they used only the hidden states from the top RNN layer in the encoder. This allows the encoder and the decoder to both be stacks of RNNs, which we’ll see later in the applications videos, which led to some of the premier models that are in production right now. The scoring functions in multiplicative attention are three that we can choose from. The simplest one is the dot scoring function, which is multiplying the hidden states of the encoder by the hidden state of the decoder. The second scoring function is called general, it builds on top of it and just adds a weight matrix between them, and this multiplication in the dot product is where multiplicative attention gets its name. The third is very similar to Bahdanau attention, in that it adds up the hidden state of the encoder with the hidden state of the decoder, and so this addition is where additive attention gets its name, then multiplies it by a weight matrix, applies a tanh activation and then multiplies it by another weight matrix. So, this is a function that we give the hidden state of the decoder at this time step and the hidden states of the encoder at all the time steps, and it will produce a score for each one of them. We then do softmax just as we did before, and then that would produce c of t here, they are called the attention context vector, and so this is the step that would produce the final output of the decoder. Again, if this doesn’t make a lot of sense right now, don’t worry about it, we’ll look at it more visually in the next video. This is an example from the paper that gives an illustration about attention methods compared to previous sequence to sequence models without attention. So, this is an English to German translation, this is the source English phrase, this is the reference, this is the label correct German translation, this is what their model did. So, it translated it very well, and this is the the base or the benchmark model without attention, where we see it got the name Luong. This is something we can attribute to the difficulty of capturing all the information just in the last hidden state of the encoder. This is one of the powerful things that attention does. It gives the encoder the ability to look at parts of the input sequence, no matter how far back they were in the input sequence.

Dr. Serendipity에서 더 알아보기

지금 구독하여 계속 읽고 전체 아카이브에 액세스하세요.

Continue reading