Hello everyone. It is now the greatest time of the year and here we are today, ready to to be amazed by Deep Learning.
Last time, we have gone through a neural machine translation project by using the renown Sequence-to-Sequence model empowered with Luong attention. As we already saw, introducing Attention Mechanisms helped improve the Seq2Seq model’s performance to a noticeably significant extent. For those who haven’t seen my last post, I recommend that you have a skim here)
However, given that we are either NLP practitioners or researchers, we might have heard of this all over the place:
Attention Is All You Need
Okay guys, let me present to you the one that we are all long for: The Transformer!
Speaking of Transformer, I didn’t mean the fancy bulky Optimus Prime, but this:
Ew, that looks scary. And believe it or not, today, we are going to create the Transformer entirely from scratch. That seems impossible at first, I know it. But as you will see in a moment, with the help of Tensorflow 2.0 (and Keras at its core), building such a complicated model is no different from stacking up Lego pieces.
Concretely, today we will go through the steps below on the journey to create our own Transformer:
- Create the simple-and-straight-forward version to understand how a Transformer works
- Update the simple version to enhance speed and optimize GPU memory
Now, let’s take a look at a couple of things that are worth checking out in advance.
As usual, to get the most out of this blog post, I highly recommend that the following should be done first:
- Install Tensorflow 2.0 (I made a tutorial on that, here)
- Read my article to understand how attention mechanisms work
- Skim through the original paper: Attention Is All You Need
Having done all of those above? Then we are ready to go. Let’s get started!
Note that by introducing the Transformer, I didn’t mean the Sequence-to-Sequence model sucks. For tasks like machine translation, the sequential characteristics of recurrent layers with the help of an attention mechanism is still capable of delivering a great result.
Create the Transformer – the simple version
This time, I felt the urge to show you guys the quick-and-dirty version, which was what I actually wrote in the beginning. I strongly believe that will help you guys to see how the paper was interpreted and follow along with ease.
With that being said, it’s time to talk business.
Just like the Seq2Seq model, the Transformer has two separate parts: the Encoder and the Decoder to deal with source sequences (English) and target sequences, respectively. Let’s take a closer look at the Encoder:
As we can see from that sketch, the Encoder is made of four components:
- Positional Encoding
- Multi-Head Attention
- Position-wise Feed-Forward Network
That might sound exhausting. How can we create them all?
Although the Encoder consists of 5 different components, it turns out that we only need to create two of them: the Positional Encoding and the Multi-Head Attention.
We know that using RNN units are not efficient for their sequential characteristics so we got rid of them and, sadly, their ability to treat input data as sequences too (i.e A->B->C: C comes after B and B comes after A).
So, what do we do now? How about explicitly providing the absolute position information of each token within the sequence?
Usually, that kind of information that the model can learn from is usually called a feature. Yep, positional encoding is simply a feature! Here is how we can compute it:
It may seem a little bit scary, but is extremely easy to implement as-is in Python.
That is our positional encoding (features).
The Multi-Head Attention
And here we are at the core of the Transformer: the Multi-Head Attention. In fact, if you know how Luong attention mechanism works (which you should by now), this would be very easy to implement because they are pretty similar.
Typically, as far as we knew, the decoder output would draw attention to the encoder output to decide where to put more weight on.
But that is no longer the case with the Transformer since it does not have the sequential constraint like the Sequence-to-Sequence architecture. Specifically, we can have three patterns like below:
- Source sequence pays attention to itself (Encoder’s self attention)
- Target sequence pays attention to itself (Decoder’s self attention)
- Target sequence pays attention to source sequence (same as Seq2Seq)
which is why the authors introduced three new terms: query, key and value.
- Query: the one which pays attention
- (Key, Value): the one to which attention is drawn. Key and value are exactly the same within this post.
So now we are ready to take a look at the Multi-Head Attention:
Again, it looks pretty complicated but in fact, the idea is pretty simple. The Scaled Dot-Product Attention is basically similar to Luong attention (dot score function) and we need to compute not one, but many of them simultaneously. Sounds complicated, doesn’t it?
Well, I always like to think that creating Deep Learning models is no different than assembling Lego pieces. Always keeping track of the shapes is key. Let’s take an insight into the Multi-Head Attention:
Everything is crystal clear now. Let’s code!
Firstly, we will create a class named MultiHeadAttention. The number of attention heads is controlled by h and remember that we must create separate Dense (Linear) layers for each head.
Next, we will implement the logic within the Multi-Head Attention. Let’s take a look at the special case first: One-Head (which is similar to Luong dot-score attention):
We need to compute h attention heads simultaneously this time. And the fastest way is to create a for loop. Don’t forget the additional Dense layers as illustrated above. Here is the Multi-Head version:
And the Multi-Head Attention is done. Now we have all the ingredients we need. Let’s go ahead and create the Encoder!
We are now able to create our Encoder. Let’s visualize the data flow inside the Encoder first:
Specifically, the Encoder consists of:
- One Embedding layer
- One or many layers, each layer contains:
- One Multi-Head Attention block
- One Normalization layer for Attention block
- One Feed-Forward Network block
- One Normalization layer for FFN block
With that, let’s go ahead and define the Encoder class:
Here you might be wondering: Shouldn’t we be using the LayerNorm thing? Well, when I started this project, LayerNorm hadn’t been implemented yet. Rather than creating one on my own, I decided to go with BatchNormalization and it worked just fine. You will see in a second.
Now it’s time for the forward pass. Starting with the Embedding layer, we will then add the positional encoding to its output (position-wise):
Next, we will implement the layers of Multi-Head Attention and Feed-Forward Network. For the Multi-Head Attention, we will loop through the input sequence’s length and compute its context vector towards the whole-length sequence:
Then we have a residual connection, followed by a Normalization layer (out = LayerNorm(out + in)):
And we are done with the Multi-Head Attention. The Feed-Forward Network coming next is much more straight-forward since it contains only Dense layers. Let’s do the same: compute the output, add the residual connection and normalize the result:
Below is the complete forward pass of the Encoder:
So, we have created the Encoder. We’ll go ahead to tackle the Decoder, which is very similar to the Encoder, except that … Well, let’s first take a look:
So basically, there is nothing that we haven’t covered yet. However, there are some differences to notice:
- There are two Multi-Head Attention blocks in one layer, one for the target sequences and one for the Encoder’s output
- The bottom Multi-Head Attention is masked
And as always, let’s visualize the data’s shapes inside the Decoder so that we can get ready to implement:
We can now create the Decoder class and define all the material we need:
Next, we will dive into the forward pass. The first step is similar to what we did in the Encoder: pass the sequences through the Embedding layer and add up the Positional Encoding information:
Then, we will have a for loop to create a bunch of layers as illustrated in Fig.11 above. In each layer, the first block is the Multi-Head Attention in which the target sequence draws attention to itself (self-attention). And as I mentioned above, this block needs to be masked. What does that mean?
Unlike source sequence on the Encoder’s side, each token in the target sequence should not be trained to depend on its neighbors to the right. Think of the inference phase, we begin with the <start> token and predict word after word, right? There is gonna be no hint on the right!
Implementing that masking mechanism is pretty easy. We just need to modify the code used in the Encoder a little bit:
Coming next is another Multi-Head Attention layer in which the query is the output of the Multi-Head Attention above and the value is the output of the Encoder. This is what we normally do with Seq2Seq architecture:
The last piece is the FFN layer, which is no different from the Encoder:
Oh, and don’t forget to use the last Dense layer to compute the final output:
Here is the full code of the Decoder’s forward pass:
And the two pieces of the Transformer are ready. Let’s test them out:
You can see that I’m setting the number of attention heads h=2. The reason for that is to test out the multi-head mechanism (setting h=1 is sufficient for this experiment). I also keep the number of layers low as we are going to train on a tiny dataset (20 English-French pairs).
The code above should print out something like below:
Okay, the output shape was good. Let’s now add some more necessary pieces to conduct the experiment: to overfit the 20 pairs of English-French sentences.
We will start will the data preparation:
For a step-by-step instruction, please refer to my previous blog post on NMT and Luong attention mechanism.
Loss Function & Optimizer
The loss function also requires no modification from what we used: SparseCategoryCrossentropy with a mask to filter out padded tokens. And we are using AdamOptimizer with default setting:
Again, the train_step function is pretty simple and similar to what we implemented before:
If you are getting bored and require something new, here it is. We have to make a small change to the predict function. What we used to do is as follows:
- Feed the <start> token into the model
- Take the last output as the predicted word
- Append the predicted word to the result
- Feed the predicted word and its associated state to the model and repeat step 2
We cannot do the same with the Transformer. Why? Because we lost the sequential mechanism, i.e. the state. Instead, here is how we are going to do:
- Feed the <start> token into the model
- Take the last output as the predicted word
- Append the predicted word to the result
- Feed the entire result into the model and repeat step 2
Let’s see an example:
- Feed “<start>“
- The last word: “I”
- Feed “<start> I”
- The last word: “am”
- Feed “<start> I am”
- Final result: “<start> I am Trung Tran . <end>“
And here is the code:
The final thing to do is to create the training loop as follows. We will train the model for 100 epochs and periodically print out the loss value as well as some translation result for monitoring purpose.
Looks like we have everything we need. Let’s start training!
In early stage, the model could only print out some meaningless phrases, which is absolutely normal.
But we don’t have to wait so long. Something cool started to appear after about 50 epochs:
The model kept learning until 100th epoch (we can tell by the loss value). Let’s have the model translate all 20 training sentence pairs:
The Transformer worked as expected and it only took ~ 80 epochs to overfit the tiny dataset, whereas the vanilla Seq2Seq needed ~ 250 epochs to do the same thing. Attention’s power is now confirmed! The model still made some weird translations and we will soon know the reason why.
So that was how we should interpret the paper and implement a quick-and-dirty version of the Transformer. In the next section, let’s see what we can do to improve the model’s performance.
Enhance the simple Transformer
RNN (of which Seq2Seq model is made) is known to be inefficient on GPU for its sequential mechanism. The Transformer, on the other hand, contains mainly matrix multiplication operators which means that it is supposed to be (super) fast on GPUs.
At this point, we already created a working Transformer which can overfit the tiny dataset. But it is slow and we should not plug the full training dataset in yet. Let’s see what we can do to improve our model.
Improve the Encoder
Let’s start by taking a look at the Encoder. There are two bottlenecks within the forward pass that slow things down: the for loop for the Embedding layer and the for loop to compute attention heads.
The first one is easy. The second one is a bit trickier.
In fact, that for loop was there to provide us a good understanding of how attention is being computed and make things easier to debug. With the current implementation of the Multi-Head Attention layer, we can just stuff the whole sequence in without affecting the result. We can rewrite the Encoder’s forward pass as follows:
Are we done with the Encoder yet? Well, there is one more change we should do about the padded zeros.
As you already know, we have to append zeros make all sequences equal in length. Those padded tokens basically have no meaning. And having our model (accidentally) pay attention to them is not a good thing at all. We haven’t done anything about it before and that was what caused the weird translation result.
The solution for that is to use a mask, which is 0 at padded tokens and 1 at anywhere else.
Because we need to use the above mask in the Decoder, we will create it later inside the train_step function.
So now we have a mask, let’s modify the Multi-Head Attention to adopt that change.
Add masking to Multi-Head Attention
The modification to make in order to apply the masking mechanism is pretty simple. Essentially, we want the masked positions to become 0 after applying softmax (i.e. zero attention). We can achieve that by assigning them to an extremely large negative value. Simple math, right?
The new forward pass of the Multi-Head Attention is as follows:
Improve the Decoder
In order to speed up the Decoder, we will basically do the same things as what we did with the Encoder, which is to get rid of the inefficient for loops.
Before diving into the code, let us recap a little bit. Do you remember that the Decoder has two Multi-Head Attention layers, which the bottom one does not allow any token to pay attention to the right-hand side? If we need to compute attentions for the whole sequence, we’re gonna need a mask:
That mask is very easy to implement. You can do in a pythonic way like this:
Or you can use a built-in Tensorflow function called band_part.
Both will produce a mask like above. Feel free to use the one that you like. And here is the new forward pass of the Decoder:
Next, as mentioned above, we need to modify the train_step function to generate a padding mask for the source sequence:
Now we are ready to train. Don’t forget to re-initialize the model to apply the changes we made:
We can see that the new model took approximately 3.76s for one epoch, whereas the old model needed 4.31s. It did cut out approximately 13% of the training time (I did this experiment on Colab so the numbers are not always the same).
Improve the Multi-Head Attention
Can we push it further? The answer is YES. There is one for loop left and it lies within the MultiHeadAttention class.
However, this one may not be as simple as the ones we tackled above. The difference is:
- Above: one matrix multiplication operator ((B, L1, M) x (B, M, L2) => L1 times of (B, 1, M) x (B, M, L2))
- This time: multiple matrix multiplication operators (H times of (B, L1, M) x (B, M, L2))
In fact, it is simpler than it sounds. The tf.matmul function is smart enough to retain not only the batch (0th) axis but also axes which are not directly involved in the dot operator.
What I meant by that is if we manage to have matrices of shapes (B, H, L1, M) and (B, H, M, L2) respectively, only calling tf.matmul once will produce the exact same result as the for loop. We may want to be extremely careful with the matrix transformation though. Here is the code illustrating the idea:
To apply the change above, we need to make some modification to the train_step function. Essentially, the masks used in the Multi-Head Attention layer must be broadcastable to the score’s shape, which is (batch_size, H, query_len, value_len). The look_left_only_mask has the shape of (query_len, value_len) so there should be no problem at all, whereas the padding_mask‘s shape is currently (batch_size, value_len), we need to explicitly turn it into (batch_size, 1, 1, value_len).
We are now ready to run the training process again and we shall see that everything is still working as before:
You might notice little to no improvement in the training speed. But believe me, once you plug in the real training data and use the full 8-head attention setup, the new MultiHeadAttention implementation will become a game changer!
Last but not least, let’s check out the translation result of all 20 training source sentences. A deep learning engineer should always be skeptical of everything:
And that is that! Everything worked flawlessly!
So we have finished implementing our own Transformer entirely from scratch. Starting off by creating a quick-and-dirty version while we were interpreting the paper, then we moved on to improve the model’s performance so that it’s production-ready. Be extremely cautious and take one baby step at a time, we can tackle literally every paper we felt interested in. Believe me.
You can find all the code related to this post below:
- Colab notebook for this post: link
- Colab notebook for full training, including attention heatmap visualization: link
- Source code for full training data: link
Feel free to play with the code and give me some feedback, I would appreciate that. Thank you all for reading and we’re gonna see each other again very shortly.