Hello guys, spring has come and I guess you’re all feeling good. Today, let’s join me in the journey of creating a neural machine translation model with attention mechanism by using the hottest-on-the-news Tensorflow 2.0.
Oh wait! I did have a series of blog posts on this topic, not so long ago. Here are the links:
Unfortunately, since they were made outdated by Tensorflow 2.0 plus I didn’t have a chance to write about Attention Mechanism before, I think those are good reasons to write a new blog post.
Don’t you worry, I won’t make another series of 3 to 4 blog posts this time (it would be dull to do so). Everything will be covered just within this blog post.
With that being said, our objective is pretty simple: we will use a very simple dataset (with only 20 examples) and we will try to overfit the training data with the renown Seq2Seq model. For the attention mechanism, we’re gonna use Luong attention, which I personally prefer over Bahdanau’s.
At the end of this post, I will also provide source code to deal with actual training data (English – French pairs). The workflow is basically the same so you can check out by yourselves.
Without talking too much about theories today, let’s jump right in the implementation. As usual, we will go through the steps below:
- Data Preparation
- Seq2Seq without Attention
- Seq2Seq with Luong Attention
Let’s tackle them one by one. The most fancy part is obviously the last one. Feel free to skip to that section if you feel confident.
In order to get the most out of today’s post, I recommend that you have:
- Tensorflow 2.0 installed (I have a tutorial here)
- Read Sequence To Sequence Learning paper: here
- Read Luong Attention paper: here
Seems like we’re all ready. Let’s get started!
Let’s talk about the data. We’re gonna use 20 English – French pairs (which I extracted from the original dataset). The reasons for using such a small dataset are:
- Easier to understand how sequences
- Extremely fast to train
- No challenge in confirming the results even if you don’t speak French
Things will start to make sense shortly. First off, let’s import necessary packages and take a look at the data:
As you can see, the data is a list of tuples in which each contains an English sentence and a French sentence.
Next, we will need to clean up the raw data a little bit. This kind of task usually involves normalizing strings, filtering unwanted tokens, adding space before punctuation, etc. Most of the time, what you need are two functions like below:
We will now split the data into two separate lists, each contains its own sentences. Then we will apply the functions above and add two special tokens: <start> and <end>:
I need to elaborate a little bit here. First off, let’s take a look at the figure below:
The Seq2Seq model consists of two networks: Encoder and Decoder. The encoder, which is on the left-hand side, requires only sequences from source language as inputs.
The decoder, on the other hand, requires two versions of destination language’s sequences, one for inputs and one for targets (loss computation). The decoder itself is usually called a language model (we used it a lot for text generation, remember?).
From personal experiments, I also found that it would be better not to add <start> and <end> tokens to source sequences. Doing so would confuse the model, especially the attention mechanism later on, since all sequences start with the same token.
Next, let’s see how to tokenize the data, i.e. convert the raw strings into integer sequences. We’re gonna use the text tokenization utility class from Keras:
Pay attention to the filters argument. By default, Keras’ Tokenizer will trim out all the punctuations, which is not what we want. Since we have already filtered out punctuations ourselves (except for .!?), we can just set filters as blank here.
The crucial part of tokenization is vocabulary. Keras’ Tokenizer class comes with a few methods for that. Since our data contains raw strings, we will use the one called fit_on_texts.
The tokenizer will created its own vocabulary as well as conversion dictionaries. Take a look:
We can now have the raw English sentences converted to integer sequences:
Last but not least, we need to pad zeros so that all sequences have the same length. Otherwise, we won’t be able to create tf.data.Dataset object later on.
Let’s check if everything is okay:
Everything is perfect. Go ahead and do exactly the same with French sentences:
A mid-way notice though, we can call fit_on_texts multiple times on different corpora and it will update vocabulary automatically. Always remember to finish with fit_on_texts first before using texts_to_sequences.
The last step is easy, we only need to create an instance of tf.data.Dataset:
And that’s it. We have done preparing the data!
Seq2Seq model without Attention
By now, we probably know that attention mechanism is the new standard in machine translation tasks. But I think there are good reasons to create the vanilla Seq2Seq first:
- Pretty simple and easy with tf.keras
- No headache to debug when things go wrong
- Be able to answer: Why need attention at all?
Okay, let’s assume that you are all convinced. We will start off with the encoder. Inside the encoder, there are an embedding layer and an RNN layer (can be either vanilla RNN or LSTM or GRU). At every forward pass, it takes in a batch of sequences and initial states and returns output sequences as well as final states:
And here is how the data’s shape changes at each layer. I find that keeping track of the data’s shape is extremely helpful not to make silly mistakes, just like stacking up Lego pieces:
We have done with the encoder. Next, let’s create the decoder. Without attention mechanism, the decoder is basically the same as the encoder, except that it has a Dense layer to map RNN’s outputs into vocabulary space:
Similarly, here’s the data’s shape at each layer:
As you might have noticed in Figure 1, the final states of the encoder will act as the initial states of the decoder. That’s the difference between a language model and a decoder of Seq2Seq model.
And that is the decoder we need to create. Before moving on, let’s check if we didn’t make any mistake along the way:
Great! Everything is working as expected. The next thing to do is to define a loss function. Since we padded zeros into the sequences, let’s not take those zeros into account when computing the loss:
What else do we need? Right, we haven’t created an optimizer yet!
Now we’re ready to create the training function in which we perform a forward pass followed by a backward pass. There are two things to remember:
- We use the @tf.function decorator to take advance of static graph computation (remove it when you want to debug)
- Network’s computations need to be put under
tf.GradientTape() to keep track of gradients
Before creating the training loop, let’s define a method for inference purpose. What it does is basically a forward pass, but instead of target sequences, we will feed in the <start> token. Every next time step will take the output of the last time step as input until we hit the <end> token or the output sequence has exceed a specific length:
And finally, here comes the training loop. At every epoch, we will grab batches of data for training. We also print out the loss value and see how the model performs at the end of each epoch:
Let’s monitor the training process. It would take less than 5 minutes with a GPU machine so that if there was something wrong, you would know immediately.
At first, the translation results didn’t make any sense at all. But gradually, the model learned to make more meaningful phrases. Finally, at 250th epoch, the model has completely remembered all 20 sentences. Below is my result:
Obviously, we can confirm that the model can actually learn to translate from a small dataset. I also trained the same model (with some modifications on hyper-parameters) using the full English-French dataset. You can tell from the result that the model’s translation is quite acceptable, right?
Links to source files will be provided at the end of this post so you can play with different settings by yourselves.
And guys, we have finished our first mission! We have successfully created a fully functional Seq2Seq model without attention mechanism, yet
In the next section, we will see that with just a few modifications, we can immediately upgrade our current model with Luong attention.
Seq2Seq model with Luong attention
Now, let’s talk about attention mechanism. What is it and why do we need it?
Things are pretty difficult to explain (especially when it comes to deep learning) if we only look at mathematic equations. So, let’s change our perspective and consider the machine translation model as a learner who is trying to learn a foreign language.
Speaking of learning a new language, personally, I think the two below are the most common problems we all had to deal with:
- Difficult to remember and process long complicated context
- Struggle with difference in syntax structure with your mother language
And guess what? Machine translation models face the same problems too. I’ll give you an example. Below I have a sentence in English:
I just want to have a sister.
And here is the French version:
Je veux juste avoir une soeur.
Let’s see how those fit in the Seq2Seq model. We will see the problems very soon:
The first thing to notice is that the encoder’s state is only passed to the first node of the decoder. For that reason, the information from the encoder will become less and less relevant every next time step.
The second problem though, we can see that the phrase just
So, how can we possibly solve those problems? Ideally, we want all time steps within the decoder to have access to the encoder’s output. That way, the decoder would be able to learn to focus partially on the encoder’s output and produce more accurate translations.
That was the idea behind Attention Mechanisms! What I just said can be illustrated as follows:
So now we know how attention mechanisms work and why we need one. Without wasting any further second, let’s go ahead and implement the Luong-style attention mechanism.
Technically, there are two terms we need to know in advance: the alignment vector and the context vector.
- The alignment vector
The alignment vector is a vector that has the same length with the source sequence and is computed at every time step of the decoder. Each of its values is the score (or the probability) of the corresponding word within the source sequence:
What alignment vectors do is to put weights onto the encoder’s output or intuitively, they tell the decoder what to focus on at each time step.
- The context vector
The context vector is what we use to compute the final output of the decoder. It is the weighted average of the encoder’s output. You can see that we can get the context vector by computing the dot product of the alignment vector and the encoder’s output:
That’s all about the secret of attention mechanisms. Next, let’s see how we can create one in Python. Let’s take a look at the equations to know exactly what we need to do. Here is how we’re gonna compute the alignment vector:
Luong attention mechanism proposed three types of score function: dot, general and concat:
Since I’m not going to talk about Bahdanau-style attention, here’s the key differences between the two:
- Bahdanau attention mechanism proposed only the concat score function
- Luong-style attention uses the current decoder output to compute the alignment vector, whereas Bahdanau’s uses the output of the previous time step
Okay, we are now clear. Let’s code. For demonstration purpose, I will only show the code for the general score function. As you can see in the equation above, we need to take the dot product of a matrix called Wa and the encoder’s output. What layer can do a dot product? It’s the Dense layer:
Next, we will implement the forward pass. Note that we have to pass in the encoder’s output this time around. The first thing to do is to compute the score. It’s the dot product of the current decoder’s output and the output of the Dense layer.
We can then compute the alignment vector by simply applying softmax function:
Finally, we will compute the context vector. It’s the weighted average of the encoder’s output, which is a different way of saying the dot product of the alignment vector and the encoder’s output:
And we have finished the implementation of Luong-style attention. Let’s test that out.
So far so good. Next, we will have to make a few more changes in order to use the attention mechanism above. Let’s start with the decoder.
From above, we have obtained the context and alignment vectors. The alignment vector has nothing to do with the decoder (we will need it for a fancy visualization later on). Okay, let’s see how we’re gonna use the context vector:
So, let’s interpret those equations, shall we? At each time step t, we will concatenate the context vector and the current output (of the RNN unit) to form a new output vector. We then continue as normal: convert that vector to vocabulary space for the final output.
In order to apply those changes, first off, we need to create an attention object when creating the decoder:
Next, we need to define two Dense layers, as we have seen in Equation 3 above, there are two matrices called Wc and Ws respectively. Remember that the first Dense layer will use the tanh activation function.
We have done with the initiation. Next, let’s apply those changes to the forward pass, i.e. the call method. Since we are doing something with the attention mechanism at every time step, let’s remember that the input sequence to the decoder is now a batch of one-word sequences.
We will begin with computing the embedded vector and get the outputs from the RNN unit. Notice that we do need to add the encoder’s output to the arguments:
We need some attention now. Let’s use the decoder’s output and the encoder’s input to get the context and the alignment vectors:
After we had the context vector, it’s time to do exactly what is written in Equation 3. We will combine the context vector and the RNN output, then pass the combined vector through the two Dense layers:
Okay, we have done with all the big changes. Next, let’s modify the train_step function. Since we are dealing with each time step at a time on the decoder’s side, we will need to explicitly create a loop for that:
Let’s do the same to the predict function. We also need get the source sequence, the translated sequence and the alignment vector for visualization purpose:
And that’s it. We have finished the implementation of the Luong-style attention. Let’s start the training!
Okay, after a night, my model has finished the training. It’s time to check it out. We want to know whether it has improved after being equipped with an attention mechanism.
So, to make it easy for the eyes, I decided to take the 20 examples that we used at the beginning and compare the translations made by the two models. And the result is as follows (the order is source sentence -> target sentence -> Seq2Seq -> Seq2Seq with Luong):
I can tell by feeling that the Seq2Seq model with Luong attention made better translations than the vanilla Seq2Seq. Although it’s beyond the scope of this blog post, one may want to compute the BLEU score for a more accurate evaluation metric.
Anyway, what is fun about using attention mechanisms is that we can visualize where the model is paying attention when making translations. Let’s take a look at the GIF below. Can you see where my model is looking at 😉
Here I have a little problem, though. When saving the figures to file, some of them couldn’t manage to display the labels correctly. I tried some solutions suggested on StackOverflow but still no luck. I will appreciate if you can tell me how to fix that.
You can find code to create those heat maps and convert to a GIF inside the source on my repository. Feel free to experiment on your own.
Phew! That’s it! We finally made it guys. How persistent you are to follow my long blog post till the end. I really appreciate that.
Let’s look back to see what we have accomplished today:
- We implemented a Sequence-to-Sequence model from scratch with Tensorflow 2.0
- We also know how attention mechanisms work and implemented Luong-style attention
Those are a tremendous amount of work. What we have done so far will help build a strong foundation and can be served as a baseline for your next machine translation/chatbot projects. Keep up the good work and actively carry on new experiments.
And as usual, you can find all the source code to reproduce the results above on my NLP repository:
- Seq2Seq on 20 examples: link
- The full English-French pairs: link (filename: fra-eng.zip)
- Seq2Seq on English-French pairs: link
- Seq2Seq + Luong attention on English-French pairs: link
That’s it for today everyone. Thank you again for your time. And I will see you in the next project.