Neural Machine Translation With Tensorflow: Model Creation

Reading Time: 4 minutes

What’s up everybody. Welcome back to the Neural Machine Translation with Tensorflow (NMTwT) series. Last time, we went through the process of creating the input pipeline using the API. Today, I’m gonna show you how to create a model that can learn to translate human languages. So, let’s get started!


As usual, there are a couple of things that can help you make the most out of my post. This one is the second part of the NMTwT series, so obviously, a good read through the first part is highly recommended. You can find it here: Data Preparation. And secondly, have a look at Sequence To Sequence paper too!

Understand the Seq2Seq Model

Before we actually write any code, let’s talk a little bit about the model that we are going to create. First, I don’t know if it has any official name, but people in the NLP world often refer to it as Seq2Seq. You know why it is called like that, don’t you?

I won’t go into detail about the model architecture, which is well written in the paper. I’m gonna point out things that I find necessary for the implementation. Basically, here’s what Seq2Seq model looks like:

Figure 1: Seq2Seq model

So, what are the key points to notice here? Technically speaking, Seq2Seq model has some characteristics like below:

  • It is some kind of Autoencoder (you should know about it), which means that it consists of two parts: an Encoder and a Decoder.
  • Its networks consist of RNN units (since we are dealing with sequences).
  • In complex problems, you should feed the targets as inputs to the Decoder (the technical term for this is teacher forcing).

That’s all I want to tell you about the Seq2Seq model, you can dig more into details using the keywords above. And for now, let’s code!

Creating the Seq2Seq Model

The encoder

So now, we know all the things we need to implement the Seq2Seq model, let’s go ahead and define a method for that:

Next, let’s create the Encoder network, which begins with an Embedding layer:

The output of the Embedding layer will be fed into the LSTM unit. Below is how you can stack up more than 2 LSTM layers: we first define a method called _create_encoder_cell to create a single LSTM layer, then call it within a loop.

Finally, we will call dynamic_rnn to get the output and the states of LSTM units:

Okay, it’s time to solve the myth as you might have noticed some weird things until now. Firstly, why we had to transpose the input sequence before feeding to the Embedding layer? And what does time_major=True mean in the call of dynamic_rnn above?

Well, an input sequence has three dimensions: batch size, sequence length, and vocabulary size. Normally, it has the shape of (batch size, sequence length, vocabulary size), which is batch-major. If we swap the first two dimensions, which results in a tensor of shape (sequence length, batch size, vocabulary size), we will obtain a time-major tensor. Using batch-major or time-major data is just a matter of personal preference (time-major is more efficient a little bit), so you can use the batch-major format if you want.

And that’s all we need to do to create the Encoder network. No difference than the one we use to generate text, right? It’s time to move on.

The decoder (training)

Let’s take a look at Figure 1 above, but focus on the Decoder network this time. Basically, the Decoder looks similar to the Encoder during the, except that its initial states came from the Encoder and we do care about its outputs (in future posts about the Attention mechanism, we will utilize the Encoder’s outputs as well, but let’s just ignore them for now).

So let’s create the Decoder network. We will start off with the Embedding layer (no surprise):

Next, we will create the LSTM layers, just exactly how we did in the Encoder:

What we’re doing after that is a little bit tricky. Of course, we can go ahead and call dynamic_rnn to compute the outputs and the states. But this time we will do something differently. This approach, which you will see in future posts, will help implement the Attention mechanism much easier.

First off, we will need something called a training helper, which helps us loop through the input sequences:

Next, we will create a decoder object, which requires the following as inputs:

  • LSTM units
  • a training helper
  • initial states
  • a Dense layer

We already have the first two. The initial states, as I said above, are from the Encoder:

We need a Dense layer as the last ingredient. The Dense layer’s number of units is equal to the vocabulary size, as its purpose is to transform the Decoder’s outputs to vectors of the same shape as target sequences:

And now we can create the decoder object:

Next, we will call dynamic_decode method of tf.contrib.seq2seq module to obtain the Decoder’s outputs and states (kind of similar to dynamic_rnn, right?):

Now we are ready to compute the loss. Don’t forget to apply transpose to the target sequences if you set time_major to True:

Another side note though, we have to get rnn_output from decoder_outputs in order to use the actual outputs from the LSTM units.

We have to take one final step to compute the final loss. Remember we applied zero-padding when creating data batches? The zero-padded elements should not be taken into account when computing loss so let’s create a mask to filter them out:

Now we can compute the final loss as follow:

And that’s the Decoder we need for the training process.

The decoder (inference)

Let’s talk about the Decoder during the inference mode. Obviously we won’t have the target sequences to feed in, so what do we do?

Remember the states from the Encoder? We will feed those states plus the start token into the Decoder. From there, the output will become the input of the next step. And the process goes on and on until it meets the stop condition, which is the stop token. The whole process looks like below:

Figure 2: Inference mode

Let’s implement the inference phase. You will see that, by using the tf.contrib.seq2seq module, the steps above can be put into the entire computational graph, instead of manually calling for each time step.

So firstly, we need another scope with reuse=True since we are in the inference phase. We then need to look up the start and stop tokens:

Can you guess what we will do next? Yes, we need a helper. But we can’t use the training helper above since we don’t have the target sequences. What we’re gonna use is called GreedyEmbeddingHelper, which will take the output with the highest probability for the next time step.

And notice that it has Embedding in its name, which means that it will apply the embedding, we just need to pass the input and the embedding layer. We also tell it the stop condition, which is the stop token:

So far so good. Next, we will create the decoder object:

And then use it to get the output sequence. We will set the maximum length for the output to twice the input sequence’s length:

Finally, let’s extract the actual output. Instead of calling rnn_output as above, which will give us (max length, batch size, vocabulary size) tensors, we will call sample_id which returns tensors of shape (max length, batch size). No need to find argmax ourselves 😉

And that’s it. We have finished creating the Seq2Seq model.

Final words

Congratulations, you made it! In this post, we took a look at the Seq2Seq model and hopefully we all understand how it works.

We also use tf.contrib.seq2seq modules to create the Seq2Seq networks, which turns out to be very agile to implement both the training and inference phases. We will see more in the future posts about Attention mechanism.

And that’s it for today. Thank you guys for reading and I will see you in no time.

Trung Tran is a software developer + AI engineer. He also works on networking & cybersecurity on the side. He loves blogging about new technologies and all posts are from his own experiences and opinions.

Leave a reply:

Your email address will not be published.