Hello everyone, I hope that you’re all doing good. The weather is great this morning, so let’s sit down to write some code. Today, I want to talk about one of the most challenging (and most fun) tasks in NLP’s world: Machine Translation. We will go through how to implement a neural machine translation model using Tensorflow. Yep, we will use the deep learning approach, which is why it has the word “
First, let’s talk about what we’re gonna do a little bit. As you know, machine translation is a big task among NLP’s problems that humans have been trying to solve for a while. Some people even think that NLP is all about creating machine translation models, lol. Firstly, the task itself is not easy, considering how many languages we can think of, plus each of those has its own characteristics. And secondly, the task requires a bunch of knowledge in both mathematics and probability.
But things changed since the boom of deep learning. Applying deep neural networks to solve machine translation problems didn’t just improve the model’s accuracy (significantly), it made the problems more accessible. Now, you don’t have to be a linguist or a math guru. You are just a tech guy who knows enough math and knows deep learning? Welcome to the field!
That being said, although it’s easier than the traditional approach, creating deep learning models for machine translation’s task is somewhat tricky. But no worries, we will walk through step by step together. And because putting everything in one post is impossible, in today’s post, we will only focus on the hardest part: get the data ready!
You will get the most out of this post if you have:
- Python3 installed
- Tensorflow installed (the newer the version is, the better)
- Experience with word embeddings and RNNs
- Read the Import Data tutorial by Tensorflow
- Read Sequence To Sequence Learning paper
No problems with anything above? Let’s go ahead and download the necessary data for today’s post. We will use the Vietnamese-English dataset since its size is suitable for experiments:
- train.vi (Vietnamese sequences)
- train.en (English sequences)
- vocab.vi (Vietnamese vocabulary)
- vocab.en (English vocabulary)
If you have downloaded all the files, you’re ready for the next part.
Here comes the implementation details.
There is a lot of ways to get your data ready. In this project, we will use tf.data API, the approach recommended by Tensorflow.
First, let’s import the necessary packages and some constants to use along the way:
Next, we need to define a method to get the vocabulary information from the text files. We will need that information to interpret the prediction sequences later on:
Now, we will actually use
Since we have two datasets, one for the source language and the other for the target language, we need two separate objects for them:
Next, we’re gonna create two vocabulary objects to cast string values to integers. And since we are using tf.data, we want everything to be a Tensor object. Tensor object often comes with debugging nightmare, which is a drawback of tf.data and maybe the reason why people prefer a more flexible approach).
Now we have the vocabulary objects. Let’s use them to get the indices of special cases first: the start-of-sentence and end-of-sentence tokens.
We then combine the two datasets into one, so that we can process them all at once altogether:
Now the dataset is ready to be processed. The first thing to do is splitting sequences into arrays of tokens:
Next, we will filter out any sequences with zero element (any unnecessary line break would result in an empty array).
We also need to make sure that all sequences don’t exceed the maximum length. Doing this will help ease the learning process a lot:
Remember the vocabulary objects we created above? It’s time to use them to convert all the string tokens into integers:
Before moving on, let’s have a look at the model (we will create this in the next post):
Yeah, the figure above revealed a lot. You can see that there are two networks or so. But I want you to focus on the in and out (circled in red). We can clearly see that we need two versions of the target sequence. One with the start-of-sequence token at the beginning and one with the end-of-sequence at the end.
If you read my previous post on text generation, you can recall that the model won’t stop generating until we tell it to. But in this project, we want to have some kind of constraint to force the model to learn where the result sequences should end.
Here is how we can create two versions of target sequences:
We are nearly there. When creating the two networks, we will need the information about the lengths, so let’s add them to the dataset object too:
The last processing step we need to do is something called zero padding. The reason is that sequences come in different lengths and that prevents us from feeding a minibatch into the model.
Padding zeros to sequences is a simple task, as long as you know how. I’m gonna show you right away:
And yeah, we don’t need to pad the lengths!
Now, just grab the iterator to loop through the entire dataset and we’re done here:
From here, you can test the input pipeline by creating a session. I will leave this task to you as homework.
Today, we have gone through the process of creating an input pipeline for the neural machine translation project. We also learned how to use
In the next post, you will see how easily the input and the model integrate together, which may change your opinion about tf.data API.
And that’s it for today, guys. Thank you for reading and I will see you in the next post.
This project borrows some code from this repository. I put my effort to simplify the source code and explain the implementation details.