Neural Machine Translation With Attention Mechanism

Reading Time: 8 minutes

Hello guys, spring has come and I guess you’re all feeling good. Today, let’s join me in the journey of creating a neural machine translation model with attention mechanism by using the hottest-on-the-news Tensorflow 2.0.


Oh wait! I did have a series of blog posts on this topic, not so long ago. Here are the links:

Unfortunately, since they were made outdated by Tensorflow 2.0 plus I didn’t have a chance to write about Attention Mechanism before, I think those are good reasons to write a new blog post.

Don’t you worry, I won’t make another series of 3 to 4 blog posts this time (it would be dull to do so). Everything will be covered just within this blog post.

With that being said, our objective is pretty simple: we will use a very simple dataset (with only 20 examples) and we will try to overfit the training data with the renown Seq2Seq model. For the attention mechanism, we’re gonna use Luong attention, which I personally prefer over Bahdanau’s.

At the end of this post, I will also provide source code to deal with actual training data (English – French pairs). The workflow is basically the same so you can check out by yourselves.

Without talking too much about theories today, let’s jump right in the implementation. As usual, we will go through the steps below:

  • Data Preparation
  • Seq2Seq without Attention
  • Seq2Seq with Luong Attention

Let’s tackle them one by one. The most fancy part is obviously the last one. Feel free to skip to that section if you feel confident.


In order to get the most out of today’s post, I recommend that you have:

  • Tensorflow 2.0 installed (I have a tutorial here)
  • Read Sequence To Sequence Learning paper: here
  • Read Luong Attention paper: here

Seems like we’re all ready. Let’s get started!

Data Preparation

Let’s talk about the data. We’re gonna use 20 English – French pairs (which I extracted from the original dataset). The reasons for using such a small dataset are:

  • Easier to understand how sequences is tokenized
  • Extremely fast to train
  • No challenge in confirming the results even if you don’t speak French

Things will start to make sense shortly. First off, let’s import necessary packages and take a look at the data:

As you can see, the data is a list of tuples in which each contains an English sentence and a French sentence.

Next, we will need to clean up the raw data a little bit. This kind of task usually involves normalizing strings, filtering unwanted tokens, adding space before punctuation, etc. Most of the time, what you need are two functions like below:

We will now split the data into two separate lists, each contains its own sentences. Then we will apply the functions above and add two special tokens: <start> and <end>:

I need to elaborate a little bit here. First off, let’s take a look at the figure below:

Figure 1: Input and Output

The Seq2Seq model consists of two networks: Encoder and Decoder. The encoder, which is on the left-hand side, requires only sequences from source language as inputs.

The decoder, on the other hand, requires two versions of destination language’s sequences, one for inputs and one for targets (loss computation). The decoder itself is usually called a language model (we used it a lot for text generation, remember?).

From personal experiments, I also found that it would be better not to add <start> and <end> tokens to source sequences. Doing so would confuse the model, especially the attention mechanism later on, since all sequences start with the same token.

Next, let’s see how to tokenize the data, i.e. convert the raw strings into integer sequences. We’re gonna use the text tokenization utility class from Keras:

Pay attention to the filters argument. By default, Keras’ Tokenizer will trim out all the punctuations, which is not what we want. Since we have already filtered out punctuations ourselves (except for .!?), we can just set filters as blank here.

The crucial part of tokenization is vocabulary. Keras’ Tokenizer class comes with a few methods for that. Since our data contains raw strings, we will use the one called fit_on_texts.

The tokenizer will created its own vocabulary as well as conversion dictionaries. Take a look:

We can now have the raw English sentences converted to integer sequences:

Last but not least, we need to pad zeros so that all sequences have the same length. Otherwise, we won’t be able to create object later on.

Let’s check if everything is okay:

Everything is perfect. Go ahead and do exactly the same with French sentences:

A mid-way notice though, we can call fit_on_texts multiple times on different corpora and it will update vocabulary automatically. Always remember to finish with fit_on_texts first before using texts_to_sequences.

The last step is easy, we only need to create an instance of

And that’s it. We have done preparing the data!

Seq2Seq model without Attention

By now, we probably know that attention mechanism is the new standard in machine translation tasks. But I think there are good reasons to create the vanilla Seq2Seq first:

  • Pretty simple and easy with tf.keras
  • No headache to debug when things go wrong
  • Be able to answer: Why need attention at all?

Okay, let’s assume that you are all convinced. We will start off with the encoder. Inside the encoder, there are an embedding layer and an RNN layer (can be either vanilla RNN or LSTM or GRU). At every forward pass, it takes in a batch of sequences and initial states and returns output sequences as well as final states:

And here is how the data’s shape changes at each layer. I find that keeping track of the data’s shape is extremely helpful not to make silly mistakes, just like stacking up Lego pieces:

Figure 2: Encoder’s data shapes

We have done with the encoder. Next, let’s create the decoder. Without attention mechanism, the decoder is basically the same as the encoder, except that it has a Dense layer to map RNN’s outputs into vocabulary space:

Similarly, here’s the data’s shape at each layer:

Figure 3: Decoder’s data shapes

As you might have noticed in Figure 1, the final states of the encoder will act as the initial states of the decoder. That’s the difference between a language model and a decoder of Seq2Seq model.

And that is the decoder we need to create. Before moving on, let’s check if we didn’t make any mistake along the way:

Great! Everything is working as expected. The next thing to do is to define a loss function. Since we padded zeros into the sequences, let’s not take those zeros into account when computing the loss:

What else do we need? Right, we haven’t created an optimizer yet!

Now we’re ready to create the training function in which we perform a forward pass followed by a backward pass. There are two things to remember:

  • We use the @tf.function decorator to take advance of static graph computation (remove it when you want to debug)
  • Network’s computations need to be put under tf.GradientTape() to keep track of gradients

Before creating the training loop, let’s define a method for inference purpose. What it does is basically a forward pass, but instead of target sequences, we will feed in the <start> token. Every next time step will take the output of the last time step as input until we hit the <end> token or the output sequence has exceed a specific length:

And finally, here comes the training loop. At every epoch, we will grab batches of data for training. We also print out the loss value and see how the model performs at the end of each epoch:

Let’s monitor the training process. It would take less than 5 minutes with a GPU machine so that if there was something wrong, you would know immediately.

At first, the translation results didn’t make any sense at all. But gradually, the model learned to make more meaningful phrases. Finally, at 250th epoch, the model has completely remembered all 20 sentences. Below is my result:

Obviously, we can confirm that the model can actually learn to translate from a small dataset. I also trained the same model (with some modifications on hyper-parameters) using the full English-French dataset. You can tell from the result that the model’s translation is quite acceptable, right?

Links to source files will be provided at the end of this post so you can play with different settings by yourselves.

And guys, we have finished our first mission! We have successfully created a fully functional Seq2Seq model without attention mechanism, yet.

In the next section, we will see that with just a few modifications, we can immediately upgrade our current model with Luong attention.

Seq2Seq model with Luong attention

Now, let’s talk about attention mechanism. What is it and why do we need it?

Things are pretty difficult to explain (especially when it comes to deep learning) if we only look at mathematic equations. So, let’s change our perspective and consider the machine translation model as a learner who is trying to learn a foreign language.

Speaking of learning a new language, personally, I think the two below are the most common problems we all had to deal with:

  • Difficult to remember and process long complicated context
  • Struggle with difference in syntax structure with your mother language

And guess what? Machine translation models face the same problems too. I’ll give you an example. Below I have a sentence in English:

I just want to have a sister.

And here is the French version:

Je veux juste avoir une soeur.

Let’s see how those fit in the Seq2Seq model. We will see the problems very soon:

Figure 4: Seq2Seq disadvantages

The first thing to notice is that the encoder’s state is only passed to the first node of the decoder. For that reason, the information from the encoder will become less and less relevant every next time step.

The second problem though, we can see that the phrase just want in the English sentence is equivalent to veux juste in French (just = juste and want = veux). That would give the decoder a tough time to work things out.

So, how can we possibly solve those problems? Ideally, we want all time steps within the decoder to have access to the encoder’s output. That way, the decoder would be able to learn to focus partially on the encoder’s output and produce more accurate translations.

That was the idea behind Attention Mechanisms! What I just said can be illustrated as follows:

Figure 5: Attention Mechanism

So now we know how attention mechanisms work and why we need one. Without wasting any further second, let’s go ahead and implement the Luong-style attention mechanism.

Technically, there are two terms we need to know in advance: the alignment vector and the context vector.

  • The alignment vector

The alignment vector is a vector that has the same length with the source sequence and is computed at every time step of the decoder. Each of its values is the score (or the probability) of the corresponding word within the source sequence:

Figure 6: Alignment vector

What alignment vectors do is to put weights onto the encoder’s output or intuitively, they tell the decoder what to focus on at each time step.

  • The context vector

The context vector is what we use to compute the final output of the decoder. It is the weighted average of the encoder’s output. You can see that we can get the context vector by computing the dot product of the alignment vector and the encoder’s output:

Figure 7: Context vector

That’s all about the secret of attention mechanisms. Next, let’s see how we can create one in Python. Let’s take a look at the equations to know exactly what we need to do. Here is how we’re gonna compute the alignment vector:

Equation 1: Equation for alignment vector

Luong attention mechanism proposed three types of score function: dot, general and concat:

Equation 2: Score functions

Since I’m not going to talk about Bahdanau-style attention, here’s the key differences between the two:

  • Bahdanau attention mechanism proposed only the concat score function
  • Luong-style attention uses the current decoder output to compute the alignment vector, whereas Bahdanau’s uses the output of the previous time step

Okay, we are now clear. Let’s code. For demonstration purpose, I will only show the code for the general score function. As you can see in the equation above, we need to take the dot product of a matrix called Wa and the encoder’s output. What layer can do a dot product? It’s the Dense layer:

Next, we will implement the forward pass. Note that we have to pass in the encoder’s output this time around. The first thing to do is to compute the score. It’s the dot product of the current decoder’s output and the output of the Dense layer.

We can then compute the alignment vector by simply applying softmax function:

Finally, we will compute the context vector. It’s the weighted average of the encoder’s output, which is a different way of saying the dot product of the alignment vector and the encoder’s output:

And we have finished the implementation of Luong-style attention. Let’s test that out.

So far so good. Next, we will have to make a few more changes in order to use the attention mechanism above. Let’s start with the decoder.

From above, we have obtained the context and alignment vectors. The alignment vector has nothing to do with the decoder (we will need it for a fancy visualization later on). Okay, let’s see how we’re gonna use the context vector:

Equation 3: Decoder output

So, let’s interpret those equations, shall we? At each time step t, we will concatenate the context vector and the current output (of the RNN unit) to form a new output vector. We then continue as normal: convert that vector to vocabulary space for the final output.

In order to apply those changes, first off, we need to create an attention object when creating the decoder:

Next, we need to define two Dense layers, as we have seen in Equation 3 above, there are two matrices called Wc and Ws respectively. Remember that the first Dense layer will use the tanh activation function.

We have done with the initiation. Next, let’s apply those changes to the forward pass, i.e. the call method. Since we are doing something with the attention mechanism at every time step, let’s remember that the input sequence to the decoder is now a batch of one-word sequences.

We will begin with computing the embedded vector and get the outputs from the RNN unit. Notice that we do need to add the encoder’s output to the arguments:

We need some attention now. Let’s use the decoder’s output and the encoder’s input to get the context and the alignment vectors:

After we had the context vector, it’s time to do exactly what is written in Equation 3. We will combine the context vector and the RNN output, then pass the combined vector through the two Dense layers:

Okay, we have done with all the big changes. Next, let’s modify the train_step function. Since we are dealing with each time step at a time on the decoder’s side, we will need to explicitly create a loop for that:

Let’s do the same to the predict function. We also need get the source sequence, the translated sequence and the alignment vector for visualization purpose:

And that’s it. We have finished the implementation of the Luong-style attention. Let’s start the training!

Okay, after a night, my model has finished the training. It’s time to check it out. We want to know whether it has improved after being equipped with an attention mechanism.

So, to make it easy for the eyes, I decided to take the 20 examples that we used at the beginning and compare the translations made by the two models. And the result is as follows (the order is source sentence -> target sentence -> Seq2Seq -> Seq2Seq with Luong):

I can tell by feeling that the Seq2Seq model with Luong attention made better translations than the vanilla Seq2Seq. Although it’s beyond the scope of this blog post, one may want to compute the BLEU score for a more accurate evaluation metric.

Anyway, what is fun about using attention mechanisms is that we can visualize where the model is paying attention when making translations. Let’s take a look at the GIF below. Can you see where my model is looking at 😉

Figure 8: Attention Heatmap

Here I have a little problem, though. When saving the figures to file, some of them couldn’t manage to display the labels correctly. I tried some solutions suggested on StackOverflow but still no luck. I will appreciate if you can tell me how to fix that.

You can find code to create those heat maps and convert to a GIF inside the source on my repository. Feel free to experiment on your own.

Final words

Phew! That’s it! We finally made it guys. How persistent you are to follow my long blog post till the end. I really appreciate that.

Let’s look back to see what we have accomplished today:

  • We implemented a Sequence-to-Sequence model from scratch with Tensorflow 2.0
  • We also know how attention mechanisms work and implemented Luong-style attention

Those are a tremendous amount of work. What we have done so far will help build a strong foundation and can be served as a baseline for your next machine translation/chatbot projects. Keep up the good work and actively carry on new experiments.

And as usual, you can find all the source code to reproduce the results above on my NLP repository:

  • Seq2Seq on 20 examples: link
  • The full English-French pairs: link (filename:
  • Seq2Seq on English-French pairs: link
  • Seq2Seq + Luong attention on English-French pairs: link

That’s it for today everyone. Thank you again for your time. And I will see you in the next project.


  • Effective Approaches to Attention-based Neural Machine Translation paper (Luong attention): link
  • Tensorflow Neural Machine Translation with (Bahdanau) Attention tutorial: link
  • Luong’s Neural Machine Translation repository: link

Trung Tran is a software developer + AI engineer. He also works on networking & cybersecurity on the side. He loves blogging about new technologies and all posts are from his own experiences and opinions.

45 comments On Neural Machine Translation With Attention Mechanism

  • Very nice work! The idea to use small set of sentences is brilliant. Because I would be worrying that it would simply overfit, but it has so much learning and testing value.

    I updated the predict function like this:
    def predict(test_source_text=None):
    if test_source_text is None:
    r = np.random.choice(len(raw_data_en))
    test_source_text = raw_data_en[r]
    test_target_text = raw_data_fr[r]
    test_target_text = None

    if test_target_text is not None:

    test_source_seq = en_tokenizer.texts_to_sequences([test_source_text])

    so that the output is even better to look at:
    Epoch 1 Loss 0.0000
    je n arrive pas a croire que vous abandonniez .

    Epoch 2 Loss 0.0000
    me donnez vous une autre chance ?

    Epoch 3 Loss 0.0000
    vous avez besoin de faire cela tous les trois .

    Epoch 4 Loss 0.0000
    cette annee avez vous plante des citrouilles ?

    Epoch 5 Loss 0.0000
    votre idee n est pas completement .

    Epoch 6 Loss 0.0000
    ne vous laissez pas abuser par les apparences .

    Epoch 7 Loss 0.0000
    comment savez vous qu il ne s agit pas d un piege ?

  • My symbols screwed up the html: Here is the better picture:

    Epoch 66 Loss 0.0000
    src -> Few people know the true meaning .
    tar -> Peu de gens savent ce que cela veut réellement dire.
    trn -> peu de gens savent ce que cela veut reellement dire .

    Epoch 67 Loss 0.0000
    src -> Both Tom and Mary work as models .
    tar -> Tom et Mary travaillent tous les deux comme mannequins.
    trn -> tom et mary travaillent tous les deux comme mannequins .

  • I have made a colab notebook version of the simple version for anyone who might want to try on Google free GPU:–1wg48VdPzhW

  • Hi Trung,

    I am so happy to see you have also made the toy translator for Transformer working.

    I have been struggling on that for some time since it has many moving parts. Using your approach of ‘overfitting’ on small dataset to workout the
    code is a brilliant thing. Let me thank you for that approach.

    The transformer seems to be a breakthrough, even in this case we can see the speed and accuracy!

    I am sure you will apply the transformer on full eng-fre dataset. I will also try and share my experience. Good luck to you,

    I think if we use sentencepiece tokenization on full eng-fre that limits the vocabulary and results will be very good.
    I will try that.

    Thanks again for the sharing heart, and providing good description.

    • Hi Ravi,

      Thank you for following my humble work.

      As you noticed in my repo, I’ve been working on the Transformer. Actually, the optimized code is ready and on the final test at my local machine. Working on the blog post at the moment and I hope it can be released within this week.

      The final code will be pushed when ready too.

  • Hello Trung,
    I really appreciate of your really great and easily understandable code!

    I am trying to run attention model that you did.

    But i faced an error ‘SystemError: returned a result with an error set’.

    I’ve checked all shapes for each part(Encoder, Attention, Decoder).

    What i found is, in train_step function, gradients is not calculated correctly.

    Have you ever met this error during your work?

    Sorry for somewhat easy question.

    • Hi Kyu,

      I haven’t seen that error before. Can you please open an issue and provide the full console’s log? I will have a look and figure it out.


  • Hi Trung,

    How do you gauge over/underfitting for this sort of model? From my understanding, since translations of one input can be correct, plotting validation loss vs epochs wouldn’t make sense, is that correct?

    • Hi Kyle,

      Monitoring training/validation loss provides a fast way to assure that the model is actually learning (as the loss gradually decreases over time). But it might not be as useful when training on large datasets. We need a metric that can judge the translation quality and at the moment, BLEU score is probably the most used metric for machine translation task.

      You can search for a detailed explanation on BLEU score. Here is one that came out recently: Understanding MT Quality.

      Trung Tran

  • Hi Trung,
    Thanks for an amazing post. I tried out your example for sentence to sentence machine translation which is fantastic. However, I wanted to use a Seq2Seq model for the purpose of translating input features (numeric) into sentences. Suppose, I have a dataframe, with 100 rows and 10 columns, if I want to use columns 1-9 (numeric) as features to predict/generate the sentences (which are corresponding to each row in the 10th column), I am not sure on how the model would need the data and its form. The confusion is about the English sentences (which in my case would be numeric features), and the French sentences would just be normal sentences in my case. Suppose my dataframe features are in the dataframe called df_features, which I have converted into a Numpy array looking like this

    array([[0.87596899, 0.98816347, 0.98803866, …, 0.87973607, 0.09400701,
    [0.87596899, 0.97709857, 0.97719679, …, 0.88528478, 0.09529423,

    I guess adapting to your example:-
    raw_data_en = df_features # this would be of dimensions 100×9
    # Now the problem would be whether to use
    raw_data_en = raw_data_en.ravel() # to convert this to a vector?

    Can you kindly share some insight for this case? Also,whether tokenizer and texts_to_sequences are at all required, because in this case the input data is itself in a sequence form.-

    • Hi,
      I was buried at work so I’m sorry for the late reply.
      As far as I know, most deep-learning approaches to NLP today require input data to be in discrete space, i.e. integer values or the embedding layer won’t work. If the values above are mapped to English words in a one-to-one manner, then I suggest two possible approaches:

      • Remove the embedding layer on Encoder’s side
      • Create a new mapping to convert float numbers above to integer

      Hope this helps. If you are still unclear, feel free to reply to this thread.
      Trung Tran

      • Hi,
        Thanks for the reply. At the moment, what I did based was to directly use a sequence of numbers like this: raw_data_en = [[1,2,3,4],[5,6,7,8],[0.9,0.1,4,6],[0.3,0.2,1,7],[1,5,6,0.2]], instead of the English sentences themselves. Based on your model and some suggestions from Ravi’s Colab notebook, I eliminated tokenising for the English sentences. Though, the model learns as the loss decreases, I am still confused regarding the role of the Embedding layer at the Encoder side. If you can have a look at my Colab Notebook on and give your suggestions, it would be greatly appreciated. Also, as you suggest removing the embedding layer on Encoder’s side, I tried the following changes:-

        encoder = Encoder(en_vocab_size, LSTM_SIZE) #Removed EMBEDDING_SIZE argument from the call
        class Encoder(tf.keras.Model):
        def __init__(self, vocab_size, lstm_size):
        super(Encoder, self).__init__()
        self.lstm_size = lstm_size
        #self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_size) #Removed Embedding layer application here
        self.lstm = tf.keras.layers.LSTM(
        lstm_size, return_sequences=True, return_state=True)

        def call(self, sequence, states):
        embed = self.embedding(sequence)
        output, state_h, state_c = self.lstm(embed, initial_state=states)

        return output, state_h, state_c

        def init_states(self, batch_size):
        return (tf.zeros([batch_size, self.lstm_size]),
        tf.zeros([batch_size, self.lstm_size]))

        But am confused regarding the call() method in the encoder side. Can you give a small example in code of how the encoder would look like without an embedding layer based on the above? Also, is it necessary to create a new mapping to convert float numbers to integers even after removing the embedding layer, as I am not tokenising the input data. Thanks

  • Hi,
    The link to Neural Machine Translation with Bahdanu Attention at the end of your post seems to be broken, just so you know. Also, can you kindly tell whether this script using Luong Attention may be changed to Bahdanu by substantial difference, or by simply invoking the Bahdanu based claaa om Tensorflow? Thanks!

  • Hi,

    Is the transformer implementation for this task complete? Eagerly waiting for the blog post.

  • Hi Trung,

    Excellent post! Really helpful to dive deeper into TF2 and seq2seq models.

    One thing that confused me a bit is the initial states of the Encoder ( def init_states(self, batch_size): above). This is always zeroed but, in fact, LSTM implementation in TF/Keras creates its own zero initital states automatically, so one can completely get rid of this method and its initialization later in the code. (I ran both versions on the Colab provided here in the discussion and with fixed random seeds, it gives the very same results).

    It’s a detail but it was confusing me at the beginning before figuring that out (and TF documentation is not really clear about it, see )

    • Hi Ivan,

      When the LSTM cell is created, its states are always initialized with zeros . After a training batch is finished, the states need to be manually reset to zeros again, otherwise their current values will harm the next batch (I don’t think they get reset to zeros automatically).

  • Giuseppe Cannizzaro

    Hi Trung! Great job with this post! I have a little problem, (I’m using a custom dataset which has as input sentences with like dozens of words, and few as output) I get this error during the execution of the training loop, it appears after 2000 iterations, so it’s pretty weird:

    WARNING:tensorflow:Entity <bound method of > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Bad argument number for Name: 3, expecting 4
    WARNING: Entity <bound method of > could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Bad argument number for Name: 3, expecting 4
    ValueError Traceback (most recent call last)
    in ()
    7 for batch, (source_seq, target_seq_in, target_seq_out) in enumerate(dataset.take(-1)):
    8 loss = train_step(source_seq, target_seq_in,
    —-> 9 target_seq_out, en_initial_states)
    10 if(i%100 == 0):
    11 print(i)

    6 frames
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/framework/ in wrapper(*args, **kwargs)
    903 except Exception as e: # pylint:disable=broad-except
    904 if hasattr(e, “ag_error_metadata”):
    –> 905 raise e.ag_error_metadata.to_exception(e)
    906 else:
    907 raise

    ValueError: in converted code:

    :4 train_step
    en_outputs = encoder(source_seq, en_initial_states)
    :11 call
    output, state_h, state_c = self.lstm(embed, initial_state=states)
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/keras/layers/ __call__
    return super(RNN, self).__call__(inputs, **kwargs)
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/keras/engine/ __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/keras/layers/ call
    runtime) = lstm_with_backend_selection(**normal_lstm_kwargs)
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/keras/layers/ lstm_with_backend_selection
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/eager/ __call__
    graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/eager/ _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/eager/ _create_graph_function
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/framework/ func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/keras/layers/ standard_lstm
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/keras/ rnn
    input_time_zero, tuple(initial_states) + tuple(constants))
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/keras/layers/ step
    z +=, recurrent_kernel)
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/ops/ binary_op_wrapper
    return func(x, y, name=name)
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/ops/ _add_dispatch
    return gen_math_ops.add_v2(x, y, name=name)
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/ops/ add_v2
    “AddV2”, x=x, y=y, name=name)
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/framework/ _apply_op_helper
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/framework/ create_op
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/framework/ _create_op_internal
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/framework/ __init__
    /tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/framework/ _create_c_op
    raise ValueError(str(e))

    ValueError: Dimensions must be equal, but are 2 and 5 for ‘add’ (op: ‘AddV2’) with input shapes: [2,256], [5,256].

  • Giuseppe Cannizzaro

    Hi Trung!
    I saved the weights of Encoder and Decoder to load them in another notebook but it doesn’t work, is there any particular procedure I need to do to save this kind of model?

  • Thưa anh, nếu có ngôn ngữ tiếng v iệt thì tiền xử lý kiểu gì đc ạ

    • Hi Thọ,

      Các bước tiền xử lý này không có ràng buộc ngôn ngữ. Tiếng Tây Ban Nha hay tiếng Việt đều làm giống vậy nhé.

  • Giuseppe Cannizzaro

    Hi Trung! I was wondering, what if I want to add another LSTM layer (after the one you have already inserted) both in the Encoder and Decoder? How should I modify the code?

  • Hello Trung Tran! Thank you very much for this post. I have seen in many tutorials about LSTM (including yours) that they almost don’t include any evaluation method for the test data. I undestand that with large dataset the validation loss monitoring is not necesarily a relevant metric, but, don’t you think that calculating the LOSS over all TRAINING data and then over all TEST data would be, at least, a reasonable way to measure over/underfitting?

  • very crystal clear sir, you definitely are the best I have ever known in machine translation. everything works perfect, my request however is, how can I create the client graphical application, so that a user can type in English sentence as input and get the target language on the other, I am very very new beginner in the technologies

    • Hi Leo, thank you for finding my post helpful. About your question though, you may want to look for tutorials on creating a simple website with Python. I highly recommend Flask because it’s super lightweight and easy to learn. Check out the official tutorial at

      • Hi Trung , I managed to find my way around and thanks to you, but issues arose when I attempted the transformer project, what I did was to copy the code in its full glory but replaced the dataset, instead of french I translated the same English sentences into my local language (chichewa). without changing anything but the sentences into chichewa i get this error

        Input vocabulary size 106
        Encoder input shape (2, 10)
        Encoder output shape (2, 10, 128)

        InvalidArgumentError Traceback (most recent call last)
        19 fr_sequence_in = tf.constant([[1, 2, 3, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0],
        20 [1, 2, 3, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0]])
        —> 21 decoder_output = decoder(fr_sequence_in, encoder_output)
        23 print(‘Target vocabulary size’, fr_vocab_size)

        ~/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/ in __call__(self, inputs, *args, **kwargs)
        889 with base_layer_utils.autocast_context_manager(
        890 self._compute_dtype):
        –> 891 outputs =, *args, **kwargs)
        892 self._handle_activity_regularization(inputs, outputs)
        893 self._set_mask_metadata(inputs, outputs, input_masks)

        in call(self, sequence, encoder_output)
        22 for i in range(sequence.shape[1]):
        23 embed = self.embedding(tf.expand_dims(sequence[:, i], axis=1))
        —> 24 embed_out.append(embed + pes[i, :])
        26 embed_out = tf.concat(embed_out, axis=1)

        ~/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/ in _slice_helper(tensor, slice_spec, var)
        811 ellipsis_mask=ellipsis_mask,
        812 var=var,
        –> 813 name=name)

        ~/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/ in strided_slice(input_, begin, end, strides, begin_mask, end_mask, ellipsis_mask, new_axis_mask, shrink_axis_mask, var, name)
        977 ellipsis_mask=ellipsis_mask,
        978 new_axis_mask=new_axis_mask,
        –> 979 shrink_axis_mask=shrink_axis_mask)
        981 parent_name = name

        ~/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/ in strided_slice(input, begin, end, strides, begin_mask, end_mask, ellipsis_mask, new_axis_mask, shrink_axis_mask, name)
        10370 else:
        10371 message = e.message
        > 10372 _six.raise_from(_core._status_to_exception(e.code, message), None)
        10373 # Add nodes to the TensorFlow graph.
        10374 if begin_mask is None:

        ~/anaconda3/lib/python3.7/site-packages/ in raise_from(value, from_value)

        InvalidArgumentError: slice index 10 of dimension 0 out of bounds. [Op:StridedSlice] name: decoder_28/strided_slice/

        so after running this cell:

        H = 2
        NUM_LAYERS = 2

        en_vocab_size = len(en_tokenizer.word_index) + 1
        encoder = Encoder(en_vocab_size, MODEL_SIZE, NUM_LAYERS, H)

        en_sequence_in = tf.constant([[1, 2, 3, 4, 6, 7, 8, 0, 0, 0],
        [1, 2, 3, 4, 6, 7, 8, 0, 0, 0]])
        encoder_output = encoder(en_sequence_in)

        print(‘Input vocabulary size’, en_vocab_size)
        print(‘Encoder input shape’, en_sequence_in.shape)
        print(‘Encoder output shape’, encoder_output.shape)

        fr_vocab_size = len(fr_tokenizer.word_index) + 1
        max_len_fr = data_fr_in.shape[1]
        decoder = Decoder(fr_vocab_size, MODEL_SIZE, NUM_LAYERS, H)

        fr_sequence_in = tf.constant([[1, 2, 3, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0],
        [1, 2, 3, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0]])
        decoder_output = decoder(fr_sequence_in, encoder_output)

        print(‘Target vocabulary size’, fr_vocab_size)
        print(‘Decoder input shape’, fr_sequence_in.shape)
        print(‘Decoder output shape’, decoder_output.shape)

        The same raw dataset works perfectly with the attention model apparently when I leave the line number 19 of the sencences in original content and everything else in English to chichewa the code runs perfectly

        • This is the chnage that I made to the dataset
          raw_data = (

          (‘What a ridiculous concept!’, ‘ganizo la uchisiru !’),
          (‘Your idea is not entirely crazy.’, “nzeru yakoyi siyobalalikiratu.”),
          (“A man’s worth lies in what he is.”, “phindu la munthu liri mmene iye alili.”),
          (‘What he did is very wrong.’, “zimene anachita zinali zolakwika.”),
          (“All three of you need to do that.”, “nonse atatu mukuyenera kuchita zimenezo.”),
          (“Are you giving me another chance?”, “mundipatsa mwayi wina ?”),
          (“Both Tom and Mary work as models.”, “onse Tom ndi mary amagwira ngati ma modelo.”),
          (“Can I have a few minutes, please?”, “mungandipatseko mphindi zingapo ?”),
          (“Could you close the door, please?”, “mungasekeko chitseko ?”),
          (“Did you plant pumpkins this year?”, “munadzala mawungu chacha ichi ?”),
          (“Do you ever study in the library?”, “umawerengerako ku laibulale ?”),
          (“Don’t be deceived by appearances.”, “usanyengedwe ndi maonekedwe.”),
          (“Excuse me. Can you speak English?”, “pepani mumalankhula chingerezi ?”),
          (“Few people know the true meaning.”, “anthu ochepa amadziwa tanthauzo lenileni.”),
          (“Germany produced many scientists.”, “germany idapanga a za sayansi ambiri.”),
          (“Guess whose birthday it is today.”, “ukudziwa kuti ndi tsiku lobadwa landani lero !”),
          (“He acted like he owned the place.”, “amapanga ngati malowo ndi ake.”),
          (“Honesty will pay in the long run.”, “chilungamo chimalipira patsogolo.”),
          (“How do we know this isn’t a trap?”, “tidziwa bwanji kuti umenewuwu si sampha?”),
          (“I can’t believe you’re giving up.”, “sindikukhulupilira kuti ukuzitsiya.”),

  • Just want to say very good job!

  • man how can I save it for deployment

  • Hi Trung,
    You have done an excellent job of explaining the seq2seq with attention model.
    However, I am getting an error with attention version of the code.
    The call method in the Decoder has 3 arguments
    def call(self, sequence, state, encoder_output):
    whereas when called from train_step function, four arguments are being passed.
    logit, de_state_h, de_state_c, _ = decoder(decoder_in, (de_state_h, de_state_c), en_outputs[0])

    Anyone else facing this issue?

  • Hi Trung. Thanks for the post. I am facing the following error after running the code:
    dataset =, data_fr_in, data_fr_out))
    ValueError: Dimensions 20 and 0 are not compatible

  • splendid job Trung.

  • Hey Bro,
    I’m getting the following error could you please help me on this:
    ValueError: tf.function-decorated function tried to create variables on non-first call.

Leave a reply:

Your email address will not be published.