Create The Transformer With Tensorflow 2.0

Reading Time: < 1 minute

Hello, buddy. The post you’re reading has been moved to my new blog:

You’ll love the new look over there. Promise!

Trung Tran is a software developer + AI engineer. He also works on networking & cybersecurity on the side. He loves blogging about new technologies and all posts are from his own experiences and opinions.

65 comments On Create The Transformer With Tensorflow 2.0

  • Thanks for a great post. But, I have a simple question which is confusing me being new to Dl. How to save the model, say in Google Colab as this method doesn’t use a Tensorflow session and probably follows along Eager execution. I guess that train_step() is handling the training process, and suppose predict() is to be called on new test data, is there a suitable way to save the model/something similar so that it doesn’t have to be trained again from scratch? If you can suggest a way, it would be very helpful as I have searched widely for the method to save without a Session, but couldn’t find a simple solution. Thanks!

    • Hi Ramesh,

      You can save the weights directly to Google Drive and restore from that for inference.

      Of course, you will need to mount your Drive into Colab first. But that’s pretty simple.

      You can see a concrete example here:
      (See the answer from Tadej Magajna)


      • Hi Trung,
        Thanks for your reply. This is interesting, I could now mount the Drive like this:-
        from google.colab import drive

        However, the answer you mention by Tadej Magajna on SO unfortunately doesn’t go into details of saving the weights and restoring that from inference. Can you tell a simple way to do this, I mean save the weights, restore the latter for using predict() without requiring training from scratch? I regularly follow your posts like on Seq2Seq and this one on transformer etc., so would really appreciate a standard way of doing this for the models which do not use the sessions in Tensorflow. Thanks.

        • Hi Ramesh,

          Thank you for following my blog for a while. I really appreciate that πŸ™‚

          Back to the Colab & Drive thing, I don’t see any problems here (or maybe I don’t understand your question correctly).

          I think you can treat the mounted drive as a normal directory, and tell the model to save weights to that.
          For example: you can do something like: model.save_weights(‘/gdrive/My \Drive/your_model_name’) during training. So your weights are kept permanently on your Drive.
          For inference, you can either download the weights from where you stored and do it locally, or you can infer directly on Colab (the Drive is still mounted and you know exactly where the weights are, right?)

          Hope this helps.

          • I see, this is how it works. In the Transformer post you have provided the way to save weights for the encoder-decoder so i was looking for something similar for the Seq2Seq model. Also, one of my observations is that the Transformer is taking more time to train than Seq2Seq for my data. Is this possible? Cheers

          • Hi Ramesh,

            If you want a complete source base, take a look here: My chatbot project is more polished (still work-in-progress) with separate training/test scripts.

            About the training time though, it depends on the network’s settings so I can barely say anything.


  • Hi there,
    Thanks for a great post. Can you tell how to stop training to prevent overfitting in this case? I mean we are not using a validation loss estimate are we? I am confused on where to stop training as the loss becomes 0 if I train for more than a few hundred epochs. Can you give an example to introduce early stopping in this code? Thank you.

  • Hi I have Few questions
    1:- What is modelsize :- is it the embedding size for each word in text which is 512 in paper?
    2:- Why for keysize you are dividing modelsize by number of attention heads :- cant we use any keysize ?
    3:- the only constraint is keysize = querysizie so that their dot product is possible?
    4:- query_len and value_len .they should be same right? . as they represent the length of words in a sequence

  • Hey ,
    I wanna say thanks . your kernel helped me a lot in understanding mathematics of transformer model. now i am able to write the complete model by myself

  • Hi,

    first of all, thank you for this article, it helped me a lot in understanding how transformer works and, specifically, how self-attention is implemented.

    Just one comment, I think I found a small error: In the snippet, first row is going to be all zeros and I think that’s not the idea, right? At least I tried that code and it wasn’t working properly until I changed it slightly:

    look_left_only_mask = tf.constant([[1] * (i + 1) + [0] * (seq_len – i – 1) for i in range(seq_len)], dtype=tf.float32)

    Many thanks again!

    • Hi,

      Thank you for reading.

      Great question. Personally, I think it is unnecessary to include the current word in key & value tensors, since we can still have access to that piece of information thanks to the residual connection later on.

      In fact, I have tried both of them and gained slightly better results if the current word is masked out too.

  • Also, one question: does it make sense to multiply the look_left_only_mask and the padding mask in the decoder?

  • There is an error to your code on the train step when I try to run it says this:
    “TypeError: Failed to convert object of type to Tensor. Contents: [Dimension(None), -1, 8, 16]. Consider casting elements to a supported type.”

    Do you know how to fix this?

  • Hi dude,
    Thank you for the great post!

    I’m struggling to save the transformer model weights. Is there any way to save it just like you saved the encoder and decoder attentions weights and loaded back for training process in your previous post?
    I’m trying to use the saved weights to use it on another dataset.

    Many thanks.

  • Hi! Great post!

    I was able to run the entire project here, but realized that the prediction implementation takes a long time to execute if MAX_LENGTH is large and has several batches to run. (really long time)

    Do you have any suggestions to optimize, or if anyone has already done so?

    • Hi Arthur,

      As far as I know, it’s the trade off of having no states among steps, which results in the long inference time. You may want to have a look at newest papers on transformers (I’m not pretty much keeping up with the trends lately).

      • Hi again!

        I had some other questions about loss and a possible accuracy function.

        What is the “from_logit” parameter? is necessary? I saw it in other posts, but if I used it, the value of the loss will always be high and unique (maybe I’m missing something)… Another parameter is the reduction=”none” .. Can you explain their use?

        In the accuracy function, a reshape is used first.. but maybe needs a mask like in the loss function?

        y_true = tf.reshape (y_true, form = (- 1, MAX_LENGH))
        precision = tf.metrics.SparseCategoricalAccuracy () (y_true, y_pred)
        return accuracy

        Thanks a lot!

        • HI Arthur,

          1. from_logit
          Basically, the raw output from a network is called logit, without being applied softmax or sigmoid.
          Why does it matter? Because the loss function I used will not apply softmax on the input by default, so if you pass in the raw output of the network, you must tell the loss function that.

          2. reduction=”none”
          You can think of logit as a multidimensional vector, say (batch, step, depth), which means that the loss function will compute a loss of shape (batch, step). By default, the loss function will add it up and return you the just the sum (a scalar). That’s not what we wanted, which is why we have to tell it not to do that.

          3. accuracy
          I don’t think using merely the default accuracy metric is a good idea. Translation task is evaluated considering not only syntactic but also semantic aspect. You should read about some well-known metrics such as BLEU score.

          • Great explanation!
            In 3. i just try to make a simple text corrector, but ok, got it!

            Thank you again!

  • maybe some mistake in the function “positional_embedding” in this post.
    line 5 and 7
    PE[:, :, i] —> PE[:, i]
    the code in your github repo is PE[:, i].

  • Hi,
    I really like this transformer implementation, its clean and simple to navigate.

    I think it would be really useful, if function def predict(test_source_text=None) could take a batch of inputs. For example function for training [def train_step] takes batches of inputs, so its quite fast, it would be amazing if you did the same for prediction.

    Because as any transformer its quite slow, and the execution time is the same for prediction 100 sentences or 1 sentence on GPU.

    Anyways, great work.

    • Hi Robert,

      Sorry I’m late. Thank you for your words.
      However, I don’t know whether it’s a good idea to infer 100 sentences at once.
      Of course, we can stuff them in one matrix, but since each sentence will have different length eventually (when they hit their end token), we would need to loop the final result again to clean each sentence.

      Anyway, it’s just my thought. Any feedback on that is welcome πŸ˜€

  • Hi! I liked everything very much. But why the SoftMax function is missing in the decoder implementation?

    • Hi Sergey,

      Nothing is missing my friend. The loss function will take care of applying softmax. The network only needs to compute the raw output, i.e. logits.

  • Hi Trung Tran

    Beautiful blog, this has greatly helped my work as a developing ML engineer. I have a cheeky question… How would one go about incorporating the predictions at each t-step with a beam search decoder?

    Kind regards,

  • Hi Trung,

    Great post. It helped me a lot to understand the self-attention mechanism of Transformers in detail. I am facing trouble in saving the model along with the weights after the model is trained. Could you please explain how to save the model and then load the model to test on new set of examples.

    Thank you very much once again

  • Hi Trung,
    Can you please provide a github link for this source code. i will like to try out this project in pycharm . i’m not used to notebook

  • Hi Trung Tran, that was an amazing post! The best I have seen on the topic. I feel like I kind of understand it now. You are a genius at teaching. Hopefully this will help me solve the problem I’m supposed to be figuring out at work! Thanks for taking the time to write this whole thing up.

  • Thank you!
    It’s amazing how hard it is to find a clear explanation and implementation of this architecture.
    I’m thankful. Cheers!

  • How can I save it for flask deployment…….anyone

  • Thank you for your generosity!
    I have one question
    how can we save this model and how can we load this model?

  • Hello Nice Blog.I’ve small doubt regarding transformers. How to increase conversation context and generating a response out of it. Is there any way that it can remember past 5-6 conversations and reply us back based on the conversation? For suppose I say my name and I’m a teacher after 2 conversations I’ll change my occupation from teacher to doctor. Can it Question me back as I’ve changed my Occupation? If yes, How?

  • Hi Trung Tran, Thank you very much for this post. You made my day. I was having difficulty to implement the Transformer until I found your article.

  • Hi Trung,
    This helped me a lot. Thanks for your efforts. I am stuck with saving the model for making inferences. Can you please help me with this, if you can point out where should I look, that would be very helpful.
    xin vui lΓ²ng


  • Hi Trung,

    What do you mean by this : once you plug in the real training data and use the full 8-head attention setup, the new MultiHeadAttention implementation will become a game changer!
    Can you please elaborate. How shall one use the full 8-head attention setup?


  • Hi , it was very informative but I got an issue.

    I am able to train it and save weights, but I am not able to find best way to get the structure of network.

    I am not able to save entire model with model .save() method, it is always asking me to override get_config method where layers class is used a parent.

    in multihead attension class and positional encoding class:
    so I added this in both class, then it says to do same for learning rate too. I wrote 0.01 in place of learning rate in optimizer rather than degrading learning rate one.

    def get_config(self):
    #config = super(PositionalEncoding, self).get_config()
    config = super().get_config().copy()
    ‘vocab_size’: self.vocab_size,
    ‘num_layers’: self.num_layers,
    ‘units’: self.units,
    ‘d_model’: self.d_model,
    ‘num_heads’: self.num_heads,
    ‘dropout’: self.dropout,
    return config

    I managed to save entire model but when I load it it says “ValueError: Unknown layer: PositionalEncoding”.

    Please tell me how to save entire model or best efficient way to load weights (rather than processing data all time)

  • Thank you very much! It’s really helpful. It’s the first time to write so long codes in tensorflow under your guidance.

    i try to translate chinese to english. but the loss will be nan after 10 or 12 epochs. I will continue debugging.

    And there is a problem I cannot figure out. In Figure 11: Data Shapes inside the Decoder, the output shape of the Decoder is (batch_size, length, model_size)
    but it may be (batch_size, length, vocab_size)?

  • Hi Thung Tran,

    thank you very much for this tutorial, it has been very helpfull. Nevertheless, I am having problems making it perform well (

    I have executed the file and after 100 epochs the results are terrible:

    She choked him with her bare hands .
    [[30, 4658, 43, 38, 55, 3077, 483, 1]]
    – . . . . . . . . . . . . . .

    and all the test sentences have the same output… is this behaviour familiar to you?

    Do you know what might be happening?

    • Hi Cae,
      Sorry for my late response. I’ve just tried to run it again from the top. The model still can translate normally as I wrote in the post.
      Can you check it again?

      • Hi,

        unfortunately I face the same problem as Cae and Cesar. I copied the from your github and I have the feeling that the longer I train the worse the result gets. As Cesar pointed out “but the translation model makes is just the same character repeated….”. Maybe the error is just in the github repo and not in the notebook, since the code fragments are not identical. Hope we find a solution πŸ˜‰

  • Hi Trung Tran,
    first of all thank you very much for this post, it has been very usefull to understand how transformers work!

    but I am having problems to reproduce your results, I have executed the code from “Source code for full training data” changing NUM_EPOCHS = 150, but the results I get are very bad:

    Tom thinks Mary doesn t get enough sleep .
    [[14, 862, 88, 124, 8, 56, 188, 309, 1]]
    . . . . . . . . . . . . . .

    Does this behavior look familiar to you? is there something wrong?

    Thank you

  • Hi Trung Tran,

    Thank you for this post, it has been very usefull. Nevertheless, I can’t reproduce the results, I downloaded the code and increased the number of epochs, but the translation the model makes is just the same character repeated…. Does this behaviour sound familiar to you? do you know where the problem is?

    thank you very much

  • Amazing job!

  • Hi Trung Tran, thank you for such a nice post, i’ve a doubt about finetuning transformers, Actually after training on large data i want to fine tune with some specific data for specific purpose by freezing encoder layers and initial decoder layers. Can you please help on how to freeze layers in transformers.

  • Hi Trung Tran,
    First thank you so much for your post. I feel it will be very useful. I am getting following error when I try to run the source code. I am relatively new to this area. Can you please help me in resolving this? Once I am able to run it, I wanted to train the model with my own translation data and see.

    Source code is taken from


    \ FutureWarning: Passing (type, 1) or ‘1type’ as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / ‘(1,)type’.
    _np_qint32 = np.dtype([(“qint32”, np.int32, 1)])
    C:\ProgramData\Anaconda3\lib\site-packages\tensorboard\compat\tensorflow_stub\ FutureWarning: Passing (type, 1) or ‘1type’ as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / ‘(1,)type’.
    np_resource = np.dtype([(“resource”, np.ubyte, 1)])
    Traceback (most recent call last):

    File “”, line 46, in
    lines = maybe_download_and_read_file(URL, FILENAME)

    File “”, line 38, in maybe_download_and_read_file
    zipf = ZipFile(filename)

    File “C:\ProgramData\Anaconda3\lib\”, line 1258, in __init__

    File “C:\ProgramData\Anaconda3\lib\”, line 1325, in _RealGetContents
    raise BadZipFile(“File is not a zip file”)

    BadZipFile: File is not a zip file

    • Right now I have the same problem too. I’ll dig into that. For now, download the file manually and adjust the code a little bit to ignore the download part.

  • Hi Trung,
    Thank you so much for publishing the post with detailed explanation. I am trying to run the code but unfotunately I am seeing the errors. I am realtively new to this area. Can you please help me in resolving the errors? I got the code from

    Please see the following warning and the errors.

    C:\ProgramData\Anaconda3\lib\site-packages\tensorboard\compat\tensorflow_stub\ FutureWarning: Passing (type, 1) or ‘1type’ as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / ‘(1,)type’.
    _np_quint16 = np.dtype([(“quint16”, np.uint16, 1)])
    C:\ProgramData\Anaconda3\lib\site-packages\tensorboard\compat\tensorflow_stub\ FutureWarning: Passing (type, 1) or ‘1type’ as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / ‘(1,)type’.
    _np_qint32 = np.dtype([(“qint32”, np.int32, 1)])
    C:\ProgramData\Anaconda3\lib\site-packages\tensorboard\compat\tensorflow_stub\ FutureWarning: Passing (type, 1) or ‘1type’ as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / ‘(1,)type’.
    np_resource = np.dtype([(“resource”, np.ubyte, 1)])
    Traceback (most recent call last):

    File “”, line 46, in
    lines = maybe_download_and_read_file(URL, FILENAME)

    File “”, line 38, in maybe_download_and_read_file
    zipf = ZipFile(filename)

    File “C:\ProgramData\Anaconda3\lib\”, line 1258, in __init__

    File “C:\ProgramData\Anaconda3\lib\”, line 1325, in _RealGetContents
    raise BadZipFile(“File is not a zip file”)

    BadZipFile: File is not a zip file


  • Hi Trung,
    I posted 2 comments related to my query. I need your help. Somehow I can not see those posts. Is this something that, only after approved the posts will be displayed?


  • Angshuman Sarma

    Being an undergraduate student and that too in second year it was hard to me to find resources to make all things work (everything jumbled in my mind)
    out but your post has all i need thank u very much.
    Please continue providing more posts like this.

  • Erwin Driessens

    It is so slow… I have a reasonable Geforce RTX GPU and a call to predict() takes a significant fraction of a second. I have used your code on github. Is this normal?


    I am getting wrong shape for encoder output. I am getting (2,95,128) while the correct shape is (2,10,128). Can somebody please help?

  • Hi Trung,
    Great post!!! Thank you so much for the post. The basic transformer worked just fine. However, when I tried to add the improvements I faced two issues
    1) Inside the call function of MultiHeadAttention (, score is of size (5,10,10) but mask is of size (5,10). So the operation score*=mask is giving error because of size mismatch. Do we need to make the mask of size (5,10,10)? How to do that?

    2) Should not the predict function also need to be updated while making improvements because the encoder inside it will require padding_mask as an additional argument now?

  • Hi Trung,
    Great post. I’m new to NLP and have been trying to implement tansformer for some time now. This post is really well structured for someone like me who has experience in programming but not so much in matrix multiplications. The simple implementation was well mapped to how a normal programmer may think.

    I have a few questions-
    1. In the encoder, you have added 2 dense layers for every block. Any particular reason for that? I tried without “dense_1” and the model overfitted faster.
    2. I tried using the default method to train the model. While loss during training reduced quickly, the prediction failed terribly. Below is the code. I have changed a few things in the model. For example, I’m using padding_mask provided by the embedding layer in Encoder.

    enc_inputs = ks.layers.Input((data.en_len,))
    dec_inputs = ks.layers.Input((data.hn_len,))

    enc = Encoder(data.en_len, data.en_vocab_size + 1, dmodel, num_heads, num_units)
    dec = Decoder(data.hn_len, data.hn_vocab_size + 1, dmodel, num_heads, num_units)

    enc_out, enc_mask = enc(enc_inputs)
    dec_out = dec(dec_inputs, enc_out, enc_mask)
    model = ks.Model(inputs=[enc_inputs, dec_inputs], outputs=dec_out)
    model.compile(optimizer=’adam’, loss=loss_func)[en_input, hn_input], y=hn_output, epochs=epochs)

    colab notebook –

    The train method you have provided, worked flawlessly on the same model.

  • Many thanks for the super helpful tutorial! This is a really good intro to how to implement transfomers.

  • Thank you soo much this wonderful explanation of such an useful concept. I got to learn a lot and that too in the most simplest way it could have been. Thanks once again,

Leave a reply:

Your email address will not be published.