Hello everyone, this is my very first blog on my new site. I will use this site for all of my future posts, which of course will be mostly about Deep Learning. But I will be in hardware mode sometimes so I will write about those projects as well. Without further introduction, let’s dive in.
First, let’s talk about what we will be doing today. After following what is written in this blog post, we will have a model that can generate something like below:
I am ever? George wanted a natural. “Dumbledore told me how Ron!” Ron gave him a voice dry clearly into his bag. “I think he doesn’t think it was doing?” Harry told Michael, Ron, who still resumed a squeaky lit face back with his beard. A emaciated cruel place fell yellowish green ten snowy witches and together emerging through Ron and served Harry’s camp bed and shredded more…
I wrote a blog post about how to create a similar model with Keras on my old blog (link in the reference session). That was quite out-of-date now, but with some simple modifications, the model will work well with the current version of Keras. In today’s post, I will show you how to implement in Tensorflow.
In order to make things work properly (and because I am not going to explain every piece of code), make sure that you have:
- Python installed (Python3 is recommended)
- Tensorflow installed
- Some experience with Python and Tensorflow plus basic knowledge about RNNs and word embeddings
No problem with anything above? Then we are ready to go.
Above, we have already talked about what problem we will solve today. Next, let’s tackle one by one. The first thing to consider when working on any Deep Learning project is which dataset to use. In this particular problem, we can use any raw text for the network to learn from. Want to mimic J.K. Rowling’s style? Any book of Harry Potter should do the job. Want some old-fashioned style? How about Mark Twain’s Oliver Twist or Charlotte Brontë’s Jane Eyre? Oh, you want to sound like Donald Trump? He’s on Twitter!
But don’t worry though, to get your code up and running as fast as possible, I will provide two sets of data: one is Mark Twain’s Oliver Twist, and the other is the transcript of Mr. Donald Trump’s speech. Each of those will be sufficient for the model to yield meaningful content.
Firstly, let’s import the necessary module and define some FLAGS to store hyperparameters and constants:
Okay, let’s write some code to process the text data file. Since we want to train a word-based model, we will split the raw text into word tokens:
Then, we will need to create two dictionaries, one to convert words into integer indices, and the other one to convert integer indices back to word tokens:
Next step, we will convert word tokens into integer indices. These will be the input to the network. And because we will train a mini-batch each iteration, we should be able to split the data into batches evenly. We can assure that by chopping out the last uneven batch:
We will then create target data for the network to learn. In text generation problem, the target of each input word will be its consecutive word, so in order to obtain the target data, we just need to shift the whole input data to the left by one step:
Let’s print out the input and target data. It’s always a good practice to check if everything is OK before moving on. Hope that your screen displays something like mine.
Finally, we will define a function to generate batches for training. The implementation is pretty straight-forward:
And we are done with the data processing step! Congratulations!
After finishing the data processing task (the most tedious part), let’s define the backbone of this project: the network. First, let’s define the input and output op:
Great! Next, we will create an embedding layer. Word embedding offers a better way to represent words in vector form than the one-hot approach. You might have seen this:
King – Man + Woman = Queen
Speaking in code though, it is no longer than two lines of code:
Right after the embedding layer is the recurrent layer. I will use LSTM in this post, you can experiment with other types of your choice. Remember one thing that makes recurrent layers different from dense (fully-connected) layers is, the states. Without states, recurrent layers have no way to keep the information from previous time steps.
We are nearly there. After we defined the recurrent layer, we can now get the output from the forward pass. The forward pass requires looping through the entire sequence. tf.nn.dynamic_rnn will handle the looping job so that we can simply call and get the output like we normally do with dense or CNN layers (I elaborate a little bit more in the training session, sneak peek here).
In the last step, we need a dense layer to map the output of the recurrent layer to vectors of vocabulary size. An important thing to remember is, the output of a recurrent layer is a tuple containing the real output and its states, don’t put them all in the dense layer 😀
Now the model is ready to learn!
So now we have done with creating our model. Next, in order to train the model, we need to define the loss function and a mechanism to update the parameters. First off, the loss function we use is the CrossEntropyLoss:
Another note though, don’t apply softmax by hand when using any loss function with softmax in its name. There are other loss functions that require the softmax values, but you should not use them since computing softmax on your own is sometimes numerically unstable.
Now we have the loss. Next, we need to define an optimizer to create a training op, which will make the model learn. Normally, we can do like this:
No, don’t do like that. When working with RNNs, we will have to deal with something called exploding gradients, which means that gradients suddenly jump so high and break the learning process. To prevent that, we need to apply gradient clipping, which will ensure that they are always below a threshold:
We now have everything we need to start training!
Here we are, in the training session. We are about to reach a point where we can leave our machines for a cup of coffee. Firstly, let’s get all the necessary training data:
Next, we will create the network’s ops. I will show you the code first:
What magic is going on here? Why the input is a batch_size x sequence_size tensor when training, but only a 1 x 1 tensor during inference?
Well, I should have elaborated a little more when I mentioned tf.nn.dynamic_rnn, but I thought it would be better to be explained along with the difference between training and inference process.
Basically, during the training process, we have a full-length sequence as input. So, here is what it looks like under the hood when we call tf.nn.dynamic_rnn:
As you can see, at every time step, the input into the LSTM unit comes from our input data, whereas the hidden states of previous steps are kept and passed along.
Meanwhile, in inference mode, we don’t have a sequence, which is why we want to create a model to generate sequences for us, right? Normally, we only start off with some initial words and have the model generate the rest. That process would look like below:
So, imagine we only have the word “Are”, feeding into the LSTM unit will get us the word “we” (more on the inference logic later). Then, we will take the output of the LSTM unit, which is “we”, as the input of the next time step. Keep doing so, we will have a sequence by concatenating all the outputs: “Are we gonna do this?”. Not so hard, right?
Now all the myths are cleared. Let’s continue with getting the loss and training ops:
Next, let’s create a session. In Tensorflow, Session is the guy who actually does all the computation, after we created a bunch of ops (which are combined into something called computational graph). We also need to initialize all the variables which were created when we defined the embedding layer, the LSTM unit, etc:
The next part is the training loop. At the beginning of every epoch, we need to re-generate the training batches and reset the states of the LSTM unit to zero:
For each batch, we will compute the loss and update the parameters. For every 100 iterations, we will print out the loss to monitor the learning progress. And for every 1000 iterations, we will see how well the model can generate content by calling predict method (we haven’t implemented it yet, we are about to):
For now, that is the training loop that we need.
The one last thing we have to do is to implement the predict method. What we need to do can be illustrated as follow:
It can be boiled down into two steps:
- From the initial words, get the final output and final state.
So, this step is pretty much similar to the training process if the initial words contain more than two words:
Next, we need to compute the index from the output. Normally, we can use tf.argmax to get the index of the element with maximum probability. But that is not good since it would cause the model to yield the same result if we use the same initial words.
Rather than using tf.argmax, we will randomly choose among five elements with the highest probability to get the index. For convenience, let’s define a method for that:
Let’s use that method to get the index token, convert it to string and append it to the end of the result sequence:
- Use the found index token and the final state above to generate text
In the second step, we will use the index token found above as the word “is” in Fig. 3. The process would be continued endlessly, so we need to set a limit of how many words we want. Below is how we do it in code:
We limit the prediction to 200 words and begin the loop. It is pretty much the same with the loop in step 1, except that we use the previous output as input. .We also append the output to the result sequence and print it out when the loop is finished.
And that is it, guys. Let’s hit run:
If there is nothing wrong with the implementation, you will see something like below in the console:
Now, go get yourself a cup of coffee while enjoying your AI’s creative work.
You have made it till the end. Well done, everyone! The complete code and training data can be found at my repo.
Hope that you enjoy the post and please let me know if you had any problems. Thank you guys for reading.