Using AWS SageMaker to train a model to generate text (part 2)

This is part 2 following on from my previous post, investigating how to take advantage of AWS SageMaker to train a model and use it to generate text for my Twitter bot, @kevinhookebot.

From the AWS SageMaker docs, in order to get the data in a supported format to use to train a model, it mentions “A script to convert data from tokenized text files to the protobuf format is included in the seq2seq example notebook”

Ok, so from the SageMaker Notebook I created in part 1, let’s start it up via the AWS console:

Once started, clicking the ‘Open’ link to open the Jupyter notebook, we can open the seq2seq example which is in the ‘SageMaker Examples’ section:

From looking at the steps in this example Notebook, it’s clear that this character2character algorithm is more focused on translating text from source to destination (such as translating text in one language to another, as shown in this example notebook).

Ok, so this isn’t what I was looking for so let’s change gears. My main objective is to be able to train a new model using AWS SageMaker service, and generate text from it. From what I understand so far, you have two options how you can use SageMaker. You can either use the AWS Console for SageMaker to create Training Jobs using the built in algorithms, or you can use a Juypter notebook and define the steps yourself using Python to retrieve your data source, prepare the data, and train a model.

At this point the easiest thing might be to look for another Recurrent Neural Net (RNN) to generate characters to replace the Lua Torch char-rnn approach I was previously running locally on an Ubuntu server. Doing some searching I found char-rnn.pytorch.

This is my first experience setting up a Juypter notebook, so at this point I’ve no idea if what I’ve doing is the right approach, but I’ve got something working.

On the righthand side of the notetbook I pressed the New button and selected a Python PyTorch notebook:

Next I added a step to clone the char-rnn.pytorch repo into my notebook:

Next I added a step to use the aws cli to copy my data file for training the model into my notebook:

Next, adding the config options to train a model using char-rnn.pytorch, I added a step to run the training, but it gave an error about some Python modules missing:

Adding an extra step to use pip to install the required modules:

The default number of epochs is 2,000 which takes a while to run, so decreasing this to something smaller with –n_epochs 100 we get a successful run, and calling the generate script, we have content!

I trained with an incredibly small file to get started, just 100 lines of text, for a very short time. So next steps I’m going to look at:

  • training with the full WordPress export of all my posts for a longer training time
  • training with a cleaned up export (remove URL links and other HTML markup)
  • automate the text generation from the model to feed my AWS Lambda based bot

I’ll share another update on these enhancements in my next upcoming post.


Retraining my Recurrent Neural Net with content from this blog

Recently I trained torch-rnn with all my previous tweets, and used it to generate tweets from an AWS Lambda, which results in some incomprehensible but somewhat recognizable content typical of my software development tweets like this:

and this:

In most cases the vocab it’s generating new content with has a high occurance of words I’d typically use, so computer, software, hardware, code are all pretty common in the output. Training the model with 2000+ tweets of 240 characters or less though I don’t think is a particular great sample of data, so I wondered what it would be like if I trained it with more data.

I have 2000+ articles on my blog here, so I ran a sql query to extract all the post text to a file (see here), and then fed this 4MB file into the training script. The script has been running on an Ubuntu VM on my rack server for almost 24 hours at this point, and it’s probably the most load I’ve had on my server (the 1 vCPU on the VM is maxed, but the server itself still has plenty of free vCPUs and RAM remaining, but this one vCPU is currently running 100%). It’s getting a little on the warm side in my office right now.

The torch-rnn script to train your model writes out a checkpoint file of the model in progress so far about once every hour, so it’s interesting to see how the generated content improves with every additional hour of training.

Here’s some examples starting with checkpoint 1, and then a few successive checkpoints as examples, running with temperature 0.7 (which gives good results after more training, but pretty wacky output earlier in the training):

Checkpoint 1, after about 2 hours:

a services the interease the result pecally was each service installing this up release have for a load have on vileent there at of althe Mork’ on it’s deforver, a some for

Checkpoint 5:

Store 4 minimal and Mavera FPC to speed and used that the original remeption of the Container and released and problem is any sudo looks most chated and Spring Setting the Started Java tagger

Checkpoint 10:

react for Java EE development and do it compended to the Java EE of code that yet this showing the desting common already back to be should announced to tracker with the problem and expenting

Checkpoint 15:

never that means that all performance developers of the just the hand of a microsch phone as not support with his all additional development though it’s better with the same by worker apache

Checkpoint 19:

The Java becomes are your server post configuring Manic Boot programming code in the PS3 lattled some time this is the last of the Docker direction is one it and a check the new features and a few new communities on the first is seen the destining

Getting pretty interesting at this point! Interesting that certain words appear pretty regularly in the generated output, although I don’t think I’ve included them in articles that often. PS2 and PS3 appear a lot, programming and computer are expected given the frequency in the majority of my articles, and there’s a lot of Java, Microsoft, Oracle, Docker and containers showing up.

I’m not sure how much longer the training is going to run for on a 4MB text file which I didn’t think was that large, but it’s been running for almost 24 hours at this point. I’ll let it run for another day and then see what the output looks like then.

If you start to see the tweets looking slightly more coherent over the next couple of days, the AWS Lambda is starting to use content generated from these new checkpoints on this new model, so it should be slightly more natural sounding hopefully, given the larger input file for training the new model.

Generating tweets using a Recurrent Neural Net (torch-rnn)

Even if you’re not actively following recent trends in AI and Machine Learning, you may have come across articles by a researcher who experiments with training neural nets to generate interesting things such as:

Brown salmon in oil. Add creamed meat and another deep mixture

  • Chocolate Pickle Sauce
  • Completely Meat Chocolate Pie

So what’s going on here? What’s being used is something called a Recurrent Neural Net to generate text in a specific style. It’s trained with input data which it analyzes to recognizes patterns in the text, constructing a model of that data. It can then generate new text following the same patterns, sometimes with rather curious and amusing results.

A commonly referred to article on this topic is by Andrej Karpathy, titled “The Unreasonable Effectiveness of Recurrent Neural Networks” – it’s well worth a read to get an understanding of the theory and approach.

There’s many RNN implementations you can download and start training with any input data you can imagine. Here’s a few to take a look at:

So it occurred to me, what would happen if you trained a RNN with all your past Twitter tweets, and then used it to generate new tweets? Let’s find out 🙂

Let’s try it out with torch-rnn – the following is a summary of install steps from

sudo apt-get -y install python2.7-dev
sudo apt-get install libhdf5-dev

Install torch, from :

git clone ~/torch --recursive
cd ~/torch; bash install-deps;
#source new PATH for first time usage in current shell
source ~/.bashrc

Now clone the torch-rnn repo:

git clone

Install torch deps:

luarocks install torch
luarocks install nn
luarocks install optim
luarocks install lua-cjson

Install torch-hdf5:

git clone
cd torch-hdf5
luarocks make hdf5-0-0.rockspec

Install pip to install python deps:

sudo apt-get install python-pip

From inside torch-rnn dir:

pip install -r requirements.txt

Now following steps from docs to preprocess your text input:

python scripts/ \
  --input_txt my_data.txt \
  --output_h5 my_data.h5 \
  --output_json my_data.json

For my input tweet text this looks like:

python scripts/ \
  --input_txt ~/tweet-text/tweet-text.txt  \
  --output_h5 ~/tweet-text/tweet-text.h5 \
  --output_json ~/tweet-text/tweet-text.json

This gives me:

Total vocabulary size: 182

Total tokens in file: 313709

  Training size: 250969

  Val size: 31370

  Test size: 31370

Now to train the model:

th train.lua \
  -input_h5 my_data.h5 
  -input_json my_data.json

For my input file containing my tweet text this looks like:

th train.lua 
  -input_h5 ~/tweet-text/tweet-text.h5 
  -input_json ~/tweet-text/tweet-text.json

This gave me this error:

init.lua:389: module 'cutorch' not found:No LuaRocks module found for cutorch

 no field package.preload['cutorch']

Trying to manually install cutorch I got errors about cuda toolkit:

CMake Error at /usr/share/cmake-3.5/Modules/FindCUDA.cmake:617 (message):


Checking the docs:

By default this will run in GPU mode using CUDA; to run in CPU-only mode, add the flag -gpu -1

… so adding -gpu -1 and trying again, now I’ve got this output as it runs:

Epoch 1.44 / 50, i = 44 / 5000, loss = 3.493316

… one line every few seconds.

After some time it completes a run, and you’ll find files like this in your cv dir beneath where you ran the previous script:


Now to run and get some generated text:

th sample.lua -checkpoint cv/checkpoint_5000.t7 -length 500 -gpu -1 -temperature 0.4

Breaking this down:

-checkpoint : as the model training runs, it saves these point in time snapshots of the model. You can run the generation against any of these files, but it seems the last file it generates gives you the best results

-length : how many characters to generate from the model

-gpu -1 : turn off the gpu usage

-temperature : this ranges from 0.1 to 1 and with values closest to zero the generation is less creative, closer to 1 the generated output is, let’s say, more creative

Let’s run a couple of example. Let’s do 140 chars are -temperature 0.1:

The programming to softting the some the programming to something the computer the computer the computer to a computer the com

and now lets crank it up to  1.0:

z&loDOps be sumpriting sor’s a porriquilefore AR2 vanerone as dathing 201lus: It’s buct. Z) Amatere. PEs’me tha

Now we’ve some pretty random stuff including a randomly generated shortened url too.

Using a value towards the middle, like 0.4 to 0.5 gets some reasonably interesting results that are not too random, but somewhat similar to my typical tweet style. What’s interesting is my regular retweets of software development quotes from @CodeWisdom have heavily influenced the model, so based on my 3000+ tweets it generates text like:

RT @CodeWisdom followed by random generated stuff

Given that the following text is clearly not content from @CodeWisdom, it wouldn’t be appropriate to use this text as-is and post it as a new tweet. Since I’m looking to take this text and use it as input for an automated Twitter-bot, as interesting as this generated pattern is in that it does look like the majority of my tweets, I’ve filtered out anything that starts with ‘RT @text’

I’ve already implemented a first attempt at a Twitter bot using this content with an AWS Lambda running on a timed schedule, you can check it out here:


I’ll be following up with some additional posts on the implementation of my AWS Lambda soon.