Using AWS SageMaker to train a model to generate text (part 3): What I learned about Python AWS Lambdas this week

This is a follow-on from my investigation on how to use AWS SageMaker as an AWS replacement to my current approach to generate text from a Machine Learning model. Up until this point I’ve been running torch-rnn on a server locally. You can follow part 1 and part 2 of my progress so far.

In summary, here’s what I learned this week:

  • Some Python modules are OS platform specific. That means, if you install a module on MacOS, you can’t zip it up as a dependency in a .zip deployment for a Lambda which runs on Linux, as it won’t be OS compatible
  • The maximum size for an AWS Lambda deployment is 50MB. Zipping up what I’ve built so far (only a minimal script but relying on a number of modules) I’ve got a 500MB zip file. Clearly that’s too large to deploy as a Lambda
  • Following some suggestions here,  there are Python frameworks (such as Zappa) to help build Python based AWS Lambdas and address some of the issues with modules and deployment. Clearly I’ve got some learning here to get his to work 🙂

Using AWS SageMaker to train a model to generate text (part 2)

This is part 2 following on from my previous post, investigating how to take advantage of AWS SageMaker to train a model and use it to generate text for my Twitter bot, @kevinhookebot.

From the AWS SageMaker docs, in order to get the data in a supported format to use to train a model, it mentions “A script to convert data from tokenized text files to the protobuf format is included in the seq2seq example notebook”

Ok, so from the SageMaker Notebook I created in part 1, let’s start it up via the AWS console:

Once started, clicking the ‘Open’ link to open the Jupyter notebook, we can open the seq2seq example which is in the ‘SageMaker Examples’ section:

From looking at the steps in this example Notebook, it’s clear that this character2character algorithm is more focused on translating text from source to destination (such as translating text in one language to another, as shown in this example notebook).

Ok, so this isn’t what I was looking for so let’s change gears. My main objective is to be able to train a new model using AWS SageMaker service, and generate text from it. From what I understand so far, you have two options how you can use SageMaker. You can either use the AWS Console for SageMaker to create Training Jobs using the built in algorithms, or you can use a Juypter notebook and define the steps yourself using Python to retrieve your data source, prepare the data, and train a model.

At this point the easiest thing might be to look for another Recurrent Neural Net (RNN) to generate characters to replace the Lua Torch char-rnn approach I was previously running locally on an Ubuntu server. Doing some searching I found char-rnn.pytorch.

This is my first experience setting up a Juypter notebook, so at this point I’ve no idea if what I’ve doing is the right approach, but I’ve got something working.

On the righthand side of the notetbook I pressed the New button and selected a Python PyTorch notebook:

Next I added a step to clone the char-rnn.pytorch repo into my notebook:

Next I added a step to use the aws cli to copy my data file for training the model into my notebook:

Next, adding the config options to train a model using char-rnn.pytorch, I added a step to run the training, but it gave an error about some Python modules missing:

Adding an extra step to use pip to install the required modules:

The default number of epochs is 2,000 which takes a while to run, so decreasing this to something smaller with –n_epochs 100 we get a successful run, and calling the generate script, we have content!

I trained with an incredibly small file to get started, just 100 lines of text, for a very short time. So next steps I’m going to look at:

  • training with the full WordPress export of all my posts for a longer training time
  • training with a cleaned up export (remove URL links and other HTML markup)
  • automate the text generation from the model to feed my AWS Lambda based bot

I’ll share another update on these enhancements in my next upcoming post.

 

Using AWS Sagemaker to train a model to generate text (part 1)

If you’ve followed any of my recent posts, you’ll know I have been using RNN models to generate text from a model trained with my previous tweets, and the text from all of my previous blog posts, and feeding this into a Twitter bot: @kevinhookebot

The trouble I have right now is the scripts and generate models are running using Lua, and although I could install this to an EC2 instance, I don’t want to pay for an EC2 instance being up 100% of the time. Currently when I generate a new batch of text for my Twitter bot, I startup a local server running the scripts and the model, generate new text, and then stage it to DynamoDB to get picked up by the bot when it’s scheduled to next run. With the AWS provided Machine Learning services, there has to be something out of the box I can use on AWS that would automate these steps.

Let’s take a look at using AWS SageMaker.

First I created a SageMaker notebook with a new role, to access S3 buckets with ‘sagemaker’ in the name.

Then I created an S3 bucket – sagemaker-kevinhooke-ml – and uploaded a copy of my data file (all my previous posts from this blog, concatenated into a single file).

Next I created a new Training Job.

You need to pick an algorithm for the training and there’s a selection of provided algorithms for different purposes. To generate new text ‘in the style of’ the text that I’m going to training the model with, the ‘Sequence2Sequence’ looks like it does what I need.

On completing the Training Job, I got this error:

Ok, so let’s change the instance type. I picked the smallest of the instances before:

And it looks like you can’t change the type on the Notebook. So let’s create a new Notebook. Looking at the instance types, the ones with GPU support are on the large side, so let’s pick the smallest of the options and try again.

At this point I realized the instance type it’s talking about is for the training job not the notebook, and it’s specified here:

So let’s pick one of the GPU types and try again.

First training job is running:

Next error:

Hmm, off to do some reading in the docs to see what’s needed to run this training job. The docs here describe what’s needed for the sequence2sequence algorithm and I’m clearly missing some steps, so taking a pause here and will come back with an update later.

Retraining my Recurrent Neural Net with content from this blog

Recently I trained torch-rnn with all my previous tweets, and used it to generate tweets from an AWS Lambda, which results in some incomprehensible but somewhat recognizable content typical of my software development tweets like this:

and this:

In most cases the vocab it’s generating new content with has a high occurance of words I’d typically use, so computer, software, hardware, code are all pretty common in the output. Training the model with 2000+ tweets of 240 characters or less though I don’t think is a particular great sample of data, so I wondered what it would be like if I trained it with more data.

I have 2000+ articles on my blog here, so I ran a sql query to extract all the post text to a file (see here), and then fed this 4MB file into the training script. The script has been running on an Ubuntu VM on my rack server for almost 24 hours at this point, and it’s probably the most load I’ve had on my server (the 1 vCPU on the VM is maxed, but the server itself still has plenty of free vCPUs and RAM remaining, but this one vCPU is currently running 100%). It’s getting a little on the warm side in my office right now.

The torch-rnn script to train your model writes out a checkpoint file of the model in progress so far about once every hour, so it’s interesting to see how the generated content improves with every additional hour of training.

Here’s some examples starting with checkpoint 1, and then a few successive checkpoints as examples, running with temperature 0.7 (which gives good results after more training, but pretty wacky output earlier in the training):

Checkpoint 1, after about 2 hours:

a services the interease the result pecally was each service installing this up release have for a load have on vileent there at of althe Mork’ on it’s deforver, a some for

Checkpoint 5:

Store 4 minimal and Mavera FPC to speed and used that the original remeption of the Container and released and problem is any sudo looks most chated and Spring Setting the Started Java tagger

Checkpoint 10:

react for Java EE development and do it compended to the Java EE of code that yet this showing the desting common already back to be should announced to tracker with the problem and expenting

Checkpoint 15:

never that means that all performance developers of the just the hand of a microsch phone as not support with his all additional development though it’s better with the same by worker apache

Checkpoint 19:

The Java becomes are your server post configuring Manic Boot programming code in the PS3 lattled some time this is the last of the Docker direction is one it and a check the new features and a few new communities on the first is seen the destining

Getting pretty interesting at this point! Interesting that certain words appear pretty regularly in the generated output, although I don’t think I’ve included them in articles that often. PS2 and PS3 appear a lot, programming and computer are expected given the frequency in the majority of my articles, and there’s a lot of Java, Microsoft, Oracle, Docker and containers showing up.

I’m not sure how much longer the training is going to run for on a 4MB text file which I didn’t think was that large, but it’s been running for almost 24 hours at this point. I’ll let it run for another day and then see what the output looks like then.

If you start to see the tweets looking slightly more coherent over the next couple of days, the AWS Lambda is starting to use content generated from these new checkpoints on this new model, so it should be slightly more natural sounding hopefully, given the larger input file for training the new model.