Running aitextgen model training in a Docker container

I’m setting up an approach to run text generation model training jobs on demand with aitextgen, and the first approach I’m looking at is to run the training in a Docker container. Later I may move this to an AWS service like ECS, but this is my first step.

I’ve built a Docker image with the following dockerfile:

FROM amazonlinux
RUN yum update -y
RUN yum install -y python3
RUN pip3 install aitextgen
ADD source-file-for-fine-tuning.txt .
ADD generate.py .
ADD train.py .

.. and then built my image with:

docker build -t aitextgen .

I then run a container passing in the cmd I want to run, in this case ‘python3 train.py’:

docker run --volume /data/trained_model:/trained_model:rw -d aitextgen sh -c "cd / && python3 train.py && mv aitextgen.tokenizer.json /trained_model"

I’m also attaching a bind point where the model output is being written to during the run, and -d to run the container in the background. The last step in the run command copies the token file to the mounted EBS volume so it can be reused by the generation.

To generate text from the model, run:

docker run --volume /data/trained_model:/trained_model:rw -d aitextgen sh -c "cd / && python3 generate.py"

Experimenting with the gpt-2 models and text generation

I’m experimenting with the gpt-2 774M model to generate text based on a prompt. Starting up with:

python3 src/interactive_conditional_samples.py --temperature=0.7 --model_name=774M --nsamples=1 --length=100

And then providing a prompt of something relevant like :

“The problem with training Machine Learning models today is that the model is only as good as the data it is trained with.”

and let’s see what we get:

This almost sounds believable as if it was written by hand. It might not make complete sense, but the sentences are well formed, something that was an issue with my previous experimenting with RNN model text generation.

Here’s. the text for reference:

“There is a desire to train the model on a large data set.

A very big data set is not always a good enough data set.

A good data set is more than just enough. A good data set is always growing.

To take advantage of this growing data set we need to train the model on the latest batch of data. However, there may be a huge amount of data that does not fit in the pre-defined batch size.”

I’m curious how the temperature setting effects what is generated, so let’s bump it up from 0.7 to 0.9 and try again with the same prompt text.

This sounds like it comes straight out of a research paper:

“This is why there is now a trend toward training a workforce of highly trained humans in a search engine optimization function (SRE) to monitor changes in signal and value in the big data.

Whether it is a search engine algorithm like Chi-squared or a machine learning model like a neural network, there is still a lot of work ahead to make those apps more relevant to the data they collect.

The need for better data

Having data that makes the predictions bigger, on”

Not knowing exactly how this works, it almost reads like it’s regurgitating whole chunks of the data it was trained on, but who knows. Given that it was trained from materials and sites online, I guess the easy way to test for this would be to just drop one of these sentences in Google and see if something like the source article comes up. Trying this out:

Huh. So this sentence is not found anywhere in Google’s indexes. That’s quite remarkable if the model actually generates sentences that are this convincing.

What about starting with a line from a recipe (for example from here):

“Heat oil in large pot. Add beef and flour, stirring to coat beef; brown.”

Ok, now this is interesting but slightly weird. It’s clearly recognizable as steps in a recipe, but the order doesn’t make too much sense and some of the steps are repeated variations of themselves:

“Add onions and garlic and cook until onion starts to turn translucent, 3 minutes. Stir in bell pepper, glass tomatoes, and celery; cook 1 minute. Add potatoes, beef broth, and bay leaves; bring to a boil. Reduce heat to low. Reduce heat to medium; cook until beef is no longer pink. Taste and add salt and pepper. Add beef and stir. Add liquid; cover and simmer until meat is tender, 10 minutes.”

What about lines from films?

“I find your lack of faith disturbing”

Now we’re gone off into Crazy Town. This is weird:

“Why do you persist in viewing my appearance with contempt?”

“I want to believe I am worthy of your worship.”

“Therefore, you have no right to demand I lift my gaze.”

“I am unworthy of your reverence; I can only pray that you may address me in a less lewd manner.”

“You call me a beast?”

“You cannot deny the very fact you see me.”

“I see nothing.”

“Why?”

Whoah. Ok, keeping with the Star Wars theme, what does it do if you start it with some Yoda quotes:

“When 900 years old you reach, look as good you will not”

And I got:

“I listen to every guru and I say, good luck with your yoga and counting numbers.”

Does this phrase exist anywhere?

Nope. You have to wonder if there were some Joda quotes or lines from films that went into the model, it’s interesting that ‘yoga’ is mentioned, and you wonder if this is the model generating a variation of the name ‘Yoda’. And ‘When 900 years old you reach” = “Counting numbers”? Very curious.

Ok, one more. How about a line from an HP monitor product description:

“Designed with purpose: multiple customization options allow you to adjust your monitor to best fit the way you work.”

Incredible:

“Customizable paired ergonomics and colors: customize every feature of your new monitor from its basic design and color customization options.

Intuitive control links: there are more than 100

cabling connections provided with each monitor.

Designed for comfort: more than twice as many inches of space between the screen and the desk top than standard monitors.

Dedicated power buttons for powering the monitor on and off.”

Ok, enough playing, I did have a project planned to use with this, so time to get busy 🙂

Issues running gpt-2 with provided Docker image

After following the instructions to build the gpt-2 Docker image here, I started up a bash shell into the container:

docker run  -it gpt-2 bash

And then ran:

python3 src/generate_unconditional_samples.py | tee /tmp/samples

This failed with this error:

AttributeError: module 'tensorflow' has no attribute 'sort'

A post here says to upgrade to Tensorflow 1.14.0 in the container, so running:

pip install tensorflow==1.14.0

And then retrying, text generated! Now to start playing and see what this provided models will generate!

Using AWS SageMaker to train a model to generate text (part 2)

This is part 2 following on from my previous post, investigating how to take advantage of AWS SageMaker to train a model and use it to generate text for my Twitter bot, @kevinhookebot.

From the AWS SageMaker docs, in order to get the data in a supported format to use to train a model, it mentions “A script to convert data from tokenized text files to the protobuf format is included in the seq2seq example notebook”

Ok, so from the SageMaker Notebook I created in part 1, let’s start it up via the AWS console:

Once started, clicking the ‘Open’ link to open the Jupyter notebook, we can open the seq2seq example which is in the ‘SageMaker Examples’ section:

From looking at the steps in this example Notebook, it’s clear that this character2character algorithm is more focused on translating text from source to destination (such as translating text in one language to another, as shown in this example notebook).

Ok, so this isn’t what I was looking for so let’s change gears. My main objective is to be able to train a new model using AWS SageMaker service, and generate text from it. From what I understand so far, you have two options how you can use SageMaker. You can either use the AWS Console for SageMaker to create Training Jobs using the built in algorithms, or you can use a Juypter notebook and define the steps yourself using Python to retrieve your data source, prepare the data, and train a model.

At this point the easiest thing might be to look for another Recurrent Neural Net (RNN) to generate characters to replace the Lua Torch char-rnn approach I was previously running locally on an Ubuntu server. Doing some searching I found char-rnn.pytorch.

This is my first experience setting up a Juypter notebook, so at this point I’ve no idea if what I’ve doing is the right approach, but I’ve got something working.

On the righthand side of the notetbook I pressed the New button and selected a Python PyTorch notebook:

Next I added a step to clone the char-rnn.pytorch repo into my notebook:

Next I added a step to use the aws cli to copy my data file for training the model into my notebook:

Next, adding the config options to train a model using char-rnn.pytorch, I added a step to run the training, but it gave an error about some Python modules missing:

Adding an extra step to use pip to install the required modules:

The default number of epochs is 2,000 which takes a while to run, so decreasing this to something smaller with –n_epochs 100 we get a successful run, and calling the generate script, we have content!

I trained with an incredibly small file to get started, just 100 lines of text, for a very short time. So next steps I’m going to look at:

  • training with the full WordPress export of all my posts for a longer training time
  • training with a cleaned up export (remove URL links and other HTML markup)
  • automate the text generation from the model to feed my AWS Lambda based bot

I’ll share another update on these enhancements in my next upcoming post.