Recently I trained torch-rnn with all my previous tweets, and used it to generate tweets from an AWS Lambda, which results in some incomprehensible but somewhat recognizable content typical of my software development tweets like this:
in the programmer to far to programmangers of a something the a sere of see the with the software the something
— Kevin Hooke Bot (@kevinhookebot) April 17, 2018
and this:
the programmers with the mess a new is the computer
— Kevin Hooke Bot (@kevinhookebot) April 16, 2018
In most cases the vocab it’s generating new content with has a high occurance of words I’d typically use, so computer, software, hardware, code are all pretty common in the output. Training the model with 2000+ tweets of 240 characters or less though I don’t think is a particular great sample of data, so I wondered what it would be like if I trained it with more data.
I have 2000+ articles on my blog here, so I ran a sql query to extract all the post text to a file (see here), and then fed this 4MB file into the training script. The script has been running on an Ubuntu VM on my rack server for almost 24 hours at this point, and it’s probably the most load I’ve had on my server (the 1 vCPU on the VM is maxed, but the server itself still has plenty of free vCPUs and RAM remaining, but this one vCPU is currently running 100%). It’s getting a little on the warm side in my office right now.
The torch-rnn script to train your model writes out a checkpoint file of the model in progress so far about once every hour, so it’s interesting to see how the generated content improves with every additional hour of training.
Here’s some examples starting with checkpoint 1, and then a few successive checkpoints as examples, running with temperature 0.7 (which gives good results after more training, but pretty wacky output earlier in the training):
Checkpoint 1, after about 2 hours:
a services the interease the result pecally was each service installing this up release have for a load have on vileent there at of althe Mork’ on it’s deforver, a some for
Checkpoint 5:
Store 4 minimal and Mavera FPC to speed and used that the original remeption of the Container and released and problem is any sudo looks most chated and Spring Setting the Started Java tagger
Checkpoint 10:
react for Java EE development and do it compended to the Java EE of code that yet this showing the desting common already back to be should announced to tracker with the problem and expenting
Checkpoint 15:
never that means that all performance developers of the just the hand of a microsch phone as not support with his all additional development though it’s better with the same by worker apache
Checkpoint 19:
The Java becomes are your server post configuring Manic Boot programming code in the PS3 lattled some time this is the last of the Docker direction is one it and a check the new features and a few new communities on the first is seen the destining
Getting pretty interesting at this point! Interesting that certain words appear pretty regularly in the generated output, although I don’t think I’ve included them in articles that often. PS2 and PS3 appear a lot, programming and computer are expected given the frequency in the majority of my articles, and there’s a lot of Java, Microsoft, Oracle, Docker and containers showing up.
I’m not sure how much longer the training is going to run for on a 4MB text file which I didn’t think was that large, but it’s been running for almost 24 hours at this point. I’ll let it run for another day and then see what the output looks like then.
If you start to see the tweets looking slightly more coherent over the next couple of days, the AWS Lambda is starting to use content generated from these new checkpoints on this new model, so it should be slightly more natural sounding hopefully, given the larger input file for training the new model.