Generating tweets using a Recurrent Neural Net (torch-rnn)

Even if you’re not actively following recent trends in AI and Machine Learning, you may have come across articles by a researcher who experiments with training neural nets to generate interesting things such as:

Brown salmon in oil. Add creamed meat and another deep mixture

  • Chocolate Pickle Sauce
  • Completely Meat Chocolate Pie

So what’s going on here? What’s being used is something called a Recurrent Neural Net to generate text in a specific style. It’s trained with input data which it analyzes to recognizes patterns in the text, constructing a model of that data. It can then generate new text following the same patterns, sometimes with rather curious and amusing results.

A commonly referred to article on this topic is by Andrej Karpathy, titled “The Unreasonable Effectiveness of Recurrent Neural Networks” – it’s well worth a read to get an understanding of the theory and approach.

There’s many RNN implementations you can download and start training with any input data you can imagine. Here’s a few to take a look at:

So it occurred to me, what would happen if you trained a RNN with all your past Twitter tweets, and then used it to generate new tweets? Let’s find out 🙂

Let’s try it out with torch-rnn – the following is a summary of install steps from https://github.com/jcjohnson/torch-rnn:

sudo apt-get -y install python2.7-dev
sudo apt-get install libhdf5-dev

Install torch, from http://torch.ch/docs/getting-started.html#_ :

git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; bash install-deps;
./install.sh
#source new PATH for first time usage in current shell
source ~/.bashrc

Now clone the torch-rnn repo:

git clone https://github.com/jcjohnson/torch-rnn.git

Install torch deps:

luarocks install torch
luarocks install nn
luarocks install optim
luarocks install lua-cjson

Install torch-hdf5:

git clone https://github.com/deepmind/torch-hdf5
cd torch-hdf5
luarocks make hdf5-0-0.rockspec

Install pip to install python deps:

sudo apt-get install python-pip

From inside torch-rnn dir:

pip install -r requirements.txt

Now following steps from docs to preprocess your text input:

python scripts/preprocess.py \
  --input_txt my_data.txt \
  --output_h5 my_data.h5 \
  --output_json my_data.json

For my input tweet text this looks like:

python scripts/preprocess.py \
  --input_txt ~/tweet-text/tweet-text.txt  \
  --output_h5 ~/tweet-text/tweet-text.h5 \
  --output_json ~/tweet-text/tweet-text.json

This gives me:

Total vocabulary size: 182

Total tokens in file: 313709

  Training size: 250969

  Val size: 31370

  Test size: 31370

Now to train the model:

th train.lua \
  -input_h5 my_data.h5 
  -input_json my_data.json

For my input file containing my tweet text this looks like:

th train.lua 
  -input_h5 ~/tweet-text/tweet-text.h5 
  -input_json ~/tweet-text/tweet-text.json

This gave me this error:

init.lua:389: module 'cutorch' not found:No LuaRocks module found for cutorch

 no field package.preload['cutorch']

Trying to manually install cutorch I got errors about cuda toolkit:

CMake Error at /usr/share/cmake-3.5/Modules/FindCUDA.cmake:617 (message):

  Specify CUDA_TOOLKIT_ROOT_DIR

Checking the docs:

By default this will run in GPU mode using CUDA; to run in CPU-only mode, add the flag -gpu -1

… so adding -gpu -1 and trying again, now I’ve got this output as it runs:

Epoch 1.44 / 50, i = 44 / 5000, loss = 3.493316

… one line every few seconds.

After some time it completes a run, and you’ll find files like this in your cv dir beneath where you ran the previous script:

checkpoint_1000.json
checkpoint_1000.t7
checkpoint_2000.json
checkpoint_2000.t7
checkpoint_3000.json
checkpoint_3000.t7
checkpoint_4000.json
checkpoint_4000.t7
checkpoint_5000.json
checkpoint_5000.t7

Now to run and get some generated text:

th sample.lua -checkpoint cv/checkpoint_5000.t7 -length 500 -gpu -1 -temperature 0.4

Breaking this down:

-checkpoint : as the model training runs, it saves these point in time snapshots of the model. You can run the generation against any of these files, but it seems the last file it generates gives you the best results

-length : how many characters to generate from the model

-gpu -1 : turn off the gpu usage

-temperature : this ranges from 0.1 to 1 and with values closest to zero the generation is less creative, closer to 1 the generated output is, let’s say, more creative

Let’s run a couple of example. Let’s do 140 chars are -temperature 0.1:

The programming to softting the some the programming to something the computer the computer the computer to a computer the com

and now lets crank it up to  1.0:

z&loDOps be sumpriting sor’s a porriquilefore AR2 vanerone as dathing 201lus: It’s buct. Z) https://t.co/gEDr9Er24N Amatere. PEs’me tha

Now we’ve some pretty random stuff including a randomly generated shortened url too.

Using a value towards the middle, like 0.4 to 0.5 gets some reasonably interesting results that are not too random, but somewhat similar to my typical tweet style. What’s interesting is my regular retweets of software development quotes from @CodeWisdom have heavily influenced the model, so based on my 3000+ tweets it generates text like:

RT @CodeWisdom followed by random generated stuff

Given that the following text is clearly not content from @CodeWisdom, it wouldn’t be appropriate to use this text as-is and post it as a new tweet. Since I’m looking to take this text and use it as input for an automated Twitter-bot, as interesting as this generated pattern is in that it does look like the majority of my tweets, I’ve filtered out anything that starts with ‘RT @text’

I’ve already implemented a first attempt at a Twitter bot using this content with an AWS Lambda running on a timed schedule, you can check it out here:

 


I’ll be following up with some additional posts on the implementation of my AWS Lambda soon.

Exporting select result from MySQL and getting the error ‘ERROR 1290 (HY000): The MySQL server is running with the –secure-file-priv option’

To use ‘select … into outfile’ on MySQL the user needs to have the FILE permission as described here.

Even with this privilege however, if there server is running with the –secure-file-priv option you’ll see this error:

ERROR 1290 (HY000): The MySQL server is running with the --secure-file-priv option

When this option is enabled there’s usually one or more paths by default that are configured that you can import and export to. You can find these paths using:

mysql> SHOW VARIABLES LIKE "secure_file_priv";

+------------------+-----------------------+

| Variable_name    | Value                 |

+------------------+-----------------------+

| secure_file_priv | /var/lib/mysql-files/ |

+------------------+-----------------------+

1 row in set (0.02 sec)

Knowing where our trusted location is for importing/exporting, lets try again:

select examplecol from exampletable 
order by createdate desc limit 1 
into outfile '/var/lib/mysql-files/mysqlout.txt';

Success!

This is discussed in answer to this question here.

Update: Java 10 with Eclipse Oxygen 3a (4.7.3a RC2) – working!

Following up on responses to my tweet and post yesterday, here’s a summary of the suggestions to get Java 10 support working with Eclipse 4.7.3a RC1:

  1. Download the 4.7.3a RC1 build. Yesterday I had tried the 4.7.3a M20180322-1835 build as stated on the previously available Java 10 support plugin:

but the recommendation from Noopur Gupta was to download the 4.7.3a RC1 release which looks like it’s after the M20180322 release, and this apparently has specific support for the new var keyword:

2. Ensure have the latest Maven m2e plugin for Maven support is installed:

3. Configure Maven pom.xml with the maven-compiler-plugin v3.7.0 and set the source and target versions to 10. Instead of setting the source and compile version in properties like this:

  <properties>
    <maven.compiler.source>10</maven.compiler.source>
    <maven.compiler.target>10</maven.compiler.target>
  </properties>

This tweet shows setting the 10 value as a <release> property:

So replacing the source and target properties that I normally use, and using this:

  <build>
     <plugins>
       <plugin>
         <groupId>org.apache.maven.plugins</groupId>
         <artifactId>maven-compiler-plugin</artifactId>
         <version>3.7.0</version>
         <configuration>
         <release>10</release>
         </configuration>
       </plugin>
     </plugins>
  </build>

Now doing a Maven Update on my project, and what do you know, now it’s switched to Java 10:

And here it it, the var keyword in all it’s glory:

Thanks for everyone’s tips and suggestions to get this working!

Migrating an existing WordPress + nginx + php5-fpm + mysql website to Docker containers: lessons learned

I’ve covered in previous posts why I wanted to Dockerize my site and move to containers, you can read about it in my other posts shared here. Having played with Docker for personal projects for several months at this point, I thought it was going to be easy, but ran into several issues and unexpected decisions that I needed to make. In this post I’ll summarize a few of these issues and learning points.

Realizing the meaning of ‘containers are ephemeral’, or ‘where do I put my application data’?

Docker images are the blueprint for a container, while the container is a running instance of an image. It’s clear from the Docker docs and elsewhere that you should treat your containers as ‘ephemeral’, meaning they only exist while they’re up and running, their state is temporary, and once they are discarded their state is also lost.

This is an easy concept to grasp at a high level, but in practice this leads to important and valid questions, like ‘so where does my data go’? This became very apparent to me when transferring my existing WordPress data. First, I have data in MySQL tables that needs to get imported into the new MySQL server running in a container. Second, where does the wordpress/wp-content go that in my case contains nearly 500MB of uploaded images from my 2,000+ posts?

The data for MySQL was easy to address, as the official MySQL docker image is already set up to use Docker’s data volume feature by default to externalize your MySQL data files outside of your running container.

The issue of where to put my WordPress wp-content containing 500MB of upload files is what caused my ahah moment with data volumes. Naively, you can create an image and use the COPY command to copy any number of files into an image, including even 500MB of images, but when you start to move this image around, like pushing it to a repository or a remote server, you quickly realize you’ve created something that is impractical. Making incremental changes to a image containing this quantity of files you quickly find that you’re unable to push it anywhere quickly.

To address this, I created an image with nginx and php5-fpm installed, but used Docker’s bind mount to reference and load my static content outside the container.

Now I have my app in containers, how do I actually deploy to different servers?

Up until this point I’ve built and run containers locally, I’ve set up a local Docker Repository for pushing images to for testing, but the main reasons I was interested in for this migration was to enable:

  • building and testing the containers locally
  • testing deployment to a VM server with an identical setup to my production KVM hosted server
  • pushing to my production server when I was ready to deploy to my live site

Before the Windows and MacOS naive Docker installations, I thought docker-machine was just a way to deploy to a locally running Docker install in a VM. It never occurred to me that you can also use the docker-machine command to act on any remote Docker install too.

It turns out even setting a env var DOCKER_HOST to point to the IP of any remote Docker server will enable you to direct commands to that remote server. I believe part of the ‘docker-machine create’ setup helps automate setting up TLS certs for communicating with your remote server, but you can also do this manually following the steps here. I took this approach because I wanted to use the certs from my dev machine as well as my GitLab build machine.

I used this approach to build my images locally, and then on committing my Dockerfile and source changes to my GitLab repo, I also set up a CI Pipeline to run the same commands and push automatically to a locally running test VM server, and then manually to push to my production server.

I’ll cover my GitLab CI Pipeline setup in an upcoming post.

How do you monitor an application running in containers?

I’ve been looking at a number of approaches. Prometheus looks like a great option, and I’ve been setting this up on my test server to take a look. I’m still looking at a few related options, maybe even using Grafana to visualize metrics. I’ll cover this in a future post too.