Using AWS SageMaker to train a model to generate text (part 2)

This is part 2 following on from my previous post, investigating how to take advantage of AWS SageMaker to train a model and use it to generate text for my Twitter bot, @kevinhookebot.

From the AWS SageMaker docs, in order to get the data in a supported format to use to train a model, it mentions “A script to convert data from tokenized text files to the protobuf format is included in the seq2seq example notebook”

Ok, so from the SageMaker Notebook I created in part 1, let’s start it up via the AWS console:

Once started, clicking the ‘Open’ link to open the Jupyter notebook, we can open the seq2seq example which is in the ‘SageMaker Examples’ section:

From looking at the steps in this example Notebook, it’s clear that this character2character algorithm is more focused on translating text from source to destination (such as translating text in one language to another, as shown in this example notebook).

Ok, so this isn’t what I was looking for so let’s change gears. My main objective is to be able to train a new model using AWS SageMaker service, and generate text from it. From what I understand so far, you have two options how you can use SageMaker. You can either use the AWS Console for SageMaker to create Training Jobs using the built in algorithms, or you can use a Juypter notebook and define the steps yourself using Python to retrieve your data source, prepare the data, and train a model.

At this point the easiest thing might be to look for another Recurrent Neural Net (RNN) to generate characters to replace the Lua Torch char-rnn approach I was previously running locally on an Ubuntu server. Doing some searching I found char-rnn.pytorch.

This is my first experience setting up a Juypter notebook, so at this point I’ve no idea if what I’ve doing is the right approach, but I’ve got something working.

On the righthand side of the notetbook I pressed the New button and selected a Python PyTorch notebook:

Next I added a step to clone the char-rnn.pytorch repo into my notebook:

Next I added a step to use the aws cli to copy my data file for training the model into my notebook:

Next, adding the config options to train a model using char-rnn.pytorch, I added a step to run the training, but it gave an error about some Python modules missing:

Adding an extra step to use pip to install the required modules:

The default number of epochs is 2,000 which takes a while to run, so decreasing this to something smaller with –n_epochs 100 we get a successful run, and calling the generate script, we have content!

I trained with an incredibly small file to get started, just 100 lines of text, for a very short time. So next steps I’m going to look at:

  • training with the full WordPress export of all my posts for a longer training time
  • training with a cleaned up export (remove URL links and other HTML markup)
  • automate the text generation from the model to feed my AWS Lambda based bot

I’ll share another update on these enhancements in my next upcoming post.

 

Using AWS Sagemaker to train a model to generate text (part 1)

If you’ve followed any of my recent posts, you’ll know I have been using RNN models to generate text from a model trained with my previous tweets, and the text from all of my previous blog posts, and feeding this into a Twitter bot: @kevinhookebot

The trouble I have right now is the scripts and generate models are running using Lua, and although I could install this to an EC2 instance, I don’t want to pay for an EC2 instance being up 100% of the time. Currently when I generate a new batch of text for my Twitter bot, I startup a local server running the scripts and the model, generate new text, and then stage it to DynamoDB to get picked up by the bot when it’s scheduled to next run. With the AWS provided Machine Learning services, there has to be something out of the box I can use on AWS that would automate these steps.

Let’s take a look at using AWS SageMaker.

First I created a SageMaker notebook with a new role, to access S3 buckets with ‘sagemaker’ in the name.

Then I created an S3 bucket – sagemaker-kevinhooke-ml – and uploaded a copy of my data file (all my previous posts from this blog, concatenated into a single file).

Next I created a new Training Job.

You need to pick an algorithm for the training and there’s a selection of provided algorithms for different purposes. To generate new text ‘in the style of’ the text that I’m going to training the model with, the ‘Sequence2Sequence’ looks like it does what I need.

On completing the Training Job, I got this error:

Ok, so let’s change the instance type. I picked the smallest of the instances before:

And it looks like you can’t change the type on the Notebook. So let’s create a new Notebook. Looking at the instance types, the ones with GPU support are on the large side, so let’s pick the smallest of the options and try again.

At this point I realized the instance type it’s talking about is for the training job not the notebook, and it’s specified here:

So let’s pick one of the GPU types and try again.

First training job is running:

Next error:

Hmm, off to do some reading in the docs to see what’s needed to run this training job. The docs here describe what’s needed for the sequence2sequence algorithm and I’m clearly missing some steps, so taking a pause here and will come back with an update later.

Piping audio between applications: Configuring ham radio apps on Mac OS using SoundFlower (virtual audio cables)

You’re running some digital mode software like WSJT-X on your Mac. Normally you would use a physical audio cable between your radio to your Mac, either via a soundcard interface like a Rigblaster, or even a direct USB connection to your Mac and your radio. What happens though if you want to route your audio from one application to another? For example, can you pipe the audio from a Web SDR running in your browser straight into WSJT-X (or any other digital mode software)? What you need are ‘virtual audio cables’.

On Windows you have a product called VB-Cable (the approach for Windows is similar to what’s described here). On MacOS you have a couple of options. There’s a commercial product called Loopback from Rogue Amoeba, or an open source alternative called Soundflower.

Follow the instructions to download and install. Once installed, you’ll find a couple of extra sound devices in your System Preferences:

Think of the Soundflower device as your cable. Instead of configuring Speakers for output and Mic for input, if you configure the input for one app as Soundflower (one end of the virtual cable) and the output for another app also as Soundflower (the other end of the cable), and sound output from one app is now directed into input of the other.

Let’s give this a go to connect the output from a WebSDR with the input to WSJT-X.

First, from System Preferences, select the Output to be Soundflower (shown above).

Start up a browser and pick a Web SDR station from http://websdr.org/

Here’s KFS and we’re tuned in to 7.074Mhz USB to receive some FT8:

Next, start up WSJT-X and go to Preferences, Audio:

Note that with Input = Soundflower we’re routing the Output audio from the WebSDR running in the browser into the Soundflower virtual cable. From WSJT-X we’re then taking the audio from this virtual cable as the input into WSJT-X, effectively routing the audio from the web browser into WSJT-X.

Also note that with Output = Soundflower in WSJT-X, if we transmit on WSJT-X the audio will also go out on the virtual cable. With WebSDR we can’t obviously transmit, but if you have access to a remote rig like remotehamradio.com, you can route the audio from WSJT-X into the remote rig app. More on that coming next.

You might note that with this current configuration there’s no actual audio coming out of your speakers. With some virtual cables you have the option to monitor the audio passing over the virtual cable. On MacOS you also have the ability to create composite audio devices using the Audio MIDI Setup app:

This shows a ‘Multi-Output Device’ comprising both the regular built-in audio (your speakers) and Soundflower. Now you’ve got the best of both worlds. More on this next step, and also configuring to use remotehamradio.com with WSJT-X coming up next.

Eclipse Oxygen with Atlassian Connector plugin for accessing Jira issues

I’ve been kicking the tires in my local dev setup running my own Jira and GitLab installations. I’ve been meaning to take a look at how to access Jira tickets from within Eclipse, and then the next logical step is to look at the Jira to GitLab integration.

First up, let’s look at accessing Jira tickets in Eclipse. Docs on the Atlassian Connector are here: https://confluence.atlassian.com/ideplugin/atlassian-connector-for-eclipse. The installation guide gives an Eclipse Update Sites for Eclipse version up to Luna but not more recent versions (Mars, Neon, Oxygen), but questions online (e.g. here) suggest the Luna version still installs and works with Oxygen (using update site: http://update.atlassian.com/atlassian-eclipse-plugin/rest/e3.7)

Integration within Eclipse is using Mylyn. After installing the plugin by adding the update site above, open the Mylyn/Task Repository view:

and select the Jira option:

Press Next and enter the URL for your Jira server:

When prompted to create a new Query, select Yes – this is what retrieves your assigned tasks from Jira:

Press Next for the next dialog. There’s a lot of options here (continues below the area in this screenshot), but in this case I’m interested just in the issues logged for my Blackjack Twitterbot project, so I selected this specific project:

In my Task List view I can now see a list of my assigned tasks, open and competed:

Double-clicking any of these opens the ticket in Eclipse:

This is pretty typical of any issue ticket tracking support in Eclipse. At this point I can edit and update the tickets.

Next up, I’ll look at Jira and GitHub integration.