Which Chat AIs are aware of incorrect facts in prompt questions, verses which generate wildly inaccurate responses?

The current Large Language Model (LLM) chat AIs like ChatGPT generate text using an input prompt or sample sentence, and generate text that follows in the same style as the input. These Chat AIs do not (currently) comprehend the questions being asked or understand the response text they generate, although the models currently do a believable job of convincing you otherwise. The generated text reads as if the model understands the question or the input prompt because it is scored or weighted and the words that would be most likely to follow the preceding generated words or input text are included in the response and less likely words are discarded. The weighting is based on the massive quantity of text that is used to train the model (ChatGPT3 was trained on 45TB of data extracted from multiple online sources). There are many articles on how these models work, but here is a good read to get a good overview

I’ve spent a lot of time in the past few years playing with frameworks that train Recurrent Neural Networks for text generation and for a few years had a Twitter bot running that tweeted text from a model trained on almost 20 years of my own blog posts (due to the recent api usage restrictions my Twitter bot lost it’s ability to tweet end of April 2023, but it lives on over on Mastodon here). It generates mostly nonsense, but it a good illustration of the capabilities of AI text generation prior to much larger language models that are now able to generate believable responses, believable to the point that you are conversing in a conversational style with a real person.

Do these models generate factually incorrect responses?

There are many stories online of the current Chat AIs generating responses that are completely incorrect and are a cautionary reminder that with state of the current technology, you should never accept a response as correct without doing your own research using alternative sources to confirm the response. Given the effort to do this additional fact checking, you could argue that you might as well do this in the first place, since trusting the output of a model without doing the additional work to verify the responses is not going to save you any time (if you need to be sure that the information you are looking for us actually correct). Using the conversational style of interacting with these models, you can also run into an issue where the model appears to be convinced that it is correct but is giving completely fabricated or just obviously incorrect responses. This is a issue with AI models called hallucinations.

To test this out I asked questions to each of the commonly available chat AIs with a prompt question based on an event that never occurred, and asked the AI to describe that event. You can obviously have a lot of fun with this, so I asked each of the Chat AIs to “tell me about the time Dave Mustaine from Megadeth toured with the British pop band Banarama”.

First up, here’s the response from ChatGPT:

… well played ChatGPT. There’s obviously some significant validation of prompt questions before the model generates a response, so this reply even in itself is impressive.

Next up, same question to Bing Chat:

… again, impressive input validation here.

Next, same question to Google Bard:

… here’s the weirdest of the responses. Clearly I had asked the model to describe an event where two bands toured together, and this is exactly what the model has described. It generated a completely fabricated description of an event that never occurred, but is impressive none the less. The generated text even includes a fabricated quote from Mustaine that he is ‘a big fan of Banarama”… maybe he is, but I’d be 99% sure this is completely generated.


So what’s the conclusion here? Given the viral attention these models are currently getting, we need to keep things in perspective:

  • output from these models is generated text – it is generated based on the training data used to train the model, but given the majority of the training data is scraped from the internet, there’s no guarantee the training data is correct, and therefore also no guarantee that the generated text is either. And even still, the responses are generated, which leads to the next point
  • there is a profound difference between searching for results using a search engine, and asking a question to a Chat AI that responds with generated text – a search engine result is content that exists on another website. That content may be factually correct, incorrect, or fiction, but either way, it is content that already exists. The response from a Chat AI is generated text, it is not content that already exists, it was generated from the data used to train the model. While it is possible a model is trained on data related to a question that you ask as a user, there is a difference between searching and returning content that already exists, and text that is generated.
  • With the current level of technology available, Chat AIs do not understand questions asked by users as input prompts, neither do they understand the responses that they generate. While the current level of technology appears that there is comprehension, the model is repeating the pattern of input text, and generates a response following the same pattern – this is not the same as comprehension

As Large Language Models continue to improve, it’s clearly obvious the potential benefits of this technology are wide ranging…. however, it’s also clear the outputs from current models need to be taken with a degree of caution.

Using AWS SageMaker to train a model to generate text (part 2)

This is part 2 following on from my previous post, investigating how to take advantage of AWS SageMaker to train a model and use it to generate text for my Twitter bot, @kevinhookebot.

From the AWS SageMaker docs, in order to get the data in a supported format to use to train a model, it mentions “A script to convert data from tokenized text files to the protobuf format is included in the seq2seq example notebook”

Ok, so from the SageMaker Notebook I created in part 1, let’s start it up via the AWS console:

Once started, clicking the ‘Open’ link to open the Jupyter notebook, we can open the seq2seq example which is in the ‘SageMaker Examples’ section:

From looking at the steps in this example Notebook, it’s clear that this character2character algorithm is more focused on translating text from source to destination (such as translating text in one language to another, as shown in this example notebook).

Ok, so this isn’t what I was looking for so let’s change gears. My main objective is to be able to train a new model using AWS SageMaker service, and generate text from it. From what I understand so far, you have two options how you can use SageMaker. You can either use the AWS Console for SageMaker to create Training Jobs using the built in algorithms, or you can use a Juypter notebook and define the steps yourself using Python to retrieve your data source, prepare the data, and train a model.

At this point the easiest thing might be to look for another Recurrent Neural Net (RNN) to generate characters to replace the Lua Torch char-rnn approach I was previously running locally on an Ubuntu server. Doing some searching I found char-rnn.pytorch.

This is my first experience setting up a Juypter notebook, so at this point I’ve no idea if what I’ve doing is the right approach, but I’ve got something working.

On the righthand side of the notetbook I pressed the New button and selected a Python PyTorch notebook:

Next I added a step to clone the char-rnn.pytorch repo into my notebook:

Next I added a step to use the aws cli to copy my data file for training the model into my notebook:

Next, adding the config options to train a model using char-rnn.pytorch, I added a step to run the training, but it gave an error about some Python modules missing:

Adding an extra step to use pip to install the required modules:

The default number of epochs is 2,000 which takes a while to run, so decreasing this to something smaller with –n_epochs 100 we get a successful run, and calling the generate script, we have content!

I trained with an incredibly small file to get started, just 100 lines of text, for a very short time. So next steps I’m going to look at:

  • training with the full WordPress export of all my posts for a longer training time
  • training with a cleaned up export (remove URL links and other HTML markup)
  • automate the text generation from the model to feed my AWS Lambda based bot

I’ll share another update on these enhancements in my next upcoming post.


Retraining my Recurrent Neural Net with content from this blog

Recently I trained torch-rnn with all my previous tweets, and used it to generate tweets from an AWS Lambda, which results in some incomprehensible but somewhat recognizable content typical of my software development tweets like this:

and this:

In most cases the vocab it’s generating new content with has a high occurance of words I’d typically use, so computer, software, hardware, code are all pretty common in the output. Training the model with 2000+ tweets of 240 characters or less though I don’t think is a particular great sample of data, so I wondered what it would be like if I trained it with more data.

I have 2000+ articles on my blog here, so I ran a sql query to extract all the post text to a file (see here), and then fed this 4MB file into the training script. The script has been running on an Ubuntu VM on my rack server for almost 24 hours at this point, and it’s probably the most load I’ve had on my server (the 1 vCPU on the VM is maxed, but the server itself still has plenty of free vCPUs and RAM remaining, but this one vCPU is currently running 100%). It’s getting a little on the warm side in my office right now.

The torch-rnn script to train your model writes out a checkpoint file of the model in progress so far about once every hour, so it’s interesting to see how the generated content improves with every additional hour of training.

Here’s some examples starting with checkpoint 1, and then a few successive checkpoints as examples, running with temperature 0.7 (which gives good results after more training, but pretty wacky output earlier in the training):

Checkpoint 1, after about 2 hours:

a services the interease the result pecally was each service installing this up release have for a load have on vileent there at of althe Mork’ on it’s deforver, a some for

Checkpoint 5:

Store 4 minimal and Mavera FPC to speed and used that the original remeption of the Container and released and problem is any sudo looks most chated and Spring Setting the Started Java tagger

Checkpoint 10:

react for Java EE development and do it compended to the Java EE of code that yet this showing the desting common already back to be should announced to tracker with the problem and expenting

Checkpoint 15:

never that means that all performance developers of the just the hand of a microsch phone as not support with his all additional development though it’s better with the same by worker apache

Checkpoint 19:

The Java becomes are your server post configuring Manic Boot programming code in the PS3 lattled some time this is the last of the Docker direction is one it and a check the new features and a few new communities on the first is seen the destining

Getting pretty interesting at this point! Interesting that certain words appear pretty regularly in the generated output, although I don’t think I’ve included them in articles that often. PS2 and PS3 appear a lot, programming and computer are expected given the frequency in the majority of my articles, and there’s a lot of Java, Microsoft, Oracle, Docker and containers showing up.

I’m not sure how much longer the training is going to run for on a 4MB text file which I didn’t think was that large, but it’s been running for almost 24 hours at this point. I’ll let it run for another day and then see what the output looks like then.

If you start to see the tweets looking slightly more coherent over the next couple of days, the AWS Lambda is starting to use content generated from these new checkpoints on this new model, so it should be slightly more natural sounding hopefully, given the larger input file for training the new model.

Generating tweets using a Recurrent Neural Net (torch-rnn)

Even if you’re not actively following recent trends in AI and Machine Learning, you may have come across articles by a researcher who experiments with training neural nets to generate interesting things such as:

Brown salmon in oil. Add creamed meat and another deep mixture

  • Chocolate Pickle Sauce
  • Completely Meat Chocolate Pie

So what’s going on here? What’s being used is something called a Recurrent Neural Net to generate text in a specific style. It’s trained with input data which it analyzes to recognizes patterns in the text, constructing a model of that data. It can then generate new text following the same patterns, sometimes with rather curious and amusing results.

A commonly referred to article on this topic is by Andrej Karpathy, titled “The Unreasonable Effectiveness of Recurrent Neural Networks” – it’s well worth a read to get an understanding of the theory and approach.

There’s many RNN implementations you can download and start training with any input data you can imagine. Here’s a few to take a look at:

So it occurred to me, what would happen if you trained a RNN with all your past Twitter tweets, and then used it to generate new tweets? Let’s find out 🙂

Let’s try it out with torch-rnn – the following is a summary of install steps from https://github.com/jcjohnson/torch-rnn:

sudo apt-get -y install python2.7-dev
sudo apt-get install libhdf5-dev

Install torch, from http://torch.ch/docs/getting-started.html#_ :

git clone https://github.com/torch/distro.git ~/torch --recursive
cd ~/torch; bash install-deps;
#source new PATH for first time usage in current shell
source ~/.bashrc

Now clone the torch-rnn repo:

git clone https://github.com/jcjohnson/torch-rnn.git

Install torch deps:

luarocks install torch
luarocks install nn
luarocks install optim
luarocks install lua-cjson

Install torch-hdf5:

git clone https://github.com/deepmind/torch-hdf5
cd torch-hdf5
luarocks make hdf5-0-0.rockspec

Install pip to install python deps:

sudo apt-get install python-pip

From inside torch-rnn dir:

pip install -r requirements.txt

Now following steps from docs to preprocess your text input:

python scripts/preprocess.py \
  --input_txt my_data.txt \
  --output_h5 my_data.h5 \
  --output_json my_data.json

For my input tweet text this looks like:

python scripts/preprocess.py \
  --input_txt ~/tweet-text/tweet-text.txt  \
  --output_h5 ~/tweet-text/tweet-text.h5 \
  --output_json ~/tweet-text/tweet-text.json

This gives me:

Total vocabulary size: 182

Total tokens in file: 313709

  Training size: 250969

  Val size: 31370

  Test size: 31370

Now to train the model:

th train.lua \
  -input_h5 my_data.h5 
  -input_json my_data.json

For my input file containing my tweet text this looks like:

th train.lua 
  -input_h5 ~/tweet-text/tweet-text.h5 
  -input_json ~/tweet-text/tweet-text.json

This gave me this error:

init.lua:389: module 'cutorch' not found:No LuaRocks module found for cutorch

 no field package.preload['cutorch']

Trying to manually install cutorch I got errors about cuda toolkit:

CMake Error at /usr/share/cmake-3.5/Modules/FindCUDA.cmake:617 (message):


Checking the docs:

By default this will run in GPU mode using CUDA; to run in CPU-only mode, add the flag -gpu -1

… so adding -gpu -1 and trying again, now I’ve got this output as it runs:

Epoch 1.44 / 50, i = 44 / 5000, loss = 3.493316

… one line every few seconds.

After some time it completes a run, and you’ll find files like this in your cv dir beneath where you ran the previous script:


Now to run and get some generated text:

th sample.lua -checkpoint cv/checkpoint_5000.t7 -length 500 -gpu -1 -temperature 0.4

Breaking this down:

-checkpoint : as the model training runs, it saves these point in time snapshots of the model. You can run the generation against any of these files, but it seems the last file it generates gives you the best results

-length : how many characters to generate from the model

-gpu -1 : turn off the gpu usage

-temperature : this ranges from 0.1 to 1 and with values closest to zero the generation is less creative, closer to 1 the generated output is, let’s say, more creative

Let’s run a couple of example. Let’s do 140 chars are -temperature 0.1:

The programming to softting the some the programming to something the computer the computer the computer to a computer the com

and now lets crank it up to  1.0:

z&loDOps be sumpriting sor’s a porriquilefore AR2 vanerone as dathing 201lus: It’s buct. Z) https://t.co/gEDr9Er24N Amatere. PEs’me tha

Now we’ve some pretty random stuff including a randomly generated shortened url too.

Using a value towards the middle, like 0.4 to 0.5 gets some reasonably interesting results that are not too random, but somewhat similar to my typical tweet style. What’s interesting is my regular retweets of software development quotes from @CodeWisdom have heavily influenced the model, so based on my 3000+ tweets it generates text like:

RT @CodeWisdom followed by random generated stuff

Given that the following text is clearly not content from @CodeWisdom, it wouldn’t be appropriate to use this text as-is and post it as a new tweet. Since I’m looking to take this text and use it as input for an automated Twitter-bot, as interesting as this generated pattern is in that it does look like the majority of my tweets, I’ve filtered out anything that starts with ‘RT @text’

I’ve already implemented a first attempt at a Twitter bot using this content with an AWS Lambda running on a timed schedule, you can check it out here:


I’ll be following up with some additional posts on the implementation of my AWS Lambda soon.