Which Chat AIs are aware of incorrect facts in prompt questions, verses which generate wildly inaccurate responses?

The current Large Language Model (LLM) chat AIs like ChatGPT generate text using an input prompt or sample sentence, and generate text that follows in the same style as the input. These Chat AIs do not (currently) comprehend the questions being asked or understand the response text they generate, although the models currently do a believable job of convincing you otherwise. The generated text reads as if the model understands the question or the input prompt because it is scored or weighted and the words that would be most likely to follow the preceding generated words or input text are included in the response and less likely words are discarded. The weighting is based on the massive quantity of text that is used to train the model (ChatGPT3 was trained on 45TB of data extracted from multiple online sources). There are many articles on how these models work, but here is a good read to get a good overview

I’ve spent a lot of time in the past few years playing with frameworks that train Recurrent Neural Networks for text generation and for a few years had a Twitter bot running that tweeted text from a model trained on almost 20 years of my own blog posts (due to the recent api usage restrictions my Twitter bot lost it’s ability to tweet end of April 2023, but it lives on over on Mastodon here). It generates mostly nonsense, but it a good illustration of the capabilities of AI text generation prior to much larger language models that are now able to generate believable responses, believable to the point that you are conversing in a conversational style with a real person.

Do these models generate factually incorrect responses?

There are many stories online of the current Chat AIs generating responses that are completely incorrect and are a cautionary reminder that with state of the current technology, you should never accept a response as correct without doing your own research using alternative sources to confirm the response. Given the effort to do this additional fact checking, you could argue that you might as well do this in the first place, since trusting the output of a model without doing the additional work to verify the responses is not going to save you any time (if you need to be sure that the information you are looking for us actually correct). Using the conversational style of interacting with these models, you can also run into an issue where the model appears to be convinced that it is correct but is giving completely fabricated or just obviously incorrect responses. This is a issue with AI models called hallucinations.

To test this out I asked questions to each of the commonly available chat AIs with a prompt question based on an event that never occurred, and asked the AI to describe that event. You can obviously have a lot of fun with this, so I asked each of the Chat AIs to “tell me about the time Dave Mustaine from Megadeth toured with the British pop band Bananarama”.

First up, here’s the response from ChatGPT:

… well played ChatGPT. There’s obviously some significant validation of prompt questions before the model generates a response, so this reply even in itself is impressive.

Next up, same question to Bing Chat:

… again, impressive input validation here.

Next, same question to Google Bard:

… here’s the weirdest of the responses. Clearly I had asked the model to describe an event where two bands toured together, and this is exactly what the model has described. It generated a completely fabricated description of an event that never occurred, but is impressive none the less. The generated text even includes a fabricated quote from Mustaine that he is ‘a big fan of Banamarama”… maybe he is, but I’d be 99% sure this is completely generated.

Conclusion

So what’s the conclusion here? Given the viral attention these models are currently getting, we need to keep things in perspective:

  • output from these models is generated text – it is generated based on the training data used to train the model, but given the majority of the training data is scraped from the internet, there’s no guarantee the training data is correct, and therefore also no guarantee that the generated text is either. And even still, the responses are generated, which leads to the next point
  • there is a profound difference between searching for results using a search engine, and asking a question to a Chat AI that responds with generated text – a search engine result is content that exists on another website. That content may be factually correct, incorrect, or fiction, but either way, it is content that already exists. The response from a Chat AI is generated text, it is not content that already exists, it was generated from the data used to train the model. While it is possible a model is trained on data related to a question that you ask as a user, there is a difference between searching and returning content that already exists, and text that is generated.
  • With the current level of technology available, Chat AIs do not understand questions asked by users as input prompts, neither do they understand the responses that they generate. While the current level of technology appears that there is comprehension, the model is repeating the pattern of input text, and generates a response following the same pattern – this is not the same as comprehension

As Large Language Models continue to improve, it’s clearly obvious the potential benefits of this technology are wide ranging…. however, it’s also clear the outputs from current models need to be taken with a degree of caution.

Running aitextgen model training in a Docker container

I’m setting up an approach to run text generation model training jobs on demand with aitextgen, and the first approach I’m looking at is to run the training in a Docker container. Later I may move this to an AWS service like ECS, but this is my first step.

I’ve built a Docker image with the following dockerfile:

FROM amazonlinux
RUN yum update -y
RUN yum install -y python3
RUN pip3 install aitextgen
ADD source-file-for-fine-tuning.txt .
ADD generate.py .
ADD train.py .

.. and then built my image with:

docker build -t aitextgen .

I then run a container passing in the cmd I want to run, in this case ‘python3 train.py’:

docker run --volume /data/trained_model:/trained_model:rw -d aitextgen sh -c "cd / && python3 train.py && mv aitextgen.tokenizer.json /trained_model"

I’m also attaching a bind point where the model output is being written to during the run, and -d to run the container in the background. The last step in the run command copies the token file to the mounted EBS volume so it can be reused by the generation.

To generate text from the model, run:

docker run --volume /data/trained_model:/trained_model:rw -d aitextgen sh -c "cd / && python3 generate.py"

Experimenting with the gpt-2 models and text generation

I’m experimenting with the gpt-2 774M model to generate text based on a prompt. Starting up with:

python3 src/interactive_conditional_samples.py --temperature=0.7 --model_name=774M --nsamples=1 --length=100

And then providing a prompt of something relevant like :

“The problem with training Machine Learning models today is that the model is only as good as the data it is trained with.”

and let’s see what we get:

This almost sounds believable as if it was written by hand. It might not make complete sense, but the sentences are well formed, something that was an issue with my previous experimenting with RNN model text generation.

Here’s. the text for reference:

“There is a desire to train the model on a large data set.

A very big data set is not always a good enough data set.

A good data set is more than just enough. A good data set is always growing.

To take advantage of this growing data set we need to train the model on the latest batch of data. However, there may be a huge amount of data that does not fit in the pre-defined batch size.”

I’m curious how the temperature setting effects what is generated, so let’s bump it up from 0.7 to 0.9 and try again with the same prompt text.

This sounds like it comes straight out of a research paper:

“This is why there is now a trend toward training a workforce of highly trained humans in a search engine optimization function (SRE) to monitor changes in signal and value in the big data.

Whether it is a search engine algorithm like Chi-squared or a machine learning model like a neural network, there is still a lot of work ahead to make those apps more relevant to the data they collect.

The need for better data

Having data that makes the predictions bigger, on”

Not knowing exactly how this works, it almost reads like it’s regurgitating whole chunks of the data it was trained on, but who knows. Given that it was trained from materials and sites online, I guess the easy way to test for this would be to just drop one of these sentences in Google and see if something like the source article comes up. Trying this out:

Huh. So this sentence is not found anywhere in Google’s indexes. That’s quite remarkable if the model actually generates sentences that are this convincing.

What about starting with a line from a recipe (for example from here):

“Heat oil in large pot. Add beef and flour, stirring to coat beef; brown.”

Ok, now this is interesting but slightly weird. It’s clearly recognizable as steps in a recipe, but the order doesn’t make too much sense and some of the steps are repeated variations of themselves:

“Add onions and garlic and cook until onion starts to turn translucent, 3 minutes. Stir in bell pepper, glass tomatoes, and celery; cook 1 minute. Add potatoes, beef broth, and bay leaves; bring to a boil. Reduce heat to low. Reduce heat to medium; cook until beef is no longer pink. Taste and add salt and pepper. Add beef and stir. Add liquid; cover and simmer until meat is tender, 10 minutes.”

What about lines from films?

“I find your lack of faith disturbing”

Now we’re gone off into Crazy Town. This is weird:

“Why do you persist in viewing my appearance with contempt?”

“I want to believe I am worthy of your worship.”

“Therefore, you have no right to demand I lift my gaze.”

“I am unworthy of your reverence; I can only pray that you may address me in a less lewd manner.”

“You call me a beast?”

“You cannot deny the very fact you see me.”

“I see nothing.”

“Why?”

Whoah. Ok, keeping with the Star Wars theme, what does it do if you start it with some Yoda quotes:

“When 900 years old you reach, look as good you will not”

And I got:

“I listen to every guru and I say, good luck with your yoga and counting numbers.”

Does this phrase exist anywhere?

Nope. You have to wonder if there were some Joda quotes or lines from films that went into the model, it’s interesting that ‘yoga’ is mentioned, and you wonder if this is the model generating a variation of the name ‘Yoda’. And ‘When 900 years old you reach” = “Counting numbers”? Very curious.

Ok, one more. How about a line from an HP monitor product description:

“Designed with purpose: multiple customization options allow you to adjust your monitor to best fit the way you work.”

Incredible:

“Customizable paired ergonomics and colors: customize every feature of your new monitor from its basic design and color customization options.

Intuitive control links: there are more than 100

cabling connections provided with each monitor.

Designed for comfort: more than twice as many inches of space between the screen and the desk top than standard monitors.

Dedicated power buttons for powering the monitor on and off.”

Ok, enough playing, I did have a project planned to use with this, so time to get busy 🙂