Which Chat AIs are aware of incorrect facts in prompt questions, verses which generate wildly inaccurate responses?

The current Large Language Model (LLM) chat AIs like ChatGPT generate text using an input prompt or sample sentence, and generate text that follows in the same style as the input. These Chat AIs do not (currently) comprehend the questions being asked or understand the response text they generate, although the models currently do a believable job of convincing you otherwise. The generated text reads as if the model understands the question or the input prompt because it is scored or weighted and the words that would be most likely to follow the preceding generated words or input text are included in the response and less likely words are discarded. The weighting is based on the massive quantity of text that is used to train the model (ChatGPT3 was trained on 45TB of data extracted from multiple online sources). There are many articles on how these models work, but here is a good read to get a good overview

I’ve spent a lot of time in the past few years playing with frameworks that train Recurrent Neural Networks for text generation and for a few years had a Twitter bot running that tweeted text from a model trained on almost 20 years of my own blog posts (due to the recent api usage restrictions my Twitter bot lost it’s ability to tweet end of April 2023, but it lives on over on Mastodon here). It generates mostly nonsense, but it a good illustration of the capabilities of AI text generation prior to much larger language models that are now able to generate believable responses, believable to the point that you are conversing in a conversational style with a real person.

Do these models generate factually incorrect responses?

There are many stories online of the current Chat AIs generating responses that are completely incorrect and are a cautionary reminder that with state of the current technology, you should never accept a response as correct without doing your own research using alternative sources to confirm the response. Given the effort to do this additional fact checking, you could argue that you might as well do this in the first place, since trusting the output of a model without doing the additional work to verify the responses is not going to save you any time (if you need to be sure that the information you are looking for us actually correct). Using the conversational style of interacting with these models, you can also run into an issue where the model appears to be convinced that it is correct but is giving completely fabricated or just obviously incorrect responses. This is a issue with AI models called hallucinations.

To test this out I asked questions to each of the commonly available chat AIs with a prompt question based on an event that never occurred, and asked the AI to describe that event. You can obviously have a lot of fun with this, so I asked each of the Chat AIs to “tell me about the time Dave Mustaine from Megadeth toured with the British pop band Bananarama”.

First up, here’s the response from ChatGPT:

… well played ChatGPT. There’s obviously some significant validation of prompt questions before the model generates a response, so this reply even in itself is impressive.

Next up, same question to Bing Chat:

… again, impressive input validation here.

Next, same question to Google Bard:

… here’s the weirdest of the responses. Clearly I had asked the model to describe an event where two bands toured together, and this is exactly what the model has described. It generated a completely fabricated description of an event that never occurred, but is impressive none the less. The generated text even includes a fabricated quote from Mustaine that he is ‘a big fan of Banamarama”… maybe he is, but I’d be 99% sure this is completely generated.

Conclusion

So what’s the conclusion here? Given the viral attention these models are currently getting, we need to keep things in perspective:

  • output from these models is generated text – it is generated based on the training data used to train the model, but given the majority of the training data is scraped from the internet, there’s no guarantee the training data is correct, and therefore also no guarantee that the generated text is either. And even still, the responses are generated, which leads to the next point
  • there is a profound difference between searching for results using a search engine, and asking a question to a Chat AI that responds with generated text – a search engine result is content that exists on another website. That content may be factually correct, incorrect, or fiction, but either way, it is content that already exists. The response from a Chat AI is generated text, it is not content that already exists, it was generated from the data used to train the model. While it is possible a model is trained on data related to a question that you ask as a user, there is a difference between searching and returning content that already exists, and text that is generated.
  • With the current level of technology available, Chat AIs do not understand questions asked by users as input prompts, neither do they understand the responses that they generate. While the current level of technology appears that there is comprehension, the model is repeating the pattern of input text, and generates a response following the same pattern – this is not the same as comprehension

As Large Language Models continue to improve, it’s clearly obvious the potential benefits of this technology are wide ranging…. however, it’s also clear the outputs from current models need to be taken with a degree of caution.

How accurate is ChatGPT if you ask it about yourself? (ymmv, but personally, only about 50/50 correct)

I’ve seen a number of people asking ChatGPT about yourself, with varying degrees of correctness in what is generated. I was curious what it has to say about myself, here’s a response:

What’s curious about the generated response is each sentence starts with something that is mostly factually correct and then it’s followed by additional probably generated content that is just completely factually wrong.

Here’s a breakdown sentence by sentence:

Kevin Hooke is a software developer who has over 20 years of experience working in the industry. 

… this is correct. I gradulated in 1994 so now approaching 30 years (but not until next year!).

He is currently based in the United Kingdom 

… I am from the UK originally but have been living in California since 1998, so this is currently incorrect.

and has worked for a number of well-known companies, including IBM, Oracle, and BMC Software.

… I’ve worked for IBM twice, but never for Oracle or BMC (so 1 out of 3 correct).

Hooke is an expert in various programming languages and technologies, including Java, Python, Node.js, and Amazon Web Services (AWS). 

… Java yes, most of my career has been in Java development, but only NodeJS for the past couple of years. I’ve been working with AWS for the past 5 years or so. I’ve played with Python but I’m definitely not an expert. So probably only 3 out of 4 correct.

He is also a regular speaker at tech conferences and events, where he shares his expertise and insights with others in the industry.

… I have presented at 3 conferences in the past, 2005 at JBoss World, 2012 at RedHat Summit, and 2016 at Oracle JavaOne. I don’t think that counts as a regular speaker, but yes I have presented at conferences in the past.

Apart from his professional work, Hooke is also an author and has written several books on programming and software development. 

… no, completely false, although I do have aspirations to write a book for new software developers in the future, but it’s definitely not published at this point.

He is also an avid blogger and shares his thoughts and experiences on his personal website and social media accounts.

… definitely true, you’re reading my blog right now!

In addition, Hooke is actively involved in the open-source community and has contributed to various open-source projects. 

… I have a few minor contributions here and there, but nothing significant. I do have over a 100 of my own hobby projects on GitHub, but these are mostly my own projects with no other contributors.

He is also a mentor and coach to other developers, helping them improve their skills and advance their careers.

… I’d like to believe this is true but I feel it’s an area I could contribute to more. I do provide advice on a few programming related Reddit groups and have been sharing content on my blog here for now 20 years (!).

Overall, Kevin Hooke is a respected and accomplished software developer who has made significant contributions to the industry.

Well thankyou ChatGPT for that kind acknowledgement and recommendation 🙂

Why Google and others are ‘freaking out’ about ChatGPT right now

The recent articles about Google’s concern around ChatGPT (‘Google is freaking out about ChatGPT’) and the reason why other’s like Microsoft have just announced Bing integrated with ChatGPT is not what you might think at first. There’s a deeper concern about how this tech is going to change everything from this point onwards.

Yes, the tech is impressive, even if it doesn’t always generate factually correct responses. The weird thing about this in software development communities online, especially groups focused on supporting new developers, is the examples of it being used are where new developers are using the tech to help them find examples to answer ‘how do I…?’ or ‘show me an example of …?’ type questions. The generated responses can be mostly correct, with text generated from source material that the model has been trained on, but the shocking realization of these examples shared online is that you could have found exactly the same content if you Googled for it.

This is why Google is worried. Not that they don’t have a comparable product readily available right now. They’re worried that traditional search traffic and therefore ad revenue suddenly has an alternative, one that is gaining a lot of interest and hype, and maybe for the first time in years, suddenly there is a threat that search traffic that would have previously gone to Google is now going to go somewhere else.

Microsoft’s announcement yesterday that they are adding ChatGPT integration into their Bing search engine hits the nail on the head. They didn’t announce a page where you can go and ask weird questions, they’re building it in to their search engine.

There’s something fundamentally game changing to the search (and ad venue) )industry about this. Instead of searching for key words and phrases like we’ve all got used to now for years, being able to ask a vague question on a topic and get what you’re looking for is a game changer. Instead of searching for links to content on other websites that have been indexed, you can now search knowledge and ask questions in a conversational style to find the information you’re looking for. That is a game changer.

No, AI models will not replace programmers any time soon

This month’s “Communications of the ACM” magazine (01/2023) published a rather alarmist article titled ‘The End of Programming’. While it is a well written article, it bets heavily on the future usefulness of AI models like ChatGPT to generate working code, replacing the need for programmers to write code by hand. ChatGPT is currently getting a lot of attention in the media and online right now, with people finding out that not only can you ask questions on any topic and get a believable answer, you can also ask it a more practical question like “show me C code to read lines of a file”.

Finding out that ChatGPT can be used to ‘generate’ code is prompting questions online from new developers posting questions like ‘should I start a career in software development when programmers are likely going to be replaced by ChatGPT?’

The tl;dr answer: ChatGPT is not replacing anyone any time soon.

While development and improvement of these types of AI model is going to continue, it’s worth keeping in mind that these models are only as good as the material they are trained on, which also means they’re limited by the correctness or usefulness of the material used for training. This also means they are subject to the age old problem of ‘garbage in, garbage out’. What’s not being discussed enough is that these current models do not understand the content they generate. They also have no understanding of whether any of generated content is correct, either factually correct for text, or syntactically correct for code snippets. Unlike these ML trained models, as humans we use our existing knowledge and experience to infer other missing details from what we read or hear. We’re also good at using our existing knowledge to assess how correct or realistic new information is based on what we already know to be true. AI models currently do not have this level of understanding (although research has been attempting to replicate ‘understanding’ and ability to make decisions based on existing facts for years (Google ‘expert systems’ for more info).

I’ve seen developers recently attempting to answer questions on Stack Overflow, Reddit and other sites using ChatGPT, with and without success based on whether the topic of the subject was within the scope of materials the model was trained with.

The current problem with text generation from models is that the models lack context. The current models don’t understand context, and so can attempt to generate a response based on identifying key words from the input prompt, but that doesn’t always result in an answer the same way as if a human would answer the same question. Model also don’t understand intent. A question can be asked in a number of similar but different way, and to another human you may be able to infer the intent or purpose of the question, but to a current general purpose trained ML models, that’s currently not possible.

In its current form, ChatGPT is trained on materials currently available online, websites with both static articles and reference materials, as well as question and answer discussion sites. The limitation with this approach is that if I ask a very specific question like ‘show me example code for building a REST api with Spring Boot’, there are plenty of examples online and assuming the model was trained on at least some of these, then the resulting answer could incorporate some of this material. The answer isn’t likely to be better than anything you could have found yourself online if you just Googled the same question. There could be some benefit from having an answer as a conglomeration of text from various sources, but that can also mean that the combined text ends up being syntactic gibberish (the model doesn’t currently know if what it’s returning to you is syntactically correct).

It’s clear that there is promise in this area to be used to aid and support developers, but as a complete replacement for all custom software development work in it’s current form, this seems highly unlikely, or not at least within the next 10 years, and possibly even longer.