
Google Assistant, Alexa or Siri can do Natural language processing (NLP) but ChatGPT is on a different level. How do they do that? Let’s take a look.
It should be noted before we begin that the focus of this lesson will be on the big picture of the situation, with only a brief mention of engineering details. The essence of the following discussion applies equally to other current “large language models” (LLMs) as it does to ChatGPT.
To begin with, it should be clarified that ChatGPT’s primary objective is to generate a “reasonable continuation” of the given text. By “reasonable,” it is meant that the output should resemble what one would expect to see after analysing the vast corpus of texts available on the internet.
Assuming we have the text “The benefits of Machine Learning and AI are”, one could scan through an immense amount of human-written text available in digitised books and on the internet to determine the next word that is most likely to follow. This is similar to what ChatGPT does, although it doesn’t analyse the text literally. Instead, it searches for words or phrases that have a similar meaning. The outcome is a list of potential words that could follow the given text, ranked by their probability of appearing next.
What’s truly noteworthy is that when ChatGPT writes an essay, its fundamental process involves repeatedly asking the question, “what should be the next word given the existing text?” It then adds the suggested word to the existing text. But in reality, it adds a “token,” which may not necessarily be an entire word, and that’s why it occasionally ends up creating new words.
At each stage of the process, ChatGPT is presented with a list of words accompanied by their respective probabilities. The question then arises: which word should it choose to add to the essay or other content it’s generating? One might assume that the best approach would be to select the word with the highest assigned probability, but this is where things get somewhat mystical. It turns out that, for reasons that are still not clear, always selecting the highest-ranked word results in a lackluster essay that lacks creativity and occasionally repeats itself. In contrast, if ChatGPT randomly selects lower-ranked words at times, the resulting essay is likely to be more interesting.
The presence of randomness in the process means that when the same prompt is used multiple times, it means that different essays will be produced each time. In addition, the concept of “temperature” is a crucial factor in this approach. This temperature parameter controls the frequency with which lower-ranked words are selected, and for essay generation, a temperature value of 0.8 has been found to be optimal. It’s worth noting that this is not a theoretical approach, but rather a practical solution that has been determined through trial and error.
As we have discussed earlier, ChatGPT is not exactly predicting word by word. It uses something called “tokens”. It is beyond the scope of this beginner level course to go in depth into the technicalities of it. So it’s better for us to go to the next point.
In natural language processing, there exists a technical term known as “n-gram”. “n-gram” is a contiguous sequence of “n” items from a given sample of text or speech. An “item” can be a character, word, or even a sentence. So for example, a 2-gram refers to a sequence of two words, and a 3-gram refers to a sequence of three words. For example, in the sentence “The quick brown fox jumps over the lazy dog”, the 2-grams are “The quick”, “quick brown”, “brown fox”, “fox jumps”, “jumps over”, “over the”, “the lazy”, and “lazy dog”. The 3-grams would be “The quick brown”, “quick brown fox”, “brown fox jumps”, “fox jumps over”, “jumps over the”, “over the lazy”, and “the lazy dog”.
The amount of text available on the web and in digitised books is immense, with hundreds of billions of words in each. However, there are already 1.6 billion possible 2-grams and 60 trillion possible 3-grams in the English language. It is impossible to calculate the probabilities of all these sequences from the available text. In fact, the number of possibilities for essay fragments of 20 words is greater than the number of particles in the universe, making it impossible to go through all of them.
To overcome this challenge, the solution is to build a model that can estimate (it will try to predict the probability rather than actually calculating the probability as it is impossible to go through all of them), the probabilities of these sequences, even if they have never been seen before in the corpus of text. This is precisely what a “large language model” (LLM) does, and it is the core technology behind ChatGPT. The LLM has been designed to accurately estimate the probabilities of word sequences, making it possible to generate reasonable continuations of text. Specifically, ChatGPT uses a type of LLM called a “transformer” model, which was introduced by researchers at Google in 2017.
The transformer model works by processing the input text through a series of layers of artificial neural networks, which gradually build up a “representation” of the input text that the model can use to make predictions about what words are likely to come next. This representation is built up through a process of “self-attention”, in which the model learns to identify which parts of the input text are most relevant to predicting the next word, and assigns different weights to each part based on its relevance.
Once the transformer model has built up its representation of the input text, it uses this representation to make predictions about what words are likely to come next. Specifically, it generates a probability distribution over all possible words that could come next, based on the probabilities of those words occurring in the training data. This probability distribution is then used to select the next word to add to the output text.
The key to the transformer model’s success is that it can capture long-range dependencies between words in a sentence, rather than just looking at pairs or triplets of adjacent words (as in the case of 2-grams and 3-grams). This allows it to generate much more fluent and natural-sounding text than earlier models that relied on simpler methods for predicting word probabilities.
This is a simple overview of the principles behind the working of LLM like ChatGPT. If you want to go further deep and learn how Artificial Intelligence in general works, you may enroll to our AI course.
If you want to go further then it’s essential that you should know programming. And Python being the most popular and widely used programming language, it’ll be better if you get started off with python. You may check out our python course.
Note: The above article is an excerpt from our online course on Artificial Intelligence.