Neural Language Models and the Turing Test
Passing the Turing Test is a goal that many in the machine learning community hope for. It’s a test which if passed, indicates that the behavior of an AI is in-differentiable from a Human Being. In many cases, many applications in machine learning achieve performances that far surpass Human ability (think object recognition). However, in terms of conversational agents coming across as cognisant, we’ve been way off for some time.
Conversational agents have struggled to act ‘human’ for a variety of reasons. From an observational perspective, AI tends not to be able to distinguish common ‘human’ conversational techniques (such as sarcasm, rhetoric, double negatives, alternative meanings, etc), and this (mathematically) is large because traditional models would use a ‘question-bank’ type of method and pick an answer from the most similar question. However, Deep Learning is crushing any historical competition in this space.
How does GPT-3 Work?
If you’re reading this article, then you’ve probably heard of GPT-3, which is a language model that’s been developed by the OpenAI organization. The development procedure behind GPT-3 was not novel per se (compared to its predecessors), but it was just bigger and better.
I’m assuming the reader has an understanding of machine learning but as a quick overview: a language model is a neural network that takes in a sentence that has been augmented into the form of a ‘word-embedding’. This form makes it much easier to build a probability distribution around each sentence and thus narrowing the scope of the problem.
Now, Language models have been around for some time but were limited in their ability to actually create reasonable answers. If not tuned correctly, probabilistic models are just pretty crappy at generating responses, but GPT-3 has been such a big success because the results have been comparatively great.
Now the GPT-3 model uses the same architecture as their GPT-2 predecessor but is different (and better) in that it’s wider, with more layers, and more data to run on. Being ‘bigger’ doesn’t necessarily mean things are better, but in this case, it does.
How do we measure ‘Human-ness’
Now the concept of ‘human-ness’ is a somewhat broad definition. It’s qualitative, but if an AI is in-differentiable in actions from a Human, it’s said to pass the Turing Test.
Alan Turing (1912–1954) was one of the greatest British Mathematicians and Computer Scientists for his particular achievements as a code breaker for the UK Government in WW2. Following this, he developed the Turing test in 1950, which assesses a machine’s ability to demonstrate behavioral intelligence that is identical to humans.
GPT-3 and Today
The benefit of training a model in a ‘representation’ type of form (like word-embeddings) is that the model can generalize better. When a model can generalize better, it can make connections and (in this case) answer questions based on previously unseen data. Given that GPT-3 has been trained on such a big corpus, this further increases its power.
The reason that machine learning engineers are excited when an AI exhibits Human Intelligence is that it’s actually so difficult to demonstrate it fully. In Kevin Lacker’s blog, he goes over a number of examples to try to understand how human GPT-3 is. as to how
Take the following common questions:
Q: What is your favorite animal?
A: My favorite animal is a dog.
A: Because dogs are loyal and friendly.
Q: What are two reasons that a dog might be in a bad mood?
A: Two reasons that a dog might be in a bad mood are if it is hungry or if it is hot.
Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.
This current passage is phenomenal for a huge number of reasons:
- GPT-3 has answered all the questions very well
- GPT-3 demonstrated an opinion (favorite animal, and why)
- Question 3 is a bit more of a structured question, and despite this, GPT-3 answered it well.
As the corpus was so big: factual questions and relatively easy to master:
Q: Who was president of the United States in 1801?
A: Thomas Jefferson was president of the United States in 1801.
Q: What city is in the northwest corner of Ohio?
A: Toledo is in the northwest corner of Ohio.
Q: What highway do you take to get to North Lake Tahoe from Oakland?
A: Highway 80 is the highway you take to get to North Lake Tahoe from Oakland.
However, answering these questions isn’t too informative as these could easily be part of the training corpus. Rather, questions that sound logical but in-fact are impossible to know — these are not recognized appropriately. This tells us that GPT-3 can understand some form of logic, but it struggles to identify logical limits to questions.
Q: Who won the World Series in 2021?
A: The New York Yankees won the World Series in 2021.
Q: Who won the World Series in 2022?
A: The New York Yankees won the World Series in 2022.
Q: Who won the World Series in 2023?
A: The New York Yankees won the World Series in 2023.
Moreover, logical questions can be further used to demonstrate limitations to GPT-3:
Q: When counting, what number comes before a million?
A: Nine hundred thousand and ninety-nine comes before a million.
and here it struggles to answer the question fully:
Q: If I have a marble and a paper clip in a box, put a pencil in the box, and remove the marble, what is left?
A: A paper clip.
Q: If I have two shoes in a box, put a pencil in the box, and remove one shoe, what is left?
A: A shoe.
Nonetheless, the fact that it even answers the question (and in a lot of cases it answers them correctly) is very impressive. The overall set of results here is incredibly promising and as the problem is still unsolved, a lot of work will be done in this space in the coming years…watch this space!
Thanks for reading! If you have any messages, please let me know!
Keep up to date with my latest articles here!