Overcoming Turing: Rethinking Evaluation in the Era of Large Language Models

by Dr. Megan Ma, CodeX Assistant Director and Jay Mandal, CodeX Fellow

As sensationalized headlines painted the Internet on the unexpected capabilities[1] of OpenAI’s GPT-4, many continue to wonder where the fairy dust ends and when the era of human-machine collaboration begins. Yet, with the excitement of GPT-4 passing known medical licensing exams,[2] bar exams,[3] and other forms of standardized testing, one question remains unanswered:  The Big So What. What does an AI model’s ability to pass these complex reasoning tests represent? Would it mean that large language models (LLMs) are exhibiting some extent of intelligence comparable to humans? One of the more common observations touted was that LLMs were seemingly “breaking the Turing test.”[4] In fact, some even noted that the Turing test is no longer applicable in the era of LLMs. So, what exactly does that mean and is it true?

To answer these questions, we must first understand the historical context of the Turing test, and importantly, the original intentions of this experiment. The Turing test was famously propagated by renowned mathematician and computer scientist Alan Turing in 1950. It has since been considered to be an operational test of intelligence, namely, testing a machine’s ability to exhibit behavior indistinguishable from that of a human. In this proposed test, a human evaluator engages in a natural language conversation with both another human and machine and attempts to determine which is which. If the human is unable to tell, the machine is said to have passed the test.

Fundamentally, the Turing test was never actually considered a test, or even intended to be a validating assessment. The Turing test was more accurately understood as a philosophical thought experiment, akin to another known and related “test,” the Chinese Room Argument.[5] Further to this clarification, the question behind the Turing test was not actually whether machines could think: it was understanding whether machines could mimic a person. Indeed Turing begins the article by describing it as the imitation game.

Researchers from AI21 Labs recently conducted the largest global social experiment to replicate the Turing test, via a game known as “Human or Not”. This experiment had engaged over 2 million participants from around the world. The researchers asked participants to blindly speak, for two minutes, with either an AI bot (based on leading LLMs such as GPT-4) or a fellow participant, and then asked them to guess if they chatted with a human or a machine.

The outcomes were rather interesting. Only 68% of people guessed correctly when asked to determine whether they had spoken to a fellow human or an AI bot. What was perhaps most interesting was the patterns of approaches and strategies deployed to tease out whether they were indeed speaking with a human or a bot. Many of the common strategies were based on the perceived limitations of language models that people encountered while using ChatGPT and similar interfaces, and their previous perception of human behavior online.[6] These included attempting to ask deeply personal or philosophical questions. There were additionally odd assumptions that the AI chatbot would be more polite than a human or that it could not formulate slang. All of these assumptions were proven false in many of the interactions.

Yet, there remains the unanswered question of what the results of the Turing game really mean. More importantly, it does not necessarily provide further clarity on whether these models are qualified to do certain tasks. In each of these experiments, the researchers were unable to articulate what exactly is being measured. Rather, it suggested that, observably and from a conversational standpoint, these machines are indeed quite capable of engaging in dialogue.

But what continues to fascinate about the Turing test and its relationship with interpreting the competences of machines – in this case, generative AI – is the underlying inability to express what distinguishes humanity from artificiality. In Alan Turing’s original paper, the majority of the text is a presentation of arguments against potential criticism of the test. In fact, he takes “such pains to point out the fallacies in contrary views.”[7] This tension that Alan Turing suggests is particularly carried forth at the face of LLMs.

The next challenge, of course, is to consider what test would make sense if not the Turing test. Naturally, a clinical psychologist turned to the IQ test.[8] Eka Roivainen gave ChatGPT the IQ test and observed that it performed better than 99.9% of test takers. Interestingly, while in certain circumstances these models may be able to score better than humans in some complex aptitude tests, the results were not always consistent. LLMs tend to fare rather poorly on reasoning tests about physical objects that developmental psychologists often give to kids.[9] Other abstraction and reasoning tests easily solvable by humans were incredibly challenging for GPT-4 (~30-33% success rate).[10]

So, how should we interpret a machine that is capable of passing incredibly sophisticated exams but could also simultaneously fail preschool?[11] Roivainen notes that the juxtaposition between high IQ scores coupled with “amusing failures” evidently imply that aspects of intelligence cannot be measured by IQ tests alone. Perhaps the deeper question here is what are we really testing for? What exactly do we want these models to demonstrate and, in effect, prove? Accordingly, how should we then be testing them?

What we observed is that it is extraordinarily difficult to evaluate the performance of these models. What further complicates the matter is that traditional machine learning evaluation, using benchmarking, is no longer applicable in the era of LLMs. Benchmarking works for specific capabilities (e.g., grammar in language ability),[12] but not when the abilities require leveraging multiple skills at once. The argument then is that academic and professional exams designed for humans should be used to remediate the pitfalls of traditional benchmarking efforts.

Yet, these exams have since become a new form of benchmarking. The problem is that how humans perform on these exams are derived from reasons that are not only contextualized by our social environment, but also speak to differences in learning. This further contributes to a blurring between an aforementioned desire to distinguish human from machine. More importantly, performance on these tests have never been applied in isolation as a metric around human capability. That is, many lawyers could agree that the bar exam is a far cry from assessing their ability to conduct quality legal work. In fact, one of the complexities of the legal industry is that quality and value-add are often implicit. It is contextualized by experience and the specific know-how of the client and industry-base.

Is it then as Jack Stilgoe describes that we need a Weizenbaum test for AI? In his article, Stilgoe suggests that the complexity of testing these models appear to be derived from a sense of detachment from their use for this world. That the focus of machines to simulate intelligence is an artifact of the past.[13] Rather, there should be a reframing of tests towards their usefulness and public value, “evaluating them according to their real-world implications.”[14] This is a well-taken argument, albeit difficult to truly assess. Usefulness is not necessarily a variable that can be systematically assessed. However, it may be just as important to note the steering from “intelligence” to granular task-based reasoning. That is, identifying the right use cases for LLMs should underpin evaluation.

So, what good are LLMs for? To this end, we begin to reflect on Moravec’s Paradox. Hans Moravec observed in 1988 that tasks which are easy to perform for humans may be difficult for machines to replicate. The key word, of course, is replication. Underlying evaluation is the impression that machines should behave as humans. Yet, we must remind ourselves that these models do not think like humans. That while they are capable of producing human-like linguistic output, they do not have the ability to feel like a human. The capabilities that these models display, they achieve via their own (black-boxed) paths. And so, while it’s not known whether their vast performance across numerous complex tasks could amount to “intelligence,” the continued interest to focus these models to perform tasks that require some level of human abstraction fosters misuse. Instead, we should not only acknowledge, but also investigate the capabilities that LLMs have that people do not.

We are beginning to see research in this area, particularly of LLMs for simulating[15] and generating hypothetical scenarios and/or offering alternative perspectives. That is, LLMs behave as excellent learning and training tools, as they are capable of extending the imagination of human minds. Importantly, humans find difficulty in recalling memory at the rate at which LLMs are able to do so. This suggests the early emergence of a division of labor between humans and machines. Just as humans have a wide spectrum of specific skill sets and talents, machines are capable of this extent of specialization. We consider, for example, assessing whether an LLM should be customized for certain tasks by adjustments at the model-level (otherwise, finetuning) or via in-context learning (otherwise, prompt engineering). Making these determinations and the process by which we do so require further evaluation and more systematic guidance. More importantly, we require a deeper analysis of our current metrics used to understand human talent and skill sets. Only then can we more confidently seek richer areas of human-machine collaboration.  Until then, we continue to be relegated to superficial, vanity metrics that only showcase partial interpretations of a model’s relative advantage.[16]

What we will need is a holistic suite of assessments that not only are capable of articulating a model’s technical performance, but also how it compares to a human’s ability and rationale for accomplishing a particular task. Therefore, the first step in this evaluative harness is perhaps to better characterize empirically and explicitly the value of human performance on a task, mapping the ontologies and conceptual hierarchies of the domain. Not only would this provide a deeper understanding of our role in the future of work, but would also encourage a higher sensitivity towards nuance and granularity in performance. Furthermore, we may be able to better distinguish both the behavioral and perceptual differences between the world models of our machines in comparison to our own.[17] In effect, we transition from the world of generalists to one of agility and specialization.


[1] Bubeck et al., Sparks of Artificial General Intelligence: Early experiments with GPT-4 (March 2023), available at: https://www.microsoft.com/en-us/research/publication/sparks-of-artificial-general-intelligence-early-experiments-with-gpt-4.

[2] Tiffany Kung, Research Spotlight: Potential for AI-Assisted Medical Education Using Large Language Models, Massachusetts General Hospital (February 9, 2023), available at: https://www.massgeneral.org/news/research-spotlight/potential-for-ai-assisted-medical-education-using-large-language-models.

[3] Daniel Martin Katz et. al, GPT-4 Passes the Bar Exam, SSRN (April 5, 2023) available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4389233.

[4] Celeste Biever, ChatGPT broke the Turing test – the race is on for new ways to assess AI, Nature (July 25, 2023) available at:  https://www.nature.com/articles/d41586-023-02361-7.

[5] See original discussion of the Chinese Room Experiment: John R. Searle, Minds, brains, and programs, Behavioral and Brain Sciences 3:417-424 (1980), available at: https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/abs/minds-brains-and-programs/DC644B47A4299C637C89772FACC2706A.

[6] Daniel Jannai et al., Human or Not? A Gamified Approach to the Turing test, available at:  https://arxiv.org/abs/2305.20010.

[7] Alan Turing, Computing Machinery and Intelligence, Mind 49: 433-460 (1950), available at: https://academic.oup.com/mind/article/LIX/236/433/986238.

[8] Eka Roivainen, I Gave ChatGPT an IQ Test. Here’s What I Discovered, Scientific American (March 28, 2023), https://www.scientificamerican.com/article/i-gave-chatgpt-an-iq-test-heres-what-i-discovered/.

[9] Will Douglas Heaven, AI hype is built on high test scores. Those tests are flawed, MIT Technology Review (August 30, 2023), https://www.technologyreview.com/2023/08/30/1078670/large-language-models-arent-people-lets-stop-testing-them-like-they-were/.

[10] Note that these results are yet to be compared with newer multimodal models. See Biever supra 4.

[11] Id.

[12] Avijit Chatterjee, The Problem with LLM Benchmarks, AIM Research (September 14, 2023), available at:https://aimresearch.co/2023/09/14/leaders-opinion-the-problems-with-llm-benchmarks/.

[13] See reference by Stilgoe on Philip Ball, “LLMs signal that it’s time to stop making the human mind a measure of AI.” Jack Stilgoe, We need a Weizenbaum test for AI, Science (Aug 11, 2023) available at:  https://www.science.org/doi/full/10.1126/science.adk0176.

[14] Id.

[15] Consider for example this idea of simulating conflict to assist with training conflict resolution. See Omar Shaikh et al., Rehearsal: Simulating Conflict to Teach Conflict Resolution, https://arxiv.org/abs/2309.12309.

[16] Consider for example the comparison of models on their rate of fact hallucination. Emerging commentary has showcased the triviality of these types of assessment, suggesting that it could mislead and imply advantages that extend outside of questions on usefulness. See for fact hallucination assessment:  https://github.com/vectara/hallucination-leaderboard. See for commentary: https://www.linkedin.com/posts/drjimfan_please-see-update-below-a-recent-llm-hallucination-activity-7130230516246593536-mxAY/.

[17] See for example, Yuling Gu,, What are they thinking? Do language models have coherent mental models of everyday things?, Medium (July 7, 2023) available at: https://blog.allenai.org/what-are-they-thinking-do-language-models-have-coherent-mental-models-of-everyday-things-cc73035a0ec8See also Wes Gurnee and Max Tegmark, Language Models Represent Space and Time, https://arxiv.org/pdf/2310.02207.pdf; Omar Shaikh et al., Grounding or Guesswork? Large Language Models are Presumptive Grounders, https://arxiv.org/pdf/2311.09144.pdf; and Melanie Mitchell, AI’s challenge of understanding the world, Science (November 10, 2023) available at: https://www.science.org/doi/full/10.1126/science.adm8175.