Cover Image for Artificial intelligence communicates clearly, but lacks understanding: discover the recently revealed concept of 'Potemkin comprehension.'
Sat Jul 05 2025

Artificial intelligence communicates clearly, but lacks understanding: discover the recently revealed concept of 'Potemkin comprehension.'

According to a recent study, the most advanced artificial intelligence models provide accurate answers 94% of the time, although they do not consistently apply that knowledge.

The accuracy of the responses provided by artificial intelligence (AI) models has reached outstanding levels, but the question arises: do these advanced systems really understand what they are communicating? A recent study calls this ability into question, reigniting the debate about the actual reasoning skills of a technology that plays an increasingly important role in the human future.

Researchers from the Massachusetts Institute of Technology (MIT), as well as Harvard and the University of Chicago, concluded that certain AI models, known as Large Reasoning Models (LRM), do not comprehend the responses they generate. Although these models can offer correct answers, they are unable to apply that knowledge coherently across different scenarios.

To conduct this research, experts examined the performance of several models, including Llama-3.3, Claude-3.5, GPT-4o, Gemini, DeepSeek-V3, DeepSeek-R1, and Qwen2-VL, in tasks that required not only defining concepts but also applying them in various exercises like classification, content generation, and editing. They focused on three specific areas: literary techniques, game theory, and psychological biases.

The findings showed that while the models provided precise definitions in 94% of cases, they were incorrect 55% of the time when attempting to classify related examples. Moreover, they made errors in 40% of the tests where they were asked to generate or edit examples. This phenomenon has been termed "Potemkin understanding," referencing the fictitious villages that Grigory Potemkin supposedly built to impress Empress Catherine II. Researchers caution that this concept should not be confused with "hallucinations," which are factual errors generated by AI.

The researchers state that "Potemkins are to conceptual knowledge what hallucinations are to factual knowledge: hallucinations create false facts; 'potemkins' generate a misleading appearance of conceptual coherence." Concrete examples were provided to illustrate this limitation; for instance, while the models could accurately explain the ABAB rhyme scheme, they failed to write a poem that followed this pattern. Although they identified and described literary techniques in a Shakespearean sonnet with precision, nearly 50% of the attempts to detect or modify a similar sonnet resulted in error.

The study also raises questions about the reliability of the benchmark tests used to evaluate AI capabilities. The authors note that these metrics could give a misleading impression of competence rather than reflect authentic understanding. They emphasize that the tests applied to large language models (LLM) are the same ones used to evaluate humans, which may be inappropriate if LLMs interpret concepts differently.

Keyon Vafa, a postdoctoral researcher at Harvard and one of the co-authors, comments on the need to develop new evaluation strategies that transcend the use of the same questions to measure human knowledge or establish methods to eliminate the illusion of understanding in the models. This is not a new topic; a previous study by the National University of Distance Education (UNED) in Spain had already pointed out that models like OpenAI's o3-mini and DeepSeek R-1 relied more on memorization than on true reasoning.

The report also warned about the reliability issues associated with these tests, which are exacerbated by the intense competition in the field. Julio Gonzalo, a professor of Languages and Computer Systems at UNED, stated that, due to competitive pressure, excessive attention is paid to benchmarks, which could lead companies to manipulate results conveniently. This highlights the need to question the validity of the metrics used in the evaluation of AI models.