Cover Image for The debates on AI evaluation have arrived in the world of Pokémon.
Tue Apr 15 2025

The debates on AI evaluation have arrived in the world of Pokémon.

Not even Pokémon is exempt from the controversy related to artificial intelligence evaluation. Last week, a post on X went viral, claiming that Google's new Gemini model had outperformed others.

Recently, the world of Pokémon has been embroiled in the controversy surrounding artificial intelligence evaluation standards. Last week, a post on X went viral claiming that Google’s new Gemini model had surpassed the prominent Claude model from Anthropic in the original Pokémon video game trilogy. Reportedly, Gemini reached the locality of Lavender Town during a Twitch stream by a developer, while Claude had been stuck at Mount Moon since late February.

However, what was not mentioned in the post is that Gemini had a significant advantage. Users on Reddit pointed out that the developer hosting the Gemini stream created a custom minimap that helps the model identify "tiles" in the game, such as trees that can be cut down. This reduces the need for Gemini to analyze screenshots before making gameplay decisions.

While Pokémon may be considered a rather dubious benchmark for AI evaluation, as few would argue that it provides very informative insight into a model's capabilities, it serves as a revealing example of how different implementations of an evaluation standard can influence results. For instance, Anthropic reported two scores for its Anthropic 3.7 Sonnet model on the SWE-bench Verified benchmark, designed to assess a model's programming skills. The Claude 3.7 Sonnet model achieved an accuracy of 62.3% on SWE-bench Verified but reached 70.3% using a "custom scaffolding" developed by Anthropic.

More recently, Meta fine-tuned a version of one of its new models, Llama 4 Maverick, to perform well on a specific benchmark, LM Arena. The standard version of the model achieved significantly worse scores on the same evaluation. This indicates that since AI benchmarks—including Pokémon—are inherently imperfect measures, custom and non-standard implementations threaten to further complicate comparisons between models as new technology is released.