The toughest AI benchmark just got a whole lot tougher
ARC-AGI-3 is the latest version of a clever benchmark that challenges AI models to solve mini video games with no written instructions.
The flood of new AI models with increasingly advanced “reasoning” capabilities is forcing the AI industry to abandon early benchmark tests and invent new ones to test for many skills.
To watch the evolution of one such test — ARC-AGI — is to witness the huge technical leaps that today’s generative-AI models have made in a few short years. Tech CEOs brag about their models’ high scores on ARC-AGI, as it is widely considered one of the most unique and difficult AI benchmarks in use today.
Rather than testing how well a model can translate an inscription on an ancient Roman tombstone, or offer a diagnosis for a complex medical case, ARC-AGI challenges AI models to analyze abstract geometric puzzles and games without any written instructions. This ensures that the models are forced to create solutions to complex multistep problems, rather than regurgitate text from their training.
We created an in-house game studio and built 135 novel environments from scratch
— ARC Prize (@arcprize) March 25, 2026
No instructions, Core Knowledge Priors-only
In order to beat these games, AI must:
• Explore the environment
• Form hypotheses
• Execute a plan
• Learn and adapt pic.twitter.com/oaSVhut7Cp
The latest version that just launched, ARC-AGI-3, is basically a collection of mini games, which the user can play by moving simple shapes through a pixelated game board using directional arrows. As designed, the games are easy for humans to figure out after a few minutes of experimentation, but incredibly difficult for computers to solve.
François Chollet, creator of ARC-AGI told Sherwood in an email:
“You can't cram for the test. It requires you to explore and figure out each environment on the fly, on your own, instead of relying on extensive training data. Humans are really good at adapting to novelty, but AI systems are still fundamentally reliant on memorized templates.”
Chollet said that even after AI models have seen thousands of games, they struggle with ARC-AGI-3 games, since they are all unique.
One of the fascinating new features of the latest version is a replay mode that lets human observers read through AI models’ “chain of thought” transcript to see how a model breaks down the problem and attempts a solution.
Humans can play through these games on the project’s website. For now it seems humans don’t have much to worry about.
The most capable state-of-the-art models in the wild haven’t even cracked a 1% score (out of 100). The current leaderboard for ARC-AGI-3 shows OpenAI’s GPT-5.4 in the lead at 0.3%, and tied for second place are Anthropic’s Opus 4.6 and Google’s Gemini 3.1 Pro. xAI’s Grok 4.20 Reasoning model got a 0%.
Chollet says his team is already working on future versions of ARC-AGI:
“We are currently working on ARC-AGI-4 and ARC-AGI-5. We will release a new benchmark every year, each time asking the most important unsolved questions on the way to AGI. Three important topics we're looking at for future versions are continual learning, open-endedness, and autonomous invention.”
Updated to include comments from François Chollet.
.gif?auto=compress%2Cformat&cs=srgb&fit=max&w=3840)