Tech
ARC-AGI-3
(Arc Prize)

The toughest AI benchmark just got a whole lot tougher

ARC-AGI-3 is the latest version of a clever benchmark that challenges AI models to solve mini video games with no written instructions.

The flood of new AI models with increasingly advanced “reasoning” capabilities is forcing the AI industry to abandon early benchmark tests and invent new ones to test for many skills.

To watch the evolution of one such test — ARC-AGI — is to witness the huge technical leaps that today’s generative-AI models have made in a few short years. Tech CEOs brag about their models’ high scores on ARC-AGI, as it is widely considered one of the most unique and difficult AI benchmarks in use today.

Rather than testing how well a model can translate an inscription on an ancient Roman tombstone, or offer a diagnosis for a complex medical case, ARC-AGI challenges AI models to analyze abstract geometric puzzles and games without any written instructions. This ensures that the models are forced to create solutions to complex multistep problems, rather than regurgitate text from their training.

The latest version that just launched, ARC-AGI-3, is basically a collection of mini games, which the user can play by moving simple shapes through a pixelated game board using directional arrows. As designed, the games are easy for humans to figure out after a few minutes of experimentation, but incredibly difficult for computers to solve.

François Chollet, the creator of ARC-AGI, told Sherwood News in an email:

“You cant cram for the test. It requires you to explore and figure out each environment on the fly, on your own, instead of relying on extensive training data. Humans are really good at adapting to novelty, but AI systems are still fundamentally reliant on memorized templates.”

Chollet said that even after AI models have seen thousands of games, they struggle with ARC-AGI-3 games, since they are all unique.

One of the fascinating new features of the latest version is a replay mode that lets human observers read through AI models’ “chain of thought” transcript to see how a model breaks down the problem and attempts a solution.

Humans can play through these games on the project’s website. For now it seems humans don’t have much to worry about.

The most capable state-of-the-art models in the wild haven’t even cracked a 1% score (out of 100). The current leaderboard for ARC-AGI-3 shows OpenAI’s GPT-5.4 in the lead at 0.3%, and tied for second place are Anthropic’s Opus 4.6 and Google’s Gemini 3.1 Pro. xAI’s Grok 4.20 Reasoning model got a 0%.

Chollet says his team is already working on future versions of ARC-AGI:

“We are currently working on ARC-AGI-4 and ARC-AGI-5. We will release a new benchmark every year, each time asking the most important unsolved questions on the way to AGI. Three important topics were looking at for future versions are continual learning, open-endedness, and autonomous invention.”

Updated to include comments from François Chollet.

More Tech

See all Tech

Latest Stories

Sherwood Media, LLC and Chartr Limited produce fresh and unique perspectives on topical financial news and are fully owned subsidiaries of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, Robinhood Money, LLC, Robinhood U.K. Ltd, Robinhood Derivatives, LLC, Robinhood Gold, LLC, Robinhood Asset Management, LLC, Robinhood Credit, Inc., Robinhood Ventures DE, LLC and, where applicable, its managed investment vehicles.