(Arc Prize)

The toughest AI benchmark just got a whole lot tougher

ARC-AGI-3 is the latest version of a clever benchmark that challenges AI models to solve mini video games with no written instructions.

Jon Keegan

3/26/26 3:10PM

The flood of new AI models with increasingly advanced “reasoning” capabilities is forcing the AI industry to abandon early benchmark tests and invent new ones to test for many skills.

To watch the evolution of one such test — ARC-AGI — is to witness the huge technical leaps that today’s generative-AI models have made in a few short years. Tech CEOs brag about their models’ high scores on ARC-AGI, as it is widely considered one of the most unique and difficult AI benchmarks in use today.

Rather than testing how well a model can translate an inscription on an ancient Roman tombstone, or offer a diagnosis for a complex medical case, ARC-AGI challenges AI models to analyze abstract geometric puzzles and games without any written instructions. This ensures that the models are forced to create solutions to complex multistep problems, rather than regurgitate text from their training.

We created an in-house game studio and built 135 novel environments from scratch

No instructions, Core Knowledge Priors-only

In order to beat these games, AI must:
• Explore the environment
• Form hypotheses
• Execute a plan
• Learn and adapt pic.twitter.com/oaSVhut7Cp
— ARC Prize (@arcprize) March 25, 2026

The latest version that just launched, ARC-AGI-3, is basically a collection of mini games, which the user can play by moving simple shapes through a pixelated game board using directional arrows. As designed, the games are easy for humans to figure out after a few minutes of experimentation, but incredibly difficult for computers to solve.

François Chollet, creator of ARC-AGI told Sherwood in an email:

“You can't cram for the test. It requires you to explore and figure out each environment on the fly, on your own, instead of relying on extensive training data. Humans are really good at adapting to novelty, but AI systems are still fundamentally reliant on memorized templates.”

Chollet said that even after AI models have seen thousands of games, they struggle with ARC-AGI-3 games, since they are all unique.

One of the fascinating new features of the latest version is a replay mode that lets human observers read through AI models’ “chain of thought” transcript to see how a model breaks down the problem and attempts a solution.

Humans can play through these games on the project’s website. For now it seems humans don’t have much to worry about.

The most capable state-of-the-art models in the wild haven’t even cracked a 1% score (out of 100). The current leaderboard for ARC-AGI-3 shows OpenAI’s GPT-5.4 in the lead at 0.3%, and tied for second place are Anthropic’s Opus 4.6 and Google’s Gemini 3.1 Pro. xAI’s Grok 4.20 Reasoning model got a 0%.

Chollet says his team is already working on future versions of ARC-AGI:

“We are currently working on ARC-AGI-4 and ARC-AGI-5. We will release a new benchmark every year, each time asking the most important unsolved questions on the way to AGI. Three important topics we're looking at for future versions are continual learning, open-endedness, and autonomous invention.”

Updated to include comments from François Chollet.

Jon Keegan3/27/26

Judge blocks Pentagon’s move to blacklist Anthropic

A federal judge in Northern California has granted a preliminary injunction blocking the Pentagon from labeling Anthropic as a national security supply chain risk.

The ruling temporarily prevents the Defense Department from restricting the AI company’s access to federal contracts amid a dispute over its refusal to allow certain military and surveillance uses of its technology. The designation could also have shifted lucrative government work toward competitors, including OpenAI.

Earlier this month, Anthropic, the company behind Claude, sued 17 federal agencies and their heads, alleging the government exceeded its statutory authority.

Rani Molla3/27/26

Report: SpaceX’s record IPO may grant preferential access to retail investors and Tesla shareholders

SpaceX’s impending IPO could raise $40 billion to $80 billion and rank as the largest ever — as well as one of the most unconventional.

The Wall Street Journal reports several ways CEO Elon Musk is considering breaking with IPO norms:

Investors in his other companies, including Tesla, could receive preferential access to shares.
Individual investors may get a third or more of the allocation, far above the typical ~10% mark.
Instead of a traditional road show, Musk wants investors to visit SpaceX facilities in person.

Exclusive | The SpaceX IPO Will Be Just as Unconventional as Musk Himself

Investors in his other companies, including Tesla, could receive preferential access to shares.
Individual investors may get a third or more of the allocation, far above the typical ~10% mark.
Instead of a traditional road show, Musk wants investors to visit SpaceX facilities in person.

Rani Molla3/26/26

Tesla released estimates for Q1 deliveries and they’re lower than analysts expected

Ahead of first-quarter earnings next month, Tesla released its own company-compiled Wall Street consensus estimate for deliveries: 365,645 vehicles. While that’s lower than the 382,000 FactSet consensus estimate, it represents a nearly 9% jump from Q1 2025, when Tesla sold 336,681 vehicles.

Tesla started releasing its own consensus estimates to the public — not just institutional investors — for the first time in Q4 2025. The move was seen as a way to temper investor expectations, as other estimates were too high. Last quarter, Tesla’s compilation was closer to actual numbers, which fell 16% year over year.

The market-implied odds from event contracts suggest 64% of traders think Tesla’s Q1 deliveries will be more than 350,000, 44% think it will be higher than 360,000, and just 21% have it at higher than 370,000.

(Event contracts are offered through Robinhood Derivatives, LLC — probabilities referenced or sourced from KalshiEx LLC or ForecastEx LLC.)

Rani Molla3/26/26

The US leads the world in robotaxi deployments

Every day it seems another robotaxi launches somewhere in the world. But most of them are in the US.

Of the 171 active robotaxi deployments globally, 69 — or 40% — are in the US, according to a new report from the Bank of America Institute. China, the next largest market, accounts for 24% of deployments.

Most of those deployments are still in testing or early commercial stages. Only 10 US cities currently have fully commercial robotaxi operations, defined as services that operate on public roads, carry paying passengers, run fully driverless without a safety driver, and function all day in any weather.

For now, that effectively refers to Alphabet’s Waymo, which operates commercially in Atlanta, Austin, Dallas, Houston, Los Angeles, Miami, Orlando, Phoenix, San Antonio, and the San Francisco Bay Area. That definition excludes competitors like Tesla, whose Robotaxi service uses safety monitors, and Amazon’s Zoox, which has yet to charge customers for rides.

Millie Giles

overfeed

3/26/26

Wait, just how addicted to Facebook and YouTube are Americans today?

While a jury found both Meta and Alphabet have designed addictive apps, usage among US adults has dropped off.