If AI models can ace every test, it’s actually not a good thing

AI companies are eager to show how much “smarter” and more capable their latest large language models are. To highlight these improvements, companies point to scores on widely used standardized tests like HellaSwg or MMLU as AI benchmarks to flex on how well the tools can write code or solve problems.

But these tools have some serious flaws, with many of the most-cited benchmarks created to test much simpler AI systems years before chatbots like ChatGPT hit the scene. Many of the tools were created by scraping amateur websites and lacked expert oversight. And because these benchmarks have been widely available on the internet for years, AI models have likely seen them as part of their training data.

The Financial Times reports that AI companies are now racing to build a new class of benchmarks, as today’s AI models all score at or above 90% on these tests.

As AI companies start to pivot to a model of AI “agents” that can control your computer and use multistep reasoning, new benchmarks are being created that can test these capabilities in a more meaningful way.

But a lack of regulatory oversight or industry standards means the public must rely on companies like Meta and OpenAI to test their own products for safety, and provides no standard way to compare the results between them.

AI groups rush to redesign model testing and create new benchmarks

The Financial Times reports that AI companies are now racing to build a new class of benchmarks, as today’s AI models all score at or above 90% on these tests.

If AI models can ace every test, it’s actually not a good thing

AI groups rush to redesign model testing and create new benchmarks

More Tech

Apple to pay Google $1 billion a year for access to AI model for Siri

Apple Plans to Use 1.2 Trillion Parameter Google Gemini Model to Run New Siri

Netflix creates new made-up metric for advertisers

Netflix’s Third Season of Ads and a Look Ahead at What's Next - About Netflix

Ahead of Musk’s pay package vote, Tesla’s board says they can’t make him work there full time

Tesla Is Obsessed With Musk’s Pay Package. Musk Is Obsessed With AI.

Motion Picture Association to Meta: Stop saying Instagram teen content is “PG-13”

Exclusive | Motion Picture Trade Group Pans Instagram’s Use of ‘PG-13’ With Cease and Desist

Dan Ives expects “overwhelming shareholder approval” of Tesla CEO pay package

Latest Stories