tech

Jon Keegan11/11/24

If AI models can ace every test, it’s actually not a good thing

AI companies are eager to show how much “smarter” and more capable their latest large language models are. To highlight these improvements, companies point to scores on widely used standardized tests like HellaSwg or MMLU as AI benchmarks to flex on how well the tools can write code or solve problems.

But these tools have some serious flaws, with many of the most-cited benchmarks created to test much simpler AI systems years before chatbots like ChatGPT hit the scene. Many of the tools were created by scraping amateur websites and lacked expert oversight. And because these benchmarks have been widely available on the internet for years, AI models have likely seen them as part of their training data.

The Financial Times reports that AI companies are now racing to build a new class of benchmarks, as today’s AI models all score at or above 90% on these tests.

As AI companies start to pivot to a model of AI “agents” that can control your computer and use multistep reasoning, new benchmarks are being created that can test these capabilities in a more meaningful way.

But a lack of regulatory oversight or industry standards means the public must rely on companies like Meta and OpenAI to test their own products for safety, and provides no standard way to compare the results between them.

AI groups rush to redesign model testing and create new benchmarks

AI groups rush to redesign model testing and create new benchmarks

But these tools have some serious flaws, with many of the most-cited benchmarks created to test much simpler AI systems years before chatbots like ChatGPT hit the scene. Many of the tools were created by scraping amateur websites and lacked expert oversight. And because these benchmarks have been widely available on the internet for years, AI models have likely seen them as part of their training data.

The Financial Times reports that AI companies are now racing to build a new class of benchmarks, as today’s AI models all score at or above 90% on these tests.

As AI companies start to pivot to a model of AI “agents” that can control your computer and use multistep reasoning, new benchmarks are being created that can test these capabilities in a more meaningful way.

But a lack of regulatory oversight or industry standards means the public must rely on companies like Meta and OpenAI to test their own products for safety, and provides no standard way to compare the results between them.

More Tech

Chris Stokel-Walker

Companies are getting AI chatbots to smear their competitors

The race to influence AI chatbots is leading to some companies to adopt shady competitive tactics.

AI Chatbot on the smart phone

tech

Tom Jones6/17/26

Prediction markets have, predictably, been given a boost by the summer of sports

Major platforms like Kalshi and Polymarket have seen huge upticks in users of late, thanks in no small part to what’s felt like a recent sporting smorgasbord, with major competitions across hockey, basketball, and soccer soaking up fans’ time (and spending, clearly) at the outset of summer.

While gaming industry groups may not like it, there’s been a huge change in the methods people are using to put money on the big games, with everyone from fortunate NYC bar owners, to a far less fortunate Spanish supporter, turning to prediction markets to try and turn their sports know-how into cold, hard cash.

According to a new report from Adam Blacker for apptopia, that shift might have been even more seismic than imagined in the wake of the NBA and NHL finals and around the 2026 World Cup kicking off.

2026 World Cup Reverses Seasonal Lull for Sports Betting Apps

2026 World Cup Reverses Seasonal Lull for Sports Betting Apps

While gaming industry groups may not like it, there’s been a huge change in the methods people are using to put money on the big games, with everyone from fortunate NYC bar owners, to a far less fortunate Spanish supporter, turning to prediction markets to try and turn their sports know-how into cold, hard cash.

According to a new report from Adam Blacker for apptopia, that shift might have been even more seismic than imagined in the wake of the NBA and NHL finals and around the 2026 World Cup kicking off.

South by Southwest Conference and Festivals

Gold Tesla Cybercabs are piling up, but they’re not picking up passengers yet

Low-volume production started in April. Now people are noticing them more and more in the wild.

Rani Molla6/15/26

TAKING ACCOUNTS

Britain announces social media ban for under-16s starting early 2027

The UK government plans to use the same model for the restrictions as Australia — but how successful has that case study been so far?

UK Prime Minister Announces Under-16s Social Media Ban

tech

Jon Keegan6/15/26

Anthropic pulls Fable and Mythos access worldwide after Trump administration bars their use by foreign nationals

Only days after releasing two versions of its next-gen AI model, Anthropic has disabled them for users worldwide.

Anthropic says it received a Friday night order from the Trump administration to suspend access to the models for any foreign national (anywhere in the world) — a group that included some Anthropic employees. In response, the company turned off access to everyone.

Last week, the company released to the public its much-anticipated Claude Fable 5 model (and its restricted version Claude Mythos 5, which is still being tested with trusted partners). Anthropic said in a blog post announcing the action that officials cited national security concerns with the new models, while offering few specific details.

The post said that the government gave the company “verbal evidence of a potential narrow, non-universal jailbreak” of the public Fable 5 model. A jailbreak is a means by which users can evade restrictions built into the code to unlock prohibited functionality. Anthropic downplayed the significance of the attack, and said other major models, such as OpenAI’s GPT-5.5, could also be affected by the technique described.

Fears of these first Mythos-class models being misused are running high, after Anthropic warned the cybersecurity world in May that the advanced cyber capabilities of Mythos have rapidly discovered thousands of vulnerabilities in ubiquitous software, leading to the decision to restrict the full version of the model to a close group of trusted partners for testing.

This morning, Axios reported that Anthropic technical staff have flown to Washington to meet with White House officials to resolve the issue.

The Wall Street Journal is reporting that the Trump administration’s decision to take action against Anthropic was prompted by discussions that Amazon CEO Andy Jassy had with officials, including Treasury Secretary Scott Bessent. According to the report, Amazon researchers said they had been able to evade some of Fable 5’s security restrictions using specific prompts. Amazon is a major investor in Anthropic.

Anthropic is currently suing the US government to fight the Pentagon’s blacklisting of the company on national security grounds.

Statement on the US government directive to suspend access to Fable 5 and Mythos 5

Statement on the US government directive to suspend access to Fable 5 and Mythos 5

Last week, the company released to the public its much-anticipated Claude Fable 5 model (and its restricted version Claude Mythos 5, which is still being tested with trusted partners). Anthropic said in a blog post announcing the action that officials cited national security concerns with the new models, while offering few specific details.

The post said that the government gave the company “verbal evidence of a potential narrow, non-universal jailbreak” of the public Fable 5 model. A jailbreak is a means by which users can evade restrictions built into the code to unlock prohibited functionality. Anthropic downplayed the significance of the attack, and said other major models, such as OpenAI’s GPT-5.5, could also be affected by the technique described.

Fears of these first Mythos-class models being misused are running high, after Anthropic warned the cybersecurity world in May that the advanced cyber capabilities of Mythos have rapidly discovered thousands of vulnerabilities in ubiquitous software, leading to the decision to restrict the full version of the model to a close group of trusted partners for testing.

This morning, Axios reported that Anthropic technical staff have flown to Washington to meet with White House officials to resolve the issue.

The Wall Street Journal is reporting that the Trump administration’s decision to take action against Anthropic was prompted by discussions that Amazon CEO Andy Jassy had with officials, including Treasury Secretary Scott Bessent. According to the report, Amazon researchers said they had been able to evade some of Fable 5’s security restrictions using specific prompts. Amazon is a major investor in Anthropic.

Anthropic is currently suing the US government to fight the Pentagon’s blacklisting of the company on national security grounds.

Latest Stories

Sherwood Media, LLC and Chartr Limited produce fresh and unique perspectives on topical financial news and are fully owned subsidiaries of Robinhood Markets, Inc., and any views expressed here do not necessarily reflect the views of any other Robinhood affiliate, including Robinhood Markets, Inc., Robinhood Financial LLC, Robinhood Securities, LLC, Robinhood Crypto, LLC, Robinhood Money, LLC, Robinhood U.K. Ltd, Robinhood Derivatives, LLC, Robinhood Gold, LLC, Robinhood Asset Management, LLC, Robinhood Credit, Inc., Robinhood Ventures DE, LLC and, where applicable, its managed investment vehicles.

©2026 Sherwood Media, LLC