Brain in a Bubble

illusion of thinking

With WWDC on deck, Apple says “reasoning” AI models collapse with complexity

Apple tested state-of-the-art “chain of thought” models and found that they aren’t “reasoning,” but merely pattern matching, calling into question the direction the industry is taking.

Jon Keegan

6/9/25 11:34AM

Apple’s troubled AI rollout was plagued by a series of remarkable feature failures and product delays.

What was supposed to be the year of “Apple Intelligence” has failed to deliver an AI-enhanced Siri on par with voice assistants from competitors like Google, OpenAI, and Meta. This week, all eyes are on Apple as it holds its Worldwide Developers Conference (WWDC) to see what it’s planning to get back in the AI race.

But behind the scenes, researchers at Apple have been digging into the competition’s latest and greatest “reasoning” models to see how they respond to tricky challenges as they scale in complexity.

In a new paper, Apple’s researchers found that the leading state-of-the-art “chain of thought” models “face a complete accuracy collapse” when they dialed up the complexity of puzzle-based tests. The spectacular failures of the models led the researchers to question their “reasoning” label, calling it instead “the illusion of thinking.”

The suite of tests included puzzles like “Tower of Hanoi,” in which the player must stack a series of disks of various sizes from one post to another, one disk at a time, only moving the top disk, and always placing smaller disks on larger ones.

Screenshot from apple “Illusion of Thinking” paper — A figure from the “Illusion of Thinking” Apple paper showing models’ collapse in accuracy as the complexity is dialed up. (Source: Apple)

While the models could solve the simplest versions of the puzzles, they fell on their face once things got more complex. The research tested reasoning models DeepSeek-R1, OpenAI’s o3-mini, and Anthropic’s Claude 3.7 Sonnet Thinking.

Chain of “thought”

After hitting performance plateaus from the “more data, more compute” approach, the industry followed OpenAI’s o1 release and started to build “chain of thought” reasoning models, which showed their “thought” processes.

This technique did boost the performance of large language models to new levels, offering a promising new pathway out of what looked to be a computational dead end. While they required vastly higher computation resources and time, the approach seemed to be the way forward.

Apple’s research seems to show that rather than reasoning, these models are merely displaying sophisticated pattern matching.

Apple researchers also examined the “thought” processes behind each solution to the puzzle, to better understand exactly how the models approached solutions.

The fact of the matter is that very little is known about how these recent models actually work. It remains to be seen if Apple has been cooking up an alternate approach, but reports indicate an AI-enhanced Siri isn’t likely to make a debut at this week’s WWDC.

South by Southwest Conference and Festivals

Gold Tesla Cybercabs are piling up, but they’re not picking up passengers yet

Low-volume production started in April. Now people are noticing them more and more in the wild.

Rani Molla6/15/26

Millie Giles

TAKING ACCOUNTS

6/15/26

Britain announces social media ban for under-16s starting early 2027

The UK government plans to use the same model for the restrictions as Australia — but how successful has that case study been so far?

UK Prime Minister Announces Under-16s Social Media Ban

Jon Keegan6/15/26

Anthropic pulls Fable and Mythos access worldwide after Trump administration bars their use by foreign nationals

Only days after releasing two versions of its next-gen AI model, Anthropic has disabled them for users worldwide.

Anthropic says it received a Friday night order from the Trump administration to suspend access to the models for any foreign national (anywhere in the world) — a group that included some Anthropic employees. In response, the company turned off access to everyone.

Last week, the company released to the public its much-anticipated Claude Fable 5 model (and its restricted version Claude Mythos 5, which is still being tested with trusted partners). Anthropic said in a blog post announcing the action that officials cited national security concerns with the new models, while offering few specific details.

The post said that the government gave the company “verbal evidence of a potential narrow, non-universal jailbreak” of the public Fable 5 model. A jailbreak is a means by which users can evade restrictions built into the code to unlock prohibited functionality. Anthropic downplayed the significance of the attack, and said other major models, such as OpenAI’s GPT-5.5, could also be affected by the technique described.

Fears of these first Mythos-class models being misused are running high, after Anthropic warned the cybersecurity world in May that the advanced cyber capabilities of Mythos have rapidly discovered thousands of vulnerabilities in ubiquitous software, leading to the decision to restrict the full version of the model to a close group of trusted partners for testing.

This morning, Axios reported that Anthropic technical staff have flown to Washington to meet with White House officials to resolve the issue.

The Wall Street Journal is reporting that the Trump administration’s decision to take action against Anthropic was prompted by discussions that Amazon CEO Andy Jassy had with officials, including Treasury Secretary Scott Bessent. According to the report, Amazon researchers said they had been able to evade some of Fable 5’s security restrictions using specific prompts. Amazon is a major investor in Anthropic.

Anthropic is currently suing the US government to fight the Pentagon’s blacklisting of the company on national security grounds.

Statement on the US government directive to suspend access to Fable 5 and Mythos 5

This morning, Axios reported that Anthropic technical staff have flown to Washington to meet with White House officials to resolve the issue.

Anthropic is currently suing the US government to fight the Pentagon’s blacklisting of the company on national security grounds.

Rani Molla6/15/26

Tesla used skewed data in push for European FSD approval, Reuters finds

Tesla has used highly questionable safety stats in an effort to win over European regulators and rekindle sales in the region, according to a Reuters investigation.

Tesla reportedly pitched regulators in Sweden and the Netherlands with claims that its Full Self-Driving (FSD) tech is over 7x safer than human drivers. However, independent researchers told Reuters that the stats are misleading because Tesla compares airbag-deployment crashes involving FSD-equipped vehicles with much broader US crash statistics, while also benchmarking newer Teslas against the entire US vehicle fleet, which is significantly older on average.

Despite the flawed metrics, the Dutch regulator approved FSD in April, saying its decision was based on its own “tests, analyses and verifications,” and Tesla is now pushing for EU-wide clearance. A version of FSD is currently available in five European markets.

Exclusive: Tesla presented misleading ‘Full Self-Driving’ safety data to European regulators

Rani Molla6/12/26

Report: Microsoft weighs Xbox spin-off amid major overhaul

Microsoft is reportedly considering spinning out or restructuring its struggling Xbox unit, per The Information. While new Xbox CEO Asha Sharma, who took over in February, is preparing for layoffs, she’s simultaneously planning to boost investment in its biggest franchises like “Halo,” “Fallout,” and “Minecraft.”

The latest potential shake-up comes as the gaming division battles major headwinds, following a massive 33% plunge in Q3 console sales and a recent move to slash Game Pass prices while removing new “Call of Duty” titles.