Anthropic researchers crack open the black box of how LLMs “think”

OK, first off — LLMs don’t think. They are clever systems that use probabilistic methods to parse language by mapping tokens to underlying concepts via weighted connections. Got it?

But exactly how a model goes from the user’s prompt to “reasoning” a solution is the subject of great speculation.

Models are trained, not programmed, and there are definitely weird things happening inside these tools that humans didn’t build. As the industry struggles with AI safety and hallucinations, understanding this process is key to developing trustworthy technology.

Researchers at the AI startup Anthropic have devised a way to perform a “circuit trace” that allows them to dissect the pathways that a model chooses between concepts on its journey to devising an answer to the prompt it was given. Their paper sheds new light on this mysterious process, much like a real-time fMRI brain scan can show which parts of the human brain “light up” in response to different stimuli.

Some of the interesting findings:

Language appears to be independent from concepts — it’s trivial for the model to parse a query in one language and answer in another. The French “petit” and English “small” map to the same concept.
When “reasoning,” sometimes the model is just bullshitting you. Researchers found that sometimes the “chain of thought” that an end user sees does not actually reflect the processes at work inside the model.
Models have created novel ways to solve math problems. Watching exactly how the model solved simple math problems showed some weird techniques that humans have definitely never learned in school.

Anthropic made a helpful video that describes the research clearly:

Anthropic is working hard to catch up to industry leader OpenAI as it seeks to grow revenues to cover the expensive computing resources needed to offer its services. Amazon has invested $8 billion in the company, and Anthropic’s Claude model will be used to power parts of the AI-enhanced Alexa.

Anthropic can now track the bizarre inner workings of a large language model

Some of the interesting findings:

Language appears to be independent from concepts — it’s trivial for the model to parse a query in one language and answer in another. The French “petit” and English “small” map to the same concept.
When “reasoning,” sometimes the model is just bullshitting you. Researchers found that sometimes the “chain of thought” that an end user sees does not actually reflect the processes at work inside the model.
Models have created novel ways to solve math problems. Watching exactly how the model solved simple math problems showed some weird techniques that humans have definitely never learned in school.

Anthropic made a helpful video that describes the research clearly:

Anthropic researchers crack open the black box of how LLMs “think”

Anthropic can now track the bizarre inner workings of a large language model

More Tech

Gold Tesla Cybercabs are piling up, but they’re not picking up passengers yet

Anthropic pulls Fable and Mythos access worldwide after Trump administration bars their use by foreign nationals

Statement on the US government directive to suspend access to Fable 5 and Mythos 5

Tesla used skewed data in push for European FSD approval, Reuters finds

Exclusive: Tesla presented misleading ‘Full Self-Driving’ safety data to European regulators

Report: Microsoft weighs Xbox spin-off amid major overhaul

Microsoft Has Considered Spinning Out Xbox, Plans New ‘Halo,’ ‘Fallout’ Games

Latest Stories