Anthropic researchers crack open the black box of how LLMs “think”

OK, first off — LLMs don’t think. They are clever systems that use probabilistic methods to parse language by mapping tokens to underlying concepts via weighted connections. Got it?

But exactly how a model goes from the user’s prompt to “reasoning” a solution is the subject of great speculation.

Models are trained, not programmed, and there are definitely weird things happening inside these tools that humans didn’t build. As the industry struggles with AI safety and hallucinations, understanding this process is key to developing trustworthy technology.

Researchers at the AI startup Anthropic have devised a way to perform a “circuit trace” that allows them to dissect the pathways that a model chooses between concepts on its journey to devising an answer to the prompt it was given. Their paper sheds new light on this mysterious process, much like a real-time fMRI brain scan can show which parts of the human brain “light up” in response to different stimuli.

Some of the interesting findings:

Language appears to be independent from concepts — it’s trivial for the model to parse a query in one language and answer in another. The French “petit” and English “small” map to the same concept.
When “reasoning,” sometimes the model is just bullshitting you. Researchers found that sometimes the “chain of thought” that an end user sees does not actually reflect the processes at work inside the model.
Models have created novel ways to solve math problems. Watching exactly how the model solved simple math problems showed some weird techniques that humans have definitely never learned in school.

Anthropic made a helpful video that describes the research clearly:

Anthropic is working hard to catch up to industry leader OpenAI as it seeks to grow revenues to cover the expensive computing resources needed to offer its services. Amazon has invested $8 billion in the company, and Anthropic’s Claude model will be used to power parts of the AI-enhanced Alexa.

Anthropic can now track the bizarre inner workings of a large language model

Some of the interesting findings:

Language appears to be independent from concepts — it’s trivial for the model to parse a query in one language and answer in another. The French “petit” and English “small” map to the same concept.
When “reasoning,” sometimes the model is just bullshitting you. Researchers found that sometimes the “chain of thought” that an end user sees does not actually reflect the processes at work inside the model.
Models have created novel ways to solve math problems. Watching exactly how the model solved simple math problems showed some weird techniques that humans have definitely never learned in school.

Anthropic made a helpful video that describes the research clearly:

Anthropic researchers crack open the black box of how LLMs “think”

Anthropic can now track the bizarre inner workings of a large language model

More Tech

Anthropic projections for 2028: Up to $70 billion in revenue, could be profitable by 2027

Anthropic Projects $70 Billion in Revenue, $17 Billion in Cash Flow in 2028

Amazon, which is developing AI shopping agents, doesn’t want Perplexity’s AI shopping agents on its site

Amazon Demands Perplexity Stop AI Agent From Making Purchases

Apple to challenge Google Chromebooks with low-cost Mac laptop, Bloomberg reports

Apple Prepares to Enter Low-Cost Laptop Market for First Time

Getty Images suffers partial defeat in UK lawsuit against Stability AI

Getty Images largely loses landmark UK lawsuit over AI image generator

Latest Stories