Quit the yapping: New AI technique could cut costs 90% by saying less

A consensus is emerging in AI circles that the way forward involves models that use “chain of reasoning” to get better performance, at the expense of costlier computing resources. This process involves instructing the model to break a problem down into detailed step-by-step instructions. The problem is that these steps can be pretty verbose, and when it comes to AI, more words = more cost.

A new paper from researchers at Zoom shows that using a new technique dubbed “chain of draft,” if you tell a model to simply limit those steps to succinct “drafts” of only five words or so, rather than wordy sentences, not only can you still achieve high performance on responses, but you can cut computing costs by up to 90%.

AI models are priced by the number of “tokens” — or portions of words — that are input and output by the model. For example: OpenAI’s o3-mini “reasoning” model costs $1.10 per million tokens input, and $4.40 per million tokens of output. That may seem cheap, but when you’re processing millions of queries, this can really add up.

“By reducing verbosity and focusing on critical insights, CoD matches or surpasses CoT in accuracy while using as little as only 7.6% of the tokens, significantly reducing cost and latency across various reasoning tasks,” the paper reports.

Translation: it’s faster, cheaper, and sometimes better than chain of thought.

This approach is also notable for its ease of use. You can simply change the prompts you enter to get this benefit. That said, most of the gains were found using larger models like OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet, while using smaller models resulted in poorer performance.

Go deeper: Here are OpenAI’s 50 Laws of Robotics