March 13, 2025 12:11 PM
Image credit: VentureBeat with DALL-E 3
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Reasoning through chain-of-thought (CoT) — the process by which models break problems into manageable “thoughts” before deducting answers — has become an integral part of the latest generation of frontier large language models (LLMs).
However, the inference costs of reasoning models can quickly stack up as models generate excess CoT tokens. In a new paper, researchers at Carnegie Mellon University propose an LLM training technique that gives developers more control over the length of the CoT.
Called length controlled policy optimization (LCPO), the technique conditions the model to provide correct answers while also keeping its “thoughts” within a predetermined token budget. Experiments show that models trained on LCPO provide a smooth tradeoff between accuracy and costs and can surprisingly outperform larger models on equal reasoning lengths. LCPO can help dramatically reduce the costs of inference in enterprise applications by saving thousands of tokens in each round of conversation with an LLM.
LLM performance leads to longer CoTs
Reasoning models such as OpenAI o1 and DeepSeek-R1 are trained through reinforcement learning (RL) to use test-time scaling and ge...