Google DeepMind Focused on Agent Performance
Google are laser focused on increasing the performance of agents. The future will feature agents capturing rising levels of the “compute requirements” or cognitive load of humanity. Users won’t interact with companies or apps directly. Agents will interact with other agents. This “agentic layer” will accelerate the pace of science, technology, commerce & creativity.
This research investigates why Large Language Models, despite their successes in various tasks and potential for complex reasoning using Chain-of-Thought, often perform poorly in decision-making scenarios.
This is one of the hottest areas of AI research right now. Agents will soon conduct 80 to 90% of all interactions on the internet… so getting them from point A to B as efficiently as possible means increasing the efficiency of the global economic engines. More and more of our R&D is going to be completed by AI agents as well. This is a critical surface area to focus on.
The core hypothesis that LLMs can leverage their world knowledge and reasoning abilities for effective exploration and problem-solving in agentic tasks doesn't always hold true. LLM agents frequently exhibit suboptimal exploration and suffer from the "knowing-doing gap” which is an inability to translate their knowledge into effective actions.
The paper systematically examines three key failure modes in small-to-medium scale LLMs (Gemma2 models of 2B, 9B, and 27B parameters):
Greediness: LLMs tend to prematurely commit to actions that yield early high rewards, neglecting exploration of potentially better options, leading to incomplete coverage of the action space.
Frequency Bias: Smaller LLMs (like 2B models) often copy the most frequent actions presented in their context history, irrespective of the rewards associated with those actions. Larger models reduce this bias but remain greedy.
Knowing-Doing Gap: LLMs can often articulate the correct strategy or reasoning for a task (they "know" what to do) but fail to execute the optimal action based on that reasoning, frequently defaulting to greedy choices instead.
To address these limitations, the researchers propose using Reinforcement Learning Fine-Tuning specifically tailored for decision-making. This involves fine-tuning the LLM on self-generated CoT rationales, using rewards obtained from interacting with the environment. Their experiments, conducted on multi-armed bandits, contextual bandits, and a text-based Tic-tac-toe game, demonstrate that RLFT significantly improves the LLMs' decision-making capabilities. RLFT enhances exploration, reduces greediness, counteracts frequency bias to some extent, and narrows the knowing-doing gap. The study also explores the effectiveness of classic RL exploration techniques and LLM-specific methods (like self-correction and self-consistency) in conjunction with RLFT to further boost performance.
While the approach works, the paper notes the computational cost associated with rollout generation, especially with larger "thinking" budgets. Token price is destined to plummet but how many extra tokens will reasoning consume across billions of agents?
Future research may focus on improving the efficiency of RLFT for LLM agents, perhaps using more sample-efficient RL algorithms, model-based RL approaches, or exploring alternative architectures like recurrent models (e.g., xLSTM ) that might offer faster / cheaper inference.
Link to paper: https://arxiv.org/pdf/2504.16078
How This Advances on Prior Methods
Sometimes, the most important piece of information is why something didn’t work.
While previous studies observed that LLMs struggle with exploration and exhibit a knowing-doing gap, this paper provides a more systematic and focused investigation into why. It specifically isolates and quantifies three distinct failure modes: greediness, frequency bias, and the knowing-doing gap, particularly in small-to-medium scale models. Prior work often noted the symptoms (suboptimal performance, poor exploration), whereas this work delves deeper into diagnosing the underlying causes, such as premature commitment to greedy strategies or mimicking frequent actions. It also explicitly measures the knowing-doing gap by comparing the correctness of the LLM's reasoning (knowing) with its actual action choices (doing).
The proposed RLFT approach is specifically evaluated based on its ability to mitigate the identified failure modes. The results demonstrate that RLFT leads to increased action coverage (mitigating greediness), reduces the tendency to copy frequent actions (counteracting frequency bias), and improves the alignment between reasoning and action (narrowing the knowing-doing gap). This provides a targeted solution directly linked to the diagnosed problems, advancing beyond general observations of LLM shortcomings.
Reinforcement Learning from Human Feedback is common for aligning LLMs with human preferences. However, this paper adapts RL principles specifically for improving decision-making capabilities by fine-tuning on environment rewards rather than just human preferences. Crucially, it leverages the LLM's own self-generated Chain-of-Thought rationales as part of the input for the RL update. This is distinct from standard RL approaches that might not incorporate explicit reasoning steps, or prior LLM work that might use CoT for prompting but not necessarily integrate it directly into an RL fine-tuning loop focused on decision tasks. The methodology fine-tunes the model to favor reasoning patterns and actions that lead to higher environmental rewards, directly addressing the knowing-doing gap by reinforcing effective reasoning-to-action pathways.
This contrasts with methods fine-tuning only on expert trajectories or using LLMs primarily as reward sources or exploration orchestrators for separate RL agents.
The authors compare classic RL techniques (e-greedy, exploration bonuses) with LLM-specific strategies (context randomization, self-correction, self-consistency) within the same RLFT framework. This provides valuable insights into which strategies are most effective for enhancing exploration specifically for LLM agents undergoing RLFT. For instance, the finding that a simple exploration bonus significantly boosts performance highlights the importance of reward shaping in this context. The finding that models perform near-optimally when forced to try all actions initially ("try-all" strategy) further underscores that the primary deficit is exploration, not capability once information is available.
My dad made me try all the sports before I could settle into one. This reminded me of that.
Investigating the Role of CoT and "Thinking Time"
The study explicitly ablates the role of CoT during RLFT, confirming its critical importance not just for in-context learning (ICL) but also for the effectiveness of the fine-tuning process itself.
The paper also quantifies the impact of allowing more "thinking time" (a larger token budget for generation, G) during RLFT, showing that increased reasoning capacity leads to better performance, albeit at a higher computational cost. This provides practical insights into the trade-offs involved in training more capable LLM agents.
To boil it all down: the paper advances the field by moving from observing LLM limitations in decision-making to diagnosing specific causes, proposing and validating a targeted RLFT method using CoT, and systematically comparing various exploration strategies to enhance this method, providing a more nuanced understanding of how to build better LLM agents.
You can tell the Google team are hyper-focusing on agentic performance from this paper.
Analogy to Explain the Paper
Imagine teaching a very smart student (the LLM) how to play a complex strategy game with many possible moves (actions), like exploring a maze with hidden rewards.
How would you go about it? What problems might get in your way?
Initially, the student, despite knowing a lot about mazes and strategies (world knowledge, CoT reasoning), plays poorly. They make three common mistakes:
Greediness: They find a path with some early rewards and just keep taking it, afraid to explore other paths that might lead to bigger treasures later. They get stuck in a small part of the maze.
Frequency Bias (especially younger students): If they see other players (or their own past moves in the history) repeatedly taking a specific turn (e.g., always turning left), they just copy that, even if it leads to dead ends.
Knowing-Doing Gap: They can perfectly explain the best strategy to explore the whole maze ("I should check all paths before settling...") but when it's time to move, they impulsively take the first rewarding path they see again.
The solution proposed (RLFT on CoT) is like a specialized coaching session.
The coach doesn't just tell the student what move to make but asks them to explain their thinking (CoT rationale) before each move. Then, based on whether the move actually led to finding treasure (environment reward), the coach reinforces good thinking-move combinations. This training helps the student become less greedy, explore more, ignore meaningless repetitions, and actually follow their own good strategies.
Talk about a great way to navigate a maze: by building yourself into a maze navigator.
AI Comes For Industries and Entire Business Models
Nearly all my rich friends are stressing out about the value of their companies and business efforts in a rapidly changing world. They are always on the run trading time for income and suddenly they see AI lowering the “cost of intelligence” and delivering continuously improving work.
What This Means for the Future
This research has significant implications for the future development and application of LLMs as autonomous agents capable of making decisions in complex, interactive environments.
The core contribution – demonstrating that RLFT on CoT rationales can mitigate inherent LLM weaknesses like greediness and the knowing-doing gap – paves the way for creating more effective AI agents. Future LLM-based agents could become better explorers and decision-makers in domains ranging from game playing and robotics to potentially more complex real-world applications like scientific discovery, personalized education assistants, or resource management systems. By reinforcing the link between reasoning and effective action, these agents could become more reliable and less prone to simplistic, suboptimal strategies.
The study underscores the critical role of Chain-of-Thought or similar explicit reasoning mechanisms, not just for prompting but as an integral part of the agent's learning process. Future agent architectures might need to explicitly incorporate and refine reasoning steps. The finding that more "thinking time" (generation budget) improves performance suggests that future work might explore trade-offs between computational cost and decision quality, potentially favoring models or methods that allow for deeper reasoning in critical situations. This could lead to agents capable of tackling problems requiring more deliberation.
The focus on the "knowing-doing gap" highlights a fundamental challenge in AI. This research provides a concrete method (RLFT on CoT) to narrow this gap. Future research will likely continue exploring ways to ensure that the vast knowledge contained within LLMs can be effectively translated into purposeful, optimal actions in dynamic environments. This might involve developing more sophisticated RL techniques, better methods for representing and utilizing internal knowledge during action selection, or architectures that inherently link reasoning and acting more tightly.
The paper suggests that standard LLM pre-training and generic fine-tuning might not be sufficient for optimal decision-making performance. Specialized fine-tuning techniques like the proposed RLFT, potentially incorporating environment-specific rewards and tailored exploration strategies (like exploration bonuses ), will likely become increasingly important. This points towards a future where foundational models are adapted for specific agentic roles using targeted reinforcement learning approaches. The success of "Thought Cloning" (SFT on expert rationales) also suggests that high-quality expert data, including the reasoning process, could be highly valuable for bootstrapping agent capabilities.
Even with RLFT, exploration remains a challenge.
The study's investigation of various exploration mechanisms indicates that finding the right way to encourage LLMs to explore effectively is crucial. Future work will need to develop more sophisticated and potentially LLM-specific exploration strategies, especially for complex, stateful environments where targeted exploration is necessary. This might involve intrinsic motivation techniques adapted for LLMs or meta-learning approaches for learning exploration strategies themselves.
In summary, this work directs future efforts towards building more intelligent and autonomous agents by highlighting specific LLM weaknesses in decision-making and offering a promising RL-based methodology, centered on reasoning, to overcome them. It emphasizes the need for targeted training, better exploration techniques, and relevant evaluation metrics for advancing agentic AI.
As agentic performance becomes more efficient, agents will capture more and more of the cognitive workload across all parts of our world - commerce, R&D, logistics, engineering and even the creative areas.
Thank you for helping us accelerate Life in the Singularity by sharing.
I started Life in the Singularity in May 2023 to track all the accelerating changes in AI/ML, robotics, quantum computing and the rest of the technologies accelerating humanity forward into the future. I’m an investor in over a dozen technology companies and I needed a canvas to unfold and examine all the acceleration and breakthroughs across science and technology.
Our brilliant audience includes engineers and executives, incredible technologists, tons of investors, Fortune-500 board members and thousands of people who want to use technology to maximize the utility in their lives.
To help us continue our growth, would you please engage with this post and share us far and wide?! 🙏