Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies
Abstract
The paper decomposes the policy of large language models into internal layer and modular policies, revealing distinct reasoning patterns across layers and proposing Bottom-up Policy Optimization to enhance performance on complex reasoning tasks.
Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at https://github.com/Trae1ounG/BuPO.
Community
Bottom-up Policy Optimization (BuPO) provides a novel framework to decompose LLM policies into internal layer and modular policies, reveals distinct reasoning patterns across different model architectures, and introduces a bottom-up optimization algorithm that leverages these insights to enhance complex reasoning.
Key Findings:
- Internal Policies: Decomposes the unified LLM policy into samplable distributions from individual layers and modules (self-attention & FFN).
- Progressive Reasoning Pattern: Discovered a human-like "Exploration-Integration-Convergence" (EIC) pattern in Qwen models, contrasting with the abrupt convergence in Llama models.
- Bottom-up Policy Optimization (BuPO): A novel two-phase RL algorithm that first optimizes an internal, lower-layer policy to reconstruct foundational reasoning, then fine-tunes the full model.
- Enhanced Reasoning Performance: BuPO significantly outperforms standard RL on complex reasoning benchmarks.
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies, from Institute of Automation, Chinese Academy of Sciences and tencent AI lab, which bridges the gap between Mechanistic Interpretability and Reinforcement Learning.
Instead of treating the LLM as a black-box policy, our work focuses on two key contributions:
🔍 Interpretability Analysis:
We decomposed the LLM policy into "Internal Layer Policies" using the Logit Lens. Our analysis reveals a "Progressive Reasoning" pattern in models like Qwen, where lower layers maintain high entropy for exploration, while upper layers converge for refinement.
⚙️ Methodology (BuPO):
Based on these insights, we propose Bottom-up Policy Optimization. It utilizes a two-stage training strategy:
- Internal Policy Optimization: We first optimize the internal layers to reconstruct and strengthen the model's foundational reasoning capabilities.
- Full Policy Optimization: We then switch to standard optimization to align the complete model.
This approach ensures that the model's internal reasoning process is optimized alongside its final output.
This is a strong and timely piece of work, and it resonates surprisingly well with the “AEO vs SEO in 2026” conversation.
What this paper highlights—decomposing a monolithic LLM policy into internal layer and modular policies—mirrors the broader shift we’re seeing from traditional SEO toward AEO (Answer Engine Optimization). SEO optimized surface signals (keywords, links, rankings), while AEO must optimize for how answers are actually formed inside the model. BuPO essentially operates at the “answer-construction layer,” not just the output layer, which is exactly where AEO competition will be decided.
Several points stand out as especially relevant:
Entropy dynamics across layers reflect how modern answer engines explore broadly early and converge late—very similar to how AEO systems will balance recall (exploration) and precision (final answer synthesis) in 2026.
The observation that Qwen models exhibit more human-like progressive reasoning aligns with AEO’s goal: producing structured, stepwise, trustworthy answers rather than keyword-matched responses typical of legacy SEO.
Bottom-up Policy Optimization feels like an architectural analog to “bottom-up content optimization” for AEO—strengthening foundational reasoning rather than overfitting final outputs.
In short, this paper isn’t just advancing RL for LLMs; it’s pointing toward how answer engines will outperform search engines: by optimizing internal reasoning pathways instead of external ranking tricks. As AEO eclipses SEO in 2026, approaches like BuPO may become the hidden backbone of competitive answer systems.
Excellent work—both technically rigorous and conceptually aligned with where AI-driven information retrieval is heading.
That's a great point! It seems that technology evolves in a very similar way. Thanks for your comment!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
