Yan Sun, Jia Guo^, Stanley Kok, Zihao Wang, Zujie Wen, Zhiqiang Zhang
^Project Lead
Published on August 25, 2025 | https://github.com/yan-sun-x/PREPO
✨ Paper accepted on NeurIPS 2025 Workshop on Efficient Reasoning
<aside>
🤔 Wondering how to choose data wisely and boost efficiency in reinforcement learning (RL) training?
We found thatLow-PPL Prompts Tend to Yield Higher Pass Rates
<aside>
In our preliminary study, we also observed that data with different prompt PPL lead to distinct training dynamics, reflected in entropy, reward, all correct ratio, zero-advantage ratio, and validation score.
</aside>
<aside>
<aside>
<aside>
<aside> 🤔
Validation Score: We also found that the Low-PPL group tended to achieve higher validation performance in the early stage, indicating that learning with more “familiar” data helps LLM quickly adapt and grasp knowledge.
Validation Score (Qwen2.5-Math-7B)
Zero Advantage Ratio: In later training phases, the Low-PPL group has a higher zero-variance ratio, with fewer effective rollouts and reduced sample efficiency compared to the High-PPL group.
Zero Advantage Ratio (Qwen2.5-Math-7B)
</aside>