Yan Sun, Jia Guo^, Stanley Kok, Zihao Wang, Zujie Wen, Zhiqiang Zhang

^Project Lead

Work in Progress | Published on August 25, 2025

TL;DR

<aside>

🤔  Wondering how to choose data wisely and boost efficiency in reinforcement learning (RL) training?

Our Motivation

Before Rollouts, Prompt Perplexity foresees Training Signals

We found thatLow-PPL Prompts Tend to Yield Higher Pass Rates

<aside>

In our preliminary study, we also observed that data with different prompt PPL lead to distinct training dynamics, reflected in entropy, reward, all correct ratio, zero-advantage ratio, and validation score.

</aside>

Some interesting phenomena…

<aside>

<aside>

<aside>

<aside> 🤔

Validation Score: We also found that the Low-PPL group tended to achieve higher validation performance in the early stage, indicating that learning with more “familiar” data helps LLM quickly adapt and grasp knowledge.

Validation Score (Qwen2.5-Math-7B)

Validation Score (Qwen2.5-Math-7B)

Zero Advantage Ratio: In later training phases, the Low-PPL group has a higher zero-variance ratio, with fewer effective rollouts and reduced sample efficiency compared to the High-PPL group.

Zero Advantage Ratio (Qwen2.5-Math-7B)

Zero Advantage Ratio (Qwen2.5-Math-7B)

</aside>

During Rollouts, High-Entropy Leads the Way into the Unknown