Yan Sun, Jia Guo^, Stanley Kok, Zihao Wang, Zujie Wen, Zhiqiang Zhang
^Project Lead
Work in Progress | Published on August 25, 2025
TL;DR
<aside>
🤔 Wondering how to choose data wisely and boost efficiency in reinforcement learning (RL) training?
- In our pilot study, we find that Perplexity (PPL) can serve as a low-overhead and intrinsic proxy to elicit varied training dynamics in RL;
- We propose PPL-Curriculum learning📈 to gradually push the model to learn from data beyond the edge of the “comfort zone” w.r.t. the current training stages.
- To effectively leverage rollouts, we adopt Relative Entropy📊. With 41.2% fewer rollouts during RL training, our PREPO approach achieves comparable performance with the baseline on multiple challenging math benchmarks! 👉 Click to see our Experimental Results
</aside>
Our Motivation
Before Rollouts, Prompt Perplexity foresees Training Signals
We found thatLow-PPL Prompts Tend to Yield Higher Pass Rates
<aside>
In our preliminary study, we also observed that data with different prompt PPL lead to distinct training dynamics, reflected in entropy, reward, all correct ratio, zero-advantage ratio, and validation score.
</aside>
Some interesting phenomena…
<aside>
- Entropy:
High-PPL prompts bring a higher-level entropy curve than the Low-PPL ones.
</aside>
<aside>
- Reward:
Low-PPL prompts achieve a higher reward than the High-PPL group.
</aside>
<aside>
- All Correct Ratio:
Low-PPL prompts are saturated faster than the High-PPL group, with more prompts having all-correct responses.
</aside>
<aside>
🤔
Validation Score:
We also found that the Low-PPL group tended to achieve higher validation performance in the early stage, indicating that learning with more “familiar” data helps LLM quickly adapt and grasp knowledge.

Validation Score (Qwen2.5-Math-7B)
Zero Advantage Ratio:
In later training phases, the Low-PPL group has a higher zero-variance ratio, with fewer effective rollouts and reduced sample efficiency compared to the High-PPL group.

Zero Advantage Ratio (Qwen2.5-Math-7B)
</aside>
- Click to see more on other models…
- Click to see comparison with baseline…
During Rollouts, High-Entropy Leads the Way into the Unknown