TL;DR

<aside>

🤔 Wondering how to choose data wisely and boost efficiency in reinforcement learning (RL) training?

In our pilot study, we find that Perplexity (PPL) can serve as a low-overhead and intrinsic proxy to elicit varied training dynamics in RL;
We propose PPL-Curriculum learning📈 to gradually push the model to learn from data beyond the edge of the “comfort zone” w.r.t. the current training stages.
To effectively leverage rollouts, we adopt Relative Entropy📊. With 41.2% fewer rollouts during RL training, our PREPO approach achieves comparable performance with the baseline on multiple challenging math benchmarks! 👉 Click to see our Experimental Results </aside>

Our Motivation

Before Rollouts, Prompt Perplexity foresees Training Signals

We found thatLow-PPL Prompts Tend to Yield Higher Pass Rates

<aside>

In our preliminary study, we also observed that data with different prompt PPL lead to distinct training dynamics, reflected in entropy, reward, all correct ratio, zero-advantage ratio, and validation score.

</aside>

Click to see setups

Some interesting phenomena…

<aside>

Entropy: High-PPL prompts bring a higher-level entropy curve than the Low-PPL ones. </aside>

<aside>

Reward: Low-PPL prompts achieve a higher reward than the High-PPL group. </aside>

<aside>

All Correct Ratio: Low-PPL prompts are saturated faster than the High-PPL group, with more prompts having all-correct responses. </aside>

<aside> 🤔

Validation Score: We also found that the Low-PPL group tended to achieve higher validation performance in the early stage, indicating that learning with more “familiar” data helps LLM quickly adapt and grasp knowledge.

Validation Score (Qwen2.5-Math-7B)

Zero Advantage Ratio: In later training phases, the Low-PPL group has a higher zero-variance ratio, with fewer effective rollouts and reduced sample efficiency compared to the High-PPL group.

Zero Advantage Ratio (Qwen2.5-Math-7B)

</aside>

Click to see more on other models…
Click to see comparison with baseline…