Rethinking Training Signals in RLVR

Speaker

University of Washington

Host

NLP Meetings Seminar Series

In this talk, I will share lessons we learned from RLVR experiments: it is risky to draw general RLVR conclusions based on a single model family. Our recent work on spurious rewards shows that even random or incorrect rewards can elicit strong mathematical reasoning in certain models, despite having no or negative correlation with the correct answer. For example, RLVR improves MATH-500 performance for Qwen2.5-Math-7B in absolute points by 21.4% (random reward) and 24.1% (incorrect label)--nearly matching the 29.1% gained with ground truth rewards. However, these spurious rewards that work for Qwen often fail to yield gains with other model families like Llama3 or OLMo2. Since any reward can work for a single model but not others, it means the model is special, not the method. In particular, we find code reasoning—-thinking in code without actual code execution—-to be a distinctive Qwen2.5-Math behavior that becomes significantly more frequent after RLVR, from 65% to over 90%, even with spurious rewards. In addition, we show that random rewards can bring meaningful training signals due to a bias in the GRPO training algorithm. Overall, we hypothesize that, given the lack of useful reward signal, RLVR must somehow be surfacing useful reasoning representations learned during pretraining. We suggest that future RLVR research should possibly be validated on diverse models rather than a single de facto choice, as we show that it is easy to get significant performance gains on Qwen models even with completely spurious reward signals.