Reasoning-enhanced Reward Models with Structured Rubrics
RM-R1 introduces a reasoning-centered approach to reward modeling through Chain-of-Rubrics (CoR). The model generates rubrics or intermediate solutions before judgment to enhance interpretability and achieve state-of-the-art accuracy on standard reward benchmarks.
Figure 1: Overview of the RM-R1 framework
All our models and datasets are available on Hugging Face.
Evaluated on RewardBench, RM-Bench, and RMB. RM-R1 models consistently outperform larger counterparts with interpretable judgments and domain generalization.
Models | RewardBench | RM-Bench | RMB | Average |
---|---|---|---|---|
ScalarRMs | ||||
SteerLM-RM-70B | 88.8 | 52.5 | 58.2 | 66.5 |
Eurus-RM-7b | 82.8 | 65.9 | 68.3 | 72.3 |
Internlm2-20b-reward | 90.2 | 68.3 | 62.9 | 73.6 |
Skywork-Reward-Gemma-2-27B | 93.8 | 67.3 | 60.2 | 73.8 |
Internlm2-7b-reward | 87.6 | 67.1 | 67.1 | 73.9 |
ArmoRM-Llama3-8B-v0.1 | 90.4 | 67.7 | 64.6 | 74.2 |
Nemotron-4-340B-Reward | 92.0 | 69.5 | 69.9 | 77.1 |
Skywork-Reward-Llama-3.1-8B | 92.5 | 70.1 | 69.3 | 77.5 |
INF-ORM-Llama3.1-70B | 95.1 | 70.9 | 70.5 | 78.8 |
GenRMs | ||||
Claude-3-5-sonnet-20240620 | 84.2 | 61.0 | 70.6 | 71.9 |
Llama3.1-70B-Instruct | 84.0 | 65.5 | 68.9 | 72.8 |
Gemini-1.5-pro | 88.2 | 75.2 | 56.5 | 73.3 |
Skywork-Critic-Llama-3.1-70B | 93.3 | 71.9 | 65.5 | 76.9 |
GPT-4o-0806 | 86.7 | 72.5 | 73.8 | 77.7 |
ReasRMs | ||||
JudgeLRM | 75.2 | 64.7 | 53.1 | 64.3 |
DeepSeek-PairRM-27B | 87.1 | – | 58.2 | – |
DeepSeek-GRM-27B-RFT | 84.5 | – | 67.0 | – |
DeepSeek-GRM-27B | 86.0 | – | 69.0 | – |
Self-taught-evaluator-llama3.1-70B | 90.2 | 71.4 | 67.0 | 76.2 |
Our Methods | ||||
RM-R1-DeepSeek-Distilled-Qwen-7B | 80.1 | 72.4 | 55.1 | 69.2 |
RM-R1-Qwen-Instruct-7B | 85.2 | 70.2 | 66.4 | 73.9 |
RM-R1-Qwen-Instruct-14B | 88.2 | 76.1 | 69.2 | 77.8 |
RM-R1-DeepSeek-Distilled-Qwen-14B | 88.9 | 81.5 | 68.5 | 79.6 |
RM-R1-Qwen-Instruct-32B | 91.4 | 79.1 | 73.0 | 81.2 |
RM-R1-DeepSeek-Distilled-Qwen-32B | 90.9 | 83.9 | 69.8 | 81.5 |
Directly replicating reinforcement learning recipes from mathematical tasks is insufficient for training strong reasoning reward models. Explicit query categorization and targeted distillation of high-quality reasoning traces are both crucial for achieving robust and generalizable improvements.
Method | Chat | Chat Hard | Safety | Reasoning | Average |
---|---|---|---|---|---|
Instruct (Original) | 95.8 | 74.3 | 86.8 | 86.3 | 85.8 |
Instruct + Cold Start RL | 92.5 | 81.5 | 89.7 | 94.4 | 89.5 |
Instruct + Cold Start RL + Rubrics | 93.0 | 82.5 | 90.8 | 94.2 | 90.1 |
Instruct + Cold Start RL + Rubrics + QC | 92.3 | 82.6 | 91.6 | 96.3 | 90.8 |
RM-R1 | 95.3 | 83.1 | 91.9 | 95.2 | 91.4 |
Scaling improves reward model performance: we observe a near-linear trend with both model size and inference-time compute. Larger models consistently benefit more from our reasoning-based training pipeline, and longer reasoning chains become increasingly effective under higher compute budgets.
Figure 4: Scaling effect of RM-R1. (a) Larger models benefit more from reasoning training. (b) Longer reasoning chains improve RM performance.
Reasoning training substantially improves reward modeling. It not only enables better generalization across tasks but also provides consistent gains even under limited data scenarios compared to direct-answer SFT approaches.
Method | RewardBench | RM-Bench | RMB | Avg. |
---|---|---|---|---|
Train on Full Data | ||||
Instruct + SFT | 90.9 | 75.4 | 65.9 | 77.4 |
Instruct + Distilled + SFT | 91.2 | 76.7 | 65.4 | 77.8 |
RM-R1 * | 91.4 | 79.1 | 73.0 | 81.2 |
Train on 9k (Distillation) Data | ||||
Instruct + SFT | 88.8 | 74.8 | 66.9 | 76.6 |
Instruct + Distilled * | 89.0 | 76.3 | 72.0 | 79.2 |
Table 3: Comparison of reasoning-based training versus SFT across benchmarks. * indicates reasoning-based methods. Reasoning training consistently yields better performance.
If you find RM-R1 useful in your research, we would appreciate it if you consider citing our work:
@article{chen2025rm, title={Rm-r1: Reward modeling as reasoning}, author={Chen, Xiusi and Li, Gaotang and Wang, Ziqi and Jin, Bowen and Qian, Cheng and Wang, Yu and Wang, Hongru and Zhang, Yu and Zhang, Denghui and Zhang, Tong and others}, journal={arXiv preprint arXiv:2505.02387}, year={2025} }