RM-R1: Reward Modeling as Reasoning

RM-R1 Team
Xiusi Chen^1*, Gaotang Li^1*, Ziqi Wang^1*, Bowen Jin¹, Cheng Qian¹, Yu Wang², Hongru Wang¹
Yu Zhang³, Denghui Zhang⁴, Tong Zhang¹, Hanghang Tong¹, Heng Ji¹
*Equal Contribution

¹University of Illinois Urbana-Champaign
²University of California, San Diego
³Texas A&M University
⁴Stevens Institute of Technology
{xiusic, gaotang3, htong, hengji}@illinois.edu

📄 Paper 💻 Code 📦 Models 🧑‍⚖️ Dataset 📚 Bibtex

Overview

RM-R1 introduces a reasoning-centered approach to reward modeling through Chain-of-Rubrics (CoR). The model generates rubrics or intermediate solutions before judgment to enhance interpretability and achieve state-of-the-art accuracy on standard reward benchmarks.

Figure 1: Overview of the RM-R1 framework

Key Features

Structured rubric generation with reasoning traces
Two-phase training: distillation + RL with verifiable rewards
Superior performance compared to GPT-4o, Llama3.1-70B
Open-source 7B, 14B, and 32B models

Model Cards

All our models and datasets are available on Hugging Face.

RM-R1-Qwen-Instruct-7B

RM-R1-Qwen-Instruct-14B

RM-R1-Qwen-Instruct-32B

RM-R1-DeepSeek-Distilled-Qwen-14B

RM-R1-DeepSeek-Distilled-Qwen-7B

RM-R1-DeepSeek-Distilled-Qwen-32B

Benchmarks

Evaluated on RewardBench, RM-Bench, and RMB. RM-R1 models consistently outperform larger counterparts with interpretable judgments and domain generalization.

Models	RewardBench	RM-Bench	RMB	Average
ScalarRMs
SteerLM-RM-70B	88.8	52.5	58.2	66.5
Eurus-RM-7b	82.8	65.9	68.3	72.3
Internlm2-20b-reward	90.2	68.3	62.9	73.6
Skywork-Reward-Gemma-2-27B	93.8	67.3	60.2	73.8
Internlm2-7b-reward	87.6	67.1	67.1	73.9
ArmoRM-Llama3-8B-v0.1	90.4	67.7	64.6	74.2
Nemotron-4-340B-Reward	92.0	69.5	69.9	77.1
Skywork-Reward-Llama-3.1-8B	92.5	70.1	69.3	77.5
INF-ORM-Llama3.1-70B	95.1	70.9	70.5	78.8
GenRMs
Claude-3-5-sonnet-20240620	84.2	61.0	70.6	71.9
Llama3.1-70B-Instruct	84.0	65.5	68.9	72.8
Gemini-1.5-pro	88.2	75.2	56.5	73.3
Skywork-Critic-Llama-3.1-70B	93.3	71.9	65.5	76.9
GPT-4o-0806	86.7	72.5	73.8	77.7
ReasRMs
JudgeLRM	75.2	64.7	53.1	64.3
DeepSeek-PairRM-27B	87.1	–	58.2	–
DeepSeek-GRM-27B-RFT	84.5	–	67.0	–
DeepSeek-GRM-27B	86.0	–	69.0	–
Self-taught-evaluator-llama3.1-70B	90.2	71.4	67.0	76.2
Our Methods
RM-R1-DeepSeek-Distilled-Qwen-7B	80.1	72.4	55.1	69.2
RM-R1-Qwen-Instruct-7B	85.2	70.2	66.4	73.9
RM-R1-Qwen-Instruct-14B	88.2	76.1	69.2	77.8
RM-R1-DeepSeek-Distilled-Qwen-14B	88.9	81.5	68.5	79.6
RM-R1-Qwen-Instruct-32B	91.4	79.1	73.0	81.2
RM-R1-DeepSeek-Distilled-Qwen-32B	90.9	83.9	69.8	81.5

Key Takeaways

⭐ Takeaway 1:

Directly replicating reinforcement learning recipes from mathematical tasks is insufficient for training strong reasoning reward models. Explicit query categorization and targeted distillation of high-quality reasoning traces are both crucial for achieving robust and generalizable improvements.

**Table 2: Ablation study of the design choices for Reasoning Training on RewardBench.**
Method	Chat	Chat Hard	Safety	Reasoning	Average
Instruct (Original)	95.8	74.3	86.8	86.3	85.8
Instruct + Cold Start RL	92.5	81.5	89.7	94.4	89.5
Instruct + Cold Start RL + Rubrics	93.0	82.5	90.8	94.2	90.1
Instruct + Cold Start RL + Rubrics + QC	92.3	82.6	91.6	96.3	90.8
RM-R1	95.3	83.1	91.9	95.2	91.4

⭐ Takeaway 2:

Scaling improves reward model performance: we observe a near-linear trend with both model size and inference-time compute. Larger models consistently benefit more from our reasoning-based training pipeline, and longer reasoning chains become increasingly effective under higher compute budgets.

Figure 4: Scaling effect of RM-R1. (a) Larger models benefit more from reasoning training. (b) Longer reasoning chains improve RM performance.

⭐ Takeaway 3:

Reasoning training substantially improves reward modeling. It not only enables better generalization across tasks but also provides consistent gains even under limited data scenarios compared to direct-answer SFT approaches.

Method	RewardBench	RM-Bench	RMB	Avg.
Train on Full Data
Instruct + SFT	90.9	75.4	65.9	77.4
Instruct + Distilled + SFT	91.2	76.7	65.4	77.8
RM-R1 *	91.4	79.1	73.0	81.2
Train on 9k (Distillation) Data
Instruct + SFT	88.8	74.8	66.9	76.6
Instruct + Distilled *	89.0	76.3	72.0	79.2

Table 3: Comparison of reasoning-based training versus SFT across benchmarks. * indicates reasoning-based methods. Reasoning training consistently yields better performance.

Citation

If you find RM-R1 useful in your research, we would appreciate it if you consider citing our work:

@article{chen2025rm,
  title={Rm-r1: Reward modeling as reasoning},
  author={Chen, Xiusi and Li, Gaotang and Wang, Ziqi and Jin, Bowen and Qian, Cheng and Wang, Yu and Wang, Hongru and Zhang, Yu and Zhang, Denghui and Zhang, Tong and others},
  journal={arXiv preprint arXiv:2505.02387},
  year={2025}
}