RM-R1: Reward Modeling as Reasoning

Reasoning-enhanced Reward Models with Structured Rubrics

RM-R1 Team
Xiusi Chen1*, Gaotang Li1*, Ziqi Wang1*, Bowen Jin1, Cheng Qian1, Yu Wang2, Hongru Wang1
Yu Zhang3, Denghui Zhang4, Tong Zhang1, Hanghang Tong1, Heng Ji1
*Equal Contribution

1University of Illinois Urbana-Champaign
2University of California, San Diego
3Texas A&M University
4Stevens Institute of Technology

{xiusic, gaotang3, htong, hengji}@illinois.edu
📄 Paper 💻 Code 📦 Models 🧑‍⚖️ Dataset 📚 Bibtex

Overview

RM-R1 introduces a reasoning-centered approach to reward modeling through Chain-of-Rubrics (CoR). The model generates rubrics or intermediate solutions before judgment to enhance interpretability and achieve state-of-the-art accuracy on standard reward benchmarks.

RM-R1 Framework

Figure 1: Overview of the RM-R1 framework

Key Features

Model Cards

All our models and datasets are available on Hugging Face.

RM-R1-Qwen-Instruct-7B
RM-R1-Qwen-Instruct-14B
RM-R1-Qwen-Instruct-32B
RM-R1-DeepSeek-Distilled-Qwen-14B
RM-R1-DeepSeek-Distilled-Qwen-7B
RM-R1-DeepSeek-Distilled-Qwen-32B

Benchmarks

Evaluated on RewardBench, RM-Bench, and RMB. RM-R1 models consistently outperform larger counterparts with interpretable judgments and domain generalization.

Models RewardBench RM-Bench RMB Average
ScalarRMs
SteerLM-RM-70B88.852.558.266.5
Eurus-RM-7b82.865.968.372.3
Internlm2-20b-reward90.268.362.973.6
Skywork-Reward-Gemma-2-27B93.867.360.273.8
Internlm2-7b-reward87.667.167.173.9
ArmoRM-Llama3-8B-v0.190.467.764.674.2
Nemotron-4-340B-Reward92.069.569.977.1
Skywork-Reward-Llama-3.1-8B92.570.169.377.5
INF-ORM-Llama3.1-70B95.170.970.578.8
GenRMs
Claude-3-5-sonnet-2024062084.261.070.671.9
Llama3.1-70B-Instruct84.065.568.972.8
Gemini-1.5-pro88.275.256.573.3
Skywork-Critic-Llama-3.1-70B93.371.965.576.9
GPT-4o-080686.772.573.877.7
ReasRMs
JudgeLRM75.264.753.164.3
DeepSeek-PairRM-27B87.158.2
DeepSeek-GRM-27B-RFT84.567.0
DeepSeek-GRM-27B86.069.0
Self-taught-evaluator-llama3.1-70B90.271.467.076.2
Our Methods
RM-R1-DeepSeek-Distilled-Qwen-7B80.172.455.169.2
RM-R1-Qwen-Instruct-7B85.270.266.473.9
RM-R1-Qwen-Instruct-14B88.276.169.277.8
RM-R1-DeepSeek-Distilled-Qwen-14B88.981.568.579.6
RM-R1-Qwen-Instruct-32B91.479.173.081.2
RM-R1-DeepSeek-Distilled-Qwen-32B90.983.969.881.5

Key Takeaways

⭐ Takeaway 1:

Directly replicating reinforcement learning recipes from mathematical tasks is insufficient for training strong reasoning reward models. Explicit query categorization and targeted distillation of high-quality reasoning traces are both crucial for achieving robust and generalizable improvements.

Table 2: Ablation study of the design choices for Reasoning Training on RewardBench.
Method Chat Chat Hard Safety Reasoning Average
Instruct (Original)95.874.386.886.385.8
Instruct + Cold Start RL92.581.589.794.489.5
Instruct + Cold Start RL + Rubrics93.082.590.894.290.1
Instruct + Cold Start RL + Rubrics + QC92.382.691.696.390.8
RM-R195.383.191.995.291.4
⭐ Takeaway 2:

Scaling improves reward model performance: we observe a near-linear trend with both model size and inference-time compute. Larger models consistently benefit more from our reasoning-based training pipeline, and longer reasoning chains become increasingly effective under higher compute budgets.

Scaling effect of RM-R1

Figure 4: Scaling effect of RM-R1. (a) Larger models benefit more from reasoning training. (b) Longer reasoning chains improve RM performance.

⭐ Takeaway 3:

Reasoning training substantially improves reward modeling. It not only enables better generalization across tasks but also provides consistent gains even under limited data scenarios compared to direct-answer SFT approaches.

Method RewardBench RM-Bench RMB Avg.
Train on Full Data
Instruct + SFT90.975.465.977.4
Instruct + Distilled + SFT91.276.765.477.8
RM-R1 *91.479.173.081.2
Train on 9k (Distillation) Data
Instruct + SFT88.874.866.976.6
Instruct + Distilled *89.076.372.079.2

Table 3: Comparison of reasoning-based training versus SFT across benchmarks. * indicates reasoning-based methods. Reasoning training consistently yields better performance.

Citation

If you find RM-R1 useful in your research, we would appreciate it if you consider citing our work:

@article{chen2025rm,
  title={Rm-r1: Reward modeling as reasoning},
  author={Chen, Xiusi and Li, Gaotang and Wang, Ziqi and Jin, Bowen and Qian, Cheng and Wang, Yu and Wang, Hongru and Zhang, Yu and Zhang, Denghui and Zhang, Tong and others},
  journal={arXiv preprint arXiv:2505.02387},
  year={2025}
}