arxiv:2606.09076

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

Published on Jun 8

· Submitted by

Huanqia Cai on Jun 11

Tongyi-MAI

Upvote

Authors:

Xin Jin ,

Huanqia Cai ,

Dengyang Jiang ,

Abstract

A teacher-student framework decouples complex reasoning from efficient reward deployment in text-to-image training, achieving superior preference accuracy and optimization performance.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Reward models are central to text-to-image post-training, but visual preference is subjective and better represented as a distribution over rubric scores than as a deterministic scalar. Existing scalar, score-token, and pairwise reward models over-compress uncertainty and fine-grained score differences, while reasoning-based generative rewards provide stronger judgments but are costly to deploy and difficult to use as direct optimization signals. We propose Z-Reward, a teacher-student reward modeling framework that decouples reasoning-heavy judgment from efficient reward deployment. The teacher is a large VLM that uses reasoning to infer rubric-aligned score distributions, and is trained with Group-wise Direct Score Optimization (GDSO), which combines policy-gradient rewards from distribution expectations with direct pointwise and pairwise supervision on score distributions and score gaps. The student is trained with Reasoning-Internalized Score Distillation (RISD), which transfers the teacher's reasoning-conditioned score distribution into a compact VLM without requiring explicit reasoning chains at inference time. On our internally annotated evaluation set, the 27B GDSO teacher reaches 89.6% human preference accuracy, outperforming SFT, RewardDance, and GRPO, while the 9B RISD student reaches 88.6%, outperforming the OPD baseline and closely matching the larger teacher. We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, yielding a 41.3% net human-preference improvement over the SFT baseline.

View arXiv page View PDF Project page Add to collection

Community

Orion-Cai

Paper author Paper submitter 1 day ago

•

edited about 22 hours ago

Tech report.

Z-Reward is a reasoning-internalized teacher-student reward modeling framework for visual generation, developed by the Z-Image Team for ⚡-Image.

Z-Reward decouples reasoning-heavy judgment from efficient reward deployment:

🧠 The Teacher (27B): A large VLM that uses reasoning to infer rubric-aligned score distributions. Trained with Group-wise Direct Score Optimization (GDSO), it reaches 89.6% human preference accuracy on our internally annotated evaluation set.
⚡ The Student (9B): Trained with Reasoning-Internalized Score Distillation (RISD), it internalizes the teacher’s reasoning-conditioned score distribution into a compact model. It reaches 88.6% accuracy (outperforming the OPD baseline) without needing explicit reasoning chains at inference time, enabling efficient direct scoring and gradient backpropagation.

We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, achieving a 41.3% net human-preference improvement over the SFT baseline.