Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
DRIFT: Difference-Aware Reinforcement Through Iterative Fine-Tuning for Language Model
0
Zitationen
3
Autoren
2026
Jahr
Abstract
Self-play fine-tuning has emerged as a promising approach to improve Large Language Models (LLMs) without additional human annotations. However, existing methods struggle with complex generation tasks requiring long context understanding, where models produce partially correct outputs interleaved with errors. Traditional approaches train on entire sequences uniformly, failing to distinguish between well-predicted and erroneous regions, leading to diluted learning signals and slow convergence. We propose DRIFT (Difference-aware Reinforcement through Iterative Fine-Tuning), a novel self-play framework that selectively trains on prediction differences. DRIFT introduces two key innovations: (1) Difference-Aware Masking (DAM) that identifies and masks common subsequences between model outputs and ground truth, focusing training exclusively on error regions; (2) Occurrence-Aware Loss (OAL) that provides position-invariant vocabulary supervision, complementing the position-sensitive adversarial loss. This dual mechanism enables models to correct both positional and lexical errors effectively. Theoretically, we prove that DRIFT converges when masked distributions align. Empirically, we evaluate DRIFT on diverse summarization benchmarks using Qwen2.5-3B and LLaMA-3.1-8B models. Results show that DRIFT significantly outperforms both supervised fine-tuning (SFT) and self-play fine-tuning (SPIN), achieving up to 16\% improvement on SAMSum dialogue summarization tasks while maintaining general capabilities. Notably, DRIFT breaks the performance ceiling of continued SFT and demonstrates superior efficiency compared to holistic self-play methods, validating that targeted optimization on prediction differences is crucial for structured text generation tasks.