OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 02.04.2026, 02:29

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

DRIFT: Difference-Aware Reinforcement Through Iterative Fine-Tuning for Language Model

2026·0 Zitationen·Proceedings of the AAAI Conference on Artificial IntelligenceOpen Access
Volltext beim Verlag öffnen

0

Zitationen

3

Autoren

2026

Jahr

Abstract

Self-play fine-tuning has emerged as a promising approach to improve Large Language Models (LLMs) without additional human annotations. However, existing methods struggle with complex generation tasks requiring long context understanding, where models produce partially correct outputs interleaved with errors. Traditional approaches train on entire sequences uniformly, failing to distinguish between well-predicted and erroneous regions, leading to diluted learning signals and slow convergence. We propose DRIFT (Difference-aware Reinforcement through Iterative Fine-Tuning), a novel self-play framework that selectively trains on prediction differences. DRIFT introduces two key innovations: (1) Difference-Aware Masking (DAM) that identifies and masks common subsequences between model outputs and ground truth, focusing training exclusively on error regions; (2) Occurrence-Aware Loss (OAL) that provides position-invariant vocabulary supervision, complementing the position-sensitive adversarial loss. This dual mechanism enables models to correct both positional and lexical errors effectively. Theoretically, we prove that DRIFT converges when masked distributions align. Empirically, we evaluate DRIFT on diverse summarization benchmarks using Qwen2.5-3B and LLaMA-3.1-8B models. Results show that DRIFT significantly outperforms both supervised fine-tuning (SFT) and self-play fine-tuning (SPIN), achieving up to 16\% improvement on SAMSum dialogue summarization tasks while maintaining general capabilities. Notably, DRIFT breaks the performance ceiling of continued SFT and demonstrates superior efficiency compared to holistic self-play methods, validating that targeted optimization on prediction differences is crucial for structured text generation tasks.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Topic ModelingArtificial Intelligence in Healthcare and EducationMultimodal Machine Learning Applications
Volltext beim Verlag öffnen