Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

LLMs in Debate: Does Arguing Make Them Better at Detecting Metamorphic Relations?

2025·0 Zitationen

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

Large Language Models (LLMs) are transforming software engineering, including mobile Augmented Reality (AR) applications. AR software behavior often depends on dynamic environmental factors, making it difficult to use conventional testing and verification approaches. Metamorphic Testing (MT) offers an alternative by assessing whether expected transformations hold across varied conditions. However, there is limited work exploring how well LLMs can detect these transformations-Metamorphic Relations (MRs)-in applications. We propose a stability-driven evaluation framework that examines whether LLMs consistently apply MRs across rephrasings. Our study finds that StarCoder and CodeLlama exhibit higher stability in MR identification compared to the general-purpose model Gemma. Additionally, we use a multi-agent debate framework to investigate whether combining multiple perspectives improves consistency in MR identification. The debate mechanism reduces MR inconsistencies, leading to more stable MR identification across all MRs. While debate helps stabilize MR identification, our evaluation against humanlabeled ground truth reveals that stability alone does not always correlate with correctness. Some models maintain stable yet incorrect predictions(CodeLlama), whereas debate enhances both consistency and correctness alignment, making LLM reasoning more reliable. This work contributes a method to evaluate LLMs in the absence of ground truth, establishing stability as a metric for assessing model reliability. Applying a multi-agent debate framework offers a promising approach to enhancing LLM reliability, especially in contexts where the ground truth is elusive.

Autoren

Institutionen

Virginia Tech(US)

Themen

Software Engineering ResearchSoftware Testing and Debugging TechniquesArtificial Intelligence in Healthcare and Education

Volltext beim Verlag öffnen

LLMs in Debate: Does Arguing Make Them Better at Detecting Metamorphic Relations?

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen