OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 27.04.2026, 12:15

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Muteract: Interactive and Iterative Prompt Mutation Interface for LLM Developers and Evaluators

2025·0 ZitationenOpen Access
Volltext beim Verlag öffnen

0

Zitationen

3

Autoren

2025

Jahr

Abstract

Large Language Models (LLMs) are next-token predictors trained on massive datasets. However, their use is often restricted to interaction within pristine environments and controlled contexts. While the focus on natural language prompt-driven response generation has increased significantly, there is still limited attention given to how adversarial mutations of prompts affect the responses of LLMs. Adversarial inputs in real-world scenarios can be used to deceive the model and elicit questionable responses. Most existing works on adversarial inputs are based on algorithmic and system-centric approaches rather than capturing critical aspects of human experience and interaction. To address this gap, we introduce Muteract, a human-in-the-loop interactive and iterative prompt mutation interface that facilitates LLM developers and evaluators in applying manually-hard-to-produce byte-level data mutations to input prompts, and analysing variations in responses such as text, audio, image, etc. Performing byte-level perturbations largely makes it possible to generate adversaries using a single interface regardless of the input modality. We implemented Muteract and used it to interact with a state-of-the-art closed-source LLM, gpt-4o-mini. We sampled 116 natural language prompts (text) out of the 738 available in the AdvGLUE developer dataset for classification tasks, demonstrating Muteract’s potential to deceive models and elicit significantly dissimilar responses (text), leading to declines in model accuracy (task-specific) by 15-30 percentage points. Following this, we conducted a pilot study with 26 participants using gpt-4.1, where the task was to prompt the model to elicit responses that violate OpenAI’s Usage Policy. 12 participants were successful within three successive mutations using Muteract. This work demonstrates Muteract’s adversarial capabilities for LLM developers and evaluators. It provides potential use cases for assessing model robustness to noise during training and supporting HCI research, particularly in evaluating resilience to adversarial inputs and aiding red-teaming efforts.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Adversarial Robustness in Machine LearningTopic ModelingArtificial Intelligence in Healthcare and Education
Volltext beim Verlag öffnen