Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Adversarial Prompting and Jailbreak Attacks in Large Language Models: Detection and Mitigation Strategies

2025·0 Zitationen

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

The rapid integration of Large Language Models (LLMs) such as GPT-4 and similar architectures into real-world applications has revolutionized natural language understanding, but it has also unveiled pressing security challenges. Among the most critical are adversarial prompting and jailbreak attacks, wherein attackers craft subtle or cleverly engineered inputs to exploit model loopholes, bypassing built-in safeguards to elicit unintended, inappropriate, or harmful outputs. Addressing these evolving threats demands solutions that are not only accurate but adaptive and comprehensive. In this study, we introduce JBSHIELD—a unified detection and mitigation framework that leverages both supervised classification and unsupervised anomaly detection techniques. The proposed system employs multi-stage feature extraction and ensemble decision-making to identify a broad spectrum of hostile prompts in real time. Our mitigation stack consists of prompt filtering, robust response sanitization, and dynamic policy enforcement, working together to dramatically reduce successful attack rates. Through rigorous experimentation on a synthetic benchmark containing 5,000 diverse prompts, including 2,000 maliciously crafted examples, JBSHIELD achieves a detection accuracy of 96.4% and reduces the attack success rate from a baseline of 45% to just 5%. These results highlight the framework’s potential as a proactive defense for next-generation AI applications, promoting safer and more trustworthy deployment of LLMs in sensitive domains.

Autoren

Institutionen

K.S. Hegde Hospital(IN)

Themen

Adversarial Robustness in Machine LearningArtificial Intelligence in Healthcare and EducationTopic Modeling

Volltext beim Verlag öffnen

Adversarial Prompting and Jailbreak Attacks in Large Language Models: Detection and Mitigation Strategies

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen