Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Adversarial Prompting and Jailbreak Attacks in Large Language Models: Detection and Mitigation Strategies
0
Zitationen
3
Autoren
2025
Jahr
Abstract
The rapid integration of Large Language Models (LLMs) such as GPT-4 and similar architectures into real-world applications has revolutionized natural language understanding, but it has also unveiled pressing security challenges. Among the most critical are adversarial prompting and jailbreak attacks, wherein attackers craft subtle or cleverly engineered inputs to exploit model loopholes, bypassing built-in safeguards to elicit unintended, inappropriate, or harmful outputs. Addressing these evolving threats demands solutions that are not only accurate but adaptive and comprehensive. In this study, we introduce JBSHIELD—a unified detection and mitigation framework that leverages both supervised classification and unsupervised anomaly detection techniques. The proposed system employs multi-stage feature extraction and ensemble decision-making to identify a broad spectrum of hostile prompts in real time. Our mitigation stack consists of prompt filtering, robust response sanitization, and dynamic policy enforcement, working together to dramatically reduce successful attack rates. Through rigorous experimentation on a synthetic benchmark containing 5,000 diverse prompts, including 2,000 maliciously crafted examples, JBSHIELD achieves a detection accuracy of 96.4% and reduces the attack success rate from a baseline of 45% to just 5%. These results highlight the framework’s potential as a proactive defense for next-generation AI applications, promoting safer and more trustworthy deployment of LLMs in sensitive domains.
Ähnliche Arbeiten
Rethinking the Inception Architecture for Computer Vision
2016 · 30.396 Zit.
MobileNetV2: Inverted Residuals and Linear Bottlenecks
2018 · 24.505 Zit.
CBAM: Convolutional Block Attention Module
2018 · 21.400 Zit.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
2020 · 21.334 Zit.
Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
2015 · 18.524 Zit.