Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmarking LLMs and Prompt Engineering Strategies for Consensus and Frontier Knowledge in Microsatellite Instability Cancers (Preprint)
0
Zitationen
7
Autoren
2025
Jahr
Abstract
<sec> <title>BACKGROUND</title> The reliability of general-purpose Large Language Models (LLMs) for complex clinical tasks in specialized domains like microsatellite instability (MSI) cancers remains critically uncharacterized. The absence of a domain-specific benchmark to evaluate and guide the optimization of their capabilities across diverse clinical tasks poses unevaluated risks to patient safety. </sec> <sec> <title>OBJECTIVE</title> The primary objective was to develop and validate MSIC-Bench, a novel, two-tiered benchmark for MSI cancer covering both consensus and frontier knowledge. Using this framework, we aimed to systematically assess LLMs performance across various prompting strategies, identify task-specific weaknesses, and reveal effective pathways for performance improvement. </sec> <sec> <title>METHODS</title> We developed MSIC-Bench, a 500-question benchmark derived from clinical guidelines and a curated knowledge base. Three state-of-the-art LLMs (GPT-4o, Gemini 2.5 Pro, and Claude Opus 4) were evaluated using four prompting strategies, including vanilla, Chain-of-Thought (CoT), Reflection of Thoughts (RoT), and Retrieval-Augmented Generation (RAG), under both multiple-choice and open-ended modalities. Performance was assessed on accuracy, safety (honesty), error composition, and token usage. </sec> <sec> <title>RESULTS</title> A significant 'scaffolding effect' was observed, with the average LLMs accuracy dropping from 89.81% in multiple-choice formats to 76.56% in open-ended scenarios. Our task-specific analysis revealed this decline was most pronounced in complex therapeutic decision-making tasks. Error analysis attributed failures in non-RAG models primarily to insufficient domain knowledge (55.51% of errors), manifesting as a high frequency of unsafe fabrication. The integration of RAG proved highly effective, substantially improving accuracy in these critical tasks (e.g., boosting claude's performance from 76.8% to 90.4%) and inducing a crucial shift towards safety by increasing explicit statements of uncertainty (from 6.70% to 16.55% on average, and up to 75% in specific cases). Notably, these gains were achieved with significantly lower token usage (RAG: 115 tokens vs. CoT: 398 and RoT: 613 tokens on average for GPT-4o). </sec> <sec> <title>CONCLUSIONS</title> Our comprehensive evaluation reveals that LLMs lack the specialized domain knowledge required for complex MSI cancer-related tasks, rather than suffering from reasoning deficits. Prompting strategies substantially influence LLMs accuracy, safety, and token usage, with RAG emerging as the most effective and reliable method for improving both accuracy and safety. Ultimately, MSIC-Bench provides not only a comprehensive resource for systematic evaluation and optimization of LLMs in the MSI cancer domain, but its two-tiered design also offers a replicable blueprint for developing similar benchmarks in other knowledge-intensive medical fields. </sec>
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.339 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.211 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.614 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.478 Zit.