Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Benchmarking LLMs and Prompt Engineering Strategies for Consensus and Frontier Knowledge in Microsatellite Instability Cancers (Preprint)

2025·0 ZitationenOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

<sec> <title>BACKGROUND</title> The reliability of general-purpose Large Language Models (LLMs) for complex clinical tasks in specialized domains like microsatellite instability (MSI) cancers remains critically uncharacterized. The absence of a domain-specific benchmark to evaluate and guide the optimization of their capabilities across diverse clinical tasks poses unevaluated risks to patient safety. </sec> <sec> <title>OBJECTIVE</title> The primary objective was to develop and validate MSIC-Bench, a novel, two-tiered benchmark for MSI cancer covering both consensus and frontier knowledge. Using this framework, we aimed to systematically assess LLMs performance across various prompting strategies, identify task-specific weaknesses, and reveal effective pathways for performance improvement. </sec> <sec> <title>METHODS</title> We developed MSIC-Bench, a 500-question benchmark derived from clinical guidelines and a curated knowledge base. Three state-of-the-art LLMs (GPT-4o, Gemini 2.5 Pro, and Claude Opus 4) were evaluated using four prompting strategies, including vanilla, Chain-of-Thought (CoT), Reflection of Thoughts (RoT), and Retrieval-Augmented Generation (RAG), under both multiple-choice and open-ended modalities. Performance was assessed on accuracy, safety (honesty), error composition, and token usage. </sec> <sec> <title>RESULTS</title> A significant 'scaffolding effect' was observed, with the average LLMs accuracy dropping from 89.81% in multiple-choice formats to 76.56% in open-ended scenarios. Our task-specific analysis revealed this decline was most pronounced in complex therapeutic decision-making tasks. Error analysis attributed failures in non-RAG models primarily to insufficient domain knowledge (55.51% of errors), manifesting as a high frequency of unsafe fabrication. The integration of RAG proved highly effective, substantially improving accuracy in these critical tasks (e.g., boosting claude's performance from 76.8% to 90.4%) and inducing a crucial shift towards safety by increasing explicit statements of uncertainty (from 6.70% to 16.55% on average, and up to 75% in specific cases). Notably, these gains were achieved with significantly lower token usage (RAG: 115 tokens vs. CoT: 398 and RoT: 613 tokens on average for GPT-4o). </sec> <sec> <title>CONCLUSIONS</title> Our comprehensive evaluation reveals that LLMs lack the specialized domain knowledge required for complex MSI cancer-related tasks, rather than suffering from reasoning deficits. Prompting strategies substantially influence LLMs accuracy, safety, and token usage, with RAG emerging as the most effective and reliable method for improving both accuracy and safety. Ultimately, MSIC-Bench provides not only a comprehensive resource for systematic evaluation and optimization of LLMs in the MSI cancer domain, but its two-tiered design also offers a replicable blueprint for developing similar benchmarks in other knowledge-intensive medical fields. </sec>

Autoren

Themen

Artificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical ImagingGenomics and Rare Diseases

Volltext beim Verlag öffnen

Benchmarking LLMs and Prompt Engineering Strategies for Consensus and Frontier Knowledge in Microsatellite Instability Cancers (Preprint)

Abstract

Ähnliche Arbeiten

Autoren

Themen