OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 01.04.2026, 05:24

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Benchmarking proprietary and open-source language and vision-language models for gastroenterology clinical reasoning

2025·1 Zitationen·npj Digital MedicineOpen Access
Volltext beim Verlag öffnen

1

Zitationen

18

Autoren

2025

Jahr

Abstract

This study evaluated the effectiveness of large language models (LLMs) and vision-language models (VLMs) in gastroenterology. We used board-style multiple-choice questions to assess the performance of both proprietary and open-source LLMs and VLMs-including GPT, Claude, Gemini, Mistral, Llama, Mixtral, Phi, and Qwen, across different interfaces, computing environments, and levels of compression (quantization). Among the proprietary models, o1-preview (82.0%) and Claude3.5-Sonnet (74.0%) had the highest accuracy, outperforming the top open-source models: Llama3.3-70b (65.7%) and Qwen-2.5-72b (61.0%). Among the small quantized open-source models, the 8-bit Llama 3.2-11b (51.7%) and 6-bit Phi3-14b (48.7%) performed the best, with scores comparable to their full-precision counterparts. Notably, VLM accuracy on image-containing questions improved (~10%) when given human-generated captions, remained unchanged with original images, and declined with LLM-generated captions. Further research is warranted to evaluate model capabilities in real-world clinical decision-making scenarios.

Ähnliche Arbeiten