Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Benchmarking proprietary and open-source language and vision-language models for gastroenterology clinical reasoning

2025·3 Zitationen·npj Digital MedicineOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

This study evaluated the effectiveness of large language models (LLMs) and vision-language models (VLMs) in gastroenterology. We used board-style multiple-choice questions to assess the performance of both proprietary and open-source LLMs and VLMs-including GPT, Claude, Gemini, Mistral, Llama, Mixtral, Phi, and Qwen, across different interfaces, computing environments, and levels of compression (quantization). Among the proprietary models, o1-preview (82.0%) and Claude3.5-Sonnet (74.0%) had the highest accuracy, outperforming the top open-source models: Llama3.3-70b (65.7%) and Qwen-2.5-72b (61.0%). Among the small quantized open-source models, the 8-bit Llama 3.2-11b (51.7%) and 6-bit Phi3-14b (48.7%) performed the best, with scores comparable to their full-precision counterparts. Notably, VLM accuracy on image-containing questions improved (~10%) when given human-generated captions, remained unchanged with original images, and declined with LLM-generated captions. Further research is warranted to evaluate model capabilities in real-world clinical decision-making scenarios.

Autoren

Institutionen

Themen

Multimodal Machine Learning ApplicationsArtificial Intelligence in Healthcare and EducationColorectal Cancer Screening and Detection

Volltext beim Verlag öffnen

Benchmarking proprietary and open-source language and vision-language models for gastroenterology clinical reasoning

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen