Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Abstract 2748: Arkangel AI, OpenEvidence, ChatGPT, Medisearch: Are they objectively up to medical standards? A real-life assessment of LLMs in healthcare.
0
Zitationen
6
Autoren
2026
Jahr
Abstract
Abstract Background: Large language models (LLMs) are increasingly used in healthcare, but standardized benchmarks fail to capture their validity and safety in real-world scenarios. Evaluating their quality is critical for safe integration into practice. Methods: Four fictitious clinical vignettes were developed by independent specialists and tested in four conversational agents: ArkangelAI, OpenEvidence, ChatGPT, and Medisearch. Each vignette included four questions. Responses were evaluated by four external clinicians using an eight-criterion Likert scale: 1-2 = dissatisfaction, 3 = neutral, 4-5 = satisfaction, 6 = not applicable. The criteria considered correctness, consensus, bias, standard of care, updated information, patient safety, real sources in references, and context-awareness. Response times were measured with medians/interquartile ranges (IQR). Results were reported as frequencies. Hypothesis tests were applied (α= 0.05). Results: There were 128 Question-answer pairs. ArkangelAI-Deep had the highest satisfaction (92.9%), followed by OpenEvidence (83.6%), ChatGPT-Deep (80.5%), and Medisearch (71.1%). Most dissatisfaction was for the real-source-of-references criteria: GPT-Personalized 75%, GPT-Regular 97%. Conversely, ArkangelAI-Deep, ChatGPT-Deep, and OpenEvidence obtained 100% satisfaction. All performed well in correctness and agreement with the consensus. ChatGPT was the lowest-scoring in non-biased answers. The safest for patients was GPT-Personalized, followed by Arkagel AI-Deep. Medisearch had the fastest response time (18 s), while GPT-Deep (13 min) and ArkangelAI-Deep (7.4 min) were slowest, showing a trade-off between depth and usability. Conclusions: ArkangelAI-Deep and OpenEvidence consistently outperformed others, while Medisearch and GPT-Regular had significant limitations. These results underscore the need for standardized frameworks to ensure safe use of LLMs in healthcare. Citation Format: Natalia Castano -Villegas, Maria Camila Villa, Katherine Monsalve, Isabella Llano, Laura Velásquez, Jose Zea. Arkangel AI, OpenEvidence, ChatGPT, Medisearch: Are they objectively up to medical standards? A real-life assessment of LLMs in healthcare [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 2748.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.626 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.532 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.046 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.843 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.