Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Abstract 2748: Arkangel AI, OpenEvidence, ChatGPT, Medisearch: Are they objectively up to medical standards? A real-life assessment of LLMs in healthcare.

2026·0 Zitationen·Cancer Research

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Abstract Background: Large language models (LLMs) are increasingly used in healthcare, but standardized benchmarks fail to capture their validity and safety in real-world scenarios. Evaluating their quality is critical for safe integration into practice. Methods: Four fictitious clinical vignettes were developed by independent specialists and tested in four conversational agents: ArkangelAI, OpenEvidence, ChatGPT, and Medisearch. Each vignette included four questions. Responses were evaluated by four external clinicians using an eight-criterion Likert scale: 1-2 = dissatisfaction, 3 = neutral, 4-5 = satisfaction, 6 = not applicable. The criteria considered correctness, consensus, bias, standard of care, updated information, patient safety, real sources in references, and context-awareness. Response times were measured with medians/interquartile ranges (IQR). Results were reported as frequencies. Hypothesis tests were applied (α= 0.05). Results: There were 128 Question-answer pairs. ArkangelAI-Deep had the highest satisfaction (92.9%), followed by OpenEvidence (83.6%), ChatGPT-Deep (80.5%), and Medisearch (71.1%). Most dissatisfaction was for the real-source-of-references criteria: GPT-Personalized 75%, GPT-Regular 97%. Conversely, ArkangelAI-Deep, ChatGPT-Deep, and OpenEvidence obtained 100% satisfaction. All performed well in correctness and agreement with the consensus. ChatGPT was the lowest-scoring in non-biased answers. The safest for patients was GPT-Personalized, followed by Arkagel AI-Deep. Medisearch had the fastest response time (18 s), while GPT-Deep (13 min) and ArkangelAI-Deep (7.4 min) were slowest, showing a trade-off between depth and usability. Conclusions: ArkangelAI-Deep and OpenEvidence consistently outperformed others, while Medisearch and GPT-Regular had significant limitations. These results underscore the need for standardized frameworks to ensure safe use of LLMs in healthcare. Citation Format: Natalia Castano -Villegas, Maria Camila Villa, Katherine Monsalve, Isabella Llano, Laura Velásquez, Jose Zea. Arkangel AI, OpenEvidence, ChatGPT, Medisearch: Are they objectively up to medical standards? A real-life assessment of LLMs in healthcare [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2026; Part 1 (Regular Abstracts); 2026 Apr 17-22; San Diego, CA. Philadelphia (PA): AACR; Cancer Res 2026;86(7 Suppl):Abstract nr 2748.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMachine Learning in HealthcareCardiovascular Health and Risk Factors

Volltext beim Verlag öffnen

Abstract 2748: Arkangel AI, OpenEvidence, ChatGPT, Medisearch: Are they objectively up to medical standards? A real-life assessment of LLMs in healthcare.

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen