Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Benchmarking readability, reliability, and scientific quality of large language models in communicating organoid science
2
Zitationen
3
Autoren
2026
Jahr
Abstract
Background: Organoids have become central platforms in precision oncology and translational research, increasing the need for communication that is accurate, transparent, and clinically responsible. Large language models (LLMs) are now widely consulted for organoid-related explanations, but their ability to balance readability, scientific rigor, and educational suitability has not been systematically established. Methods: Five mainstream LLMs (GPT-5, DeepSeek, Doubao, Tongyi Qianwen, and Wenxin Yiyan) were systematically evaluated using a curated set of thirty representative organoid-related questions. For each model, twenty outputs were independently scored using the C-PEMAT-P scale, the Global Quality Score (GQS), and seven validated readability indices. Between-model differences were analyzed using one-way ANOVA or Kruskal-Wallis tests, and correlation analyses were performed to examine associations between readability and quality measures. Results: Model performance differed markedly, with GPT-5 achieving the highest C-PEMAT and GQS scores (16.05 ± 1.10; 4.70 ± 0.47; both P < 0.001), followed by intermediate performance from DeepSeek and Doubao (C-PEMAT 11.75 ± 2.07 and 12.05 ± 1.82; GQS 3.65 ± 0.49 and 3.35 ± 0.49). Tongyi Qianwen and Wenxin Yiyan comprised the lowest-performing tier (C-PEMAT 7.85 ± 1.09 and 9.00 ± 2.05; GQS 1.55 ± 0.51 and 2.10 ± 0.55). Score-distribution patterns further highlighted reliability gaps, with GPT-5 showing tightly clustered values and domestic models displaying broader dispersion and unstable performance. Readability differed significantly across models and question categories, with safety-related, diagnostic, and technical questions showing the highest linguistic and conceptual complexity. Correlation analyses showed strong internal coherence among readability indices but only weak-to-moderate associations with C-PEMAT, GQS, and reliability metrics, indicating that linguistic simplicity is not a dependable surrogate for scientific quality. Conclusion: LLMs exhibited substantial variability in communicating organoid-related information, forming distinct performance tiers with direct implications for patient education and translational decision-making. Because readability, scientific quality, and reliability diverged across models, linguistic simplification alone is insufficient to guarantee accurate or dependable interpretation. These findings underscore the need for organoid-adapted AI systems that integrate domain-specific knowledge, convey uncertainty transparently, ensure output reliability, and safeguard safety-critical information.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.697 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.602 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.127 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.872 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.