Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Benchmarking readability, reliability, and scientific quality of large language models in communicating organoid science

2026·2 Zitationen·Frontiers in Bioengineering and BiotechnologyOpen Access

Volltext beim Verlag öffnen

Zitationen

Autoren

2026

Jahr

Abstract

Background: Organoids have become central platforms in precision oncology and translational research, increasing the need for communication that is accurate, transparent, and clinically responsible. Large language models (LLMs) are now widely consulted for organoid-related explanations, but their ability to balance readability, scientific rigor, and educational suitability has not been systematically established. Methods: Five mainstream LLMs (GPT-5, DeepSeek, Doubao, Tongyi Qianwen, and Wenxin Yiyan) were systematically evaluated using a curated set of thirty representative organoid-related questions. For each model, twenty outputs were independently scored using the C-PEMAT-P scale, the Global Quality Score (GQS), and seven validated readability indices. Between-model differences were analyzed using one-way ANOVA or Kruskal-Wallis tests, and correlation analyses were performed to examine associations between readability and quality measures. Results: Model performance differed markedly, with GPT-5 achieving the highest C-PEMAT and GQS scores (16.05 ± 1.10; 4.70 ± 0.47; both P < 0.001), followed by intermediate performance from DeepSeek and Doubao (C-PEMAT 11.75 ± 2.07 and 12.05 ± 1.82; GQS 3.65 ± 0.49 and 3.35 ± 0.49). Tongyi Qianwen and Wenxin Yiyan comprised the lowest-performing tier (C-PEMAT 7.85 ± 1.09 and 9.00 ± 2.05; GQS 1.55 ± 0.51 and 2.10 ± 0.55). Score-distribution patterns further highlighted reliability gaps, with GPT-5 showing tightly clustered values and domestic models displaying broader dispersion and unstable performance. Readability differed significantly across models and question categories, with safety-related, diagnostic, and technical questions showing the highest linguistic and conceptual complexity. Correlation analyses showed strong internal coherence among readability indices but only weak-to-moderate associations with C-PEMAT, GQS, and reliability metrics, indicating that linguistic simplicity is not a dependable surrogate for scientific quality. Conclusion: LLMs exhibited substantial variability in communicating organoid-related information, forming distinct performance tiers with direct implications for patient education and translational decision-making. Because readability, scientific quality, and reliability diverged across models, linguistic simplification alone is insufficient to guarantee accurate or dependable interpretation. These findings underscore the need for organoid-adapted AI systems that integrate domain-specific knowledge, convey uncertainty transparently, ensure output reliability, and safeguard safety-critical information.

Autoren

Institutionen

Second Affiliated Hospital of Dalian Medical University(CN)

Themen

Artificial Intelligence in Healthcare and EducationRadiomics and Machine Learning in Medical ImagingMachine Learning in Healthcare

Volltext beim Verlag öffnen

Benchmarking readability, reliability, and scientific quality of large language models in communicating organoid science

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen