Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Evaluation of error detection and treatment recommendations in nucleic acid test reports using ChatGPT models

2025·3 Zitationen·Clinical Chemistry and Laboratory Medicine (CCLM)

Volltext beim Verlag öffnen

Zitationen

Autoren

2025

Jahr

Abstract

OBJECTIVES: Accurate medical laboratory reports are essential for delivering high-quality healthcare. Recently, advanced artificial intelligence models, such as those in the ChatGPT series, have shown considerable promise in this domain. This study assessed the performance of specific GPT models-namely, 4o, o1, and o1 mini-in identifying errors within medical laboratory reports and in providing treatment recommendations. METHODS: In this retrospective study, 86 medical laboratory reports of Nucleic acid test report for the seven upper respiratory tract pathogens were compiled. There were 285 errors from four common error categories intentionally and randomly introduced into reports and generated 86 incorrected reports. GPT models were tasked with detecting these errors, using three senior medical laboratory scientists (SMLS) and three medical laboratory interns (MLI) as control groups. Additionally, GPT models were tasked with generating accurate and reliable treatment recommendations following positive test outcomes based on 86 corrected reports. χ2 tests, Kruskal-Wallis tests, and Wilcoxon tests were used for statistical analysis where appropriate. RESULTS: In comparison with SMLS or MLI, GPT models accurately detected three error types, and the average detection rates of the three GPT models were 88.9 %(omission), 91.6 % (time sequence), and 91.7 % (the same individual acted both as the inspector and the reviewer). However, the average detection rate for errors in the result input format by the three GPT models was only 51.9 %, indicating a relatively poor performance in this aspect. GPT models exhibited substantial to almost perfect agreement with SMLS in detecting total errors (kappa [min, max]: 0.778, 0.837). However, the agreement between GPT models and MLI was moderately lower (kappa [min, max]: 0.632, 0.696). When it comes to reading all 86 reports, GPT models showed obviously reduced reading time compared with SMLS or MLI (all p<0.001). Notably, our study also found the GPT-o1 mini model had better consistency of error identification than the GPT-o1 model, which was better than that of the GPT-4o model. The pairwise comparisons of the same GPT model's outputs across three repeated runs showed almost perfect agreement (kappa [min, max]: 0.912, 0.996). GPT-o1 mini showed obviously reduced reading time compared with GPT-4o or GPT-o1(all p<0.001). Additionally, GPT-o1 significantly outperformed GPT-4o or o1 mini in providing accurate and reliable treatment recommendations (all p<0.0001). CONCLUSIONS: The detection capability of some of medical laboratory report errors and the accuracy and reliability of treatment recommendations of GPT models was competent, especially, potentially reducing work hours and enhancing clinical decision-making.

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationMeta-analysis and systematic reviewsExplainable Artificial Intelligence (XAI)

Volltext beim Verlag öffnen

Evaluation of error detection and treatment recommendations in nucleic acid test reports using ChatGPT models

Abstract

Ähnliche Arbeiten

Autoren

Institutionen

Themen