Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Does ChatGPT Help Us Understand the Medical Literature?
8
Zitationen
1
Autoren
2023
Jahr
Abstract
Integrated Discussion Just as we were getting used to one class of applications of artificial intelligence (AI) in medicine and nephrology, there comes along a powerful version of a large language model (LLM), ChatGPT-4, developed by the research firm OpenAI,1 offering a new set of AI tools. LLM are a form of a deep learning model operating on extremely large datasets of unlabeled text. Perhaps more familiar to nephrologists as examples of AI are deep learning models based on convolutional neural networks (CNNs) for processing digital biopsy images.2 Both are characterized by network architectures with a very large number of neurons that communicate with each other in weighted interactions. The weights are adjusted over an iterative process to minimize some form of error function, which evaluates how well the assigned task of the neural net is accomplished. The task for training LLM is to assign probabilities to how a particular input text (a prompt or query) should be continued (e.g., an answer following a query), although how the network assigns these probabilities is at present at least somewhat opaque. One pivotal difference between visual CNN models and LMMs is the nature of the metric that defines the accuracy of the guesses the models make—important both for training and for evaluating the models' performance. Although image segmentation models can be quantitatively evaluated (e.g., by Dice coefficient scores) from previously annotated examples, LMMs seem to be judged fundamentally by their syntactic plausibility or fluency post facto. They can always be evaluated by humans for content accuracy, but this is also only post facto because there is no single correct answer to any given query (perhaps other than a yes/no question) that can be established in advance. ChatGPT (GPT stands for generative pretrained transformer) is one of several LLMs that are finding increasing application as general purpose generative search engines using natural language to query very large, usually open-access data sources. The strengths and drawbacks of ChatGPT with respect to analyzing the medical literature are commented on by Jin and colleagues3 in a previous issue of JASN. Although the idea of hallucinations may be the most colorful drawback of LLM, the most damning one they note is the fact that ChatGPT does not consult any source of truth. That is, training and performance are wholly syntactic and not semantic. This is in contrast to the training and evaluation of CNN, which are typically based on a quantifiable, objectively measurable ground truth. A remarkable finding of one study4 the authors cite is that the accuracy of citations provided by ChatGPT (as indicated by the components of the F1 index) in fact appears to be inversely correlated to the fluency and perceived utility of the model outputs. The authors allow that LLMs may be more appropriate for text summarization than for giving valid answers to medical questions. Here validity means being backed up by citations and reasoning, which have proven to be a weakness of ChatGPT. Forgetting the more egregious errors such as fabricating PMID numbers, the authors present some examples of summarization failures by ChatGPT when seeking to answer a specific clinical question (identify possible mechanisms of AKI in coronavirus disease 2019), even when the model was presented a small number of relevant publications to work from. Other areas of potential chatbot use raise other issues. In medical education, reasonable performance on USMLE examinations5 may better represent book smarts than deep clinical insight. Simple single-reply questions cannot illustrate the balancing of priorities and human(e) interaction which is a hallmark of satisfying clinical encounters. Screening of the medical literature to aid with systematic reviews leaves out one of the largely unheralded advantages of an old-fashioned PubMed search on a topic, that is, of stumbling onto possible sources that one was not even looking for, not to mention the fact that the text source of PubMed is essentially continuously updated. It is not clear how serendipity can be engineered into LLM. I am sure that hallucinations are not a desirable way. I am also not convinced that greater speed in producing a systematic review is always a good idea. One generally has to question the utility of LLM for biomedical information seeking if the LLM is insensitive to semantics (information) and lives only in the realm of syntactic appropriateness. Some of these deficiencies do seem hard to reconcile with the reported abilities of LLM chatbots such as ChatGPT to generate useful computer code, which would appear to require a high level of technical competence and not just syntactic mimicry. A couple of other stumbling blocks for LLMs may need to be addressed before more enthusiasm for their use is appropriate. LLMs may have significant performance instability over very short periods, during which they are presumably undergoing development.6 This may be similar to the instability that is seen in nonlinear but deterministic dynamical systems (the butterfly effect or extreme sensitivity to initial conditions7). In the case of LLMs, effects of small changes in one part of the code may propagate through the network, leading to unpredictable changes in the performance of distant domains of the model. In contrast to the classic butterfly effect, the extreme sensitivity may be with regard to the model weighting parameters and not the input data. If this is the case, it may indicate a fatal limitation to LLMs. In terms of publications reporting the performance of ChatGPT, the proprietary approach to the program code for ChatGPT is fundamentally incompatible with our usual expectations for experimental science, in which we expect detailed enough descriptions of the methods used to be provided such that another investigator would be able to reproduce the experiment reported. As long as the training text sets, weightings, or source code of ChatGPT are not available, rigorous validation by independent investigators will be impossible. Only empirical testing of answers to queries—with human post facto validation—is possible. Although this is necessarily only an anecdotal approach to validation, much of the time these evaluations have not been particularly favorable. Although the forward march of generative AI applications, such as ChatGPT in medicine, will and should proceed, the very highest standards of performance must be met before application in the clinical realm can even be considered. Having AI chatbots police themselves,1 although apparently effective in some situations, seems at best a somewhat questionable solution to their occasional aberrations. The use of LLM in retrieval, summarization, and verification of the medical literature does not seem to this observer yet to have reached a particularly high standard, although Jin and colleagues3 as well as some other commentators8 remain guardedly optimistic. I am inclined to see greater utility in using LLM to flesh out, fill in, and bill for clinical encounters from audio dictations, for example, in which they might perform better when constrained to work within a much smaller universe of text and possible situations. Enhancements, such as clinical decision support, could be add-on features for such a simple transcription program. In general, decision support by expanding the clinician's initial list of differential diagnoses seems both safe and potentially very useful. Regardless of its merits, ChatGPT will likely find increasing numbers of users over time, in part because of its easy-to-use conversational format using natural language. The important guardrails of having the knowledgeable human end-user verify the accuracy and appropriateness of the chat output may be challenging to maintain in busy clinical settings. One need only think of the numerous cases of transcribed dictations that are electronically signed with obvious—and clinically significant—errors in them uncorrected.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.611 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.504 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.025 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.835 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.