Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Artificial intelligence in the clinical setting
7
Zitationen
4
Autoren
2022
Jahr
Abstract
Advanced statistical models for predicting adverse clinical events in medical literature have become omnipresent and we often hear that concepts like artificial intelligence or machine learning are going to disrupt medicine. Given the amount of data generated during surgical procedures and intensive care admissions, these clinical areas are prototypical for the application of machine learning. Yet, in the face of massive attention and enormous research output, there are so far, few clinically validated and implemented algorithms.1 Within the disciplines of anaesthesia and intensive care, we are familiar with a few compelling sepsis prediction studies, but they are either small2 or not designed as randomised controlled trials.3 In this editorial, we broadly discuss some of the reasons why machine learning struggles with real-world implementation. Some of these reasons relate to methodology, others to clinical context. Framing the question Few machine learning researchers are intimately familiar with the clinical environment, and so it should be no surprise that many machine learning studies are not carried out in a way that allows for easy translation to the bedside. Framing a machine learning study appropriately, that is properly defining the clinical event and the prediction task, requires interdisciplinary knowledge and detailed discussion of methodology. For a prediction task, for example, framing would include identifying the clinical outcome, specifying when exactly the prediction is made, selecting the observation window, and so on. These details are sometimes poorly considered, sometimes poorly described. Framing forms the very backbone of the machine learning model being developed, and evaluation takes place within the context of the framing.4 Consequently, without clear and clinically relevant framing a seemingly high-performing model may still not be clinically usable.5 Many machine learning studies seek to address clinically relevant problems, but oversimplify the problem to the point where clinical relevance is eventually lost. The ubiquitous case-control framing/design in machine learning studies is a good example of design where researchers seek to solve a clinically relevant problem that is not aligned with clinical reality. The evidence level of a classic case-control study is weak and the caveats of this design, such as selection bias, do not disappear just because a study applies machine learning techniques. In relation to creating models that can make predictions and update them over time, applying the case-control design in a ‘validation study’ is often creating a temporal bias that should be avoided.6 When releasing a black box prediction algorithm that is developed this way, the result is often that the positive predictive value declines dramatically6 and that it is impossible for users to know which event alarms to trust. The nature of observational data Many studies are based on analyses of large retrospectively collected datasets, where missing data are a frequent and natural phenomenon. The treatment of missing data is often a major issue, given that data are rarely missing at random. One could think of the simple physiological example of SpO2 becoming unmeasurable in shock with hypotension. A clinical example is the difference between the patients who had an arterial blood gas taken in the emergency department versus the patient who did not. A clinician decided to obtain that blood gas. This presence or missingness of an observation tells us something important. Taking this a step further: Where and when was the blood gas taken? If taken in the first postoperative hours in the cardiac surgery recovery unit, that lab test result could well be obtained to inform FiO2 adjustment, indicating a different ‘lab presence risk’ than in the emergency department patient. A large retrospective study found that the mere ‘presence of a laboratory test order, regardless of any other information about the test result, has a significant association with the odds of survival in 233 of 272 (86%) tests. Data about the timing of when laboratory tests were ordered were more accurate than the test results in predicting survival in 118 of 174 tests (68%).7 Observational studies, whether retrospective or prospective, are, in general, vulnerable to this missingness bias. While imputation techniques and use of auxiliary variables may help to mitigate these issues, there should be no expectation of generalisability given that missingness patterns probably reflect a specific ward's clinical culture.8,9 Algorithm performance In machine learning research, often the goal is to outperform previous reports on benchmark tasks, rather than to truly consider how models might perform in practice. Many studies that use large datasets are focused on the applied methods and algorithms,10 as well as fine-tuning of performance metrics, such as area under the receiver operating characteristics curve (AUROC), sensitivity and specificity. This focus on algorithms and classification performance often comes at the expense of basic epidemiologic principles and clinical interpretation11 and it is characterised by vague basic descriptions of the data and its origin, often indicating a limited domain knowledge among authors. Oversimplified methods sections and lack of code sharing often result in the inability of the community to even reproduce a study cohort, let alone the study outcome.12 The unfortunate consequence is that algorithms become useless to the research community, contributing to research waste.10 Fortunately, the unfavourable events that are typically the goal of prediction tasks occur rarely. Metrics, such as AUROC, and sensitivity and specificity, are important features of an algorithm, but they do not address clinical usefulness and may be less informative in cases of rare events. For “imbalanced’ datasets, a metric such as the area under the precision-recall curve - plotting the positive predictive value (precision) against sensitivity (recall) - is often more informative, particularly in order to quantify the presence of false alarms. For the clinician at the bedside, a model with a high false alarm rate is unlikely to be a useful model. In addition, if a false-positive decision causes greater harm than a false-negative decision, a model with high specificity may be preferable to a model with high sensitivity and lower specificity, although the latter model might have, say, a higher AUROC. In general terms, a model is clinically useful if the use of its decisions for patients leads to a better ratio between benefits and harms than not using the model.11,13 The decision for converting a predicted probability into a binary label (positive or negative) is governed by a decision threshold in the range between 0 and 1. For example, with a decision threshold of 0.5, probabilities less than 0.5 are assigned to class 0 and values greater than or equal to 0.5 are assigned to class 1. Receiver operating characteristic (ROC) and precision-recall curves are all diagnostic plots that evaluate a set of probability predictions at varying decision thresholds. In the case of a ROC curve, a set of different thresholds is used to interpret the true-positive rate and the false-positive rate of the predictions. In this sense, the ROC is a useful tool to help understand the trade-off in the true-positive rate and false-positive rate for different thresholds. Similarly, decision curve analysis (DCA)14 assesses the clinical usefulness of a prediction model by evaluating the so-called net benefit at varying decision thresholds for the model. In practice, this means that the decision threshold is used to control the exchange ratio between the number of false positives that is acceptable in exchange for one true positive. This interpretation is important, because it is informative of how the clinician weighs the harm of a false decision over the benefit of a true decision. The harm/benefit exchange ratio is subjective and will vary among clinicians. A decision curve in DCA illustrates the consequence of an arbitrary choice by evaluating the net benefit for the binary decision of opting into the intervention or not across a range of different decision thresholds, or equivalently, for a range of different harm-benefit exchange ratios.15 Another key aspect of model performance that is often overlooked is algorithmic bias. Does the model exhibit behaviour that might reinforce inequalities? Strong overall performance of a model can be misleading, concealing poor performance in patient subgroups. Validation and trust If a new model is being proposed then there is almost no reason not to provide one or more reference models for comparison. These might include a classic regression model and clinical scores of disease severity, such as the Acute Physiology And Chronic Health Evaluation (APACHE) score, the Sequential Organ Failure Assessment (SOFA) score and others. Yet, reference models are often missing, which makes it impossible to determine if a new and less transparent model is adding any predictive value (at the expense of direct interpretability).16 In low-risk-of-bias studies, where an interpretable logistic regression model is reported, more advanced machine learning models rarely outperform logistic regression.17 Methodological issues such as this may become less common once reporting guidelines are established for diagnostic and prognostic prediction model studies based on artificial intelligence,10 and their subsequent uptake into the reviewing process of scientific journals. While developing approaches that enable the reasoning of complex machine learning models to be explained is an active research area, it is fair to say that this is still in its infancy. Interpretable models are likely to be preferred by clinical teams, even at the expense of performance, favouring traditional modelling approaches over the ‘black box’ of state-of-the-art models, such as neural networks. Deployment at the bedside There are also crucial contextual and technical reasons why so few artificial intelligence algorithms, even well validated ones from a scientific point of view, have been deployed successfully.18 Arguably, studies that explore translation of algorithms to the bedside are scarce, at least in part because the academic system provides greater reward to the lower-hanging fruits of fast, successive publications that are unencumbered by the realities of the clinical environment. Technically, there is a huge gap between the data infrastructure needed to train an algorithm on a retrospective dataset, extracted once from a setup optimised for collecting and storing data, and using the algorithm in a prospective, and maybe real-time, setup. There is also the issue of dataset shift. The clinical environment and its patient population are not static. A model that works now may catastrophically fail when a laboratory reagent is switched, a protocol is updated, or patient characteristics change. In conclusion, a number of challenges, summarised in Table 1, have inhibited widespread adoption of machine learning models at the bedside, but there has been progress nonetheless. This progress includes movement towards collaborative approaches for machine learning in health research; public datasets that are more representative of the clinical environment; more holistic metrics for assessing performance; and establishment of guidelines for reporting machine learning studies. Table 1 - Issues and challenges often present in the exiting prediction model studies and possible ways to handle them Issues to address or handle Possible ways to handle the issues in future studies Insufficient reporting of design and framing Always align the framing and design with the clinical question/unmet needThe classic case-control design is rarely aligned with a clinically relevant question.Report details about observation window, prediction window, lead time, window shift, preferably with a supporting figure Missing values Quantify the presence and its implicationsPossibly apply imputation if deemed meaningfulPossibly model the missingness pattern to inform the prediction model Insufficient reporting and clinical assessment of discrimination metrics Thoroughly report discrimination metricsDiscuss the presence and implications of a possibly unbalanced design/datasetDiscuss perceived clinical benefits of the prediction model (compared with the reference model), e.g. using concepts of net benefit and decision curve analysis Lack of a clinically meaningful reference model Always report a reference model that could be considered current practice for predictions.Example of short-term outcomes: Predicting eminent hypotension or tachycardia: Blood pressure or heart rate itself, should always be (part of) a reference model. Preferably a transparent regression model if multivariate.Example of Longer-term outcomes: Predicting sepsis or mortality: Clinical scores of disease severity, such as EWS, SOFA or APACHE scores, can be relevant or a multivariate regression model based on the underlying variables used for such scores in order to calibrate better Obstacles for actual implementation Discuss why and where the prediction model could realistically be implemented
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.316 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.177 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 7.575 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.776 Zit.
Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI)
2018 · 5.468 Zit.