Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
When external validation isn’t enough: Simpson’s paradox, direction asymmetry, and calibration collapse in cross-continental perioperative mortality prediction
0
Zitationen
1
Autoren
2025
Jahr
Abstract
Abstract Objective To test whether stratified within-cohort analysis, bidirectional external validation, and case-level paired bootstrap inference jointly surface failure-mode magnitudes in cross-continental clinical prediction that aggregate metrics conceal. Materials and Methods Eight machine learning models (XGBoost and logistic regression, on preoperative and preoperative+intraoperative feature sets) were trained on each of INSPIRE (Korea; n = 127,413) and MOVER (USA; n = 57,545), then evaluated bidirectionally between cohorts. Case-level paired bootstrap (2,000 iterations) was the primary inferential framework. Direction asymmetry was stress-tested via matched-subsampling across four case-mix dimensions (ASA, Elixhauser comorbidity, emergency proportion, temporal period). Feature-importance transferability used SHAP rank correlation; calibration used slope, intercept, O:E, and Brier score before and after Platt scaling. Results Across the eight cross-population runs, aggregate AUCs concealed substantially lower within-stratum AUCs (Simpson’s paradox gap range 5.0–16.5 pp; worst case: aggregate AUC 0.756 vs within-stratum AUCs 0.58–0.60). Cross-continental transferability was markedly direction-asymmetric (+8.53 pp, 95% CI 6.91–10.24, bootstrap p = 0.001); the asymmetry survived matching on ASA and comorbidity, attenuated to 70–86% of baseline after matching emergency proportion, and was untestable for temporal period. Pre-Platt calibration slopes ranged 0.41–1.29 across all eight cross-population runs; 5-fold CV Platt scaling restored slopes to 0.95–1.02. Intraoperative features conferred a mean external AUC advantage of +3.60 pp (95% CI: +2.75 to +4.39 pp; bootstrap p=0.001). Discussion These magnitudes are clinically material and not visible to conventional aggregate reporting. The methodological commitments that surface them are well-established individually; their joint application here characterizes failure-mode magnitudes that single commitments would underestimate. Conclusion We present this case study as a cautionary reference for cross-population deployment of clinical prediction models, with reproducibility infrastructure released for verification and extension.
Ähnliche Arbeiten
Classification of Surgical Complications
2004 · 30.631 Zit.
2013 ESH/ESC Guidelines for the management of arterial hypertension
2013 · 13.665 Zit.
CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials
2010 · 13.499 Zit.
Seventh Report of the Joint National Committee on Prevention, Detection, Evaluation, and Treatment of High Blood Pressure
2003 · 13.273 Zit.
2013 ACCF/AHA Guideline for the Management of Heart Failure
2013 · 12.618 Zit.