OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 18.05.2026, 23:34

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

When external validation isn’t enough: Simpson’s paradox, direction asymmetry, and calibration collapse in cross-continental perioperative mortality prediction

2025·0 Zitationen·medRxivOpen Access
Volltext beim Verlag öffnen

0

Zitationen

1

Autoren

2025

Jahr

Abstract

Abstract Objective To test whether stratified within-cohort analysis, bidirectional external validation, and case-level paired bootstrap inference jointly surface failure-mode magnitudes in cross-continental clinical prediction that aggregate metrics conceal. Materials and Methods Eight machine learning models (XGBoost and logistic regression, on preoperative and preoperative+intraoperative feature sets) were trained on each of INSPIRE (Korea; n = 127,413) and MOVER (USA; n = 57,545), then evaluated bidirectionally between cohorts. Case-level paired bootstrap (2,000 iterations) was the primary inferential framework. Direction asymmetry was stress-tested via matched-subsampling across four case-mix dimensions (ASA, Elixhauser comorbidity, emergency proportion, temporal period). Feature-importance transferability used SHAP rank correlation; calibration used slope, intercept, O:E, and Brier score before and after Platt scaling. Results Across the eight cross-population runs, aggregate AUCs concealed substantially lower within-stratum AUCs (Simpson’s paradox gap range 5.0–16.5 pp; worst case: aggregate AUC 0.756 vs within-stratum AUCs 0.58–0.60). Cross-continental transferability was markedly direction-asymmetric (+8.53 pp, 95% CI 6.91–10.24, bootstrap p = 0.001); the asymmetry survived matching on ASA and comorbidity, attenuated to 70–86% of baseline after matching emergency proportion, and was untestable for temporal period. Pre-Platt calibration slopes ranged 0.41–1.29 across all eight cross-population runs; 5-fold CV Platt scaling restored slopes to 0.95–1.02. Intraoperative features conferred a mean external AUC advantage of +3.60 pp (95% CI: +2.75 to +4.39 pp; bootstrap p=0.001). Discussion These magnitudes are clinically material and not visible to conventional aggregate reporting. The methodological commitments that surface them are well-established individually; their joint application here characterizes failure-mode magnitudes that single commitments would underestimate. Conclusion We present this case study as a cautionary reference for cross-population deployment of clinical prediction models, with reproducibility infrastructure released for verification and extension.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Cardiac, Anesthesia and Surgical OutcomesArtificial Intelligence in Healthcare and EducationMachine Learning in Healthcare
Volltext beim Verlag öffnen