Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
The Challenge of Data Scarcity and Imbalanced Classes in Radiomics Performance
0
Zitationen
5
Autoren
2025
Jahr
Abstract
BackgroundRadiomics holds great promise for non-invasive clinical prediction, offering insights into disease characteristics that traditional methods might miss. However, its application is often constrained by challenges like small sample sizes and class imbalance, which are common in real-world datasets. These limitations can lead to model overfitting and poor generalization. This study systematically investigates the impact of these two factors both in isolation and in combination, evaluating how they affect model performance and exploring strategies to mitigate their effects, with the goal of enhancing model robustness and clinical applicability.MethodsThree radiomics datasets—PI-CAI (prostate cancer), BraTS2021 (glioblastoma), and Hunter2023 (lung cancer)—were analyzed under four experimental conditions: a baseline (balanced, fixed-size dataset), progressive class imbalance, progressive sample size reduction, and a combined scenario. Five machine learning models were evaluated, with Random Forest ultimately selected as the reference model. Class imbalance was addressed using state-of-the-art sampling techniques, and data scarcity was mitigated using Tabular Variational Autoencoders (TVAE). Performance was assessed across five metrics (sensitivity, specificity, accuracy ROC-AUC, and balanced accuracy), with statistical significance evaluated via t-tests.ResultsFeature selection played a key role in both model performance and interpretability. The most predictive selected features were biologically plausible and dataset-specific, such as perinodular texture heterogeneity in lung cancer or gray-level non-uniformity in glioblastoma. Class imbalance significantly degraded performance, especially under unsampled conditions. Applying the best-performing sampling method—typically an undersampling strategy—consistently improved Balanced Accuracy and Specificity. TVAE provided modest improvements under sample size reduction, but these were not statistically significant. In combined scenarios, the use of TVAE together with the best sampler yielded the highest gains, particularly under moderate data constraints.ConclusionClass imbalance and small sample size each impair radiomics model performance, and their effects compound under combined conditions. Although targeted sampling and augmentation strategies provide partial mitigation, model generalizability remains constrained under extreme conditions, highlighting the ongoing need for methodological advancements.
Ähnliche Arbeiten
TNM Classification of Malignant Tumours
1987 · 16.123 Zit.
A survey on deep learning in medical image analysis
2017 · 14.114 Zit.
Reduced Lung-Cancer Mortality with Low-Dose Computed Tomographic Screening
2011 · 10.916 Zit.
The American Joint Committee on Cancer: the 7th Edition of the AJCC Cancer Staging Manual and the Future of TNM
2010 · 9.150 Zit.
UNet++: A Nested U-Net Architecture for Medical Image Segmentation
2018 · 8.840 Zit.