This article provides a comprehensive examination of test-retest reliability in radiomic features, a critical foundation for developing robust imaging biomarkers.
This article provides a comprehensive examination of test-retest reliability in radiomic features, a critical foundation for developing robust imaging biomarkers. We explore the fundamental concepts of repeatability and reproducibility, addressing the paradigm-shifting perspective that feature stability and predictive power are independent. The content details methodological frameworks for reliability assessment, including test-retest protocols and computational perturbation techniques, while troubleshooting key variability sources across cancer types, imaging modalities, and segmentation practices. Through validation studies and comparative analyses of conventional radiomics versus deep learning approaches, we provide actionable insights for researchers and drug development professionals to enhance radiomic model generalizability and accelerate clinical translation in precision oncology.
In the field of radiomics, where quantitative features are extracted from medical images to serve as biomarkers for diagnosis, prognosis, and treatment assessment, the reliability of these features is paramount for clinical translation. Two foundational concepts underpin this reliability: repeatability and reproducibility. These terms are often used interchangeably, but they represent distinct aspects of feature stability that are critical for validating the trustworthiness of radiomic signatures [1] [2].
Repeatability refers to the ability of a radiomic feature to remain consistent when the same subject is imaged multiple times under identical conditions, using the same equipment, software, and operators. It is often assessed through test-retest studies where a patient or phantom is scanned twice in a short time frame without changes to the imaging protocol [1] [3]. Reproducibility, a broader concept, refers to a feature's ability to remain stable despite variations in the imaging or analysis process. This includes changes in scanner manufacturer, acquisition parameters (such as slice thickness or tube current), reconstruction algorithms, imaging software, or operators across different institutions [1] [3]. Understanding and quantifying both aspects is essential for distinguishing true biological signal from technical noise, thereby ensuring that radiomic models perform robustly in multi-institutional clinical trials and eventual routine practice.
The stability of radiomic features is not uniform; it varies significantly by feature class, imaging modality, and specific acquisition parameters. The tables below synthesize quantitative data from multiple studies to provide a clear overview of which features demonstrate the highest reliability.
Table 1: Overall Repeatability and Reproducibility of Radiomic Features Across Key Studies
| Study Context | Total Features Analyzed | % with Good Repeatability (ICC > 0.9) | % with Good Reproducibility (ICC > 0.9) | Most Stable Feature Classes | Least Stable Feature Classes |
|---|---|---|---|---|---|
| Non-Small Cell Lung Cancer (Clinical CT) [3] | 1080 | 82% | 19% | First-order statistics; Wavelet features | Texture features |
| Multi-Scanner Phantom (CT) [3] | 1080 | 45% (across 3 scanners) | 14% (inter-scanner) | Laplacian of Gaussian (LOG); Wavelet | Texture Analysis |
| Novel CBCT (Organic Phantoms) [4] | 107 | ~98-100% (test-retest) | ~66-97% (reposition/rotation) | Shape; First-order | Second-order (texture) |
| MR-Linac (Phantom, FLAIR sequence) [5] | 91 | 51.65% (longitudinal) | 62.64% (inter-platform) | Features from FLAIR sequences | Features from T1W sequences |
Table 2: Impact of Specific Variables on Feature Reproducibility (Clinical CT Cohort) [3]
| Variable Tested | Protocol Details | % of Features with Good Reproducibility (ICC > 0.9) | Key Finding |
|---|---|---|---|
| Slice Thickness | 2 mm vs. 5 mm | 47% | Change in slice thickness had a more pronounced negative effect on reproducibility than the use of contrast. |
| IV Contrast | With vs. Without | 14% | A majority of features were sensitive to contrast administration. |
| Inter-Observer Variability | Different segmenting radiologists [6] | >97% (209/214 features with ICC ≥ 0.8) | Software-derived features can be highly reproducible when segmentation is consistent. |
A rigorous assessment of radiomic feature stability requires controlled experiments. The following are detailed methodologies for key types of stability studies cited in the comparative data.
Objective: To quantify the intrinsic noise-level of radiomic features under identical imaging conditions [1] [7].
Protocol Details (as used in a clinical cohort for NSCLC [3]):
Objective: To evaluate feature stability across different imaging platforms, simulating a multi-institutional setting [3].
Protocol Details (Phantom Study [3]):
Objective: To provide a practical alternative to test-retest imaging for assessing feature repeatability when re-scanning patients is not feasible [7].
Protocol Details (Breast Cancer MRI Study [7]):
The following diagram illustrates the logical relationship between the key concepts, assessment methods, and goals in evaluating radiomic feature stability.
Table 3: Key Research Reagents and Solutions for Radiomic Stability Studies
| Item Name | Function/Application | Example in Context |
|---|---|---|
| Radiomic Phantom | Serves as a stable, known reference object to isolate technical variability from biological variance. | The American College of Radiology (ACR) MRI phantom is used for standardized testing of MR-Linac systems [5]. Custom-textured phantoms simulate tumor heterogeneity for CT studies [3]. |
| Organic Low-Contrast Phantoms | Provides biologically realistic texture and density for more clinically relevant stability testing. | Scans of fruits like apples, oranges, and onions are used to evaluate novel Cone-Beam CT (CBCT) systems, testing feature stability across scan-presets and repositioning [4]. |
| Feature Extraction Software | High-throughput computational pipelines that convert medical images into quantitative data. | Software must be documented with name and version. The Image Biomarker Standardization Initiative (IBSI) provides reference values to standardize outputs across different platforms [2] [5]. |
| Stability Analysis Scripts | Code for calculating statistical metrics of feature stability, such as ICC and the Concordance Correlation Coefficient (CCC). | In-house or published scripts in R or Python are used to compute ICC values from test-retest or perturbation data, with a typical threshold of ICC > 0.9 for defining stable features [3] [7]. |
| Image Perturbation Algorithm | Generates simulated "re-test" images through controlled deformations, providing an alternative to physical test-retest. | Algorithms apply random translations, rotations, and contour randomizations to a single dataset, enabling repeatability analysis without additional patient scans [7]. |
A clear and empirically grounded understanding of repeatability and reproducibility is non-negotiable for advancing radiomics from research to clinical decision-support systems. The body of evidence demonstrates that while a substantial number of radiomic features exhibit good repeatability under identical conditions, their reproducibility across the variable landscape of clinical imaging is significantly more challenging. The consistent trend across studies is that first-order and intensity-histogram-based features tend to be the most stable, while textural features are more susceptible to variation. Furthermore, technical factors like slice thickness and scanner model have a profound impact on feature values.
Therefore, the routine incorporation of stability analyses—using either test-retest, multi-scanner phantom studies, or computationally efficient image perturbation—is a critical step in the radiomic workflow. Filtering for robust features that demonstrate both high repeatability and reproducibility is the most reliable path toward building predictive models that will generalize effectively across institutions and ultimately fulfill the promise of radiomics in personalized medicine.
In the pursuit of precision medicine, biomarkers derived from high-content technologies such as radiomics and omics platforms have emerged as powerful tools for diagnosis, prognosis, and treatment selection. However, their translation into clinical practice hinges critically on a fundamental property: feature stability. Feature stability, encompassing both repeatability (consistency under identical conditions) and reproducibility (consistency across varying conditions), serves as the foundational requirement for developing clinically applicable predictive models [1] [8]. The profound limitations of prematurely adopted biomarkers, exemplified by the historical case of the dexamethasone suppression test for major depressive disorder, underscore the necessity of rigorous validation before clinical implementation [9]. This guide provides a comparative analysis of experimental approaches for assessing feature stability, detailing protocols, and synthesizing empirical data to inform researchers and drug development professionals in navigating the critical path from biomarker discovery to clinical translation.
Experimental Protocol: The test-retest methodology is considered the reference standard for evaluating radiomic feature repeatability. The protocol involves scanning the same subject (patient or phantom) twice within a short time interval, without changes to the patient's position or the imaging equipment [7] [1]. For example, in a study using organic phantoms, researchers acquired repeated Cone-Beam CT (CBCT) scans without any changes ("re-test"), followed by scans after repositioning ("reposition-test"), and finally after a 90° rotation ("90°-test") [10] [4]. Features are then extracted from both imaging sessions, and their stability is quantified using statistical measures of agreement.
Key Stability Metrics:
Experimental Protocol: When test-retest imaging is not feasible due to resource constraints or patient dose concerns, image perturbation offers a viable alternative. This method involves computationally generating "pseudo-retest" images by applying controlled variations to the original images. Common perturbations include [7]:
The stability of features across these perturbed images is then assessed using the same metrics (ICC, CCC) as in test-retest studies.
A direct comparison on a breast cancer dataset revealed both similarities and important distinctions between these two methods [7].
Table 1: Comparative Performance of Test-Retest vs. Image Perturbation
| Aspect | Test-Retest Imaging | Image Perturbation |
|---|---|---|
| Basis of Stability Assessment | Real-world biological and technical variations from actual rescanning [7] | Simulated technical variations from software-driven alterations [7] |
| Feature Repeatability | Generally lower and more conservative ICC values [7] | Systematically higher ICC values; more lenient [7] |
| Correlation of Results | Strong correlation (Pearson r = 0.79) with perturbation results, suggesting overlap in identified stable features [7] | Strong correlation with test-retest, but agrees on a limited set of highly stable features [7] |
| Model Reliability | Models trained on its stable features (ICC ≥ 0.9) showed high testing AUC (~0.77) and prediction ICC (>0.9) [7] | Achieved similar optimal model reliability (testing AUC ~0.76, prediction ICC >0.9) at the same ICC threshold [7] |
| Practical Application | Recommended when feasible and ethically justified [1] | Recommended as a necessary component when test-retest is not feasible [7] |
The stability of radiomic features is not uniform; it varies significantly based on feature class, imaging modality, and anatomical region.
Table 2: Feature Stability by Class in Novel CBCT Imaging (Based on Phantom Studies)
| Feature Class | Re-Test Stability (CCC >0.90) | Reposition-Test Stability (CCC >0.90) | 90° Rotation-Test Stability (CCC >0.90) |
|---|---|---|---|
| Shape Features | 100.0% | 97.0% | 86.3% |
| First-Order Features | 98.1% | 90.3% | 75.9% |
| Second-Order/Texture Features | 98.4% | 96.2% | 65.8% |
Data adapted from Willam et al. [10] [4]
A study on brain PET imaging evaluated the reproducibility of 93 features across six classes under different Partial Volume Correction (PVC) methods [11].
Table 3: Reproducible Radiomic Features in Brain PET (ICC ≥ 0.75)
| PVC Method | Most Reproducible Feature Classes | Least Reproducible Feature Classes | High-Reproducibility Regions (ICC ≥ 0.9) | Low-Reproducibility Regions (ICC < 0.5) |
|---|---|---|---|---|
| Reblurred Van Cittert (RVC) | GLCM, GLDM | First Order, NGTDM | Cerebellum, Lingual Gyrus | Fusiform Gyrus, Brainstem |
| Richardson-Lucy (RL) | GLCM, GLDM | First Order, NGTDM | Cerebellum, Lingual Gyrus | Fusiform Gyrus, Brainstem |
| Multi-Target Correction (MTC) | (Overall lowest reproducibility) | (Overall highest variability) | - | - |
Data synthesized from Gaj et al. [11]. GLCM: Gray Level Co-occurrence Matrix; GLDM: Gray Level Dependence Matrix; NGTDM: Neighborhood Gray Tone Difference Matrix.
A provocative perspective emerging in radiomics research challenges the dogma that individual feature reproducibility is an absolute prerequisite for predictive modeling. This view argues that predictive information can be distributed across multiple correlated features, much like the parable of the blind men and the elephant, where each person touches a different part but cannot comprehend the whole animal [12].
Experimental Evidence: An experiment mimicking a test-retest scenario using slices from MRI and CT datasets demonstrated that features classified as "nonreproducible" could still contribute significantly to model performance [12]. In some datasets (e.g., Desmoid), models trained exclusively on nonreproducible features outperformed those trained only on reproducible features, especially at certain reproducibility thresholds (CCC ~0.75). This suggests that rigidly filtering out nonreproducible features may sometimes discard valuable predictive information, as the underlying signal is captured by the collective behavior of features rather than the stability of any single one [12].
The challenge of high-dimensional data (where features far exceed samples) has spurred the development of advanced machine learning methods that integrate stability directly into the feature selection process.
Stabl: A Machine Learning Framework for Sparse, Reliable Biomarkers Protocol: Stabl is an algorithm designed to identify a minimal set of highly reliable biomarkers from large omic datasets (e.g., transcriptomics, metabolomics) [13]. Its workflow integrates noise injection and a data-driven signal-to-noise threshold:
Performance: Benchmarking on synthetic and real-world datasets showed that Stabl achieves superior sparsity and reliability (lower false discovery rate) compared to traditional methods like Lasso and Stability Selection, while maintaining predictive performance. It can distill datasets of 1,400–35,000 features down to a concise set of 4–34 candidate biomarkers [13].
The following diagram illustrates the core workflow and conceptual advance of the Stabl framework compared to a traditional stability analysis workflow.
Table 4: Key Research Reagent Solutions for Stability Assessment
| Tool / Resource | Function in Stability Research | Example Use Case |
|---|---|---|
| Organic Phantoms | Mimic low-contrast human tissue for controlled, repeatable imaging without patient variability. | Assessing baseline stability of radiomic features across different scanner presets (e.g., head, pelvis) [10] [4]. |
| Image Biomarker Standardisation Initiative (IBSI) | Provides standardized definitions and formulas for calculating radiomic features to enable cross-study comparisons. | Harmonizing feature extraction across multiple research sites to improve reproducibility [1] [8]. |
| PyRadiomics | An open-source Python package for the extraction of a large set of radiomic features from medical images. | High-throughput batch processing of images to generate feature values for subsequent stability analysis [12] [11]. |
| Stabl Algorithm | A machine learning method that integrates noise injection to select sparse, reliable biomarker sets from high-dimensional data. | Distilling thousands of omic features (proteomic, metabolomic) into a shortlist of high-confidence candidate biomarkers [13]. |
| Stability Metrics (ICC/CCC) | Statistical measures to quantify the agreement or consistency between repeated measurements. | Classifying features as "stable" or "unstable" based on a predefined threshold (e.g., ICC > 0.9) [7] [10] [11]. |
The journey toward clinically translatable biomarkers is complex and demands a rigorous, multi-faceted approach to feature stability. This guide has outlined the critical methodologies, from the reference standard of test-retest imaging to the practical alternative of image perturbation and the advanced computational approach of tools like Stabl. The empirical data consistently show that stability is highly dependent on context—feature class, imaging modality, and processing parameters all play a decisive role.
The emerging evidence that nonreproducible features can retain predictive power in multivariable models does not negate the importance of stability but rather reframes it. It underscores the need to move beyond evaluating features in isolation and toward assessing the stability and validity of the entire predictive model [12]. Future research must focus on robust validation in multi-institutional settings, the development of standardized, IBSI-compliant pipelines, and the creation of large, representative public datasets. By adhering to these principles and leveraging the tools and data presented, researchers can significantly enhance the reliability and clinical utility of biomarker-driven medicine.
Radiomics, the high-throughput extraction of quantitative features from medical images, has emerged as a promising field for developing non-invasive biomarkers in oncology and beyond [14] [8]. A fundamental principle that has guided radiomics research is that for a feature to be clinically useful, it must first be reproducible—stable across test-retest scenarios, different scanners, acquisition protocols, and reconstruction settings [14] [1]. This paradigm has led to widespread practice of filtering out "nonreproducible" features before model development, based on metrics like Intra-class Correlation Coefficient (ICC) or Concordance Correlation Coefficient (CCC) [7] [15].
However, a paradigm shift is emerging in the radiomics community, challenging the notion that individual feature reproducibility should be the primary gatekeeper for clinical translation. Growing evidence suggests that the relationship between feature reproducibility and predictive performance is more complex than previously assumed [12]. This guide examines this shifting landscape through a critical assessment of current evidence, methodological approaches, and the non-linear relationship between technical stability and clinical utility.
The conventional wisdom in radiomics prioritizes feature reproducibility based on sound scientific principles. Nonreproducible features are considered unreliable for clinical decision-making because they may vary unexpectedly when imaging conditions change, leading to inconsistent predictions [12]. Table 1 summarizes major sources of variability affecting radiomic feature reproducibility.
Table 1: Sources of Variability in Radiomic Feature Extraction
| Variability Category | Specific Examples | Impact on Features |
|---|---|---|
| Image Acquisition | Scanner manufacturer, protocol settings, kVp, mA (CT), magnetic field strength (MRI) | Affects noise, resolution, and signal characteristics [14] [1] |
| Image Reconstruction | Algorithms, kernels, slice thickness, iterative vs. filtered back projection | Influences texture and noise patterns [14] [15] |
| Segmentation | Manual vs. automated, inter-observer variability, contouring methods | Alters region of interest, affecting all extracted features [1] [8] |
| Feature Extraction | Software implementation, parameter settings, preprocessing filters | Causes systematic differences in feature values [1] [15] |
The radiomics community has developed rigorous methodologies to assess feature reproducibility:
Test-Retest Imaging: The gold standard approach where patients are scanned twice within a short interval using the same acquisition protocol [7] [1]. Features are then evaluated using ICC, with typical thresholds of ICC ≥ 0.75 or 0.8 indicating good reproducibility [1].
Image Perturbation: A computational alternative that applies simulated variations to images, including random translations, rotations, noise addition, and contour randomizations [7]. This method is particularly valuable when test-retest data is unavailable due to clinical or ethical constraints.
Phantom Studies: Using physical phantoms with known characteristics to evaluate feature stability across different scanners and protocols [1].
A critical insight driving the paradigm shift is the understanding that reproducibility and predictiveness are independent properties of radiomic features [12]. A highly reproducible feature may have no predictive value for a specific clinical endpoint, while a feature with moderate reproducibility might be highly predictive.
Experimental evidence from multiple studies demonstrates this phenomenon. In a systematic investigation across four radiomic datasets (Lipo, Desmoid, CRLM, and GIST), researchers found that filtering features based on reproducibility thresholds did not consistently improve predictive performance [12]. In the Desmoid dataset, models trained exclusively on nonreproducible features (CCC < 0.75) outperformed those using reproducible features, achieving higher Area Under the Curve (AUC) values [12].
The limitations of evaluating features in isolation have been illustrated through a powerful analogy [12]. Consider determining if an elephant is in a house by checking multiple rooms (features). If the elephant moves between rooms between measurements, individual room checks will show poor reproducibility, yet the collective information perfectly indicates the elephant's presence. Similarly, in radiomics, predictive information may be distributed across multiple features rather than confined to individual, highly stable features [12].
Table 2 compares the performance of models built using reproducibility features identified through test-retest versus image perturbation methods across four classifiers [7].
Table 2: Model Performance Comparison Based on Reproducibility Assessment Method
| Classifier | Feature ICC Threshold | Testing AUC (Perturbation) | Testing AUC (Test-Retest) | Prediction ICC (Test-Retest) |
|---|---|---|---|---|
| Logistic Regression | 0.9 | 0.76 | 0.77 | 0.87 |
| Logistic Regression | 0.95 | 0.75 | 0.59 | Significant drop |
| SVM | 0.9 | Variable | Variable | > 0.9 |
| Random Forest | 0.9 | Variable | Variable | > 0.9 |
The data reveals that while both methods can achieve good predictive performance (AUC 0.7-0.8) and robustness (prediction ICC > 0.9) at optimal ICC thresholds, test-retest models experience significant performance degradation at very strict reproducibility thresholds (ICC = 0.95) [7]. This suggests that overemphasizing individual feature reproducibility can eliminate valuable predictive information.
Feature reproducibility shows significant dependence on imaging modality, anatomical region, and processing techniques [11]. In brain PET imaging, the choice of partial volume correction method dramatically affects feature reproducibility. The Reblurred Van Cittert (RVC) and Richardson-Lucy (RL) methods demonstrated the best reproducibility, with over 60% of features having Coefficient of Variation (COV) < 25% and ICC ≥ 0.75 [11]. Gray Level Co-occurrence Matrix (GLCM) and Gray Level Dependence Matrix (GLDM) features were most stable across regions, while first-order and Neighborhood Gray Tone Difference Matrix (NGTDM) features showed highest variability [11].
The standard protocol for test-retest reproducibility analysis involves [7] [1]:
As an alternative to test-retest, image perturbation protocols include [7]:
Recent evidence suggests an alternative workflow that [12] [15]:
The following workflow diagram illustrates this alternative approach:
Table 3 catalogues essential tools and methodologies for conducting reproducibility-predictiveness investigations in radiomics.
Table 3: Essential Research Toolkit for Radiomics Reproducibility Studies
| Tool Category | Specific Tools/Methods | Function/Purpose |
|---|---|---|
| Feature Extraction Software | PyRadiomics [15], IBSI-compliant tools [8] | Standardized extraction of radiomic features according to consensus definitions |
| Reproducibility Metrics | ICC, CCC, COV [7] [15] [11] | Quantifying feature stability across repeated measurements |
| Perturbation Libraries | Custom Python/ROI perturbation algorithms [7] | Simulating realistic image variations for robustness assessment |
| Public Datasets | WORC database [12], TCIA [16] | Access to test-retest and multi-institutional data for validation |
| Statistical Analysis | Linear mixed models [16], Delong test [16] | Comparing model performance across different feature selection strategies |
The emerging evidence suggests a revised framework for radiomics clinical translation that balances reproducibility concerns with predictive performance optimization:
This framework emphasizes:
The radiomics field is undergoing a necessary paradigm shift from a narrow focus on individual feature reproducibility toward a more holistic approach that prioritizes clinical predictive performance. While technical robustness remains essential, the evidence suggests that strict adherence to reproducibility thresholds may eliminate valuable predictive information distributed across multiple features [12].
This guide demonstrates through comparative analysis that the most effective path forward involves:
As the field matures, this balanced approach promises to enhance the clinical translation of radiomic biomarkers while maintaining scientific rigor, ultimately fulfilling the promise of quantitative imaging in personalized medicine.
Radiomics has emerged as a transformative approach in quantitative medical imaging, extracting sub-visual data from conventional images to create mineable feature spaces that can inform clinical decision-making [8]. This high-throughput extraction of quantitative features from medical images aims to identify biomarkers that can predict diagnosis, prognosis, and treatment response across various cancers [1] [18]. However, the distribution of information across multiple feature interactions presents substantial challenges for clinical translation, primarily due to questions about reliability and reproducibility.
The fundamental premise of radiomics rests on converting standard-of-care images into high-dimensional data through automated feature extraction [8]. These features—including morphological characteristics, first-order statistics, and higher-order textural patterns—theoretically encode information about tumor phenotype and microenvironment that surpasses human visual assessment [18]. Yet this very strength creates a critical vulnerability: the stability of information distributed across these feature interactions directly determines whether radiomic signatures can reliably transition from research environments to clinical practice.
Within the broader context of test-retest reliability research, understanding how information distributes across feature interactions requires examining both the sources of variability and methodological approaches to quantify robustness. The clinical imperative is clear: only reproducible and repeatable features should be incorporated into models intended for patient care decisions [1] [19]. This comparison guide objectively evaluates the experimental approaches, performance data, and methodological standards for assessing feature reliability across multiple interaction contexts.
Test-Retest Imaging Protocol The traditional test-retest approach involves repeatedly scanning the same subject within a short time interval using identical acquisition parameters. In practice, this requires patients to undergo additional scanning sessions, typically with a 15-minute to 2-day interval between scans [1] [20]. For example, in lung cancer studies, the RIDER dataset contains repeat CT scans taken 15 minutes apart for 31 patients, providing a benchmark for test-retest analysis [20]. The fundamental requirement is maintaining consistent imaging parameters (scanner model, acquisition protocol, reconstruction algorithms) between scans to isolate biological stability from technical variability.
Image Perturbation Protocol As an alternative to physical rescanning, image perturbation uses computational methods to simulate variations encountered during image acquisition and segmentation [21]. The validated protocol involves systematic modifications to original images and segmentations: translational shifts (0, 0.4, and 0.8 pixels), rotational changes (-20°, 0°, and 20°), random noise additions (0, 1, 2, and 5 times original noise levels), and contour randomizations via displacement fields [21]. Typically, 40-60 different perturbation combinations are generated to robustly estimate feature stability, with intraclass correlation coefficients (ICCs) calculated across perturbations to quantify repeatability [7] [21].
Phantom-Based Stability Testing Phantom studies provide a controlled approach to feature stability assessment, using either synthetic or organic materials scanned under varying parameters [22]. The experimental design involves scanning phantoms across different scanners (e.g., Philips Gemini TF16, Philips Gemini TF64, GE Discovery NM 570) with systematic variation in acquisition parameters (tube current, slice thickness, reconstruction kernels) [3] [22]. For example, one study utilized apples, kiwis, limes, and onions as organic phantoms, scanning each at 10 mAs, 50 mAs, and 100 mAs with 120-kV tube current to evaluate feature stability across imaging parameters [22].
Table 1: Methodological Comparison Between Reliability Assessment Approaches
| Parameter | Test-Retest Imaging | Image Perturbation | Phantom Studies |
|---|---|---|---|
| Clinical Burden | High (additional scans, patient radiation exposure) | Low (computational only) | None (no patient involvement) |
| Resource Requirements | Significant (scanner time, personnel) | Minimal (computational resources) | Moderate (scanner time, phantom materials) |
| Sample Size Considerations | Typically limited (patient availability) | Virtually unlimited (can use existing data) | Flexible (depends on phantom availability) |
| Realism for Human Tissue | High (actual human pathophysiology) | Moderate (simulated variations) | Variable (depends on phantom design) |
| Quantification Metric | Intraclass correlation coefficient (ICC) | Intraclass correlation coefficient (ICC) | Concordance correlation coefficient (CCC), ICC |
| Assessment Scope | Position, noise, biological variations | Position, noise, segmentation variations | Scanner, acquisition parameter variations |
| Implementation in Multi-center Studies | Challenging (protocol harmonization) | Straightforward (standardized algorithms) | Moderate (phantom distribution needed) |
Table 2: Quantitative Reliability Performance Across Assessment Methods
| Feature Category | Test-Retest Reliability (% with ICC > 0.9) | Perturbation Reliability (% with ICC > 0.9) | Phantom Reliability (% with ICC > 0.9) |
|---|---|---|---|
| First-Order Features | 78% [22] | 78% (wavelet-filtered) [3] | 78% [22] |
| Shape Features | 100% [22] | 65% [21] | 100% [22] |
| Texture Features | 63% [22] | 47% (LoG-filtered) [3] | 63% [22] |
| Wavelet Features | Not reported | 59% [3] | Not reported |
| Overall Features | 70% [22] | 34% (470/1395 features) [21] | 45-61% (scanner-dependent) [3] |
The reproducibility of radiomic features demonstrates significant dependence on image acquisition parameters, with slice thickness emerging as a particularly influential factor. In clinical cohort studies, changes in slice thickness resulted in poor reproducibility for 37% of features, while intravenous contrast administration affected 45% of features [3]. This parameter sensitivity varies substantially by feature class, with first-order features generally demonstrating higher stability compared to textural features under parameter variations.
Scanner variability represents another critical factor in feature reproducibility. Inter-scanner comparisons reveal substantially lower reproducibility compared to intra-scanner assessments, with only 14% of features maintaining good reproducibility (ICC > 0.9) across different scanner models [3]. The percentage of stable features decreases progressively with increasing protocol complexity: 30% maintain stability under intra-scanner variations, 19% across clinical protocol changes, and only 13% demonstrate combined repeatability and reproducibility across all tested conditions [3].
Different feature classes exhibit distinct stability profiles across test-retest and perturbation assessments. First-order statistics consistently demonstrate higher repeatability, with 78% of first-order features showing excellent test-retest stability (CCC > 0.9) in phantom studies [22]. Shape features show perfect stability (100%) in test-retest phantom experiments but reduced stability (65%) under perturbation conditions that include contour randomization [22] [21].
Texture features present the greatest variability, with only 63% demonstrating excellent test-retest stability [22]. Among filtered features, wavelet and Laplacian of Gaussian (LoG)-filtered features show moderate stability, with 59% of wavelet and 46% of LoG features maintaining ICC > 0.9 under perturbation testing [3]. These patterns highlight how information distribution across feature interactions is strongly modulated by feature class, with first-order and shape features generally providing more reliable information channels than texture features.
Radiomics Reliability Assessment Workflow: This diagram illustrates the integration of reliability assessment methods within the standard radiomics pipeline, highlighting how test-retest, perturbation, and phantom studies feed into feature filtering and model validation stages.
The Image Biomarker Standardization Initiative (IBSI) represents a critical response to reproducibility challenges in radiomics, establishing consensus guidelines for image preprocessing and feature extraction [19]. This international collaboration has developed standardized definitions for computational phantoms, image processing techniques, and feature extraction methodologies to enable cross-study comparisons [21]. The initiative provides reference values for verified features, creating a framework for calibrating different radiomics software implementations against established standards.
IBSI compliance has become increasingly important for methodological rigor in radiomics research. Studies adhering to IBSI guidelines demonstrate improved interoperability between different feature extraction platforms [19]. For example, comparative analyses between MATLAB toolkits and PyRadiomics implementations show that 29 out of 43 common features maintain high reproducibility (Spearman's rs > 0.8) when IBSI standards are followed [20]. This standardization is particularly crucial for textural features, which show the highest variability between software implementations without standardized calculation methods.
For multi-center studies implementing radiomic models, several harmonization strategies have emerged to address feature reproducibility challenges. These include prospective protocol harmonization across institutions, statistical harmonization methods such as ComBat, and feature preselection based on robustness databases [19] [21]. The establishment of feature robustness databanks (RF-RobustDB) provides curated collections of stable features across different cancer types and imaging modalities, enabling researchers to preselect features with known reliability profiles before model development [21].
These harmonization approaches have demonstrated tangible benefits for model generalizability. Studies utilizing preselected highly repeatable features from robustness databanks show improved concordance indices in external validation cohorts and reduced performance gaps between development and validation datasets [21]. This strategy effectively safeguards model performance when applied to new patient populations or imaging protocols, addressing a critical barrier to clinical implementation.
Table 3: Essential Research Tools for Radiomic Feature Reliability Assessment
| Tool Category | Specific Tools/Solutions | Primary Function | Key Considerations |
|---|---|---|---|
| Feature Extraction Platforms | PyRadiomics [22] [21], MATLAB Radiomics Toolkit [20] | Standardized implementation of feature calculation algorithms | IBSI compliance essential for reproducibility |
| Image Perturbation Software | Custom implementations based on Zwanenburg method [21] | Simulation of test-retest variations without additional scanning | Should include translation, rotation, noise, and contour perturbations |
| Reliability Quantification Metrics | Intraclass Correlation Coefficient (ICC) [7] [3], Concordance Correlation Coefficient (CCC) [22] | Statistical assessment of feature stability across repetitions | ICC > 0.9 typically indicates excellent repeatability |
| Phantom Materials | Organic phantoms (apples, kiwis, limes) [22], Synthetic radiomics phantoms | Controlled assessment of feature stability across scanners | Organic materials provide more realistic texture than uniform phantoms |
| Statistical Analysis Environments | R Statistics [22], Python SciPy/NumPy/Scikit-learn | Implementation of reliability statistics and machine learning models | Flexible programming environments enable custom analysis pipelines |
| Standardization Reference | IBSI Guidelines [19] [21], IBSI Reference Manual | Standardized definitions for image processing and feature calculations | Critical for cross-study comparisons and software validation |
Feature Stability Profiles: This diagram visualizes the differential stability of radiomic feature categories under various sources of variation, with first-order and shape features demonstrating higher reliability than texture features.
The distribution of information across multiple feature interactions in radiomics presents both opportunities and challenges for clinical translation. Experimental evidence indicates that image perturbation methods can achieve comparable model reliability to traditional test-retest approaches while overcoming practical limitations of physical rescanning [7]. The emerging methodology of radiomic feature robustness databanks offers a promising path toward standardized feature preselection, potentially improving model generalizability across institutions and imaging protocols [21].
For researchers and drug development professionals, the strategic selection of stable feature classes represents a critical consideration in model development. First-order and shape features provide higher reliability foundations, while texture features require more rigorous stability assessment before clinical implementation [22]. The integration of stability assessment directly into the radiomics pipeline—through perturbation methods or robustness databases—ensures that models built on distributed feature interactions maintain their predictive power when deployed in heterogeneous clinical environments.
As standardization initiatives like IBSI continue to mature and robustness databases expand across cancer types and imaging modalities, the field moves closer to reliable clinical implementation. Future research directions should focus on expanding multi-institutional validation of stable feature sets, developing automated stability assessment pipelines, and establishing clinical guidelines for feature selection based on reliability evidence. Through these efforts, the distribution of information across multiple feature interactions can transform from a source of variability to a foundation for robust, clinically actionable biomarkers.
Test-retest imaging is a foundational methodology for assessing the reliability and precision of quantitative imaging biomarkers (QIBs) and radiomic features, establishing a benchmark for their use in scientific research and clinical trials [23]. In this paradigm, the same subject is scanned twice within a short time interval, under identical or nearly identical conditions, assuming no biological change has occurred in the target metric between scans [24]. The resulting data allows researchers to quantify measurement error arising from the entire imaging chain, from scanner physics to image analysis algorithms.
Despite its conceptual status as a gold standard for evaluating feature repeatability, the practical application of test-retest imaging faces significant limitations. These constraints have spurred the development of alternative methodologies, such as image perturbation and no-gold-standard (NGS) evaluation techniques, which aim to provide practical reliability assessments when conventional test-retest is infeasible [7] [24]. This guide examines the technical execution, comparative performance, and practical challenges of these approaches within radiomics research, providing researchers with a framework for methodological selection.
A well-designed test-retest study requires rigorous standardization across multiple dimensions. The core protocol involves scanning the same participant twice, with a critical interval typically ranging from minutes to hours to several days, depending on the biological stability of the measured feature [25]. During this interval, every effort is made to maintain identical conditions for scanner type, acquisition protocol, patient preparation, and positioning to isolate technical measurement variability from biological change.
Key methodological steps include:
For example, in a prospective cardiac MRI study investigating radiomic feature repeatability in myocardial T1 and T2 mapping, 50 healthy volunteers underwent two identical MRI examinations on the same day with a break of at least 20 minutes between sessions, using the same 1.5T scanner and identical sequences for both scans [26].
Image perturbation has emerged as a practical alternative to test-retest imaging, especially when repeated scanning is ethically concerning or resource-prohibitive [7]. This computational approach applies controlled, random variations to existing images or their segmentations to simulate the effects of acquisition variability.
Common perturbation techniques include:
The process involves generating multiple perturbed versions of each original image, followed by radiomic feature extraction from all variants. Intra-class correlation coefficient (ICC) or concordance correlation coefficient (CCC) is then calculated to quantify feature repeatability across perturbations [7]. A systematic workflow for this methodology is illustrated in Figure 1.
The no-gold-standard evaluation (NGSE) framework represents a more recent statistical approach that estimates measurement precision without repeated scans or a known ground truth [24]. This technique operates on the fundamental assumption that measurements from multiple different methods (e.g., segmentation algorithms) are linearly related to the true (but unknown) quantitative values, with method-specific noise characteristics.
The NGSE methodology involves:
Table 1: Core Components of Reliability Assessment Methodologies
| Component | Test-Retest Imaging | Image Perturbation | No-Gold-Standard Evaluation |
|---|---|---|---|
| Primary Data Source | Repeated actual scans | Modified single scans | Multiple algorithms on single scans |
| Key Assumptions | No biological change between scans | Perturbations mimic real variability | Linear measurement relationships |
| Primary Output Metrics | ICC, CV, COV | ICC, CCC | Method precision (σₖ) |
| Biological Variability Capture | Yes | Partial | No |
| Resource Requirements | High | Low | Moderate |
Direct comparisons between test-retest and perturbation methodologies reveal both convergence and divergence in their assessments of feature reliability. In a comprehensive study using a 191-patient public breast cancer dataset with 71 test-retest scans, researchers evaluated radiomic model reliability based on repeatable features identified by both methods [7].
The study found that image perturbation systematically identified more features as repeatable compared to test-retest evaluation. Specifically, among 1120 volume-independent radiomic features, only 143 showed lower ICC under image perturbation than test-retest, with a strong correlation (Pearson r = 0.79) between the two ICC measures [7]. This systematic difference highlights how perturbation may capture different aspects of variability compared to actual test-retest imaging.
In terms of predictive model performance, filtering features by repeatability improved both internal generalizability (testing AUC) and robustness (prediction ICC) for both methods. The optimal reliability was achieved at an ICC threshold of 0.9 for both approaches, with testing AUC = 0.7-0.8 and prediction ICC > 0.9 [7]. However, at higher thresholds (ICC = 0.95), the test-retest model showed significant performance drops while perturbation-based models maintained more stable performance.
Test-retest reliability demonstrates substantial variation across different imaging modalities and anatomical regions. The intraclass correlation coefficient (ICC) serves as the primary metric for quantifying this reliability, with values >0.9 typically classified as "excellent," 0.75-0.9 as "good," 0.5-0.75 as "moderate," and <0.5 as "poor" [26].
In neuroimaging, the YOUth cohort study reported good test-retest reliability for global brain measures derived from structural T1-weighted and diffusion-weighted imaging (DWI), with moderate reliability for resting-state functional connectivity and task-based fMRI measures [25]. This pattern of global measures outperforming local/functional measures is consistent across many neuroimaging applications.
In cardiac MRI, a prospective study of myocardial T1 and T2 mapping found that only a subset of radiomic features demonstrated good to excellent repeatability [26]. For T1 maps in short-axis orientation, 6 features showed excellent reproducibility (ICC > 0.9), 29 good (ICC 0.75-0.90), 19 moderate (ICC 0.50-0.75), and 46 poor (ICC < 0.50) reproducibility. The study ultimately identified just 15 features from 6 classes that maintained good to excellent reproducibility across all resolutions and orientations for T1 mapping [26].
Table 2: Test-Retest Reliability Across Imaging Applications
| Imaging Application | Reliability Level | Representative Features/Metrics | ICC Range |
|---|---|---|---|
| Brain MRI (Structural) | Good to Excellent | Global brain volume, Tissue classification | >0.75 |
| Brain MRI (Functional) | Moderate | Resting-state connectivity, Task activation | 0.5-0.75 |
| Cardiac MRI (T1/T2 Mapping) | Variable (Poor to Excellent) | Myocardial radiomic features (subset) | <0.5 to >0.9 |
| Body PET (Oncological) | Moderate to Good | Metabolic tumor volume, SUV metrics | 0.6-0.85 |
| Brain PET (Radiomics) | Highly variable | Texture features (dependent on PVC method) | <0.5 to >0.9 |
The implementation of test-retest imaging faces substantial practical barriers that limit its widespread application:
While perturbation and NGSE approaches offer practical advantages, they introduce their own methodological limitations:
Image perturbation techniques may not fully capture the complex sources of variability present in actual repeated scans. The controlled nature of synthetic perturbations tends to produce systematically higher repeatability estimates compared to test-retest, potentially overestimating feature stability [7]. Furthermore, the optimal thresholds for classifying features as "repeatable" remain ambiguous and may vary across applications.
The no-gold-standard framework relies on several strong statistical assumptions, particularly the linear relationship between measured and true values, which may not hold in practice [24]. Violations of these assumptions can lead to biased estimates of method precision. Additionally, the NGSE technique requires data from multiple measurement methods and sufficient sample sizes (typically >80 lesions) to produce reliable estimates [24].
Critical reviews have highlighted concerns about the validity of commonly used reliability indices in quantitative imaging [27]. The intraclass correlation coefficient (ICC), while widely used, has limitations including:
These limitations underscore the importance of complementing ICC with additional metrics such as the coefficient of variation (CV), standard error of measurement (SEM), and Bland-Altman analysis to provide a more comprehensive assessment of measurement reliability [27] [23].
Table 3: Key Materials and Analytical Tools for Reliability Studies
| Tool/Reagent | Function/Application | Representative Examples |
|---|---|---|
| Phantom Systems | Scanner calibration and performance monitoring | MRI homogeneity phantoms, PET/CT resolution inserts |
| Image Analysis Platforms | Feature extraction and quantification | PyRadiomics, ITK-SNAP, SPM, FSL |
| Statistical Software | Reliability analysis and visualization | R, Python (scikit-learn, Pingouin), SPSS |
| Radiomics Standardization Tools | Protocol harmonization and reporting | IBSI (Imaging Biomarker Standardization Initiative) guidelines |
| Computational Resources | Image processing and perturbation | High-performance computing clusters, GPU acceleration |
The experimental workflow for assessing radiomic feature reliability typically follows a structured pipeline, whether using test-retest, perturbation, or NGSE approaches. Figure 2 illustrates this generalized methodology, highlighting key decision points and analytical steps common to all three paradigms.
Test-retest imaging remains the methodological gold standard for establishing the reliability of radiomic features and quantitative imaging biomarkers, providing direct evidence of measurement stability under realistic conditions [23]. However, substantial practical limitations including resource intensity, patient burden, and biological stability concerns constrain its implementation [24] [7].
Alternative methodologies offer promising approaches for addressing these limitations. Image perturbation provides a practical, computationally efficient alternative that demonstrates reasonable concordance with test-retest results, though it may systematically overestimate feature repeatability [7]. The no-gold-standard framework represents a statistically sophisticated approach that eliminates the need for repeated scanning entirely, though it relies on strong assumptions that require careful validation [24].
The choice between these methodologies involves balancing practical constraints against methodological rigor, with the optimal approach depending on specific research contexts, available resources, and the intended clinical application of the radiomic features under investigation. As the field advances, standardization of reliability assessment protocols and reporting standards will be crucial for meaningful comparison across studies and eventual clinical translation of robust radiomic biomarkers.
In the field of radiomics, which aims to extract high-dimensional quantitative features from medical images to inform cancer diagnosis, prognosis, and treatment, the test-retest reliability of features is a fundamental prerequisite for clinical translation [28]. Traditionally, this reliability has been assessed through physical test-retest studies, where patients are scanned multiple times within a short interval [29]. However, such studies are resource-intensive, increase patient radiation exposure, and are often limited by small sample sizes [7]. In response, computational perturbation methods have emerged as a promising alternative. This guide objectively compares these innovative computational approaches against traditional physical rescanning for assessing radiomic feature reliability, providing researchers with the experimental data and methodologies needed to inform their study designs.
Radiomics converts standard medical images into minable data by extracting a vast number of quantitative features that describe tumor phenotype [28]. The multi-step radiomics workflow—from image acquisition and segmentation to feature extraction and model building—is susceptible to variations at every stage. Consequently, the reproducibility and repeatability of radiomic features are major concerns [30].
The following section provides a detailed, point-by-point comparison of the two approaches, covering their fundamental principles, implementation, and key characteristics.
Table 1: Direct comparison of physical rescanning and computational perturbation methods.
| Characteristic | Physical Test-Retest | Computational Perturbation |
|---|---|---|
| Primary Objective | Identify robust features against short-term scan-rescan variability [29]. | Assess feature stability against simulated imaging and segmentation variations [32] [33]. |
| Patient Burden | High (additional scan and radiation exposure). | None (uses existing clinical data). |
| Resource Intensity | High (scanner time, personnel). | Low (computational power only). |
| Dataset Size | Often limited (e.g., N=27-40) [29]. | Virtually unlimited (e.g., 60+ perturbations per patient) [32]. |
| Generalizability | May be specific to scanner, protocol, and cancer site [29]. | Can be tailored to a specific study's expected variations. |
| Controlled Variables | Limited to scan-rescan noise; cannot isolate other factors. | Can be designed to isolate specific variability sources (e.g., segmentation only). |
Recent studies have directly compared the outcomes of these two methods, providing quantitative data on their effectiveness in building reliable radiomic models.
A 2023 study on a breast cancer dataset provided a direct comparison [7]. The researchers filtered out non-repeatable features using both test-retest (CCC) and perturbation (ICC) methods, then built predictive models for pathological complete response (pCR) using the resulting feature sets.
Table 2: Model performance comparison based on feature repeatability method (adapted from [7]).
| Feature Repeatability Method | Testing AUC (Logistic Regression) at ICC/CCC Threshold=0.9 | Prediction ICC on Test-Retest Data |
|---|---|---|
| Test-Retest (CCC) | 0.77 | 0.87 |
| Image Perturbation (ICC) | 0.76 | 0.75 |
| Baseline (No Filtering) | ~0.56 | 0.45 |
The key finding was that while the model based on test-retest features (Mtr) showed slightly higher prediction reliability on the actual test-retest data, the model based on perturbation-filtered features (Mp) also achieved a significant and comparable improvement in performance and robustness over the baseline model. This demonstrates that perturbation is a highly effective alternative when test-retest data is unavailable [7].
A large-scale study on 1,419 head-and-neck cancer patients across four datasets systematically evaluated the impact of using perturbation-derived robust features [33]. The results clearly show that filtering out low-robust features significantly enhances the final radiomic model.
Table 3: Model performance with robust feature filtering (summary of findings from [33]).
| Feature Robustness Filtering Threshold | Model Robustness (ICC) | Train-Test AUC Difference | Average Testing AUC |
|---|---|---|---|
| None (All Features) | 0.65 | 0.21 | Not Reported |
| ICC > 0.75 | 0.78 | 0.18 | 0.58 |
| ICC > 0.95 | 0.91 | 0.12 | Lower than ICC>0.75 |
The study concluded that using features with good robustness (ICC > 0.75) yielded the best balance, providing substantially improved model robustness and generalizability (evidenced by a smaller train-test performance gap) while maintaining the model's discriminatory power. Overly strict robustness thresholds (e.g., ICC > 0.95), while further improving robustness, can reduce a model's predictive performance by eliminating informative features [33].
For researchers seeking to implement these methods, the following protocols detail the steps as described in the cited literature.
This protocol is synthesized from methodologies used in multiple studies [32] [33] [34].
The following diagram illustrates the computational perturbation workflow.
This table catalogs key software tools and methodological components essential for implementing computational perturbation methods, as derived from the reviewed literature.
Table 4: Key research reagents and solutions for perturbation analysis.
| Item/Software | Type | Primary Function | Example Use in Context |
|---|---|---|---|
| PyRadiomics | Open-source Python package | Standardized extraction of radiomic features from medical images. | The core feature extraction engine used in multiple studies [33] [31]; integrates with workflows in 3D Slicer. |
| 3D Slicer / ITK-SNAP | Open-source software platform | Manual, semi-automatic, or deep learning-based image segmentation. | Used for initial delineation of Regions of Interest (ROIs) like tumors prior to perturbation analysis [31]. |
| Perturbation Framework | In-house Python code | Applies random transformations, noise, and contour deformations to images and segmentations. | Critical for simulating test-retest conditions; parameters include translation/rotation ranges and noise levels [33] [34]. |
| ICC Analysis Script | Statistical code (R/Python) | Quantifies feature robustness by calculating Intra-class Correlation Coefficient across perturbations. | Used to analyze the output from PyRadiomics, ranking features by their stability (ICC value) [32] [7]. |
| Laplacian-of-Gaussian (LoG) & Wavelet Filters | Image processing filters | Highlight texture patterns at different spatial scales (speeds) before feature extraction. | Applied to images pre-feature extraction to create a multi-scale feature set; sigma values (e.g., 1-5mm) define texture coarseness [33] [34]. |
The body of evidence demonstrates that computational perturbation is a viable and effective alternative to physical test-retest imaging for assessing radiomic feature reliability. While physical rescanning may remain a benchmark in ideal scenarios, its practical limitations are significant. Perturbation methods offer a scalable, flexible, and patient-free solution that directly addresses the critical need for robust feature selection. Research shows that models built upon perturbation-validated features achieve markedly improved reliability and generalizability. For the broader scientific community, adopting computational perturbation is a pragmatic and powerful strategy for advancing the field of radiomics toward clinically reliable applications.
The field of radiomics faces a significant challenge in translating promising research findings into clinical practice, primarily due to concerns about feature reliability and reproducibility. Quantitative reliability metrics provide the essential framework for assessing this stability, helping researchers distinguish robust, biologically relevant features from those unduly influenced by technical variations. The Concordance Correlation Coefficient (CCC), Intraclass Correlation Coefficient (ICC), and Limits of Agreement (LOA) serve as fundamental statistical tools for this validation process. These metrics systematically evaluate different aspects of feature behavior under various conditions, forming the foundation for methodological rigor in radiomics research. Their proper application is critical for establishing the trustworthiness of radiomic signatures intended for clinical decision-making in areas such as cancer diagnosis, prognosis prediction, and treatment response assessment [35] [36] [37].
The importance of these metrics extends beyond mere technical validation. In the context of test-retest reliability, they provide objective measures of whether a feature remains stable when measured repeatedly under similar conditions (repeatability) or under changing conditions such as different scanners or segmentation methods (reproducibility). This distinction is crucial for determining whether a radiomic feature can serve as a reliable biomarker in multi-center studies or clinical trials, where variations in imaging protocols and analysis methods are inevitable [36] [38]. As radiomics moves closer to clinical implementation, understanding the proper application and interpretation of ICC, CCC, and LOA becomes paramount for ensuring that predictive models perform consistently and reliably in real-world settings.
The three core metrics—ICC, CCC, and LOA—each provide distinct insights into feature reliability through different mathematical frameworks.
The Intraclass Correlation Coefficient (ICC) quantifies reliability by partitioning variance components in data. The general formula for ICC is expressed as:
ICC = Between-subject variance / (Between-subject variance + Within-subject measurement variance) [36]
This ratio-based approach makes ICC particularly useful for assessing the proportion of total variance attributable to actual biological differences between subjects versus measurement error. Several forms of ICC exist depending on the experimental design, including one-way or two-way models, random or fixed effects, and single or multiple measurements. For radiomics applications, ICC(3,1)—which employs a two-way mixed-effects model for absolute agreement with single measurement—is frequently recommended when comparing fixed raters or conditions [35] [36].
The Concordance Correlation Coefficient (CCC) evaluates the agreement between two measures by assessing how well pairs of observations fall along the line of perfect concordance (the 45-degree line). Unlike ICC, which focuses specifically on variance components, CCC incorporates both precision (deviation from the best-fit line) and accuracy (deviation from the 45-degree line) in its assessment of agreement. This makes CCC particularly valuable for test-retest and repositioning studies where both systematic and random errors need quantification [35].
While Limits of Agreement (LOA) were not explicitly detailed in the search results, they typically involve calculating the mean difference between two measurements (± 1.96 standard deviations of the differences) to establish an interval within which most differences between measurements are expected to lie. This approach, often visualized through Bland-Altman plots, provides intuitive information about the magnitude of disagreement between measurement techniques or repeated assessments.
Consistent interpretation of these metrics requires established thresholds for classifying reliability levels:
Table 1: Standard Interpretation Guidelines for Reliability Metrics
| Reliability Level | ICC Range | CCC Range | Typical Application |
|---|---|---|---|
| Poor | < 0.5 | < 0.90 | Unacceptable for clinical use |
| Moderate | 0.5 - 0.75 | - | Suitable for group-level research |
| Good | 0.75 - 0.9 | - | Approaching clinical utility |
| Excellent | > 0.9 | ≥ 0.90 | Suitable for clinical applications |
These thresholds follow established conventions in the literature. For ICC, the classification system proposed by Koo and Li is widely adopted: values below 0.5 indicate poor reliability, between 0.5 and 0.75 moderate, between 0.75 and 0.9 good, and above 0.9 excellent reliability [35] [39]. For CCC, a threshold of ≥ 0.9 is commonly used to define excellent stability in test-retest analyses [35].
It is important to recognize that these thresholds are guidelines rather than absolute rules. The required level of reliability depends on the specific clinical or research context. For example, features intended for treatment response assessment might require higher reliability standards than those used for exploratory research into disease mechanisms.
Phantom studies serve as the foundation for establishing the technical reliability of radiomic features by controlling biological variability. A 2023 phantom study utilizing photon-counting detector CT (PCCT) exemplifies rigorous test-retest methodology. Researchers scanned organic phantoms (apples, kiwis, limes, and onions) at different exposure levels (10, 50, and 100 mAs) with 120-kV tube current. Each scan included immediate test-retest sequences without phantom repositioning, followed by additional scans after 90-degree clockwise repositioning. After semi-automated segmentation and extraction of 104 original radiomic features using PyRadiomics, stability was assessed using CCC and ICC [35].
The results demonstrated promising technical stability for radiomic features obtained with modern imaging technology. In test-retest comparisons, 73 features (70%) showed excellent stability with CCC values > 0.9. When assessing repositioning effects, 68 features (65.4%) maintained excellent stability (CCC > 0.9). Notably, all shape-based features exhibited excellent stability across test conditions. When evaluating the impact of different exposure settings, 75% of features demonstrated excellent stability across varying mAs values (10, 50, and 100 mAs) based on ICC analysis [35].
This phantom study design provides a template for technical validation of radiomic feature stability, isolating the effects of image acquisition parameters from biological variability. The high stability rates observed suggest that modern CT technology, particularly photon-counting detectors, may address some of the historical limitations impeding radiomics clinical translation.
Segmentation represents one of the most significant sources of variability in radiomic analysis, with studies consistently demonstrating its impact on feature stability. Research on oropharyngeal cancer CT images revealed that segmentation variability substantially affects both feature representation and predictive accuracy. When comparing original segmentations with deliberately resized versions (simulating under- and over-segmentation), most radiomic features showed considerable variation, with ICC and CCC values below 0.5 for all features in both representation and predictive agreement [39].
Different segmentation methodologies yield different reliability profiles. A study on cervical cancer DWI-MRI compared manual versus semi-automatic segmentation using a flood-fill algorithm. The semi-automatic approach demonstrated significantly higher reliability, with an average ICC of 0.952 compared to 0.897 for manual segmentation. This advantage was consistent across first-order, shape, and textural features [40].
Large-scale analyses have identified specific feature categories with differential sensitivity to segmentation variability. One comprehensive investigation using manual segmentations from four expert readers and probabilistic automated segmentations (generating 25 plausible segmentations per lesion) analyzed three publicly available datasets (lung, kidney, and liver lesions). The results consistently identified subsets of radiomic features robust to segmentation variability, while others demonstrated poor reproducibility across different segmentations. This pattern held for both manual and automated segmentation approaches [41].
Table 2: Comparative Reliability Across Segmentation Methodologies
| Study Focus | Segmentation Method | Reliability Level | Stable Features Identified |
|---|---|---|---|
| Cervical Cancer DWI-MRI [40] | Semi-automatic (flood-fill) | Average ICC = 0.952 | First-order, shape, textural features |
| Cervical Cancer DWI-MRI [40] | Manual | Average ICC = 0.897 | First-order, shape, textural features |
| Multi-site CT Analysis [41] | Manual (4 experts) | Feature-dependent | Subsets of robust features identified |
| Multi-site CT Analysis [41] | Probabilistic Automated | Feature-dependent | Similar robust features as manual |
The choice of feature extraction platform significantly influences radiomic feature reliability, even when analyzing identical images and segmentations. A multi-platform comparison study evaluated four software tools (PyRadiomics, LIFEx, CERR, and IBEX) across three clinical datasets (head and neck cancer, small-cell lung cancer, and non-small-cell lung cancer). When comparing all four platforms using harmonized calculation settings, only 4 out of 17 features demonstrated excellent reliability (ICC > 0.9) across all datasets. However, when the analysis was restricted to the three Image Biomarker Standardisation Initiative (IBSI)-compliant platforms (excluding IBEX), reliability improved substantially, with 15 out of 17 features showing excellent reliability [37].
This study also revealed that failure to harmonize calculation settings resulted in poor reliability, even across IBSI-compliant platforms. Additionally, software version choice had a marked effect on feature reliability for some platforms. Perhaps most importantly, features identified as having significant relationships to survival varied between platforms, as did the direction of hazard ratios, highlighting the profound implications of platform choice for clinical conclusions [37].
Acquisition parameters represent another critical variable affecting feature stability. The Acquisition Impact on Radiomics Estimation (AcquIRE) study analyzed three chest CT datasets (749 patients from nine sites) to rank the impact of various acquisition parameters. Results identified CT software version and convolution kernel as the most influential parameters affecting feature variance. The study also found that different texture feature families were affected differently, with Haralick features being least affected in one dataset, while Gabor features were most stable in others, suggesting that acquisition parameter effects may be problem-specific [38].
Synthesizing data across multiple studies reveals consistent patterns in radiomic feature reliability and the factors that influence it.
Table 3: Reliability Metrics Across Experimental Conditions
| Experimental Condition | Metric | Performance | Reference |
|---|---|---|---|
| Phantom Test-Retest | CCC > 0.9 | 70% of features (73/104) | [35] |
| Phantom Repositioning | CCC > 0.9 | 65.4% of features (68/104) | [35] |
| mAS Variance (10-100 mAs) | ICC > 0.9 | 75% of features (78/104) | [35] |
| All Software Platforms | ICC > 0.9 | 23.5% of features (4/17) | [37] |
| IBSI-Compliant Platforms Only | ICC > 0.9 | 88.2% of features (15/17) | [37] |
| Segmentation Variability (OPC) | ICC/CCC | Below 0.5 for all features | [39] |
The data demonstrates that technical factors such as scanner type, image acquisition parameters, segmentation methodology, and feature extraction platforms collectively influence feature stability. Promisingly, modern imaging technology like photon-counting CT demonstrates high inherent feature stability under controlled conditions. However, methodological choices throughout the radiomics workflow can either preserve or degrade this inherent stability.
The consistency of findings across multiple studies and research groups strengthens the evidence base for radiomic feature reliability. For instance, the identification of similar subsets of robust features across different segmentation methodologies and datasets suggests that certain classes of features possess inherent mathematical properties that confer stability despite methodological variations [41].
Implementing standardized experimental protocols is essential for generating comparable, reliable data in radiomics feature stability analysis. The following workflow diagrams illustrate key methodological approaches documented in the literature:
Figure 1: Phantom Test-Retest Stability Protocol. This workflow illustrates the comprehensive approach for assessing technical reliability of radiomic features using organic phantoms, incorporating both test-retest and repositioning elements [35].
Figure 2: Segmentation Variability Assessment Workflow. This protocol evaluates feature stability across different segmentation methodologies, a critical consideration for multi-center studies [39] [40] [41].
Table 4: Essential Tools for Radiomics Feature Stability Research
| Tool Category | Specific Examples | Function & Importance |
|---|---|---|
| Phantom Systems | Organic phantoms (apples, kiwis, limes, onions) [35] | Provide controlled test objects without biological variability |
| Imaging Modalities | Photon-counting CT (PCCT) [35], 3T MRI [40] | Generate image data with specific resolution and noise characteristics |
| Segmentation Tools | MITK Workbench [35], 3D Slicer [40], VelocityAI [39] | Define regions of interest for feature extraction |
| Feature Extraction Platforms | PyRadiomics (IBSI-compliant) [35] [37], LIFEx [37], IBEX [39] | Calculate radiomic features from segmented images |
| Statistical Software | R with irr, survival packages [37] | Compute reliability metrics (ICC, CCC) and perform survival analyses |
The comprehensive assessment of ICC, CCC, and LOA provides the statistical foundation for establishing radiomic feature reliability across various technical and clinical contexts. The evidence synthesized from multiple studies indicates that while many radiomic features demonstrate excellent inherent stability under controlled conditions, their reliability can be significantly compromised by variations in segmentation methodology, feature extraction platforms, and image acquisition parameters. These findings have profound implications for both research conduct and clinical translation.
For researchers, the methodological recommendations are clear: implement phantom validation studies to establish technical performance, utilize multiple segmentation approaches to assess robustness, standardize feature extraction using IBSI-compliant platforms with harmonized calculation settings, and explicitly report reliability metrics for features used in predictive models. Furthermore, the consistent identification of robust feature subsets across studies suggests that future research should prioritize these stable features for clinical model development.
As radiomics progresses toward clinical integration, establishing rigorous reliability assessment protocols will be essential for regulatory approval and clinical adoption. The metrics and methodologies reviewed here provide a roadmap for this validation process, offering standardized approaches for demonstrating that radiomic biomarkers meet the rigorous reliability standards required for clinical decision-making. Through consistent application of these quantitative reliability metrics, the field can advance toward its promise of transforming medical images into mineable, clinically actionable data.
In the field of radiomics, the reliability of extracted features is a prerequisite for developing predictive models that can be translated into clinical practice. Robust radiomic features must remain stable against inevitable variations in image acquisition, reconstruction, and segmentation. The intraclass correlation coefficient (ICC) has emerged as a primary statistical tool for quantifying this reliability, with a threshold of ICC > 0.75 frequently established as a benchmark for identifying "good" robust features [42]. This guide objectively examines the experimental data supporting this threshold, compares its performance against alternative benchmarks, and details the methodologies for its implementation, providing a foundational resource for researchers and drug development professionals.
The ICC measures the consistency and agreement of quantitative measurements, serving as a ratio of true variance to the total variance (true plus error) [42]. While general guidelines classify ICC values greater than 0.9 as "excellent," those between 0.75 and 0.9 are considered to indicate "good" reliability [42]. This specific range has been validated in numerous radiomic studies as a pragmatic threshold that effectively balances feature stability with the retention of a sufficient number of biologically informative features for model development.
Experimental data from multiple cancer types and imaging modalities consistently demonstrates the utility of the 0.75 threshold. A key study on head-and-neck cancer CT imaging found that using an ICC > 0.75 filter significantly improved model robustness. The average model robustness ICC improved from 0.65 (using all features) to 0.78, and model generalizability increased, evidenced by a reduced train-test AUC difference from 0.21 to 0.18 [33]. Furthermore, models built with these "good-robust" features yielded the best average AUC (0.58) on unseen datasets [33]. In cardiac MRI, a test-retest study on T1 and T2 mapping reported that 44.9% and 38.8% of myocardial radiomic features, respectively, surpassed the ICC > 0.75 benchmark, helping to identify a subset of features with high repeatability for clinical application [43].
Table 1: Performance of the ICC > 0.75 Benchmark Across Different Studies
| Cancer Type/Organ | Imaging Modality | Key Finding with ICC > 0.75 | Source |
|---|---|---|---|
| Head-and-Neck Cancer | CT | Model robustness ICC improved to 0.78; best performance on unseen data [33]. | Frontiers in Oncology |
| Breast Cancer | ADC (MRI) | Achieved optimal model reliability with testing AUC=0.7–0.8 and prediction ICC > 0.9 [7]. | Scientific Reports |
| Myocardium | T1 Mapping (Cardiac MRI) | 44.9% of features were above the ICC > 0.75 threshold [43]. | Journal of Cardiovascular Magnetic Resonance |
| Myocardium | T2 Mapping (Cardiac MRI) | 38.8% of features were above the ICC > 0.75 threshold [43]. | Journal of Cardiovascular Magnetic Resonance |
Selecting an ICC threshold involves a trade-off between feature robustness and predictive power. Excessively high thresholds can eliminate weakly correlated but biologically significant features, thereby impairing a model's discriminative ability. Experimental comparisons provide critical data on the consequences of this choice.
A breast cancer study using apparent diffusion coefficient (ADC) MRI images evaluated model performance across multiple ICC thresholds. The findings revealed that while higher thresholds improved robustness, the optimal model reliability was achieved at an ICC threshold of 0.9, not higher [7]. Specifically, at a very stringent threshold of ICC = 0.95, the test-retest model's performance dropped significantly [7]. This suggests that while ICC > 0.75 is a good initial filter, a marginally higher threshold might sometimes be optimal for final model feature selection, depending on the context.
Another study on head-and-neck cancer provided a direct comparison of different thresholds, demonstrating a clear progression in model performance. The use of "excellent-robust" features (ICC > 0.95) further improved model robustness (ICC = 0.91) and generalizability (train-test AUC difference = 0.12) compared to the "good-robust" threshold [33]. However, the earlier finding that the "good-robust" features yielded the best performance on unseen datasets highlights that the most robust model is not always the most generalizable, underscoring the need for context-specific threshold selection [33].
Table 2: Impact of Different ICC Thresholds on Radiomic Model Performance
| ICC Threshold | Designation | Impact on Model Robustness | Impact on Model Generalizability | Considerations |
|---|---|---|---|---|
| > 0.75 | Good Reliability | Significant improvement over baseline [33]. | Improved generalizability; best performance on some unseen data [33]. | Optimal for retaining predictive features while ensuring stability. |
| > 0.90 | Excellent Reliability | Further improvement in robustness [7] [33]. | Can maintain high testing performance [7]. | May be an optimal final filter; balances stringency and feature retention. |
| > 0.95 | Very High Reliability | Highest model robustness (e.g., ICC=0.91) [33]. | Performance can drop significantly due to loss of predictive features [7] [33]. | Risk of being overly restrictive; may lower discrimination power. |
The test-retest protocol is considered the reference standard for assessing feature repeatability.
Given the challenges of test-retest imaging, image perturbation has been developed as a practical and effective alternative [44].
Diagram 1: Image perturbation is a practical alternative to test-retest imaging for assessing radiomic feature robustness.
Table 3: Key Research Reagent Solutions for Radiomic Robustness Studies
| Item/Resource | Function in Experiment | Specific Examples & Notes |
|---|---|---|
| PyRadiomics | Open-source Python package for standardized feature extraction. | Ensures reproducibility; allows configuration of preprocessing and extraction parameters [33] [45]. |
| 3D Slicer / ITK-SNAP | Software for image segmentation and visualization. | Used for manual, semi-automated, or automated delineation of Regions of Interest (ROIs) [45]. |
| Pingouin | Statistical package in Python for reliability analysis. | Used to calculate various forms of ICC along with their 95% confidence intervals [42]. |
| Test-Retest Datasets | Publicly available datasets to validate feature repeatability. | e.g., Public NSCLC (Non-Small Cell Lung Cancer) and breast cancer datasets [7] [44]. |
| Perturbation Code/Framework | In-house or published code for generating image perturbations. | Implements chains of operations (R, T, V, C) to simulate real-world variations [44]. |
The body of experimental evidence solidifies ICC > 0.75 as a common and scientifically validated benchmark for establishing robustness in radiomic features. Data from head-and-neck, breast, and cardiac studies confirm that this threshold significantly enhances model robustness and generalizability compared to using unfiltered features. While alternative, more stringent thresholds (e.g., ICC > 0.90) can further improve stability, they risk discarding predictive information, potentially leading to a drop in performance on unseen data [7] [33]. The choice between test-retest and image perturbation protocols depends on data availability, with the latter providing a highly effective and feasible alternative [44]. For researchers building reliable radiomic models, incorporating an ICC > 0.75 filter is a critical step, and its implementation is facilitated by a well-established toolkit of software and methodologies.
This guide provides an objective comparison of radiomic feature considerations across Computed Tomography (CT), Positron Emission Tomography (PET), and Magnetic Resonance (MR) imaging modalities, with a specific focus on their implications for test-retest reliability in radiomics research.
Radiomics extracts high-dimensional data from medical images to quantify tumor phenotypes. A core challenge in radiomics is ensuring these features are reproducible, meaning they yield stable measurements when the same subject is imaged under identical conditions. Test-retest reliability is a critical prerequisite for developing robust, clinically applicable models. However, this reliability is profoundly influenced by the imaging modality used, due to differences in their underlying physics, acquisition protocols, and reconstruction algorithms. This guide compares the test-retest reliability of radiomic features across CT, PET, and MR, providing researchers with the experimental data and methodologies needed to inform their study designs.
The diagnostic performance and technical characteristics of hybrid imaging modalities, often used in radiomics, are summarized below. Furthermore, the stability of radiomic features extracted from different modalities is highly variable, as shown by test-retest studies.
Table 1: Comparative Diagnostic Performance of PET/CT vs. PET/MR in Detecting Breast Cancer Recurrence (Patient-Level Analysis) [46]
| Modality | Sensitivity (%) | 95% CI for Sensitivity | Specificity (%) | 95% CI for Specificity |
|---|---|---|---|---|
| PET/CT | 93 | 88 – 96 | 87 | 80 – 93 |
| PET/MR | 99 | 94 – 100 | 98 | 90 – 100 |
| P-value | 0.07 | 0.06 |
Table 2: Comparative Diagnostic Performance in Detecting Liver Metastases [47]
| Modality | Sensitivity (%) | Specificity (%) | Statistical Significance (p-value) |
|---|---|---|---|
| Total-Body PET/CT | 66.7 | 83.3 | 0.016 |
| PET/MR | 96.3 | 91.7 | (Reference) |
Table 3: Radiomic Feature Stability in Test-Retest Scenarios [29]
| Feature Category | Total Features | Features with CCC > 0.85 (Lung, "Coffee-Break") | Features with CCC > 0.85 (Rectal, Clinical) |
|---|---|---|---|
| All Features | 542 | 234 | 9 |
| Shape | 11 | 11 | 11 |
| Texture (GLCM) | 44 | 40 | 30 |
| Tumor Intensity | 15 | 13 | 2 |
| Wavelet | 472 | 170 | 5 |
CT radiomics is influenced by acquisition parameters like tube voltage, current, and slice thickness. Test-retest studies reveal that feature stability is highly dependent on the imaging scenario.
PET radiomics faces unique challenges due to its lower spatial resolution, noisy data, and sensitivity to factors like uptake time and reconstruction algorithms. Quantifying feature stability is essential before building predictive models.
MR imaging presents the most complex landscape for radiomics due to its multi-parametric nature and high sensitivity to variations in sequence parameters (e.g., TR, TE, field strength). This can lead to significant challenges in test-retest reproducibility.
The workflow for assessing feature reliability, whether through test-retest or perturbation, is summarized in the diagram below.
Table 4: Essential Tools for Radiomics Reliability Research
| Solution / Tool | Function / Application | Relevance to Test-Retest |
|---|---|---|
| PyRadiomics | Open-source Python library for standardized extraction of a wide range of radiomic features. | Ensures consistent feature calculation, which is foundational for reproducibility studies [50]. |
| Concordance Correlation Coefficient (CCC) | Statistical measure to assess agreement between two measurements of the same variable. | Primary metric for quantifying feature stability in test-retest and perturbation analyses [29] [7]. |
| Image Perturbation Algorithms | Software scripts to apply random transformations (translations, rotations, contour noise) to images and ROIs. | Simulates test-retest variability when a second scan is unavailable; used to identify robust features [7]. |
| Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) | A critical appraisal tool for systematic reviews of diagnostic accuracy studies. | Used to assess the methodological quality and risk of bias in studies included in radiomics meta-analyses [46]. |
| Test-Retest Datasets (e.g., RIDER) | Publicly available datasets containing repeated scans of the same patient with minimal time interval. | Gold standard for conducting and validating feature repeatability analyses [29]. |
The reliability of radiomic features is intrinsically linked to the imaging modality. CT features show high stability in ideal "coffee-break" settings but can be highly variable in clinical practice. PET features, while quantitatively consistent across hybrid systems like PET/CT and PET/MR, require careful harmonization. MR offers superior soft-tissue contrast but presents the greatest reproducibility challenges due to its parametric complexity. A critical emerging insight is that the common practice of filtering out individually "non-reproducible" features may discard predictive information, as this information can be distributed across multiple features [50]. Therefore, the radiomics community must move beyond a narrow focus on feature-level reproducibility and adopt a more holistic, model-centric approach to ensure the development of robust and clinically valuable tools.
The development of robust biomarkers and therapeutic targets in oncology is fundamentally complicated by the pervasive issue of tissue specificity. Cancer driver genes, radiomic features, and drug responses demonstrate significant variation across different tissue types, creating substantial challenges for developing reliable pan-cancer models. Recent genomic analyses have revealed that the vast majority of cancer driver genes are mutated in a tissue-dependent manner, meaning they are altered in some cancers but not others [51]. This tissue specificity extends beyond genetic alterations to functional pathways and therapeutic responses, with even cancer immunotherapy achieving enduring clinical benefit in only a fraction of tumor types [51].
Understanding the origins of this tissue specificity requires consideration of both cell-intrinsic and cell-extrinsic factors. The cell type-specific wiring of signaling networks determines the outcome of cancer driver gene mutations, while exposure to tissue-specific microenvironments (e.g., immune cells, hormones) also shapes the tissue specificity of driver genes and therapy response [51]. This complex interplay creates a landscape where feature stability—whether genomic, radiomic, or proteomic—varies considerably across disease contexts, necessitating specialized methodologies for accurate assessment and interpretation.
Table 1: Comparison of Feature Stability Assessment Methods
| Method Type | Key Characteristics | Advantages | Limitations | Optimal Application Context |
|---|---|---|---|---|
| Test-Retest Imaging [52] | Repeated scanning of patients within short time intervals with identical acquisition settings | • Considered gold standard• Captures real biological and technical variance• Direct clinical relevance | • Requires additional medical resources• Potential extra radiation exposure• Limited patient cohorts available• Conclusions not easily generalizable | • Establishing ground truth for feature repeatability• Validation studies with sufficient resources |
| Image Perturbation [52] | Application of random transformations (translations, rotations, noise addition, contour randomizations) to simulated retest images | • No additional scanning required• Applicable to existing datasets• Cost-effective and efficient• No patient burden | • May not capture all real-world variance• Systematic overestimation of repeatability• Requires validation against test-retest when possible | • Routine radiomic studies without dedicated retest data• Initial feature screening and filtering |
| Multi-Pipeline Comparison [53] | Extraction of identical feature classes using different computational pipelines (e.g., Pyradiomics, Moddicom) | • Identifies algorithm-dependent stability• Highlights implementation variations• Assesses computational robustness | • Does not address biological variance• Limited to technical reproducibility• Platform-specific differences | • Protocol standardization studies• Pipeline selection and harmonization |
Table 2: Performance Comparison of Perturbation vs. Test-Retest Methods
| Performance Metric | Image Perturbation (ICC = 0.9) | Test-Retest (ICC = 0.9) | Statistical Significance |
|---|---|---|---|
| Testing AUC (Logistic Regression) | 0.76 (0.64-0.88) | 0.77 (0.64-0.88) | p = 0.021 (within method)p > 0.05 (between methods) |
| Prediction ICC | 0.86 (0.82-0.90) | 0.87 (0.80-0.92) | Not statistically significant |
| Feature Repeatability Agreement | 621 features (ICC > 0.5) | 621 features (ICC > 0.5) | Strong correlation (r = 0.79, p < 0.001) |
| Mutually Agreed Repeatable Features (ICC > 0.9) | 18 features | 18 features | 989 features showed disagreement |
The experimental data reveals that while test-retest remains the gold standard, image perturbation can achieve similar model reliability at optimal intra-class correlation coefficient (ICC) thresholds [52]. Both methods demonstrate significantly improved testing AUC (0.76-0.77) compared to baseline models (AUC = 0.56) when applying appropriate ICC filtering thresholds. However, researchers should note the systematic overestimation of feature repeatability by perturbation methods, with only 18 features achieving mutual agreement at ICC > 0.9 compared to 989 features showing disagreement between methods [52].
The image perturbation protocol involves several systematic steps to simulate realistic variations in image acquisition and segmentation. For a comprehensive assessment, researchers should implement the following workflow:
Image Transformation: Apply random translations (±5mm), rotations (±10°), and noise addition (Gaussian, σ=0.01) to the original images to simulate positioning variations [52].
Contour Randomization: Generate multiple perturbed segmentations by applying random deformations to the original region of interest (ROI) using statistical shape models with ±2mm variance to account for inter-observer variability [52].
Feature Extraction: Extract radiomic features from all perturbed images and segmentations using standardized pipelines such as Pyradiomics [53].
Stability Calculation: Compute intra-class correlation coefficient (ICC) for each feature across all perturbations using a two-way random effects model assessing absolute agreement.
Feature Filtering: Apply predetermined ICC thresholds (typically 0.8-0.9) to select stable features for downstream modeling [52].
This protocol can be implemented using publicly available tools such as Pyradiomics within the 3D Slicer platform, which provides standardized feature definitions compliant with the Image Biomarker Standardization Initiative (IBSI) [53].
Given the significant heterogeneity between radiomics pipelines, validation across multiple computational tools is essential:
Parallel Feature Extraction: Extract identical feature classes using at least two independent pipelines such as Pyradiomics (3D voxel-to-voxel relationships) and Moddicom (2D slice-wise analysis with aggregation) [53].
Correlation Analysis: Assess inter-pipeline concordance using Spearman's rank correlation, with significance threshold of p ≤ 0.05 [53].
Stability Concordance: Identify features demonstrating consistent stability measures across pipelines, prioritizing those with correlation coefficients >0.7.
Downstream Validation: Evaluate how pipeline heterogeneity affects clustering with known clinical parameters such as T/N categories and tumor volume [53].
This multi-tool approach is particularly important for texture features, which show higher inter-pipeline variability (61.9% correlation for CT vs. 19.0% for MRI) compared to shape features (100% correlation for both modalities) [53].
Figure 1: Experimental workflow for comprehensive assessment of feature stability incorporating both image perturbation and multi-pipeline validation.
The tissue specificity observed in cancer features stems from fundamental biological mechanisms that vary across organ systems and tissue types. Understanding these mechanisms is essential for interpreting feature stability variations:
DNA Damage Response Pathways: DDR genes demonstrate striking tissue specificity in their mutation patterns. For example, germline mutations in nucleotide excision repair (NER) pathway genes (XPA, XPC) predominantly cause xeroderma pigmentosum with high skin cancer risk, while BRCA1/2 mutations in homologous recombination pathways primarily increase breast and ovarian cancer risk [51]. This specificity occurs despite relatively uniform expression of these DNA repair genes across tissues [54].
Cell-Extrinsic Factors: Tissue-specific microenvironments significantly influence feature stability through:
Cell-Intrinsic Factors: The developmental origin and differentiation state of cells creates tissue-specific vulnerabilities:
Figure 2: Biological mechanisms underlying tissue specificity in cancer features, showing how cell-intrinsic and cell-extrinsic factors collectively influence feature stability variations.
Patients with multiple tumors present unique challenges for feature stability assessment and predictive modeling. Several aggregation methods have been developed to address this challenge:
Table 3: Performance of Radiomic Feature Aggregation Methods in Multifocal Brain Metastases
| Aggregation Method | Description | C-Index (Cox PH) | C-Index (Cox LASSO) | C-Index (Random Forest) |
|---|---|---|---|---|
| Weighted Average (Largest 3) | Volume-weighted mean of features from 3 largest tumors | 0.627 (0.595-0.661) | 0.628 (0.591-0.666) | 0.652 (0.565-0.727) |
| Unweighted Average (All) | Simple mean of features from all tumors | 0.619 (0.586-0.652) | 0.621 (0.585-0.660) | 0.637 (0.550-0.712) |
| Largest Only | Features from single largest tumor only | 0.615 (0.582-0.648) | 0.618 (0.581-0.657) | 0.640 (0.553-0.715) |
| Largest + Count | Features from largest tumor plus metastasis count | 0.622 (0.589-0.655) | 0.624 (0.587-0.663) | 0.645 (0.558-0.720) |
The volume-weighted average of the largest three metastases consistently outperformed other aggregation methods across all survival models, suggesting that in multifocal disease, the largest tumors drive prognosis and provide the most stable feature sets [55]. This approach also offers practical advantages for computational efficiency and clinical implementation by reducing segmentation burden.
The optimal aggregation method varies based on disease characteristics:
For patients with <5 metastases: Weighted average of largest three tumors performs best (C-index = 0.640) [55].
For patients with 5-10 metastases: Unweighted average of all metastases shows superior performance (C-index = 0.697) [55].
For patients with 11+ metastases: Model including only the largest metastasis plus metastasis count performs best (C-index = 0.909) [55].
These findings indicate that as metastatic burden increases, incorporating clinical measures of multifocality (e.g., number of metastases) becomes increasingly important for accurate prognostication.
Table 4: Key Research Resources for Feature Stability Studies
| Resource Category | Specific Tools/Solutions | Primary Function | Implementation Considerations |
|---|---|---|---|
| Radiomics Pipelines | Pyradiomics (v2.1.2+), Moddicom (v0.51+), CERR | Standardized feature extraction from medical images | Pyradiomics uses 3D voxel relationships; Moddicom uses 2D slice aggregation; significant heterogeneity exists between pipelines [53] |
| Statistical Analysis | R/Python with survival, ICC calculation packages | Feature stability assessment and survival modeling | Implement mixed-effects models for nested data; use LASSO regularization for high-dimensional feature selection [52] [55] |
| Image Perturbation | Custom scripts for translation, rotation, contour randomization | Simulation of technical variations without additional scanning | Systematic overestimation of repeatability requires validation against test-retest when possible [52] |
| Multi-Omics Integration | mix-lasso model, PharmacoGx R package | Identification of tissue-specific predictive features across data types | Incorporates group penalty terms for tissue-specific effects; handles high-dimensional correlated features [56] |
| Validation Frameworks | IBSI-standardized phantoms, public test-retest datasets | Method benchmarking and harmonization | Limited generalizability across modalities and cancer sites necessitates study-specific validation [52] [53] |
The systematic evaluation of feature stability across cancer types requires multifaceted approaches that account for both technical and biological sources of variation. Image perturbation methods provide a practical alternative to test-retest imaging for routine feature stability assessment, achieving comparable model reliability at optimal ICC thresholds [52]. However, researchers must account for the systematic overestimation of feature repeatability by perturbation methods and the significant heterogeneity between radiomics pipelines [52] [53].
The biological context of tissue specificity—driven by DNA damage response heterogeneity, environmental exposures, and tissue-specific signaling networks—fundamentally limits pan-cancer applications of molecular and radiomic features [51] [54]. Successful modeling strategies must incorporate both feature stability measures and biological plausibility, with aggregation methods tailored to disease-specific characteristics such as metastatic burden [55].
Emerging methodologies that explicitly model tissue-specific effects, such as the mix-lasso approach for pan-cancer drug response prediction, offer promising frameworks for addressing feature stability variations across diseases [56]. By integrating technical validation with biological reasoning, researchers can develop more reliable, interpretable models that advance precision oncology across diverse cancer types.
Radiomics, the high-throughput extraction of quantitative features from medical images, has emerged as a cornerstone of precision oncology, offering non-invasive insights into tumor phenotype and microenvironment [18]. The reliability of these radiomic features (RFs) is paramount for developing robust predictive and prognostic models that can guide clinical decision-making. Among the critical factors influencing feature reliability, the pathological region from which features are extracted—specifically, the primary tumor, peritumoral area, and lymph nodes—represents a fundamental but often overlooked variable.
This review synthesizes current evidence on how these distinct pathological regions impact radiomic feature consistency, framing the discussion within the broader context of test-retest reliability research. Understanding these regional variations is essential for researchers and drug development professionals seeking to build generalizable radiomic models that can reliably inform therapeutic development and personalized treatment strategies.
A comprehensive 2025 study investigating esophageal cancer (EC) and nasopharyngeal carcinoma (NPC) provided direct comparative data on RF repeatability across pathological regions. The research, which utilized perturbation analysis and intraclass correlation coefficients (ICC) for repeatability assessment, revealed significant region-dependent variations [18].
Table 1: Radiomic Feature Repeatability Across Pathological Regions and Modalities
| Cancer Type | Imaging Modality | Pathological Region | Repeatability (Median ICC) | Statistical Comparison |
|---|---|---|---|---|
| Esophageal Cancer (EC) | CT | Tumor | 0.806 | Reference value |
| Esophageal Cancer (EC) | CT | Peritumor | 0.824 | Comparable to tumor (p > 0.05) |
| Esophageal Cancer (EC) | PET | Tumor | 0.897 | Significantly higher than CT-based tumor features (p < 0.05) |
| Esophageal Cancer (EC) | PET | Peritumor | 0.819 | Significantly lower than PET-based tumor features (p < 0.05) |
| Nasopharyngeal Carcinoma (NPC) | CT | Tumor | 0.886 | Reference value |
| Nasopharyngeal Carcinoma (NPC) | CT | Lymph Nodes | 0.863 | Significantly lower than tumor features (p < 0.05) |
This study demonstrated that CT-based peritumoral features in EC showed comparable repeatability to tumor features, whereas PET-based peritumoral features exhibited significantly lower repeatability than their tumor counterparts. Additionally, CT-based lymph node features in NPC demonstrated significantly lower repeatability than primary tumor features [18].
The prognostic significance of features extracted from different regions varies substantially. Research in non-small cell lung cancer (NSCLC) found that radiomic data from lymph nodes provided valuable complementary information to primary tumor features for predicting pathological complete response (pCR) after neoadjuvant chemoradiation. Specifically, lymph node homogeneity features were significantly predictive of gross residual disease (AUC range: 0.72–0.75) and performed significantly better than primary tumor features (AUC = 0.62) [57].
Traditional test-retest imaging, while considered a gold standard for repeatability assessment, presents practical challenges in clinical settings due to resource constraints and additional radiation exposure [18] [7]. Consequently, perturbation analysis has emerged as a validated alternative methodology.
Table 2: Key Methodological Approaches for Repeatability Assessment
| Methodology | Core Principle | Implementation | Validation |
|---|---|---|---|
| Test-Retest Imaging | Repeated scanning of same patient within short interval (typically 1-7 days) | Fixed scanner protocol; minimal changes in patient positioning | Considered reference standard but clinically impractical for large studies |
| Image Perturbation | Simulates spatial variations through computational transformations | Affine transformations (rotation); contour randomization via super voxels | Strong correlation (r=0.79) with test-retest results [7] |
| ICC Calculation | Quantifies consistency of measurements | One-way, random, absolute-agreement ICC | ICC >0.8 considered repeatable; >0.95 highly repeatable [58] |
The perturbation approach typically involves:
Studies have demonstrated strong correlation between feature repeatability assessed via perturbation and traditional test-retest methods (Pearson correlation r = 0.79, p < 0.001), supporting its validity as an assessment tool [7].
Accurate ROI definition is crucial for regional feature consistency. Key segmentation approaches include:
In cervical cancer studies, features extracted from threshold-based VOI40 isocontours demonstrated significantly better repeatability than those from manually delineated whole-tumor volumes (VOIWT). For instance, gray-level run length matrix (GLRLM) features showed poor repeatability (CCC < 0.52) when extracted from VOIWT but high repeatability (CCC > 0.96) from VOI40 [58].
Table 3: Key Research Reagents and Computational Tools for Radiomic Repeatability Studies
| Tool Category | Specific Tools | Primary Function | Application in Regional Analysis |
|---|---|---|---|
| Image Processing | ITK-SNAP | Manual ROI segmentation | Precise delineation of tumor, peritumoral, and lymph node regions |
| Feature Extraction | PyRadiomics, LIFEx | High-throughput feature calculation | Standardized extraction across different pathological regions |
| Statistical Analysis | R, Python (scikit-learn) | Statistical modeling and ICC calculation | Quantifying repeatability differences between regions |
| Phantom Materials | Customized texture phantoms | Scanner calibration and protocol validation | Ensuring consistent imaging across different tissue densities |
| Data Harmonization | ComBat, Z-score normalization | Mitigating multicenter variability | Reducing institutional bias in multi-region feature analysis |
The experimental workflow for assessing regional feature consistency typically follows a structured pipeline:
The repeatability of radiomic features directly impacts the generalizability of predictive models across institutions. Research in esophageal squamous cell cancer demonstrated that models built using high-repeatable features maintained significantly better performance in external validation sets compared to those using low-repeatable features (C-index: 0.67 vs. 0.61 for local recurrence-free survival) [59].
Furthermore, certain feature classes demonstrate more consistent repeatability across pathological regions:
The consistency of radiomic features varies significantly across different pathological regions, with primary tumor features generally demonstrating higher repeatability than peritumoral or lymph node features. These regional variations are influenced by multiple factors including imaging modality, segmentation methodology, and feature class.
For researchers developing radiomic models, these findings underscore the importance of:
Advancing our understanding of how pathological region impacts feature consistency will enhance the reliability of radiomic models, ultimately accelerating their integration into precision oncology workflows and therapeutic development pipelines.
In radiomics, the high-throughput extraction of minable data from medical images, feature stability is a prerequisite for developing reliable, clinically relevant biomarkers [60] [61]. The radiomics workflow, from image acquisition to model building, is complex and introduces multiple potential sources of variability. Among these, segmentation variability—the differences in delineating the region of interest (ROI) by different observers (inter-observer) or by the same observer at different times (intra-observer)—is a critical bottleneck [60]. This variability in defining the volume from which features are extracted can significantly influence feature values, potentially compromising their reliability and subsequent clinical utility [62] [60]. Therefore, within the broader context of test-retest reliability research, assessing the impact of segmentation variability is paramount for distinguishing robust, physiologically meaningful biomarkers from unstable, segmentation-dependent artifacts. This guide objectively compares the effects of inter- and intra-observer segmentation variability on radiomic feature stability across different imaging modalities, anatomical sites, and experimental setups, providing a synthesis of current experimental data and methodologies.
The impact of segmentation variability has been quantitatively assessed in numerous studies, each employing specific experimental designs and metrics. The data below summarize key findings from multiple investigations, highlighting how feature stability changes under different conditions.
Table 1: Summary of Study Designs and Key Metrics in Segmentation Variability Research
| Study Focus / Anatomical Site | Imaging Modality | Number of Observers / Segmentations | Primary Stability Metric(s) | Key Segmentation Metric (e.g., DSC) |
|---|---|---|---|---|
| Breast Cancer [60] | MRI | 4 observers (Radiologist to Student) | ICC > 0.90 | Mean DSC: 0.81 (Range: 0.19-0.96) |
| Coronary Arteries [63] | PET/CT | 2 observers (Expert) | ICC (Lower Bound) | Auto-segmentation DSC: 0.61 ± 0.05 |
| Organic Phantoms [4] | Novel CBCT | Re-test, Reposition, 90°-rotation | CCC > 0.90 | Not Applicable (Phantom Study) |
| DWI Phantom [64] | MRI | Re-test, Reposition, Intra-/Inter-reader | ICC > 0.90 | Not Applicable (Phantom Study) |
| Clinical Example (Prostate) [4] | Novel CBCT | Re-test (Two scans) | CCC > 0.90 | Not Applicable |
Table 2: Impact of Variability on Radiomic Feature Stability
| Study / Condition | Total Features Extracted | Stable Features (Count or %) | Most Stable Feature Classes / Notes |
|---|---|---|---|
| Breast MRI (Inter-observer) [60] | 1,328 (RadiomiX) | 552 (41.6%) | Local Intensity, GLRLM |
| 833 (PyRadiomics) | 273 (32.8%) | First-Order Statistics | |
| Breast MRI - "Easy" Tumors [60] | 1,328 (RadiomiX) | 763 (57.5%) | Higher stability with higher DSC |
| Breast MRI - "Challenging" Tumors [60] | 1,328 (RadiomiX) | 228 (17.2%) | Lower stability with lower DSC |
| Cardiac PET (Inter-observer) [63] | 373 (Unfiltered) | 47 (12.6%) CT, 25 (7.5%) PET | First-Order, GLCM |
| Cardiac PET (Intra-observer) [63] | 373 (Unfiltered) | 133 (35.8%) CT, 57 (15.3%) PET | First-Order, GLCM |
| CBCT Phantoms (Re-test) [4] | 107 | ~98-100% stable | Shape, First-Order, Second-Order |
| CBCT Phantoms (90°-test) [4] | 107 | ~66-86% stable | Stability decreases with rotation |
| Clinical CBCT (Re-test) [4] | 107 | 63% (Prostate), 15% (Bladder/Rectum) | Context-dependent stability |
A critical step in evaluating segmentation variability is understanding the standard experimental protocols used to quantify its effects.
A common study design involves multiple observers manually segmenting the same set of images. The observers should have varying levels of expertise (e.g., dedicated radiologists, residents, students) to assess the generalizability of features across real-world clinical settings [60]. Typically, a crossed design is used, where all observers segment all patient images or phantoms [65]. To assess intra-observer variability, the same observer repeats the segmentation after a suitable time interval (e.g., two months) while being blinded to their initial segmentations to prevent recall bias [66] [63]. The workflow for a typical segmentation reliability study is outlined below.
The consistency of the segmentations themselves is first evaluated using spatial overlap metrics. The Dice Similarity Coefficient (DSC) is the most commonly used metric, quantifying the spatial overlap between two segmentations, with a value of 1 indicating perfect agreement and 0 indicating no overlap [60] [63]. Other metrics like the Hausdorff Distance (HD) may be used to assess the maximum boundary separation [63].
For feature stability, the Intraclass Correlation Coefficient (ICC) is the most widely adopted statistical metric for assessing reliability for continuous variables [61]. It is defined as the ratio of between-subject variance to the total variance (between-subject plus within-subject measurement variance) [61]. Different forms of ICC exist, but models that incorporate absolute agreement are typically required [67]. A common benchmark is to deem features with an ICC > 0.90 as "excellent" and robust to segmentation variability [64] [60] [61]. The Concordance Correlation Coefficient (CCC), which evaluates both accuracy and precision, is also used with a similar threshold (e.g., CCC > 0.90) [4]. The relationship between study design, statistical analysis, and conclusions is shown in the following workflow.
Table 3: Essential Tools for Segmentation Variability and Radiomics Research
| Tool Category | Specific Examples | Function in Research |
|---|---|---|
| Radiomics Software | PyRadiomics [62] [60] [68], RadiomiX [60] | Open-source & commercial platforms for standardized feature extraction. Adherence to IBSI standards is critical. |
| Segmentation Software | 3D Slicer [62], MIM [63], MicroDicom [66] | Applications for manual, semi-automatic, or automatic delineation of Regions of Interest (ROIs). |
| AI Segmentation Models | nnUNet [63] | State-of-the-art deep learning framework for automated segmentation, used as a comparator to manual variability. |
| Statistical Analysis | R (pingouin), Python (Pingouin, SciPy) [66] | Programming languages/packages for calculating ICC, CCC, and other reliability statistics. |
| Stability Metrics | Dice Similarity Coefficient (DSC), Intraclass Correlation Coefficient (ICC) [60] [67] [63] | Quantitative metrics to evaluate spatial agreement of segmentations and reliability of extracted features. |
The data consistently demonstrates that segmentation variability is a major determinant of radiomic feature stability. A substantial proportion of features are unstable when faced with inter- and intra-observer segmentation differences. For instance, in breast MRI, only about one-third to two-fifths of features were robust across four observers [60]. This effect is magnified in challenging segmentation tasks, such as with irregular, spiculated tumors or complex anatomical structures like coronary arteries, where the number of robust features can drop significantly [60] [63].
The class of radiomic features plays a role in stability. While no single feature class is universally stable, first-order statistics and texture features from the Gray Level Co-occurrence Matrix (GLCM) and Gray Level Run Length Matrix (GLRLM) are frequently among the more robust groups [4] [60] [63]. In contrast, shape features have been shown to be the least reliable when derived from AI-based segmentations compared to manual ones, which is intuitive given their direct dependence on the segmentation boundary [63].
Furthermore, the image modality and context influence stability. Phantom studies, which control for biological noise, often show very high stability in re-test scenarios [4] [64]. However, this stability can degrade dramatically with changes in positioning or rotation, and the transfer to clinical patient data is not straightforward, as seen with the lower stable feature fraction in prostate, rectum, and bladder compared to phantoms using the same CBCT system [4]. This underscores that stability is context-dependent and must be verified in the specific clinical setting.
This comparison guide underscores that inter- and intra-observer segmentation variability presents a significant challenge to the stability and reproducibility of radiomic features. The widespread use of metrics like the DSC and ICC provides a standardized framework for quantifying these effects. The evidence shows that a failure to account for segmentation variability risks building radiomic models on technically unstable, non-generalizable biomarkers.
Moving forward, the field is adopting several strategies to mitigate these issues. These include:
In conclusion, rigorous assessment of segmentation-related effects is not an optional step but a foundational requirement in the test-retest reliability framework for radiomics. By objectively quantifying these effects and focusing on robust biomarkers, the path toward clinically applicable and reliable radiomic models can be achieved.
The reliability of quantitative radiomic features is paramount for their translation into clinical research and drug development. A core challenge lies in the sensitivity of these features to variations in image acquisition parameters, including the scanner type, imaging protocol, and reconstruction settings. This variability can obscure genuine biological signals, compromising the validity of longitudinal studies and multi-center trials. Consequently, harmonization strategies are essential to ensure that radiomic features are robust and reproducible, meeting the stringent requirements of test-retest reliability studies. This guide provides a comparative evaluation of prominent harmonization techniques, assessing their efficacy in mitigating technical variability to produce reliable radiomic biomarkers.
Harmonization techniques can be broadly categorized into image processing methods, deep learning-based approaches, and acquisition protocol standardization. The optimal choice depends on the specific application, computational resources, and the desired outcome—whether for visual interpretation or quantitative feature reproducibility [69].
Deep learning techniques have demonstrated superior performance in harmonization tasks. Convolutional Neural Networks (CNNs) excel at enhancing image quality for visual interpretation, while Generative Adversarial Networks (GANs) are more effective at ensuring the reproducibility of quantitative radiomic and deep features [69].
Table 1: Performance Comparison of Deep Learning Harmonization Techniques on CT Images
| Harmonization Technique | Key Strength | Quantitative Performance (Sample Data) | Best For |
|---|---|---|---|
| Convolutional Neural Networks (CNNs) [69] | High image similarity enhancement | PSNR: ↑ 17.76 to 31.93; SSIM: ↑ 0.22 to 0.75 [69] | Visual interpretation, diagnostic tasks |
| Generative Adversarial Networks (GANs) [69] | Superior feature reproducibility | Radiomic feature CCC: 0.97; Deep feature CCC: 0.84 [69] | Quantitative radiomics, machine learning models |
Traditional methods and acquisition standardization provide foundational and practical approaches to reduce variability.
Table 2: Performance of Traditional Methods and Feature Selection in Different Modalities
| Method / Observation | Modality / Context | Performance / Finding | Reference |
|---|---|---|---|
| Feature Repeatability (ICC>0.75) | Cardiac MRI (T1 maps) | 44.9% of features (44/98) showed good-excellent repeatability [43] | Marfisi et al. |
| Feature Repeatability (ICC>0.75) | Cardiac MRI (T2 maps) | 38.8% of features (38/98) showed good-excellent repeatability [43] | Marfisi et al. |
| Image Perturbation vs. Test-Retest | Breast MRI (ADC maps) | High correlation (r=0.79) in feature ICC; model with ICC>0.9 features showed AUC=0.76-0.77 and prediction ICC>0.9 [7] | Song et al. |
This protocol systematically assesses harmonization techniques against variations in radiation dose and reconstruction kernels [69].
This protocol compares two methods for evaluating feature repeatability: test-retest imaging and computational image perturbation [7].
Table 3: Essential Tools and Resources for Radiomics Harmonization Research
| Tool / Resource | Function / Purpose | Example Use Case |
|---|---|---|
| PyRadiomics [12] [70] | Open-source Python library for standardized extraction of radiomic features from medical images. | Extracting 1015 radiomic features from ROIs for repeatability analysis [12]. |
| IBSI Guidelines [71] | Reference standards (Image Biomarker Standardisation Initiative) ensuring consistent calculation and reporting of radiomic features. | Providing a consensus-based, exhaustive set of mathematical definitions for features [71]. |
| ICC & RC Metrics [43] [72] | Statistical measures (Intraclass Correlation Coefficient, Repeatability Coefficient) for quantifying feature repeatability and reproducibility. | Identifying a subset of stable myocardial radiomic features with ICC > 0.75 [43]. |
| Image Perturbation Algorithms [7] | Computational generation of pseudo-retest images via random transformations (translation, rotation, contour randomization). | Assessing feature repeatability when a true test-retest dataset is not available [7]. |
| Deep Learning Frameworks | Software libraries (e.g., TensorFlow, PyTorch) for implementing and training CNN and GAN harmonization models. | Training a U-Net or Pix2Pix model to map images from one acquisition parameter set to another [69]. |
| Resampling & Normalization [70] | Preprocessing techniques to achieve uniform voxel spacing and intensity value ranges across heterogeneous datasets. | Mitigating variability from different scanners or protocols before feature extraction [70]. |
The harmonization of acquisition parameters is a critical step in the development of robust and clinically relevant radiomic models. The evidence indicates that while traditional pre-processing and feature selection are necessary, deep learning-based harmonization offers a powerful, data-driven solution. The choice of technique should be guided by the study's endpoint: CNNs are superior for tasks requiring high image fidelity for visual interpretation, whereas GANs are more effective for ensuring the reproducibility of quantitative features in predictive models. Furthermore, incorporating feature repeatability analysis, whether through test-retest or image perturbation, is essential for building reliable models. A multi-pronged strategy combining protocol standardization, advanced harmonization techniques, and rigorous feature stability assessment paves the way for translating radiomics from research into drug development and clinical practice.
The reliability and reproducibility of radiomic features are fundamental prerequisites for their translation into clinical research and practice. The stability of these features across multiple tests and retests—a concept known as test-retest reliability—is profoundly influenced by choices made during the image preprocessing phase. Discretization, filtering, and standardization are not merely technical preliminaries but are critical determinants of whether a radiomic signature will hold predictive value when validated on independent datasets or in clinical settings. Variations in preprocessing protocols can introduce substantial non-biological variance, potentially obscuring true phenotypic signatures and compromising the validity of radiomic models [31]. This guide systematically compares prevalent preprocessing strategies, evaluating their impact on feature stability and model performance within the specific context of test-retest reliability research.
Intensity discretization, the process of grouping continuous image intensity values into a finite number of discrete bins, is a crucial step for calculating texture features. The method and parameters of discretization significantly influence the resultant radiomic feature values and their stability.
Absolute vs. Relative Discretization: Absolute discretization employs a fixed bin width (e.g., 6 or 42), preserving absolute intensity differences, which can be beneficial for CT data with Hounsfield Units. In contrast, relative discretization uses a fixed number of bins (e.g., 16, 32, or 128) across the intensity range of the Region of Interest (ROI), effectively normalizing the intensities and is often recommended for MRI data with arbitrary units [73] [31].
Parameter Selection: The choice of bin number or width represents a trade-off between texture detail and feature stability. Excessively few bins may oversimplify the texture, while too many can amplify noise. In a pancreas MRI study, a fixed bin number of 16 yielded 42 significant second-order texture features, outperforming other bin numbers and widths [73]. Conversely, a brain metastasis study on MRI found a model with 32 bins achieved the highest accuracy (70%) and AUC (0.70), while a model with 10 bins performed best among the "fixed bin number" approaches (79% accuracy) [74]. This indicates that the optimal parameter may be context-dependent.
Table 1: Impact of Discretization Parameters on Radiomic Analysis Outcomes
| Discretization Method | Key Parameter | Reported Effect on Features / Model Performance | Study Context |
|---|---|---|---|
| Relative (Fixed Bin Number) | 16 bins | Yielded 42 significant second-order texture features [73] | Pancreas MRI [73] |
| Relative (Fixed Bin Number) | 32 bins | Achieved 70% accuracy, AUC 0.70 [74] | Brain Metastasis MRI [74] |
| Relative (Fixed Bin Number) | 10 bins | Achieved 79% accuracy [74] | Brain Metastasis MRI [74] |
| Relative (Fixed Bin Number) | 128 bins | Yielded 38 significant second-order texture features [73] | Pancreas MRI [73] |
| Absolute (Fixed Bin Width) | Width of 6 | Yielded 24 significant second-order texture features [73] | Pancreas MRI [73] |
| Absolute (Fixed Bin Width) | Width of 42 | Yielded 26 significant second-order texture features [73] | Pancreas MRI [73] |
Image filtering techniques are applied to emphasize or suppress specific image characteristics prior to feature extraction. The choice of filter can selectively enhance feature sets relevant to particular biological questions.
Laplacian of Gaussian (LoG): This filter is effective for edge enhancement and highlighting blob-like structures. It is commonly used in radiomics studies of the pancreas and brain metastases [73] [74]. The sigma (σ) parameter controls the coarseness of the texture analyzed, with smaller σ (e.g., 2 mm) emphasizing finer textures and larger σ (e.g., 5 mm) emphasizing coarser textures [73].
Wavelet & Logarithm Filters: Wavelet filters decompose images into frequency components, enabling multi-scale texture analysis. Logarithm filters can help in handling data with multiplicative noise and are promising in diffuse diseases of the pancreas [73].
Mean Filter: A simple filter for noise reduction and smoothing. In brain metastasis studies, both the LoG and Mean filters demonstrated superior performance for model development [74].
Table 2: Common Filters in Radiomic Preprocessing and Their Applications
| Filter Type | Primary Function | Impact on Radiomics | Exemplary Use Case |
|---|---|---|---|
| Laplacian of Gaussian (LoG) | Edge enhancement, blob detection | Highlights structural boundaries and coarse/fine textures; superior performance in brain metastasis models [74] | Brain metastasis treatment response prediction [74] |
| Wavelet | Multi-scale frequency decomposition | Extracts textural information at different spatial scales | General multi-scale texture analysis [74] |
| Logarithm | Multiplicative noise reduction, dynamic range compression | Improves significance of first-order features [73] | Chronic pancreatitis assessment in pancreas MRI [73] |
| Mean | Noise reduction, smoothing | Demonstrated superior model performance in brain metastasis [74] | Brain metastasis treatment response prediction [74] |
Intensity rescaling aims to normalize the intensity values across different images or scanners, reducing domain shift. A common method is Z-score normalization, which calculates the mean (μ) and standard deviation (σ) of grey-levels within the ROI and excludes or clips grey-levels outside the range μ ± 3σ to remove outliers [31]. In brain metastasis studies, mean relative ROI ±3SD rescaling improved model accuracy (73% vs 61%) and AUC (0.74 vs 0.60) compared to min-max rescaling, highlighting its importance for model performance [74].
A study on chronic pancreatitis and healthy controls provides a clear protocol for evaluating preprocessing effects [73]:
Finding: The number of significant features, especially second-order textures, was highly sensitive to the discretization method and parameters [73].
True test-retest studies, where a patient is scanned twice within a short interval, represent the gold standard for assessing feature repeatability. However, due to practical and ethical constraints, image perturbation methods have emerged as a viable alternative.
Test-Retest Analysis: A study on rectal cancer used a clinical test-retest CT dataset. Feature stability was assessed using the Concordance Correlation Coefficient (CCC), with a threshold of CCC > 0.85 considered reproducible. In this challenging clinical setting, only 9 out of 542 features met this criterion, underscoring the profound impact of real-world variability [29].
Image Perturbation Protocol: When test-retest images are unavailable, synthetic perturbations can assess feature robustness. A validated workflow includes [7]:
Comparative Finding: Research shows that feature repeatability assessed by perturbation strongly correlates (r=0.79) with test-retest stability. Models built on features filtered by perturbation (ICC>0.9) can achieve similar reliability to those based on test-retest, with testing AUC of 0.7-0.8 and prediction ICC > 0.9 [7].
The following diagram illustrates the standard radiomics workflow, highlighting the preprocessing steps and their critical role in ensuring feature reliability.
Radiomics Preprocessing and Stability Workflow
Choosing the right preprocessing strategy depends on the imaging modality, clinical question, and need for feature stability. The following diagram provides a logical pathway for making these choices.
Preprocessing Strategy Decision Pathway
Table 3: Key Software Tools and Analytical Solutions for Radiomics Research
| Tool Name / Category | Primary Function | Role in Preprocessing & Stability Analysis |
|---|---|---|
| PyRadiomics | Radiomic Feature Extraction | An open-source Python package that implements standardized extraction of a wide range of features, allowing for precise configuration of discretization and filtering parameters. [31] |
| 3D Slicer | Medical Image Visualization & Analysis | An open-source platform with a PyRadiomics plugin, enabling interactive image segmentation, preprocessing, and feature extraction without extensive programming. [31] |
| LifEx | Radiomics Stand-Alone Software | A stand-alone platform with an integrated graphical user interface for segmentation and texture analysis, facilitating user-friendly radiomics studies. [31] |
| ITK-SNAP | Interactive Image Segmentation | A specialized tool for detailed manual and semi-automatic segmentation of structures in medical images, a critical step preceding preprocessing. |
| Intra-class Correlation Coefficient (ICC) | Statistical Metric | Measures feature repeatability between test-retest scans or perturbed images. Features with high ICC (e.g., >0.8 or >0.9) are considered stable and selected for model building. [7] |
| Concordance Correlation Coefficient (CCC) | Statistical Metric | An alternative metric for assessing agreement between two measurements, often used in test-retest analyses to identify robust features. [29] |
The path to clinically reliable radiomic models is inextricably linked to rigorous and standardized preprocessing. Evidence consistently shows that the choices of discretization parameters, filter types, and rescaling methods significantly impact the stability and discriminative power of extracted features. While no single set of parameters is universally optimal, the consensus leans toward relative discretization (fixed bin number of 16-32 for MRI) and the use of filters like LoG for enhancing relevant textural information. Critically, the practice of assessing feature stability—whether through gold-standard test-retest studies or computationally efficient image perturbation—must be integrated into the radiomic workflow. By adopting a systematic and evidence-based approach to preprocessing, researchers can significantly enhance the reproducibility and translational potential of their radiomic models.
Radiomics harnesses quantitative features extracted from medical images to predict clinical outcomes, offering significant potential for personalized medicine. However, the transition of radiomic models from research to clinical practice is hindered by challenges in feature repeatability and reproducibility. This guide provides a comparative analysis of experimental methodologies for establishing the test-retest reliability of radiomic features and linking them to robust prognostic models. We synthesize evidence from multiple cancer types and imaging modalities, offering a structured framework for researchers and drug development professionals to validate the prognostic value of highly repeatable radiomic features.
Radiomics converts routine medical images into mineable, high-dimensional data by extracting numerous quantitative features that describe tumor phenotype. These features—encompassing morphology, intensity statistics, and texture—can serve as non-invasive biomarkers for diagnosis, prognosis, and treatment response prediction [75] [8]. A fundamental prerequisite for any radiomic biomarker to be clinically useful is repeatability (stability under identical imaging conditions) and reproducibility (stability across varying imaging conditions) [1].
The high dimensionality of radiomic data, often characterized by many more features than patient samples, increases the risk of model overfitting and spurious findings. Without establishing feature repeatability first, a model may appear predictive in a development cohort but fail in independent validation, not due to a lack of biological signal, but because it was built on unstable, non-repeatable features [75] [8]. This guide compares the primary experimental approaches for identifying repeatable features and demonstrates how this critical step underpins the development of reliable prognostic models.
Two primary experimental paradigms exist for evaluating radiomic feature repeatability: the test-retest study and the image perturbation approach. The table below compares their protocols, advantages, and challenges.
Table 1: Comparison of Radiomic Feature Repeatability Assessment Methods
| Aspect | Test-Retest Imaging | Image Perturbation |
|---|---|---|
| Core Protocol | Repeatedly scanning the same patient within a short time interval under near-identical conditions [7] [76]. | Applying simulated variations to a single scan (e.g., random translations, rotations, contour randomizations, noise addition) [7] [8]. |
| Key Metric | Intraclass Correlation Coefficient (ICC) between feature values from the two scans [43] [77]. | ICC between feature values from the original and multiple perturbed images [7]. |
| Advantages | - Captures real-world variability from the entire imaging process [7].- Considered the "gold standard" for assessing repeatability. | - Does not require additional patient radiation dose or scanner time [7].- Can generate large numbers of "pseudo-retest" images from a single scan.- Allows controlled study of specific variation sources. |
| Challenges | - Logistically challenging and expensive [7].- Requires ethical consideration for extra scans/radiation.- Limited sample sizes in existing studies [7]. | - May not fully capture all real-world biological and technical variances [7].- Requires careful selection of perturbation parameters. |
| Prognostic Performance | Models built with test-retest-selected features (ICC>0.9) showed high testing AUC (0.77) and prediction ICC (0.87) [7]. | Models built with perturbation-selected features (ICC>0.9) achieved comparable testing AUC (0.76) and prediction ICC (0.90) [7]. |
Evidence consistently shows that not all radiomic feature classes are equally repeatable. The table below synthesizes findings from multiple studies across different anatomical sites and imaging modalities.
Table 2: Repeatability of Radiomic Feature Classes Across Studies
| Feature Class | Reported Repeatability | Context and Examples |
|---|---|---|
| First-Order Statistics | Generally most reproducible [1] [2]. Entropy is consistently among the most stable features [1]. In cardiac T1 mapping, features like Mean, Median, and 10Percentile showed high repeatability (ICC > 0.75) [43]. | |
| Shape Features | Generally show good repeatability, particularly in cardiac MRI [43] [77]. | |
| Textural Features | Generally less robust than first-order and shape features [1] [2]. Coarseness and contrast are among the least reproducible [1]. In cardiac MRI, a subset (e.g., RunLengthNonUniformityNormalized, RunPercentage) can show high repeatability [43]. | |
| General Observation | The repeatability and reproducibility of radiomic features are sensitive to processing details, including image acquisition settings, reconstruction algorithms, and segmentation methods [1] [2]. |
Filtering features based on repeatability metrics directly impacts the reliability of subsequent prognostic models.
Breast Cancer Prediction Model: A study on breast cancer (191 patients) predicting pathological complete response (pCR) found that model reliability improved with higher ICC thresholds for feature selection. The testing AUC for a logistic regression model increased from 0.56 (no ICC filter) to a maximum of 0.76 using image perturbation (ICC≥0.9) and 0.77 using test-retest (ICC≥0.9). Model robustness, measured by prediction ICC, also improved significantly (>0.9 at ICC≥0.9 threshold). Notably, overly stringent filtering (ICC≥0.95) caused a performance drop in test-retest models, highlighting the need to balance repeatability with predictive information [7].
Gastric Cancer Prognostic Model: A large multicenter study developed a machine learning model for overall and cancer-specific survival in gastric cancer. While not explicitly detailing repeatability filtering, the study emphasized robust feature selection and external validation, achieving a C-index of 0.719 for cancer-specific survival. This underscores that rigorous methodology, which should include stability assessment, leads to generalizable models [78] [79].
The following diagram illustrates the logical workflow for linking repeatability analysis to validated clinical outcomes.
A study investigating myocardial T1 and T2 mapping provides a detailed test-retest protocol [43] [77]:
A study on a breast cancer dataset demonstrated the image perturbation approach [7]:
Table 3: Essential Tools for Radiomic Repeatability and Prognostic Validation Studies
| Tool / Resource | Function | Examples & Notes |
|---|---|---|
| Test-Retest Datasets | Provides ground-truth data for assessing feature repeatability under real scanning conditions. | Public datasets like RIDER (CT) [76]. Specific disease cohorts (e.g., breast cancer [7], cardiac patients [43]). |
| Image Perturbation Software | Generates simulated test-retest images, offering a flexible and dose-free alternative. | In-house or open-source algorithms for translations, rotations, noise addition, and contour randomizations [7]. |
| Segmentation Software | Defines the region of interest (ROI) from which features are extracted. | ITK-SNAP [43], 3D Slicer. Manual, semi-, or fully-automated methods impact reproducibility [76]. |
| Radiomic Feature Extraction Platforms | Standardized extraction of quantitative features from images. | PyRadiomics (Python) [77], IBSI-compliant software. Standardization is critical for reproducibility [8] [2]. |
| Statistical Analysis Software | Calculates repeatability metrics and builds prognostic models. | R, Python (Scipy, Pingouin). Used for ICC, CCC, machine learning (Cox, RSF, SVM, etc.) [7] [77]. |
| Reporting Checklist | Ensures comprehensive reporting to enable study replication. | Checklist based on systematic reviews to improve reporting quality [2]. |
The pathway to clinically validated radiomic prognostic models is inextricably linked to the rigorous assessment of feature repeatability. Both test-retest and image perturbation methods provide viable pathways to filter out unstable features, thereby improving model generalizability and robustness. While test-retest remains the gold standard, image perturbation offers a practical and powerful alternative, especially when test-retest imaging is not feasible. The consistent finding that a subset of radiomic features demonstrates high repeatability across diverse clinical contexts is encouraging. Future research should focus on standardizing workflows, validating repeatable feature sets in larger multi-institutional cohorts, and formally establishing their value in prospective clinical trials for drug development and personalized therapy.
The identification of robust biomarkers that reliably predict clinical outcomes across multiple cancer types represents a pivotal challenge in oncology research. Pan-cancer analyses, which interrogate molecular data across diverse malignancies, have emerged as powerful approaches for discovering conserved biological mechanisms and consistent prognostic features. Such cross-cancer biomarkers offer significant advantages for understanding shared tumorigenic processes, developing broadly applicable diagnostic tools, and identifying therapeutic targets with potential utility beyond individual cancer types. This review synthesizes recent methodological advances and empirical findings in pan-cancer biomarker discovery, with particular attention to the stability and reliability of these features—a concern prominently highlighted in parallel research on test-retest reliability of radiomic features.
The integration of diverse molecular data types, or multi-omics analysis, significantly enhances the discovery of robust pan-cancer biomarkers. One comprehensive approach simultaneously analyzed DNA methylation (DM), gene expression (GE), somatic copy number alteration (SCNA), and microRNA expression (ME) data from 13 cancer types [80]. This method transformed each omics dataset into a standardized gene matrix, applied z-score normalization, and computed a unified "Score" to rank genes by their prognostic potential [80]. The resulting biomarkers demonstrated impressive prognostic power, with C-indexes ranging from 0.76 to 0.96 across cancer types [80].
Table 1: Multi-Omics Data Types in Pan-Cancer Biomarker Discovery
| Data Type | Biological Significance | Analysis Approach |
|---|---|---|
| DNA Methylation (DM) | Epigenetic regulation, transcriptional silencing/activation | Promoter region hyper/hypomethylation analysis |
| Gene Expression (GE) | Transcriptional activity, cellular phenotype | RNA-seq data normalization and differential expression |
| Somatic Copy Number Alteration (SCNA) | Genomic amplification/deletion, oncogene activation | Gistic 2.0 processing, correlation with expression |
| microRNA Expression (ME) | Post-transcriptional regulation, mRNA stability | miRNA-mRNA interaction mapping from databases |
An alternative to single-gene biomarkers focuses on pathway-level disruptions. The iPath method identifies prognostic biomarker pathways by detecting significant deviations from transcriptional norms at the individual sample level [81] [82]. This approach operates on the hypothesis that disruption of transcription homeostasis in key pathways has profound implications for clinical outcomes [81]. Pathway-based biomarkers have demonstrated superior robustness and effectiveness compared to single-gene biomarkers because they capture the coordinated activity of multiple genes involved in tumorigenesis [81].
Machine learning approaches, particularly pan-cancer models, have shown enhanced performance for specific prediction tasks compared to cancer-specific models. For predicting 30-day mortality in patients with advanced cancer, a pan-cancer model based on the eXtreme Gradient Boosting (XGBoost) algorithm achieved an average precision of 0.56, outperforming single-cancer models (average precision: 0.51) [83]. Important features identified by this approach—including plasma albumin level, white blood cell count, and lactate dehydrogenase levels—were shared across cancer types, indicating conserved predictors of short-term mortality [83].
The application of multi-omics integration to 13 cancer types identified seven genes consistently associated with prognosis across multiple cancers: SLK, API5, BTBD2, PTAR1, VPS37A, EIF2B1, and ZRANB1 [80]. Among these, SLK emerged as particularly cancer-relevant due to its high missense mutation rate and association with cell adhesion processes [80]. Additional network analysis identified EPRS, HNRNPA2B1, BPTF, LRRK1, and PUM1 as having broad correlations with cancers [80].
Table 2: Experimentally Validated Pan-Cancer Biomarkers
| Biomarker | Molecular Function | Cancer Associations | Prognostic Value |
|---|---|---|---|
| SLK | Serine/threonine kinase, cell adhesion | Multiple cancers, high missense mutation rate | Associated with prognosis in various cancers |
| CENPN | Centromere protein, cell cycle progression | Elevated in most cancer types | Correlates with survival across 33 cancer types |
| API5 | Apoptosis inhibitor | Multiple cancers | Pan-cancer prognostic association |
| EPRS | Glutamyl-prolyl-tRNA synthetase | Network analysis showing broad cancer correlation | Potential pan-cancer biomarker |
| Pathway-based signatures | Coordinated expression of pathway genes | Multiple cancers | Superior to single-gene biomarkers |
A comprehensive pan-cancer analysis of Centromere Protein N (CENPN) demonstrates the biomarker potential of centromere proteins across diverse malignancies [84]. CENPN expression was elevated in most of 33 analyzed cancer types and showed differential expression across molecular and immune subtypes [84]. The protein demonstrated significant diagnostic value with area under the curve (AUC) values in the "good" to "high" range (0.7-0.9+) across multiple cancers [84]. Functionally, CENPN enrichment correlates with cell cycle progression, mitotic nuclear division, and oocyte meiosis pathways [84]. Its expression also positively correlates with Th2 and Tcm cells in most cancers and associates with immunomodulator genetic markers, suggesting relevance for cancer immunotherapy [84].
The following experimental workflow outlines the key steps for multi-omics biomarker discovery:
Multi-Omics Biomarker Discovery Workflow
For comprehensive biomarker identification across cancer types:
Pan-Cancer Expression Analysis Protocol
The reliability of biomarkers—whether molecular or radiomic—fundamentally impacts their clinical utility. Extensive research in radiomics has highlighted the critical importance of test-retest stability in feature selection.
Radiomics research employs rigorous methods to identify stable features. Test-retest experiments typically involve scanning the same subject multiple times under identical conditions, then using intraclass correlation coefficient (ICC) or concordance correlation coefficient (CCC) to quantify feature stability [7] [29] [43]. Commonly, features with ICC > 0.75-0.9 are considered sufficiently stable for further analysis [43] [22].
One study comparing test-retest stability across cancer types found dramatic differences between ideal "coffee-break" scenarios (15-minute intervals) and clinical settings (days between scans) [29]. In lung cancer with a 15-minute interval, 234/542 features showed high stability (CCC > 0.85), while only 9 features met this threshold in rectal cancer with days between scans [29]. This highlights the significant impact of experimental conditions on perceived feature stability.
When test-retest imaging is impractical, image perturbation offers an alternative approach for assessing feature repeatability. This method applies random translations, rotations, and contour randomizations to existing images [7]. Studies comparing both methods have found perturbation can achieve similar optimal reliability with testing AUC = 0.7-0.8 and prediction ICC > 0.9 at ICC threshold of 0.9 [7].
Table 3: Performance Comparison of Pan-Cancer Biomarker Approaches
| Approach | Key Advantages | Limitations | Performance Metrics |
|---|---|---|---|
| Multi-omics Integration | Comprehensive biological insight, higher prognostic power | Computational complexity, data availability requirements | C-indexes: 0.76-0.96 across 13 cancers [80] |
| Pathway-Based (iPath) | Robust to technical variability, captures biological coherence | Pathway definition dependency, interpretation complexity | Superior to single-gene biomarkers for survival prediction [81] |
| Machine Learning (Pan-cancer) | Leverages shared predictors, improved performance | Potential masking of cancer-specific signals | Average precision: 0.56 vs 0.51 for single-cancer models [83] |
| Single-Cancer Models | Cancer-specific optimization, direct clinical applicability | Limited sample sizes, reduced generalizability | Variable performance across cancer types [83] |
Table 4: Key Research Resources for Pan-Cancer Biomarker Discovery
| Resource | Type | Primary Function | Application Example |
|---|---|---|---|
| The Cancer Genome Atlas (TCGA) | Database | Multi-omics cancer data | Primary data source for molecular analyses [80] [84] |
| Genotype-Tissue Expression (GTEx) | Database | Normal tissue reference | Control samples for differential expression [84] |
| cBioPortal | Analysis Tool | Genetic alteration analysis | Somatic mutation frequency across cancers [84] |
| TISIDB | Database | Tumor-immune system interactions | Correlation with immune subtypes [84] |
| PyRadiomics | Software | Radiomic feature extraction | Standardized feature calculation from medical images [22] |
| STRING | Database | Protein-protein interactions | Network analysis of biomarker interactions [84] |
| LinkedOmics | Database | Multi-omics data analysis | Exploration of associations across cancer types [80] |
The identification of consistent prognostic features across cancer types represents a promising frontier in oncology research. Multi-omics integration, pathway-centric approaches, and machine learning models have demonstrated the existence of biomarkers with genuine pan-cancer prognostic potential. The stability and reliability of these biomarkers—mirroring concerns in radiomics research—remain paramount for clinical translation. As methods advance and datasets expand, the continued discovery and validation of cross-cancer consistent features will enhance our understanding of shared tumor biology and accelerate the development of broadly applicable prognostic tools.
The pursuit of robust, non-invasive biomarkers for cancer diagnosis, prognosis, and treatment response has positioned radiomics at the forefront of oncological research. This comparison guide objectively evaluates the robustness of conventional radiomics versus deep learning-based feature extraction methods, framed within the critical context of test-retest reliability. For researchers and drug development professionals, the stability of these quantitative imaging features against variations in image acquisition, segmentation, and processing is a prerequisite for clinical translation. We synthesize experimental data from recent studies across multiple cancer types, detailing methodologies, presenting quantitative performance comparisons, and outlining essential research tools. The evidence indicates that while conventional radiomics requires rigorous robustness filtering to achieve reliability, deep learning models demonstrate inherent stability and can outperform radiomics in real-world heterogeneous settings. Furthermore, fusion models that integrate both approaches show promising synergistic effects, achieving superior predictive performance.
Radiomics is the high-throughput extraction of quantitative features from medical images to divulge cancer biological and genetic characteristics that are imperceptible to the human eye [33]. These features, which include morphological, first-order statistical, and textural descriptors, aim to quantify tumor phenotype [85] [86]. However, the reliability and generalizability of radiomic models are major concerns for clinical adoption [7] [30]. A primary challenge is feature robustness—the stability of a feature's value when measured under varying conditions, such as different imaging scanners, acquisition parameters, segmentation, or even from the same subject imaged twice within a short interval (test-retest) [44] [29].
The test-retest reliability of imaging features is the foundational step for any robust radiomic study. Features that are not repeatable and reproducible are likely to lead to models that fail when applied to new, independent data [44] [30]. This guide systematically compares how conventional handcrafted radiomics features and deep learning (DL)-based features perform in this regard. We examine experimental protocols designed to stress-test feature stability and present synthesized data to help researchers choose the optimal approach for their specific precision oncology goals.
Conventional radiomics involves a multi-step process where handcrafted features are engineered from defined regions of interest (ROIs). The workflow typically includes:
Deep learning, particularly Convolutional Neural Networks (CNNs), offers an end-to-end learning paradigm:
A critical component of radiomics research is the experimental design for evaluating feature robustness. Key protocols include:
Diagram 1: Workflow for assessing radiomic feature robustness via image perturbation. ICC, Intraclass Correlation Coefficient.
The table below synthesizes performance metrics from multiple studies that directly or indirectly compared conventional radiomics and deep learning models.
Table 1: Comparative performance of radiomics, deep learning, and fusion models across different clinical tasks.
| Cancer Type / Task | Model Type | Performance (Metric) | Key Finding / Context | Source |
|---|---|---|---|---|
| Lung Nodule Malignancy | Conventional Radiomics (Baseline) | AUROC: 0.792 ± 0.025 | Performance improved significantly with optimization (feature selection, data balancing). | [85] |
| Deep Learning (Baseline) | AUROC: 0.801 ± 0.018 | Outperformed baseline radiomics without much fine-tuning. | [85] | |
| Deep-Feature Radiomics | AUROC: 0.817 ± 0.032 | [85] | ||
| Conventional Radiomics (Optimized) | AUROC: 0.921 ± 0.010 | [85] | ||
| Deep-Feature Radiomics (Optimized) | AUROC: 0.936 ± 0.011 | [85] | ||
| Hybrid (Radiomics + Deep Features) | AUROC: 0.938 ± 0.010 | The most promising model, indicating complementary information. | [85] | |
| HCC Overall Survival | Clinical Model Only | C-index: 0.74 [0.57–0.86] | Best performing model in validation. | [87] |
| Conventional Radiomics Models | C-index: 0.51–0.66 [0.30–0.79] | Susceptible to data heterogeneity. | [87] | |
| Deep Learning Models | C-index: 0.63–0.71 [0.39–0.88] | Superior prognostic potential under clinical conditions. | [87] | |
| MIA vs. IAC Classification | Conventional Radiomics | AUROC: 0.794 | [88] | |
| 2D Deep Learning (ResNet50) | AUROC: 0.754 | [88] | ||
| 3D Deep Learning (ResNet50) | AUROC: 0.847 | Leveraged full spatial context. | [88] | |
| Late Fusion (Rad + 2D/3D DL) | AUROC: 0.898 | Highest performance, ensembling output probabilities. | [88] | |
| Multi-modality Image Classification | Statistical / Radiomics Features | Sensitivity: 90.8%–92.2%Latency: High | Less effective, time-intensive. | [89] |
| Deep Learning Features (ResNet50) | Sensitivity: 96.0%–96.9%Latency: Low (4x faster) | Efficient, high performance for rapid diagnostics. | [89] |
The table below focuses specifically on studies that measured the stability and reliability of features.
Table 2: Comparative robustness of conventional radiomics versus deep learning features.
| Aspect of Robustness | Conventional Radiomics | Deep Learning | Source |
|---|---|---|---|
| Inherent Stability | Highly sensitive to acquisition parameters, reconstruction algorithms, and segmentation. Requires explicit robustness filtering. | More inherently robust to image variations due to data augmentation and hierarchical feature learning. | [87] [89] |
| Impact of Robustness Filtering | Model robustness (ICC) improved from 0.65 to 0.91 by using excellent-robust (ICC>0.95) features. Generalizability also increased. | Not typically required as a separate step; robustness is often learned during training with augmentation. | [33] |
| Performance in Heterogeneous Data | Performance drops significantly with variation in scanners and protocols. A "coffee-break" test-retest study found 234/542 robust features, but only 9 were robust in a clinical scenario with different scanners. | Demonstrates superior prognostic potential in real-world settings with varied acquisition parameters and tumor stages. | [87] [29] |
| Robustness by Feature Class | Most Robust: First-order statistics (e.g., Entropy). Least Robust: Many texture and wavelet features. Shape features are often highly reproducible. | Robustness is not as easily categorized by feature class, as features are learned and abstract. | [29] [30] |
Diagram 2: A comparative framework for conventional radiomics and deep learning analysis, culminating in a hybrid fusion model.
For researchers designing experiments in this field, the following tools and materials are essential:
Table 3: Key research reagents and computational solutions for radiomics and deep learning studies.
| Item / Solution | Function / Description | Example Tools / Libraries |
|---|---|---|
| Image Analysis Platform | Software for image visualization, registration, and manual segmentation of Regions of Interest (ROIs). | 3D Slicer, ITK-SNAP [87] [88] |
| Radiomics Feature Extraction | Open-source platforms that standardize the extraction of handcrafted radiomic features per IBSI guidelines. | PyRadiomics (Python) [88], MIRP [87] |
| Deep Learning Framework | Libraries providing pre-built components and automatic differentiation for developing and training CNN models. | PyTorch, TensorFlow |
| Pre-trained DL Models | Models trained on large datasets (natural images or medical images) used as a starting point for transfer learning. | ResNet50 (ImageNet), Med3D [88] |
| Perturbation Analysis Tool | Software to simulate test-retest variations via random transformations, noise addition, and contour deformation. | Custom implementations based on methods from Zwanenburg et al. [44] [33] |
| Feature Robustness Quantification | Statistical method to assess feature stability across perturbations or test-retest scans. | Intraclass Correlation Coefficient (ICC) [44] [33] |
| Model Reliability Assessment | Frameworks to evaluate the robustness and generalizability of the final predictive model. | FAMILIAR (R package) [87] |
This comparative analysis demonstrates that the choice between conventional radiomics and deep learning involves a critical trade-off between interpretability and inherent robustness. Conventional radiomics provides handcrafted, biologically-plausible features but is highly susceptible to technical variations, necessitating rigorous, study-specific robustness assessments using test-retest or perturbation methods. In contrast, deep learning approaches demonstrate greater native robustness to clinical heterogeneity and can outperform radiomics in real-world scenarios, though they often function as "black boxes." The most promising path forward appears to be hybrid models that integrate both handcrafted radiomic features and deep learning features, as they leverage the strengths of both approaches and have been shown to achieve state-of-the-art predictive performance [85] [88].
For the field to advance, future work must focus on standardizing robustness assessment protocols and improving the transparency of deep learning models. Furthermore, as demonstrated by their application in predicting complex tumor microenvironments [86], these non-invasive tools hold immense potential to redefine precision oncology by providing scalable, repeatable, and informative biomarkers for drug development and personalized therapy.
The clinical translation of radiomics in oncology hinges on the development of robust predictive models whose performance generalizes beyond single-institution datasets. External validation through multi-center and cross-institutional frameworks provides the critical evidence base needed to assess model reliability and reproducibility before clinical deployment [1] [90]. These frameworks systematically evaluate how radiomic signatures perform across different patient populations, imaging protocols, and institutional settings, addressing key challenges that have historically impeded radiomics' clinical adoption [90].
This guide objectively compares methodological approaches for assessing the external validity of radiomic features, with a particular emphasis on test-retest reliability within multi-center contexts. We synthesize experimental data and protocols from key studies to provide researchers with practical frameworks for designing validation studies that meet rigorous scientific standards. The comparative analysis focuses on quantitative performance metrics, including stability indices and reliability coefficients, to guide selection of appropriate methodologies for different research scenarios.
In radiomics research, precise terminology is essential for proper experimental design and interpretation:
Feature selection plays a pivotal role in enhancing radiomic stability. The table below summarizes the performance of different feature selection methods based on multi-institutional validation studies:
Table 1: Performance comparison of feature selection methods for radiomic stability
| Feature Selection Method | Jaccard Index (JI) | Dice-Sorensen Index (DSI) | Overall Performance (OP) | Key Strengths | Stability Limitations |
|---|---|---|---|---|---|
| Graph-FS (Connected Components) | 0.46 | 0.62 | 45.8% | Models feature interdependencies; High cross-center reproducibility | Computational complexity |
| mRMR | 0.014 | - | - | Reduces feature redundancy | Low stability across parameter variations (JI=0.014) |
| Lasso | 0.010 | - | - | Handles high-dimensional data well | Sensitive to preprocessing parameters (JI=0.010) |
| RFE | 0.006 | - | - | Iterative refinement of feature set | Moderate stability (JI=0.006) |
| Boruta | 0.005 | - | - | Comprehensive feature importance | Lowest stability in comparison (JI=0.005) |
Data adapted from graph-based feature selection study evaluating 1,648 radiomic features from 752 HNSCC patients across three institutions [68].
Different statistical approaches are available for quantifying various aspects of reliability in radiomic studies:
Table 2: Reliability assessment metrics and their applications
| Metric | Formula | Application Context | Interpretation | Evidence Quality |
|---|---|---|---|---|
| Intraclass Correlation Coefficient (ICC) | ICC = (MSR - MSE)/(MSR + (k-1)MSE + (k/n)(MSC - MSE)) [92] | Test-retest reliability; Inter-rater reliability | 0-1.0 (Higher values indicate better reliability) | Strong evidence for continuous measures [93] |
| Jaccard Index (JI) | JI = |A ∩ B|/|A ∪ B| | Feature selection stability | 0-1.0 (Measures similarity of selected feature sets) | Emerging evidence in radiomics [68] |
| Dice-Sorensen Index (DSI) | DSI = 2|A ∩ B|/(|A| + |B|) | Feature selection stability | 0-1.0 (Similar to JI but more sensitive) | Emerging evidence in radiomics [68] |
| Coefficient of Variation (CV) | CV = σ/μ × 100% | Measurement precision | Lower values indicate higher precision | Well-established for physiological measures [93] |
| Kendall's Coefficient of Concordance (W) | - | Feature ranking consistency | 0-1.0 (Higher values indicate more consistent rankings) | Applied in graph-based feature selection [68] |
The following workflow diagram illustrates a comprehensive experimental design for assessing radiomic feature reliability across multiple institutions:
Multi-Center Reliability Assessment Workflow
Key Methodological Components:
Multi-Center Cohort Design: A retrospective analysis of 752 patients with head and neck squamous cell carcinoma (HNSCC) across three independent institutions demonstrates an adequately powered study design [68]. Cohorts should represent realistic clinical variation in demographics, treatment approaches, and imaging protocols.
Systematic Parameter Variation: To simulate real-world variability, researchers applied 36 different radiomics parameter configurations, varying normalization scales (50 and 100), discretized gray levels (5, 10, 15, 20, 25, 30), and outlier removal thresholds (2, 3, 4) [68].
Comprehensive Feature Extraction: Using PyRadiomics (v3.1.0), extract 1,648 features from original CT scans and eight distinct image transformations, including Laplacian-of-Gaussian filters with sigma values of 1-5 mm and wavelet decompositions [68].
Stability-Oriented Feature Selection: Apply graph-based feature selection (Graph-FS) that constructs feature similarity networks where edges represent statistical similarities (Pearson correlation). Select representative features using centrality measures (betweenness centrality) to enhance stability across imaging conditions [68].
For test-retest reliability studies specifically, implement this methodological framework:
Test-Retest Reliability Assessment Protocol
Critical Design Considerations:
Optimal Time Intervals: Implement a 4-week control period between test and retest sessions to minimize learning effects while capturing true biological variability, as demonstrated in neuromuscular reliability studies [93].
Standardized Acquisition Protocols: Maintain identical imaging parameters, equipment, and patient preparation procedures across all sessions. Document any deviations that might affect measurements.
Comprehensive Metric Reporting: Beyond ICC values, report the Standard Error of Measurement (SEM), Minimal Detectable Change (MDC), and Coefficient of Variation (CV) to provide complete information about measurement precision [93].
Stability Thresholds: Establish predefined reliability thresholds for feature selection. Features with ICC values >0.8 are generally considered to have excellent reliability, while those with ICC <0.5 show poor reliability and should be excluded from predictive models [1] [92].
Table 3: Essential research reagents and computational tools for radiomics reliability assessment
| Tool/Category | Specific Examples | Function/Purpose | Key Features | Evidence Base |
|---|---|---|---|---|
| Feature Extraction Software | PyRadiomics (v3.1.0) | Standardized feature extraction from medical images | IBSI-compliant; 1,648+ extractable features; Open-source [68] | Extensive validation in multi-center studies [68] [94] |
| Feature Selection Algorithms | Graph-FS (Graph-Based Feature Selection) | Identifies stable features across institutions | Models feature interdependencies; Superior stability (JI=0.46) [68] | Validated on 752 HNSCC patients across 3 centers [68] |
| Statistical Analysis Packages | R Statistical Software (relfeas package) | Reliability feasibility analysis and sample size estimation | Estimates reliability for new samples; Power analysis [92] | Peer-reviewed methodology [92] |
| Image Processing Tools | B-spline interpolation algorithms | Image resampling and registration | Isotropic voxel resampling (1mm³); Standardized preprocessing [68] | Essential for reproducibility [68] |
| Reliability Analysis Metrics | Intraclass Correlation Coefficient (ICC) | Quantifies test-retest reliability | Various forms for different experimental designs [92] [93] | Gold standard for reliability assessment [1] [93] |
| Phantom Validation Systems | Radiomic phantoms | Controlled test-retest reliability studies | Controlled variability assessment; Protocol optimization [1] | Reference standard for technical validation [1] |
This comparison guide has synthesized experimental data and methodological frameworks for assessing the external validity and reliability of radiomic features across multiple institutions. The quantitative comparisons demonstrate that graph-based feature selection methods offer superior stability (JI=0.46, DSI=0.62) compared to traditional approaches like Lasso (JI=0.010) or Boruta (JI=0.005) in multi-center validation studies [68].
For researchers designing reliability assessment studies, we recommend: (1) implementing standardized imaging protocols across all participating centers; (2) incorporating systematic parameter variations to test feature stability; (3) utilizing graph-based feature selection to identify robust radiomic signatures; and (4) reporting comprehensive reliability metrics including ICC, SEM, MDC, and CV to enable proper interpretation of results.
Rigorous multi-center validation remains the cornerstone of clinically applicable radiomics research. By adopting the frameworks and methodologies compared in this guide, researchers can enhance the reproducibility and clinical translation of radiomic biomarkers for oncology applications.
This guide provides an objective comparison of the Radiomics Quality Score (RQS) with emerging alternatives, focusing on their application for evaluating methodological rigor in radiomic feature research, particularly within test-retest reliability studies.
Radiomics quality assessment tools are designed to evaluate the multi-step analytical pipeline in radiomics research, which extracts quantitative features from medical images to build predictive models for clinical decision-making [95]. The complex nature of this pipeline—encompassing image acquisition, segmentation, feature extraction, and model validation—introduces numerous potential sources of bias and variability that can compromise research reproducibility and clinical translation [96]. The Radiomics Quality Score (RQS) was the first comprehensive tool developed to address these challenges by providing a standardized assessment framework for methodological rigor [95]. Recently, the METhodological RadiomICs Score (METRICS) has emerged as a new consensus-based tool endorsed by the European Society of Medical Imaging Informatics (EuSoMII), developed through a modified Delphi process involving a large international expert panel [95] [97]. Understanding the implementation, strengths, and limitations of these tools is particularly crucial for research on test-retest reliability, which forms the foundation for assessing radiomic feature stability across repeated image acquisitions.
The following table provides a detailed comparison of the RQS and METRICS assessment tools across multiple dimensions relevant to test-retest reliability research:
Table 1: Comprehensive Comparison of RQS and METRICS Assessment Tools
| Feature | Radiomics Quality Score (RQS) | METRICS |
|---|---|---|
| Year Introduced | 2017 [95] | 2024 [95] |
| Number of Items | 16 items [98] | 30 items across 9 categories [95] |
| Scoring Range | -8 to 36 [98] | 0-100% [98] |
| Development Process | Developed by a small research group [95] | Modified Delphi study with 59 international experts from 19 countries [95] |
| Weighting System | Unclear rationale for point allocation [98] | Transparent, expert opinion-based weights [95] |
| Test-Retest Consideration | Includes "multiple time points" as Item #4 [99] | Incorporated within methodological framework [95] |
| Tool Conditionality | Limited conditionality [98] | Conditional format for different methodological variations [95] |
| Coverage of Deep Learning | Limited [95] | Explicitly covers handcrafted and deep learning approaches [95] |
| Calculation Tools | Manual calculation | Web application available [95] |
A 2023 multi-reader study evaluated the intra- and inter-rater reliability of RQS, involving nine raters with different expertise levels assessing 33 original radiomics research papers [100]. The findings revealed significant challenges in consistent RQS application:
A 2025 systematic review and meta-analysis of 130 systematic reviews providing 3,258 individual RQS assessments revealed:
Early studies on METRICS implementation indicate:
Table 2: Comparative Performance Metrics of Radiomics Assessment Tools
| Performance Metric | RQS | METRICS |
|---|---|---|
| Inter-rater Reliability (ICC) | 0.30-0.55 [100] | Varies (early data) [96] |
| Typical Application Time | 13.9 minutes per article (human evaluator) [103] | Similar timeframe (human evaluator) |
| LLM-assisted Evaluation Time | 2.9-3.5 minutes per article [103] | Comparable reduction possible [102] |
| Common Scoring Errors | 39.8% of applications [101] | Limited data (newer tool) |
| Ave. Score in Literature | 26.1% (9.4/36 points) [101] | Limited application data available |
Test-retest imaging represents the reference standard for assessing radiomic feature repeatability, involving repeated scanning of each patient within a short time period with identical acquisition settings [7]. The protocol involves:
When test-retest imaging is not feasible due to resource constraints or additional patient radiation exposure, image perturbation methods provide an alternative:
The following diagram illustrates the comparative workflow for assessing feature reliability using both test-retest and perturbation methods:
Research directly comparing test-retest and perturbation methods reveals:
Table 3: Essential Research Resources for Radiomics Quality Assessment
| Resource Category | Specific Tool/Resource | Function in Quality Assessment |
|---|---|---|
| Quality Scoring Tools | RQS (16 items) [98] | Methodological quality assessment for traditional radiomics |
| METRICS (30 items) [95] | Methodological quality assessment for handcrafted and deep learning | |
| Calculation Platforms | METRICS Web Application [95] | Streamlines score calculation and feedback collection |
| Manual RQS Calculation | Traditional scoring method | |
| Reference Standards | METRICS-E3 [96] | Explanation and elaboration with 227 examples |
| CLEAR Guidelines [95] | Reporting guidelines for radiomics research | |
| Feature Standardization | Image Biomarker Standardization Initiative (IBSI) [96] | Harmonized feature extraction protocols |
| Test-Retest Alternatives | Image Perturbation [7] | Assesses feature repeatability when test-retest unavailable |
| Automation Assistance | Large Language Models (LLMs) [102] [103] | Accelerates and standardizes quality assessment |
The Radiomics Quality Score (RQS) represents a pioneering effort to standardize methodological quality assessment in radiomics research, with demonstrated value for test-retest reliability studies. However, evidence reveals significant limitations in reproducibility and consistent application. The newer METRICS tool addresses several RQS limitations through transparent development methodology, comprehensive coverage of modern approaches, and conditional scoring adaptation. For researchers focusing on test-retest reliability, both tools provide structured frameworks for methodological evaluation, though METRICS offers more contemporary alignment with evolving radiomics methodologies. Implementation can be enhanced through supplementary resources like METRICS-E3 and emerging LLM-assisted evaluation tools that improve consistency and efficiency. The choice between assessment tools should consider specific research objectives, with METRICS increasingly positioned as the preferred choice for comprehensive methodological evaluation despite RQS's established history and extensive literature application.
The evolving understanding of test-retest reliability in radiomics emphasizes that feature reproducibility, while important, should not be the sole determinant of clinical utility. The paradigm is shifting toward recognizing that predictive information can be distributed across multiple features, with even non-reproducible features potentially contributing significantly to model performance when considered within their interactive context. Future directions must focus on standardizing assessment methodologies, developing pan-cancer reliable biomarkers, and establishing robust validation frameworks that prioritize clinical relevance alongside technical stability. For biomedical and clinical research, this means adopting more holistic evaluation approaches that balance feature reliability with predictive power, ultimately accelerating the translation of radiomic biomarkers into clinical trials and routine practice for personalized medicine applications.