Validating Real-Time Oncology Data from EHRs: A Framework for Reliable Real-World Evidence in Cancer Research and Drug Development

Eli Rivera Nov 26, 2025 501

This article provides a comprehensive framework for the validation of real-time oncology data extracted from Electronic Health Records (EHRs), tailored for researchers and drug development professionals.

Validating Real-Time Oncology Data from EHRs: A Framework for Reliable Real-World Evidence in Cancer Research and Drug Development

Abstract

This article provides a comprehensive framework for the validation of real-time oncology data extracted from Electronic Health Records (EHRs), tailored for researchers and drug development professionals. It explores the critical imperative for timely, high-quality data in oncology surveillance and research. The content details advanced methodological approaches for data extraction and harmonization, including the use of common data models and AI-driven curation. It further addresses pervasive challenges such as data fragmentation and interoperability, offering practical optimization strategies. Finally, the article synthesizes validation frameworks and comparative studies, presenting key metrics for assessing data fitness-for-use in generating reliable real-world evidence for regulatory and health technology assessment decisions.

The Imperative for Real-Time Data in Modern Oncology

The Growing Burden of Cancer and the Limitations of Traditional Surveillance

The global burden of cancer continues to rise, necessitating robust surveillance systems to generate accurate, comprehensive data for public health interventions and clinical research. Traditional cancer surveillance methodologies face significant challenges in data standardization, interoperability, and adaptability to diverse healthcare settings. This is particularly evident in the context of utilizing electronic health records (EHRs), which contain valuable real-world data but present substantial extraction and standardization hurdles. The transition from EHRs to reliable real-world evidence requires sophisticated approaches to data validation, especially in precision oncology where data complexity is substantial. This article examines current methodologies for validating oncology data extracted from EHRs, comparing traditional and emerging approaches to address these critical challenges.

Experimental Protocols for EHR Data Validation

Research teams have developed distinct methodological frameworks to ensure the quality and reliability of data extracted from EHRs for oncology applications. These protocols generally fall into three categories: manual abstraction as a gold standard, traditional natural language processing (NLP) pipelines, and emerging large language model (LLM)-based approaches.

1. Manual Abstraction and Gold Standard Validation The most established validation approach uses manual chart abstraction by clinical experts to create a gold standard dataset. In one implementation, researchers pulled 106 lung cancer and 45 sarcoma patient cases from databases complying with the Precision Oncology Core Data Model (Precision-DM). This reference dataset enabled quantitative evaluation of automated extraction tools, though with variable results—descriptive fields were accurately retrieved, but temporal variables like Date of Diagnosis and Treatment Start Date showed accuracy ranging from 50% to 86%, limiting reliable calculation of key oncology endpoints such as Overall Survival and Time to First Treatment [1].

2. Traditional NLP and Data Mining Pipelines Prior to the advent of LLMs, many institutions implemented toolkits incorporating data mining scripts and rule-based NLP to automatically retrieve structured variables from EHRs. These pipelines faced challenges with the predominantly unstructured nature of clinical notes (approximately 80% of EHR data), requiring extensive customization to handle site-specific documentation styles and coding practices [1] [2]. The infrastructure based on Precision-DM standardization demonstrated potential for cross-institutional adoption but required enhancement for improved accuracy on specific variables [1].

3. LLM-Based Extraction Frameworks More recently, research teams have developed structured frameworks specifically for evaluating LLM-based data extraction. Flatiron Health's Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID) Framework implements a GDPR-compliant platform for duplicate abstraction, where two expert reviewers independently extract clinical data from patient records. This enables calculation of performance metrics (recall, precision, F1 score) to benchmark LLMs against human extraction across different healthcare systems [3]. Similarly, researchers at Ontada employed prompt engineering with oncology-specific terminology to guide LLMs in extracting structured data from unstructured clinical documents, validating outputs against a gold standard created by clinical specialists [4].

Comparative Performance Evaluation

The table below summarizes quantitative performance data for different approaches to oncology data extraction from EHRs:

Table 1: Performance Metrics of Oncology Data Extraction Methodologies

Extraction Methodology	Cancer Types Evaluated	Key Data Elements	Reported Performance	Reference Dataset
Traditional NLP Pipeline	Lung cancer, sarcoma	Descriptive fields, Date of Diagnosis, Treatment Start Date	Accuracy: 50%-86% for temporal variables	151 patient cases from Precision-DM databases [1]
LLM with Prompt Engineering	26 solid tumors, 14 hematologic malignancies	Cancer diagnosis, histology, grade, TNM staging	F1 scores ≥0.85 for all key clinical elements	Validation against manual extraction by clinical specialists [4]
AI-Enhanced Cancer Surveillance (Meta-analysis)	Cervical, oral, urological, gastrointestinal, thoracic	Diagnostic accuracy across imaging and screening	Pooled sensitivity: 88.5% (95% CI 83.2–92.6), specificity: 84.3% (95% CI 78.9–88.7)	5 studies across 1,234,093 patients or imaging cases [5]
Smartphone-Based AI Screening	Oral cancer	Visual detection of suspicious lesions	Sensitivity: 96.7%, Specificity: 96.7%	108,948 images [5]

Visualization of EHR Data Extraction Workflows

The following diagram illustrates the core workflow for validating oncology data extracted from electronic health records:

EHR Data Validation Workflow

Research Reagent Solutions for Oncology Data Extraction

The table below details key technologies and platforms used in advanced oncology data extraction and validation:

Table 2: Essential Research Tools for Oncology EHR Data Extraction

Tool/Platform	Type	Primary Function	Key Features
Flatiron Health VALID Framework	Validation Framework	Evaluating LLM-extracted information	GDPR-compliant platform, duplicate abstraction, recall/precision/F1 metrics [3]
Precision-DM (Precision Oncology Core Data Model)	Data Standardization	Standardizing EHR data for precision oncology	Common data model for molecular tumor boards, structured data elements [1]
Ontada LLM Platform	Large Language Model	Extracting structured oncology data from clinical notes	Prompt engineering with oncology terminology, validation against manual abstraction [4]
IQVIA RWE Platform	Real-World Evidence Platform	Clinical trial data management and analysis	Centralized data management, advanced analytics, regulatory compliance [6]
TriNetX	Real-World Evidence Platform	Clinical research and trial optimization	Data encryption, access controls, audit trails, advanced analytics [6]

Discussion

The growing burden of cancer demands more sophisticated approaches to surveillance that can overcome the limitations of traditional systems. Current research demonstrates that while traditional NLP pipelines for EHR data extraction have provided a foundation for automation, they face significant challenges with variable accuracy, particularly for temporal oncology endpoints [1]. The emergence of LLM-based approaches represents a substantial advancement, achieving high F1 scores (≥0.85) across diverse cancer types by leveraging prompt engineering with oncology-specific terminology [4].

Critical to the adoption of these technologies are robust validation frameworks like Flatiron's VALID, which implements systematic approaches to benchmark automated extraction against human experts [3]. The field is also addressing infrastructure challenges through standardized data models like Precision-DM, which enables cross-institutional collaboration while maintaining data quality [1].

For researchers and drug development professionals, these advancements enable more reliable generation of real-world evidence from routine clinical practice. This has profound implications for understanding treatment patterns, supporting regulatory decisions, and accelerating the development of novel therapies, particularly in precision oncology where molecular data complexity compounds traditional surveillance challenges [2].

As the field evolves, future directions will likely focus on expanding extraction capabilities to include biomarkers, medication history, and treatment outcomes, further enriching the real-world data available for cancer research and care optimization [4].

Cancer registries have traditionally served as static repositories of historical data, compiled through labor-intensive manual processes with significant time lags. However, a paradigm shift is underway toward dynamic systems capable of real-time data reporting. This transformation, powered by automated extraction technologies and standardized data models, is creating unprecedented opportunities for epidemiological research, drug development, and clinical decision-making. This guide compares the performance of emerging real-time reporting methodologies against traditional registry approaches, examining their validation through recent experimental implementations. We provide comprehensive experimental data and technical specifications to inform researchers, scientists, and drug development professionals navigating this rapidly evolving landscape.

The Evolution from Traditional to Real-Time Cancer Registries

Traditional population-based cancer registries have been indispensable for understanding cancer epidemiology, tracking incidence trends, and informing public health policy. These systems typically rely on manual data extraction from electronic health records (EHRs), a process that is both time-consuming and labor-intensive [7]. The Netherlands Cancer Registry (NCR), for instance, exemplifies this conventional approach where all Dutch cancer patients are manually recorded, creating significant delays between patient encounters and data availability for research and surveillance [7].

The limitations of this static model have become increasingly apparent amid rapid advances in cancer treatment. The growing demand for real-world evidence to evaluate diagnostic and therapeutic strategies used in daily practice has exposed the inadequacies of manual registration systems [7]. Furthermore, the rise of precision oncology, with its transition from hundreds of diagnoses to thousands of distinct cancer subtypes driven by molecular testing, places unique burdens on traditional registry structures that were not designed for such complexity [2].

In response to these challenges, a new model of dynamic, real-time reporting has emerged. These systems leverage automated data extraction technologies that harmonize structured EHR data across multiple healthcare institutions into common data models, supporting near real-time enrichment of cancer registries [7] [8]. This transition represents a fundamental shift from cancer registries as historical archives to their new role as living resources that can support contemporary clinical decision-making and accelerate oncology research.

Performance Comparison: Real-Time vs. Traditional Registry Systems

Direct comparisons between emerging real-time reporting systems and traditional registry approaches reveal significant differences in data accuracy, timeliness, and operational efficiency. The following analysis is based on experimental implementations across multiple research initiatives.

Table 1: Comparative Performance of Real-Time vs. Traditional Registry Systems

Performance Metric	Traditional Registry Systems	Real-Time Reporting Systems	Validation Study
Diagnosis Accuracy	Not directly reported	100% concordance with registered NCR diagnoses	Datagateway System [7] [8]
New Case Identification	Not directly reported	95% accuracy against inclusion criteria	Datagateway System [7] [8]
Treatment Regimen Accuracy	Not directly reported	97-100% across cancer types	Datagateway System [7]
Combination Therapy Classification	Not directly reported	97% accuracy (3% misclassification)	Datagateway System [7]
Laboratory Data Accuracy	Not directly reported	~100% match	Datagateway System [7]
Toxicity Indicators Accuracy	Not directly reported	72%-100% accuracy	Datagateway System [7]
Data Currency	Months to years	Near real-time	Multiple Studies [7] [9]
EHR-EDC Concordance (CDS)	Not applicable	34% (increasing to 87% when disease evaluation captured in both systems)	ICAREdata Project [9]
EHR-EDC Concordance (TPC)	Not applicable	79%	ICAREdata Project [9]

Table 2: Specialized Performance Metrics by Cancer Type

Cancer Type	Validation Focus	Accuracy Rate	Sample Size	System
Acute Myeloid Leukemia (AML)	Treatment regimens	100%	254 patients	Datagateway [7]
Multiple Myeloma (MM)	Treatment regimens	97%	117 patients, 198 regimens	Datagateway [7]
Lung Cancer	New diagnosis extraction	95%	938 patients	Datagateway [7]
Breast Cancer	Overall system performance	Included in multi-cancer validation	Not specified	Datagateway [7]
Sarcoma	Descriptive EHR fields	Variable (50-86%)	45 cases	Precision-DM Toolset [10]
Solid Tumors	Cancer Disease Status (CDS)	87% (when disease evaluation captured in both systems)	15 trials	ICAREdata [9]

Experimental Protocols and Methodologies

The Datagateway Validation Study

The Netherlands Cancer Registry implemented and validated an automated real-time data extraction system called "Datagateway" that harmonizes structured EHR data across multiple hospitals into a common model [7] [8].

Experimental Protocol:

Data Sources: EHR data from patients with acute myeloid leukemia, multiple myeloma, lung cancer, and breast cancer were extracted via the Datagateway system [7]
Validation Method: Extracted data were compared against manually registered NCR data and original EHR source data [7]
Patient Cohorts:
- Prospective validation: 1,287 patient records across three hospitals (349 AML patients, 938 lung cancer patients) [7]
- Retrospective validation: 384 patient records (168 AML patients, 216 lung cancer patients) [7]
Accuracy Assessment: Multiple data elements were validated including diagnoses, treatment regimens, laboratory values, and toxicity indicators [7]

Key Findings:

The system achieved 100% accuracy for identifying existing diagnoses compared to manually registered NCR data [7]
For new diagnoses, the system demonstrated 95% accuracy against NCR inclusion criteria [7]
Treatment identification showed high accuracy (100% for AML, 97% for multiple myeloma) with only 3% of combination therapies misclassified [7]
Laboratory values matched "virtually completely" between systems [7]

The following workflow diagram illustrates the Datagateway validation process:

The ICAREdata Project Implementation

The Integrating Clinical Trials and Real-World Endpoints (ICAREdata) project demonstrated an alternative approach to real-world data capture using standardized oncology data elements [9].

Experimental Protocol:

Data Standards: Implemented minimal Common Oncology Data Elements (mCODE) within HL7 FHIR standard [9]
Study Scope: 10 clinical sites (academic and community centers) across 15 trials [9]
Data Elements: Focused on Cancer Disease Status (CDS) and Treatment Plan Change (TPC) [9]
Implementation Tools:
- CDS captured via Epic SmartForms attached to problem list [9]
- TPC captured via Epic SmartPhrases in clinical notes [9]
Extraction Method: mCODE Extraction Framework with FHIR-based transmission [9]

Key Findings:

Overall concordance rate of 79% for Treatment Plan Change between EHR and electronic data capture systems [9]
Concordance of 34% for Cancer Disease Status, increasing to 87% when disease evaluation was captured in both systems [9]
Demonstrated feasibility of standards-based structured data capture and transmission for clinical trials [9]

Technological Infrastructure for Real-Time Reporting

Common Data Models and Standards

Successful real-time reporting systems rely on standardized data models that enable interoperability across different healthcare systems:

mCODE (Minimal Common Oncology Data Elements): An open-source set of structured oncology data elements part of the HL7 FHIR standard, designed to facilitate electronic exchange of cancer-specific data between systems [9].

Precision-DM (Precision Oncology Core Data Model): A comprehensive model developed to support clinical-genomic data standardization, containing 22 profiles and 494 data elements with mappings to standardized terminologies [10].

FHIR (Fast Healthcare Interoperability Resources): A standard for electronic healthcare data exchange that supports API-based data access, increasingly adopted by EHR vendors for research purposes [9].

Architecture of Real-Time Reporting Systems

The technological infrastructure supporting real-time cancer registry reporting typically follows a layered architecture:

Research Reagent Solutions: Essential Tools and Technologies

Table 3: Key Research Reagents and Technologies for Real-Time Cancer Registry Implementation

Tool/Technology	Function	Implementation Example
HL7 FHIR Standard	Enables electronic exchange of health data between systems	ICAREdata project used FHIR for data transmission [9]
mCODE (Minimal Common Oncology Data Elements)	Standardized structured data elements for oncology	Used for cancer disease status and treatment representation [9]
Epic SmartForms	Structured data capture within EHR problem lists	Implemented for Cancer Disease Status questions [9]
Epic SmartPhrases	Structured documentation in clinical notes	Used for Treatment Plan Change documentation [9]
mCODE Extraction Framework	Open-source tool for data formatting and transmission	Interim solution for FHIR-based transmission [9]
Natural Language Processing (NLP)	Extracts information from unstructured clinical text	Used for retrieving performance status with 93% accuracy [10]
Precision-DM Model	Comprehensive clinical-genomic data standardization	Supports molecular data integration with clinical phenotypes [10]
Common Data Model Harmonization	Transforms heterogeneous EHR data into standardized format	Datagateway system harmonized data across multiple hospitals [7] [8]

Implications for Research and Drug Development

The transition to real-time reporting in cancer registries presents significant opportunities for the research community and pharmaceutical industry:

Accelerated Clinical Research: Real-world data from automated systems can supplement or serve as external control cohorts in clinical trials, potentially reducing recruitment timelines and costs [7]. The ability to identify patient populations meeting specific criteria in near real-time enhances clinical trial feasibility and efficiency.

Enhanced Safety Monitoring: Automated systems can provide more timely insights into treatment toxicities and adverse events, with studies demonstrating 72-100% accuracy for toxicity indicators [7]. This enables more responsive safety monitoring and pharmacovigilance.

Precision Medicine Applications: Standardized data models that incorporate molecular testing results support the development of targeted therapies for specific cancer subtypes [10]. The integration of genomic and clinical data is essential for advancing personalized treatment approaches.

Health Economics and Outcomes Research: More current and comprehensive data on treatment patterns and outcomes facilitates robust cost-effectiveness analyses and population health management, supporting value-based care initiatives in oncology.

The evolution from static to dynamic cancer registries represents a transformative advancement in oncology data infrastructure. Validation studies demonstrate that automated real-time reporting systems can achieve high accuracy rates—95-100% for key data elements—while dramatically improving data currency compared to traditional manual approaches [7] [8]. The successful implementation of standards-based approaches like mCODE and FHIR further supports the scalability and interoperability of these systems [9].

For researchers, scientists, and drug development professionals, these technological advances create unprecedented opportunities to leverage real-world evidence throughout the therapeutic development lifecycle. As these systems continue to mature, incorporating artificial intelligence and enhanced natural language processing capabilities, the potential for innovation in cancer research and care delivery will continue to expand, ultimately accelerating progress against cancer.

The validation of real-time oncology data from electronic health records (EHRs) is transforming oncology research and drug development. By converting unstructured clinical narratives into structured, research-ready data, advanced computational methods are enabling more efficient evidence generation and supporting the advancement of precision medicine. This guide objectively compares the key technologies and methodologies driving this transformation.

Table 1: Performance Comparison of Data Processing Models in Oncology

Model Name	Primary Function	Test Data	Key Performance Metrics	Reported Limitations / Challenges
GPT-4o (OpenAI) [11]	Classify cancer diagnoses from ICD/free-text	762 unique diagnoses (326 ICD, 436 free-text) [11]	ICD Code Accuracy: 90.8%; Free-text Accuracy: 81.9%; Weighted Macro F1-score (Free-text): 71.8 [11]	Confusion between metastasis and CNS tumors; errors with ambiguous terminology [11]
BioBERT (dmis-lab) [11]	Biomedical-specific classification	762 unique diagnoses [11]	ICD Code Accuracy: 90.8%; Free-text Accuracy: 81.6%; Weighted Macro F1-score (Free-text): 61.5 [11]	Lower performance on unstructured free-text compared to structured ICD codes [11]
LLM for Clinical Data Extraction (Ontada) [4]	Extract cancer diagnosis, histology, grade, stage	26 solid tumors, 14 hematologic malignancies [4]	F1 scores > 0.85 for key data elements (TNM stage, grade, histology) [4]	Requires testing for bias across all cancer populations to ensure fairness [4]
Precision-DM Data Pipeline [1]	Standardize EHR data for precision oncology	106 lung cancer & 45 sarcoma cases [1]	Accuracy for Age at Diagnosis, Overall Survival: 50% - 86% [1]	Lower accuracy in extracting dates (e.g., Date of Diagnosis, Treatment Start) [1]

Detailed Experimental Protocols

Protocol for Validating LLMs in Cancer Diagnosis Categorization

This protocol is based on a benchmark study evaluating large language models (LLMs) and a specialized model on their ability to classify cancer diagnoses from EHRs [11].

Objective: To evaluate the performance of LLMs (GPT-3.5, GPT-4o, Llama 3.2, Gemini 1.5) and BioBERT in classifying cancer diagnoses from both structured and unstructured EHR data into clinically relevant categories [11].
Dataset Curation:
- Source data was obtained from 3,456 patient records in the Research Enterprise Data Warehouse [11].
- The test set consisted of 762 unique diagnoses: 326 structured International Classification of Diseases (ICD) code descriptions and 436 unstructured free-text entries from clinical notes [11].
- Two oncology experts defined and validated 14 cancer categories (e.g., Breast, Lung, Gastrointestinal, Central Nervous System) [11].
Model Implementation & Prompting:
- General-purpose LLMs were accessed via their respective cloud APIs, while BioBERT was deployed via the Hugging Face Transformers library. Llama 3.2 was run locally using Ollama [11].
- A standardized prompt was used for the LLMs: "Given the following ICD-10 description or treatment note for a radiation therapy patient: {input}, select the most appropriate category from the predefined list: {Category list}. Respond only with the exact category name from the list..." [11].
Validation and Metrics:
- Model outputs were compared against expert classifications by oncology specialists [11].
- Performance was quantified using accuracy and weighted macro F1-score, with 95% confidence intervals calculated via nonparametric bootstrapping [11].

Protocol for Building a Standardized Real-World Data Pipeline

This methodology focuses on creating a scalable infrastructure to extract and standardize EHR data for precision oncology use cases [1].

Objective: To develop and evaluate a toolset that automatically retrieves and standardizes descriptive variables and common endpoints from EHRs according to the Precision Oncology Core Data Model (Precision-DM) [1].
Data Processing & Toolset Development:
- The infrastructure incorporated data mining and natural language processing (NLP) scripts to extract variables from unstructured EHRs [1].
- Extracted data was structured to comply with the Precision-DM standard to ensure consistency and interoperability [1].
Validation Approach:
- The toolset's performance was validated against a reference dataset of 106 lung cancer and 45 sarcoma patient cases from the Johns Hopkins Molecular Tumor Board [1].
- Accuracy was assessed for key clinical variables, including Age at Diagnosis, Overall Survival, and Time to First Treatment, which were calculated from extracted dates [1].

The Scientist's Toolkit: Research Reagent Solutions

The following tools and data standards are essential for conducting real-world evidence research in oncology.

Tool / Solution	Type	Primary Function in Research
Large Language Models (LLMs) [11] [4]	Software Model	Automate the extraction and structuring of complex clinical information (e.g., diagnosis, stage) from unstructured EHR text, enabling high-throughput data curation.
BioBERT [11]	Software Model	Provide a domain-specific language model pre-trained on biomedical literature, enhancing performance on tasks involving specialized medical terminology.
Precision-DM (Precision Oncology Core Data Model) [1]	Data Standard	Offer a standardized data model to harmonize EHR-derived real-world data, ensuring consistency and facilitating data sharing across different cancer centers and studies.
AACR Project GENIE [12]	Data Registry	Serve as a large, publicly available, clinically annotated genomic registry used to accelerate precision oncology discovery and validate findings across diverse patient populations.
Flatiron Health EHR-Derived Databases [13]	Data Resource	Provide de-identified, structured, and unstructured data derived from routine oncology care across a nationwide network of providers, supporting outcomes research and regulatory-grade evidence generation.

Experimental Workflow and Data Validation Pathways

The following diagram illustrates the standard workflow for processing and validating real-world data from EHRs for oncology research.

Real-World Data Processing Workflow

Framework for Regulatory-Grade Real-World Evidence

Generating evidence fit for regulatory decisions requires a robust methodological framework to address biases inherent in observational data [14].

RWE Validation Framework

This framework emphasizes the importance of pre-specifying a causal question and using tools like Directed Acyclic Graphs (DAGs) to map relationships between variables [14]. Target trial emulation involves designing an observational study to mimic a hypothetical randomized controlled trial as closely as possible, which includes precisely defining eligibility criteria, treatment strategies, and outcomes [14]. Analytic methods like Inverse Probability of Treatment Weighting (IPTW) are then used to control for confounding and generate reliable evidence for regulatory and reimbursement decisions [14].

In the field of oncology research, the validation of real-time data from electronic health records (EHRs) represents a critical frontier for advancing evidence-based medicine. Real-world data (RWD) offers the potential to capture diverse patient experiences often missed by traditional randomized controlled trials (RCTs), particularly for older adults, those with comorbidities, and individuals with rare cancers [15] [16]. However, the journey from raw EHR data to trustworthy evidence is fraught with challenges related to data completeness, accuracy, and timeliness. These data gaps directly impact the reliability of insights drawn from RWD and can consequently affect drug development timelines, clinical decision-making, and ultimately, patient outcomes. This guide objectively compares the performance of different data collection and validation methodologies, providing researchers with a framework for navigating the complex landscape of oncology RWD.

The following tables summarize the performance characteristics of various oncology data sources and validation systems based on recent research findings.

Table 1: Performance Metrics of Automated Oncology Data Extraction Systems

System / Study	Data Source	Key Performance Metrics	Primary Limitations
Datagateway System [7]	EHR data from multiple hospitals (Netherlands Cancer Registry)	• 100% concordance with registered NCR diagnoses• 95% accuracy in new diagnosis extraction• 97% accuracy in treatment regimen identification (MM)• 100% accuracy in AML treatment identification	• 3% of combination therapies misclassified• Toxicity indicators showed variable accuracy (72%-100%)
Privacy-Preserving ML Tool [17]	Oncology EHR data across multiple institutions	• Improved ML model performance by 10-15%• Accelerated feedback cycles from weeks to days	• Requires human expert-curated gold standard for validation• Must comply with European data protection standards
Oncology Data Network (ODN) [18]	124 cancer centers across 7 European countries	• Near real-time analytics within 24 hours of data entry• Captures treatment duration, intervals, and discontinuation	• Concise initial dataset focused primarily on cancer medicine use• Achieving critical mass of contributors proved challenging

Table 2: Data Completeness Across Different Oncology Registry Types

Registry Type	Strengths	Data Gaps & Limitations	Example Research Applications
Population-Based Registries (SEER, NPCR) [16]	• Large, diverse samples representative of populations• Common coding schema• Details on tumor characteristics	• Incomplete treatment information• Lack of detailed data on health behaviors and age-related conditions (frailty, cognition)	• Trends in cancer incidence and mortality• Health disparities research across age, race, and geography
Hospital-Based Registries (National Cancer Database) [16]	• Captures ~70% of incident US cancers• Detailed clinical information from accredited hospitals	• Findings may not be generalizable to full US population• Limited information on geriatric impairments	• Quality of care comparisons across institutions• Treatment pattern analysis
Specialized Geriatric Registries (Carolina Seniors Registry) [16]	• Captures geriatric assessment data for all participants• Focuses on older adults in academic and community settings	• Limited geographic coverage• May not represent all care settings	• Understanding geriatric impairments in older cancer patients• Linking functional status to treatment outcomes

Experimental Protocols for Data Validation

Protocol 1: Validation of Automated EHR Data Extraction Systems

Objective: To validate the accuracy of an automated system (Datagateway) for extracting and harmonizing structured EHR data into a common model to support near real-time enrichment of cancer registries [7].

Methodology:

Patient Cohort Selection: Data from patients with acute myeloid leukemia (AML), multiple myeloma, lung cancer, and breast cancer were extracted via the Datagateway system.
Comparison Framework: Extracted data was compared against two standards: the manually curated Netherlands Cancer Registry (NCR) and original EHR source data.
Validation Metrics:
- Diagnostic Accuracy: Compared automatically extracted diagnoses with NCR-registered diagnoses.
- Treatment Identification: Validated treatment regimens against manually curated records.
- Data Element Accuracy: Assessed concordance for laboratory values and toxicity indicators.

Key Findings: The system demonstrated 100% accuracy for retrieving patients recorded in the NCR and 95% accuracy in identifying new diagnoses meeting NCR inclusion criteria. Treatment identification showed high accuracy (97-100%) across cancer types, with only 3% of complex combination therapies misclassified [7].

Protocol 2: Privacy-Preserving Machine Learning Error Analysis

Objective: To develop a workflow that allows clinical experts and data scientists to collaboratively identify machine learning (ML) extraction errors while maintaining privacy compliance [17].

Methodology:

Gold Standard Establishment: Human expert-curated datasets served as the validation benchmark.
Interactive Dashboard: Implemented a Snowflake-based interactive dashboard for reviewing model outputs against human benchmarks.
Error Categorization: Team reviewed discrepancies to categorize errors and inform model improvements.
Iterative Refinement: Established feedback loops to continuously improve ML model performance.

Key Findings: This approach improved ML model performance by 10-15% and accelerated feedback cycles from weeks to days, ensuring that data extraction remains both precise and compliant with European data protection standards [17].

Visualization of Data Validation Workflows

Oncology RWD Validation Workflow: This diagram illustrates the sequential process of transforming raw EHR data into validated real-world data through common data models, validation against gold standards, and iterative error analysis loops.

Multinational RWD Harmonization: This visualization shows the workflow for creating globally applicable oncology datasets through disease-specific common data models, robust curation processes, and secure trusted research environments.

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Resources for Oncology RWD Research

Resource / Tool	Type	Primary Function	Access Considerations
Common Data Models [7] [18]	Data Infrastructure	Harmonizes data from diverse EHR systems into standardized formats for aggregation and comparison	Requires mapping local data elements to common standards; must be maintained as clinical practices evolve
Core Regimen Reference Library (CRRL) [18]	Reference Database	Codifies treatment regimens used in clinical practice and maps them against established guidelines	Essential for comparing treatment patterns across institutions and countries; requires clinical expertise to maintain
Privacy-Preserving Error Analysis Dashboard [17]	Analytical Tool	Enables collaborative identification of ML extraction errors against human expert-curated gold standards	Must comply with regional data protection regulations (e.g., GDPR); requires specialized technical implementation
USP Medicine Supply Map [19]	Supply Chain Analytics	Uses predictive analytics to identify vulnerability factors in drug supply chains and calculate shortage risk scores	Commercial tool requiring subscription; provides crucial data for understanding drug availability impacts on treatment
Flatiron Health Multinational Datasets [20]	Curated RWD Source	Provides structured, curated oncology EHR-derived data across multiple countries using disease-specific common data models	Access via trusted research environment; enables global comparative studies while maintaining data privacy

The validation of real-time oncology data from EHRs remains a complex but essential endeavor for advancing cancer research and treatment. Automated data extraction systems show promising accuracy, particularly for diagnosis identification and monotherapy regimens, but challenges persist with complex combination therapies and toxicity documentation. The consequences of incomplete and delayed information are significant, potentially leading to suboptimal treatment decisions, inefficient drug development, and inadequate understanding of real-world therapeutic effectiveness.

Researchers must carefully select data sources and validation methodologies based on their specific use cases, recognizing that different registry types and data systems exhibit distinct strength and limitation profiles. The emerging toolkit of common data models, privacy-preserving error analysis frameworks, and multinational data harmonization approaches offers promising pathways for addressing critical data gaps. As the field evolves, continued refinement of these methodologies and technologies will be essential for generating reliable real-world evidence that can truly inform clinical practice and improve outcomes for cancer patients.

Building the Pipeline: Methodologies for Real-Time Data Extraction and Harmonization

The modern landscape of oncology research is increasingly dependent on the rapid and reliable use of real-world data from Electronic Health Records (EHRs). However, the clinical utility of this data is often hampered by significant challenges, including fragmentation across proprietary systems, inconsistent data structures, and burdensome manual extraction processes that are both time-consuming and labor-intensive [7] [2]. These limitations create critical bottlenecks for research and delay insights into cancer treatment efficacy and safety. Common Data Models (CDMs) have emerged as a foundational architectural solution to these problems, providing a standardized framework that harmonizes disparate data sources into a consistent, analyzable format. By transforming heterogeneous EHR data into a unified structure, CDMs enable the automated, real-time data pipelines essential for a responsive Learning Health System in oncology [7] [21]. This guide objectively evaluates the role of CDMs, with a specific focus on validating their performance in automating the extraction of real-time oncology data for research and drug development.

CDM Performance Comparison: Validating Automated Oncology Data Extraction

To assess the practical value of Common Data Models in a real-world oncology context, we examine performance data from a validation study of an automated data extraction system that leveraged a CDM to harmonize EHR data across multiple hospitals for the Netherlands Cancer Registry (NCRCITATION). The study provides critical quantitative metrics on the accuracy and feasibility of using a CDM for real-time data enrichment in a population-based registry.

Diagnostic and Treatment Data Accuracy

The validation demonstrated a high level of accuracy across key oncology data domains, confirming the reliability of the CDM-based automated system.

Table 1: Accuracy of CDM-Based Data Extraction for Oncology Diagnoses and Treatment

Data Category	Specific Metric	Performance	Context / Sample Size
Diagnosis Validation	Concordance with registered NCR diagnoses	100%	Compared to NCR gold standard [7]
	Accuracy in identifying new diagnoses per NCR criteria	95%	1,219 of 1,287 patient records [7]
Treatment Validation	Acute Myeloid Leukemia (AML) treatment regimens	100%	254 patients [7]
	Multiple Myeloma (MM) treatment regimens	97%	198 regimens from 117 patients [7]
	Combination therapy misclassification	3%	Small subset of MM regimens [7]

Clinical and Laboratory Data Accuracy

The system also excelled in capturing detailed clinical data, which is crucial for comprehensive research and safety monitoring.

Table 2: Accuracy of CDM-Based Clinical and Laboratory Data Extraction

Data Type	Performance	Notes
Laboratory Values	Virtually complete match [7]	High fidelity in transferring structured numeric data.
Toxicity Indicators	72% - 100% accuracy [7]	Range indicates variation in capture accuracy for different types of toxicities.

Experimental Protocols: Methodologies for CDM Validation

The performance data cited in the previous section were derived from rigorous validation studies. The following protocols detail the methodologies used to generate that evidence, providing a blueprint for researchers seeking to validate similar CDM-based systems.

Protocol 1: Prospective Validation of New Cancer Diagnoses

This protocol was designed to test the system's ability to accurately identify and include new cancer cases in real-time.

Objective: To determine the accuracy of the CDM-based automated system in identifying new patient diagnoses that meet the registry's inclusion criteria, compared to manual registration processes [7].
Data Source: Structured data extracted directly from hospital EHRs and harmonized via the CDM [7].
Patient Cohort: 1,287 patient records from three hospitals, encompassing patients with Acute Myeloid Leukemia (AML) and lung cancer [7].
Validation Method: Each patient record identified by the automated system was checked against the NCR inclusion criteria to confirm they represented a valid, new cancer diagnosis [7].
Output Metric: The percentage of automatically identified patients who correctly met all inclusion criteria (95%) [7].

Protocol 2: Retrospective Validation of Treatment Regimens

This protocol assessed the system's precision in capturing complex cancer treatment information.

Objective: To validate the correctness of treatment regimen identification by the CDM system against previously recorded NCR data and the original EHR source data [7].
Data Source: Harmonized treatment data from the CDM, compared to the gold-standard NCR records and source EHRs [7].
Patient Cohort: 254 AML patients and 117 Multiple Myeloma (MM) patients, encompassing a total of 198 distinct treatment regimens for MM [7].
Validation Method: A detailed, record-by-record comparison was conducted. For example, each specific drug combination (e.g., D-VRd, D-VTd) identified for an MM patient was verified against the actual prescribed therapy in the medical record [7].
Output Metric: The percentage of treatment regimens that were correctly classified (100% for AML, 97% for MM) [7].

System Architecture: Workflow of a CDM for Oncology Data

The following diagram illustrates the logical flow and key components of an automated system that uses a Common Data Model to process oncology data from source EHRs to research-ready outputs.

Diagram 1: Logical workflow of a CDM-based automated data pipeline for oncology research, from heterogeneous EHR sources to research consumption. Based on a scalable data architecture framework [22].

Successful implementation of a CDM for oncology research relies on a combination of specific data standards, technical tools, and governance frameworks.

Table 3: Key Resources for Implementing a CDM in Oncology Research

Tool / Standard	Category	Primary Function in CDM Implementation
OMOP Common Data Model [21]	Data Standard	Provides an open-source, standardized data model and structure for observational health data, enabling systematic analytics.
OHDSI Standardized Vocabularies [21]	Terminology	Allows organization and standardization of medical terms (e.g., medications, conditions) across clinical domains for consistent phenotype definition.
Dataverse / Microsoft CDM [23]	Platform & Standard	Offers a standardized, cloud-based schema and storage for business data, promoting interoperability between applications like Dynamics 365 and Power BI.
Data Catalogs (e.g., Alation) [24]	Governance Tool	Centralizes documentation of the CDM, tracks data lineage, manages metadata, and ensures governance and discoverability of standardized entities.
ETL/ELT Tools (e.g., dbt, Talend) [22] [24]	Technical Tool	Executes the transformation and loading of source data into the CDM structure, often within modular, version-controlled pipelines.
Unity Catalog (on Databricks) [22]	Governance Tool	Provides centralized governance, lineage tracking, and access control for data within a lakehouse architecture, securing the CDM.

The empirical validation of a Common Data Model for automating oncology data extraction demonstrates that this architectural approach is not only feasible but also highly reliable. With performance benchmarks showing 95% to 100% accuracy in identifying diagnoses and treatment regimens, CDMs provide a robust foundation for real-time data integration in cancer research [7]. By overcoming the inherent fragmentation of EHR systems, CDMs enable scalable, high-quality data pipelines that are essential for accelerating real-world evidence generation, supporting drug development, and ultimately advancing patient care in a Learning Health System framework.

Harnessing AI and Natural Language Processing (NLP) for Unstructured Data

In modern oncology, the pursuit of precision medicine generates massive volumes of patient data, most of which exists in unstructured formats within electronic health records (EHRs). Critical information regarding cancer diagnosis, histology, staging, treatment responses, and patient-reported outcomes often remains buried in clinical narratives, pathology reports, and physician notes rather than structured, analyzable fields. This unstructured data represents both a formidable challenge and a tremendous opportunity for cancer research and drug development. The U.S. healthcare system alone has exceeded 2000 exabytes of data, much of which is unstructured clinical information requiring sophisticated processing techniques [25]. For researchers and pharmaceutical professionals, unlocking this information is crucial for generating robust real-world evidence, streamlining clinical trials, and advancing personalized treatment strategies.

Artificial intelligence, particularly natural language processing, has emerged as a transformative solution to this data accessibility problem. NLP technologies can automatically extract, structure, and analyze clinical information from unstructured text, converting qualitative narratives into quantitative, research-ready data [4]. This capability is especially valuable in oncology, where the heterogeneity of cancer subtypes, treatment protocols, and patient outcomes demands sophisticated data integration across multiple sources. This guide provides a comprehensive comparison of AI and NLP methodologies for oncology data extraction, evaluates their performance against traditional approaches, and details experimental protocols for validating these technologies in real-world research settings, with a specific focus on applications for real-time oncology data validation from EHRs.

Comparative Analysis of NLP Approaches in Oncology

Evolution of NLP Technologies: From Rules to Deep Learning

The field of natural language processing has undergone significant evolution, transitioning through three distinct technological eras that build upon each other in complexity and capability. Each approach offers different advantages for oncology applications, from extracting simple diagnostic information to understanding complex clinical contexts.

Figure 1: The evolution of NLP technologies shows a progression from rigid rule-based systems to contextually aware large language models, with each generation building upon the previous to handle increasingly complex clinical language tasks.

Performance Comparison: Traditional NLP vs. Modern LLMs

Multiple studies have quantitatively evaluated the performance of different NLP approaches for extracting oncology concepts from unstructured clinical text. The table below summarizes key performance metrics across various extraction tasks and cancer types, demonstrating the comparative advantages of modern approaches.

Table 1: Performance comparison of NLP approaches on oncology data extraction tasks

NLP Approach	Cancer Types Evaluated	Key Data Elements Extracted	Performance Metrics	Reference
Rule-Based Systems	Breast, Colorectal, Prostate	Symptoms, Urinary function, Pain intensity	Precision: 0.72-0.89, Recall: 0.68-0.85 [26]	Systematic Review (2024)
Traditional Machine Learning	Multiple (26 solid tumors, 14 hematologic)	Diagnosis, Histology, Staging	F1 Score: 0.78-0.82 [4]	ASCO 2025 Validation Study
Large Language Models (LLMs)	Multiple (26 solid tumors, 14 hematologic)	Cancer diagnosis, Histology, Grade, TNM staging	F1 Score: >0.85 [4]	ASCO 2025 Validation Study
Deep Convolutional Neural Networks	Gastric cancer (early detection)	Endoscopic image classification	Sensitivity: 0.94, Specificity: 0.91, AUC: 0.98 [27]	Meta-analysis (2025)

The performance advantage of large language models is particularly evident in complex extraction tasks such as TNM staging, where contextual understanding is essential. Modern LLMs like BERT and GPT variants achieve F1 scores exceeding 0.85 for extracting key clinical elements across 26 solid tumors and 14 hematologic malignancies, outperforming traditional machine learning approaches that require extensive feature engineering and task-specific training [4]. This represents a significant advancement for oncology research, where accurate, automated extraction of structured data from clinical narratives enables more comprehensive patient cohort identification for clinical trials and more robust real-world evidence generation.

Domain-Specific Performance Variations

While modern NLP approaches generally outperform traditional methods, their effectiveness varies across specific oncology domains and documentation types. For instance, in early gastric cancer detection using endoscopic images, deep convolutional neural networks (DCNNs) demonstrate remarkable sensitivity (0.94) and specificity (0.91), significantly outperforming both traditional computer vision approaches and clinician assessment in controlled studies [27]. This performance advantage is particularly pronounced in dynamic video analysis, where DCNNs achieve an AUC of 0.98 compared to clinician AUC ranges of 0.85-0.90, highlighting their potential for real-time clinical decision support [27].

However, the performance of any NLP system is highly dependent on the quality and representativeness of its training data. Models trained on specific cancer types or institutional documentation styles typically perform better within those domains than general-purpose models applied to unfamiliar contexts. This underscores the importance of domain-specific tuning and validation when implementing NLP solutions for oncology research applications [26] [4].

Experimental Protocols for NLP Validation in Oncology

Methodological Framework for Validation Studies

Robust validation of NLP systems for oncology applications requires carefully designed experimental protocols that assess both technical performance and clinical utility. The following workflow outlines a comprehensive validation methodology adapted from recent high-quality studies in the field.

Figure 2: Comprehensive validation workflow for NLP systems in oncology, progressing from data collection through clinical utility assessment, with specific methodological considerations at each stage.

Key Performance Metrics and Evaluation Criteria

Rigorous evaluation of NLP systems requires multiple performance dimensions assessed through standardized metrics. The oncology research context demands particular attention to clinical relevance and potential impact on research workflows.

Table 2: Standard evaluation metrics for NLP systems in oncology applications

Performance Dimension	Key Metrics	Target Benchmarks	Evaluation Method
Concept Extraction Accuracy	Precision, Recall, F1-score	F1 > 0.85 for key concepts [4]	Comparison to gold standard manual abstraction
Clinical Validity	Sensitivity, Specificity, AUC	Sensitivity: 0.90-0.94, Specificity: 0.87-0.95 [27]	Cross-reference with clinical outcomes
Generalizability	Performance variation across cancer types, institutions	F1 >= 0.85 across all cancer types [4]	Cross-validation, external validation
Clinical Utility	Time savings, clinician accuracy improvement	Improved clinician performance with AI assistance [28]	Pre-post implementation studies
Calibration	Calibration plots, Brier score	Ratio between predicted/observed outcomes [28]	Graphical assessment of prediction reliability

Beyond these technical metrics, successful validation should include assessment of clinical utility involving end-users. Recent studies have engaged 499 clinicians using 12 different assessment tools to demonstrate that AI assistance improves clinician performance in tasks such as trial eligibility screening and documentation accuracy [28]. This real-world validation is essential for establishing the practical value of NLP systems in oncology research and clinical contexts.

Research Reagent Solutions: Essential Tools for Oncology NLP

Implementing successful NLP projects in oncology requires both technical infrastructure and domain-specific resources. The following table details essential components of the research toolkit for developing and validating NLP systems for oncology data extraction.

Table 3: Essential research reagents and tools for oncology NLP projects

Tool Category	Specific Solutions	Function	Application Context
Data Management Platforms	iCore [29], OSIRIS RWD [30], OMOP CDM [30]	Harmonizes diverse datasets (genomics, proteomics, imaging) and ensures regulatory compliance	Multi-institutional research collaborations, regulatory-grade RWE generation
NLP Frameworks & Models	BERT [25], GPT variants [25], Transformer models [26]	Provides pre-trained language understanding capabilities for clinical text	Rapid development of information extraction pipelines
Standardized Data Models	OMOP Common Data Model [30], OSIRIS [30], FHIR [30]	Enables standardized data representation and cross-system interoperability	Health system integrations, regulatory submissions
Annotation Tools	Clinical specialist manual abstraction [4], Structured annotation guidelines	Creates gold standard datasets for model training and validation	Supervised learning projects, model validation
Validation Frameworks	QUADAS-2 [27], Clinical utility assessments [28]	Assesses risk of bias and clinical applicability of NLP systems	Peer-reviewed research, regulatory evaluation

These research reagents collectively enable the end-to-end development, validation, and deployment of NLP systems for oncology research. Platforms like iCore are particularly valuable for addressing the "data dilemma" in AI development by ensuring proper harmonization of diverse datasets from genomics, proteomics, and imaging sources, which is essential for building trustworthy AI models [29]. Similarly, standardized data models like OSIRIS and OMOP facilitate the structured representation of extracted information, enabling cross-system interoperability and collaborative research initiatives across multiple cancer centers [30].

The integration of AI and NLP technologies into oncology research represents a paradigm shift in how we extract knowledge from unstructured clinical data. The quantitative evidence demonstrates that modern approaches, particularly large language models, achieve clinically acceptable performance levels for automating the extraction of critical oncology concepts from EHRs. These capabilities directly address fundamental challenges in real-world oncology data validation by enabling more efficient, comprehensive, and accurate structuring of patient information for research purposes.

Looking forward, several emerging trends will shape the next generation of oncology NLP applications. The integration of multi-omics data—drawing from genomics, transcriptomics, proteomics, and metabolomics—will provide a more comprehensive picture of cancer biology that extends beyond singular dysregulated genes or signaling pathways [29]. Additionally, the growing regulatory acceptance of AI-defined biomarkers and the intentional incorporation of AI tools into clinical trial designs promise to accelerate the translation of these technologies into practical research applications [29]. However, success will ultimately depend on how well these AI tools integrate into clinical and operational workflows, not just the sophistication of the underlying algorithms [29].

For researchers, scientists, and drug development professionals, these advancements offer unprecedented opportunities to leverage real-world data at scale. By implementing robust validation methodologies and selecting appropriate NLP approaches for specific research questions, the oncology research community can harness the full potential of unstructured data to accelerate drug development, personalize treatment approaches, and improve outcomes for cancer patients.

The shift towards data-driven oncology research, accelerated by initiatives like the Cancer Moonshot, has made the curation of electronic health record (EHR) data a critical scientific competency [31] [32]. Real-world evidence (RWE) generated from EHRs is now integral to understanding disease progression, supporting drug development, and optimizing patient care [31]. However, EHR data exists in two fundamentally different forms—structured and unstructured—each requiring distinct curation methodologies. This guide objectively compares techniques for handling these data types, focusing on their validation within real-time oncology research contexts. For researchers and drug development professionals, selecting the appropriate curation strategy is paramount for ensuring data quality, relevance, and reliability for specific use cases, from clinical trial design to post-market surveillance.

Structured data refers to highly organized information with predefined formats, typically stored in tabular forms like relational databases. In oncology EHRs, this includes demographic information, laboratory test results (e.g., numerical values from blood tests), vital signs, medication prescriptions, and standardized diagnosis codes like ICD-10 [33] [34]. Unstructured data, which constitutes an estimated 80-90% of all digital information, lacks a pre-defined model and includes clinical notes, pathology reports, radiology interpretations, and discharge summaries [33]. A third category, semi-structured data (e.g., JSON, XML formats), offers some organizational tags without rigid schema requirements [35].

The core distinctions between structured and unstructured data impact every aspect of their management, from storage to analysis. The table below summarizes these key differences.

Table 1: Core Characteristics of Data Types in Oncology Research

Aspect	Structured Data	Unstructured Data
Schema & Format	Predefined, tabular format (rows/columns); schema-dependent [33] [35]	Schemaless; stored in native formats (text, PDF, images) [33] [35]
Oncology Examples	Patient demographics, ICD-10 codes, lab values, medication orders, TNM staging [31] [34]	Pathology reports, clinical narratives, radiology notes, surgical summaries [31] [34]
Storage Solutions	Relational databases (SQL); Data Warehouses [33] [35]	Data lakes, NoSQL databases; Cloud object storage [33] [35]
Primary Analysis Tools	SQL, traditional BI and statistical tools [33] [35]	NLP, Machine Learning, AI-based indexing [36] [37]
Inherent Nature	Quantitative, easily countable [33]	Qualitative, rich in context and nuance [33]

Data Provenance and Workflow Integration

In clinical settings, structured data is often generated through discrete entry fields in EHRs, such as dropdown menus for Eastern Cooperative Oncology Group (ECOG) performance status or checkboxes for symptoms [31]. This data is extracted from various hospital systems and harmonized into computable standard terminologies. Unstructured data, conversely, originates from free-text entries composed by clinicians. This includes the rich contextual details found in clinical narratives and tumor board notes, which are crucial for understanding patient-specific factors and complex disease presentations [31] [34].

Curation Techniques and Methodologies

The transformation of raw EHR data into a research-ready resource requires sophisticated, fit-for-purpose curation pipelines. The following workflow diagram illustrates the parallel processes for structured and unstructured data, culminating in a unified dataset for evidence generation.

Diagram 1: Oncology Data Curation Workflow

Structured Data Curation

The curation of structured data focuses on harmonization and validation. Data from disparate EHR systems and formats are mapped to common data models, such as the Fast Healthcare Interoperability Resources (FHIR) standard, and standardized terminologies (e.g., SNOMED CT, LOINC) [36] [34]. The process involves:

Extract, Transform, Load (ETL): Automated processes extract data from source systems, apply business rules for transformation, and load it into a target database or warehouse [33].
Data Quality Checks: Implementing verification for conformance (data matches expected type), consistency (values are logically consistent across related fields), and plausibility (values fall within expected ranges) [31].
Validation Techniques: Accuracy is assessed by comparing curated variables to internal or external reference standards where available, or through indirect benchmarking against known population distributions [31].

Unstructured Data Curation

Curation of unstructured data is the process of converting clinical text into structured, analyzable fields. Methodologies exist on a spectrum from manual to fully automated.

Manual Abstraction: Traditionally the gold standard, this involves trained abstractors (e.g., clinical research coordinators) reviewing clinical notes to extract and code specific variables. While highly accurate for complex concepts, it is resource-intensive and difficult to scale [31].
Rule-Based Natural Language Processing (NLP): This approach uses custom-written rules or dictionaries to identify and extract specific concepts from text (e.g., flagging a note that contains "tumor size" followed by a measurement).
Machine Learning (ML)/Deep Learning Models: Supervised ML models can be trained on pre-annotated clinical text to identify and extract complex clinical concepts, such as disease progression or recurrence status [36] [31]. These models can capture context and nuance better than rigid rules.
Large Language Model (LLM) Processing: Recent studies demonstrate the use of LLMs like Claude 3.5 Sonnet to automate the structuring of clinical data from deidentified EHR extracts [37]. A typical protocol involves a multi-phase prompt refinement process where the LLM is trained on sample data to accurately extract and structure factors like tumor characteristics, nodal status, and biomarker information from complex clinical narratives [37].

Performance Comparison: Experimental Data and Validation

Evaluating the fitness of curated data requires assessing its performance across multiple dimensions, including prediction accuracy, operational efficiency, and alignment with established data quality frameworks.

Predictive Model Performance

A 2023 study directly compared the performance of Machine Learning models in predicting 5-year breast cancer recurrence using different data sources [36]. The eXtreme Gradient Boosting (XGB) model was trained on three distinct datasets derived from the same patient cohort.

Table 2: ML Performance for Breast Cancer Recurrence Prediction (5-Year)

Dataset Type	Precision	Recall	F1-Score	AUROC
Structured Data Only	0.900	0.907	0.897	0.807 [36]
Unstructured Data Only	(Performance was lower than Structured)	(Performance was lower than Structured)	(Performance was lower than Structured)	(Performance was lower than Structured) [36]
Combined Dataset	(Poorest performance among the three) [36]	(Poorest performance among the three) [36]	(Poorest performance among the three) [36]	(Poorest performance among the three) [36]

This study found that structured data alone yielded the best predictive performance [36]. The authors noted that an NLP-based approach on unstructured data offered comparable results with potentially less manual mapping effort, suggesting context-dependent utility [36].

Curation Efficiency and Accuracy

A 2025 study compared traditional manual review against an LLM-based processing pipeline for curating breast cancer surgical oncology data [37]. The experimental protocol involved extracting 31 clinical factors from patient records.

Table 3: Manual Review vs. LLM-Based Curation Efficiency

Curation Metric	Manual Physician Review	LLM-Based Processing
Processing Time	7 months (5 physicians)	12 days (2 physicians) [37]
Total Physician Hours	1025 hours	96 hours (91% reduction) [37]
Reported Accuracy	(Benchmark for comparison)	90.8% [37]
Cost per Case	(Labor-intensive)	US $0.15 [37]
Key Strength	Established benchmark	Superior capture of survival events (41 vs. 11) [37]

The study concluded that the two-step approach—automated data extraction followed by LLM curation—addressed both privacy and efficiency needs, providing a scalable solution for retrospective clinical research while maintaining data quality [37].

Data Quality Framework Alignment

Regulatory agencies like the FDA and EMA emphasize relevance and reliability as primary data quality dimensions for RWE generation [31]. The table below applies this framework to the two curation paradigms.

Table 4: Quality Dimension Assessment for Curation Outputs

Quality Dimension	Structured Data Curation	Unstructured Data Curation
Relevance	High for defined variables (e.g., treatments, lab values); availability is clear [31]	Enables relevance for concepts not in structured fields (e.g., disease severity, symptom details) [31]
Reliability: Accuracy	Assessed via validation against reference standards; high conformance to predefined rules [31]	Accuracy is task-dependent; LLMs show >90% in structured tasks but requires validation [37]
Reliability: Completeness	Easily measured against expected data points [31]	Completeness depends on source documentation and extraction thoroughness [31]
Reliability: Provenance	Highly traceable through ETL pipelines and data transformation logs [31]	Requires detailed metadata on abstraction method (human, NLP, LLM) and versioning [31]

Successful curation and utilization of oncology data often involve leveraging a suite of public resources and analytical tools.

Table 5: Essential Resources for Oncology Data Curation and Validation

Resource or Tool	Type	Primary Function in Curation & Research
FHIR (Fast Healthcare Interoperability Resources)	Data Standard	Provides a modern, web-based standard for exchanging EHR data, facilitating the harmonization of both structured and unstructured elements [36].
cBioPortal	Genomic Database	A public resource for exploring, visualizing, and analyzing multidimensional cancer genomics data; useful for validating molecular findings from EHRs [32].
The Cancer Genome Atlas (TCGA)	Genomic Database	A landmark public dataset containing multi-omics data from thousands of patients; serves as a critical reference for benchmarking and discovery [38].
PROBAST (Prediction model Risk Of Bias ASsessment Tool)	Methodological Tool	A structured tool to assess the risk of bias and applicability of diagnostic and prognostic prediction model studies, crucial for evaluating ML models [39].
NLP/LLM Platforms (e.g., Claude, GPT)	Curation Tool	Used to automate the structuring of information from clinical narratives, pathology reports, and other unstructured text sources [37].
Data Lakes (e.g., Amazon S3, Azure Blob)	Storage Solution	Cloud object storage systems designed to hold vast volumes of raw, unstructured data in its native format prior to curation [35].

The choice between structured and unstructured data curation is not a binary one; rather, it is a strategic decision based on the research question, available resources, and required level of precision. Structured data curation provides a robust, efficient pathway for variables that are routinely and discretely captured in EHRs, consistently demonstrating high performance in predictive modeling tasks [36] [39]. In contrast, unstructured data curation, through NLP or modern LLMs, is indispensable for unlocking the rich, contextual details of patient care and capturing clinical phenotypes not represented in structured fields [31] [37].

The emerging paradigm is one of integration. The most powerful real-world evidence will come from studies that intelligently combine the quantitative precision of curated structured data with the qualitative depth extracted from unstructured narratives. Future advancements will continue to blur the lines between these two types, with LLMs playing an increasingly central role in scaling the curation of complex clinical concepts, thereby accelerating oncology research and drug development.

The Datagateway system represents a significant advancement in real-time oncology data extraction, demonstrating high reliability in automating the transfer of structured Electronic Health Record (EHR) data to the Netherlands Cancer Registry (NCR). This validation study assesses the system's performance against the established standard of manual data entry, which has been the traditional methodology for population-based cancer registries. The imperative for this technological evolution is clear: manual registration is time-consuming and labor-intensive, creating limitations in the timeliness and scalability of data collection essential for modern oncology research and real-world evidence generation [7]. The findings indicate that automated data extraction via the Datagateway is not only feasible but also highly accurate, enabling near real-time insights into cancer treatment patterns and outcomes [7].

The Datagateway system is designed to address critical bottlenecks in cancer data aggregation. Its core function is to automatically harmonize and transfer structured EHR data from multiple hospitals into a common data model, directly supporting the enrichment of the NCR [7]. This positions it as a next-generation solution against a backdrop of traditional and contemporary alternatives.

The table below outlines the key characteristics of the Datagateway system compared to other common data collection methodologies.

Table: Comparison of Oncology Data Collection Methodologies

Methodology	Description	Key Advantages	Key Limitations
Manual Abstraction (Traditional Standard)	Trained registration clerks abstract data directly from medical records [40].	Established, high-quality data; handles unstructured data [40].	Extremely time-consuming, labor-intensive, costly, slower data availability [7].
Datagateway (Automated System)	Automated system that harmonizes structured EHR data into a common model for real-time transfer [7].	High-speed, scalable, enables real-time surveillance, reduces manual burden [7].	Limited to structured EHR data; accuracy dependent on source data quality and system coding.
Enterprise Data Warehouses (EDWs)	Centralized databases that aggregate EHR data for research and reporting [2].	Consolidates data from across a health system; useful for internal analytics.	Prone to data quality issues (missing data, inconsistent coding); complex queries require informatics support; often not designed for interoperability [2].
Basic EHR Export & Reporting	Use of built-in, hospital-specific EHR reporting tools.	Leverages existing system functionality; no new infrastructure needed.	Lack of data standardization across hospitals; ill-documented local codes; poor interoperability [2].

Experimental Validation & Performance Data

The validation of the Datagateway system was conducted rigorously, assessing its performance across multiple data domains critical to a cancer registry. The study utilized data from patients with acute myeloid leukemia (AML), multiple myeloma, lung cancer, and breast cancer [7].

Validation of Diagnostic Data

The system's ability to correctly identify and process new cancer diagnoses was tested both prospectively and retrospectively.

Prospective Validation: Of 1,287 patient records evaluated, 1,219 (95%) met the NCR inclusion criteria via the Datagateway. The 5% that did not were primarily due to relapsed disease or preliminary, unconfirmed diagnoses already in the EHR [7].
Retrospective Validation: The system successfully retrieved 100% of patients recorded in the NCR from a sample of 384 records. Furthermore, 89% of these patients were identified with a care trajectory and diagnosis within the same year as the NCR record [7].

Validation of Treatment Data

Treatment data is complex, often involving numerous combination regimens. The Datagateway system was validated against manually recorded NCR data and EHR source data.

AML Treatment: A perfect 100% concordance was found when comparing treatment regimens for 254 AML patients identified by the Datagateway to the reference standard [7].
Multiple Myeloma (MM) Treatment: Across 198 treatment regimens for 117 MM patients, 192 (97%) were correctly identified. The 3% misclassification rate was primarily related to specific dosing nuances and regimens not included in the system's initial classification rules [7].

Table: Summary of Datagateway System Validation Performance

Validation Metric	Data Type	Sample Size	Accuracy	Notes
Diagnosis Retrieval	Retrospective	384 patients	100%	All NCR-recorded patients were retrieved [7].
New Diagnosis Inclusion	Prospective	1,287 patients	95%	Compared to NCR inclusion criteria [7].
Treatment Regimen (AML)	Cross-sectional	254 patients	100%	100% concordance with NCR/EHR source data [7].
Treatment Regimen (MM)	Cross-sectional	198 regimens	97%	Misclassifications involved specific drug combinations and dosing [7].
Laboratory Values	Cross-sectional	Various	~100%	Virtually complete match with source data [7].
Toxicity Indicators	Cross-sectional	Various	72%-100%	Accuracy varied by specific toxicity indicator [7].

Experimental Protocol & Methodology

The validation process for the Datagateway system can be summarized in the following workflow, which illustrates the key stages of data extraction, harmonization, and validation.

Detailed Methodological Steps

Data Extraction: Structured data regarding diagnosis, treatment, laboratory values, and toxicity indicators were extracted from the EHRs of multiple participating hospitals [7].
Data Harmonization: The extracted data was processed and harmonized by the Datagateway system into a Common Data Model. This critical step ensures that data from different source systems with varying formats and codes is standardized into a consistent structure for the registry [7].
Validation Comparison: The output from the Datagateway was compared against two primary reference standards:
- The established Netherlands Cancer Registry (NCR) data, which is manually abstracted by trained clerks and serves as the gold standard for population-level data [40].
- Direct review of the EHR source data to verify treatment details and laboratory values [7].
Performance Analysis: Quantitative metrics including accuracy, concordance, and misclassification rates were calculated for each data domain (e.g., diagnoses, treatment regimens) to provide a comprehensive performance assessment [7].

The Researcher's Toolkit: Essential Components for Real-Time Data Validation

The successful implementation and validation of an automated system like the Datagateway rely on a combination of technological infrastructure, data standards, and methodological frameworks.

Table: Essential Components for Real-Time Oncology Data Validation

Component	Function in Validation	Application in Datagateway Study
Common Data Model (CDM)	Provides a standardized structure for harmonizing heterogeneous data from multiple sources, enabling consistent analysis and comparison.	The core of the Datagateway system, allowing it to integrate data from different hospital EHRs into a unified format for the NCR [7].
Electronic Health Records (EHRs)	Serve as the primary source of real-world patient data, including diagnoses, treatments, lab results, and outcomes.	The source systems from which structured data on diagnosis, treatment, and lab values were extracted for validation [7].
Validation Framework	A structured protocol defining the reference standards, comparison metrics, and statistical methods for assessing data accuracy.	The study design comparing Datagateway output to manual NCR data and EHR source data across multiple cancer types and data domains [7].
Reference Standard Registry (NCR)	A high-quality, manually curated data source that serves as the benchmark for validating the automated system's output.	The Netherlands Cancer Registry itself was used as the gold standard for validating diagnoses and treatment data [7] [40].
Structured Data Fields	Pre-defined, coded fields within the EHR (e.g., medication lists, lab codes) that are essential for reliable automated extraction.	The validation focused on structured EHR data, which is a prerequisite for the high accuracy achieved by the Datagateway system [7].

The validation study demonstrates that the Datagateway system is a highly accurate and reliable method for automating data flow from EHRs to a population-based cancer registry. With performance metrics exceeding 95% accuracy for critical data points like new diagnoses and complex treatment regimens, it presents a robust alternative to traditional manual abstraction [7]. This capability is a cornerstone for building a true Learning Health System (LHS) in oncology, where data from routine clinical practice can be rapidly analyzed to generate knowledge and inform care [2].

The primary challenge, as seen in the minor misclassifications of MM regimens, lies in the nuances of clinical data, such as dosing and complex treatment sequences. This underscores that automated systems require continuous refinement and validation against clinical reality. Furthermore, the effectiveness of systems like the Datagateway is contingent upon the availability and quality of structured data within source EHRs [2].

In conclusion, the Datagateway system validates the feasibility of automated, real-time EHR data integration using a harmonized common model. It offers a scalable and high-quality solution to the growing demands for timely real-world oncology data, thereby accelerating research and enhancing the ability to monitor and improve cancer care on a population level [7].

Navigating Challenges: Strategies for Data Quality and Interoperability

Conquering Data Fragmentation and Lack of Interoperability

For oncology researchers and drug development professionals, data fragmentation remains a formidable obstacle to generating robust real-world evidence. This guide objectively compares three leading technological approaches—standardized federated networks, natural language processing (NLP)-driven integration, and continuous multimodal supply chains—based on their implementation in current research ecosystems. Performance data extracted from peer-reviewed studies demonstrate that while each approach offers distinct advantages for specific research use cases, FHIR-based federated models currently provide the most balanced solution for multi-institutional observational studies, whereas NLP-enabled platforms excel at unlocking unstructured data for clinicogenomic discovery. The validation protocols and performance metrics presented herein provide a framework for selecting appropriate interoperability solutions based on research objectives, data types, and operational constraints.

Oncology research increasingly relies on electronic health record (EHR) data, yet this information exists in siloed systems with varying standards and structures. This fragmentation impedes the aggregation of sufficient datasets for meaningful analysis, particularly for rare cancers or subpopulations. Beyond technical compatibility issues, semantic interoperability—ensuring data elements maintain consistent meaning across systems—presents additional complexity for multi-site studies. Current research initiatives are deploying diverse strategies to overcome these barriers, each with validated performance characteristics that inform their optimal application in real-world evidence generation.

Comparative Analysis of Interoperability Approaches

Table 1: Performance Comparison of Interoperability Solutions in Oncology Research

Solution Approach	Implementation Scope	Data Quality Accuracy	Primary Use Cases	Scalability Assessment
FHIR-based Federated Networks [41]	6 university hospitals; 17,885 cancer cases	Comparable to cancer registry data	Multi-institutional observational studies; Privacy-preserving analysis	Modular architecture supports expansion; Handles diverse IT infrastructures
NLP-Enabled Clinicogenomic Platforms [42]	24,950 patients; 705,241 radiology reports	AUC >0.9; Precision/recall >0.78 for NLP tasks	Clinicogenomic discovery; Survival outcome prediction	Six times larger than manually curated cohorts; Generalizes across cancer types
Continuous Multimodal Data Supply Chains [43]	171,128 patients across 11 cancer types	92.6% accuracy (surgical pathology); 98.7% (molecular pathology)	Clinical decision support; Longitudinal treatment tracking	Daily updates of 800+ features; Processes ~81 quality control cases daily

Table 2: Technical Implementation Characteristics

Technical Feature	FHIR Federated Pipeline [41]	NLP-Driven Integration [42]	Multimodal Supply Chain [43]
Data Standards	HL7 FHIR; oBDS	PRISSMM methodology; Structured & unstructured data	ICD-O coding; DICOM for imaging
Transformation Methods	XML-to-FHIR mapping; Tabular format conversion	Transformer models; Rule-based extraction	ETL with NLP; Tokenization techniques
Quality Validation	Comparison with cancer registry data	Cross-validation; External dataset testing	143 logical QC checks; Manual verification
Unstructured Data Handling	Limited	Core capability (notes, reports)	NLP for pathology/radiology reports

Experimental Protocols and Validation Methodologies

FHIR-Based Federated Network Implementation

The Bavarian Cancer Research Center consortium implemented a modular data transformation pipeline across six university hospitals with heterogeneous IT systems [41]. Their experimental protocol involved:

Data Extraction: Two input interfaces were deployed—a direct ONKOSTAR database connector and a folder import mechanism for XML-based oBDS collections from other tumor documentation systems.
Transformation Process: Source data was converted to HL7 FHIR format, then to tabular format compatible with the DataSHIELD federated analysis environment. Pseudonymization was performed using site-specific tools (entici or gPAS) before analysis.
Validation Methodology: Researchers defined a cohort of patients diagnosed with cancer in 2022 to address two research questions on tumor entity distribution and gender patterns. Validation compared federated analysis results against the Bavarian Cancer Registry and local tumor documentation systems, assessing discrepancies through manual audit.
Performance Outcomes: The analysis successfully incorporated 17,885 cancer cases from 2021/2022. Expected variations from registry data (e.g., higher malignant melanoma rates: 10.7% vs 5.3%) were attributed to differing time periods and data source scope, confirming the pipeline's validity while highlighting contextual factors in interoperability assessments [41].

NLP-Enabled Clinicogenomic Data Integration

Memorial Sloan Kettering's MSK-CHORD initiative developed a framework for integrating structured and unstructured oncology data [42]. Their experimental approach included:

NLP Model Development: Transformer architectures were trained on the Project GENIE Biopharma Collaborative dataset, with manual clinician annotations serving as ground truth. Models were validated using fivefold cross-validation against manual curation labels.
Feature Annotation: Algorithms were designed to extract specific clinical features from free-text reports: cancer progression and sites from radiology reports; prior outside treatment from clinician notes; receptor status from clinical documentation.
Performance Metrics: Model performance was quantified using area under the curve (AUC), precision, and recall. Discrepancies between model predictions and curation labels underwent retrospective clinician review, which revealed that confident transformer scores often indicated curator error rather than model failure.
Validation Outcomes: All NLP models achieved AUC >0.9 with precision and recall >0.78. In a head-to-head comparison, NLP-derived annotations for metastatic sites demonstrated precision and recall improvements of 0.03-0.32 over billing codes alone. Hold-one-cancer-out experiments confirmed generalizability across cancer types not represented in training data [42].

Continuous Multimodal Data Supply Chain

The Yonsei Cancer Data Library framework established a real-time data integration system within a single academic cancer center [43]. The implementation methodology consisted of:

Data Acquisition: Developed a patient-centric data model anchored by hospital identification numbers, linking anonymized datasets across clinical, genomic, and imaging domains.
ETL Process: Customized Extract-Transform-Load algorithms were created for each of 817 predefined features, incorporating NLP for unstructured data processing. Specific selection approaches were tailored for 11 cancer types based on ICD-O coding and cancer registry criteria.
Quality Control Framework: Implemented 143 logical comparisons for quality control: 70 for missing data, 41 for temporal validity, 15 for outlier detection, 13 for relevant value selection, and 4 for duplicate/inconsistency identification.
Validation Protocol: Accuracy was assessed through manual chart review comparison for surgical and molecular pathology features. NLP classification models were evaluated against 1,000 CT reports using AUROC and F1 scores, with temporal accuracy measured as correct prediction within ±30 days of disease progression.
Performance Outcomes: The system achieved median accuracies of 92.6% for surgical pathology and 98.7% for molecular pathology data extraction. The NLP model for CT reports achieved AUROC of 0.956 and accurately predicted disease progression day within ±30 days for 72.3% of cases [43].

Visualizing Interoperability Workflows

Oncology Data Interoperability Pipeline

Data Validation Framework Methodology

The Researcher's Toolkit: Essential Solutions for Oncology Data Interoperability

Table 3: Essential Research Reagents and Solutions for Oncology Data Interoperability

Solution Category	Specific Tools/Standards	Research Application
Data Standards	HL7 FHIR [41] [44]; OMOP CDM [44]; ICD-O-3 [45]	Standardized data exchange and semantic interoperability across systems
NLP Technologies	Transformer architectures [42]; Rule-based models [42]; Tokenization techniques [43]	Extraction of structured information from unstructured clinical notes and reports
Integration Platforms	DataSHIELD [41]; Apache Kafka [41] [44]; ETL algorithms [43]	Privacy-preserving analysis and real-time data pipeline management
Quality Control Frameworks	Logical comparison checks [43]; Cross-validation [42]; Registry benchmarking [41]	Ensuring data accuracy, completeness, and reliability for research use
Terminology Systems	SNOMED CT [44]; LOINC [44]; OncoKB [44]	Standardized coding of clinical concepts and molecular alterations

The comparative analysis presented demonstrates that no single solution completely resolves oncology data fragmentation, yet each approach offers validated pathways for specific research contexts. FHIR-based federated networks prove optimal for multi-institutional studies requiring privacy preservation, while NLP-enabled platforms provide superior unstructured data extraction for clinicogenomic discovery. Continuous multimodal supply chains offer the most comprehensive solution for single-institution research environments requiring real-time data access.

For drug development professionals, these interoperability solutions directly enhance real-world evidence generation by improving data quality, expanding cohort sizes, and enabling more sophisticated predictive modeling. Future directions should emphasize hybrid approaches that combine the strengths of these methodologies, particularly as regulatory standards evolve toward structured electronic case reporting and real-time cancer surveillance [45]. The experimental protocols and validation metrics provided here offer a framework for researchers to implement and extend these approaches in their own oncology research ecosystems.

In the evolving field of oncology research, real-world data (RWD) derived from electronic health records (EHRs) has become indispensable for studying disease patterns, treatment effectiveness, and patient outcomes. However, the fragmented health information technology landscape and varying data curation methods present significant challenges for ensuring data quality [2]. The determination of fitness for use in research and regulatory decision-making hinges on systematically evaluating data across two primary dimensions: relevance—whether data can adequately address the research question—and reliability—the accuracy and consistency of the data elements themselves [31]. This guide compares how leading frameworks and data providers implement quality checks across these dimensions, providing researchers with methodologies to critically evaluate oncology RWD sources.

Core Dimensions of Data Quality

Foundational Principles

Robust data quality assessment in oncology RWD rests on two pillars established by regulatory agencies including the US Food and Drug Administration (FDA) and the European Medicines Agency (EMA) [31].

Relevance: The extent to which a data set contains the necessary variables (exposures, outcomes, covariates) and a sufficient number of representative patients within the appropriate time period to address a specific research question [31]. This dimension encompasses subdimensions of availability, sufficiency, and representativeness.
Reliability: The degree to which data accurately represent the intended clinical concepts, encompassing subdimensions of accuracy, completeness, provenance, and timeliness [31]. Reliability ensures data are trustworthy for evidence generation.

Relationship Between Quality Dimensions

The following diagram illustrates how these core dimensions and their subdimensions interact within a robust data quality framework:

Framework Comparison and Performance Evaluation

Different methodological approaches have been developed to implement these quality dimensions in practice. The following table compares two prominent approaches applied to oncology use cases.

Table 1: Comparative Performance of Data Quality Frameworks in Oncology

Framework	Developer/Provider	Primary Approach	Use Case	Key Performance Findings
UReQA	Merck & Co. researchers [46] [47]	Use case-specific assessment linking data quality and relevance	Real-world time to treatment discontinuation (rwTTD)	Data Set A: 24.96% (1,200/4,808) of patients received target therapy; Data Set B: 5.92% (237/4,003) received target therapy, demonstrating superior relevance of Data Set A [46]
Multi-Dimensional Quality Processes	Flatiron Health [31] [48]	Systematic processes across data lifecycle aligned to regulatory frameworks	Broad oncology RWD applications	Accuracy addressed via validation approaches (external/internal reference standards, indirect benchmarking); provenance via auditable metadata; timeliness via refresh frequency optimization [31]
Automated EHR Data Reuse	Academic Medical Center Netherlands [49]	Validation of automated data extraction against manual collection	Head and neck cancer quality dashboard	High agreement (up to 99.0%) for most variables; one variable showed only 20.0% agreement; most quality indicators showed <3.5% discrepancy rates [49]

Experimental Outcomes in Practice

The implementation of these frameworks reveals substantial variability in data quality. In the UReQA framework evaluation for rwTTD assessment, the two oncology data sets differed significantly in the terminology used for systemic anticancer therapy (SACT) drugs, line of therapy (LOT) format, and target SACT LOT distribution over time [46]. Data Set B exhibited less complete SACT records, longer lags in incorporating the latest data, and incomplete mortality data, rendering it unfit for estimating rwTTD [46] [47].

The Dutch validation study of automated EHR data extraction found that while most variables showed excellent agreement with manual abstraction, certain variables demonstrated poor performance, with one specific variable showing only 20.0% agreement between automated and manual collection methods [49]. This highlights the critical need for variable-level validation even within generally reliable systems.

Methodological Protocols

Use Case-Specific Assessment (UReQA)

The UReQA framework employs a structured four-step methodology for assessing fitness for purpose [47] [50]:

Conceptual Definition: Precisely define the research concept (e.g., rwTTD as time from initiation to discontinuation of medication, with discontinuation defined as death, new treatment initiation, or ≥120-day gap after last dose) [50].
Operational Mapping: Deconstruct the conceptual definition into required RWD elements commonly available from oncology EHR-derived data sets (SACT data, line of therapy, mortality status, and follow-up time) [47].
Quality Checks Development: Identify specific verification tasks to assess data quality at both variable and cohort levels. For rwTTD, this included 20 distinct checks across completeness and plausibility dimensions [47].
Framework Implementation: Apply quality checks to evaluate RWD fitness through descriptive statistics and comparative analysis between data sources [46].

The workflow for implementing this use case-specific assessment is methodically structured as follows:

AI-Enhanced Data Extraction

Advanced computational methods are increasingly employed to address the challenges of unstructured clinical data:

Natural Language Processing (NLP) Pipelines: Comprehensive pipelines performing optical character recognition, entity extraction, assertion detection, and relationship mapping from physician notes and various medical reports [51]. One implementation processed over 1.4 million physician notes and approximately 1 million PDF reports, identifying 113.6 million entities with high accuracy (entity extraction F1 score: ~93%) [51].
Hybrid NLP/LLM Approaches: For contexts where standard NLP performs poorly, such as identifying complex conditions like thrombosis, a sophisticated hybrid approach uses NLP-predicted adverse events with flanking text processed through a locally cached base large language model (LLM) with prompt engineering [51]. This hybrid approach significantly improved precision from <0.5 to 0.87 for challenging extraction tasks [51].

Automated Data Validation Protocol

The Dutch methodology for validating automated EHR data reuse involved [49]:

Dataset Comparison: Comparative analysis between manually extracted dataset (MED) and automatically extracted dataset (AED) for 262 patients treated for head and neck cancer.
Linking Procedure: Records in both datasets were linked based on a unique patient identifier, with a linkage indicator identifying matching records (325 patients linked, coverage of 98.48%).
Statistical Analysis: Percentage agreement calculation per data element, difference in days for date variables, kappa statistics for categorical variables, and discrepancy rates for quality indicators.

The Scientist's Toolkit

Table 2: Essential Research Reagents and Computational Tools for Data Quality Assessment

Tool/Resource	Type	Primary Function	Application Example
UReQA Framework	Methodological Framework	Use case-specific data quality and relevance assessment	Estimating real-world time to treatment discontinuation for oncology therapies [46]
NLP/LLM Pipelines	Computational Tool	Extraction of structured information from unstructured clinical notes	Identifying cancer staging and biomarker findings from physician notes for clinical trial matching [51]
Medimapp Analytics	Business Intelligence Software	Data refinement using logic rules for care pathway assignment	Assigning specific appointments as start points for various stages in patient care pathways [49]
EPIC EHR Database	Data Source	Centralized electronic health record system with structured data capture	Automated extraction of structured oncology data for quality measurement [49]
BioBERT	Machine Learning Model	Biomedical domain-specific language representation	Entity extraction from clinical notes with ~93% F1 score [51]

Implementing robust data quality frameworks requires systematic attention to both relevance and reliability dimensions throughout the data lifecycle. The comparative analysis presented here demonstrates that fitness for purpose depends heavily on both the specific research use case and the methodological rigor applied to data curation and validation. As oncology RWD continues to evolve with advanced techniques like NLP and LLMs, the fundamental principles of relevance and reliability remain paramount for generating trustworthy real-world evidence. Researchers should select and implement frameworks that provide transparency into quality processes and validate data sources against their specific research objectives.

The shift towards data-driven oncology research, particularly using real-world data from electronic health records (EHRs), demands robust frameworks for automated validation and continuous monitoring. For researchers, scientists, and drug development professionals, ensuring the quality, accuracy, and reliability of complex oncology data is paramount for generating trustworthy evidence. This guide objectively compares emerging and established technological solutions, focusing on their performance in operationalizing data quality within the specific context of real-time oncology data validation.

The Validation Landscape: Tool Comparison and Performance Data

The following next-generation tools are redefining quality standards in oncology data extraction and validation by applying artificial intelligence to automate complex processes.

Table 1: Performance Comparison of Oncology Data Validation Tools

Tool / Technology	Primary Method	Reported Accuracy (F1 Score)	Key Strength	Primary Data Source
LLM with Prompt Engineering [4]	Large Language Model extraction	>0.85 (26 solid, 14 hematologic cancers) [4]	High accuracy across diverse cancer types, no model training needed [4]	Unstructured EHR clinical notes [4]
Traditional NLP Tools [4]	Trained or fine-tuned models	Information Not Provided	Customizable for specific tasks	Structured & Unstructured Data
Automated Rule-Based Systems [52]	Predefined validation rules	Information Not Provided	Real-time error prevention, ensures data format/range integrity [52]	Spreadsheets, Databases, Cloud Apps [52]
AI-Powered Data Validation [52]	Machine learning anomaly detection	Information Not Provided	Identifies novel inconsistencies and patterns in large datasets [52]	Large-scale, diverse datasets [52]

Experimental Protocols: Methodologies for Validation

To evaluate and compare these tools, researchers employ rigorous experimental protocols. The methodologies below are critical for assessing tool performance in real-world oncology research scenarios.

Protocol 1: Validating LLM Performance for EHR Data Extraction

This protocol outlines the process for validating the accuracy of Large Language Models in extracting structured oncology data from unstructured clinical notes, a common challenge in real-world evidence generation [4].

Objective: To measure the accuracy of a Large Language Model (LLM) in extracting key clinical data elements—including cancer diagnosis, TNM stage, grade, and histology—from unstructured EHR documents [4].
Data Set Curation:
- A diverse dataset of oncology-specific documents, such as pathology reports and progress notes, is assembled [4].
- A gold standard validation set is created through manual extraction and curation by clinical specialists [4].
Model Application:
- The LLM is applied using prompt engineering, where natural language prompts containing relevant oncology-specific terminology guide the model without requiring training or fine-tuning [4].
Validation & Metrics:
- The model's outputs (e.g., extracted stage, grade) are compared against the gold standard manual abstractions [4].
- Performance is quantified using the F1 score (the harmonic mean of precision and recall) to provide a balanced accuracy metric [4].

Protocol 2: Automated Data Validation for Oncology Research Data

This methodology tests automated tools designed to ensure the quality and integrity of structured research datasets, which is fundamental for any subsequent analysis.

Objective: To assess an automated data validation tool's ability to identify and flag errors in structured oncology research datasets (e.g., from clinical trials or curated registries).
Test Data Preparation: A dataset with known, pre-introduced errors is created. This includes missing values, incorrect data formats (e.g., invalid date formats), values outside acceptable ranges (e.g., tumor size > 500mm), and duplicate patient records [52].
Rule Configuration: Predefined validation rules are established based on study protocols and clinical logic. This includes range validation (e.g., ensuring laboratory values are physiologically plausible), format validation, uniqueness validation (e.g., for patient IDs), and cross-field validation (e.g., ensuring a progression date is not before diagnosis) [52].
Tool Execution & Evaluation: The automated tool is run against the test dataset. Its performance is evaluated based on the percentage of pre-introduced errors correctly flagged (sensitivity) and the rate of false positives (specificity) [52].

Visualizing the Workflows

The following diagrams illustrate the core workflows for the two primary validation approaches discussed, providing a clear schematic of their processes.

Diagram 1: LLM Data Extraction from EHRs

Diagram 2: Automated Rule-Based Data Validation

The Scientist's Toolkit: Essential Research Reagent Solutions

Beyond software, operationalizing quality requires a suite of specialized "reagent" solutions. The tools and data sources below form the foundational toolkit for modern oncology data validation and monitoring.

Table 2: Essential Toolkit for Oncology Data Validation and Monitoring

Tool / Resource	Category	Primary Function in Validation
LLM with Prompt Engineering [4]	Software/AI	Extracts and structures critical oncology variables (TNM stage, histology) from unstructured EHR text for analysis.
AI-Powered Data Validation Tool [52]	Software/Automation	Automates error detection (duplicates, missing values, format errors) in large structured research datasets.
Linked Clinical & Claims Data (e.g., SEER-Medicare) [53]	Hybrid Data	Provides a rich, population-level source for validating treatment patterns and outcomes.
Validation Study Data (e.g., POC, MCBS) [53]	Hybrid Data	Serves as a gold standard or reference dataset to overcome limitations in primary secondary data sources.
Electronic Health Records (EHR) [53]	Fixed Data	The core real-world data source requiring validation; provides point-of-care data for clinical research.
Registry Data (e.g., SEER) [53]	Fixed Data	Offers high-quality, curated data on cancer incidence and outcomes, useful as a validation benchmark.

For oncology researchers, the transition to automated validation and continuous monitoring is no longer a future goal but a present necessity. The experimental data and protocols presented demonstrate that AI-driven tools, particularly LLMs for unstructured data and automated validators for structured data, are achieving the accuracy and efficiency required for robust real-world evidence generation. Success in this evolving landscape will depend on the strategic selection and integration of these tools, guided by rigorous validation protocols and a clear understanding of their respective strengths. By leveraging this toolkit, the oncology research community can enhance the integrity of their data and, consequently, the reliability of the evidence used to guide future cancer care.

In the field of oncology research, the ability to leverage real-world data (RWD) from electronic health records (EHR) hinges on the efficiency and accuracy of data extraction and harmonization processes. Automating these workflows is crucial for reducing registrar burden and enabling the timely data collection needed for robust real-world evidence (RWE). This guide compares technological approaches and solutions that support these optimized workflows, framing the analysis within the broader thesis of validating real-time oncology data for research and regulatory decision-making.

Validated Approaches for Automated Data Extraction

The core challenge in reducing registrar burden lies in creating reliable, automated systems that can harmonize complex EHR data. The following table summarizes key performance data from a validated, real-world implementation.

System / Feature	Performance Metric	Result / Outcome
Datagateway Automated System (Netherlands Cancer Registry)	Diagnostic Accuracy vs. Manual Registry	100% [54]
	Accuracy for New Diagnoses (vs. inclusion criteria)	95% [54]
	Treatment Identification Accuracy	100% [54]
	Combination Therapy Misclassification	3% [54]
ON.Genuity RWD Platform (Ontada)	Data Completeness (for common variables)	>80% [55]
	Data Standardization	FHIR, mCODE [55]

The Datagateway system, which supports the Netherlands Cancer Registry (NCR), demonstrates that automated data harmonization from multiple hospitals into a common model is not only feasible but highly reliable. It achieved perfect accuracy in capturing registered diagnoses and near-perfect accuracy in identifying new cases against registry inclusion criteria [54]. Furthermore, its ability to correctly identify treatments in all evaluated cases, with only a minimal rate of misclassifying complex combination therapies, shows a high level of sophistication in processing clinical information [54].

For an automated system to be effective for research, the quality and "fit-for-purpose" of its data must be rigorously assessed. The ON.Genuity RWD platform exemplifies this practice by implementing quality dimensions from the FDA-led QCARD initiative. The platform integrates EHR data from approximately 500 US community oncology clinics and focuses on key areas such as relevance (availability of standardized variables across numerous clinical domains) and reliability, where it achieves high data completeness by enhancing structured data with chart abstraction and natural language processing [55]. The conformance of its data to standards like FHIR (Fast Healthcare Interoperability Resources) and mCODE (Minimal Common Oncology Data Elements) is critical for ensuring interoperability and consistency in oncology research [55].

Protocols for Data Validation and Workflow Integration

Implementing an automated data workflow requires a structured approach to ensure the output is valid and useful for research. The following experimental protocols detail the methodologies for system validation and workflow optimization.

Protocol 1: Validating an Automated EHR Data Extraction System

This protocol is based on the validation study of the Datagateway system for the Netherlands Cancer Registry [54].

1. Objective: To determine the accuracy and reliability of an automated data extraction system in harmonizing EHR data for near real-time enrichment of a cancer registry.
2. Data Source Setup: The automated "Datagateway" system is configured to extract and harmonize structured EHR data from multiple, disparate hospital systems into a single, common data model.
3. Validation Method:
- Comparison 1 (Diagnoses): Extracted data on patient diagnoses is compared to the existing, manually curated gold-standard data in the Netherlands Cancer Registry (NCR).
- Comparison 2 (New Cases): New diagnoses identified by the automated system are checked against the formal NCR inclusion criteria.
- Comparison 3 (Treatments): The treatments identified by the system are compared to the source EHR data and registry records, with specific attention to combination therapies.
- Comparison 4 (Lab Data): Laboratory values and toxicity indicators extracted by the system are validated against the original EHR source data.
4. Outcome Measures:
- Percentage accuracy for diagnoses, new case identification, and treatment regimens.
- Misclassification rate for combination therapies.
- Accuracy range for laboratory values and toxicity indicators.

Protocol 2: Implementing a QCARD Framework for RWD Quality Assessment

This protocol is derived from the methodology used to evaluate the ON.Genuity RWD platform, based on the QCARD initiative [55].

1. Objective: To assess the relevance, reliability, and external validity of an EHR-derived oncology database for regulatory decision-making.
2. Platform Configuration: An RWD platform (e.g., ON.Genuity) is established, integrating EHR data from a large network of community oncology clinics. This data is linked with external data sources, such as claims data and the National Death Index for mortality validation.
3. Quality Dimension Assessment:
- Relevance:
  - Catalog the availability and coverage of standardized variables (e.g., demographics, clinical characteristics, treatment patterns, outcomes).
  - Assess the ability of the data to describe the patient journey over a long-term period (e.g., 10+ years).
- Reliability:
  - Measure completeness for key variables, using methods like chart abstraction and NLP to enhance structured data.
  - Ensure conformance by mapping and standardizing native EHR data to common models like FHIR and mCODE.
- External Validity:
  - Compare key outcomes, such as overall survival data, with external benchmarks (e.g., National Death Index) using statistical tests for consistency.
4. Outcome Measures:
- Number of available standardized variables and patient coverage.
- Percentage completeness for common variables.
- Statistical consistency (e.g., p-values) with external validity benchmarks.

Workflow and Data Validation Pathways

The following diagrams illustrate the logical flow of the two key experimental protocols, providing a visual guide to the processes of automated data validation and quality assessment.

Diagram 1: Experimental Pathways for Data Workflow Validation. This illustrates the two primary methodologies: validating an automated extraction system against a gold standard (left) and implementing a multi-dimensional quality assessment framework (right).

The Scientist's Toolkit: Essential Reagents & Solutions

Beyond software platforms, optimizing oncology data workflows relies on a foundation of specific data standards, quality frameworks, and technologies. The following table details these essential components.

Tool / Solution	Category	Primary Function
FHIR (Fast Healthcare Interoperability Resources)	Data Standard	Provides a modern, web-based standard for exchanging healthcare data electronically, enabling interoperability between EHRs and research systems [56] [55].
mCODE (Minimal Common Oncology Data Elements)	Data Standard	Defines a core set of structured data elements for oncology EHRs, ensuring consistent capture of critical clinical data points across different providers [55].
FDA QCARD Initiative	Quality Framework	Offers a structured methodology to evaluate the Relevance, Reliability, and external validity of Real-World Data, ensuring it is fit-for-use in regulatory contexts [55].
Natural Language Processing (NLP)	Enabling Technology	Used to extract and structure information from unstructured clinical notes (e.g., pathology reports), significantly improving data completeness [55].
OSF (Open Science Framework)	Workflow Tool	A collaborative, open-source platform to manage, share, and document the entire research lifecycle, helping to streamline processes and maintain project transparency [57].
Common Data Model	Data Architecture	A standardized data structure (e.g., the one used by the Datagateway system) that allows for the harmonization of EHR data from multiple, disparate source systems [54].

Key Insights for Implementation

The data and methodologies presented reveal several critical considerations for selecting and implementing workflow optimization solutions. Systems that leverage common data models and standards like FHIR and mCODE demonstrate a clear path toward scalable, high-quality data integration [54] [55]. Furthermore, the move toward automated, real-time EHR data extraction is not a distant goal but a present-day reality, proven to be both feasible and reliable for population-level oncology surveillance [54].

Ultimately, the choice of tools and protocols should be guided by the principle of "fit-for-purpose." As demonstrated by the application of the QCARD framework, a solution's effectiveness is not absolute but must be evaluated against the specific objectives of the research, whether for clinical practice insights, health technology assessment, or regulatory submissions [55].

Establishing Trust: Validation Frameworks and Comparative Evidence

The adoption of Electronic Health Records (EHRs) has created unprecedented opportunities for oncology research, yet their full potential remains constrained by significant data quality challenges. These systems, originally designed for billing and scheduling, now contain vast amounts of structured and unstructured clinical data that require rigorous validation before they can reliably support research and clinical decision-making [58]. The complex, ever-evolving nature of cancer diagnostics and therapeutics demands specialized benchmarking approaches to ensure data completeness, accuracy, and consistency across diverse healthcare settings. This guide provides a comprehensive framework for benchmarking oncology data quality by synthesizing current validation methodologies, performance metrics, and experimental protocols from recent research, enabling researchers to critically assess the reliability of real-world oncology data for scientific and clinical applications.

Core Validation Metrics for Oncology Data Quality

The quality of oncology EHR data can be quantified using standardized metrics that evaluate different dimensions of data fitness for purpose. These metrics are essential for establishing confidence in real-world evidence generated from EHR sources.

Table 1: Key Validation Metrics for Oncology EHR Data Quality

Metric Category	Specific Metrics	Performance Range in Recent Studies	Application Context
Accuracy & Completeness	Sensitivity, Specificity, Positive Predictive Value (PPV)	Sensitivity: 50.0-95.3% (closed claims); PPV: 79.1-98.3% (infusions) [59]	Treatment data identification from claims vs. abstracted EHRs
Temporal Accuracy	Exact start date matching, Date matching within ±7 days	45.5-82.5% (infusion start dates); 27.6-65.9% (oral start dates within 7 days) [59]	Medication administration timing
Clinical Concept Extraction	Sensitivity, Precision, F1-score	GPT-4: 96.8% sensitivity, 96.8% precision; Physicians: 88.8% sensitivity, 97.7% precision [60]	Comorbidity identification from clinical notes
Disease Progression Capture	Sentence-level accuracy, Patient-level accuracy (±30 days)	98.2% (sentence-level); 88% (patient-level) [61]	Real-world progression-free survival calculation
Model Performance	Area Under ROC (AUROC), Accuracy	Woollie LLM: 0.97 overall AUROC (MSK data); 0.88 overall AUROC (UCSF data) [62]	Cancer progression prediction from radiology notes

Benchmarking Methodologies: Experimental Protocols for Data Validation

Validating Treatment Data Completeness: Claims vs. EHR Benchmarking

Objective: To assess the completeness of oncology treatment data from administrative claims compared to manually abstracted EHRs (considered the gold standard) using patient-level linkages [59].

Methodology:

Data Source Linking: Extract abstracted EHRs from a clinico-genomic database for 6,487 stage 4 lung adenocarcinoma patients diagnosed between 2020-2023. Link claims data (both open and closed) using de-identified patient tokens.
Temporal Alignment: Select claims data occurring between patients' first and last abstracted treatment dates from EHRs.
Validation Framework: Consider abstracted EHR data as ground truth. Classify claims for the same medication between abstracted start and end dates as true positives, unmatched claims as false positives, and unmatched abstracted treatments as false negatives.
Metric Calculation: Compute sensitivity, specificity, and positive predictive values across 13 infusional and 3 oral medications, reported as ranges across all medications.
Temporal Accuracy Assessment: Calculate the percentage of abstracted start dates with exact matches in claims data, and within 7-day tolerance for oral medications (accommodating prescription fill delays).

Key Findings: Closed claims enrollment periods showed significantly higher sensitivities (50.0-95.3%) than open claims (14.3-54.8%). Sensitivities differed substantially by route of administration, with infusions (closed: 76.5-95.3%; open: 32.4-54.8%) higher than orals (closed: 50.0-76.2%; open: 14.3-34.1%) [59].

Extracting Real-World Progression from Unstructured Text

Objective: To configure a pretrained, general-purpose healthcare natural language processing (NLP) framework to transform free-text clinical notes and radiology reports into structured progression events for computing real-world progression-free survival (rwPFS) in metastatic breast cancer [61].

Methodology:

Cohort Identification: Identify breast cancer patients using structured diagnosis codes (ICD-9: 174; ICD-10: C50) plus NLP-based positive confirmations of disease-related terms in clinical notes. Identify metastatic disease using structured codes (ICD-9: 197, 198; ICD-10: C78, C79).
Cohort Refinement: Apply inclusion criteria (N=316): female patients aged ≥18 years with hormone receptor-positive, HER-2-negative metastatic breast cancer receiving first-line palbociclib and letrozole combination therapy between 2015-2021.
Progression Pattern Identification: Conduct manual abstraction of 200 cases to identify progression-indicative phrases through iterative clustering of sentences from oncology and radiology notes.
NLP Engine Configuration: Configure a clinical NLP engine (ensemble of deep learning-based multi-BERT framework) to perform named entity recognition and predict sentiment labels for subject, temporality, and certainty of captured entities.
Validation: Evaluate performance at sentence level and patient level (±30 days) against manually curated ground truth datasets.
rwPFS Calculation: Define NLP-captured progression or change in therapy line as outcome events; death, loss to follow-up, and end of study period as censoring events.

Key Findings: The configured NLP engine achieved 98.2% sentence-level progression capture accuracy and 88% patient-level accuracy within ±30 days. Median rwPFS was 20 months (95% CI 18-25), closely aligning with manual curation (25 months, 95% CI 15-35) [61].

Specialized Named Entity Recognition for Chinese Cancer EHRs

Objective: To develop and validate a specialized named entity recognition (NER) model for extracting medical entities from Chinese breast cancer EHRs to overcome limitations of general models designed for English medical records [63].

Methodology:

Model Architecture: Construct ChCancerBERT by incorporating a Chinese cancer corpus for pretraining based on the MC-BERT foundation. Implement a multi-model, multi-level integrated NER approach combining dilated-gated convolutional neural networks, bidirectional long short-term memory (BiLSTM), multihead attention mechanism, and conditional random fields.
Data Collection: Collect desensitized inpatient EHRs related to breast cancer from a leading hospital in Beijing, with manual annotations for model training and validation.
Entity Categorization: Focus on recognizing medical entities related to symptoms, signs, tests, treatments, and time in Chinese breast cancer EHRs.
Performance Evaluation: Compare the proposed model against baseline and other models using precision, recall, and F1-score metrics. Validate the model on the CCKS2019 dataset to benchmark against existing approaches.

Key Findings: The proposed model achieved an F1-score of 86.93% (precision: 87.24%, recall: 86.61%), surpassing baseline models and demonstrating exceptional performance on the CCKS2019 dataset with an F1-score of 87.26% [63].

Comparative Performance of Data Extraction Approaches

Large Language Models for Clinical Data Extraction

Objective: To evaluate the accuracy, efficiency, and cost-effectiveness of large language models (LLMs) in extracting and structuring information from free-text clinical reports, specifically for identifying and classifying patient comorbidities within oncology EHRs, compared to specialized human evaluators [60].

Methodology:

Model Selection: Test gpt-3.5-turbo-1106 and gpt-4-1106-preview models using the OpenAI API.
Prompt Engineering: Implement an iterative process to develop optimal prompts, framing the task as a JSON dictionary completion problem with "YES/NO" values for specific comorbidities and risk factors.
Data Set: Analyze 250 personal history reports in Spanish, processed in batches of 50 by 5 radiation oncology specialists for comparison.
Evaluation Metrics: Calculate sensitivity, specificity, precision, accuracy, F-value, κ index, and apply McNemar test for statistical significance.
Consistency Assessment: Repeat analyses 10 times to measure result stability.

Key Findings: GPT-4 demonstrated clear superiority in several key metrics (McNemar test, P<0.001), achieving a sensitivity of 96.8% compared to 88.2% for GPT-3.5 and 88.8% for physicians. Physicians marginally outperformed GPT-4 in precision (97.7% vs. 96.8%). GPT-4 showed greater consistency, replicating the exact same results in 76% of reports across 10 repeated analyses, compared to 59% for GPT-3.5 [60].

Table 2: Performance Comparison of LLMs in Oncology Data Extraction

Model/Evaluator	Sensitivity	Precision	Accuracy	Result Consistency	Key Strengths
GPT-4	96.8% [60]	96.8% [60]	Not specified	76% [60]	Superior sensitivity, high consistency
GPT-3.5	88.2% [60]	Not specified	Not specified	59% [60]	Faster, more economical
Physicians	88.8% [60]	97.7% [60]	Not specified	Not specified	Slightly higher precision
Woollie (Oncology-specific LLM)	Not specified	Not specified	PubMedQA: 0.81 [62]	Not specified	Domain-specific optimization
CancerBERT (Chinese NER)	86.61% [63]	87.24% [63]	Not specified	Not specified	Language and domain specialization

Oncology-Specific LLMs for Cancer Progression Prediction

Objective: To develop and validate Woollie, an open-source, oncology-specific large language model trained on real-world data from Memorial Sloan Kettering Cancer Center for predicting cancer progression from radiology reports [62].

Methodology:

Model Development: Train a family of Woollie models (7B to 65B parameters) using a stacked alignment process based on the open-source Llama models from META, progressively building oncology-specific knowledge.
Data Set: Utilize 39,319 radiology impression notes from 4,002 patients across lung, breast, prostate, pancreatic, and colorectal cancers from MSK.
External Validation: Validate performance using an independent dataset of 600 radiology impressions from 600 unique patients from UCSF, focusing on lung, breast, and prostate cancers.
Performance Assessment: Evaluate using area under the receiver operating characteristic curve (AUROC) for cancer progression prediction across different cancer types.
Comparative Analysis: Benchmark against standard LLMs (LLaMA) and general medical models on standard medical benchmarks (PubMedQA, MedMCQA, USMLE).

Key Findings: Woollie achieved an overall AUROC of 0.97 for cancer progression prediction on MSK data, including 0.98 AUROC for pancreatic cancer. On UCSF data, it achieved an overall AUROC of 0.88, excelling in lung cancer detection with an AUROC of 0.95. The 65B parameter model significantly outperformed Llama 65B on medical benchmarks (PubMedQA: 0.81 vs. 0.70; MedMCQA: 0.50 vs. 0.37; USMLE: 0.52 vs. 0.42) [62].

Visualization of Experimental Workflows

Validation Workflow for Oncology EHR Data

NLP Workflow for Progression Extraction

Table 3: Essential Research Reagents and Computational Tools for Oncology EHR Validation

Tool/Resource	Type	Function	Example Applications
Precision-DM (Precision Oncology Core Data Model)	Data Standardization Framework	Facilitates complete clinical-genomic data standardization for oncology research [10]	Structuring EHR data for molecular tumor boards, immunotherapy adverse event tracking
ChCancerBERT	Domain-Specific Language Model	Named entity recognition for Chinese cancer EHRs [63]	Extracting medical entities from Chinese breast cancer records
Woollie	Oncology-Specific LLM	Predicts cancer progression from radiology notes [62]	Analyzing radiology impressions across multiple cancer types
Clinical NLP Engine	Natural Language Processing Framework	Transforms free-text clinical notes into structured progression events [61]	Calculating real-world progression-free survival in metastatic breast cancer
Nference nSights Platform	Analytics Platform	Hosts deidentified EHR data for large-scale oncology studies [61]	Multicenter retrospective observational studies
PathBench	Benchmarking Framework	Evaluates pathology foundation models across cancer types [64]	Standardized comparison of computational pathology models
Google's Healthcare Natural Language API	General-Purpose NLP Tool	Extracts clinical concepts from unstructured text [61]	Clinical concept recognition, entity linking in diverse medical texts

Real-world evidence (RWE) has rapidly become a fixture in regulatory and Health Technology Assessment (HTA) discussions, often heralded as the missing link between controlled trials and clinical reality for oncology treatments [65]. The promise is compelling: to understand how treatments perform in the complexity of routine care, particularly for cancer patients whose characteristics often differ significantly from those in highly controlled clinical trials. However, this promise rests on potentially fragile foundations, with data quality, methodological rigor, and transparency remaining persistent challenges that can undermine the credibility of RWE if not properly addressed [65].

The year 2025 marks a significant turning point in Europe with the implementation of the new EU Regulation on Health Technology Assessment (HTAR), which became applicable on January 12, 2025 [66]. This regulation establishes a formal framework for cooperation between the European Medicines Agency (EMA) and HTA bodies through mechanisms such as Joint Clinical Assessments (JCAs) and parallel Joint Scientific Consultations (JSCs) [66]. Despite this coordinated framework, significant divergences remain in how these bodies accept and apply RWE, reflecting an ongoing struggle to move from rhetoric to reliable practice in real-world evidence generation [65].

Quantitative Comparison of RWE Acceptance

The acceptance and application of RWE by the EMA and European HTA bodies varies significantly across multiple dimensions. The following tables summarize these key differences based on current guidelines and practices.

Table 1: Comparative Overview of RWE Acceptance and Application

Aspect	EMA (Regulatory Focus)	European HTA Bodies (Payer/Reimbursement Focus)
Primary Role of RWE	Supporting drug approvals and safety monitoring [67]	Informing comparative effectiveness and cost-effectiveness [68]
Key Applications	- Post-market surveillance- Supporting drug approvals- Label expansions [67]	- Comparative effectiveness research (CER)- Subgroup analysis- Long-term outcomes- Economic modeling [68]
Evidence Standards	Emphasis on data provenance and methodology [65]	Focus on relevance to clinical practice and comparators [65]
Implementation Timing	Already utilized in various regulatory contexts [67]	Phased approach under HTAR (2025-2030) [66]
Major Concerns	Data quality and methodological rigor [65]	Relevance to PICOs and comparative effectiveness [69]

Table 2: Acceptance Levels for Different RWE Applications in 2025

RWE Application	EMA Acceptance Level	HTA Bodies Acceptance Level	Key Factors Influencing Divergence
Complementary Evidence for RCTs	High	Moderate	HTA skepticism about generalizability from controlled settings [68]
Single-Arm Trial External Control	High	Variable	Differences in acceptability of historical controls [68]
Post-Authorization Safety Studies	High	Low	Beyond HTA mandate of relative effectiveness [66]
Long-Term Effectiveness	Moderate	High	Critical for HTA economic modeling [68]
Subgroup Effectiveness	Moderate	High	Essential for HTA pricing and reimbursement decisions [68]

Methodological Approaches for RWE Generation

Experimental Protocols for Oncology EHR Data Characterization

Generating reliable RWE from electronic health records (EHR) requires rigorous methodological approaches to ensure data quality and comparability. The following protocol outlines a standardized process for characterizing oncology EHR-derived real-world data across multiple countries, based on established research practices [70].

Protocol Title: Characterization of Oncology EHR-Derived Real-World Data Across Multiple Healthcare Systems

Objective: To create transnational oncology RWD datasets with sufficient clinical depth, consistency, and country-level comparability to support regulatory and HTA decision-making.

Materials and Research Reagent Solutions:

Table 3: Essential Research Components for EHR-Derived RWE Generation

Component	Function	Implementation Example
EHR Source Data	Raw clinical data capture from routine practice	Structured and unstructured data from oncology EHR systems [70]
Common Data Model	Enables data harmonization and pooling across sources	Standardized ontology mapping for multinational data integration [70]
ISPOR Suitability Checklist	Framework for ensuring data quality	Assessment tool for data completeness, accuracy, and relevance [70]
Trusted Research Environment	Secure data analysis platform	De-identified and anonymized data analysis environment with governance [70]
Clinical Validation Framework	Ensures clinical relevance of structured data	Oncologist-led curation and validation of key clinical variables [70]

Methodology:

Multi-Disciplinary Team Assembly: Establish diverse teams comprising researchers, software engineers, and medical oncologists local to each geography to ensure contextual understanding of clinical practices and healthcare systems [70].
Data Provenance and Governance Establishment: Define clear data lineage, ownership, and usage rights frameworks compliant with local regulations (e.g., GDPR in EU countries) [70].
Common Data Model Implementation: Develop and implement harmonized data models that enable pooling of data across different countries while maintaining semantic consistency for key oncology concepts such as line of therapy, response, and progression [70].
De-identification and Anonymization: Apply rigorous de-identification and anonymization processes to protect patient privacy while preserving data utility for research purposes [70].
Analytic Trusted Research Environment Setup: Establish secure computational environments where researchers can access and analyze the characterized data while maintaining privacy and security protocols [70].
Quality Validation Against Standards: Adhere to established quality checklists such as the ISPOR EHR-Derived Data Suitability Checklist to ensure the trustworthiness, reliability, and relevance of the final datasets [70].

Workflow for RWE Generation and Submission

The process of generating RWE for both regulatory and HTA assessment follows a structured pathway that aligns with drug development and marketing authorization timelines. The diagram below illustrates this workflow, highlighting points of divergence between EMA and HTA requirements.

RWE Generation and Assessment Workflow

Joint Clinical Assessment Process Under EU HTA Regulation

The new EU HTA Regulation establishes a formal process for Joint Clinical Assessments (JCAs) that significantly impacts how RWE is evaluated for oncology products. The following diagram outlines this process and the critical role of RWE within it.

JCA Process Under EU HTA Regulation

Analysis of Divergent Requirements and Standards

Temporal Misalignment in Evidence Requirements

A fundamental challenge in aligning EMA and HTA perspectives on RWE stems from the different timelines for evidence submission. Under the new HTAR framework, manufacturers must submit the JCA dossier at approximately Day 170 of the centralized marketing authorization process [69] [71]. This occurs before the EMA releases its list of outstanding issues (at Day 180) and before the Committee for Medicinal Products for Human Use (CHMP) opinion (at Day 210) [71]. This timing creates significant challenges for manufacturers who must prepare HTA submissions without knowing the final approved population or indication, potentially requiring restarted assessments if the label changes substantially [71].

Divergent Perspectives on Comparators and PICOs

The Population, Intervention, Comparator, Outcomes (PICO) framework represents another area of significant divergence between regulatory and HTA needs. While the EMA focuses primarily on establishing efficacy and safety against placebo or standard care, HTA bodies require comparisons against specific technologies already available in their healthcare systems [69]. This divergence is particularly pronounced in oncology, where standards of care can vary significantly across European countries, leading to potentially numerous PICOs for a single product [71]. The JCA process aims to consolidate PICOs "as far as possible," but substantial challenges remain in determining a clear strategy and generating appropriate evidence, especially given the "apparent lack of manufacturer involvement" in this part of the process [71].

Methodological Standards and Evidence Hierarchy

While both entities recognize the potential value of RWE, their methodological standards and evidence hierarchies continue to differ. The EMA has developed specific guidelines such as the "Guideline on Registry-Based Studies" and participates in initiatives like the EMA-FDA collaboration on RWE [68]. HTA bodies, meanwhile, maintain greater skepticism about the validity of RWE generated outside controlled settings, particularly concerning unmeasured confounding and selection bias [65]. This skepticism manifests in requirements for more robust sensitivity analyses and stricter validation of endpoints, particularly when RWE is used to support economic modeling or coverage decisions [68] [71].

Case Studies and Practical Applications

Oncology EHR Data Characterization Across Borders

Recent research demonstrates practical approaches to addressing transnational RWE challenges in oncology. A 2025 publication detailed the characterization of EHR-derived oncology datasets across the UK, Germany, and Japan, developing common data models that enable harmonized pooling with US data while working within local regulatory requirements [70]. This work highlights both the feasibility and challenges of generating RWE suitable for both regulatory and HTA purposes across diverse healthcare systems. The methodology employed—including local clinical expertise, standardized data models, and rigorous governance frameworks—provides a template for overcoming the divergent requirements of different assessment bodies [70].

RWE in Cost-Effectiveness Modeling for HTA Submissions

Another illustrative case involves using RWD to inform cost-effectiveness models for HTA submissions. In one example, a pharmaceutical company utilized patient-level records from real-world settings to quantify key areas of differentiation (improved lung function, reduced healthcare resource utilization) for a new inhalation solution compared with standard of care in chronic pulmonary infection [71]. These RWE-derived insights informed a cost-effectiveness model that demonstrated the benefit of reduced drug, hospitalization, and transplantation costs, ultimately supporting a national HTA submission [71]. This case demonstrates the critical role of RWE in bridging between clinical efficacy demonstrated in trials and the real-world economic outcomes required by HTA bodies.

Future Directions and Strategic Implications

Market Growth and Evolution

The RWE oncology market is experiencing significant growth, expected to reach $893 million in 2025 and projected to grow at a compound annual growth rate (CAGR) of 14.7% to $3.51 billion by 2035 [72]. This growth reflects increasing demand for RWE across drug development, market access, and post-market surveillance applications, with the market access and reimbursement segment holding the largest share in 2025 [72]. This expansion is driven by multiple factors including growing regulatory acceptance, increasing focus on value-based healthcare, rising cancer incidence, and advancements in data analytics and AI technologies [72].

Strategic Imperatives for Drug Developers

For pharmaceutical companies and drug developers, navigating the divergent acceptance of RWE requires strategic shifts in evidence generation planning:

Earlier PICO Planning: Companies must anticipate PICO requirements significantly earlier in drug development, ideally during Phase II trials, to ensure that RWE generation addresses relevant comparators and outcomes for HTA bodies [69].
Cross-Functional Governance: Successful navigation of both regulatory and HTA requirements demands integrated cross-functional teams encompassing clinical development, HEOR, regulatory affairs, and market access functions, established early in the development process [69].
Investment in Data Quality and Transparency: Addressing concerns about RWE validity requires robust data governance, transparent methodology, and adherence to emerging standards such as the ISPOR EHR-Derived Data Suitability Checklist [70].
Engagement in Parallel Consultations: Proactive engagement in parallel joint scientific consultations (JSCs) with both EMA and HTA bodies can help align evidence generation plans with both regulatory and HTA requirements from the outset [66].

The divergent acceptance of RWE by the EMA and European HTA bodies represents both a challenge and an opportunity for oncology drug development. While the new EU HTA Regulation creates a more structured framework for cooperation, significant differences remain in evidence standards, temporal requirements, and methodological expectations. Navigating these divergences requires sophisticated evidence generation strategies that anticipate both regulatory and HTA needs from early development stages. As the field evolves, successful organizations will be those that treat RWE not as a supplementary activity but as a core competency integrated throughout the drug development lifecycle. The ongoing standardization of methodologies and growing experience with successful RWE submissions offer promise for greater alignment in the future, potentially accelerating patient access to innovative oncology treatments while ensuring appropriate assessment of their real-world value.

The Role of US EHR-Derived Data in Informing International Health Technology Assessments

The growing complexity of new cancer therapies, coupled with limitations inherent in traditional clinical trials, has prompted Health Technology Assessment (HTA) bodies worldwide to increasingly seek out real-world evidence (RWE) to inform reimbursement and access decisions [73]. Electronic health record (EHR)-derived real-world data (RWD) from the United States has emerged as a particularly valuable resource in this context. The earlier approval and market entry of most oncology drugs in the U.S. creates a unique opportunity to generate timely evidence on how these therapies perform in routine clinical practice, potentially bridging critical evidence gaps for HTA agencies in other countries [73] [74].

This guide objectively examines the role of U.S. EHR-derived data in international HTA, focusing on its application, the frameworks ensuring its quality, and its practical use in addressing specific evidence needs. The content is framed within the broader thesis of validating real-world oncology data from EHRs for rigorous research purposes.

Quantitative Advantages of US EHR-Derived Data for HTA

The primary advantage of US data lies in the significant head start in data accumulation prior to decisions in other markets. A retrospective cohort study analyzing 60 NICE technology appraisals (TAs) between 2014 and 2019 quantified this lead time and available data.

Table 1: Data Availability from US EHRs at Key NICE Milestones

NICE Milestone	Median Time from FDA Approval (Months)	Average Number of Patients Available per TA	Average Median Follow-up per TA (Months)
Company Submission to NICE	6.4 months	147 patients	4.5 months
Final Appraisal Determination	14.4 months	Not Specified	Not Specified
Final Guidance Publication	18.5 months	269 patients	Not Specified

Source: Adapted from [73]

The same study found that at the 18.5-month mark post-FDA approval, US EHR-derived databases contained a median of 75.3 person-years of time-at-risk data for analysis [73]. This substantial volume of data, available before or around the time of HTA decisions in other countries, can be pivotal for reducing decision-making uncertainties.

Foundational Frameworks for Data Quality and Relevance

For US-derived RWD to be credible for HTA, its quality must be systematically demonstrated. Leading regulatory and HTA bodies have established frameworks focusing on two primary quality dimensions: relevance and reliability [31].

Key Quality Dimensions

Table 2: Core Data Quality Dimensions for EHR-Derived RWD

Quality Dimension	Subdimensions	Definition and Application in HTA Context
Relevance	Availability, Sufficiency, Representativeness	Determines if the data provides sufficient information on exposures, outcomes, and covariates to produce robust and generalizable results for the specific HTA question [31].
Reliability	Accuracy, Completeness, Provenance, Timeliness	Assesses how closely the data reflects the intended clinical concepts and the trustworthiness of the data, encompassing data accrual and quality control processes [31].

These dimensions are operationalized through specific data curation processes. For example, accuracy is addressed through validation against external or internal reference standards and verification checks for conformance and plausibility. Completeness is assessed against expected source documentation, while provenance is maintained by recording all data transformations and management procedures [31].

The SUITABILITY Checklist for HTA

The ISPOR Good Practices Report provides a use-case-specific framework for HTA bodies to evaluate EHR-derived RWD. The SUITABILITY Checklist focuses on two main elements [75]:

Data Delineation: Provides a complete understanding of the data and an assessment of its trustworthiness.
Data Fitness-for-Purpose: Examines the accuracy and suitability of the data to answer the particular HTA question at hand.

This framework encourages HTA agencies to move beyond a one-size-fits-all approach and actively assess whether the data is fit for its intended purpose, such as characterizing treatment pathways or modeling long-term survival [75] [76].

Experimental Protocols for Key HTA Use Cases

The application of US EHR-derived data in HTA is demonstrated through specific, method-driven use cases. The following protocols detail the methodologies for generating evidence.

Use Case 1: Characterizing Early Drug Utilization and Outcomes

Objective: To describe the accumulation of US RWD for new cancer therapies between FDA approval and HTA milestones in other countries, quantifying available patient numbers and follow-up time [73].

Methodology:

Data Source: Nationwide, longitudinal US EHR-derived database (e.g., Flatiron Health database), comprising data from ~280 cancer clinics (~800 sites of care) [73].
Cohort Identification: Patients with one of 11 specified advanced cancer types who received a cancer therapy of interest.
Time Intervals: Data accumulation is measured from the FDA approval date to relevant HTA milestones (e.g., submission to HTA agency, final guidance publication).
Key Metrics:
- Patient Count: The number of unique patients who initiated the therapy after FDA approval and before the HTA milestone.
- Follow-up Time (Time-at-risk): Calculated in person-years for each patient from the therapy start date to death, last EHR activity, or the HTA milestone date, whichever occurs first [73].

Use Case 2: Estimating Real-World Time to Treatment Discontinuation (rwTTD)

Objective: To implement a fit-for-purpose data quality assessment for estimating rwTTD, a pragmatic end point for continuously administered therapies, using the UReQA framework [77].

Methodology:

Conceptual Definition: rwTTD is defined as the time from initiation to discontinuation of a medication. Discontinuation is triggered by death, initiation of a new treatment, or a gap of ≥120 days after the last recorded dose [77].
Operational Mapping: The rwTTD definition is deconstructed and mapped to four data elements required from the EHR:
- Systemic Anticancer Therapy (SACT): Drug name, administration date, and order date.
- Line of Therapy (LOT): LOT name, number, start, and end dates to identify subsequent treatments.
- Mortality Status: Vital status or date of death.
- Follow-up Time: Date of last follow-up in the EHR.
Data Quality Checks: A series of tasks (e.g., 20 checks in the referenced study) are performed to verify the completeness and plausibility of the required data elements. This includes checks on the distribution of gaps between drug orders, completeness of mortality data, and lag times in data updates [77].

The logical flow of this fit-for-purpose assessment is outlined in the diagram below.

Use Case 3: Providing an External Control Arm

Objective: To generate an external control cohort from US RWD for single-arm trials, supporting the contextualization of intervention effectiveness for HTA [78].

Methodology:

Cohort Definition: Patients from the US EHR-derived database are selected to match the key eligibility criteria of the single-arm trial (e.g., cancer type, stage, biomarker status, prior lines of therapy).
Outcome Alignment: The outcomes of interest (e.g., overall survival, progression-free survival) are defined and curated from the EHR data to align as closely as possible with the trial end points.
Statistical Analysis: Appropriate statistical methods, such as propensity score matching or weighting, are applied to adjust for differences in baseline characteristics between the trial cohort and the real-world external control arm. This helps address potential selection bias [78].

The Scientist's Toolkit: Essential Reagents for RWE Generation

The rigorous application of US EHR-derived data in HTA relies on a suite of methodological "reagents" and considerations.

Table 3: Key Reagents for HTA-Focused RWE Studies

Research Reagent	Function and Importance in HTA Context
Curated EHR-Derived Database	Provides the foundational data, requiring depth (clinical variables) and breadth (patient numbers/representativeness) to address HTA questions [73] [31].
Data Quality Framework (e.g., SUITABILITY)	A structured tool to ensure and communicate data trustworthiness and fitness-for-purpose to HTA reviewers [75].
Terminology Harmonization	Processes that map local EHR codes to standard terminologies (e.g., RxNorm for drugs) to ensure consistent variable definition across the network [31].
Line of Therapy (LOT) Algorithm	A rule-based algorithm to reconstruct treatment sequences from raw EHR data, critical for understanding treatment patterns and defining endpoints like rwTTD [77].
Validated Mortality Data	A composite mortality variable, often combining data from multiple sources (e.g., EHR, Social Security Death Index), which is crucial for robust overall survival analysis [73].
Methodology for Addressing Selection Bias	Statistical techniques (e.g., propensity scores, inverse probability weighting) to minimize confounding when comparing RWD cohorts, a common concern for HTA bodies [78].

US EHR-derived data represents a powerful and rapidly evolving resource for informing international HTA decisions in oncology. Its value is not merely a function of early availability but is contingent on the systematic application of data quality frameworks, transparent reporting, and rigorous methodological approaches tailored to specific HTA evidence gaps. As the field advances, the integration of these data sources into HTA submissions is poised to become more standardized, playing a central role in ensuring that innovative cancer therapies reach patients globally based on a comprehensive understanding of their real-world value.

The use of real-world data (RWD) from electronic health records (EHRs) in oncology research has accelerated substantially, driven by the need for evidence on diagnostic and therapeutic interventions across diverse patient populations. Regulatory agencies increasingly recognize real-world evidence (RWE) to support regulatory decisions on drug effectiveness and safety, as evidenced by recent U.S. Food and Drug Administration (FDA) guidance documents [79]. This evolution creates an urgent need for standardized approaches to assess data quality and fitness-for-use—the degree to which a dataset is suitable for answering a specific scientific question [80]. For researchers, scientists, and drug development professionals working with real-time oncology data, understanding and applying regulatory data quality frameworks is essential for generating reliable evidence that can inform treatment paradigms and regulatory decision-making.

The fundamental challenge in utilizing EHR-derived oncology data lies in its inherent complexity. These data are captured during routine clinical practice rather than through controlled research protocols, resulting in fragmented information across systems, varied documentation practices, and significant information embedded in unstructured clinical notes [31]. This article compares predominant regulatory frameworks for assessing data quality, provides experimental protocols for validating key oncology endpoints, and presents practical toolkits for implementing fitness-for-use assessments in oncology research contexts.

Comparative Analysis of Regulatory Data Quality Frameworks

Core Quality Dimensions Across Major Frameworks

A targeted review of frameworks from major regulatory and health technology assessment agencies reveals two primary data quality dimensions: relevance and reliability [31]. These dimensions provide a structured approach for evaluating whether real-world data sources contain the necessary information (relevance) and accurately represent the clinical concepts they purport to measure (reliability) for specific research questions in oncology.

Table 1: Core Data Quality Dimensions Across Regulatory Frameworks

Quality Dimension	FDA Focus	EMA Focus	NICE Focus	Common Application in Oncology
Relevance	Availability of key data elements (exposure, outcomes, covariates) and sufficient numbers of representative patients [31]	Extent to which a dataset presents data elements useful to answer a research question [31]	Whether data provide sufficient information for robust results and generalizability to healthcare system populations [31]	Assessing if EHR data capture critical oncology-specific elements (e.g., biomarkers, cancer stage, treatment regimens, outcomes)
Reliability	Data accuracy, completeness, provenance, and traceability [31]	How closely data reflect what they are designed to measure [31]	Ability to get similar results when study is repeated with different populations [31]	Ensuring accuracy of cancer diagnoses, treatment dates, and outcomes across diverse data sources
Key Subdimensions	- Accuracy- Completeness- Provenance- Timeliness	- Precision- Completeness- Consistency	- Accuracy- Completeness- Consistency	- Tumor histology accuracy- Treatment capture completeness- Outcome ascertainment

Implementation Approaches: Single-Stage vs. Two-Stage Processes

Operationalizing fitness-for-use assessments involves either single-stage or two-stage processes. In a single-stage process, researchers apply cleaning, transformation, and linkage steps directly to raw RWD to generate an output dataset deemed fit for a specific use. This approach is efficient for studies with well-defined purposes, such as device registries or specific label extensions. In contrast, a two-stage process first brings raw RWD to a baseline "research-ready" quality level, with additional study-specific cleaning and transformation applied subsequently. This approach is better suited for data used across multiple studies with different objectives or when linking with diverse data sources [81].

Major distributed research networks have implemented variations of these approaches. The Sentinel Initiative and PCORnet employ comprehensive data characterization routines that run against common data models, providing descriptive statistics on missing values, outliers, frequency distributions, and results from systematic quality checks. These networks then layer study-specific assessments prior to analysis [81]. This iterative process gradually improves overall data quality while providing researchers with documented quality metrics for determining fitness-for-use.

Experimental Protocols for Validating Oncology-Specific Endpoints

Methodologies for EHR Data Extraction and Standardization

Recent research demonstrates sophisticated approaches for extracting and standardizing oncology endpoints from EHRs. A study implementing a real-world data pipeline for precision oncology developed infrastructure incorporating data mining and natural language processing (NLP) scripts to automatically retrieve descriptive variables and common endpoints from EHRs complying with the Precision Oncology Core Data Model (Precision-DM) [10]. The methodology involved:

Data Source Establishment: Creating a comprehensive cancer registry containing clinical, molecular, genomic, radiographic, pathology, and operational data for all cancer patients within a healthcare system
Structured Data Extraction: Using standard SQL queries to directly pull structured data elements (e.g., race, ethnicity, sex, smoking status) from the EHR
Unstructured Data Processing: Implementing NLP scripts to extract critical variables from clinical notes, pathology reports, and other unstructured sources
Containerized Toolset Development: Packaging extraction algorithms into a web-based toolset installed on a virtual machine and interfaced with the cancer registry

This pipeline accurately retrieved most descriptive EHR fields but demonstrated variable performance for dates needed to calculate key oncology endpoints, with accuracy ranging from 50%-86% for Date of Diagnosis and Treatment Start Date, which directly impact the calculation of Age at Diagnosis, Overall Survival, and Time to First Treatment [10].

Validation Study Design for Critical Data Elements

The FDA guidance recommends that operational definitions for key variables should be demonstrated using sufficiently large samples, appropriate sampling techniques, and reasonable reference standards [82]. For oncology endpoints, validation studies should include:

Reference Standard Definition: Establishing a reliable reference (e.g., manual chart abstraction by trained tumor registrars, linked tumor registry data) against which to compare EHR-derived elements
Sampling Strategy: Implementing appropriate sampling techniques that account for potential variation in data quality across patient subgroups, practice settings, and time periods
Performance Metric Calculation: Quantifying positive predictive value, sensitivity, specificity, and negative predictive value of algorithms used to identify oncology endpoints
Transportability Assessment: Evaluating whether algorithm performance remains consistent across different healthcare settings, coding systems, and calendar time periods

For example, in validating an algorithm to identify immunotherapy-related adverse events, researchers might compare EHR-derived identification against manual chart review by clinical experts, calculating performance metrics overall and within key subgroups (e.g., by cancer type, treatment regimen, practice setting) [10].

Table 2: Performance Metrics for Oncology Endpoint Validation

Oncology Endpoint	Validation Reference Standard	Reported Performance in Recent Studies	Key Challenges
Date of Diagnosis	Manual chart abstraction by tumor registrars [10]	Accuracy range: 50%-86% [10]	Multiple potential dates (first symptom, first presentation, pathologic confirmation)
Treatment Start Date	Medication administration records with manual verification [10]	Accuracy range: 50%-86% [10]	Distinguishing between order date, administration date, and actual start
Overall Survival	Linked vital status records from state death registries [10]	Dependent on accurate diagnosis date and death capture [10]	Incomplete death capture in EHR; requires data linkage
Performance Status	NLP extraction from clinical notes [10]	Reproduced model with 93% accuracy [10]	Varied documentation patterns across providers

Visualization of Fitness-for-Use Assessment Workflow

The following diagram illustrates the conceptual workflow for assessing fitness-for-use of real-world data sources in oncology research, integrating elements from regulatory frameworks and experimental validation approaches:

Fitness-for-Use Assessment Workflow

The Researcher's Toolkit: Essential Solutions for Oncology Data Quality

Implementing robust fitness-for-use assessments requires both methodological approaches and practical tools. The following toolkit provides essential solutions for researchers working with oncology real-world data:

Table 3: Research Reagent Solutions for Oncology Data Quality Assessment

Tool Category	Specific Solutions	Function in Quality Assessment	Implementation Considerations
Data Quality Frameworks	FDA RWD Guidance Framework [79], EMA Quality Framework [31], NESTcc Data Quality Framework [81]	Provide structured approaches for evaluating relevance and reliability of data sources	Framework selection should align with intended use case and regulatory context
Common Data Models	Precision-DM [10], mCODE [10], PCORnet CDM [81], Sentinel CDM [81]	Standardize structure and terminology for oncology data elements, enabling interoperability	Implementation requires mapping from source EHR data to standardized model
Data Characterization Tools	Achilles [81], PCORnet Data Curation Query Package [81], Sentinel Data Characterization	Generate descriptive statistics on data completeness, outliers, and value distributions	Should be implemented iteratively with each data refresh
Validation Tools	NLP scripts for unstructured data [10], Algorithm performance calculators, Quantitative bias analysis tools [82]	Assess accuracy of critical variables against reference standards	Focus validation efforts on exposure, outcome, and key confounder variables
Quality Documentation	Data provenance trackers, Audit trail systems, Data quality metric dashboards	Document data transformations and quality metrics for regulatory submission	Should capture lineage from source data to analytic dataset

Assessing fitness-for-use represents a fundamental requirement for generating reliable evidence from real-world oncology data. Regulatory frameworks provide structured approaches centered on relevance and reliability dimensions, while experimental protocols offer methodologies for validating oncology-specific endpoints with variable performance characteristics. Successful implementation requires careful consideration of research questions, data source characteristics, and appropriate validation strategies focused on critical study elements. As regulatory standards continue to evolve, researchers must maintain rigorous yet practical approaches to data quality assessment that enable robust evidence generation while acknowledging the inherent limitations of real-world data sources. Through systematic application of these frameworks and tools, the oncology research community can advance the appropriate use of real-world evidence in regulatory decision-making and clinical care.

Conclusion

The validation of real-time oncology data from EHRs is no longer a theoretical ambition but a feasible and critical component of a modern cancer data ecosystem. Evidence demonstrates that automated systems can achieve high accuracy in capturing diagnoses, treatments, and outcomes, transforming registries from retrospective archives into proactive tools for clinical decision-making. Success hinges on the strategic implementation of common data models, AI-enabled curation, and continuous quality assurance aligned with regulatory frameworks. For researchers and drug developers, this validated, timely data opens new frontiers for generating robust real-world evidence, supporting everything from external control arms to post-marketing surveillance. Future efforts must focus on standardizing these validation approaches globally to ensure that the accelerated pace of oncology innovation is matched by equally agile and trustworthy data systems, ultimately improving patient access to care and outcomes.