10 Evaluating AI Systems
Learning Objectives
By the end of this chapter, you will:
- Understand fundamental principles of AI system evaluation in public health contexts
- Select and apply appropriate performance metrics for different types of AI models
- Design rigorous validation strategies including internal, external, and prospective validation
- Evaluate AI systems beyond technical performance to include clinical utility and implementation outcomes
- Identify and avoid common pitfalls in AI evaluation (data leakage, optimization bias, improper cross-validation)
- Conduct comprehensive fairness and equity assessments across population subgroups
- Interpret and critically appraise AI evaluation studies in published literature
- Design prospective evaluation studies including RCTs and implementation trials
Time to complete: 90-120 minutes Prerequisites: Chapter 2: AI Basics, Chapter 3: Data, Chapter 8: Clinical AI
What you’ll build: 💻 Performance evaluation framework, external validation study design, fairness audit toolkit, critical appraisal checklist, and prospective testing protocol
10.1 Introduction: The Evaluation Crisis in AI
December 2020, Radiology:
Researchers at UC Berkeley publish a comprehensive review of 62 deep learning studies in medical imaging published in high-impact journals.
Their sobering finding: Only 6% performed external validation on data from different institutions.
The vast majority tested models only on hold-out sets from the same dataset used for training—a practice that provides minimal evidence of real-world performance.
July 2021, JAMA Internal Medicine:
Wong et al. publish external validation of Epic’s sepsis prediction model—deployed at >100 US hospitals, affecting millions of patients.
The model’s performance: - Sensitivity: 33% (missed 2 out of 3 sepsis cases) - Positive predictive value: 7% (93% of alerts were false positives) - Early detection: Only 6% of alerts fired before clinical recognition
Conclusion from authors: “The algorithm rarely alerted clinicians to sepsis before it was clinically recognized and had poor predictive accuracy.”
This wasn’t a research study—this was a deployed clinical system used in real patient care.
The Evaluation Gap:
Between lab performance and real-world deployment lies a chasm that has claimed many promising AI systems:
In the lab: - Curated, high-quality datasets - Balanced classes (50% positive, 50% negative) - Consistent protocols - Expert-confirmed labels - AUC-ROC = 0.95
In the real world: - Messy, incomplete data - Rare events (1-5% prevalence) - Variable protocols across sites - Ambiguous cases - AUC-ROC = 0.68
The consequences are severe:
❌ Failed deployments — Models that work in development but fail in production
❌ Hidden biases — Systems that perform well on average but poorly for specific groups
❌ Wasted resources — Millions invested in systems that don’t deliver promised benefits
❌ Patient harm — Incorrect predictions leading to inappropriate treatments
❌ Eroded trust — Clinicians lose confidence in AI after experiencing failures
Rigorous evaluation is the bridge between AI research and AI implementation. Without it, we’re deploying unvalidated systems and hoping for the best.
This chapter provides a comprehensive framework for evaluating AI systems across five critical dimensions:
- Technical performance — Accuracy, calibration, robustness
- Generalizability — External validity, temporal stability, geographic transferability
- Clinical/Public health utility — Impact on decisions and outcomes
- Fairness and equity — Performance across demographic subgroups
- Implementation outcomes — Adoption, usability, sustainability
You’ll learn how to evaluate AI systems rigorously, design validation studies, and critically appraise published research.
10.2 The Multidimensional Nature of Evaluation
10.2.1 What Are We Really Evaluating?
Evaluating an AI system is not just about measuring accuracy. In public health and clinical contexts, we need to assess multiple dimensions.
10.2.1.1 1. Technical Performance
Question: Does the model make accurate predictions on new data?
Key aspects: - Discrimination: Can the model distinguish between positive and negative cases? - Calibration: Do predicted probabilities match observed frequencies? - Robustness: Does performance degrade with missing data or noise? - Computational efficiency: Speed and resource requirements for deployment
Relevant for: All AI systems (foundational requirement)
10.2.1.2 2. Generalizability
Question: Will the model work in settings different from where it was developed?
Key aspects: - Geographic transferability: Performance at different institutions, regions, countries - Temporal stability: Does performance degrade as time passes and data distributions shift? - Population differences: Performance across different patient demographics, disease prevalence - Setting transferability: Hospital vs. primary care vs. community settings
Relevant for: Any system intended for broad deployment
Critical insight: A 2020 Nature Medicine paper showed that AI models often learn “shortcuts”—spurious correlations specific to training data that don’t generalize. For example, pneumonia detection models learned to identify portable X-ray machines (used for sicker patients) rather than actual pneumonia.
10.2.1.3 3. Clinical/Public Health Utility
Question: Does the model improve decision-making and outcomes?
Key aspects: - Decision impact: Does it change clinician decisions? - Outcome improvement: Does it lead to better patient outcomes? - Net benefit: Does it provide value above existing approaches? - Cost-effectiveness: Does it provide value commensurate with costs?
Relevant for: Clinical decision support, diagnostic tools
Critical distinction: A model can be statistically accurate but clinically useless. Example: A model predicting hospital mortality with AUC-ROC = 0.85 sounds impressive, but if it doesn’t change management or improve outcomes, it adds no value.
For framework on clinical utility assessment, see Van Calster et al., 2019, BMJ on decision curve analysis.
10.2.1.4 4. Fairness and Equity
Question: Does the model perform equitably across population subgroups?
Key aspects: - Subgroup performance: Stratified metrics by race, ethnicity, gender, age, socioeconomic status - Error rate disparities: Differential false positive/negative rates - Outcome equity: Does deployment narrow or widen health disparities? - Representation: Are all groups adequately represented in training data?
Relevant for: All systems affecting humans
Essential reading: Obermeyer et al., 2019, Science - racial bias in healthcare algorithm affecting millions; Gianfrancesco et al., 2018, Nature Communications - sex bias in clinical AI.
10.2.1.5 5. Implementation Outcomes
Question: Is the model adopted and used effectively in practice?
Key aspects: - Adoption: Are users actually using it as intended? - Usability: Can users operate it efficiently? - Workflow integration: Does it fit smoothly into existing processes? - Sustainability: Will it continue to be used and maintained over time?
Relevant for: Any deployed system
Framework: Proctor et al., 2011, Administration and Policy in Mental Health - implementation outcome taxonomy.
10.3 The Hierarchy of Evidence for AI Systems
Just as clinical medicine has evidence hierarchies (case reports → cohort studies → RCTs), AI systems should progress through increasingly rigorous validation stages.
10.3.1 Level 1: Development and Internal Validation
What it is: - Split-sample validation (train-test split) or cross-validation on development dataset - Model trained and tested on data from same source
Evidence strength: ⭐ (Weakest)
Value: - Initial proof-of-concept - Model selection and hyperparameter tuning - Feasibility assessment
Limitations: - Optimistic bias (model may overfit to dataset-specific quirks) - No evidence of generalizability - Cannot assess real-world performance
Common in: Early-stage research, algorithm development
10.3.2 Level 2: Temporal Validation
What it is: - Train on data from earlier time period - Test on data from later time period (same source)
Evidence strength: ⭐⭐
Value: - Tests temporal stability - Detects concept drift (changes in data distribution over time) - Better than spatial hold-out from same time period
Limitations: - Still from same institution/setting - May not generalize geographically
Example: Sendak et al., 2020, npj Digital Medicine - demonstrated temporal degradation of sepsis models
10.3.3 Level 3: External Geographic Validation
What it is: - Train on data from one institution/region - Test on data from different institution(s)/region(s)
Evidence strength: ⭐⭐⭐
Value: - Strongest evidence of generalizability without prospective deployment - Tests performance across different patient populations, clinical practices, data collection protocols - Identifies setting-specific dependencies
Limitations: - Still retrospective - Doesn’t assess impact on clinical decisions or outcomes
Gold standard for retrospective evaluation: Collins et al., 2015, BMJ - TRIPOD guidelines recommend external validation as minimal standard.
10.3.4 Level 4: Retrospective Impact Assessment
What it is: - Simulate what would have happened if model had been used - Estimate impact on decision-making without actual deployment
Evidence strength: ⭐⭐⭐
Value: - Estimates potential benefit before prospective deployment - Identifies potential implementation barriers - Justifies resource allocation for prospective studies
Limitations: - Cannot capture changes in clinician behavior - Assumptions about how predictions would be used may be incorrect
Example: Jung et al., 2020, JAMA Network Open - retrospective assessment of deep learning for diabetic retinopathy screening
10.3.5 Level 5: Prospective Observational Studies
What it is: - Model deployed in real clinical practice - Predictions shown to clinicians - Outcomes observed but no experimental control
Evidence strength: ⭐⭐⭐⭐
Value: - Real-world performance data - Identifies implementation challenges (workflow disruption, alert fatigue) - Measures actual usage patterns
Limitations: - Cannot establish causality (improvements may be due to other factors) - Selection bias if clinicians choose when to use model - No counterfactual (what would have happened without model?)
Example: Tomašev et al., 2019, Nature - DeepMind AKI prediction deployed prospectively at VA hospitals
10.3.6 Level 6: Randomized Controlled Trials
What it is: - Randomize patients/clinicians/units to model-assisted vs. control groups - Measure outcomes in both groups - Compare to establish causal effect
Evidence strength: ⭐⭐⭐⭐⭐ (Strongest)
Value: - Definitive evidence of impact on outcomes - Establishes causality - Meets regulatory and reimbursement standards
Limitations: - Expensive and time-consuming - Requires large sample sizes - Ethical considerations (withholding potentially beneficial intervention from control group)
Example: Komorowski et al., 2018, Nature Medicine - RL-based sepsis treatment (retrospective RCT simulation); actual prospective RCTs rare but emerging.
Best practice: Progress systematically through validation stages:
- Internal validation → Establish proof-of-concept
- Temporal validation → Test stability over time
- External validation → Test generalizability
- Retrospective impact → Estimate potential benefit
- Prospective observational → Measure real-world performance
- RCT → Establish causal impact
Don’t skip steps. Each level provides critical information before committing resources to higher levels.
10.4 Performance Metrics: Choosing the Right Measures
10.4.1 Classification Metrics
For binary classification (disease/no disease, outbreak/no outbreak), numerous metrics exist. No single metric tells the whole story.
10.4.1.1 The Confusion Matrix Foundation
All classification metrics derive from the 2×2 confusion matrix:
Predicted Positive | Predicted Negative | |
---|---|---|
Actually Positive | True Positives (TP) | False Negatives (FN) |
Actually Negative | False Positives (FP) | True Negatives (TN) |
Example: TB screening of 1,000 individuals; 100 actually have TB
Predicted TB+ | Predicted TB- | |
---|---|---|
Actually TB+ | 85 (TP) | 15 (FN) |
Actually TB- | 90 (FP) | 810 (TN) |
From this matrix, we calculate all other metrics.
10.4.1.2 Core Metrics
1. Sensitivity (Recall, True Positive Rate)
\[\text{Sensitivity} = \frac{TP}{TP + FN} = \frac{TP}{\text{All Actual Positives}}\]
- Interpretation: Of all actual positives, what proportion did we identify?
- Example: 85/100 = 85% (identified 85 of 100 TB cases)
- When to prioritize: High-stakes screening (must catch most cases), early disease detection, rule-out tests
- Trade-off: Maximizing sensitivity → more false positives
2. Specificity (True Negative Rate)
\[\text{Specificity} = \frac{TN}{TN + FP} = \frac{TN}{\text{All Actual Negatives}}\]
- Interpretation: Of all actual negatives, what proportion did we correctly identify?
- Example: 810/900 = 90% (correctly ruled out TB in 810 of 900 healthy people)
- When to prioritize: Confirmatory tests, when false alarms are costly, rule-in tests
- Trade-off: Maximizing specificity → more false negatives
3. Positive Predictive Value (Precision, PPV)
\[\text{PPV} = \frac{TP}{TP + FP} = \frac{TP}{\text{All Predicted Positives}}\]
- Interpretation: Of all predicted positives, what proportion are actually positive?
- Example: 85/175 = 49% (49% of positive predictions are correct)
- When to prioritize: When acting on predictions is costly (treatments, interventions)
- Critical property: Depends heavily on disease prevalence
Prevalence dependence example:
Scenario | Prevalence | Sensitivity | Specificity | PPV |
---|---|---|---|---|
High-burden TB setting | 10% | 85% | 90% | 49% |
Low-burden TB setting | 1% | 85% | 90% | 8% |
Same model, vastly different PPV! In low-prevalence settings, even high specificity leads to poor PPV.
For detailed explanation, see Altman & Bland, 1994, BMJ on diagnostic tests and prevalence.
4. Negative Predictive Value (NPV)
\[\text{NPV} = \frac{TN}{TN + FN} = \frac{TN}{\text{All Predicted Negatives}}\]
- Interpretation: Of all predicted negatives, what proportion are actually negative?
- Example: 810/825 = 98% (98% of negative predictions are correct)
- When to prioritize: Rule-out tests, when missing disease is catastrophic
- Critical property: Also depends on prevalence (high prevalence → lower NPV)
5. Accuracy
\[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} = \frac{\text{Correct Predictions}}{\text{All Predictions}}\]
- Interpretation: Overall proportion of correct predictions
- Example: (85+810)/1000 = 89.5%
- Major limitation: Misleading for imbalanced datasets
Classic pitfall:
Dataset: 1,000 patients, 10 with disease (1% prevalence)
Naive model: Predict “no disease” for everyone - Accuracy: 990/1000 = 99% - But sensitivity = 0% (misses all disease cases!)
Takeaway: Accuracy alone is insufficient, especially for rare events.
6. F1 Score (Harmonic Mean of Precision and Recall)
\[F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}\]
- Interpretation: Balance between precision and recall
- Range: 0 (worst) to 1 (perfect)
- When to use: When you need single metric balancing both concerns
- Limitation: Ignores true negatives (not suitable when TN important)
Variants: - \(F_2\) score: Weights recall higher than precision - \(F_{0.5}\) score: Weights precision higher than recall
10.4.1.3 Threshold-Independent Metrics
7. Area Under the ROC Curve (AUC-ROC)
The Receiver Operating Characteristic (ROC) curve plots: - Y-axis: True Positive Rate (Sensitivity) - X-axis: False Positive Rate (1 - Specificity)
…across all possible classification thresholds (0 to 1).
AUC-ROC interpretation: - 0.5 = Random guessing (diagonal line) - 0.6-0.7 = Poor discrimination - 0.7-0.8 = Acceptable - 0.8-0.9 = Excellent - >0.9 = Outstanding (rare in clinical applications)
Alternative interpretation: Probability that a randomly selected positive case is ranked higher than a randomly selected negative case.
Advantages: - Threshold-independent (single summary metric) - Not affected by class imbalance (in terms of metric itself) - Standard metric for model comparison
Limitations: - May overemphasize performance at thresholds you wouldn’t use clinically - Doesn’t indicate optimal threshold - Can be misleading for highly imbalanced data (see Average Precision)
For comprehensive guide, see Hanley & McNeil, 1982, Radiology on the meaning and use of AUC.
8. Average Precision (Area Under Precision-Recall Curve)
The Precision-Recall (PR) curve plots: - Y-axis: Precision (PPV) - X-axis: Recall (Sensitivity)
…across all thresholds.
Average Precision (AP): Area under PR curve
Why use PR curve instead of ROC? - More informative for imbalanced datasets where positive class is rare - Focuses on performance on the positive class (which you care about more when it’s rare) - ROC can be misleadingly optimistic when negative class dominates
Example: Disease with 1% prevalence
- AUC-ROC = 0.90 (sounds great!)
- Average Precision = 0.25 (reveals poor performance on actual disease cases)
When to use: Rare disease detection, outbreak detection, any imbalanced problem
For detailed comparison, see Saito & Rehmsmeier, 2015, PLOS ONE on precision-recall vs. ROC curves.
10.4.1.4 Choosing Metrics by Scenario
Scenario | Primary Metrics | Rationale |
---|---|---|
COVID-19 airport screening | Sensitivity, NPV | Must catch most cases; false positives acceptable (confirmatory testing available) |
Cancer diagnosis confirmation | Specificity, PPV | False positives → unnecessary surgery; high bar for confirmation |
Automated triage system | AUC-ROC, Calibration | Need good ranking across full risk spectrum |
Rare disease detection | Average Precision, Sensitivity | Standard AUC-ROC misleading when imbalanced |
Syndromic surveillance | Sensitivity, Timeliness | Early detection critical; false alarms tolerable (investigation cheap) |
Clinical decision support | PPV, Calibration | Clinicians ignore if too many false alarms; need well-calibrated probabilities |
10.4.2 Calibration: Do Predicted Probabilities Mean What They Say?
Calibration assesses whether predicted probabilities match observed frequencies.
Example of well-calibrated model: - Model predicts “30% risk of readmission” for 100 patients - About 30 of those 100 are actually readmitted - Predicted probability ≈ observed frequency
Poor calibration: - Model predicts “30% risk” but 50% are actually readmitted → underconfident - Model predicts “30% risk” but 15% are actually readmitted → overconfident
10.4.2.1 Measuring Calibration
1. Calibration Plot
Method: 1. Bin predictions into groups (e.g., 0-10%, 10-20%, …, 90-100%) 2. For each bin, calculate: - Mean predicted probability (x-axis) - Observed frequency of outcome (y-axis) 3. Plot points 4. Perfect calibration: points lie on diagonal line (y = x)
Interpretation: - Points above diagonal: Model underconfident (predicts lower risk than reality) - Points below diagonal: Model overconfident (predicts higher risk than reality)
2. Brier Score
\[\text{Brier Score} = \frac{1}{N} \sum_{i=1}^{N} (p_i - y_i)^2\]
where \(p_i\) = predicted probability, \(y_i\) = actual outcome (0 or 1)
- Range: 0 (perfect) to 1 (worst)
- Lower is better
- Combines discrimination and calibration into single metric
- Can be decomposed into calibration and refinement components
Interpretation: - 0.25 = Baseline (predicting prevalence for everyone) - <0.15 = Good calibration - <0.10 = Excellent calibration
For Brier score deep dive, see Rufibach, 2010, Clinical Trials.
3. Expected Calibration Error (ECE)
\[\text{ECE} = \sum_{m=1}^{M} \frac{n_m}{N} |\text{acc}(B_m) - \text{conf}(B_m)|\]
where: - \(M\) = number of bins - \(B_m\) = set of predictions in bin \(m\) - \(n_m\) = number of predictions in bin \(m\) - \(\text{acc}(B_m)\) = accuracy in bin \(m\) - \(\text{conf}(B_m)\) = average confidence in bin \(m\)
Interpretation: Average difference between predicted and observed probabilities across bins (weighted by bin size)
10.4.2.2 Why Calibration Matters
Clinical decision-making requires well-calibrated probabilities:
Scenario 1: Treatment threshold - If risk >20%, prescribe preventive medication - Poorly calibrated model: risk actually 40% when model says 20% - Result: Under-treatment of high-risk patients
Scenario 2: Resource allocation - Allocate home health visits to top 10% risk - Overconfident model: predicted “high risk” patients aren’t actually high risk - Result: Resources wasted on low-risk patients, true high-risk patients missed
Scenario 3: Patient counseling - Tell patient: “You have 30% chance of complications” - If model poorly calibrated, this number is meaningless - Result: Informed consent based on inaccurate information
Common issue: Deep neural networks often produce poorly calibrated probabilities out-of-the-box. They tend to be overconfident (predicted probabilities too extreme).
Why? Modern neural networks are optimized for accuracy, not calibration. Regularization techniques that prevent overfitting can actually worsen calibration.
Evidence: Guo et al., 2017, ICML - “On Calibration of Modern Neural Networks”
Solution: Post-hoc calibration methods: - Temperature scaling: Simplest and most effective - Platt scaling: Logistic regression on model outputs - Isotonic regression: Non-parametric calibration
Takeaway: Always assess and correct calibration for deep learning models before deployment.
10.4.3 Regression Metrics
For continuous outcome prediction (disease burden, resource utilization, epidemic size):
1. Mean Absolute Error (MAE)
\[\text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i|\]
- Interpretation: Average absolute difference between prediction and truth
- Unit: Same as outcome variable
- Advantage: Interpretable, robust to outliers
- Example: MAE = 3.2 days (average error in predicting length of stay)
2. Root Mean Squared Error (RMSE)
\[\text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2}\]
- Interpretation: Square root of average squared error
- Property: Penalizes large errors more heavily than MAE (due to squaring)
- When to use: When large errors are particularly problematic
Relationship: RMSE ≥ MAE always (equality only if all errors identical)
3. R-squared (Coefficient of Determination)
\[R^2 = 1 - \frac{\sum_{i=1}^{N} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{N} (y_i - \bar{y})^2} = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}\]
- Range: 0 to 1 (can be negative if model worse than mean)
- Interpretation: Proportion of variance in outcome explained by model
- Example: R² = 0.65 means model explains 65% of variance
- Limitation: Can be artificially inflated by adding more features
4. Mean Absolute Percentage Error (MAPE)
\[\text{MAPE} = \frac{100\%}{N} \sum_{i=1}^{N} \left| \frac{y_i - \hat{y}_i}{y_i} \right|\]
- Interpretation: Average percentage error
- Advantage: Scale-independent (can compare across different units)
- Example: MAPE = 15% (average error is 15% of true value)
- Limitation: Undefined when actual value is zero; penalizes under-predictions more than over-predictions
10.4.4 Survival Analysis Metrics
For time-to-event prediction (mortality, readmission, disease progression):
1. Concordance Index (C-index, Harrell’s C-statistic)
- Extension of AUC-ROC to survival data with censoring
- Interpretation: Probability that, for two randomly selected individuals, the one who experiences event first has higher predicted risk
- Range: 0.5 (random) to 1.0 (perfect)
- Handles censoring: Pairs where censoring occurs are excluded or weighted
For details: Harrell et al., 1982, JAMA - original C-index paper.
2. Integrated Brier Score (IBS)
- Extension of Brier score to survival analysis
- Interpretation: Average prediction error over time, accounting for censoring
- Range: 0 (perfect) to 1 (worst)
- Advantage: Assesses calibration of survival probability predictions over follow-up period
10.5 Validation Strategies: Testing Generalization
The validation strategy determines how trustworthy your performance estimates are.
10.5.1 Internal Validation
Purpose: Estimate model performance on new data from the same source.
Critical limitation: Provides no evidence about performance on different populations, institutions, or time periods.
10.5.1.1 Method 1: Train-Test Split (Hold-Out Validation)
Procedure: 1. Randomly split data into training (70-80%) and test (20-30%) 2. Train model on training set 3. Evaluate on test set (one time only)
Advantages: - Simple and fast - Clear separation between training and testing
Disadvantages: - Single split can be unrepresentative (bad luck in random split) - Wastes data (test set not used for training) - High variance in performance estimate
When to use: Large datasets (>10,000 samples), quick experiments
from sklearn.model_selection import train_test_split
= train_test_split(
X_train, X_test, y_train, y_test
X, y, =0.2, # 20% for testing
test_size=42, # Reproducible split
random_state=y # Maintain class balance
stratify )
10.5.1.2 Method 2: K-Fold Cross-Validation
Procedure: 1. Divide data into K folds (typically 5 or 10) 2. For each fold: - Train on K-1 folds - Validate on remaining fold 3. Average performance across all K folds
Advantages: - Uses all data for both training and validation - More stable performance estimate (less variance) - Standard practice in machine learning
Disadvantages: - Computationally expensive (train K models) - Still no external validation
When to use: Moderate-sized datasets (1,000-10,000 samples), model selection
from sklearn.model_selection import cross_val_score
= cross_val_score(
scores
model, X, y,=5, # 5-fold CV
cv='roc_auc' # Metric to optimize
scoring
)
print(f"AUC-ROC: {scores.mean():.3f} (±{scores.std():.3f})")
10.5.1.3 Method 3: Stratified K-Fold Cross-Validation
Modification: Ensures each fold maintains the same class distribution as the full dataset.
Critical for imbalanced datasets (e.g., 5% disease prevalence).
Why it matters: Without stratification, some folds might have very few positive cases (or none!), leading to unstable estimates.
from sklearn.model_selection import StratifiedKFold
= StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
skf = cross_val_score(model, X, y, cv=skf, scoring='roc_auc') scores
10.5.1.4 Method 4: Time-Series Cross-Validation
For temporal data: Never train on future, test on past!
Procedure (expanding window):
Fold 1: Train [1:100] → Test [101:120]
Fold 2: Train [1:120] → Test [121:140]
Fold 3: Train [1:140] → Test [141:160]
...
Critical for: Epidemic forecasting, time-series prediction, any data with temporal structure
from sklearn.model_selection import TimeSeriesSplit
= TimeSeriesSplit(n_splits=5)
tscv for train_idx, test_idx in tscv.split(X):
= X[train_idx], X[test_idx]
X_train, X_test = y[train_idx], y[test_idx]
y_train, y_test # Train and evaluate
10.5.2 Critical Considerations for Internal Validation
10.5.2.1 1. Data Leakage Prevention
Data leakage: Information from test set influencing training process.
Common sources:
❌ Feature engineering on entire dataset:
# WRONG: Standardize before splitting
= StandardScaler().fit_transform(X) # Uses mean/std from ALL data
X_scaled = train_test_split(X_scaled, y)
X_train, X_test, y_train, y_test # Test set info leaked into training!
✅ Feature engineering within train/test:
# CORRECT: Fit scaler on training only
= train_test_split(X, y)
X_train, X_test, y_train, y_test = StandardScaler().fit(X_train) # Learn from training only
scaler = scaler.transform(X_train)
X_train_scaled = scaler.transform(X_test) # Apply to test X_test_scaled
❌ Feature selection on entire dataset:
# WRONG
= SelectKBest(k=10).fit(X, y) # Uses ALL data
selector = selector.transform(X)
X_selected = train_test_split(X_selected) X_train, X_test
✅ Feature selection within training:
# CORRECT
= train_test_split(X, y)
X_train, X_test, y_train, y_test = SelectKBest(k=10).fit(X_train, y_train)
selector = selector.transform(X_train)
X_train_selected = selector.transform(X_test) X_test_selected
For comprehensive guide on data leakage, see Kaufman et al., 2012, SIGKDD.
10.5.2.2 2. Cluster-Aware Splitting
Problem: If data has natural clusters (patients within hospitals, repeated measures within individuals), random splitting can lead to overfitting.
Example: Patient has 5 hospitalizations. Random split → some hospitalizations in training, others in test. Model learns patient-specific patterns → overoptimistic performance.
Solution: Group K-Fold — ensure all samples from same group stay together
from sklearn.model_selection import GroupKFold
# patient_ids: array indicating which patient each sample belongs to
= GroupKFold(n_splits=5)
gkf for train_idx, test_idx in gkf.split(X, y, groups=patient_ids):
# All samples from same patient stay in same fold
= X[train_idx], X[test_idx] X_train, X_test
10.5.3 External Validation: The Gold Standard
External validation: Testing on data from entirely different source—different institution(s), population, time period, or setting.
Why it matters:
Models often learn dataset-specific quirks that don’t generalize: - Hospital equipment signatures - Documentation practices - Patient population characteristics - Data collection protocols
Without external validation, you don’t know if model learned disease patterns or dataset artifacts.
10.5.3.1 Types of External Validation
1. Geographic External Validation
Design: - Train: Hospital A (or multiple hospitals in one region) - Test: Hospital B (or hospitals in different region)
What it tests: - Different patient demographics - Different clinical practices - Different data collection protocols - Different equipment (for imaging)
Example: McKinney et al., 2020, Nature - Google breast cancer AI trained on UK data, validated on US data (and vice versa). Performance dropped: UK→US AUC decreased from 0.889 to 0.858.
2. Temporal External Validation
Design: - Train: Data from 2015-2018 - Test: Data from 2019-2021
What it tests: - Temporal stability (concept drift) - Changes in disease patterns - Changes in clinical practice - Changes in data collection
Example: Davis et al., 2017, JAMIA - Clinical prediction models degrade over time; most models need recalibration after 2-3 years.
3. Setting External Validation
Design: - Train: Intensive care unit (ICU) data - Test: General ward data
What it tests: - Performance in different clinical settings - Generalization across disease severity spectra
Example: Sepsis models trained on ICU patients often fail on ward patients (different disease presentation, different monitoring intensity).
10.5.3.2 External Validation Case Study
10.5.4 Prospective Validation: Real-World Testing
Prospective validation: Model deployed in actual clinical practice, evaluated in real-time.
Why it matters: Retrospective validation can’t capture: - How clinicians actually use (or ignore) model predictions - Workflow integration challenges - Alert fatigue and override patterns - Behavioral changes in response to predictions - Unintended consequences
10.5.4.1 Study Design 1: Silent Mode Deployment
Design: - Deploy model in background - Generate predictions but don’t show to clinicians - Compare predictions to actual outcomes (collected as usual)
Advantages: - Tests real-world data quality and distribution - No risk to patients (clinicians unaware of predictions) - Can assess performance before making decisions based on model
Disadvantages: - Doesn’t test impact on clinical decisions - Doesn’t assess workflow integration
Example: Tomašev et al., 2019, Nature - DeepMind AKI prediction initially deployed silently at VA hospitals to validate real-time performance before clinical integration.
10.5.4.2 Study Design 2: Randomized Controlled Trial (RCT)
Design: - Randomize: Patients, clinicians, or hospital units to: - Intervention: Model-assisted care - Control: Standard care (no model) - Measure: Clinical outcomes in both groups - Compare: Test if model improves outcomes
Advantages: - Strongest causal evidence for impact - Can establish cost-effectiveness - Meets regulatory/reimbursement standards
Disadvantages: - Expensive (often millions of dollars) - Time-consuming (months to years) - Requires large sample size - Ethical considerations (withholding potentially beneficial intervention)
Example: Semler et al., 2018, JAMA - SMART trial for sepsis management (not AI, but example of rigorous prospective design)
10.5.4.3 Study Design 3: Stepped-Wedge Design
Design: - Roll out model sequentially to different units/sites - Each unit serves as its own control (before vs. after) - Eventually all units receive intervention
Advantages: - More feasible than full RCT - All units eventually get intervention (addresses ethical concerns) - Within-unit comparisons reduce confounding
Disadvantages: - Temporal trends can confound results - Less rigorous than RCT (no contemporaneous control group)
Example: Common in health system implementations where full RCT infeasible.
10.5.4.4 Study Design 4: A/B Testing
Design: - Randomly assign users to model-assisted vs. control in real-time - Continuously measure outcomes - Iterate rapidly based on results
Advantages: - Rapid experimentation - Can test multiple model versions - Common in tech industry
Challenges in healthcare: - Ethical concerns (different care for similar patients) - Regulatory considerations (IRB approval required) - Contamination (clinicians may share information)
10.6 Beyond Accuracy: Clinical Utility Assessment
Critical insight: A model can be statistically accurate but clinically useless.
Example: - Model predicts hospital mortality with AUC-ROC = 0.85 - But: If it doesn’t change clinical decisions or improve outcomes, what’s the value? - Moreover: If implementing it disrupts workflow or generates alert fatigue, net impact may be negative.
Before deploying any clinical AI:
- Does it change decisions?
- Do those changed decisions improve outcomes?
- Is the improvement worth the cost (financial, workflow disruption, alert burden)?
If you can’t answer “yes” to all three, don’t deploy.
10.6.1 Decision Curve Analysis (DCA)
Developed by: Vickers & Elkin, 2006, Medical Decision Making
Purpose: Assess the clinical net benefit of using a prediction model compared to alternative strategies.
Concept: A model is clinically useful only if using it leads to better decisions than: - Treating everyone - Treating no one - Using clinical judgment alone
10.6.1.1 How Decision Curve Analysis Works
For each possible risk threshold \(p_t\) (e.g., “treat if risk >10%”):
Calculate Net Benefit (NB):
\[\text{NB}(p_t) = \frac{TP}{N} - \frac{FP}{N} \times \frac{p_t}{1 - p_t}\]
Where: - \(TP/N\) = True positive rate (benefit from correctly treating disease) - \(FP/N \times p_t/(1-p_t)\) = False positive rate, weighted by harm of unnecessary treatment
Interpretation: - If treating disease has high benefit relative to harm of unnecessary treatment → lower \(p_t\) threshold - If treating disease has low benefit relative to harm → higher \(p_t\) threshold
Weight \(p_t/(1-p_t)\): Reflects how much we weight false positives. - At \(p_t\) = 0.10: Weight = 0.10/0.90 ≈ 0.11 (FP weighted 1/9 as much as TP) - At \(p_t\) = 0.50: Weight = 0.50/0.50 = 1.00 (FP and TP equally weighted)
10.6.1.2 DCA Plot and Interpretation
Create DCA plot: - X-axis: Threshold probability (risk at which you’d intervene) - Y-axis: Net benefit - Plot curves for: - Model: Net benefit using model predictions - Treat all: Net benefit if everyone treated - Treat none: Net benefit if no one treated (= 0)
Interpretation: - Model is useful where its curve is above both “treat all” and “treat none” - Higher net benefit = better clinical value - Range of thresholds where model useful = decision curve clinical range
Example interpretation:
At 15% risk threshold: - Model NB = 0.12 - Treat all NB = 0.05 - Treat none NB = 0.00
Meaning: Using model at 15% threshold is equivalent to correctly treating 12 out of 100 patients with no false positives, compared to only 5 for “treat all” strategy.
Python implementation:
def calculate_net_benefit(y_true, y_pred_proba, thresholds):
"""Calculate net benefit across thresholds for decision curve analysis"""
= []
net_benefits
for threshold in thresholds:
# Classify based on threshold
= (y_pred_proba >= threshold).astype(int)
y_pred
# Calculate TP, FP, TN, FN
= ((y_pred == 1) & (y_true == 1)).sum()
TP = ((y_pred == 1) & (y_true == 0)).sum()
FP = len(y_true)
N
# Net benefit formula
= (TP / N) - (FP / N) * (threshold / (1 - threshold))
nb
net_benefits.append(nb)
return np.array(net_benefits)
# Calculate for model, treat all, treat none
= np.linspace(0.01, 0.99, 100)
thresholds = calculate_net_benefit(y_test, y_pred_proba, thresholds)
nb_model = y_test.mean() - (1 - y_test.mean()) * (thresholds / (1 - thresholds))
nb_treat_all = np.zeros_like(thresholds)
nb_treat_none
# Plot decision curve
=(10, 6))
plt.figure(figsize='Model', linewidth=2)
plt.plot(thresholds, nb_model, label='Treat All', linestyle='--', linewidth=2)
plt.plot(thresholds, nb_treat_all, label='Treat None', linestyle=':', linewidth=2)
plt.plot(thresholds, nb_treat_none, label'Threshold Probability', fontsize=12)
plt.xlabel('Net Benefit', fontsize=12)
plt.ylabel('Decision Curve Analysis', fontsize=14, fontweight='bold')
plt.title(
plt.legend()=0.3)
plt.grid(alpha0, 0.5) # Focus on clinically relevant range
plt.xlim( plt.show()
For comprehensive tutorial, see Vickers et al., 2019, Diagnostic and Prognostic Research.
10.6.2 Reclassification Metrics
Purpose: Quantify whether new model improves risk stratification compared to existing approach.
Context: You have an existing risk model (or clinical judgment). New model proposed. Does it reclassify patients into more appropriate risk categories?
10.6.2.1 Net Reclassification Improvement (NRI)
Concept: Among events (people with disease), what proportion correctly moved to higher risk? Among non-events, what proportion correctly moved to lower risk?
Formula:
\[\text{NRI} = (\text{NRI}_{\text{events}} + \text{NRI}_{\text{non-events}}) / 2\]
Where: - \(\text{NRI}_{\text{events}}\) = P(moved up | event) - P(moved down | event) - \(\text{NRI}_{\text{non-events}}\) = P(moved down | non-event) - P(moved up | non-event)
Interpretation: - NRI > 0: New model improves classification - NRI < 0: New model worsens classification - Typically report with 95% CI
Example:
Group | Moved Up | Stayed | Moved Down | NRI Component |
---|---|---|---|---|
Events (n=100) | 35 | 50 | 15 | (35-15)/100 = 0.20 |
Non-events (n=900) | 50 | 800 | 50 | (50-50)/900 = 0.00 |
NRI = (0.20 + 0.00) / 2 = 0.10
Interpretation: Net 10% improvement in classification.
For detailed explanation, see Pencina et al., 2008, Statistics in Medicine.
10.6.2.2 Integrated Discrimination Improvement (IDI)
Concept: Difference in average predicted probabilities between events and non-events.
Formula:
\[\text{IDI} = [\overline{P}_{\text{new}}(\text{events}) - \overline{P}_{\text{old}}(\text{events})] - [\overline{P}_{\text{new}}(\text{non-events}) - \overline{P}_{\text{old}}(\text{non-events})]\]
Interpretation: - How much does new model increase separation between events and non-events? - IDI > 0: Better discrimination - Less sensitive to arbitrary cut-points than NRI
10.7 Fairness and Equity in Evaluation
AI systems can exhibit disparate performance across demographic groups, even when overall performance appears strong.
Failure to assess fairness can: - Perpetuate or amplify existing health disparities - Result in differential quality of care based on race, gender, socioeconomic status - Violate ethical principles of justice and equity - Expose organizations to legal liability
Assessing fairness is not optional—it’s essential.
10.7.1 Mathematical Definitions of Fairness
Challenge: Multiple, often conflicting, definitions of fairness exist.
10.7.1.1 1. Demographic Parity (Statistical Parity)
Definition: Positive prediction rates equal across groups
\[P(\hat{Y}=1 | A=0) = P(\hat{Y}=1 | A=1)\]
where \(A\) = protected attribute (e.g., race, gender)
Example: Model predicts high risk for 20% of White patients and 20% of Black patients
When appropriate: - Resource allocation (equal access to interventions) - Contexts where base rates should be equal
Problem: Ignores actual outcome rates. If disease prevalence differs between groups (due to structural factors), enforcing demographic parity may reduce overall accuracy.
10.7.1.2 2. Equalized Odds (Equal Opportunity)
Definition: True positive and false positive rates equal across groups
\[P(\hat{Y}=1 | Y=1, A=0) = P(\hat{Y}=1 | Y=1, A=1)\] \[P(\hat{Y}=1 | Y=0, A=0) = P(\hat{Y}=1 | Y=0, A=1)\]
Example: 85% sensitivity for both White and Black patients; 90% specificity for both
When appropriate: - Clinical diagnosis and screening - When both types of errors (false positives and false negatives) matter
More clinically relevant than demographic parity in most healthcare applications.
10.7.1.3 3. Calibration Fairness
Definition: Predicted probabilities calibrated for all groups
\[P(Y=1 | \hat{Y}=p, A=0) = P(Y=1 | \hat{Y}=p, A=1) = p\]
Example: Among patients predicted 30% risk, ~30% in each group actually experience outcome
When appropriate: - Risk prediction for clinical decision-making - When predicted probabilities guide treatment thresholds
Most important for clinical applications where decisions based on predicted probabilities.
10.7.1.4 4. Predictive Parity
Definition: Positive predictive values equal across groups
\[P(Y=1 | \hat{Y}=1, A=0) = P(Y=1 | \hat{Y}=1, A=1)\]
Example: Among patients predicted positive, same proportion are true positives in both groups
When appropriate: - When acting on positive predictions (e.g., treatment initiation)
10.7.2 The Impossibility Theorem
Fundamental challenge: Chouldechova, 2017, FAT and Kleinberg et al., 2017, ITCS proved:
If base rates differ between groups, you cannot simultaneously satisfy: 1. Calibration 2. Equalized odds 3. Predictive parity
Implication: Must choose which fairness criterion to prioritize based on context and values.
For healthcare: Calibration typically most important (want predicted probabilities to mean the same thing across groups).
10.7.3 Practical Fairness Assessment
10.7.3.1 Step-by-Step Fairness Audit
Step 1: Define Protected Attributes
Identify characteristics that should not influence model performance: - Race/ethnicity - Gender/sex - Age - Socioeconomic status (income, insurance, ZIP code) - Language - Disability status
Step 2: Stratify Performance Metrics
Calculate metrics separately for each subgroup:
# Example: Performance by race/ethnicity
= data.groupby('race')
groups
= []
fairness_metrics for race, group_data in groups:
= group_data['outcome']
y_true = group_data['prediction']
y_pred
= {
metrics 'race': race,
'n': len(group_data),
'prevalence': y_true.mean(),
'sensitivity': recall_score(y_true, y_pred > 0.5),
'specificity': recall_score(1 - y_true, 1 - (y_pred > 0.5)),
'PPV': precision_score(y_true, y_pred > 0.5),
'NPV': precision_score(1 - y_true, 1 - (y_pred > 0.5)),
'AUC': roc_auc_score(y_true, y_pred),
'Brier': brier_score_loss(y_true, y_pred)
}
fairness_metrics.append(metrics)
= pd.DataFrame(fairness_metrics)
fairness_df print(fairness_df)
Step 3: Assess Calibration by Subgroup
# Calibration plots by race
= plt.subplots(1, len(groups), figsize=(15, 5))
fig, axes
for idx, (race, group_data) in enumerate(groups):
= group_data['outcome']
y_true = group_data['prediction']
y_pred
= calibration_curve(y_true, y_pred, n_bins=10)
prob_true, prob_pred
='o', label=race)
axes[idx].plot(prob_pred, prob_true, marker0, 1], [0, 1], 'k--', label='Perfect calibration')
axes[idx].plot([f'{race} (n={len(group_data)})')
axes[idx].set_title('Predicted Probability')
axes[idx].set_xlabel('Observed Frequency')
axes[idx].set_ylabel( axes[idx].legend()
Step 4: Identify Disparities
Calculate disparity metrics:
Absolute disparity: Difference between groups
= metrics_white['sensitivity']
sens_white = metrics_black['sensitivity']
sens_black = sens_white - sens_black disparity_abs
Relative disparity: Ratio between groups
= sens_white / sens_black disparity_rel
Threshold for concern: - Absolute disparity >5 percentage points - Relative disparity >1.1 or <0.9 (10% difference)
Step 5: Investigate Root Causes
Potential causes of disparities:
- Data representation
- Underrepresentation in training data
- Different sample sizes → unstable estimates for small groups
- Label bias
- Outcome labels reflect biased processes (e.g., healthcare access disparities)
- Example: Hospitalization rates lower in group with less access, not because they’re healthier
- Feature bias
- Features proxy for protected attributes
- Example: ZIP code strongly correlated with race
- Measurement bias
- Different data quality across groups
- Example: Pulse oximetry less accurate in dark skin (Sjoding et al., 2020, NEJM)
- Prevalence differences
- True differences in disease prevalence
- May be due to structural factors (e.g., environmental exposures)
Step 6: Mitigation Strategies
Pre-processing (adjust training data): - Increase representation of underrepresented groups (oversampling, synthetic data) - Re-weight samples to balance groups - Remove or transform biased features
In-processing (modify algorithm): - Add fairness constraints during training - Adversarial debiasing (penalize predictions that reveal protected attribute) - Multi-objective optimization (accuracy + fairness)
Post-processing (adjust predictions): - Separate thresholds per group to achieve equalized odds - Calibration adjustment per group - Reject option classification (defer to human for uncertain cases)
Structural interventions: - Address root causes (improve data collection for underrepresented groups) - Partner with communities to ensure appropriate representation - Consider whether model should be deployed if disparities cannot be adequately mitigated
For comprehensive fairness toolkit, see Fairlearn by Microsoft.
10.7.4 Landmark Bias Case Study
10.8 Implementation Outcomes: Beyond Technical Performance
Even models with strong technical performance can fail in practice if not properly implemented.
10.8.1 The Implementation Science Framework
Proctor et al., 2011 define implementation outcomes:
10.8.1.1 1. Acceptability
Definition: Perception that system is agreeable/satisfactory
Measures: - User satisfaction surveys (Likert scales, Net Promoter Score) - Qualitative interviews (what do users like/dislike?) - Perceived usefulness and ease of use
Example questions: - “This system improves my clinical decision-making” (1-5 scale) - “I would recommend this system to colleagues” (yes/no)
10.8.1.2 2. Adoption
Definition: Intention/action to use the system
Measures: - Utilization rate (% of eligible cases where system used) - Number of users who have activated/logged in - Time to initial use
Red flag: Low adoption despite availability suggests problems with acceptability, workflow fit, or perceived utility.
10.8.1.3 3. Appropriateness
Definition: Perceived fit for setting/population/problem
Measures: - Stakeholder perception surveys - Alignment with clinical workflows (workflow mapping) - Relevance to clinical questions
Example: ICU mortality prediction may be appropriate for ICU but inappropriate for outpatient clinic.
10.8.1.4 4. Feasibility
Definition: Ability to successfully implement
Measures: - Technical integration challenges (API compatibility, data availability) - Resource requirements (cost, staff time, training) - Infrastructure needs (computing, network)
10.8.1.5 5. Fidelity
Definition: Degree to which system used as designed
Measures: - Override rates (how often do clinicians dismiss alerts?) - Deviation from intended use (using for wrong purpose) - Workarounds (users circumventing system)
High override rates signal problems: - Too many false positives (alert fatigue) - Predictions don’t match clinical judgment (trust issues) - Workflow disruption (alerts at wrong time)
10.8.1.6 6. Penetration
Definition: Integration across settings/populations
Measures: - Number of sites/units using system - Proportion of target population reached - Geographic spread
10.8.1.7 7. Sustainability
Definition: Continued use over time
Measures: - Retention of users over 6-12 months - Model updating/maintenance plan - Long-term performance monitoring
Common failure: “Pilot-itis” — successful pilot, but system not sustained after initial implementation period.
10.8.2 Common Implementation Failures
10.8.2.1 1. Alert Fatigue
Problem: Excessive false alarms → clinicians ignore alerts
Evidence: Ancker et al., 2017, BMJ Quality & Safety - Drug-drug interaction alerts overridden 49-96% of time.
Example: Epic sepsis model - 93% false positive rate → clinicians stopped paying attention.
Solutions: - Minimize false positives (sacrifice sensitivity if needed) - Tiered alerts (critical vs. informational) - Smart timing (deliver when actionable, not during documentation) - Actionable recommendations (“Order blood cultures” not “Consider sepsis”)
10.8.2.2 2. Workflow Disruption
Problem: System doesn’t integrate smoothly into existing processes
Examples: - Extra clicks required - Separate application (need to switch contexts) - Alerts interrupt at inopportune times (during patient exam)
Solutions: - User-centered design (involve clinicians early and often) - Embed in existing EHR workflows - Minimize friction (one-click actions)
For workflow integration principles, see Bates et al., 2003, NEJM on clinical decision support systems.
10.8.2.3 3. Lack of Trust
Problem: Clinicians don’t trust “black box” predictions
Example: Deep learning model provides risk score with no explanation
Solutions: - Provide explanations (SHAP values, attention weights) - Show evidence base (similar cases, supporting literature) - Transparent validation (publish performance data) - Gradual trust-building (start with low-stakes recommendations)
10.8.2.4 4. Model Drift
Problem: Performance degrades over time as data distribution changes
Example: COVID-19 pandemic changed disease patterns → pre-pandemic models failed
Solutions: - Continuous monitoring (track performance metrics over time) - Automated alerts for performance degradation - Regular retraining schedule - Triggers for urgent retraining (sudden performance drop)
Framework: Davis et al., 2017, JAMIA - Clinical prediction models degrade; most need recalibration after 2-3 years.
10.9 Comprehensive Evaluation Framework
10.9.1 Complete Evaluation Checklist
Use this when evaluating AI systems:
10.10 Reporting Guidelines
10.10.1 TRIPOD: Transparent Reporting of Prediction Models
Collins et al., 2015, BMJ - TRIPOD statement
22-item checklist for prediction model studies:
Title and Abstract 1. Identify as prediction model study 2. Summary of objectives, design, setting, participants, outcome, prediction model, results
Introduction 3. Background and objectives 4. Rationale for development or validation
Methods - Source of Data 5. Study design and data sources 6. Eligibility criteria and study period
Methods - Participants 7. Participant characteristics 8. Outcome definition 9. Predictors (features) clearly defined
Methods - Sample Size 10. Sample size determination
Methods - Missing Data 11. How missing data were handled
Methods - Model Development 12. Statistical methods for model development 13. Model selection procedure 14. Model performance measures
Results - Participants 15. Participant flow diagram 16. Descriptive characteristics
Results - Model Specification 17. Model specification (all parameters) 18. Model performance (discrimination and calibration)
Discussion 19. Interpretation (clinical meaning, implications) 20. Limitations 21. Implications for practice
Other 22. Funding and conflicts of interest
TRIPOD-AI extension (in development): Additional items for AI/ML models: - Training/validation/test set composition - Data augmentation - Hyperparameter tuning - Computational environment - Algorithm selection process
10.10.2 STARD-AI: Standards for Reporting Diagnostic Accuracy Using AI
Extension of STARD guidelines for diagnostic AI.
Additional items: - Model architecture details - Training procedure (epochs, batch size, optimization) - Validation strategy - External validation results - Subgroup analyses - Calibration assessment - Comparison to human performance (if applicable)
10.11 Critical Appraisal of Published Studies
10.11.1 Systematic Evaluation Framework
When reading AI studies:
10.11.1.1 1. Study Design and Data Quality
Questions: - Representative sample of target population? - External validation performed? - Test set truly independent? - Outcome objectively defined and consistently measured? - Potential for data leakage?
Red flags: - No external validation - Small sample size (<500 events) - Convenience sampling - Vague outcome definitions - Feature engineering on entire dataset before splitting
10.11.1.2 2. Model Development and Reporting
Questions: - Multiple models compared? - Simple baseline included (logistic regression)? - Hyperparameters tuned on separate validation set? - Feature selection appropriate? - Model clearly described?
Red flags: - No baseline comparison - Hyperparameter tuning on test set - Inadequate model description - No cross-validation
10.11.1.3 3. Performance Evaluation
Questions: - Appropriate metrics for task? - Confidence intervals provided? - Calibration assessed? - Multiple metrics reported? - Statistical testing appropriate?
Red flags: - Only accuracy reported (especially for imbalanced data) - No calibration assessment - No confidence intervals - Cherry-picked metrics
10.11.1.4 4. Fairness and Generalizability
Questions: - Performance stratified by subgroups? - Diverse populations included? - Generalizability limitations discussed? - Potential biases identified?
Red flags: - No subgroup analysis - Homogeneous study population - Claims of broad generalizability without external validation - Dismissal of fairness concerns
10.11.1.5 5. Clinical Utility
Questions: - Clinical utility assessed (beyond accuracy)? - Compared to current practice? - Implementation considerations discussed? - Cost-effectiveness assessed?
Red flags: - Only technical metrics - No comparison to existing approaches - No implementation discussion - Overstated clinical claims
10.11.1.6 6. Transparency and Reproducibility
Questions: - Code and data available? - Reporting guidelines followed? - Sufficient detail to reproduce? - Limitations clearly stated? - Conflicts of interest disclosed?
Red flags: - No code/data availability - Insufficient methodological detail - Overstated conclusions - Undisclosed industry funding
10.12 Key Takeaways
Evaluation is multidimensional — Technical performance, clinical utility, fairness, and implementation outcomes all matter
Internal validation is insufficient — External validation on independent data is essential to assess generalizability
Calibration is critical — Predicted probabilities must be meaningful for clinical decisions, not just discriminative
Assess fairness proactively — Stratify performance by demographic subgroups; disparities invisible otherwise
Clinical utility ≠ statistical performance — A model can be statistically accurate but clinically useless without improving outcomes
Prospective validation is the gold standard — Real-world testing provides strongest evidence
Common pitfalls are avoidable — Data leakage, improper CV, threshold optimization on test set lead to overoptimistic estimates
Implementation determines success — Even well-performing models fail if workflow integration ignored
Transparency enables trust — Follow reporting guidelines (TRIPOD, STARD-AI); share code and data when possible
Continuous monitoring is essential — Model performance drifts over time; plan for ongoing evaluation and updating
Check Your Understanding
Test your knowledge of the key concepts from this chapter. Click “Show Answer” to reveal the correct response and explanation.
You’re building a model to predict hospital readmissions using data from 2018-2023. Which cross-validation strategy is MOST appropriate?
- 10-fold random cross-validation
- Leave-one-out cross-validation
- Stratified K-fold cross-validation
- Time-based forward-chaining cross-validation
Answer: d) Time-based forward-chaining cross-validation
Explanation: Time-based (temporal) cross-validation is essential for healthcare data with temporal dependencies:
Why temporal CV is critical:
Fold 1: Train 2018-2019 → Test 2020
Fold 2: Train 2018-2020 → Test 2021
Fold 3: Train 2018-2021 → Test 2022
Fold 4: Train 2018-2022 → Test 2023
What this tests: - Model performance as deployed (using past to predict future) - Robustness to temporal drift (treatment changes, policy updates) - Realistic performance estimates
Why not random K-fold (a)? Creates data leakage:
Train: [2019, 2021, 2023]
Test: [2018, 2020, 2022]
You’re using 2023 data to predict 2018! Inflates performance artificially.
Why not leave-one-out (b)? - Computationally expensive - Still has temporal leakage problem - High variance in estimates
Why not stratified K-fold (c)? - Useful for class imbalance - But still allows temporal leakage - Doesn’t test temporal robustness
Real-world impact: Models validated with random CV often show 10-20% performance drops when deployed because they never faced forward-looking prediction during validation.
Lesson: Healthcare data has temporal structure. Always validate as you’ll deploy—using past to predict future, never the reverse.
A cancer risk model predicts 20% risk for 1,000 patients. In reality, 300 of these patients develop cancer. What does this indicate?
- The model is well-calibrated
- The model is overconfident (underestimates risk)
- The model is underconfident (overestimates risk)
- The model has good discrimination but poor calibration
Answer: b) The model is overconfident (underestimates risk)
Explanation: Calibration compares predicted probabilities to observed outcomes:
Analysis: - Predicted: 20% of 1,000 patients = 200 patients expected to develop cancer - Observed: 300 patients actually developed cancer - Gap: Predicted 200, observed 300 → Underestimating risk
Calibration terminology: - Well-calibrated: Predicted ≈ Observed (20% predicted → 20% observed) - Overconfident/Underestimate: Predicted < Observed (20% predicted → 30% observed) ✓ This case - Underconfident/Overestimate: Predicted > Observed (20% predicted → 10% observed)
Why it matters:
# Clinical decision: Treat if risk > 25%
= 0.20 # Below threshold → No treatment
model.predict_proba(patient)
# Reality: True risk was 0.30
# Patient should have been treated!
How to detect: 1. Calibration plot: Predicted vs observed by risk bin 2. Brier score: Mean squared error of probabilities 3. Expected Calibration Error (ECE): Average absolute calibration error
Lesson: High AUC doesn’t guarantee calibration. When predictions inform decisions with probability thresholds, calibration is critical. Always check calibration plots, not just discrimination metrics.
Your sepsis model achieves AUC 0.88 on internal test set. You test on external hospitals and get AUC 0.72-0.82 (varying by site). What does this variability indicate?
- The model is overfitting
- External sites have poor data quality
- There is substantial site-specific heterogeneity
- The model should not be used
Answer: c) There is substantial site-specific heterogeneity
Explanation: Performance variability across sites reveals important heterogeneity:
What varies between hospitals:
- Patient populations:
- Demographics (age, race, socioeconomic status)
- Disease severity (tertiary referral vs community hospital)
- Comorbidity profiles
- Clinical practices:
- Sepsis protocols (early vs delayed antibiotics)
- ICU admission criteria
- Documentation practices
- Infrastructure:
- EHR systems (Epic vs Cerner vs homegrown)
- Lab equipment (different reference ranges)
- Staffing models (nurse-to-patient ratios)
- Data capture:
- Missing data patterns
- Measurement frequency
- Feature definitions
Why not overfitting (a)? Overfitting shows as gap between training and test within same dataset. Here, internal test was fine (0.88)—it’s external generalization that varies.
Why not poor data quality (b)? Could contribute, but more likely reflects legitimate differences in populations and practices.
Why not unusable (d)? AUC 0.72-0.82 is still useful! But indicates need for: - Site-specific calibration - Understanding what drives differences - Possibly site-specific models or adjustments
Best practice: External validation almost always shows performance drops. Variability across sites is normal and informative—reveals where model struggles and needs adaptation.
True or False: If a model improvement is statistically significant (p < 0.05), it is clinically meaningful and should be deployed.
Answer: False
Explanation: Statistical significance ≠ clinical significance. Both are necessary but neither alone is sufficient:
Statistical significance: - Tests if difference is unlikely due to chance - Depends on sample size (large N → small differences become significant) - Answers: “Is there an effect?”
Clinical significance: - Tests if difference matters for patient care - Independent of sample size - Answers: “Is the effect large enough to care?”
Example:
# New model vs baseline
= {
results 'baseline_auc': 0.820,
'new_model_auc': 0.825,
'difference': 0.005,
'p_value': 0.03, # Statistically significant
'sample_size': 50000 # Large sample
}
Analysis: - ✅ Statistically significant: p=0.03 < 0.05 - ❌ Clinically insignificant: 0.5% AUC improvement negligible - Why significant? Large sample detects tiny differences - Should deploy? No—not worth the cost/disruption
Lesson: Always evaluate both statistical and clinical significance. With large samples, trivial differences become statistically significant. Ask: “Is this improvement large enough to change practice?” Consider effect sizes, confidence intervals, and practical impact—not just p-values.
Two models have been evaluated: - Model A: AUC 0.85 (95% CI: 0.83-0.87) - Model B: AUC 0.86 (95% CI: 0.79-0.93)
Which statement is correct?
- Model B is definitely better because it has higher AUC
- Model A is more reliable because it has a narrower confidence interval
- The models cannot be compared without more information
- Model B is better if you’re willing to accept more uncertainty
Answer: b) Model A is more reliable because it has a narrower confidence interval
Explanation: Confidence intervals reveal precision/uncertainty, not just point estimates:
Model A: - AUC: 0.85 - 95% CI: 0.83-0.87 - Width: 0.04 (narrow) - Interpretation: We’re 95% confident true AUC is between 0.83-0.87 (precise estimate)
Model B: - AUC: 0.86 - 95% CI: 0.79-0.93 - Width: 0.14 (wide) - Interpretation: We’re 95% confident true AUC is between 0.79-0.93 (imprecise estimate)
Key insight: CIs overlap substantially (0.83-0.87 vs 0.79-0.93). Cannot conclude Model B is actually better—difference might be due to chance.
In practice: Most organizations prefer Model A: - Predictable performance for planning - Lower risk of underperformance - Easier to set appropriate thresholds - Small gain (0.01 AUC) not worth the uncertainty
Lesson: Always report and consider confidence intervals, not just point estimates. Narrow CIs indicate reliable performance. Wide CIs indicate uncertainty—might get much worse (or better) than point estimate suggests.
You evaluate a diagnostic model and find: - Overall AUC: 0.84 - Men: AUC 0.88 - Women: AUC 0.78
What should you do?
- Report only overall performance (0.84)
- Report overall performance but note subgroup differences exist
- Investigate why women’s performance is lower and consider separate models or adjustments
- Conclude the model is biased and should not be used
Answer: c) Investigate why women’s performance is lower and consider separate models or adjustments
Explanation: Subgroup performance disparities require investigation and action, not just reporting:
Why performance differs: Possible reasons
- Biological differences:
- Disease presents differently (atypical symptoms in women)
- Different physiological reference ranges
- Example: Heart attack symptoms differ by sex
- Data representation:
- Fewer women in training data → model learns men’s patterns better
- Women may be underdiagnosed historically → labels biased
- Feature appropriateness:
- Features optimized for men
- Missing features relevant for women
- Example: Pregnancy-related factors not included
- Measurement bias:
- Tests/measurements less accurate for women
- Different documentation patterns
Potential solutions:
Collect more women’s data (if sample size issue)
Add sex-specific features:
# Include pregnancy status, hormonal factors += ['pregnant', 'menopause_status', 'hormone_therapy'] features
Stratified modeling:
# Separate models for men/women if patient.sex == 'M': = model_men.predict(patient) prediction else: = model_women.predict(patient) prediction
Weighted loss function:
# Penalize errors on women more heavily during training = [2.0 if sex=='F' else 1.0 for sex in data['sex']] sample_weights =sample_weights) model.fit(X, y, sample_weight
Lesson: Subgroup analysis is mandatory, not optional. When disparities found, investigate root causes and take corrective action. Don’t hide disparities in overall metrics.
10.13 Discussion Questions
Validation hierarchy: You’ve developed a hospital-acquired infection prediction model. What validation studies would you conduct before deploying? In what order? What evidence would convince you to deploy at other hospitals?
Fairness trade-offs: Your sepsis model has AUC-ROC = 0.85 overall, but sensitivity is 0.90 for White patients vs. 0.75 for Black patients. What would you do? What are trade-offs of different mitigation approaches?
Calibration vs. discrimination: Model A: AUC-ROC = 0.85, Brier score = 0.30 (poor calibration). Model B: AUC-ROC = 0.80, Brier score = 0.15 (excellent calibration). Which deploy? Why?
External validation failure: Your model achieves AUC-ROC = 0.82 internal validation but 0.68 external validation at different hospital. What explains this? What next steps?
Clinical utility skepticism: Model predicts 30-day mortality with AUC-ROC = 0.88. Does this mean it’s clinically useful? What additional evaluations needed?
Prospective study design: Evaluate hospital readmission model prospectively. RCT, stepped-wedge, or silent mode? What are trade-offs?
Alert threshold selection: Clinical decision support tool can alert at >10%, >20%, or >30% predicted risk. How choose? What factors matter?
Model drift: COVID-19 forecasting model trained on 2020 data; now 2023 with new variants. How assess if still valid? What triggers retraining?
10.14 Further Resources
10.14.1 📚 Books
- Steyerberg, 2019, Clinical Prediction Models - Comprehensive guide to prediction modeling
- Barocas et al., 2023, Fairness and Machine Learning - Free online textbook
10.14.2 📄 Essential Papers
Validation: - Collins et al., 2015, BMJ - TRIPOD guidelines 🎯 - Liu et al., 2019, Radiology - Medical imaging AI systematic review 🎯 - Oakden-Rayner et al., 2020, Nature Medicine - Hidden stratification 🎯
Fairness: - Obermeyer et al., 2019, Science - Racial bias case study 🎯 - Chouldechova, 2017, FAT - Impossibility theorem 🎯
Clinical Utility: - Vickers & Elkin, 2006, Medical Decision Making - Decision curve analysis 🎯
Implementation: - Proctor et al., 2011 - Implementation outcomes 🎯
10.14.3 💻 Tools
Metrics and Validation: - Scikit-learn - Comprehensive metrics - Scikit-survival - Survival analysis metrics
Fairness: - Fairlearn - Microsoft fairness toolkit - AI Fairness 360 - IBM toolkit - Aequitas - Bias audit
Explainability: - SHAP - Feature importance - LIME - Local explanations