10  Evaluating AI Systems

Learning Objectives

Tip

By the end of this chapter, you will:

  • Understand fundamental principles of AI system evaluation in public health contexts
  • Select and apply appropriate performance metrics for different types of AI models
  • Design rigorous validation strategies including internal, external, and prospective validation
  • Evaluate AI systems beyond technical performance to include clinical utility and implementation outcomes
  • Identify and avoid common pitfalls in AI evaluation (data leakage, optimization bias, improper cross-validation)
  • Conduct comprehensive fairness and equity assessments across population subgroups
  • Interpret and critically appraise AI evaluation studies in published literature
  • Design prospective evaluation studies including RCTs and implementation trials

Time to complete: 90-120 minutes Prerequisites: Chapter 2: AI Basics, Chapter 3: Data, Chapter 8: Clinical AI

What you’ll build: 💻 Performance evaluation framework, external validation study design, fairness audit toolkit, critical appraisal checklist, and prospective testing protocol


10.1 Introduction: The Evaluation Crisis in AI

December 2020, Radiology:

Researchers at UC Berkeley publish a comprehensive review of 62 deep learning studies in medical imaging published in high-impact journals.

Their sobering finding: Only 6% performed external validation on data from different institutions.

The vast majority tested models only on hold-out sets from the same dataset used for training—a practice that provides minimal evidence of real-world performance.


July 2021, JAMA Internal Medicine:

Wong et al. publish external validation of Epic’s sepsis prediction model—deployed at >100 US hospitals, affecting millions of patients.

The model’s performance: - Sensitivity: 33% (missed 2 out of 3 sepsis cases) - Positive predictive value: 7% (93% of alerts were false positives) - Early detection: Only 6% of alerts fired before clinical recognition

Conclusion from authors: “The algorithm rarely alerted clinicians to sepsis before it was clinically recognized and had poor predictive accuracy.”

This wasn’t a research study—this was a deployed clinical system used in real patient care.


The Evaluation Gap:

Between lab performance and real-world deployment lies a chasm that has claimed many promising AI systems:

In the lab: - Curated, high-quality datasets - Balanced classes (50% positive, 50% negative) - Consistent protocols - Expert-confirmed labels - AUC-ROC = 0.95

In the real world: - Messy, incomplete data - Rare events (1-5% prevalence) - Variable protocols across sites - Ambiguous cases - AUC-ROC = 0.68

The consequences are severe:

Failed deployments — Models that work in development but fail in production
Hidden biases — Systems that perform well on average but poorly for specific groups
Wasted resources — Millions invested in systems that don’t deliver promised benefits
Patient harm — Incorrect predictions leading to inappropriate treatments
Eroded trust — Clinicians lose confidence in AI after experiencing failures

ImportantWhy This Chapter Matters

Rigorous evaluation is the bridge between AI research and AI implementation. Without it, we’re deploying unvalidated systems and hoping for the best.

This chapter provides a comprehensive framework for evaluating AI systems across five critical dimensions:

  1. Technical performance — Accuracy, calibration, robustness
  2. Generalizability — External validity, temporal stability, geographic transferability
  3. Clinical/Public health utility — Impact on decisions and outcomes
  4. Fairness and equity — Performance across demographic subgroups
  5. Implementation outcomes — Adoption, usability, sustainability

You’ll learn how to evaluate AI systems rigorously, design validation studies, and critically appraise published research.


10.2 The Multidimensional Nature of Evaluation

10.2.1 What Are We Really Evaluating?

Evaluating an AI system is not just about measuring accuracy. In public health and clinical contexts, we need to assess multiple dimensions.

10.2.1.1 1. Technical Performance

Question: Does the model make accurate predictions on new data?

Key aspects: - Discrimination: Can the model distinguish between positive and negative cases? - Calibration: Do predicted probabilities match observed frequencies? - Robustness: Does performance degrade with missing data or noise? - Computational efficiency: Speed and resource requirements for deployment

Relevant for: All AI systems (foundational requirement)


10.2.1.2 2. Generalizability

Question: Will the model work in settings different from where it was developed?

Key aspects: - Geographic transferability: Performance at different institutions, regions, countries - Temporal stability: Does performance degrade as time passes and data distributions shift? - Population differences: Performance across different patient demographics, disease prevalence - Setting transferability: Hospital vs. primary care vs. community settings

Relevant for: Any system intended for broad deployment

Critical insight: A 2020 Nature Medicine paper showed that AI models often learn “shortcuts”—spurious correlations specific to training data that don’t generalize. For example, pneumonia detection models learned to identify portable X-ray machines (used for sicker patients) rather than actual pneumonia.


10.2.1.3 3. Clinical/Public Health Utility

Question: Does the model improve decision-making and outcomes?

Key aspects: - Decision impact: Does it change clinician decisions? - Outcome improvement: Does it lead to better patient outcomes? - Net benefit: Does it provide value above existing approaches? - Cost-effectiveness: Does it provide value commensurate with costs?

Relevant for: Clinical decision support, diagnostic tools

Critical distinction: A model can be statistically accurate but clinically useless. Example: A model predicting hospital mortality with AUC-ROC = 0.85 sounds impressive, but if it doesn’t change management or improve outcomes, it adds no value.

For framework on clinical utility assessment, see Van Calster et al., 2019, BMJ on decision curve analysis.


10.2.1.4 4. Fairness and Equity

Question: Does the model perform equitably across population subgroups?

Key aspects: - Subgroup performance: Stratified metrics by race, ethnicity, gender, age, socioeconomic status - Error rate disparities: Differential false positive/negative rates - Outcome equity: Does deployment narrow or widen health disparities? - Representation: Are all groups adequately represented in training data?

Relevant for: All systems affecting humans

Essential reading: Obermeyer et al., 2019, Science - racial bias in healthcare algorithm affecting millions; Gianfrancesco et al., 2018, Nature Communications - sex bias in clinical AI.


10.2.1.5 5. Implementation Outcomes

Question: Is the model adopted and used effectively in practice?

Key aspects: - Adoption: Are users actually using it as intended? - Usability: Can users operate it efficiently? - Workflow integration: Does it fit smoothly into existing processes? - Sustainability: Will it continue to be used and maintained over time?

Relevant for: Any deployed system

Framework: Proctor et al., 2011, Administration and Policy in Mental Health - implementation outcome taxonomy.


10.3 The Hierarchy of Evidence for AI Systems

Just as clinical medicine has evidence hierarchies (case reports → cohort studies → RCTs), AI systems should progress through increasingly rigorous validation stages.

10.3.1 Level 1: Development and Internal Validation

What it is: - Split-sample validation (train-test split) or cross-validation on development dataset - Model trained and tested on data from same source

Evidence strength: ⭐ (Weakest)

Value: - Initial proof-of-concept - Model selection and hyperparameter tuning - Feasibility assessment

Limitations: - Optimistic bias (model may overfit to dataset-specific quirks) - No evidence of generalizability - Cannot assess real-world performance

Common in: Early-stage research, algorithm development


10.3.2 Level 2: Temporal Validation

What it is: - Train on data from earlier time period - Test on data from later time period (same source)

Evidence strength: ⭐⭐

Value: - Tests temporal stability - Detects concept drift (changes in data distribution over time) - Better than spatial hold-out from same time period

Limitations: - Still from same institution/setting - May not generalize geographically

Example: Sendak et al., 2020, npj Digital Medicine - demonstrated temporal degradation of sepsis models


10.3.3 Level 3: External Geographic Validation

What it is: - Train on data from one institution/region - Test on data from different institution(s)/region(s)

Evidence strength: ⭐⭐⭐

Value: - Strongest evidence of generalizability without prospective deployment - Tests performance across different patient populations, clinical practices, data collection protocols - Identifies setting-specific dependencies

Limitations: - Still retrospective - Doesn’t assess impact on clinical decisions or outcomes

Gold standard for retrospective evaluation: Collins et al., 2015, BMJ - TRIPOD guidelines recommend external validation as minimal standard.


10.3.4 Level 4: Retrospective Impact Assessment

What it is: - Simulate what would have happened if model had been used - Estimate impact on decision-making without actual deployment

Evidence strength: ⭐⭐⭐

Value: - Estimates potential benefit before prospective deployment - Identifies potential implementation barriers - Justifies resource allocation for prospective studies

Limitations: - Cannot capture changes in clinician behavior - Assumptions about how predictions would be used may be incorrect

Example: Jung et al., 2020, JAMA Network Open - retrospective assessment of deep learning for diabetic retinopathy screening


10.3.5 Level 5: Prospective Observational Studies

What it is: - Model deployed in real clinical practice - Predictions shown to clinicians - Outcomes observed but no experimental control

Evidence strength: ⭐⭐⭐⭐

Value: - Real-world performance data - Identifies implementation challenges (workflow disruption, alert fatigue) - Measures actual usage patterns

Limitations: - Cannot establish causality (improvements may be due to other factors) - Selection bias if clinicians choose when to use model - No counterfactual (what would have happened without model?)

Example: Tomašev et al., 2019, Nature - DeepMind AKI prediction deployed prospectively at VA hospitals


10.3.6 Level 6: Randomized Controlled Trials

What it is: - Randomize patients/clinicians/units to model-assisted vs. control groups - Measure outcomes in both groups - Compare to establish causal effect

Evidence strength: ⭐⭐⭐⭐⭐ (Strongest)

Value: - Definitive evidence of impact on outcomes - Establishes causality - Meets regulatory and reimbursement standards

Limitations: - Expensive and time-consuming - Requires large sample sizes - Ethical considerations (withholding potentially beneficial intervention from control group)

Example: Komorowski et al., 2018, Nature Medicine - RL-based sepsis treatment (retrospective RCT simulation); actual prospective RCTs rare but emerging.

NoteProgression Through Evidence Levels

Best practice: Progress systematically through validation stages:

  1. Internal validation → Establish proof-of-concept
  2. Temporal validation → Test stability over time
  3. External validation → Test generalizability
  4. Retrospective impact → Estimate potential benefit
  5. Prospective observational → Measure real-world performance
  6. RCT → Establish causal impact

Don’t skip steps. Each level provides critical information before committing resources to higher levels.


10.4 Performance Metrics: Choosing the Right Measures

10.4.1 Classification Metrics

For binary classification (disease/no disease, outbreak/no outbreak), numerous metrics exist. No single metric tells the whole story.

10.4.1.1 The Confusion Matrix Foundation

All classification metrics derive from the 2×2 confusion matrix:

Predicted Positive Predicted Negative
Actually Positive True Positives (TP) False Negatives (FN)
Actually Negative False Positives (FP) True Negatives (TN)

Example: TB screening of 1,000 individuals; 100 actually have TB

Predicted TB+ Predicted TB-
Actually TB+ 85 (TP) 15 (FN)
Actually TB- 90 (FP) 810 (TN)

From this matrix, we calculate all other metrics.


10.4.1.2 Core Metrics

1. Sensitivity (Recall, True Positive Rate)

\[\text{Sensitivity} = \frac{TP}{TP + FN} = \frac{TP}{\text{All Actual Positives}}\]

  • Interpretation: Of all actual positives, what proportion did we identify?
  • Example: 85/100 = 85% (identified 85 of 100 TB cases)
  • When to prioritize: High-stakes screening (must catch most cases), early disease detection, rule-out tests
  • Trade-off: Maximizing sensitivity → more false positives

2. Specificity (True Negative Rate)

\[\text{Specificity} = \frac{TN}{TN + FP} = \frac{TN}{\text{All Actual Negatives}}\]

  • Interpretation: Of all actual negatives, what proportion did we correctly identify?
  • Example: 810/900 = 90% (correctly ruled out TB in 810 of 900 healthy people)
  • When to prioritize: Confirmatory tests, when false alarms are costly, rule-in tests
  • Trade-off: Maximizing specificity → more false negatives

3. Positive Predictive Value (Precision, PPV)

\[\text{PPV} = \frac{TP}{TP + FP} = \frac{TP}{\text{All Predicted Positives}}\]

  • Interpretation: Of all predicted positives, what proportion are actually positive?
  • Example: 85/175 = 49% (49% of positive predictions are correct)
  • When to prioritize: When acting on predictions is costly (treatments, interventions)
  • Critical property: Depends heavily on disease prevalence

Prevalence dependence example:

Scenario Prevalence Sensitivity Specificity PPV
High-burden TB setting 10% 85% 90% 49%
Low-burden TB setting 1% 85% 90% 8%

Same model, vastly different PPV! In low-prevalence settings, even high specificity leads to poor PPV.

For detailed explanation, see Altman & Bland, 1994, BMJ on diagnostic tests and prevalence.


4. Negative Predictive Value (NPV)

\[\text{NPV} = \frac{TN}{TN + FN} = \frac{TN}{\text{All Predicted Negatives}}\]

  • Interpretation: Of all predicted negatives, what proportion are actually negative?
  • Example: 810/825 = 98% (98% of negative predictions are correct)
  • When to prioritize: Rule-out tests, when missing disease is catastrophic
  • Critical property: Also depends on prevalence (high prevalence → lower NPV)

5. Accuracy

\[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} = \frac{\text{Correct Predictions}}{\text{All Predictions}}\]

  • Interpretation: Overall proportion of correct predictions
  • Example: (85+810)/1000 = 89.5%
  • Major limitation: Misleading for imbalanced datasets

Classic pitfall:

Dataset: 1,000 patients, 10 with disease (1% prevalence)

Naive model: Predict “no disease” for everyone - Accuracy: 990/1000 = 99% - But sensitivity = 0% (misses all disease cases!)

Takeaway: Accuracy alone is insufficient, especially for rare events.


6. F1 Score (Harmonic Mean of Precision and Recall)

\[F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}\]

  • Interpretation: Balance between precision and recall
  • Range: 0 (worst) to 1 (perfect)
  • When to use: When you need single metric balancing both concerns
  • Limitation: Ignores true negatives (not suitable when TN important)

Variants: - \(F_2\) score: Weights recall higher than precision - \(F_{0.5}\) score: Weights precision higher than recall


10.4.1.3 Threshold-Independent Metrics

7. Area Under the ROC Curve (AUC-ROC)

The Receiver Operating Characteristic (ROC) curve plots: - Y-axis: True Positive Rate (Sensitivity) - X-axis: False Positive Rate (1 - Specificity)

…across all possible classification thresholds (0 to 1).

AUC-ROC interpretation: - 0.5 = Random guessing (diagonal line) - 0.6-0.7 = Poor discrimination - 0.7-0.8 = Acceptable - 0.8-0.9 = Excellent - >0.9 = Outstanding (rare in clinical applications)

Alternative interpretation: Probability that a randomly selected positive case is ranked higher than a randomly selected negative case.

Advantages: - Threshold-independent (single summary metric) - Not affected by class imbalance (in terms of metric itself) - Standard metric for model comparison

Limitations: - May overemphasize performance at thresholds you wouldn’t use clinically - Doesn’t indicate optimal threshold - Can be misleading for highly imbalanced data (see Average Precision)

For comprehensive guide, see Hanley & McNeil, 1982, Radiology on the meaning and use of AUC.


8. Average Precision (Area Under Precision-Recall Curve)

The Precision-Recall (PR) curve plots: - Y-axis: Precision (PPV) - X-axis: Recall (Sensitivity)

…across all thresholds.

Average Precision (AP): Area under PR curve

Why use PR curve instead of ROC? - More informative for imbalanced datasets where positive class is rare - Focuses on performance on the positive class (which you care about more when it’s rare) - ROC can be misleadingly optimistic when negative class dominates

Example: Disease with 1% prevalence

  • AUC-ROC = 0.90 (sounds great!)
  • Average Precision = 0.25 (reveals poor performance on actual disease cases)

When to use: Rare disease detection, outbreak detection, any imbalanced problem

For detailed comparison, see Saito & Rehmsmeier, 2015, PLOS ONE on precision-recall vs. ROC curves.


10.4.1.4 Choosing Metrics by Scenario

Scenario Primary Metrics Rationale
COVID-19 airport screening Sensitivity, NPV Must catch most cases; false positives acceptable (confirmatory testing available)
Cancer diagnosis confirmation Specificity, PPV False positives → unnecessary surgery; high bar for confirmation
Automated triage system AUC-ROC, Calibration Need good ranking across full risk spectrum
Rare disease detection Average Precision, Sensitivity Standard AUC-ROC misleading when imbalanced
Syndromic surveillance Sensitivity, Timeliness Early detection critical; false alarms tolerable (investigation cheap)
Clinical decision support PPV, Calibration Clinicians ignore if too many false alarms; need well-calibrated probabilities

10.4.2 Calibration: Do Predicted Probabilities Mean What They Say?

Calibration assesses whether predicted probabilities match observed frequencies.

Example of well-calibrated model: - Model predicts “30% risk of readmission” for 100 patients - About 30 of those 100 are actually readmitted - Predicted probability ≈ observed frequency

Poor calibration: - Model predicts “30% risk” but 50% are actually readmitted → underconfident - Model predicts “30% risk” but 15% are actually readmitted → overconfident


10.4.2.1 Measuring Calibration

1. Calibration Plot

Method: 1. Bin predictions into groups (e.g., 0-10%, 10-20%, …, 90-100%) 2. For each bin, calculate: - Mean predicted probability (x-axis) - Observed frequency of outcome (y-axis) 3. Plot points 4. Perfect calibration: points lie on diagonal line (y = x)

Interpretation: - Points above diagonal: Model underconfident (predicts lower risk than reality) - Points below diagonal: Model overconfident (predicts higher risk than reality)


2. Brier Score

\[\text{Brier Score} = \frac{1}{N} \sum_{i=1}^{N} (p_i - y_i)^2\]

where \(p_i\) = predicted probability, \(y_i\) = actual outcome (0 or 1)

  • Range: 0 (perfect) to 1 (worst)
  • Lower is better
  • Combines discrimination and calibration into single metric
  • Can be decomposed into calibration and refinement components

Interpretation: - 0.25 = Baseline (predicting prevalence for everyone) - <0.15 = Good calibration - <0.10 = Excellent calibration

For Brier score deep dive, see Rufibach, 2010, Clinical Trials.


3. Expected Calibration Error (ECE)

\[\text{ECE} = \sum_{m=1}^{M} \frac{n_m}{N} |\text{acc}(B_m) - \text{conf}(B_m)|\]

where: - \(M\) = number of bins - \(B_m\) = set of predictions in bin \(m\) - \(n_m\) = number of predictions in bin \(m\) - \(\text{acc}(B_m)\) = accuracy in bin \(m\) - \(\text{conf}(B_m)\) = average confidence in bin \(m\)

Interpretation: Average difference between predicted and observed probabilities across bins (weighted by bin size)


10.4.2.2 Why Calibration Matters

Clinical decision-making requires well-calibrated probabilities:

Scenario 1: Treatment threshold - If risk >20%, prescribe preventive medication - Poorly calibrated model: risk actually 40% when model says 20% - Result: Under-treatment of high-risk patients

Scenario 2: Resource allocation - Allocate home health visits to top 10% risk - Overconfident model: predicted “high risk” patients aren’t actually high risk - Result: Resources wasted on low-risk patients, true high-risk patients missed

Scenario 3: Patient counseling - Tell patient: “You have 30% chance of complications” - If model poorly calibrated, this number is meaningless - Result: Informed consent based on inaccurate information

WarningThe Deep Learning Calibration Problem

Common issue: Deep neural networks often produce poorly calibrated probabilities out-of-the-box. They tend to be overconfident (predicted probabilities too extreme).

Why? Modern neural networks are optimized for accuracy, not calibration. Regularization techniques that prevent overfitting can actually worsen calibration.

Evidence: Guo et al., 2017, ICML - “On Calibration of Modern Neural Networks”

Solution: Post-hoc calibration methods: - Temperature scaling: Simplest and most effective - Platt scaling: Logistic regression on model outputs - Isotonic regression: Non-parametric calibration

Takeaway: Always assess and correct calibration for deep learning models before deployment.


10.4.3 Regression Metrics

For continuous outcome prediction (disease burden, resource utilization, epidemic size):

1. Mean Absolute Error (MAE)

\[\text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i|\]

  • Interpretation: Average absolute difference between prediction and truth
  • Unit: Same as outcome variable
  • Advantage: Interpretable, robust to outliers
  • Example: MAE = 3.2 days (average error in predicting length of stay)

2. Root Mean Squared Error (RMSE)

\[\text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2}\]

  • Interpretation: Square root of average squared error
  • Property: Penalizes large errors more heavily than MAE (due to squaring)
  • When to use: When large errors are particularly problematic

Relationship: RMSE ≥ MAE always (equality only if all errors identical)


3. R-squared (Coefficient of Determination)

\[R^2 = 1 - \frac{\sum_{i=1}^{N} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{N} (y_i - \bar{y})^2} = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}\]

  • Range: 0 to 1 (can be negative if model worse than mean)
  • Interpretation: Proportion of variance in outcome explained by model
  • Example: R² = 0.65 means model explains 65% of variance
  • Limitation: Can be artificially inflated by adding more features

4. Mean Absolute Percentage Error (MAPE)

\[\text{MAPE} = \frac{100\%}{N} \sum_{i=1}^{N} \left| \frac{y_i - \hat{y}_i}{y_i} \right|\]

  • Interpretation: Average percentage error
  • Advantage: Scale-independent (can compare across different units)
  • Example: MAPE = 15% (average error is 15% of true value)
  • Limitation: Undefined when actual value is zero; penalizes under-predictions more than over-predictions

10.4.4 Survival Analysis Metrics

For time-to-event prediction (mortality, readmission, disease progression):

1. Concordance Index (C-index, Harrell’s C-statistic)

  • Extension of AUC-ROC to survival data with censoring
  • Interpretation: Probability that, for two randomly selected individuals, the one who experiences event first has higher predicted risk
  • Range: 0.5 (random) to 1.0 (perfect)
  • Handles censoring: Pairs where censoring occurs are excluded or weighted

For details: Harrell et al., 1982, JAMA - original C-index paper.


2. Integrated Brier Score (IBS)

  • Extension of Brier score to survival analysis
  • Interpretation: Average prediction error over time, accounting for censoring
  • Range: 0 (perfect) to 1 (worst)
  • Advantage: Assesses calibration of survival probability predictions over follow-up period

10.5 Validation Strategies: Testing Generalization

The validation strategy determines how trustworthy your performance estimates are.

10.5.1 Internal Validation

Purpose: Estimate model performance on new data from the same source.

Critical limitation: Provides no evidence about performance on different populations, institutions, or time periods.


10.5.1.1 Method 1: Train-Test Split (Hold-Out Validation)

Procedure: 1. Randomly split data into training (70-80%) and test (20-30%) 2. Train model on training set 3. Evaluate on test set (one time only)

Advantages: - Simple and fast - Clear separation between training and testing

Disadvantages: - Single split can be unrepresentative (bad luck in random split) - Wastes data (test set not used for training) - High variance in performance estimate

When to use: Large datasets (>10,000 samples), quick experiments

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42,    # Reproducible split
    stratify=y          # Maintain class balance
)

10.5.1.2 Method 2: K-Fold Cross-Validation

Procedure: 1. Divide data into K folds (typically 5 or 10) 2. For each fold: - Train on K-1 folds - Validate on remaining fold 3. Average performance across all K folds

Advantages: - Uses all data for both training and validation - More stable performance estimate (less variance) - Standard practice in machine learning

Disadvantages: - Computationally expensive (train K models) - Still no external validation

When to use: Moderate-sized datasets (1,000-10,000 samples), model selection

from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    model, X, y,
    cv=5,              # 5-fold CV
    scoring='roc_auc'  # Metric to optimize
)

print(f"AUC-ROC: {scores.mean():.3f}{scores.std():.3f})")

10.5.1.3 Method 3: Stratified K-Fold Cross-Validation

Modification: Ensures each fold maintains the same class distribution as the full dataset.

Critical for imbalanced datasets (e.g., 5% disease prevalence).

Why it matters: Without stratification, some folds might have very few positive cases (or none!), leading to unstable estimates.

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')

10.5.1.4 Method 4: Time-Series Cross-Validation

For temporal data: Never train on future, test on past!

Procedure (expanding window):

Fold 1: Train [1:100]  → Test [101:120]
Fold 2: Train [1:120]  → Test [121:140]
Fold 3: Train [1:140]  → Test [141:160]
...

Critical for: Epidemic forecasting, time-series prediction, any data with temporal structure

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]
    # Train and evaluate

10.5.2 Critical Considerations for Internal Validation

10.5.2.1 1. Data Leakage Prevention

Data leakage: Information from test set influencing training process.

Common sources:

❌ Feature engineering on entire dataset:

# WRONG: Standardize before splitting
X_scaled = StandardScaler().fit_transform(X)  # Uses mean/std from ALL data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
# Test set info leaked into training!

✅ Feature engineering within train/test:

# CORRECT: Fit scaler on training only
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler().fit(X_train)  # Learn from training only
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Apply to test

❌ Feature selection on entire dataset:

# WRONG
selector = SelectKBest(k=10).fit(X, y)  # Uses ALL data
X_selected = selector.transform(X)
X_train, X_test = train_test_split(X_selected)

✅ Feature selection within training:

# CORRECT
X_train, X_test, y_train, y_test = train_test_split(X, y)
selector = SelectKBest(k=10).fit(X_train, y_train)
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)

For comprehensive guide on data leakage, see Kaufman et al., 2012, SIGKDD.


10.5.2.2 2. Cluster-Aware Splitting

Problem: If data has natural clusters (patients within hospitals, repeated measures within individuals), random splitting can lead to overfitting.

Example: Patient has 5 hospitalizations. Random split → some hospitalizations in training, others in test. Model learns patient-specific patterns → overoptimistic performance.

Solution: Group K-Fold — ensure all samples from same group stay together

from sklearn.model_selection import GroupKFold

# patient_ids: array indicating which patient each sample belongs to
gkf = GroupKFold(n_splits=5)
for train_idx, test_idx in gkf.split(X, y, groups=patient_ids):
    # All samples from same patient stay in same fold
    X_train, X_test = X[train_idx], X[test_idx]

10.5.3 External Validation: The Gold Standard

External validation: Testing on data from entirely different source—different institution(s), population, time period, or setting.

Why it matters:

Models often learn dataset-specific quirks that don’t generalize: - Hospital equipment signatures - Documentation practices - Patient population characteristics - Data collection protocols

Without external validation, you don’t know if model learned disease patterns or dataset artifacts.


10.5.3.1 Types of External Validation

1. Geographic External Validation

Design: - Train: Hospital A (or multiple hospitals in one region) - Test: Hospital B (or hospitals in different region)

What it tests: - Different patient demographics - Different clinical practices - Different data collection protocols - Different equipment (for imaging)

Example: McKinney et al., 2020, Nature - Google breast cancer AI trained on UK data, validated on US data (and vice versa). Performance dropped: UK→US AUC decreased from 0.889 to 0.858.


2. Temporal External Validation

Design: - Train: Data from 2015-2018 - Test: Data from 2019-2021

What it tests: - Temporal stability (concept drift) - Changes in disease patterns - Changes in clinical practice - Changes in data collection

Example: Davis et al., 2017, JAMIA - Clinical prediction models degrade over time; most models need recalibration after 2-3 years.


3. Setting External Validation

Design: - Train: Intensive care unit (ICU) data - Test: General ward data

What it tests: - Performance in different clinical settings - Generalization across disease severity spectra

Example: Sepsis models trained on ICU patients often fail on ward patients (different disease presentation, different monitoring intensity).


10.5.3.2 External Validation Case Study

NoteCase Study: CheXNet External Validation Failure

Original paper: Rajpurkar et al., 2017, arXiv - CheXNet

Training: - ChestX-ray14 dataset: 112,120 X-rays from NIH Clinical Center - 14 pathology classification tasks - Claimed: “Radiologist-level pneumonia detection” - Performance: AUC-ROC = 0.7632 for pneumonia

External validation: DeGrave et al., 2021, Nature Machine Intelligence

Tested on: - MIMIC-CXR: 377,110 X-rays from Beth Israel Deaconess Medical Center - PadChest: 160,000 X-rays from Hospital San Juan, Spain - CheXpert: 224,000 X-rays from Stanford Hospital

Results: - AUC-ROC ranged from 0.51 to 0.70 across sites (vs. 0.76 internal) - Poor calibration: predicted probabilities didn’t match observed frequencies - Explanation: Model learned to detect portable X-ray machines (used for sicker patients) rather than pneumonia itself

Lessons: 1. Internal validation dramatically overestimated performance 2. Single-institution data insufficient for generalizability claims 3. Models can learn spurious correlations specific to training site 4. External validation is essential before clinical deployment

See also: Zech et al., 2018, PLOS Medicine - “Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs”


10.5.4 Prospective Validation: Real-World Testing

Prospective validation: Model deployed in actual clinical practice, evaluated in real-time.

Why it matters: Retrospective validation can’t capture: - How clinicians actually use (or ignore) model predictions - Workflow integration challenges - Alert fatigue and override patterns - Behavioral changes in response to predictions - Unintended consequences


10.5.4.1 Study Design 1: Silent Mode Deployment

Design: - Deploy model in background - Generate predictions but don’t show to clinicians - Compare predictions to actual outcomes (collected as usual)

Advantages: - Tests real-world data quality and distribution - No risk to patients (clinicians unaware of predictions) - Can assess performance before making decisions based on model

Disadvantages: - Doesn’t test impact on clinical decisions - Doesn’t assess workflow integration

Example: Tomašev et al., 2019, Nature - DeepMind AKI prediction initially deployed silently at VA hospitals to validate real-time performance before clinical integration.


10.5.4.2 Study Design 2: Randomized Controlled Trial (RCT)

Design: - Randomize: Patients, clinicians, or hospital units to: - Intervention: Model-assisted care - Control: Standard care (no model) - Measure: Clinical outcomes in both groups - Compare: Test if model improves outcomes

Advantages: - Strongest causal evidence for impact - Can establish cost-effectiveness - Meets regulatory/reimbursement standards

Disadvantages: - Expensive (often millions of dollars) - Time-consuming (months to years) - Requires large sample size - Ethical considerations (withholding potentially beneficial intervention)

Example: Semler et al., 2018, JAMA - SMART trial for sepsis management (not AI, but example of rigorous prospective design)


10.5.4.3 Study Design 3: Stepped-Wedge Design

Design: - Roll out model sequentially to different units/sites - Each unit serves as its own control (before vs. after) - Eventually all units receive intervention

Advantages: - More feasible than full RCT - All units eventually get intervention (addresses ethical concerns) - Within-unit comparisons reduce confounding

Disadvantages: - Temporal trends can confound results - Less rigorous than RCT (no contemporaneous control group)

Example: Common in health system implementations where full RCT infeasible.


10.5.4.4 Study Design 4: A/B Testing

Design: - Randomly assign users to model-assisted vs. control in real-time - Continuously measure outcomes - Iterate rapidly based on results

Advantages: - Rapid experimentation - Can test multiple model versions - Common in tech industry

Challenges in healthcare: - Ethical concerns (different care for similar patients) - Regulatory considerations (IRB approval required) - Contamination (clinicians may share information)


10.6 Beyond Accuracy: Clinical Utility Assessment

Critical insight: A model can be statistically accurate but clinically useless.

Example: - Model predicts hospital mortality with AUC-ROC = 0.85 - But: If it doesn’t change clinical decisions or improve outcomes, what’s the value? - Moreover: If implementing it disrupts workflow or generates alert fatigue, net impact may be negative.

ImportantThe Clinical Utility Question

Before deploying any clinical AI:

  1. Does it change decisions?
  2. Do those changed decisions improve outcomes?
  3. Is the improvement worth the cost (financial, workflow disruption, alert burden)?

If you can’t answer “yes” to all three, don’t deploy.


10.6.1 Decision Curve Analysis (DCA)

Developed by: Vickers & Elkin, 2006, Medical Decision Making

Purpose: Assess the clinical net benefit of using a prediction model compared to alternative strategies.

Concept: A model is clinically useful only if using it leads to better decisions than: - Treating everyone - Treating no one - Using clinical judgment alone


10.6.1.1 How Decision Curve Analysis Works

For each possible risk threshold \(p_t\) (e.g., “treat if risk >10%”):

Calculate Net Benefit (NB):

\[\text{NB}(p_t) = \frac{TP}{N} - \frac{FP}{N} \times \frac{p_t}{1 - p_t}\]

Where: - \(TP/N\) = True positive rate (benefit from correctly treating disease) - \(FP/N \times p_t/(1-p_t)\) = False positive rate, weighted by harm of unnecessary treatment

Interpretation: - If treating disease has high benefit relative to harm of unnecessary treatment → lower \(p_t\) threshold - If treating disease has low benefit relative to harm → higher \(p_t\) threshold

Weight \(p_t/(1-p_t)\): Reflects how much we weight false positives. - At \(p_t\) = 0.10: Weight = 0.10/0.90 ≈ 0.11 (FP weighted 1/9 as much as TP) - At \(p_t\) = 0.50: Weight = 0.50/0.50 = 1.00 (FP and TP equally weighted)


10.6.1.2 DCA Plot and Interpretation

Create DCA plot: - X-axis: Threshold probability (risk at which you’d intervene) - Y-axis: Net benefit - Plot curves for: - Model: Net benefit using model predictions - Treat all: Net benefit if everyone treated - Treat none: Net benefit if no one treated (= 0)

Interpretation: - Model is useful where its curve is above both “treat all” and “treat none” - Higher net benefit = better clinical value - Range of thresholds where model useful = decision curve clinical range

Example interpretation:

At 15% risk threshold: - Model NB = 0.12 - Treat all NB = 0.05 - Treat none NB = 0.00

Meaning: Using model at 15% threshold is equivalent to correctly treating 12 out of 100 patients with no false positives, compared to only 5 for “treat all” strategy.

Python implementation:

def calculate_net_benefit(y_true, y_pred_proba, thresholds):
    """Calculate net benefit across thresholds for decision curve analysis"""
    net_benefits = []
    
    for threshold in thresholds:
        # Classify based on threshold
        y_pred = (y_pred_proba >= threshold).astype(int)
        
        # Calculate TP, FP, TN, FN
        TP = ((y_pred == 1) & (y_true == 1)).sum()
        FP = ((y_pred == 1) & (y_true == 0)).sum()
        N = len(y_true)
        
        # Net benefit formula
        nb = (TP / N) - (FP / N) * (threshold / (1 - threshold))
        net_benefits.append(nb)
    
    return np.array(net_benefits)

# Calculate for model, treat all, treat none
thresholds = np.linspace(0.01, 0.99, 100)
nb_model = calculate_net_benefit(y_test, y_pred_proba, thresholds)
nb_treat_all = y_test.mean() - (1 - y_test.mean()) * (thresholds / (1 - thresholds))
nb_treat_none = np.zeros_like(thresholds)

# Plot decision curve
plt.figure(figsize=(10, 6))
plt.plot(thresholds, nb_model, label='Model', linewidth=2)
plt.plot(thresholds, nb_treat_all, label='Treat All', linestyle='--', linewidth=2)
plt.plot(thresholds, nb_treat_none, label='Treat None', linestyle=':', linewidth=2)
plt.xlabel('Threshold Probability', fontsize=12)
plt.ylabel('Net Benefit', fontsize=12)
plt.title('Decision Curve Analysis', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.xlim(0, 0.5)  # Focus on clinically relevant range
plt.show()

For comprehensive tutorial, see Vickers et al., 2019, Diagnostic and Prognostic Research.


10.6.2 Reclassification Metrics

Purpose: Quantify whether new model improves risk stratification compared to existing approach.

Context: You have an existing risk model (or clinical judgment). New model proposed. Does it reclassify patients into more appropriate risk categories?


10.6.2.1 Net Reclassification Improvement (NRI)

Concept: Among events (people with disease), what proportion correctly moved to higher risk? Among non-events, what proportion correctly moved to lower risk?

Formula:

\[\text{NRI} = (\text{NRI}_{\text{events}} + \text{NRI}_{\text{non-events}}) / 2\]

Where: - \(\text{NRI}_{\text{events}}\) = P(moved up | event) - P(moved down | event) - \(\text{NRI}_{\text{non-events}}\) = P(moved down | non-event) - P(moved up | non-event)

Interpretation: - NRI > 0: New model improves classification - NRI < 0: New model worsens classification - Typically report with 95% CI

Example:

Group Moved Up Stayed Moved Down NRI Component
Events (n=100) 35 50 15 (35-15)/100 = 0.20
Non-events (n=900) 50 800 50 (50-50)/900 = 0.00

NRI = (0.20 + 0.00) / 2 = 0.10

Interpretation: Net 10% improvement in classification.

For detailed explanation, see Pencina et al., 2008, Statistics in Medicine.


10.6.2.2 Integrated Discrimination Improvement (IDI)

Concept: Difference in average predicted probabilities between events and non-events.

Formula:

\[\text{IDI} = [\overline{P}_{\text{new}}(\text{events}) - \overline{P}_{\text{old}}(\text{events})] - [\overline{P}_{\text{new}}(\text{non-events}) - \overline{P}_{\text{old}}(\text{non-events})]\]

Interpretation: - How much does new model increase separation between events and non-events? - IDI > 0: Better discrimination - Less sensitive to arbitrary cut-points than NRI


10.7 Fairness and Equity in Evaluation

AI systems can exhibit disparate performance across demographic groups, even when overall performance appears strong.

WarningThe Fairness Imperative

Failure to assess fairness can: - Perpetuate or amplify existing health disparities - Result in differential quality of care based on race, gender, socioeconomic status - Violate ethical principles of justice and equity - Expose organizations to legal liability

Assessing fairness is not optional—it’s essential.


10.7.1 Mathematical Definitions of Fairness

Challenge: Multiple, often conflicting, definitions of fairness exist.

10.7.1.1 1. Demographic Parity (Statistical Parity)

Definition: Positive prediction rates equal across groups

\[P(\hat{Y}=1 | A=0) = P(\hat{Y}=1 | A=1)\]

where \(A\) = protected attribute (e.g., race, gender)

Example: Model predicts high risk for 20% of White patients and 20% of Black patients

When appropriate: - Resource allocation (equal access to interventions) - Contexts where base rates should be equal

Problem: Ignores actual outcome rates. If disease prevalence differs between groups (due to structural factors), enforcing demographic parity may reduce overall accuracy.


10.7.1.2 2. Equalized Odds (Equal Opportunity)

Definition: True positive and false positive rates equal across groups

\[P(\hat{Y}=1 | Y=1, A=0) = P(\hat{Y}=1 | Y=1, A=1)\] \[P(\hat{Y}=1 | Y=0, A=0) = P(\hat{Y}=1 | Y=0, A=1)\]

Example: 85% sensitivity for both White and Black patients; 90% specificity for both

When appropriate: - Clinical diagnosis and screening - When both types of errors (false positives and false negatives) matter

More clinically relevant than demographic parity in most healthcare applications.


10.7.1.3 3. Calibration Fairness

Definition: Predicted probabilities calibrated for all groups

\[P(Y=1 | \hat{Y}=p, A=0) = P(Y=1 | \hat{Y}=p, A=1) = p\]

Example: Among patients predicted 30% risk, ~30% in each group actually experience outcome

When appropriate: - Risk prediction for clinical decision-making - When predicted probabilities guide treatment thresholds

Most important for clinical applications where decisions based on predicted probabilities.


10.7.1.4 4. Predictive Parity

Definition: Positive predictive values equal across groups

\[P(Y=1 | \hat{Y}=1, A=0) = P(Y=1 | \hat{Y}=1, A=1)\]

Example: Among patients predicted positive, same proportion are true positives in both groups

When appropriate: - When acting on positive predictions (e.g., treatment initiation)


10.7.2 The Impossibility Theorem

Fundamental challenge: Chouldechova, 2017, FAT and Kleinberg et al., 2017, ITCS proved:

If base rates differ between groups, you cannot simultaneously satisfy: 1. Calibration 2. Equalized odds 3. Predictive parity

Implication: Must choose which fairness criterion to prioritize based on context and values.

For healthcare: Calibration typically most important (want predicted probabilities to mean the same thing across groups).


10.7.3 Practical Fairness Assessment

10.7.3.1 Step-by-Step Fairness Audit

Step 1: Define Protected Attributes

Identify characteristics that should not influence model performance: - Race/ethnicity - Gender/sex - Age - Socioeconomic status (income, insurance, ZIP code) - Language - Disability status


Step 2: Stratify Performance Metrics

Calculate metrics separately for each subgroup:

# Example: Performance by race/ethnicity
groups = data.groupby('race')

fairness_metrics = []
for race, group_data in groups:
    y_true = group_data['outcome']
    y_pred = group_data['prediction']
    
    metrics = {
        'race': race,
        'n': len(group_data),
        'prevalence': y_true.mean(),
        'sensitivity': recall_score(y_true, y_pred > 0.5),
        'specificity': recall_score(1 - y_true, 1 - (y_pred > 0.5)),
        'PPV': precision_score(y_true, y_pred > 0.5),
        'NPV': precision_score(1 - y_true, 1 - (y_pred > 0.5)),
        'AUC': roc_auc_score(y_true, y_pred),
        'Brier': brier_score_loss(y_true, y_pred)
    }
    fairness_metrics.append(metrics)

fairness_df = pd.DataFrame(fairness_metrics)
print(fairness_df)

Step 3: Assess Calibration by Subgroup

# Calibration plots by race
fig, axes = plt.subplots(1, len(groups), figsize=(15, 5))

for idx, (race, group_data) in enumerate(groups):
    y_true = group_data['outcome']
    y_pred = group_data['prediction']
    
    prob_true, prob_pred = calibration_curve(y_true, y_pred, n_bins=10)
    
    axes[idx].plot(prob_pred, prob_true, marker='o', label=race)
    axes[idx].plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
    axes[idx].set_title(f'{race} (n={len(group_data)})')
    axes[idx].set_xlabel('Predicted Probability')
    axes[idx].set_ylabel('Observed Frequency')
    axes[idx].legend()

Step 4: Identify Disparities

Calculate disparity metrics:

Absolute disparity: Difference between groups

sens_white = metrics_white['sensitivity']
sens_black = metrics_black['sensitivity']
disparity_abs = sens_white - sens_black

Relative disparity: Ratio between groups

disparity_rel = sens_white / sens_black

Threshold for concern: - Absolute disparity >5 percentage points - Relative disparity >1.1 or <0.9 (10% difference)


Step 5: Investigate Root Causes

Potential causes of disparities:

  1. Data representation
    • Underrepresentation in training data
    • Different sample sizes → unstable estimates for small groups
  2. Label bias
    • Outcome labels reflect biased processes (e.g., healthcare access disparities)
    • Example: Hospitalization rates lower in group with less access, not because they’re healthier
  3. Feature bias
    • Features proxy for protected attributes
    • Example: ZIP code strongly correlated with race
  4. Measurement bias
  5. Prevalence differences
    • True differences in disease prevalence
    • May be due to structural factors (e.g., environmental exposures)

Step 6: Mitigation Strategies

Pre-processing (adjust training data): - Increase representation of underrepresented groups (oversampling, synthetic data) - Re-weight samples to balance groups - Remove or transform biased features

In-processing (modify algorithm): - Add fairness constraints during training - Adversarial debiasing (penalize predictions that reveal protected attribute) - Multi-objective optimization (accuracy + fairness)

Post-processing (adjust predictions): - Separate thresholds per group to achieve equalized odds - Calibration adjustment per group - Reject option classification (defer to human for uncertain cases)

Structural interventions: - Address root causes (improve data collection for underrepresented groups) - Partner with communities to ensure appropriate representation - Consider whether model should be deployed if disparities cannot be adequately mitigated

For comprehensive fairness toolkit, see Fairlearn by Microsoft.


10.7.4 Landmark Bias Case Study

WarningCase Study: Racial Bias in Healthcare Risk Algorithm

Paper: Obermeyer et al., 2019, Science

Context: - Commercial algorithm used by major US health systems to identify high-risk patients for care management programs - Affected millions of patients nationwide

The Algorithm: - Predicted future healthcare costs as proxy for healthcare needs - Used to determine eligibility for high-touch care management

The Bias Discovered:

Black patients had: - 26% more chronic conditions than White patients at same risk score - Lower predicted costs despite being sicker

Why? - Algorithm used healthcare costs as outcome label - Black patients historically received less care due to systemic barriers - Less care → lower costs → model learned “Black = lower cost = healthier” - Result: At same risk score, Black patients were sicker than White patients

Impact: - To qualify for care management, Black patients needed to be sicker than White patients - Black patients at 97th percentile of risk score had similar medical needs as White patients at 85th percentile - Reduced access to care management programs for Black patients

Solution: - Re-label using direct measures of health need (number of chronic conditions, biomarkers) instead of costs - Result: Reduced bias by 84%

Lessons:

  1. Outcome label choice is critical — using healthcare utilization as proxy for need embeds systemic bias
  2. Overall accuracy can mask subgroup disparities — algorithm performed well on average
  3. Historical bias propagates — model learned from biased past care patterns
  4. Evaluate across subgroups — disparities invisible without stratified analysis
  5. Audit deployed systems — this was a production system, not a research study

Follow-up: Buolamwini & Gebru, 2018, FAT - similar biases in facial recognition; Gichoya et al., 2022, Lancet Digital Health - AI can predict race from medical images (concerning proxy variable).


10.8 Implementation Outcomes: Beyond Technical Performance

Even models with strong technical performance can fail in practice if not properly implemented.

10.8.1 The Implementation Science Framework

Proctor et al., 2011 define implementation outcomes:

10.8.1.1 1. Acceptability

Definition: Perception that system is agreeable/satisfactory

Measures: - User satisfaction surveys (Likert scales, Net Promoter Score) - Qualitative interviews (what do users like/dislike?) - Perceived usefulness and ease of use

Example questions: - “This system improves my clinical decision-making” (1-5 scale) - “I would recommend this system to colleagues” (yes/no)


10.8.1.2 2. Adoption

Definition: Intention/action to use the system

Measures: - Utilization rate (% of eligible cases where system used) - Number of users who have activated/logged in - Time to initial use

Red flag: Low adoption despite availability suggests problems with acceptability, workflow fit, or perceived utility.


10.8.1.3 3. Appropriateness

Definition: Perceived fit for setting/population/problem

Measures: - Stakeholder perception surveys - Alignment with clinical workflows (workflow mapping) - Relevance to clinical questions

Example: ICU mortality prediction may be appropriate for ICU but inappropriate for outpatient clinic.


10.8.1.4 4. Feasibility

Definition: Ability to successfully implement

Measures: - Technical integration challenges (API compatibility, data availability) - Resource requirements (cost, staff time, training) - Infrastructure needs (computing, network)


10.8.1.5 5. Fidelity

Definition: Degree to which system used as designed

Measures: - Override rates (how often do clinicians dismiss alerts?) - Deviation from intended use (using for wrong purpose) - Workarounds (users circumventing system)

High override rates signal problems: - Too many false positives (alert fatigue) - Predictions don’t match clinical judgment (trust issues) - Workflow disruption (alerts at wrong time)


10.8.1.6 6. Penetration

Definition: Integration across settings/populations

Measures: - Number of sites/units using system - Proportion of target population reached - Geographic spread


10.8.1.7 7. Sustainability

Definition: Continued use over time

Measures: - Retention of users over 6-12 months - Model updating/maintenance plan - Long-term performance monitoring

Common failure: “Pilot-itis” — successful pilot, but system not sustained after initial implementation period.


10.8.2 Common Implementation Failures

10.8.2.1 1. Alert Fatigue

Problem: Excessive false alarms → clinicians ignore alerts

Evidence: Ancker et al., 2017, BMJ Quality & Safety - Drug-drug interaction alerts overridden 49-96% of time.

Example: Epic sepsis model - 93% false positive rate → clinicians stopped paying attention.

Solutions: - Minimize false positives (sacrifice sensitivity if needed) - Tiered alerts (critical vs. informational) - Smart timing (deliver when actionable, not during documentation) - Actionable recommendations (“Order blood cultures” not “Consider sepsis”)


10.8.2.2 2. Workflow Disruption

Problem: System doesn’t integrate smoothly into existing processes

Examples: - Extra clicks required - Separate application (need to switch contexts) - Alerts interrupt at inopportune times (during patient exam)

Solutions: - User-centered design (involve clinicians early and often) - Embed in existing EHR workflows - Minimize friction (one-click actions)

For workflow integration principles, see Bates et al., 2003, NEJM on clinical decision support systems.


10.8.2.3 3. Lack of Trust

Problem: Clinicians don’t trust “black box” predictions

Example: Deep learning model provides risk score with no explanation

Solutions: - Provide explanations (SHAP values, attention weights) - Show evidence base (similar cases, supporting literature) - Transparent validation (publish performance data) - Gradual trust-building (start with low-stakes recommendations)


10.8.2.4 4. Model Drift

Problem: Performance degrades over time as data distribution changes

Example: COVID-19 pandemic changed disease patterns → pre-pandemic models failed

Solutions: - Continuous monitoring (track performance metrics over time) - Automated alerts for performance degradation - Regular retraining schedule - Triggers for urgent retraining (sudden performance drop)

Framework: Davis et al., 2017, JAMIA - Clinical prediction models degrade; most need recalibration after 2-3 years.


10.9 Comprehensive Evaluation Framework

10.9.1 Complete Evaluation Checklist

Use this when evaluating AI systems:

Note✅ AI System Evaluation Checklist

TECHNICAL PERFORMANCE - [ ] Discrimination metrics reported (AUC-ROC, sensitivity, specificity, PPV, NPV) - [ ] 95% confidence intervals provided for all metrics - [ ] Calibration assessed (calibration plot, Brier score, ECE) - [ ] Appropriate for class imbalance (if applicable) - [ ] Comparison to baseline model (e.g., logistic regression, clinical judgment) - [ ] Multiple metrics reported (not just accuracy)

VALIDATION RIGOR - [ ] Internal validation performed (CV or hold-out) - [ ] Temporal validation (train on past, test on future) - [ ] External validation on independent dataset from different institution - [ ] Prospective validation performed or planned - [ ] Data leakage prevented (feature engineering within folds) - [ ] Appropriate cross-validation for data structure (stratified, grouped, time-series)

FAIRNESS AND EQUITY - [ ] Performance stratified by demographic subgroups (race, gender, age, SES) - [ ] Disparities quantified (absolute and relative differences) - [ ] Calibration assessed per subgroup - [ ] Training data representation documented - [ ] Potential for bias explicitly discussed - [ ] Mitigation strategies proposed if disparities identified

CLINICAL UTILITY - [ ] Decision curve analysis or similar utility assessment - [ ] Comparison to current standard of care - [ ] Clinical workflow integration considered - [ ] Net benefit quantified - [ ] Cost-effectiveness assessed (if applicable) - [ ] Actionable outputs (not just risk scores)

TRANSPARENCY AND REPRODUCIBILITY - [ ] Model architecture and type clearly described - [ ] Feature engineering documented - [ ] Hyperparameters and training procedure reported - [ ] Reporting guidelines followed (TRIPOD, STARD-AI) - [ ] Code availability stated - [ ] Data availability (with appropriate privacy protections) - [ ] Conflicts of interest disclosed

IMPLEMENTATION PLANNING - [ ] Target users and use cases defined - [ ] Workflow integration plan described - [ ] Alert/decision threshold selection justified - [ ] Plan for performance monitoring post-deployment - [ ] Model updating and maintenance plan - [ ] Training plan for end users - [ ] Contingency plan for model failure

LIMITATIONS - [ ] Limitations clearly stated - [ ] Generalizability constraints acknowledged - [ ] Potential biases discussed - [ ] Appropriate caveats about clinical use


10.10 Reporting Guidelines

10.10.1 TRIPOD: Transparent Reporting of Prediction Models

Collins et al., 2015, BMJ - TRIPOD statement

22-item checklist for prediction model studies:

Title and Abstract 1. Identify as prediction model study 2. Summary of objectives, design, setting, participants, outcome, prediction model, results

Introduction 3. Background and objectives 4. Rationale for development or validation

Methods - Source of Data 5. Study design and data sources 6. Eligibility criteria and study period

Methods - Participants 7. Participant characteristics 8. Outcome definition 9. Predictors (features) clearly defined

Methods - Sample Size 10. Sample size determination

Methods - Missing Data 11. How missing data were handled

Methods - Model Development 12. Statistical methods for model development 13. Model selection procedure 14. Model performance measures

Results - Participants 15. Participant flow diagram 16. Descriptive characteristics

Results - Model Specification 17. Model specification (all parameters) 18. Model performance (discrimination and calibration)

Discussion 19. Interpretation (clinical meaning, implications) 20. Limitations 21. Implications for practice

Other 22. Funding and conflicts of interest

TRIPOD-AI extension (in development): Additional items for AI/ML models: - Training/validation/test set composition - Data augmentation - Hyperparameter tuning - Computational environment - Algorithm selection process


10.10.2 STARD-AI: Standards for Reporting Diagnostic Accuracy Using AI

Extension of STARD guidelines for diagnostic AI.

Additional items: - Model architecture details - Training procedure (epochs, batch size, optimization) - Validation strategy - External validation results - Subgroup analyses - Calibration assessment - Comparison to human performance (if applicable)


10.11 Critical Appraisal of Published Studies

10.11.1 Systematic Evaluation Framework

When reading AI studies:

10.11.1.1 1. Study Design and Data Quality

Questions: - Representative sample of target population? - External validation performed? - Test set truly independent? - Outcome objectively defined and consistently measured? - Potential for data leakage?

Red flags: - No external validation - Small sample size (<500 events) - Convenience sampling - Vague outcome definitions - Feature engineering on entire dataset before splitting


10.11.1.2 2. Model Development and Reporting

Questions: - Multiple models compared? - Simple baseline included (logistic regression)? - Hyperparameters tuned on separate validation set? - Feature selection appropriate? - Model clearly described?

Red flags: - No baseline comparison - Hyperparameter tuning on test set - Inadequate model description - No cross-validation


10.11.1.3 3. Performance Evaluation

Questions: - Appropriate metrics for task? - Confidence intervals provided? - Calibration assessed? - Multiple metrics reported? - Statistical testing appropriate?

Red flags: - Only accuracy reported (especially for imbalanced data) - No calibration assessment - No confidence intervals - Cherry-picked metrics


10.11.1.4 4. Fairness and Generalizability

Questions: - Performance stratified by subgroups? - Diverse populations included? - Generalizability limitations discussed? - Potential biases identified?

Red flags: - No subgroup analysis - Homogeneous study population - Claims of broad generalizability without external validation - Dismissal of fairness concerns


10.11.1.5 5. Clinical Utility

Questions: - Clinical utility assessed (beyond accuracy)? - Compared to current practice? - Implementation considerations discussed? - Cost-effectiveness assessed?

Red flags: - Only technical metrics - No comparison to existing approaches - No implementation discussion - Overstated clinical claims


10.11.1.6 6. Transparency and Reproducibility

Questions: - Code and data available? - Reporting guidelines followed? - Sufficient detail to reproduce? - Limitations clearly stated? - Conflicts of interest disclosed?

Red flags: - No code/data availability - Insufficient methodological detail - Overstated conclusions - Undisclosed industry funding


10.12 Key Takeaways

ImportantEssential Principles
  1. Evaluation is multidimensional — Technical performance, clinical utility, fairness, and implementation outcomes all matter

  2. Internal validation is insufficient — External validation on independent data is essential to assess generalizability

  3. Calibration is critical — Predicted probabilities must be meaningful for clinical decisions, not just discriminative

  4. Assess fairness proactively — Stratify performance by demographic subgroups; disparities invisible otherwise

  5. Clinical utility ≠ statistical performance — A model can be statistically accurate but clinically useless without improving outcomes

  6. Prospective validation is the gold standard — Real-world testing provides strongest evidence

  7. Common pitfalls are avoidable — Data leakage, improper CV, threshold optimization on test set lead to overoptimistic estimates

  8. Implementation determines success — Even well-performing models fail if workflow integration ignored

  9. Transparency enables trust — Follow reporting guidelines (TRIPOD, STARD-AI); share code and data when possible

  10. Continuous monitoring is essential — Model performance drifts over time; plan for ongoing evaluation and updating


Check Your Understanding

Test your knowledge of the key concepts from this chapter. Click “Show Answer” to reveal the correct response and explanation.

NoteQuestion 1: Cross-Validation Strategy

You’re building a model to predict hospital readmissions using data from 2018-2023. Which cross-validation strategy is MOST appropriate?

  1. 10-fold random cross-validation
  2. Leave-one-out cross-validation
  3. Stratified K-fold cross-validation
  4. Time-based forward-chaining cross-validation

Answer: d) Time-based forward-chaining cross-validation

Explanation: Time-based (temporal) cross-validation is essential for healthcare data with temporal dependencies:

Why temporal CV is critical:

Fold 1: Train 2018-2019 → Test 2020
Fold 2: Train 2018-2020 → Test 2021
Fold 3: Train 2018-2021 → Test 2022
Fold 4: Train 2018-2022 → Test 2023

What this tests: - Model performance as deployed (using past to predict future) - Robustness to temporal drift (treatment changes, policy updates) - Realistic performance estimates

Why not random K-fold (a)? Creates data leakage:

Train: [2019, 2021, 2023]
Test:  [2018, 2020, 2022]

You’re using 2023 data to predict 2018! Inflates performance artificially.

Why not leave-one-out (b)? - Computationally expensive - Still has temporal leakage problem - High variance in estimates

Why not stratified K-fold (c)? - Useful for class imbalance - But still allows temporal leakage - Doesn’t test temporal robustness

Real-world impact: Models validated with random CV often show 10-20% performance drops when deployed because they never faced forward-looking prediction during validation.

Lesson: Healthcare data has temporal structure. Always validate as you’ll deploy—using past to predict future, never the reverse.

NoteQuestion 2: Calibration Assessment

A cancer risk model predicts 20% risk for 1,000 patients. In reality, 300 of these patients develop cancer. What does this indicate?

  1. The model is well-calibrated
  2. The model is overconfident (underestimates risk)
  3. The model is underconfident (overestimates risk)
  4. The model has good discrimination but poor calibration

Answer: b) The model is overconfident (underestimates risk)

Explanation: Calibration compares predicted probabilities to observed outcomes:

Analysis: - Predicted: 20% of 1,000 patients = 200 patients expected to develop cancer - Observed: 300 patients actually developed cancer - Gap: Predicted 200, observed 300 → Underestimating risk

Calibration terminology: - Well-calibrated: Predicted ≈ Observed (20% predicted → 20% observed) - Overconfident/Underestimate: Predicted < Observed (20% predicted → 30% observed) ✓ This case - Underconfident/Overestimate: Predicted > Observed (20% predicted → 10% observed)

Why it matters:

# Clinical decision: Treat if risk > 25%
model.predict_proba(patient) = 0.20  # Below threshold → No treatment

# Reality: True risk was 0.30
# Patient should have been treated!

How to detect: 1. Calibration plot: Predicted vs observed by risk bin 2. Brier score: Mean squared error of probabilities 3. Expected Calibration Error (ECE): Average absolute calibration error

Lesson: High AUC doesn’t guarantee calibration. When predictions inform decisions with probability thresholds, calibration is critical. Always check calibration plots, not just discrimination metrics.

NoteQuestion 3: External Validation

Your sepsis model achieves AUC 0.88 on internal test set. You test on external hospitals and get AUC 0.72-0.82 (varying by site). What does this variability indicate?

  1. The model is overfitting
  2. External sites have poor data quality
  3. There is substantial site-specific heterogeneity
  4. The model should not be used

Answer: c) There is substantial site-specific heterogeneity

Explanation: Performance variability across sites reveals important heterogeneity:

What varies between hospitals:

  1. Patient populations:
    • Demographics (age, race, socioeconomic status)
    • Disease severity (tertiary referral vs community hospital)
    • Comorbidity profiles
  2. Clinical practices:
    • Sepsis protocols (early vs delayed antibiotics)
    • ICU admission criteria
    • Documentation practices
  3. Infrastructure:
    • EHR systems (Epic vs Cerner vs homegrown)
    • Lab equipment (different reference ranges)
    • Staffing models (nurse-to-patient ratios)
  4. Data capture:
    • Missing data patterns
    • Measurement frequency
    • Feature definitions

Why not overfitting (a)? Overfitting shows as gap between training and test within same dataset. Here, internal test was fine (0.88)—it’s external generalization that varies.

Why not poor data quality (b)? Could contribute, but more likely reflects legitimate differences in populations and practices.

Why not unusable (d)? AUC 0.72-0.82 is still useful! But indicates need for: - Site-specific calibration - Understanding what drives differences - Possibly site-specific models or adjustments

Best practice: External validation almost always shows performance drops. Variability across sites is normal and informative—reveals where model struggles and needs adaptation.

NoteQuestion 4: Statistical Significance vs Clinical Significance

True or False: If a model improvement is statistically significant (p < 0.05), it is clinically meaningful and should be deployed.

Answer: False

Explanation: Statistical significance ≠ clinical significance. Both are necessary but neither alone is sufficient:

Statistical significance: - Tests if difference is unlikely due to chance - Depends on sample size (large N → small differences become significant) - Answers: “Is there an effect?”

Clinical significance: - Tests if difference matters for patient care - Independent of sample size - Answers: “Is the effect large enough to care?”

Example:

# New model vs baseline
results = {
    'baseline_auc': 0.820,
    'new_model_auc': 0.825,
    'difference': 0.005,
    'p_value': 0.03,  # Statistically significant
    'sample_size': 50000  # Large sample
}

Analysis: - ✅ Statistically significant: p=0.03 < 0.05 - ❌ Clinically insignificant: 0.5% AUC improvement negligible - Why significant? Large sample detects tiny differences - Should deploy? No—not worth the cost/disruption

Lesson: Always evaluate both statistical and clinical significance. With large samples, trivial differences become statistically significant. Ask: “Is this improvement large enough to change practice?” Consider effect sizes, confidence intervals, and practical impact—not just p-values.

NoteQuestion 5: Confidence Intervals

Two models have been evaluated: - Model A: AUC 0.85 (95% CI: 0.83-0.87) - Model B: AUC 0.86 (95% CI: 0.79-0.93)

Which statement is correct?

  1. Model B is definitely better because it has higher AUC
  2. Model A is more reliable because it has a narrower confidence interval
  3. The models cannot be compared without more information
  4. Model B is better if you’re willing to accept more uncertainty

Answer: b) Model A is more reliable because it has a narrower confidence interval

Explanation: Confidence intervals reveal precision/uncertainty, not just point estimates:

Model A: - AUC: 0.85 - 95% CI: 0.83-0.87 - Width: 0.04 (narrow) - Interpretation: We’re 95% confident true AUC is between 0.83-0.87 (precise estimate)

Model B: - AUC: 0.86 - 95% CI: 0.79-0.93 - Width: 0.14 (wide) - Interpretation: We’re 95% confident true AUC is between 0.79-0.93 (imprecise estimate)

Key insight: CIs overlap substantially (0.83-0.87 vs 0.79-0.93). Cannot conclude Model B is actually better—difference might be due to chance.

In practice: Most organizations prefer Model A: - Predictable performance for planning - Lower risk of underperformance - Easier to set appropriate thresholds - Small gain (0.01 AUC) not worth the uncertainty

Lesson: Always report and consider confidence intervals, not just point estimates. Narrow CIs indicate reliable performance. Wide CIs indicate uncertainty—might get much worse (or better) than point estimate suggests.

NoteQuestion 6: Subgroup Analysis

You evaluate a diagnostic model and find: - Overall AUC: 0.84 - Men: AUC 0.88 - Women: AUC 0.78

What should you do?

  1. Report only overall performance (0.84)
  2. Report overall performance but note subgroup differences exist
  3. Investigate why women’s performance is lower and consider separate models or adjustments
  4. Conclude the model is biased and should not be used

Answer: c) Investigate why women’s performance is lower and consider separate models or adjustments

Explanation: Subgroup performance disparities require investigation and action, not just reporting:

Why performance differs: Possible reasons

  1. Biological differences:
    • Disease presents differently (atypical symptoms in women)
    • Different physiological reference ranges
    • Example: Heart attack symptoms differ by sex
  2. Data representation:
    • Fewer women in training data → model learns men’s patterns better
    • Women may be underdiagnosed historically → labels biased
  3. Feature appropriateness:
    • Features optimized for men
    • Missing features relevant for women
    • Example: Pregnancy-related factors not included
  4. Measurement bias:
    • Tests/measurements less accurate for women
    • Different documentation patterns

Potential solutions:

  1. Collect more women’s data (if sample size issue)

  2. Add sex-specific features:

    # Include pregnancy status, hormonal factors
    features += ['pregnant', 'menopause_status', 'hormone_therapy']
  3. Stratified modeling:

    # Separate models for men/women
    if patient.sex == 'M':
        prediction = model_men.predict(patient)
    else:
        prediction = model_women.predict(patient)
  4. Weighted loss function:

    # Penalize errors on women more heavily during training
    sample_weights = [2.0 if sex=='F' else 1.0 for sex in data['sex']]
    model.fit(X, y, sample_weight=sample_weights)

Lesson: Subgroup analysis is mandatory, not optional. When disparities found, investigate root causes and take corrective action. Don’t hide disparities in overall metrics.


10.13 Discussion Questions

  1. Validation hierarchy: You’ve developed a hospital-acquired infection prediction model. What validation studies would you conduct before deploying? In what order? What evidence would convince you to deploy at other hospitals?

  2. Fairness trade-offs: Your sepsis model has AUC-ROC = 0.85 overall, but sensitivity is 0.90 for White patients vs. 0.75 for Black patients. What would you do? What are trade-offs of different mitigation approaches?

  3. Calibration vs. discrimination: Model A: AUC-ROC = 0.85, Brier score = 0.30 (poor calibration). Model B: AUC-ROC = 0.80, Brier score = 0.15 (excellent calibration). Which deploy? Why?

  4. External validation failure: Your model achieves AUC-ROC = 0.82 internal validation but 0.68 external validation at different hospital. What explains this? What next steps?

  5. Clinical utility skepticism: Model predicts 30-day mortality with AUC-ROC = 0.88. Does this mean it’s clinically useful? What additional evaluations needed?

  6. Prospective study design: Evaluate hospital readmission model prospectively. RCT, stepped-wedge, or silent mode? What are trade-offs?

  7. Alert threshold selection: Clinical decision support tool can alert at >10%, >20%, or >30% predicted risk. How choose? What factors matter?

  8. Model drift: COVID-19 forecasting model trained on 2020 data; now 2023 with new variants. How assess if still valid? What triggers retraining?


10.14 Further Resources

10.14.1 📚 Books

10.14.2 📄 Essential Papers

Validation: - Collins et al., 2015, BMJ - TRIPOD guidelines 🎯 - Liu et al., 2019, Radiology - Medical imaging AI systematic review 🎯 - Oakden-Rayner et al., 2020, Nature Medicine - Hidden stratification 🎯

Fairness: - Obermeyer et al., 2019, Science - Racial bias case study 🎯 - Chouldechova, 2017, FAT - Impossibility theorem 🎯

Clinical Utility: - Vickers & Elkin, 2006, Medical Decision Making - Decision curve analysis 🎯

Implementation: - Proctor et al., 2011 - Implementation outcomes 🎯

10.14.3 💻 Tools

Metrics and Validation: - Scikit-learn - Comprehensive metrics - Scikit-survival - Survival analysis metrics

Fairness: - Fairlearn - Microsoft fairness toolkit - AI Fairness 360 - IBM toolkit - Aequitas - Bias audit

Explainability: - SHAP - Feature importance - LIME - Local explanations


Next: Chapter 10: Ethics, Bias, and Health Equity →