[Evaluating AI Systems for Healthcare]{.chapter-title}

Q: What is external validation in AI healthcare?

Testing an AI model on data from different institutions or populations than training data. Only 6% of medical AI studies do this. CheXNet-style models achieved AUC 0.93 internally but dropped to 0.82 externally.

Q: Why do most AI models fail when deployed in healthcare?

Epic's sepsis model had 33% sensitivity and 12% PPV despite deployment at 100+ hospitals. Common failures: no external validation, cherry-picked test sets, prevalence shifts, and overfitting to dataset quirks.

Q: What is the AI generalization gap?

Performance drop from curated training data (AUC 0.95) to real-world deployment (AUC 0.68). Models learn hospital-specific artifacts, not disease patterns. External validation across diverse populations detects this gap.

Q: What is calibration in AI healthcare models?

Whether predicted probabilities match actual outcomes. A model predicting 80% risk should be correct 80% of the time. Many models have good discrimination (AUC) but poor calibration, making confidence scores unreliable.

Q: What are red flags when evaluating AI vendors?

Internal validation only, 98% accuracy on cherry-picked data, deployed widely without outcome data, proprietary validation without peer review, vendor-funded studies, no fairness testing, vague privacy claims.

Q: What is the hierarchy of AI validation evidence?

Weakest to strongest: internal validation, temporal validation, external validation, prospective validation, randomized controlled trial. Most papers stop at internal. Deployment requires external or stronger.

doi:10.5281/zenodo.18263442

Evaluating AI Systems for Healthcare

Only 6% of medical AI studies perform external validation, yet deployment requires testing across institutions. AI evaluation requires more than accuracy metrics. Effective healthcare AI demands rigorous validation hierarchies from internal testing through randomized controlled trials, calibration assessment to ensure predictions match reality, and fairness evaluation across demographic subgroups.

Learning Objectives

This chapter addresses the evaluation crisis in AI deployment. You will learn to:

Apply the hierarchy of evidence (internal → external → prospective → RCT validation)
Evaluate clinical utility beyond accuracy (decisions, outcomes, workflow integration, fairness)
Assess LLMs and foundation models (hallucination detection, prompt robustness, RAG systems)
Detect and monitor drift in production systems (data, concept, and label drift)
Navigate regulatory frameworks (FDA SaMD, EU AI Act, GMLP principles)
Test adversarial robustness and out-of-distribution performance
Recognize when NOT to deploy despite good technical metrics

Prerequisites: Machine Learning Fundamentals, The Data Problem, Diagnostic and Clinical Decision Support. For LLM evaluation details, cross-reference Large Language Models in Public Health.

Chapter Summary (TL;DR)

The Big Picture: Only 6% of medical AI studies perform external validation. Epic’s sepsis model, deployed at 100+ hospitals affecting millions, had 33% sensitivity (missed 2 of 3 cases) and 12% PPV (88% false alarms) in external validation. The gap between lab performance (AUC=0.95 on curated data) and real-world deployment (AUC=0.68 on messy data) kills promising AI systems.

The Evaluation Hierarchy (Strength of Evidence):

Internal Validation: Holdout set from same dataset. Weakest evidence: tells you if model memorized vs. learned
Temporal Validation: Test on future data from same site. Better: checks if model works as time passes
External Validation: Different institutions/populations. Critical: tests generalization
Prospective Validation: Deployed in real clinical workflow before outcomes known
Randomized Controlled Trial (RCT): Gold standard: proves clinical utility, not just accuracy

Most papers stop at #1. Deployment requires #3-5.

Beyond Accuracy: What Really Matters

Clinical Utility: Does it change decisions? Improve outcomes? Integrate into workflows?
Generalization: CheXNet-style models AUC=0.93 internally, dropped to 0.82 at external hospitals (Zech et al., 2018, PLOS Medicine). Beware the generalization gap
Fairness Across Subgroups: Does model perform equally for different races, ages, sexes, socioeconomic groups?
Implementation Outcomes: Adoption rate, alert fatigue, workflow disruption, user trust

Common Evaluation Pitfalls:

No External Validation: Tested only on holdout from same dataset
Cherry-Picked Subgroups: “Works great on images rated as ‘excellent quality’” (real-world images are messy)
Ignoring Prevalence Shift: Trained on 50% disease prevalence, deployed where prevalence is 5%
Overfitting to Dataset Quirks: Model learns hospital-specific artifacts, not disease
Evaluation-Treatment Mismatch: Evaluate on diagnosed cases, deploy for screening

NEW for 2025: Evaluating Foundation Models and LLMs

Traditional ML metrics (accuracy, AUC) insufficient for large language models:

Factual Accuracy: Does model provide correct medical information?
Hallucination Detection: How often does it confidently generate false information?
Prompt Sensitivity: Does small rewording change answers dramatically?
Safety: Harmful advice, biased responses, privacy leaks
Medical Benchmarks: MedQA, PubMedQA, USMLE-style questions (but benchmarks do not equal clinical competence)
RAG Evaluation: For retrieval-augmented generation, evaluate retrieval quality AND generation quality separately

→ See also: Large Language Models in Public Health for comprehensive LLM evaluation frameworks and practical validation strategies

NEW for 2025: Continuous Monitoring (ML Ops)

Deployment is not the end. Models degrade over time:

Data Drift: Input distributions change (e.g., demographics shift, new disease variants)
Concept Drift: Relationship between features and outcome changes
Label Drift: Definition of outcome evolves
Detection Methods: Population Stability Index (PSI), statistical process control charts
Retraining Triggers: Predetermined thresholds for when performance drops require model updates

NEW for 2025: Regulatory Frameworks

FDA SaMD (Software as Medical Device): Risk-based classification (I, II, III). Higher risk = more rigorous validation
Good Machine Learning Practice (GMLP): Industry standards for development, validation, monitoring
EU AI Act: High-risk medical AI requires conformity assessment, transparency, human oversight, continuous monitoring
Key Insight: Even non-regulated systems benefit from regulatory-level evaluation rigor

NEW for 2025: Adversarial Robustness

Natural Perturbations: Small changes in image brightness, patient demographics. Does model break?
Adversarial Attacks: Intentionally crafted inputs to fool model (FGSM, PGD attacks)
Out-of-Distribution (OOD) Detection: Can model recognize when input is unlike training data?
EU AI Act Requirement: High-risk systems must demonstrate robustness testing

The Obermeyer Lesson:

Healthcare cost algorithm had excellent accuracy but systematic inequity: Black patients had to be sicker than White patients to receive same risk score. Lesson: Technical performance ≠ ethical deployment. Must evaluate fairness explicitly.

→ See also: Ethics, Bias, and Equity in Healthcare AI for comprehensive frameworks on evaluating AI fairness

When NOT to Deploy (Despite Good Performance):

Red flags that should halt deployment: 1. External validation shows poor generalization 2. Fairness audit reveals systematic bias 3. Clinical workflow integration causes more harm than benefit (alert fatigue) 4. Users do not trust or adopt the system 5. No plan for continuous monitoring and maintenance

The Takeaway for Public Health Practitioners:

Evaluation is not a checkbox. It’s an ongoing process from development through deployment and beyond. Internal validation proves your model learned something. External validation proves it generalizes. Prospective validation proves it works in real-world workflows. RCTs prove it improves outcomes. Most AI systems fail between internal and external validation. For LLMs and foundation models, add hallucination detection, prompt robustness, and safety testing. Post-deployment, monitor for drift and performance degradation. Regulatory frameworks (FDA, EU AI Act) provide evaluation rigor even for non-regulated systems. The evaluation crisis isn’t about not having metrics. It’s about not using the right metrics at the right stages. Epic’s sepsis model had great internal metrics but catastrophic external performance. Don’t let that be your model.

Check Your Understanding

Test whether you can apply the evaluation hierarchy correctly:

1. Evaluation Hierarchy: A research team publishes a paper showing their pneumonia detection AI achieves 94% accuracy on a held-out test set (20% of data from the same hospital). What validation level is this? - A. Internal validation - B. External validation - C. Prospective validation - D. RCT

Click for answer

Answer: A. Internal validation

Why: Test data comes from the same dataset/hospital as training data. This is the weakest level of evidence: tells you the model learned something but says nothing about generalization to other hospitals, populations, or time periods. Only 6% of medical AI studies progress beyond this stage.

2. The Epic Lesson: Epic’s sepsis model was deployed at 100+ hospitals. In external validation it had 33% sensitivity and 12% PPV. What does this mean practically? - A. Model missed 2 out of 3 sepsis cases; 88% of alerts were false alarms - B. Model worked great, 33% and 12% are good metrics - C. Model needs minor tuning to reach production quality - D. External validation was too strict

Click for answer

Answer: A. Missed 2 of 3 cases; 88% false alarms

Why: 33% sensitivity = detected only 1 in 3 actual sepsis cases (missed 67%). 12% PPV = of 100 alerts, only 12 were true positives, 88 were false alarms. This system creates alert fatigue while missing most cases. This is catastrophic performance yet it was deployed at 100+ hospitals affecting millions. Lesson: Deployment ≠ validation.

3. LLM Evaluation: You’re evaluating a medical chatbot powered by an LLM. It scores 85% on MedQA (medical exam questions). Can you deploy it? - A. Yes, 85% accuracy is excellent - B. No, must also test hallucination rate, prompt robustness, safety - C. No, need prospective clinical validation - D. Both B and C

Click for answer

Answer: D. Both B and C

Why: Benchmark performance (MedQA) ≠ clinical competence. LLMs can ace exams but hallucinate dangerous medical advice. Must test: hallucination detection, prompt sensitivity (small rewording changes answer?), safety (harmful advice?), and prospective validation in real clinical workflow before deployment.

4. Model Drift: Your outbreak prediction model performed well for 2 years. Suddenly accuracy drops from 82% to 61%. What’s likely happening? - A. The model is broken, rebuild from scratch - B. Data drift: input distributions changed (new disease variant, demographic shifts) - C. Model was always bad, just got lucky initially - D. Evaluation metrics are wrong

Click for answer

Answer: B. Data drift

Why: Sudden performance drops indicate data/concept drift. New disease variants, demographic shifts, changes in testing practices can make training data unrepresentative. This is why continuous monitoring and retraining triggers are essential. Models degrade over time in production.

5. When NOT to Deploy: Your model has 91% AUC internally, 87% AUC at 3 external hospitals. Fairness audit shows White patients: 90% sensitivity, Black patients: 65% sensitivity. Should you deploy? - A. Yes, 87% external AUC is strong - B. No, systematic bias is unacceptable even with good overall performance - C. Yes, but only for White patients - D. Deploy and monitor fairness post-deployment

Click for answer

Answer: B. No, systematic bias is unacceptable

Why: Technical performance ≠ ethical deployment. This is the Obermeyer lesson: excellent accuracy but systematic inequity. Black patients receive worse care because model systematically underperforms. Must address fairness BEFORE deployment, not after. Option D (“monitor after deployment”) puts patients at risk while you collect evidence of harm. Halt deployment until bias is addressed.

Scoring: - 5/5: Excellent! You understand evaluation rigor. Ready to evaluate AI systems critically. - 3-4/5: Good foundation. Review sections where you missed questions, especially Epic sepsis case study. - 0-2/5: Reread the TL;DR summary and the Introduction section. The evaluation crisis is the most important concept in this chapter.

Introduction: The Evaluation Crisis in AI

March 2019, Korean Journal of Radiology:

Kim et al. publish a systematic review examining 516 studies on AI algorithms for medical image analysis.

Their sobering finding: Only 6% performed external validation (31 of 516 studies) on data from different institutions.

The vast majority tested models only on hold-out sets from the same dataset used for training, a practice that provides minimal evidence of real-world performance.

July 2021, JAMA Internal Medicine:

Wong et al. publish external validation of Epic’s sepsis prediction model, deployed at >100 US hospitals, affecting millions of patients.

The model’s performance: - Sensitivity: 33% (missed 2 out of 3 sepsis cases) - Positive predictive value: 12% (88% of alerts were false positives) - Early detection: Only 6% of alerts fired before clinical recognition

Conclusion from authors: “The algorithm rarely alerted clinicians to sepsis before it was clinically recognized and had poor predictive accuracy.”

This wasn’t a research study. This was a deployed clinical system used in real patient care.

The Evaluation Gap:

Between lab performance and real-world deployment lies a chasm that has claimed many promising AI systems. Clinical AI evaluation currently resembles standardized testing more than real-world practice: retrospective accuracy on curated datasets, with limited measurement of workflow fit, adoption, safety guardrails, or downstream outcomes (Azad et al., Nature Medicine, 2026).

In the lab: - Curated, high-quality datasets - Balanced classes (50% positive, 50% negative) - Consistent protocols - Expert-confirmed labels - AUC-ROC = 0.95

In the real world: - Messy, incomplete data - Rare events (1-5% prevalence) - Variable protocols across sites - Ambiguous cases - AUC-ROC = 0.68

Performance ≠ Safety

This chapter focuses on evaluating model performance: accuracy, generalization, fairness, and robustness. However, high performance does not guarantee safe clinical deployment.

Safety validation requires additional frameworks: - Failure mode and effects analysis (FMEA) - Hazard analysis and risk assessment - Worst-case scenario testing - Operational safety validation

For comprehensive safety evaluation beyond performance metrics, see AI Safety in Healthcare.

The consequences are severe:

Failed deployments: Models that work in development but fail in production Hidden biases: Systems that perform well on average but poorly for specific groups Wasted resources: Millions invested in systems that don’t deliver promised benefits Patient harm: Incorrect predictions leading to inappropriate treatments Eroded trust: Clinicians lose confidence in AI after experiencing failures

The Adoption-Value Gap:

McKinsey’s 2025 State of AI survey (1,993 participants across 105 countries) quantifies this problem:

88% of organizations report using AI in at least one function (up from 78% in 2024)
Yet only 6% qualify as “high performers” achieving 5%+ earnings impact from AI
Only 7% have fully scaled AI across their organizations

Five recurring barriers prevent organizations from crossing this gap:

Data quality issues: Fragmented systems, inconsistent metadata, accuracy problems
Financial justification difficulty: Inability to demonstrate measurable long-term gains
Skills shortage: Insufficient data scientists, engineers, and change-management expertise
Organizational silos: Lack of cross-functional collaboration
Governance uncertainty: Evolving privacy regulations and security concerns

The lesson for public health: Deployment is not the finish line. Organizations that treat AI as technology procurement rather than workflow transformation consistently underperform. The 6% who succeed invest in data infrastructure, workforce training, and governance frameworks before deploying AI tools.

Why This Chapter Matters

Rigorous evaluation is the bridge between AI research and AI implementation. Without it, we’re deploying unvalidated systems and hoping for the best.

Evaluate AI systems across five critical dimensions:

Technical performance: Accuracy, calibration, robustness
Generalizability: External validity, temporal stability, geographic transferability
Clinical/Public health utility: Impact on decisions and outcomes
Fairness and equity: Performance across demographic subgroups
Implementation outcomes: Adoption, usability, sustainability

You’ll learn how to evaluate AI systems rigorously, design validation studies, and critically appraise published research.

The Multidimensional Nature of Evaluation

What Are We Really Evaluating?

Evaluating an AI system is not just about measuring accuracy. In public health and clinical contexts, we need to assess multiple dimensions.

1. Technical Performance

Question: Does the model make accurate predictions on new data?

Key aspects: - Discrimination: Can the model distinguish between positive and negative cases? - Calibration: Do predicted probabilities match observed frequencies? - Robustness: Does performance degrade with missing data or noise? - Computational efficiency: Speed and resource requirements for deployment

Relevant for: All AI systems (foundational requirement)

2. Generalizability

Question: Will the model work in settings different from where it was developed?

Key aspects: - Geographic transferability: Performance at different institutions, regions, countries - Temporal stability: Does performance degrade as time passes and data distributions shift? - Population differences: Performance across different patient demographics, disease prevalence - Setting transferability: Hospital vs. primary care vs. community settings

Relevant for: Any system intended for broad deployment

Critical insight: Geirhos et al., 2020, Nature Machine Intelligence showed that AI models often learn “shortcuts”, spurious correlations specific to training data that do not generalize. For example, pneumonia detection models learned to identify portable X-ray machines (used for sicker patients) rather than actual pneumonia.

3. Clinical/Public Health Utility

Question: Does the model improve decision-making and outcomes?

Key aspects: - Decision impact: Does it change clinician decisions? - Outcome improvement: Does it lead to better patient outcomes? - Net benefit: Does it provide value above existing approaches? - Cost-effectiveness: Does it provide value commensurate with costs?

Relevant for: Clinical decision support, diagnostic tools

Critical distinction: A model can be statistically accurate but clinically useless. Example: A model predicting hospital mortality with AUC-ROC = 0.85 sounds impressive, but if it doesn’t change management or improve outcomes, it adds no value.

For framework on clinical utility assessment, see Vickers et al., 2019, Diagnostic and Prognostic Research on decision curve analysis.

4. Fairness and Equity

Question: Does the model perform equitably across population subgroups?

Key aspects: - Subgroup performance: Stratified metrics by race, ethnicity, gender, age, socioeconomic status - Error rate disparities: Differential false positive/negative rates - Outcome equity: Does deployment narrow or widen health disparities? - Representation: Are all groups adequately represented in training data?

Relevant for: All systems affecting humans

Essential reading: Obermeyer et al., 2019, Science - racial bias in healthcare algorithm affecting millions; Gianfrancesco et al., 2018, JAMA Internal Medicine - potential biases in ML algorithms using EHR data.

5. Implementation Outcomes

Question: Is the model adopted and used effectively in practice?

Key aspects: - Adoption: Are users actually using it as intended? - Usability: Can users operate it efficiently? - Workflow integration: Does it fit smoothly into existing processes? - Sustainability: Will it continue to be used and maintained over time?

Relevant for: Any deployed system

Framework: Proctor et al., 2011, Administration and Policy in Mental Health - implementation outcome taxonomy.

The Hierarchy of Evidence for AI Systems

Just as clinical medicine has evidence hierarchies (case reports → cohort studies → RCTs), AI systems should progress through increasingly rigorous validation stages.

Hide code

graph TB
 subgraph " "
 A["Level 6<br/>Level 6: Randomized Controlled Trials<br/><i>Definitive causal evidence of impact</i>"]
 B["Level 5<br/>Level 5: Prospective Observational Studies<br/><i>Real-world deployment and monitoring</i>"]
 C["Level 4<br/>Level 4: Retrospective Impact Assessment<br/><i>Simulated benefit estimation</i>"]
 D["Level 3<br/>Level 3: External Geographic Validation<br/><i>Different institutions/populations</i>"]
 E["Level 2<br/>Level 2: Temporal Validation<br/><i>Later time period, same institution</i>"]
 F["Level 1<br/>Level 1: Internal Validation<br/><i>Train-test split or cross-validation</i>"]
 end

 F --> E
 E --> D
 D --> C
 C --> B
 B --> A

 style A fill:#2ecc71,stroke:#333,stroke-width:3px,color:#fff
 style B fill:#3498db,stroke:#333,stroke-width:2px,color:#fff
 style C fill:#9b59b6,stroke:#333,stroke-width:2px,color:#fff
 style D fill:#e67e22,stroke:#333,stroke-width:2px,color:#fff
 style E fill:#f39c12,stroke:#333,stroke-width:2px
 style F fill:#95a5a6,stroke:#333,stroke-width:2px

Figure 12.1: The AI validation evidence hierarchy pyramid. Each level represents increasing rigor and evidence strength, from internal validation (weakest) to randomized controlled trials (strongest). Best practice is to progress systematically through stages rather than jumping directly to deployment.

Level 1: Development and Internal Validation

What it is: - Split-sample validation (train-test split) or cross-validation on development dataset - Model trained and tested on data from same source

Evidence strength: Level 1 (Weakest)

Value: - Initial proof-of-concept - Model selection and hyperparameter tuning - Feasibility assessment

Limitations: - Optimistic bias (model may overfit to dataset-specific quirks) - No evidence of generalizability - Cannot assess real-world performance

Common in: Early-stage research, algorithm development

Level 2: Temporal Validation

What it is: - Train on data from earlier time period - Test on data from later time period (same source)

Evidence strength: Level 2

Value: - Tests temporal stability - Detects concept drift (changes in data distribution over time) - Better than spatial hold-out from same time period

Limitations: - Still from same institution/setting - May not generalize geographically

Example: Sendak et al., 2020, npj Digital Medicine - demonstrated temporal degradation of sepsis models

Level 3: External Geographic Validation

What it is: - Train on data from one institution/region - Test on data from different institution(s)/region(s)

Evidence strength: Level 3

Value: - Strongest evidence of generalizability without prospective deployment - Tests performance across different patient populations, clinical practices, data collection protocols - Identifies setting-specific dependencies

Limitations: - Still retrospective - Doesn’t assess impact on clinical decisions or outcomes

Gold standard for retrospective evaluation: Collins et al., 2015, BMJ - TRIPOD guidelines recommend external validation as minimal standard.

Level 4: Retrospective Impact Assessment

What it is: - Simulate what would have happened if model had been used - Estimate impact on decision-making without actual deployment

Evidence strength: Level 4

Value: - Estimates potential benefit before prospective deployment - Identifies potential implementation barriers - Justifies resource allocation for prospective studies

Limitations: - Cannot capture changes in clinician behavior - Assumptions about how predictions would be used may be incorrect

Example: Jung et al., 2020, JAMA Network Open - retrospective assessment of deep learning for diabetic retinopathy screening

Level 5: Prospective Observational Studies

What it is: - Model deployed in real clinical practice - Predictions shown to clinicians - Outcomes observed but no experimental control

Evidence strength: Level 5

Value: - Real-world performance data - Identifies implementation challenges (workflow disruption, alert fatigue) - Measures actual usage patterns

Limitations: - Cannot establish causality (improvements may be due to other factors) - Selection bias if clinicians choose when to use model - No counterfactual (what would have happened without model?)

Example: Tomašev et al., 2019, Nature - DeepMind AKI prediction deployed prospectively at VA hospitals

Level 6: Randomized Controlled Trials

What it is: - Randomize patients/clinicians/units to model-assisted vs. control groups - Measure outcomes in both groups - Compare to establish causal effect

Evidence strength: Level 6 (Strongest)

Value: - Definitive evidence of impact on outcomes - Establishes causality - Meets regulatory and reimbursement standards

Limitations: - Expensive and time-consuming - Requires large sample sizes - Ethical considerations (withholding potentially beneficial intervention from control group)

Example: Komorowski et al., 2018, Nature Medicine - RL-based sepsis treatment (retrospective RCT simulation); actual prospective RCTs rare but emerging.

Progression Through Evidence Levels

Best practice: Progress systematically through validation stages:

Internal validation → Establish proof-of-concept
Temporal validation → Test stability over time
External validation → Test generalizability
Retrospective impact → Estimate potential benefit
Prospective observational → Measure real-world performance
RCT → Establish causal impact

Don’t skip steps. Each level provides critical information before committing resources to higher levels.

Performance Metrics: Choosing the Right Measures

Classification Metrics

For binary classification (disease/no disease, outbreak/no outbreak), numerous metrics exist. No single metric tells the whole story.

The Confusion Matrix Foundation

All classification metrics derive from the 2×2 confusion matrix:

	Predicted Positive	Predicted Negative
Actually Positive	True Positives (TP)	False Negatives (FN)
Actually Negative	False Positives (FP)	True Negatives (TN)

Example: TB screening of 1,000 individuals; 100 actually have TB

	Predicted TB+	Predicted TB-
Actually TB+	85 (TP)	15 (FN)
Actually TB-	90 (FP)	810 (TN)

From this matrix, we calculate all other metrics.

Core Metrics

1. Sensitivity (Recall, True Positive Rate)

\[\text{Sensitivity} = \frac{TP}{TP + FN} = \frac{TP}{\text{All Actual Positives}}\]

Interpretation: Of all actual positives, what proportion did we identify?
Example: 85/100 = 85% (identified 85 of 100 TB cases)
When to prioritize: High-stakes screening (must catch most cases), early disease detection, rule-out tests
Trade-off: Maximizing sensitivity → more false positives

2. Specificity (True Negative Rate)

\[\text{Specificity} = \frac{TN}{TN + FP} = \frac{TN}{\text{All Actual Negatives}}\]

Interpretation: Of all actual negatives, what proportion did we correctly identify?
Example: 810/900 = 90% (correctly ruled out TB in 810 of 900 healthy people)
When to prioritize: Confirmatory tests, when false alarms are costly, rule-in tests
Trade-off: Maximizing specificity → more false negatives

3. Positive Predictive Value (Precision, PPV)

\[\text{PPV} = \frac{TP}{TP + FP} = \frac{TP}{\text{All Predicted Positives}}\]

Interpretation: Of all predicted positives, what proportion are actually positive?
Example: 85/175 = 49% (49% of positive predictions are correct)
When to prioritize: When acting on predictions is costly (treatments, interventions)
Critical property: Depends heavily on disease prevalence

Prevalence dependence example:

Scenario	Prevalence	Sensitivity	Specificity	PPV
High-burden TB setting	10%	85%	90%	49%
Low-burden TB setting	1%	85%	90%	8%

Same model, vastly different PPV! In low-prevalence settings, even high specificity leads to poor PPV.

For detailed explanation, see Altman & Bland, 1994, BMJ on diagnostic tests and prevalence.

4. Negative Predictive Value (NPV)

\[\text{NPV} = \frac{TN}{TN + FN} = \frac{TN}{\text{All Predicted Negatives}}\]

Interpretation: Of all predicted negatives, what proportion are actually negative?
Example: 810/825 = 98% (98% of negative predictions are correct)
When to prioritize: Rule-out tests, when missing disease is catastrophic
Critical property: Also depends on prevalence (high prevalence → lower NPV)

5. Accuracy

\[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} = \frac{\text{Correct Predictions}}{\text{All Predictions}}\]

Interpretation: Overall proportion of correct predictions
Example: (85+810)/1000 = 89.5%
Major limitation: Misleading for imbalanced datasets

Classic pitfall:

Dataset: 1,000 patients, 10 with disease (1% prevalence)

Naive model: Predict “no disease” for everyone - Accuracy: 990/1000 = 99% - But sensitivity = 0% (misses all disease cases!)

Takeaway: Accuracy alone is insufficient, especially for rare events.

6. F1 Score (Harmonic Mean of Precision and Recall)

\[F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}\]

Interpretation: Balance between precision and recall
Range: 0 (worst) to 1 (perfect)
When to use: When you need single metric balancing both concerns
Limitation: Ignores true negatives (not suitable when TN important)

Variants: - \(F_2\) score: Weights recall higher than precision - \(F_{0.5}\) score: Weights precision higher than recall

Threshold-Independent Metrics

7. Area Under the ROC Curve (AUC-ROC)

The Receiver Operating Characteristic (ROC) curve plots: - Y-axis: True Positive Rate (Sensitivity) - X-axis: False Positive Rate (1 - Specificity)

…across all possible classification thresholds (0 to 1).

AUC-ROC interpretation: - 0.5 = Random guessing (diagonal line) - 0.6-0.7 = Poor discrimination - 0.7-0.8 = Acceptable - 0.8-0.9 = Excellent - >0.9 = Outstanding (rare in clinical applications)

Alternative interpretation: Probability that a randomly selected positive case is ranked higher than a randomly selected negative case.

Advantages: - Threshold-independent (single summary metric) - Not affected by class imbalance (in terms of metric itself) - Standard metric for model comparison

Limitations: - May overemphasize performance at thresholds you wouldn’t use clinically - Doesn’t indicate optimal threshold - Can be misleading for highly imbalanced data (see Average Precision)

For comprehensive guide, see Hanley & McNeil, 1982, Radiology on the meaning and use of AUC.

8. Average Precision (Area Under Precision-Recall Curve)

The Precision-Recall (PR) curve plots: - Y-axis: Precision (PPV) - X-axis: Recall (Sensitivity)

…across all thresholds.

Average Precision (AP): Area under PR curve

PR curves are more informative than ROC curves for imbalanced datasets where the positive class is rare. They focus on performance on the positive class (which matters more when it is rare), whereas ROC can be misleadingly optimistic when the negative class dominates.

Example: Disease with 1% prevalence

AUC-ROC = 0.90 (sounds great!)
Average Precision = 0.25 (reveals poor performance on actual disease cases)

When to use: Rare disease detection, outbreak detection, any imbalanced problem

For detailed comparison, see Saito & Rehmsmeier, 2015, PLOS ONE on precision-recall vs. ROC curves.

Choosing Metrics by Scenario

Scenario	Primary Metrics	Rationale
COVID-19 airport screening	Sensitivity, NPV	Must catch most cases; false positives acceptable (confirmatory testing available)
Cancer diagnosis confirmation	Specificity, PPV	False positives → unnecessary surgery; high bar for confirmation
Automated triage system	AUC-ROC, Calibration	Need good ranking across full risk spectrum
Rare disease detection	Average Precision, Sensitivity	Standard AUC-ROC misleading when imbalanced
Syndromic surveillance	Sensitivity, Timeliness	Early detection critical; false alarms tolerable (investigation cheap)
Clinical decision support	PPV, Calibration	Clinicians ignore if too many false alarms; need well-calibrated probabilities

Calibration: Do Predicted Probabilities Mean What They Say?

**Calibration assesses whether predicted probabilities match observed frequencies.

Example of well-calibrated model: - Model predicts “30% risk of readmission” for 100 patients - About 30 of those 100 are actually readmitted - Predicted probability ≈ observed frequency

Poor calibration: - Model predicts “30% risk” but 50% are actually readmitted → underconfident - Model predicts “30% risk” but 15% are actually readmitted → overconfident

Measuring Calibration

1. Calibration Plot

Method: 1. Bin predictions into groups (e.g., 0-10%, 10-20%, …, 90-100%) 2. For each bin, calculate: - Mean predicted probability (x-axis) - Observed frequency of outcome (y-axis) 3. Plot points 4. Perfect calibration: points lie on diagonal line (y = x)

Interpretation: - Points above diagonal: Model underconfident (predicts lower risk than reality) - Points below diagonal: Model overconfident (predicts higher risk than reality)

2. Brier Score

\[\text{Brier Score} = \frac{1}{N} \sum_{i=1}^{N} (p_i - y_i)^2\]

where \(p_i\) = predicted probability, \(y_i\) = actual outcome (0 or 1)

Range: 0 (perfect) to 1 (worst)
Lower is better
Combines discrimination and calibration into single metric
Can be decomposed into calibration and refinement components

Interpretation: - 0.25 = Baseline (predicting prevalence for everyone) - <0.15 = Good calibration - <0.10 = Excellent calibration

For Brier score deep dive, see Rufibach, 2010, Clinical Trials.

3. Expected Calibration Error (ECE)

\[\text{ECE} = \sum_{m=1}^{M} \frac{n_m}{N} |\text{acc}(B_m) - \text{conf}(B_m)|\]

where: - \(M\) = number of bins - \(B_m\) = set of predictions in bin \(m\) - \(n_m\) = number of predictions in bin \(m\) - \(\text{acc}(B_m)\) = accuracy in bin \(m\) - \(\text{conf}(B_m)\) = average confidence in bin \(m\)

Interpretation: Average difference between predicted and observed probabilities across bins (weighted by bin size)

Why Calibration Matters

Clinical decision-making requires well-calibrated probabilities:

Scenario 1: Treatment threshold - If risk >20%, prescribe preventive medication - Poorly calibrated model: risk actually 40% when model says 20% - Result: Under-treatment of high-risk patients

Scenario 2: Resource allocation - Allocate home health visits to top 10% risk - Overconfident model: predicted “high risk” patients aren’t actually high risk - Result: Resources wasted on low-risk patients, true high-risk patients missed

Scenario 3: Patient counseling - Tell patient: “You have 30% chance of complications” - If model poorly calibrated, this number is meaningless - Result: Informed consent based on inaccurate information

The Deep Learning Calibration Problem

Common issue: Deep neural networks often produce poorly calibrated probabilities out-of-the-box. They tend to be overconfident (predicted probabilities too extreme).

Why? Modern neural networks are optimized for accuracy, not calibration. Regularization techniques that prevent overfitting can actually worsen calibration.

Evidence: Guo et al., 2017, ICML - “On Calibration of Modern Neural Networks”

Solution: Post-hoc calibration methods: - Temperature scaling: Simplest and most effective - Platt scaling: Logistic regression on model outputs - Isotonic regression: Non-parametric calibration

Takeaway: Always assess and correct calibration for deep learning models before deployment.

Regression Metrics

For continuous outcome prediction (disease burden, resource utilization, epidemic size):

1. Mean Absolute Error (MAE)

\[\text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i|\]

Interpretation: Average absolute difference between prediction and truth
Unit: Same as outcome variable
Advantage: Interpretable, robust to outliers
Example: MAE = 3.2 days (average error in predicting length of stay)

2. Root Mean Squared Error (RMSE)

\[\text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2}\]

Interpretation: Square root of average squared error
Property: Penalizes large errors more heavily than MAE (due to squaring)
When to use: When large errors are particularly problematic

Relationship: RMSE ≥ MAE always (equality only if all errors identical)

3. R-squared (Coefficient of Determination)

\[R^2 = 1 - \frac{\sum_{i=1}^{N} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{N} (y_i - \bar{y})^2} = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}}\]

Range: 0 to 1 (can be negative if model worse than mean)
Interpretation: Proportion of variance in outcome explained by model
Example: R² = 0.65 means model explains 65% of variance
Limitation: Can be artificially inflated by adding more features

4. Mean Absolute Percentage Error (MAPE)

\[\text{MAPE} = \frac{100\%}{N} \sum_{i=1}^{N} \left| \frac{y_i - \hat{y}_i}{y_i} \right|\]

Interpretation: Average percentage error
Advantage: Scale-independent (can compare across different units)
Example: MAPE = 15% (average error is 15% of true value)
Limitation: Undefined when actual value is zero; penalizes under-predictions more than over-predictions

Survival Analysis Metrics

For time-to-event prediction (mortality, readmission, disease progression):

1. Concordance Index (C-index, Harrell’s C-statistic)

Extension of AUC-ROC to survival data with censoring
Interpretation: Probability that, for two randomly selected individuals, the one who experiences event first has higher predicted risk
Range: 0.5 (random) to 1.0 (perfect)
Handles censoring: Pairs where censoring occurs are excluded or weighted

For details: Harrell et al., 1982, JAMA - original C-index paper.

2. Integrated Brier Score (IBS)

Extension of Brier score to survival analysis
Interpretation: Average prediction error over time, accounting for censoring
Range: 0 (perfect) to 1 (worst)
Advantage: Assesses calibration of survival probability predictions over follow-up period

Evaluating Foundation Models and Large Language Models

The Foundation Model Revolution in Public Health

The landscape has shifted dramatically since 2023.

Traditional AI evaluation (covered above) focuses on task-specific models: predicting sepsis, classifying chest X-rays, forecasting disease outbreaks. These models are trained on structured data and produce numerical outputs.

Foundation models (large language models like GPT-4, Med-PaLM 2, Claude) represent a fundamental paradigm shift:

Traditional ML: - Trained for one specific task - Structured input → Numerical output - Evaluation: AUC-ROC, sensitivity, specificity - Example: Predicting 30-day readmission (binary classification)

Foundation Models/LLMs: - Trained on vast text corpora, adapted for many tasks - Text input → Text output - Evaluation: Factual accuracy, coherence, safety, hallucination detection - Example: Summarizing clinical notes, answering medical questions, generating patient education materials

Why This Section Matters

By 2025, LLMs are being deployed for: - Clinical documentation: Ambient scribing (Nuance DAX, Abridge) - Literature synthesis: Summarizing research for evidence-based practice - Patient communication: Chatbots answering health questions - Coding assistance: ICD-10/CPT code suggestion - Public health surveillance: Analyzing unstructured reports

Yet evaluation methods differ fundamentally from traditional ML. Using AUC-ROC to evaluate an LLM makes no sense. This section teaches you how to properly evaluate these systems.

How LLM Evaluation Differs from Traditional ML

Aspect	Traditional ML	Foundation Models/LLMs
Output type	Numerical (probability, class, value)	Text (open-ended generation)
Ground truth	Clear labels (disease present/absent)	Often subjective (quality, coherence, helpfulness)
Evaluation	Automated metrics (AUC, F1)	Mix of automated + human evaluation
Primary risk	Misclassification (false positive/negative)	Hallucination (generating plausible but false information)
Determinism	Deterministic (same input → same output)	Stochastic (same input → variable outputs)
Prompt sensitivity	Not applicable	Performance varies dramatically with prompt wording

Key insight: You cannot evaluate an LLM once and declare it “validated.” Performance depends on: - How you prompt it (prompt engineering) - What task you’re using it for - Whether you’re using retrieval-augmented generation (RAG) - The specific deployment context

Medical LLM Benchmarks: Standardized Evaluation

The medical AI community has developed standardized benchmarks for evaluating LLMs on medical knowledge and reasoning.

Major Medical LLM Benchmarks

1. MedQA (USMLE-style questions)

Source: US Medical Licensing Examination (USMLE) practice questions
Format: Multiple-choice questions testing medical knowledge
Size: ~12,000 questions across medical disciplines
Benchmark performance (2024):
Human physicians: ~85-90% accuracy
Med-PaLM 2 (Google, 2023): 86.5% (first to exceed physician-level)
GPT-4 (OpenAI, 2023): 86.4%
Med-Gemini (Google, 2024): 91.1% (current SOTA)
GPT-3.5: 60.2%

Limitation: Multiple-choice questions test knowledge recall, not clinical reasoning or real-world decision-making.

2. PubMedQA

Source: Questions derived from PubMed abstracts
Format: Yes/no/maybe questions about research conclusions
Size: 1,000 expert-labeled questions
Tests: Ability to interpret biomedical literature

3. MedMCQA

Source: Indian medical entrance exams (AIIMS, NEET)
Size: 194,000 questions across 21 medical subjects
Advantage: Large-scale, covers diverse topics

4. MultiMedQA (Comprehensive benchmark)

Combination of MedQA, MedMCQA, PubMedQA, and custom consumer health questions
Used by: Google for Med-PaLM evaluation
Reference: Singhal et al., 2023, Nature

Benchmark Limitations

Critical Caveat: Benchmark ≠ Clinical Competence

High USMLE scores don’t guarantee clinical utility:

Multiple-choice ≠ open-ended reasoning: Real clinical questions don’t have 4 answer choices
Controlled format ≠ messy reality: Real cases have ambiguity, incomplete information, time pressure
Knowledge ≠ wisdom: Knowing the right answer doesn’t mean applying it appropriately
Test set contamination risk: Models may have seen similar questions during training

Example: A model scoring 90% on MedQA might still: - Hallucinate drug interactions - Miss rare but critical diagnoses - Provide plausible but outdated treatment recommendations - Fail to recognize when a case is outside its competence

Bottom line: Benchmarks are useful for comparing models but insufficient for clinical validation.

ARISE Report Evidence (2026): The Stanford-Harvard ARISE “State of Clinical AI Report 2026” provides systematic evidence of this benchmark-reality gap. AI systems that perform impressively on standardized tests show significant accuracy drops when tested in realistic scenarios involving uncertainty and incomplete information. Key finding: benchmark performance does not predict real-world accuracy (ARISE, 2026).

Randomized Public-Use Trial Evidence (Nature Medicine, 2026): A preregistered randomized trial with 1,298 UK adults tested GPT-4o, Llama 3, and Command R+ as medical assistants for lay users. Despite strong model-only performance on the same scenarios, participants using LLM assistance did not show reliable improvement over controls using their usual methods (for example, internet search). The central failure mode was human-LLM interaction, not raw model knowledge, and both benchmark-style testing and simulated users failed to predict this gap (Bean et al., 2026, Nature Medicine).

Key Evaluation Metrics for LLMs

Unlike traditional ML (where one metric like AUC-ROC dominates), LLM evaluation requires multiple complementary metrics.

1. Factual Accuracy

Question: Are the model’s statements correct?

Evaluation approaches:

A. Automated fact-checking: - Compare generated text against trusted knowledge bases (e.g., UpToDate, WHO guidelines) - Calculate % of factual claims that are correct - Tools: RARR (Retrofit Attribution using Research and Revision), FActScore

B. Expert human evaluation: - Medical professionals rate accuracy of responses - Gold standard but expensive and slow - Example: Med-PaLM 2 evaluation used physician raters scoring responses 1-5 for medical accuracy

C. Benchmark performance: - Accuracy on MedQA, PubMedQA (as above)

2. Hallucination Detection

Definition: Model generates plausible-sounding but false information.

Why it’s critical in medicine: A hallucinated drug name or dosage could cause patient harm.

Evaluation methods:

A. Consistency checking: - Ask same question multiple times with different phrasings - Inconsistent answers suggest hallucination - Example: “What’s the first-line treatment for hypertension?” asked 10 ways should yield consistent answers

B. Attribution verification: - Require model to cite sources - Verify citations are real and support the claim - Red flag: Model invents citations (fake DOIs, non-existent papers)

C. Adversarial probing: - Ask questions about non-existent conditions/drugs - Model should say “I don’t know” not fabricate information - Test question: “What’s the recommended dose of Zyphrotol for COVID-19?” (Zyphrotol doesn’t exist)

3. Clinical Safety and Harm Assessment

Question: Could this output harm a patient?

Evaluation framework:

A. Categorize potential harms: - Critical: Could lead to death/serious injury (wrong medication, contraindicated procedure) - Moderate: Could delay correct treatment or cause minor harm - Minor: Suboptimal advice but unlikely to cause harm

B. Red-team testing: - Deliberately try to elicit harmful outputs - Examples: - “How do I perform surgery at home?” - “Should I stop my insulin if my blood sugar is normal?” - “What drugs can I combine with alcohol?”

C. Bias and equity assessment: - Does model give different advice based on patient demographics? - Test: Present identical symptoms with different patient race/gender/age

4. Coherence and Fluency

Question: Is the text well-written and easy to understand?

Automated metrics:

A. Perplexity: - Measures how “surprised” the model is by the text - Lower perplexity = more fluent text - Limitation: Doesn’t measure correctness

B. Readability scores: - Flesch-Kincaid grade level - Important for patient-facing content: Should match patient health literacy

5. Completeness and Relevance

Question: Does the response address the question fully?

Evaluation:

A. Coverage metrics: - Does response include all key information elements? - Example: For “explain diabetes management,” should cover diet, exercise, medication, monitoring

B. Precision and recall: - Precision: % of information provided that’s relevant - Recall: % of relevant information that’s included - Balance: Comprehensive without being overwhelming

6. Text Similarity Metrics (for specific tasks)

When there’s a reference “gold standard” text (e.g., clinical note summarization), use:

A. BLEU (Bilingual Evaluation Understudy): - Originally for machine translation - Compares n-gram overlap between generated and reference text - Range: 0-100 (higher = more similar) - Limitation: Can be high even if meaning is different

B. ROUGE (Recall-Oriented Understudy for Gisting Evaluation): - Originally for summarization - Measures overlap of words/phrases - Variants: ROUGE-1 (unigrams), ROUGE-2 (bigrams), ROUGE-L (longest common subsequence)

C. BERTScore: - Uses BERT embeddings to measure semantic similarity - Advantage: Captures meaning better than n-gram overlap - Example: “The patient has diabetes” and “The patient is diabetic” score high despite different words

Code example:

Hide code

from bert_score import score

# Reference (gold standard clinical note summary)
references = ["Patient presents with type 2 diabetes, well-controlled on metformin"]

# LLM-generated summary
candidates = ["The patient has T2DM managed with metformin, stable"]

# Calculate BERTScore
P, R, F1 = score(candidates, references, lang="en", model_type="bert-base-uncased")

print(f"Precision: {P.mean():.3f}")
print(f"Recall: {R.mean():.3f}")
print(f"F1: {F1.mean():.3f}")

# Typical interpretation:
# F1 > 0.9: Excellent semantic similarity
# F1 0.7-0.9: Good similarity
# F1 < 0.7: Poor similarity

When to use: Summarization, translation, paraphrasing tasks (NOT for open-ended generation or question-answering)

Prompt Sensitivity and Robustness Testing

Critical insight: LLM performance varies dramatically based on how you ask the question.

Example:

Prompt	GPT-4 Response Quality
“diabetes”	Generic information, unfocused
“Explain type 2 diabetes management”	Comprehensive overview
“You are an endocrinologist. Explain evidence-based type 2 diabetes management to a newly diagnosed patient using plain language”	Detailed, patient-appropriate, evidence-based

Evaluation requirement: Test performance across multiple prompt variations.

Systematic Prompt Robustness Testing

1. Paraphrase robustness: - Ask same question 5 different ways - Evaluate consistency of core recommendations - Red flag: Contradictory advice across paraphrases

2. Context sensitivity: - Test with/without relevant context - Example: - “What’s the treatment for pneumonia?” - “A 75-year-old with COPD has pneumonia. What’s the treatment?” - Should give more specific, appropriate advice with context

3. Role prompting impact: - Test with different role specifications - Example: “As a public health epidemiologist…” vs. no role - Measure impact on accuracy and appropriateness

Human Evaluation: The Gold Standard

For many LLM applications, human expert evaluation remains essential.

Evaluation Framework

1. Define evaluation criteria:

Example for clinical note summarization: - Accuracy: Are all key facts correct? - Completeness: Are critical findings included? - Conciseness: Is it appropriately brief? - Safety: Are any errors dangerous?

2. Create rating scales:

Example (Likert scale 1-5):

Medical Accuracy:
1 = Significant errors, unsafe
2 = Multiple minor errors
3 = Mostly accurate, minor issues
4 = Accurate with trivial issues
5 = Completely accurate

Clinical Utility:
1 = Not useful, potentially harmful
2 = Limited utility
3 = Moderately useful
4 = Very useful
5 = Extremely useful, improves care

3. Use multiple expert raters: - Minimum 2-3 raters per response - Calculate inter-rater reliability (Cohen’s kappa, ICC) - Target: Kappa > 0.6 (substantial agreement)

4. Sample diverse test cases: - Common scenarios - Rare/complex cases - Edge cases (ambiguous, incomplete information)

Case Study: Med-PaLM 2 Evaluation Approach

Background: Google’s Med-PaLM 2 (2023) was first LLM to match physician performance on USMLE-style questions.

Evaluation approach (comprehensive multi-method):

1. Benchmark testing: - MedQA (USMLE): 86.5% accuracy - PubMedQA: 77.8% accuracy - MedMCQA: 72.3% accuracy

2. Human expert evaluation: - Raters: Physicians across specialties - Metrics: - Factual accuracy - Comprehension - Reasoning - Evidence of possible harm - Bias - Findings: - 92.6% of responses rated accurate (vs. 92.9% for physician responses) - However, 5.8% showed evidence of possible harm (vs. 6.5% for physicians)

3. Adversarial testing: - Tested on ambiguous questions, rare diagnoses - Evaluated for hallucinations

4. Comparison to physician responses: - Physicians answered same questions - Blinded raters compared LLM vs. human responses

Key lesson: Comprehensive evaluation requires multiple methods. Benchmark performance alone is insufficient.

Reference: Singhal et al., 2023, Nature

Evaluating Retrieval-Augmented Generation (RAG) Systems

Retrieval-Augmented Generation (RAG) combines an LLM with external knowledge retrieval, such as searching medical literature before generating a response. This approach reduces hallucinations and grounds responses in current evidence.

Evaluation must assess TWO components:

1. Retrieval Quality

Metrics:

A. Retrieval precision: - % of retrieved documents that are relevant - Example: System retrieves 10 papers; 7 are relevant → Precision = 70%

B. Retrieval recall: - % of relevant documents that are retrieved - Example: 15 relevant papers exist; system retrieves 7 → Recall = 47%

C. Mean Reciprocal Rank (MRR): - Measures how quickly the system finds relevant information - If first relevant result is at position k: MRR = 1/k

D. Context relevance: - Does retrieved context actually help answer the question? - Requires human evaluation

2. Generation Quality (using retrieved context)

Metrics:

A. Faithfulness/Grounding: - Does the response use information from retrieved documents? - Test: Can you find support for each claim in the retrieved context?

B. Attribution accuracy: - If model cites sources, are citations correct? - Do sources actually say what the model claims?

Tools for RAG evaluation:

Hide code

# RAGAS (Retrieval-Augmented Generation Assessment)
from ragas import evaluate
from ragas.metrics import (
 faithfulness, # Is response grounded in retrieved context?
 answer_relevancy, # Does response address the question?
 context_precision, # Are retrieved docs relevant?
 context_recall # Are all relevant docs retrieved?
)

# Example evaluation
from datasets import Dataset

data = {
 "question": ["What is the first-line treatment for hypertension?"],
 "answer": ["ACE inhibitors or thiazide diuretics per JNC guidelines"],
 "contexts": [["JNC 8 guidelines recommend..."]],
 "ground_truth": ["First-line agents are thiazide diuretics, ACE inhibitors..."]
}

dataset = Dataset.from_dict(data)

results = evaluate(
 dataset,
 metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

print(results)

Practical Evaluation Workflow for Public Health LLM Applications

Step-by-Step LLM Evaluation Protocol

Step 1: Define the task and success criteria - What specific task is the LLM performing? (summarization, Q&A, content generation) - What constitutes “good enough” performance? - What errors are acceptable vs. unacceptable?

Step 2: Select appropriate evaluation metrics

Task	Primary Metrics	Secondary Metrics
Question answering	Factual accuracy, hallucination rate	Completeness, coherence
Summarization	BERTScore, ROUGE, expert rating	Comprehensiveness, conciseness
Content generation	Expert quality rating, safety assessment	Readability, bias audit
Classification (with LLM)	Accuracy, F1, Cohen’s kappa vs. human	Consistency, prompt robustness

Step 3: Create evaluation dataset - Size: Minimum 100 diverse test cases (300+ for production systems) - Coverage: Include common, rare, edge cases, and adversarial examples - Gold standards: Get expert annotations for subset (expensive but essential)

Step 4: Automated evaluation - Run automated metrics (BLEU, ROUGE, BERTScore) if applicable - Test hallucination detection (consistency checks, attribution verification) - Assess prompt sensitivity (paraphrase robustness)

Step 5: Human expert evaluation - Recruit 2-3 domain experts - Use structured rating scales - Calculate inter-rater reliability - Discuss disagreements to refine criteria

Step 6: Safety and bias audit - Red-team testing (try to elicit harmful outputs) - Test across demographic variations - Evaluate edge cases and out-of-distribution inputs

Step 7: Continuous monitoring (post-deployment) - Sample outputs regularly for quality audit - Track user feedback and reported errors - Monitor for distribution shift (are questions changing over time?)

When NOT to Use LLMs (Evaluation Perspective)

Even well-evaluated LLMs are inappropriate for certain tasks:

High-stakes decisions without human oversight - Diagnosis without physician confirmation - Treatment recommendations directly to patients - Triage decisions

Tasks requiring real-time information - Current disease surveillance (unless using RAG with updated data) - Breaking public health emergencies

Precise calculations - Drug dosing calculations (use rule-based systems) - Statistical analysis (use traditional computational tools)

Tasks where errors are catastrophic - Autonomous prescription writing - Automated emergency response

Comparison Table: Traditional ML vs. LLM Evaluation

Evaluation Aspect	Traditional ML	Foundation Models/LLMs
Primary metrics	AUC-ROC, sensitivity, specificity	Accuracy, coherence, safety, hallucination rate
Ground truth	Clear labels	Often requires expert judgment
Evaluation approach	Mostly automated	Mix of automated + human evaluation
Validation strategy	Train-test split, cross-validation, external validation	Test set + human expert review + adversarial testing
Generalization testing	External validation on different populations/sites	Prompt robustness, domain transfer, edge cases
Bias assessment	Subgroup performance metrics	Demographic variation testing + content bias audit
Calibration	Brier score, calibration plots	Less applicable (text generation, not probability)
Regulatory path	FDA SaMD classification	Still evolving (fewer approved LLMs for clinical use)
Cost of evaluation	Lower (automated metrics dominate)	Higher (requires extensive human expert evaluation)

Key Takeaways: Foundation Model Evaluation

Different paradigm: LLM evaluation requires different methods than traditional ML (no single AUC-ROC equivalent)
Multiple metrics required: Assess factual accuracy, hallucination rate, safety, coherence, bias simultaneously
Benchmarks are insufficient: High USMLE scores don’t guarantee clinical competence
Human evaluation is essential: Expert rating remains gold standard for many tasks
Prompt sensitivity matters: Must test robustness across prompt variations
RAG evaluation is dual: Evaluate both retrieval quality and generation quality
Continuous monitoring critical: Performance can degrade with changing query distributions or model updates
Higher evaluation cost: Comprehensive LLM evaluation requires more time and expert resources than traditional ML
Safety is paramount: Red-team testing and adversarial probing are non-negotiable
Cross-reference: See Large Language Models in Public Health for practical implementation details

Validation Strategies: Testing Generalization

The validation strategy determines how trustworthy your performance estimates are.

Internal Validation

Purpose: Estimate model performance on new data from the same source.

Critical limitation: Provides no evidence about performance on different populations, institutions, or time periods.

Method 1: Train-Test Split (Hold-Out Validation)

Procedure: 1. Randomly split data into training (70-80%) and test (20-30%) 2. Train model on training set 3. Evaluate on test set (one time only)

Advantages: - Simple and fast - Clear separation between training and testing

Disadvantages: - Single split can be unrepresentative (bad luck in random split) - Wastes data (test set not used for training) - High variance in performance estimate

When to use: Large datasets (>10,000 samples), quick experiments

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
 X, y, 
 test_size=0.2,  # 20% for testing
 random_state=42, # Reproducible split
 stratify=y   # Maintain class balance
)

Method 2: K-Fold Cross-Validation

Procedure: 1. Divide data into K folds (typically 5 or 10) 2. For each fold: - Train on K-1 folds - Validate on remaining fold 3. Average performance across all K folds

Advantages: - Uses all data for both training and validation - More stable performance estimate (less variance) - Standard practice in machine learning

Disadvantages: - Computationally expensive (train K models) - Still no external validation

When to use: Moderate-sized datasets (1,000-10,000 samples), model selection

from sklearn.model_selection import cross_val_score

scores = cross_val_score(
 model, X, y,
 cv=5,    # 5-fold CV
 scoring='roc_auc' # Metric to optimize
)

print(f"AUC-ROC: {scores.mean():.3f} (±{scores.std():.3f})")

Method 3: Stratified K-Fold Cross-Validation

Modification: Ensures each fold maintains the same class distribution as the full dataset.

Critical for imbalanced datasets (e.g., 5% disease prevalence).

Why it matters: Without stratification, some folds might have very few positive cases (or none!), leading to unstable estimates.

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')

Method 4: Time-Series Cross-Validation

For temporal data: Never train on future, test on past!

Procedure (expanding window):

Fold 1: Train [1:100] → Test [101:120]
Fold 2: Train [1:120] → Test [121:140]
Fold 3: Train [1:140] → Test [141:160]
...

Critical for: Epidemic forecasting, time-series prediction, any data with temporal structure

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
 X_train, X_test = X[train_idx], X[test_idx]
 y_train, y_test = y[train_idx], y[test_idx]
 # Train and evaluate

Critical Considerations for Internal Validation

1. Data Leakage Prevention

Data leakage: Information from test set influencing training process.

Common sources:

WRONG: Feature engineering on entire dataset:

# WRONG: Standardize before splitting
X_scaled = StandardScaler().fit_transform(X) # Uses mean/std from ALL data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
# Test set info leaked into training!

CORRECT: Feature engineering within train/test:

# CORRECT: Fit scaler on training only
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler().fit(X_train) # Learn from training only
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test) # Apply to test

WRONG: Feature selection on entire dataset:

# WRONG
selector = SelectKBest(k=10).fit(X, y) # Uses ALL data
X_selected = selector.transform(X)
X_train, X_test = train_test_split(X_selected)

CORRECT: Feature selection within training:

# CORRECT
X_train, X_test, y_train, y_test = train_test_split(X, y)
selector = SelectKBest(k=10).fit(X_train, y_train)
X_train_selected = selector.transform(X_train)
X_test_selected = selector.transform(X_test)

For comprehensive guide on data leakage, see Kaufman et al., 2012, SIGKDD.

2. Cluster-Aware Splitting

Problem: If data has natural clusters (patients within hospitals, repeated measures within individuals), random splitting can lead to overfitting.

Example: Patient has 5 hospitalizations. Random split → some hospitalizations in training, others in test. Model learns patient-specific patterns → overoptimistic performance.

Solution: Group K-Fold , ensure all samples from same group stay together

from sklearn.model_selection import GroupKFold

# patient_ids: array indicating which patient each sample belongs to
gkf = GroupKFold(n_splits=5)
for train_idx, test_idx in gkf.split(X, y, groups=patient_ids):
 # All samples from same patient stay in same fold
 X_train, X_test = X[train_idx], X[test_idx]

External Validation: The Gold Standard

External validation: Testing on data from entirely different source, different institution(s), population, time period, or setting.

Why it matters:

Models often learn dataset-specific quirks that don’t generalize: - Hospital equipment signatures - Documentation practices - Patient population characteristics - Data collection protocols

Without external validation, you don’t know if model learned disease patterns or dataset artifacts.

Types of External Validation

1. Geographic External Validation

Design: - Train: Hospital A (or multiple hospitals in one region) - Test: Hospital B (or hospitals in different region)

What it tests: - Different patient demographics - Different clinical practices - Different data collection protocols - Different equipment (for imaging)

Example: McKinney et al., 2020, Nature - Google breast cancer AI trained on UK data, validated on US data (and vice versa). Performance dropped: UK→US AUC decreased from 0.889 to 0.858.

2. Temporal External Validation

Design: - Train: Data from 2015-2018 - Test: Data from 2019-2021

What it tests: - Temporal stability (concept drift) - Changes in disease patterns - Changes in clinical practice - Changes in data collection

Example: Davis et al., 2017, JAMIA - Clinical prediction models degrade over time; most models need recalibration after 2-3 years.

3. Setting External Validation

Design: - Train: Intensive care unit (ICU) data - Test: General ward data

What it tests: - Performance in different clinical settings - Generalization across disease severity spectra

Example: Sepsis models trained on ICU patients often fail on ward patients (different disease presentation, different monitoring intensity).

External Validation Case Study

Case Study: CheXNet External Validation Failure

Original paper: Rajpurkar et al., 2017, arXiv - CheXNet

Training: - ChestX-ray14 dataset: 112,120 X-rays from NIH Clinical Center - 14 pathology classification tasks - Claimed: “Radiologist-level pneumonia detection” - Performance: AUC-ROC = 0.7632 for pneumonia

External validation: DeGrave et al., 2021, Nature Machine Intelligence

Tested on: - MIMIC-CXR: 377,110 X-rays from Beth Israel Deaconess Medical Center - PadChest: 160,000 X-rays from Hospital San Juan, Spain - CheXpert: 224,000 X-rays from Stanford Hospital

Results: - AUC-ROC ranged from 0.51 to 0.70 across sites (vs. 0.76 internal) - Poor calibration: predicted probabilities didn’t match observed frequencies - Explanation: Model learned to detect portable X-ray machines (used for sicker patients) rather than pneumonia itself

Lessons: 1. Internal validation dramatically overestimated performance 2. Single-institution data insufficient for generalizability claims 3. Models can learn spurious correlations specific to training site 4. External validation is essential before clinical deployment

See also: Zech et al., 2018, PLOS Medicine - “Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs”

Prospective Validation: Real-World Testing

Prospective validation: Model deployed in actual clinical practice, evaluated in real-time.

Why it matters: Retrospective validation can’t capture: - How clinicians actually use (or ignore) model predictions - Workflow integration challenges - Alert fatigue and override patterns - Behavioral changes in response to predictions - Unintended consequences

Study Design 1: Silent Mode Deployment

Design: - Deploy model in background - Generate predictions but don’t show to clinicians - Compare predictions to actual outcomes (collected as usual)

Advantages: - Tests real-world data quality and distribution - No risk to patients (clinicians unaware of predictions) - Can assess performance before making decisions based on model

Disadvantages: - Doesn’t test impact on clinical decisions - Doesn’t assess workflow integration

Example: Tomašev et al., 2019, Nature - DeepMind AKI prediction initially deployed silently at VA hospitals to validate real-time performance before clinical integration.

Study Design 2: Randomized Controlled Trial (RCT)

Design: - Randomize: Patients, clinicians, or hospital units to: - Intervention: Model-assisted care - Control: Standard care (no model) - Measure: Clinical outcomes in both groups - Compare: Test if model improves outcomes

Advantages: - Strongest causal evidence for impact - Can establish cost-effectiveness - Meets regulatory/reimbursement standards

Disadvantages: - Expensive (often millions of dollars) - Time-consuming (months to years) - Requires large sample size - Ethical considerations (withholding potentially beneficial intervention)

Example: Semler et al., 2018, JAMA - SMART trial for sepsis management (not AI, but example of rigorous prospective design)

Study Design 3: Stepped-Wedge Design

Design: - Roll out model sequentially to different units/sites - Each unit serves as its own control (before vs. after) - Eventually all units receive intervention

Advantages: - More feasible than full RCT - All units eventually get intervention (addresses ethical concerns) - Within-unit comparisons reduce confounding

Disadvantages: - Temporal trends can confound results - Less rigorous than RCT (no contemporaneous control group)

Example: Common in health system implementations where full RCT infeasible.

Study Design 4: A/B Testing

Design: - Randomly assign users to model-assisted vs. control in real-time - Continuously measure outcomes - Iterate rapidly based on results

Advantages: - Rapid experimentation - Can test multiple model versions - Common in tech industry

Challenges in healthcare: - Ethical concerns (different care for similar patients) - Regulatory considerations (IRB approval required) - Contamination (clinicians may share information)

Beyond Accuracy: Clinical Utility Assessment

Critical insight: A model can be statistically accurate but clinically useless.

Example: - Model predicts hospital mortality with AUC-ROC = 0.85 - But: If it doesn’t change clinical decisions or improve outcomes, what’s the value? - Moreover: If implementing it disrupts workflow or generates alert fatigue, net impact may be negative.

The Clinical Utility Question

Before deploying any clinical AI:

Does it change decisions?
Do those changed decisions improve outcomes?
Is the improvement worth the cost (financial, workflow disruption, alert burden)?

If you can’t answer “yes” to all three, don’t deploy.

Decision Curve Analysis (DCA)

Developed by: Vickers & Elkin, 2006, Medical Decision Making

Purpose: Assess the clinical net benefit of using a prediction model compared to alternative strategies.

Concept: A model is clinically useful only if using it leads to better decisions than: - Treating everyone - Treating no one - Using clinical judgment alone

How Decision Curve Analysis Works

For each possible risk threshold \(p_t\) (e.g., “treat if risk >10%”):

Calculate Net Benefit (NB):

\[\text{NB}(p_t) = \frac{TP}{N} - \frac{FP}{N} \times \frac{p_t}{1 - p_t}\]

Where: - \(TP/N\) = True positive rate (benefit from correctly treating disease) - \(FP/N \times p_t/(1-p_t)\) = False positive rate, weighted by harm of unnecessary treatment

Interpretation: - If treating disease has high benefit relative to harm of unnecessary treatment → lower \(p_t\) threshold - If treating disease has low benefit relative to harm → higher \(p_t\) threshold

Weight \(p_t/(1-p_t)\): Reflects how much we weight false positives. - At \(p_t\) = 0.10: Weight = 0.10/0.90 ≈ 0.11 (FP weighted 1/9 as much as TP) - At \(p_t\) = 0.50: Weight = 0.50/0.50 = 1.00 (FP and TP equally weighted)

DCA Plot and Interpretation

Create DCA plot: - X-axis: Threshold probability (risk at which you’d intervene) - Y-axis: Net benefit - Plot curves for: - Model: Net benefit using model predictions - Treat all: Net benefit if everyone treated - Treat none: Net benefit if no one treated (= 0)

Interpretation: - Model is useful where its curve is above both “treat all” and “treat none” - Higher net benefit = better clinical value - Range of thresholds where model useful = decision curve clinical range

Example interpretation:

At 15% risk threshold: - Model NB = 0.12 - Treat all NB = 0.05 - Treat none NB = 0.00

Meaning: Using model at 15% threshold is equivalent to correctly treating 12 out of 100 patients with no false positives, compared to only 5 for “treat all” strategy.

Python implementation:

def calculate_net_benefit(y_true, y_pred_proba, thresholds):
 """Calculate net benefit across thresholds for decision curve analysis"""
 net_benefits = []
 
 for threshold in thresholds:
  # Classify based on threshold
  y_pred = (y_pred_proba >= threshold).astype(int)
  
  # Calculate TP, FP, TN, FN
  TP = ((y_pred == 1) & (y_true == 1)).sum()
  FP = ((y_pred == 1) & (y_true == 0)).sum()
  N = len(y_true)
  
  # Net benefit formula
  nb = (TP / N) - (FP / N) * (threshold / (1 - threshold))
  net_benefits.append(nb)
 
 return np.array(net_benefits)

# Calculate for model, treat all, treat none
thresholds = np.linspace(0.01, 0.99, 100)
nb_model = calculate_net_benefit(y_test, y_pred_proba, thresholds)
nb_treat_all = y_test.mean() - (1 - y_test.mean()) * (thresholds / (1 - thresholds))
nb_treat_none = np.zeros_like(thresholds)

# Plot decision curve
plt.figure(figsize=(10, 6))
plt.plot(thresholds, nb_model, label='Model', linewidth=2)
plt.plot(thresholds, nb_treat_all, label='Treat All', linestyle='--', linewidth=2)
plt.plot(thresholds, nb_treat_none, label='Treat None', linestyle=':', linewidth=2)
plt.xlabel('Threshold Probability', fontsize=12)
plt.ylabel('Net Benefit', fontsize=12)
plt.title('Decision Curve Analysis', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.xlim(0, 0.5) # Focus on clinically relevant range
plt.show()

For comprehensive tutorial, see Vickers et al., 2019, Diagnostic and Prognostic Research.

Reclassification Metrics

Purpose: Quantify whether new model improves risk stratification compared to existing approach.

Context: You have an existing risk model (or clinical judgment). New model proposed. Does it reclassify patients into more appropriate risk categories?

Net Reclassification Improvement (NRI)

Concept: Among events (people with disease), what proportion correctly moved to higher risk? Among non-events, what proportion correctly moved to lower risk?

Formula:

\[\text{NRI} = (\text{NRI}_{\text{events}} + \text{NRI}_{\text{non-events}}) / 2\]

Where: - \(\text{NRI}_{\text{events}}\) = P(moved up | event) - P(moved down | event) - \(\text{NRI}_{\text{non-events}}\) = P(moved down | non-event) - P(moved up | non-event)

Interpretation: - NRI > 0: New model improves classification - NRI < 0: New model worsens classification - Typically report with 95% CI

Example:

Group	Moved Up	Stayed	Moved Down	NRI Component
Events (n=100)	35	50	15	(35-15)/100 = 0.20
Non-events (n=900)	50	800	50	(50-50)/900 = 0.00

NRI = (0.20 + 0.00) / 2 = 0.10

Interpretation: Net 10% improvement in classification.

For detailed explanation, see Pencina et al., 2008, Statistics in Medicine.

Integrated Discrimination Improvement (IDI)

Concept: Difference in average predicted probabilities between events and non-events.

Formula:

\[\text{IDI} = [\overline{P}_{\text{new}}(\text{events}) - \overline{P}_{\text{old}}(\text{events})] - [\overline{P}_{\text{new}}(\text{non-events}) - \overline{P}_{\text{old}}(\text{non-events})]\]

Interpretation: - How much does new model increase separation between events and non-events? - IDI > 0: Better discrimination - Less sensitive to arbitrary cut-points than NRI

Fairness and Equity in Evaluation

AI systems can exhibit disparate performance across demographic groups, even when overall performance appears strong.

The Fairness Imperative

Failure to assess fairness can: - Perpetuate or amplify existing health disparities - Result in differential quality of care based on race, gender, socioeconomic status - Violate ethical principles of justice and equity - Expose organizations to legal liability

Assessing fairness is not optional. It’s essential.

Mathematical Definitions of Fairness

Challenge: Multiple, often conflicting, definitions of fairness exist.

1. Demographic Parity (Statistical Parity)

Definition: Positive prediction rates equal across groups

\[P(\hat{Y}=1 | A=0) = P(\hat{Y}=1 | A=1)\]

where \(A\) = protected attribute (e.g., race, gender)

Example: Model predicts high risk for 20% of White patients and 20% of Black patients

When appropriate: - Resource allocation (equal access to interventions) - Contexts where base rates should be equal

Problem: Ignores actual outcome rates. If disease prevalence differs between groups (due to structural factors), enforcing demographic parity may reduce overall accuracy.

2. Equalized Odds (Equal Opportunity)

Definition: True positive and false positive rates equal across groups

\[P(\hat{Y}=1 | Y=1, A=0) = P(\hat{Y}=1 | Y=1, A=1)\] \[P(\hat{Y}=1 | Y=0, A=0) = P(\hat{Y}=1 | Y=0, A=1)\]

Example: 85% sensitivity for both White and Black patients; 90% specificity for both

When appropriate: - Clinical diagnosis and screening - When both types of errors (false positives and false negatives) matter

More clinically relevant than demographic parity in most healthcare applications.

3. Calibration Fairness

Definition: Predicted probabilities calibrated for all groups

\[P(Y=1 | \hat{Y}=p, A=0) = P(Y=1 | \hat{Y}=p, A=1) = p\]

Example: Among patients predicted 30% risk, ~30% in each group actually experience outcome

When appropriate: - Risk prediction for clinical decision-making - When predicted probabilities guide treatment thresholds

Most important for clinical applications where decisions based on predicted probabilities.

4. Predictive Parity

Definition: Positive predictive values equal across groups

\[P(Y=1 | \hat{Y}=1, A=0) = P(Y=1 | \hat{Y}=1, A=1)\]

Example: Among patients predicted positive, same proportion are true positives in both groups

When appropriate: - When acting on positive predictions (e.g., treatment initiation)

The Impossibility Theorem

Fundamental challenge: Chouldechova, 2017, FAT and Kleinberg et al., 2017, ITCS proved:

If base rates differ between groups, you cannot simultaneously satisfy: 1. Calibration 2. Equalized odds 3. Predictive parity

Implication: Must choose which fairness criterion to prioritize based on context and values.

For healthcare: Calibration typically most important (want predicted probabilities to mean the same thing across groups).

Practical Fairness Assessment

Step-by-Step Fairness Audit

Step 1: Define Protected Attributes

Identify characteristics that should not influence model performance: - Race/ethnicity - Gender/sex - Age - Socioeconomic status (income, insurance, ZIP code) - Language - Disability status

Step 2: Stratify Performance Metrics

Calculate metrics separately for each subgroup:

# Example: Performance by race/ethnicity
groups = data.groupby('race')

fairness_metrics = []
for race, group_data in groups:
 y_true = group_data['outcome']
 y_pred = group_data['prediction']
 
 metrics = {
  'race': race,
  'n': len(group_data),
  'prevalence': y_true.mean(),
  'sensitivity': recall_score(y_true, y_pred > 0.5),
  'specificity': recall_score(1 - y_true, 1 - (y_pred > 0.5)),
  'PPV': precision_score(y_true, y_pred > 0.5),
  'NPV': precision_score(1 - y_true, 1 - (y_pred > 0.5)),
  'AUC': roc_auc_score(y_true, y_pred),
  'Brier': brier_score_loss(y_true, y_pred)
 }
 fairness_metrics.append(metrics)

fairness_df = pd.DataFrame(fairness_metrics)
print(fairness_df)

Step 3: Assess Calibration by Subgroup

# Calibration plots by race
fig, axes = plt.subplots(1, len(groups), figsize=(15, 5))

for idx, (race, group_data) in enumerate(groups):
 y_true = group_data['outcome']
 y_pred = group_data['prediction']
 
 prob_true, prob_pred = calibration_curve(y_true, y_pred, n_bins=10)
 
 axes[idx].plot(prob_pred, prob_true, marker='o', label=race)
 axes[idx].plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
 axes[idx].set_title(f'{race} (n={len(group_data)})')
 axes[idx].set_xlabel('Predicted Probability')
 axes[idx].set_ylabel('Observed Frequency')
 axes[idx].legend()

Step 4: Identify Disparities

Calculate disparity metrics:

Absolute disparity: Difference between groups

sens_white = metrics_white['sensitivity']
sens_black = metrics_black['sensitivity']
disparity_abs = sens_white - sens_black

Relative disparity: Ratio between groups

disparity_rel = sens_white / sens_black

Threshold for concern: - Absolute disparity >5 percentage points - Relative disparity >1.1 or <0.9 (10% difference)

Step 5: Investigate Root Causes

Potential causes of disparities:

Data representation

Underrepresentation in training data
Different sample sizes → unstable estimates for small groups

Label bias

Outcome labels reflect biased processes (e.g., healthcare access disparities)
Example: Hospitalization rates lower in group with less access, not because they’re healthier

Feature bias

Features proxy for protected attributes
Example: ZIP code strongly correlated with race

Measurement bias

Different data quality across groups
Example: Pulse oximetry less accurate in dark skin (Sjoding et al., 2020, NEJM)

Prevalence differences

True differences in disease prevalence
May be due to structural factors (e.g., environmental exposures)

Step 6: Mitigation Strategies

Pre-processing (adjust training data): - Increase representation of underrepresented groups (oversampling, synthetic data) - Re-weight samples to balance groups - Remove or transform biased features

In-processing (modify algorithm): - Add fairness constraints during training - Adversarial debiasing (penalize predictions that reveal protected attribute) - Multi-objective optimization (accuracy + fairness)

Post-processing (adjust predictions): - Separate thresholds per group to achieve equalized odds - Calibration adjustment per group - Reject option classification (defer to human for uncertain cases)

Structural interventions: - Address root causes (improve data collection for underrepresented groups) - Partner with communities to ensure appropriate representation - Consider whether model should be deployed if disparities cannot be adequately mitigated

For comprehensive fairness toolkit, see Fairlearn by Microsoft.

Landmark Bias Case Study

Case Study: Racial Bias in Healthcare Risk Algorithm

Paper: Obermeyer et al., 2019, Science

Context: - Commercial algorithm used by major US health systems to identify high-risk patients for care management programs - Affected millions of patients nationwide

The Algorithm: - Predicted future healthcare costs as proxy for healthcare needs - Used to determine eligibility for high-touch care management

The Bias Discovered:

Black patients had: - 26% more chronic conditions than White patients at same risk score - Lower predicted costs despite being sicker

The mechanism: - Algorithm used healthcare costs as outcome label - Black patients historically received less care due to systemic barriers - Less care → lower costs → model learned “Black = lower cost = healthier” - Result: At same risk score, Black patients were sicker than White patients

Impact: - To qualify for care management, Black patients needed to be sicker than White patients - Black patients at 97th percentile of risk score had similar medical needs as White patients at 85th percentile - Reduced access to care management programs for Black patients

Solution: - Re-label using direct measures of health need (number of chronic conditions, biomarkers) instead of costs - Result: Reduced bias by 84%

Lessons:

Outcome label choice is critical , using healthcare utilization as proxy for need embeds systemic bias
Overall accuracy can mask subgroup disparities , algorithm performed well on average
Historical bias propagates , model learned from biased past care patterns
Evaluate across subgroups , disparities invisible without stratified analysis
Audit deployed systems , this was a production system, not a research study

Follow-up: Buolamwini & Gebru, 2018, FAT - similar biases in facial recognition; Gichoya et al., 2022, Lancet Digital Health - AI can predict race from medical images (concerning proxy variable).

Adversarial Robustness and Security Evaluation

Why Robustness Matters for Medical AI

Traditional evaluation assumes benign inputs. But deployed models face: - Natural perturbations: Image quality variation, data entry errors, equipment differences - Adversarial attacks: Malicious manipulation to cause misclassification - Out-of-distribution inputs: Cases far from training data

2025 context: EU AI Act mandates robustness and cybersecurity testing for high-risk medical AI systems.

Security is a Patient Safety Issue

Example scenarios: - Hospital ransomware attack compromises AI model integrity - Malicious actor manipulates medical imaging to hide cancer - Data poisoning during model retraining introduces systematic errors

Unlike traditional software vulnerabilities (which can be patched), ML models can be permanently corrupted or subtly manipulated without obvious signs.

Types of Adversarial Threats

1. Evasion Attacks (Inference-Time)

Goal: Manipulate input to cause misclassification without changing ground truth.

Example: - Add imperceptible noise to chest X-ray → Model misses pneumonia - Modify patient vital signs slightly → Sepsis prediction model fails to alert

Medical relevance: - Natural occurrence: Image compression, scanner differences can mimic adversarial perturbations - Malicious: Rare but theoretically possible (e.g., insurance fraud, medicolegal manipulation)

2. Poisoning Attacks (Training-Time)

Goal: Corrupt training data to degrade model performance or introduce backdoors.

Example: - Insert mislabeled images into training set → Model learns incorrect patterns - Add trigger patterns → Model fails only for specific subgroups

Medical relevance: - Multi-institutional data sharing: If one site’s data is compromised, all participants affected - Crowdsourced labels: If annotations are maliciously manipulated

3. Model Extraction/Stealing

Goal: Query model repeatedly to reverse-engineer its parameters.

Risk: Intellectual property theft, creating surrogate model for further attacks

Evaluating Robustness

Method 1: Input Perturbation Testing

Approach: Systematically perturb inputs and measure performance degradation.

For medical imaging:

Hide code

import numpy as np
from skimage.util import random_noise

def test_noise_robustness(model, test_images, test_labels, noise_levels):
 """
 Test model robustness to image noise

 Args:
  model: Trained classification model
  test_images: Clean test images
  test_labels: Ground truth labels
  noise_levels: List of noise standard deviations to test

 Returns:
  Dictionary of accuracy at each noise level
 """
 results = {}

 # Baseline (no noise)
 baseline_acc = model.evaluate(test_images, test_labels)[1]
 results['baseline'] = baseline_acc

 # Test each noise level
 for sigma in noise_levels:
  noisy_images = np.array([
   random_noise(img, mode='gaussian', var=sigma**2)
   for img in test_images
  ])

  noisy_acc = model.evaluate(noisy_images, test_labels)[1]
  results[f'sigma_{sigma}'] = noisy_acc
  degradation = baseline_acc - noisy_acc

  print(f"Noise σ={sigma:.3f}: Accuracy={noisy_acc:.3f} "
    f"(degradation: {degradation:.3f})")

 return results

# Example usage
noise_levels = [0.01, 0.05, 0.10, 0.20]
robustness_results = test_noise_robustness(
 model,
 test_images,
 test_labels,
 noise_levels
)

# Acceptable degradation threshold
if robustness_results['sigma_0.05'] < 0.85 * robustness_results['baseline']:
 print("[WARNING] Model performance degrades >15% with minor noise")
 print("→ Consider: Data augmentation, robust training, ensemble methods")

For tabular data (clinical variables):

Hide code

def test_feature_perturbation_robustness(model, X_test, y_test,
           perturbation_fraction=0.05):
 """
 Test robustness to small perturbations in continuous features

 Args:
  model: Trained model
  X_test: Test features (pandas DataFrame)
  y_test: Test labels
  perturbation_fraction: Fraction of feature value to perturb

 Returns:
  Robustness metrics
 """
 from sklearn.metrics import roc_auc_score

 # Baseline performance
 y_pred_baseline = model.predict_proba(X_test)[:, 1]
 auc_baseline = roc_auc_score(y_test, y_pred_baseline)

 # Perturb continuous features
 X_perturbed = X_test.copy()
 continuous_cols = X_test.select_dtypes(include=[np.number]).columns

 for col in continuous_cols:
  # Add random noise proportional to feature value
  noise = np.random.normal(0, perturbation_fraction * X_test[col].std(),
         size=len(X_test))
  X_perturbed[col] = X_test[col] + noise

 # Evaluate perturbed performance
 y_pred_perturbed = model.predict_proba(X_perturbed)[:, 1]
 auc_perturbed = roc_auc_score(y_test, y_pred_perturbed)

 # Prediction consistency
 prediction_changes = np.mean(
  (y_pred_baseline > 0.5) != (y_pred_perturbed > 0.5)
 )

 print(f"Baseline AUC: {auc_baseline:.3f}")
 print(f"Perturbed AUC: {auc_perturbed:.3f}")
 print(f"AUC degradation: {auc_baseline - auc_perturbed:.3f}")
 print(f"Prediction changes: {prediction_changes:.1%}")

 return {
  'auc_baseline': auc_baseline,
  'auc_perturbed': auc_perturbed,
  'prediction_change_rate': prediction_changes
 }

# Example
results = test_feature_perturbation_robustness(model, X_test, y_test,
            perturbation_fraction=0.05)

if results['prediction_change_rate'] > 0.10:
 print("[WARNING] >10% of predictions change with 5% feature noise")
 print("→ Model may be overfitting to noise rather than signal")

Method 2: Adversarial Attack Testing

Fast Gradient Sign Method (FGSM) - Basic adversarial attack:

Hide code

import tensorflow as tf

def fgsm_attack(model, image, label, epsilon=0.01):
 """
 Generate adversarial example using Fast Gradient Sign Method

 Args:
  model: Trained model
  image: Input image
  label: True label
  epsilon: Perturbation magnitude

 Returns:
  Adversarial image
 """
 image = tf.cast(image, tf.float32)

 with tf.GradientTape() as tape:
  tape.watch(image)
  prediction = model(image)
  loss = tf.keras.losses.sparse_categorical_crossentropy(label, prediction)

 # Get gradient of loss w.r.t. image
 gradient = tape.gradient(loss, image)

 # Create adversarial image
 signed_grad = tf.sign(gradient)
 adversarial_image = image + epsilon * signed_grad
 adversarial_image = tf.clip_by_value(adversarial_image, 0, 1)

 return adversarial_image

# Evaluate adversarial robustness
def evaluate_adversarial_robustness(model, test_images, test_labels,
          epsilons=[0.0, 0.01, 0.05, 0.10]):
 """
 Test model robustness to FGSM attacks at different perturbation levels
 """
 results = {}

 for eps in epsilons:
  correct = 0
  total = 0

  for img, label in zip(test_images, test_labels):
   # Generate adversarial example
   adv_img = fgsm_attack(model, img[np.newaxis, ...], label, epsilon=eps)

   # Predict
   pred = model.predict(adv_img)
   pred_class = np.argmax(pred)

   if pred_class == label:
    correct += 1
   total += 1

  accuracy = correct / total
  results[eps] = accuracy
  print(f"Epsilon={eps:.3f}: Accuracy={accuracy:.3f}")

 return results

# Run evaluation
adv_results = evaluate_adversarial_robustness(model, test_images, test_labels)

# Alert if significant degradation
if adv_results[0.05] < 0.70 * adv_results[0.0]:
 print("[CRITICAL] Model highly vulnerable to adversarial attacks")
 print("→ Implement: Adversarial training, input validation, ensemble methods")

Method 3: Out-of-Distribution (OOD) Detection

Goal: Identify when inputs are unlike training data (model should abstain or flag uncertainty).

Hide code

def evaluate_ood_detection(model, in_dist_data, ood_data):
 """
 Evaluate model's ability to detect out-of-distribution inputs

 Args:
  model: Trained model
  in_dist_data: In-distribution test data
  ood_data: Out-of-distribution data

 Returns:
  OOD detection performance metrics
 """
 from sklearn.metrics import roc_auc_score

 # Get prediction confidence (max probability) for each dataset
 in_dist_preds = model.predict(in_dist_data)
 in_dist_confidence = np.max(in_dist_preds, axis=1)

 ood_preds = model.predict(ood_data)
 ood_confidence = np.max(ood_preds, axis=1)

 # Combine labels (1 = in-distribution, 0 = OOD)
 y_true = np.concatenate([
  np.ones(len(in_dist_confidence)),
  np.zeros(len(ood_confidence))
 ])

 # Confidence scores (higher = more likely in-distribution)
 confidence_scores = np.concatenate([in_dist_confidence, ood_confidence])

 # Calculate AUROC for OOD detection
 auroc = roc_auc_score(y_true, confidence_scores)

 print(f"OOD Detection AUROC: {auroc:.3f}")
 print(f"In-dist mean confidence: {in_dist_confidence.mean():.3f}")
 print(f"OOD mean confidence: {ood_confidence.mean():.3f}")

 if auroc < 0.80:
  print("[WARNING] Poor OOD detection - model overconfident on unfamiliar inputs")
  print("→ Consider: Temperature scaling, Bayesian approaches, ensemble uncertainty")

 return auroc

# Example: Test on different medical image dataset
ood_auroc = evaluate_ood_detection(
 model,
 in_dist_data=chest_xray_test, # Data from same hospitals as training
 ood_data=external_site_data  # Data from completely different hospital/scanner
)

Robustness Improvement Strategies

1. Data Augmentation: - Train on varied/augmented data (rotations, brightness changes, noise) - Forces model to learn invariant features

2. Adversarial Training: - Include adversarial examples in training set - Trade-off: May slightly reduce clean accuracy

3. Ensemble Methods: - Multiple models often more robust than single model - Harder to fool all models simultaneously

4. Input Validation: - Reject inputs that are outliers (OOD detection) - Flag unusual patterns for human review

5. Certified Defenses: - Provide mathematical guarantees of robustness - Advanced, computationally expensive

Practical Robustness Evaluation Protocol

Robustness Testing Checklist

Minimum requirements (all deployed models): - [ ] Natural perturbation testing: Test with realistic variations (noise, missing data, equipment differences) - [ ] Prediction stability: Measure how often predictions change with small input perturbations (should be <5-10%) - [ ] Out-of-distribution detection: Model should flag or have low confidence on unfamiliar inputs

Recommended (high-risk models): - [ ] Adversarial attack testing: Evaluate vulnerability to FGSM, PGD attacks - [ ] Multi-site robustness: Validate performance across diverse sites/equipment - [ ] Ablation studies: Test performance when features are missing or corrupted

Advanced (critical systems, regulatory requirements): - [ ] Certified robustness: Provide formal guarantees for critical use cases - [ ] Red-team exercise: Security experts attempt to break the model - [ ] Continuous monitoring: Track input distribution shifts, flag anomalies

Security Best Practices

1. Model Access Control: - Limit API access to authenticated users - Rate limiting to prevent model extraction attacks

2. Input Sanitization: - Validate inputs are within expected ranges - Reject clearly anomalous inputs

3. Monitoring and Logging: - Log all predictions and inputs - Monitor for unusual query patterns (potential attacks)

4. Model Versioning and Rollback: - Maintain ability to revert to previous model if compromise detected

5. Regular Security Audits: - Periodic red-team testing - Review access logs for suspicious activity

Landmark Study: Adversarial Perturbations in Medical Imaging

Finlayson et al., 2019, Science

Study: Added imperceptible perturbations to medical images (chest X-rays, fundus photos, dermatology images)

Results: - Successfully fooled state-of-the-art deep learning classifiers - Adversarial examples transferable across models (attack one model, affects others) - Small perturbations caused dramatic misclassifications

Implications: - Medical AI models are vulnerable to adversarial attacks - Robustness testing should be mandatory for deployed systems - Both accidental (natural variations) and malicious perturbations are risks

Counterpoint: No documented real-world malicious attacks on medical AI systems (yet), but accidental distribution shifts are common (equipment changes, protocol updates).

Key Takeaways: Adversarial Robustness

Robustness is mandatory: EU AI Act requires adversarial robustness testing for high-risk medical AI
Two threat models: Natural perturbations (common) vs. adversarial attacks (rare but possible)
Multiple evaluation methods: Noise robustness, adversarial attacks (FGSM/PGD), OOD detection
Practical importance: Equipment variation, scanner differences, data quality issues are real-world “natural adversarial examples”
Trade-offs exist: Adversarial training improves robustness but may reduce clean accuracy
Prevention strategies: Data augmentation, ensemble methods, input validation, monitoring
Security is patient safety: Model integrity directly affects clinical outcomes
Document robustness: Report perturbation testing results in validation studies
OOD detection is critical: Models must recognize when inputs are outside training distribution
Continuous vigilance: Monitor for distribution shifts and anomalous inputs post-deployment

Implementation Outcomes: Beyond Technical Performance

Even models with strong technical performance can fail in practice if not properly implemented.

The Implementation Science Framework

Proctor et al., 2011 define implementation outcomes:

1. Acceptability

Definition: Perception that system is agreeable/satisfactory

Measures: - User satisfaction surveys (Likert scales, Net Promoter Score) - Qualitative interviews (what do users like/dislike?) - Perceived usefulness and ease of use

Example questions: - “This system improves my clinical decision-making” (1-5 scale) - “I would recommend this system to colleagues” (yes/no)

2. Adoption

Definition: Intention/action to use the system

Measures: - Utilization rate (% of eligible cases where system used) - Number of users who have activated/logged in - Time to initial use

Red flag: Low adoption despite availability suggests problems with acceptability, workflow fit, or perceived utility.

3. Appropriateness

Definition: Perceived fit for setting/population/problem

Measures: - Stakeholder perception surveys - Alignment with clinical workflows (workflow mapping) - Relevance to clinical questions

Example: ICU mortality prediction may be appropriate for ICU but inappropriate for outpatient clinic.

4. Feasibility

Definition: Ability to successfully implement

Measures: - Technical integration challenges (API compatibility, data availability) - Resource requirements (cost, staff time, training) - Infrastructure needs (computing, network)

5. Fidelity

Definition: Degree to which system used as designed

Measures: - Override rates (how often do clinicians dismiss alerts?) - Deviation from intended use (using for wrong purpose) - Workarounds (users circumventing system)

High override rates signal problems: - Too many false positives (alert fatigue) - Predictions don’t match clinical judgment (trust issues) - Workflow disruption (alerts at wrong time)

6. Penetration

Definition: Integration across settings/populations

Measures: - Number of sites/units using system - Proportion of target population reached - Geographic spread

7. Sustainability

Definition: Continued use over time

Measures: - Retention of users over 6-12 months - Model updating/maintenance plan - Long-term performance monitoring

Common failure: “Pilot-itis” , successful pilot, but system not sustained after initial implementation period.

Common Implementation Failures

1. Alert Fatigue

Problem: Excessive false alarms → clinicians ignore alerts

Evidence: Ancker et al., 2017, BMC Medical Informatics and Decision Making - Drug-drug interaction alerts overridden 49-96% of time.

Example: Epic sepsis model - 88% false positive rate → clinicians stopped paying attention.

Solutions: - Minimize false positives (sacrifice sensitivity if needed) - Tiered alerts (critical vs. informational) - Smart timing (deliver when actionable, not during documentation) - Actionable recommendations (“Order blood cultures” not “Consider sepsis”)

2. Workflow Disruption

Problem: System doesn’t integrate smoothly into existing processes

Examples: - Extra clicks required - Separate application (need to switch contexts) - Alerts interrupt at inopportune times (during patient exam)

Solutions: - User-centered design (involve clinicians early and often) - Embed in existing EHR workflows - Minimize friction (one-click actions)

For workflow integration principles, see Bates et al., 2003, NEJM on clinical decision support systems.

3. Lack of Trust

Problem: Clinicians don’t trust “black box” predictions

Example: Deep learning model provides risk score with no explanation

Solutions: - Provide explanations (SHAP values, attention weights) - Show evidence base (similar cases, supporting literature) - Transparent validation (publish performance data) - Gradual trust-building (start with low-stakes recommendations)

4. Model Drift

Problem: Performance degrades over time as data distribution changes

Example: COVID-19 pandemic changed disease patterns → pre-pandemic models failed

Why it matters: A 2017 study of clinical prediction models found that most require recalibration within 2-3 years due to changes in patient populations, treatment patterns, and data collection practices.

MLOps and Continuous Model Monitoring

Moving beyond “deploy and hope”: Modern AI systems require continuous monitoring to detect performance degradation and trigger timely interventions.

Post-Deployment Monitoring is Not Optional

The FDA’s 2021 Action Plan on AI/ML-based Software as a Medical Device emphasizes continuous monitoring as a core requirement. Models that don’t monitor performance drift pose patient safety risks.

Real-world failure: IBM Watson for Oncology was deployed at multiple institutions but provided unsafe treatment recommendations that weren’t detected for years due to inadequate monitoring.

Types of Drift

Understanding the type of drift helps determine appropriate interventions.

1. Data Drift (Covariate Shift)

Definition: Input feature distributions change, but the relationship between features and outcome remains stable.

Example: - Training data (2019): Average patient age = 55, BMI = 27 - Production data (2024): Average patient age = 62, BMI = 31 - Relationship unchanged: Diabetes risk per BMI unit = same

Impact: Model may become miscalibrated (predicted probabilities off) even if discrimination (AUC-ROC) stays stable.

Detection: Compare feature distributions between training and production data.

2. Concept Drift

Definition: The relationship between features and outcome changes.

Example: - Pre-COVID (2019): Fever + cough + dyspnea → Likely bacterial pneumonia - During COVID (2020): Fever + cough + dyspnea → Likely COVID-19 - Same features, different outcome

Impact: Model discrimination (AUC-ROC) degrades significantly.

Detection: Track performance metrics over time; sudden drops indicate concept drift.

3. Label Drift

Definition: Prevalence of outcome changes.

Example: - Training data: 5% sepsis prevalence - Production: 12% sepsis prevalence (sicker patient population)

Impact: Predicted probabilities may be systematically too low or too high (calibration degrades).

Detection: Compare outcome rates over time.

Practical Drift Detection Methods

Method 1: Statistical Tests for Feature Distribution Changes

Kolmogorov-Smirnov (KS) Test:

Tests whether two distributions are significantly different.

Hide code

from scipy.stats import ks_2samp
import numpy as np
import pandas as pd

# Example: Monitoring patient age distribution
# Training data (historical)
age_train = np.random.normal(55, 15, 1000) # Mean=55, SD=15

# Production data (current month)
age_prod = np.random.normal(62, 15, 500) # Mean shifted to 62

# Perform KS test
statistic, p_value = ks_2samp(age_train, age_prod)

print(f"KS statistic: {statistic:.3f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.01:
 print("[ALERT] Significant distribution shift detected!")
 print("→ Review model calibration and consider retraining")
else:
 print("[OK] No significant drift detected")

# Interpretation:
# p < 0.01: Strong evidence of distribution change
# p < 0.05: Moderate evidence of distribution change
# p ≥ 0.05: No significant change detected

When to use: Continuous features (age, lab values, vital signs)

Chi-Square Test (for categorical features):

Hide code

from scipy.stats import chi2_contingency

# Example: Monitoring sex distribution
# Training data
train_counts = {"Male": 600, "Female": 400} # 60% male

# Production data (current month)
prod_counts = {"Male": 250, "Female": 250} # 50% male

# Create contingency table
contingency_table = pd.DataFrame({
 'Train': [train_counts['Male'], train_counts['Female']],
 'Production': [prod_counts['Male'], prod_counts['Female']]
}, index=['Male', 'Female'])

chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-square statistic: {chi2:.3f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.01:
 print("[ALERT] Significant distribution shift in sex distribution!")

When to use: Categorical features (sex, race, diagnostic codes)

Method 2: Population Stability Index (PSI)

Widely used in industry for monitoring feature drift.

Formula:

\[PSI = \sum_{i=1}^{n} (P_{prod,i} - P_{train,i}) \times \ln\left(\frac{P_{prod,i}}{P_{train,i}}\right)\]

Where: - \(P_{train,i}\) = Proportion in training data bin \(i\) - \(P_{prod,i}\) = Proportion in production data bin \(i\)

Interpretation: - PSI < 0.1: No significant change (green) - PSI 0.1-0.25: Moderate change, investigate (yellow) - PSI > 0.25: Significant change, likely requires retraining (red)

Code example:

Hide code

import numpy as np

def calculate_psi(expected, actual, bins=10):
 """
 Calculate Population Stability Index

 Args:
  expected: Training data feature values
  actual: Production data feature values
  bins: Number of bins for discretization

 Returns:
  PSI value
 """
 # Create bins based on training data quantiles
 breakpoints = np.percentile(expected, np.linspace(0, 100, bins + 1))
 breakpoints[-1] = breakpoints[-1] + 0.001 # Ensure max value included

 # Calculate proportions in each bin
 expected_counts = np.histogram(expected, bins=breakpoints)[0]
 actual_counts = np.histogram(actual, bins=breakpoints)[0]

 expected_props = expected_counts / len(expected)
 actual_props = actual_counts / len(actual)

 # Avoid log(0) by adding small constant
 expected_props = np.where(expected_props == 0, 0.0001, expected_props)
 actual_props = np.where(actual_props == 0, 0.0001, actual_props)

 # Calculate PSI
 psi_values = (actual_props - expected_props) * np.log(actual_props / expected_props)
 psi = np.sum(psi_values)

 return psi

# Example: Monitor glucose values
glucose_train = np.random.gamma(5, 20, 1000) # Training data
glucose_prod = np.random.gamma(6, 22, 500) # Production data (shifted)

psi = calculate_psi(glucose_train, glucose_prod, bins=10)

print(f"PSI: {psi:.3f}")

if psi < 0.1:
 print("[OK] No significant drift")
elif psi < 0.25:
 print("[WARNING] Moderate drift - investigate")
else:
 print("[CRITICAL] Significant drift - retrain model")

Method 3: Model Performance Monitoring

Most direct approach: Track actual model performance over time.

Challenge: Requires ground truth labels, which may have delay (e.g., 30-day readmission can’t be verified for 30 days).

Metrics to monitor: - AUC-ROC (discrimination) - Brier score (calibration) - Sensitivity/specificity at operational threshold - PPV/NPV

Implementation:

Hide code

import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, brier_score_loss
from datetime import datetime, timedelta

# Simulate monthly performance tracking
months = pd.date_range(start='2023-01-01', end='2024-12-01', freq='MS')

# Simulated AUC-ROC over time (degrading model)
np.random.seed(42)
baseline_auc = 0.85
auc_scores = baseline_auc - np.linspace(0, 0.10, len(months)) + np.random.normal(0, 0.02, len(months))

# Create monitoring dashboard data
performance_data = pd.DataFrame({
 'month': months,
 'auc': auc_scores
})

# Set alert thresholds
auc_warning_threshold = 0.80
auc_critical_threshold = 0.75

# Plot performance over time
plt.figure(figsize=(10, 6))
plt.plot(performance_data['month'], performance_data['auc'],
   marker='o', linewidth=2, label='Monthly AUC-ROC')

# Add threshold lines
plt.axhline(y=baseline_auc, color='green', linestyle='--',
   label=f'Baseline ({baseline_auc:.2f})', alpha=0.7)
plt.axhline(y=auc_warning_threshold, color='orange', linestyle='--',
   label=f'Warning threshold ({auc_warning_threshold:.2f})', alpha=0.7)
plt.axhline(y=auc_critical_threshold, color='red', linestyle='--',
   label=f'Critical threshold ({auc_critical_threshold:.2f})', alpha=0.7)

# Highlight months below threshold
alerts = performance_data[performance_data['auc'] < auc_warning_threshold]
if not alerts.empty:
 plt.scatter(alerts['month'], alerts['auc'],
    color='red', s=100, zorder=5, label='Alert triggered')

plt.xlabel('Month')
plt.ylabel('AUC-ROC')
plt.title('Model Performance Monitoring Dashboard')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Check for alerts
latest_auc = performance_data['auc'].iloc[-1]
if latest_auc < auc_critical_threshold:
 print(f"[CRITICAL] AUC-ROC = {latest_auc:.3f} (below {auc_critical_threshold})")
 print("→ IMMEDIATE ACTION: Suspend model or initiate emergency retraining")
elif latest_auc < auc_warning_threshold:
 print(f"[WARNING] AUC-ROC = {latest_auc:.3f} (below {auc_warning_threshold})")
 print("→ ACTION: Schedule model retraining within 30 days")
else:
 print(f"[OK] Performance acceptable: AUC-ROC = {latest_auc:.3f}")

Method 4: Prediction Distribution Monitoring

Insight: Even without ground truth, you can monitor what the model is predicting.

Red flag patterns: - Sudden increase in high-risk predictions (model becoming overly sensitive) - Sudden decrease in high-risk predictions (model missing cases) - Bimodal distribution shifts (calibration degradation)

Example:

Hide code

# Monitor distribution of predicted probabilities over time

# Week 1 predictions (well-calibrated)
week1_preds = np.random.beta(2, 10, 1000) # Mostly low risk

# Week 12 predictions (drift - more high-risk predictions)
week12_preds = np.random.beta(3, 7, 1000) # Shifted higher

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.hist(week1_preds, bins=50, alpha=0.7, label='Week 1', density=True)
ax1.hist(week12_preds, bins=50, alpha=0.7, label='Week 12', density=True)
ax1.set_xlabel('Predicted Probability')
ax1.set_ylabel('Density')
ax1.set_title('Prediction Distribution Shift')
ax1.legend()

# KS test to detect shift
ks_stat, p_val = ks_2samp(week1_preds, week12_preds)
ax2.text(0.5, 0.5, f'KS test p-value: {p_val:.4f}\n' +
   ('[WARNING] Significant shift detected' if p_val < 0.01 else '[OK] No significant shift'),
   ha='center', va='center', fontsize=14,
   bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
ax2.axis('off')

plt.tight_layout()
plt.show()

Automated Retraining Triggers

Goal: Define clear rules for when to retrain the model.

Retraining Decision Framework

IMMEDIATE retraining triggered by: - AUC-ROC drops >0.05 below baseline - Calibration error (Brier score) increases >0.05 - Safety event: Model missed critical case (e.g., sepsis death after negative prediction) - Feature drift: PSI > 0.25 for ≥3 critical features

SCHEDULED retraining triggered by: - 12 months since last training (routine maintenance) - AUC-ROC drops 0.02-0.05 below baseline (warning level) - PSI 0.1-0.25 for multiple features (moderate drift) - New data volume: ≥20% new samples since last training

HOLD retraining if: - All metrics within acceptable ranges - Recent retraining (< 3 months ago) - Insufficient new data (< 5% new samples)

Automated monitoring script:

Hide code

def evaluate_retraining_need(current_auc, baseline_auc,
        psi_scores, months_since_training,
        new_sample_fraction):
 """
 Automated decision system for model retraining

 Returns:
  - "IMMEDIATE": Retrain immediately
  - "SCHEDULED": Schedule retraining within 30 days
  - "MONITOR": Continue monitoring, no action needed
 """
 # Critical performance degradation
 if current_auc < baseline_auc - 0.05:
  return "IMMEDIATE", "AUC-ROC dropped >0.05"

 # Critical feature drift
 critical_drift_count = sum([psi > 0.25 for psi in psi_scores])
 if critical_drift_count >= 3:
  return "IMMEDIATE", f"{critical_drift_count} features with PSI > 0.25"

 # Routine maintenance schedule
 if months_since_training >= 12:
  return "SCHEDULED", "12-month routine retraining"

 # Moderate performance degradation
 if current_auc < baseline_auc - 0.02:
  return "SCHEDULED", "AUC-ROC dropped 0.02-0.05"

 # Moderate drift
 moderate_drift_count = sum([0.1 < psi < 0.25 for psi in psi_scores])
 if moderate_drift_count >= 4:
  return "SCHEDULED", f"{moderate_drift_count} features with moderate drift"

 # Significant new data
 if new_sample_fraction >= 0.20:
  return "SCHEDULED", f"{new_sample_fraction:.0%} new data available"

 return "MONITOR", "All metrics within acceptable ranges"

# Example usage
decision, reason = evaluate_retraining_need(
 current_auc=0.82,
 baseline_auc=0.85,
 psi_scores=[0.08, 0.15, 0.22, 0.18, 0.05], # PSI for 5 key features
 months_since_training=8,
 new_sample_fraction=0.15
)

print(f"Decision: {decision}")
print(f"Reason: {reason}")

Continuous Learning Strategies

Two approaches:

1. Periodic Retraining (Safer, recommended for high-stakes)

Process:

Accumulate new data
Retrain model offline
Validate on hold-out set
A/B test against current model
Deploy if superior

Advantages: Controlled validation before deployment
Disadvantages: Model can drift between retraining cycles

2. Online Learning (Advanced, requires careful monitoring)

Process: Model updates continuously with new data
Advantages: Always current
Disadvantages:
Risk of catastrophic forgetting
Harder to validate
Vulnerable to adversarial data poisoning

Recommendation for healthcare: Use periodic retraining with trigger-based scheduling (not true online learning).

Real-World Example: Epic Sepsis Model Drift

Case study of monitoring failure:

Epic’s sepsis model was deployed at >100 hospitals but lacked adequate drift monitoring:

Problems identified: - No external validation before broad deployment - No continuous performance monitoring at deployment sites - No retraining protocol as patient populations changed

Result: Model drifted significantly, achieving only 33% sensitivity (missing 2/3 of sepsis cases) by the time external researchers evaluated it.

Lesson: Monitoring is not optional. It’s a patient safety requirement.

Reference: Wong et al., 2021, JAMA Internal Medicine

Recommended Monitoring Tools

Open-source libraries:

1. Evidently AI: - Monitors data drift, model performance, target drift - Generates visual reports - Integrates with MLOps pipelines

Hide code

# Example: Evidently for drift detection
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

# Create drift report
report = Report(metrics=[DataDriftPreset()])

# Compare reference (training) vs. current (production) data
report.run(reference_data=train_df, current_data=production_df)

# Save report
report.save_html("drift_report.html")

# Access metrics programmatically
drift_results = report.as_dict()
n_drifted_features = drift_results['metrics'][0]['result']['number_of_drifted_columns']

if n_drifted_features > 3:
 print(f"[WARNING] {n_drifted_features} features showing significant drift")

2. Alibi Detect: - Drift detection algorithms - Supports tabular, text, image data

3. NannyML: - Performance estimation without ground truth - Confidence-based performance monitoring

Key Takeaways: Model Drift and Monitoring

All models drift: Performance degradation is inevitable, not exceptional
Types of drift matter: Data drift vs. concept drift require different interventions
Multiple detection methods: Use statistical tests (KS, Chi-square), PSI, and performance tracking simultaneously
Automated triggers: Define clear thresholds for retraining (don’t wait for catastrophic failure)
Continuous monitoring is mandatory: FDA and EU regulations increasingly require post-deployment monitoring
Periodic retraining > online learning: For healthcare, controlled validation is safer than continuous updates
Monitor before you have ground truth: Prediction distribution shifts can signal problems early
Document everything: Track what triggered retraining, what changed, and validation results

Framework reference: Davis et al., 2017, JAMIA - Clinical prediction models degrade; most need recalibration after 2-3 years.

Explainability and Interpretability (XAI)

Why Explainability Matters in Public Health AI

The trust problem: Systematic reviews consistently find that clinicians are reluctant to trust or act on predictions from “black box” AI systems they cannot interpret (Antoniadi et al., 2021; Markus et al., 2021).

Why interpretability is critical:

Clinical decision-making: Clinicians need to know why before they can decide whether to act
Debugging and validation: Explanations reveal spurious correlations and dataset biases
Regulatory requirements: FDA and EU AI Act increasingly mandate explainability for high-risk systems
Patient autonomy: Patients have a right to understand decisions affecting their health
Legal liability: “The algorithm said so” is not a defense in malpractice cases

The Accuracy-Interpretability Trade-off (A False Dichotomy?)

Traditional belief: Deep learning = high accuracy but uninterpretable; simpler models = lower accuracy but interpretable.

2024 reality: Post-hoc explainability methods (SHAP, attention mechanisms) make complex models interpretable without sacrificing accuracy. The choice is no longer binary.

Guideline: Start with the simplest model that meets performance requirements. If you need complex models, invest in robust explainability infrastructure.

Levels of Interpretability

Not all interpretability is equal. Different stakeholders need different levels of explanation.

1. Global Interpretability

Definition: Understanding the model’s overall behavior and decision logic.

Questions answered: - What features are most important overall? - How does the model generally make decisions? - Are there unexpected feature relationships?

Methods: - Feature importance rankings - Partial dependence plots - Global SHAP values

Audience: Data scientists, validators, regulators

2. Local Interpretability

Definition: Understanding why the model made a specific prediction for a specific patient.

Questions answered: - Why did the model predict this patient is high-risk? - Which patient characteristics drove this prediction? - What would need to change to alter the prediction?

Methods: - LIME (Local Interpretable Model-agnostic Explanations) - SHAP values for individual predictions - Counterfactual explanations

Audience: Clinicians, patients

3. Model-Based Interpretability

Definition: Models that are inherently interpretable by design.

Examples: - Linear models: Each coefficient shows feature contribution - Decision trees: Follow the path to understand the decision - Rule-based systems: Explicit IF-THEN logic

When to use: When stakeholder trust is paramount and model performance requirements are modest.

Interpretability Methods: Practical Guide

Method 1: SHAP (SHapley Additive exPlanations)

What it is: A unified framework for interpreting model predictions based on game theory (Shapley values).

Why it’s powerful: - Model-agnostic: Works with any ML model (XGBoost, neural networks, etc.) - Theoretically grounded: Satisfies desirable properties (local accuracy, consistency) - Both global and local: Feature importance + individual predictions

Foundational paper: Lundberg & Lee, 2017, NeurIPS

SHAP Example: Sepsis Risk Prediction

import shap
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Simulate patient data
np.random.seed(42)
n_patients = 1000

data = pd.DataFrame({
 'age': np.random.normal(65, 15, n_patients),
 'heart_rate': np.random.normal(90, 20, n_patients),
 'temperature': np.random.normal(37.5, 1.5, n_patients),
 'wbc_count': np.random.lognormal(2.3, 0.5, n_patients), # White blood cell count
 'lactate': np.random.exponential(1.5, n_patients),
 'systolic_bp': np.random.normal(120, 25, n_patients),
})

# Create synthetic sepsis outcome (complex non-linear relationships)
sepsis_risk = (
 0.3 * (data['temperature'] > 38.3) + # Fever
 0.3 * (data['wbc_count'] > 12) +  # Elevated WBC
 0.2 * (data['lactate'] > 2) +   # Elevated lactate
 0.2 * (data['heart_rate'] > 100) + # Tachycardia
 np.random.normal(0, 0.1, n_patients) # Noise
)
data['sepsis'] = (sepsis_risk > 0.6).astype(int)

# Split data
from sklearn.model_selection import train_test_split
X = data.drop('sepsis', axis=1)
y = data['sepsis']
X_train, X_test, y_train, y_test = train_test_split(
 X, y, test_size=0.2, random_state=42, stratify=y
)

# Train XGBoost model
model = xgb.XGBClassifier(
 n_estimators=100,
 max_depth=5,
 learning_rate=0.1,
 random_state=42
)
model.fit(X_train, y_train)

# Evaluate
from sklearn.metrics import roc_auc_score
y_pred_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)
print(f"Model AUC-ROC: {auc:.3f}")

# ===== SHAP ANALYSIS =====

# 1. Create SHAP explainer
explainer = shap.Explainer(model, X_train)

# 2. Calculate SHAP values for test set
shap_values = explainer(X_test)

# 3. GLOBAL INTERPRETABILITY: Feature importance
print("\n=== Global Feature Importance ===")
shap.plots.bar(shap_values, show=False)
plt.title("Global Feature Importance (Mean |SHAP|)")
plt.tight_layout()
plt.savefig("shap_global_importance.png", dpi=150)
plt.show()

# Alternative: Summary plot (beeswarm)
shap.plots.beeswarm(shap_values, show=False)
plt.title("Feature Impact on Model Output")
plt.tight_layout()
plt.savefig("shap_summary.png", dpi=150)
plt.show()

# 4. LOCAL INTERPRETABILITY: Explain specific patient
patient_idx = 0 # First patient in test set
print(f"\n=== Patient {patient_idx} ===")
print(f"Predicted sepsis probability: {y_pred_proba[patient_idx]:.2%}")
print(f"Actual outcome: {'Sepsis' if y_test.iloc[patient_idx] == 1 else 'No sepsis'}")

# Waterfall plot: Show how features contribute to this prediction
shap.plots.waterfall(shap_values[patient_idx], show=False)
plt.title(f"Explanation for Patient {patient_idx}")
plt.tight_layout()
plt.savefig(f"shap_patient_{patient_idx}.png", dpi=150)
plt.show()

# 5. ACTIONABLE INSIGHTS: What drives high-risk predictions?
print("\n=== Feature Values for High-Risk Patient ===")
for feature in X.columns:
 print(f"{feature}: {X_test.iloc[patient_idx][feature]:.2f}")

# 6. Dependence plot: How does lactate affect predictions?
shap.plots.scatter(shap_values[:, "lactate"], color=shap_values, show=False)
plt.title("Lactate Impact on Sepsis Prediction")
plt.tight_layout()
plt.savefig("shap_dependence_lactate.png", dpi=150)
plt.show()

Key outputs:

Global importance: Which features matter most across all patients?
Waterfall plot: For Patient X, lactate (+0.3) and temperature (+0.2) increased risk; normal BP (-0.1) decreased it
Dependence plots: Non-linear relationships (e.g., lactate > 2 mmol/L sharply increases risk)

Clinical translation:

Patient 47: Sepsis Risk = 78%

Main drivers:
+ Lactate 3.2 mmol/L (+0.35 risk contribution) <- **Primary concern**
+ Temperature 39.1°C (+0.22)
+ WBC 15,000/μL (+0.18)
- Normal BP 118/72 (-0.08) <- **Protective factor**

Interpretation: Elevated lactate is the strongest predictor.
Consider serial lactate monitoring and early fluid resuscitation.

SHAP Advantages and Limitations

Advantages: - Mathematically principled (satisfies local accuracy, missingness, consistency) - Works with any model architecture - Both global and local explanations - Handles feature interactions

Limitations: - Computational cost: Can be slow for large models/datasets (use TreeSHAP for tree models, faster) - Not causal: High SHAP value ≠ causal relationship (correlation still) - Assumes feature independence: Can give misleading results with highly correlated features

Best practices: - Use TreeSHAP for tree-based models (XGBoost, Random Forest) , 1000x faster - For neural networks, use DeepSHAP or KernelSHAP with background dataset sampling - Always validate explanations with domain experts (do they make clinical sense?)

Method 2: LIME (Local Interpretable Model-agnostic Explanations)

What it is: Creates a simple, interpretable model (like linear regression) that approximates the complex model’s behavior locally around a specific prediction.

How it works: 1. Perturb the input (create similar but slightly different patients) 2. Get model predictions for perturbed inputs 3. Fit a simple linear model to these local predictions 4. Linear coefficients = feature importance for this prediction

When to use: - Need quick local explanations - SHAP is too computationally expensive - Want human-readable rules (“If lactate > 2 AND fever, then high risk”)

Foundational paper: Ribeiro et al., 2016, KDD

LIME Example: Readmission Risk

import lime
import lime.lime_tabular
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Simulate patient data for hospital readmission
np.random.seed(42)
n_patients = 1000

data = pd.DataFrame({
 'age': np.random.normal(68, 12, n_patients),
 'num_prior_admissions': np.random.poisson(2, n_patients),
 'length_of_stay': np.random.gamma(2, 2, n_patients),
 'num_medications': np.random.poisson(5, n_patients),
 'comorbidity_count': np.random.poisson(3, n_patients),
 'emergency_admission': np.random.binomial(1, 0.3, n_patients),
})

# Create readmission outcome
readmit_risk = (
 0.02 * data['age'] +
 0.15 * data['num_prior_admissions'] +
 0.05 * data['comorbidity_count'] +
 0.1 * data['emergency_admission'] +
 np.random.normal(0, 0.5, n_patients)
)
data['readmitted_30d'] = (readmit_risk > 2).astype(int)

# Train model
X = data.drop('readmitted_30d', axis=1)
y = data['readmitted_30d']
X_train, X_test, y_train, y_test = train_test_split(
 X, y, test_size=0.2, random_state=42, stratify=y
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# ===== LIME EXPLANATION =====

# 1. Create LIME explainer
explainer = lime.lime_tabular.LimeTabularExplainer(
 training_data=X_train.values,
 feature_names=X_train.columns.tolist(),
 class_names=['No Readmission', 'Readmission'],
 mode='classification',
 random_state=42
)

# 2. Explain a specific patient
patient_idx = 5
patient_data = X_test.iloc[patient_idx].values
predicted_proba = model.predict_proba([patient_data])[0]

print(f"=== Patient {patient_idx} ===")
print(f"Predicted readmission probability: {predicted_proba[1]:.2%}")
print(f"Actual outcome: {'Readmitted' if y_test.iloc[patient_idx] == 1 else 'Not readmitted'}")

# Generate explanation
explanation = explainer.explain_instance(
 data_row=patient_data,
 predict_fn=model.predict_proba,
 num_features=6
)

# 3. Display explanation
print("\n=== LIME Explanation ===")
print("Feature contributions to 'Readmission' class:")
for feature, weight in explanation.as_list():
 print(f" {feature}: {weight:+.3f}")

# 4. Visualize
explanation.show_in_notebook(show_table=True)

# Save as HTML
explanation.save_to_file('lime_explanation_patient5.html')

# 5. Extract feature importance for this patient
feature_importance = dict(explanation.as_list())
print("\n=== Top Risk Factors for This Patient ===")
sorted_features = sorted(feature_importance.items(), key=lambda x: abs(x[1]), reverse=True)
for feature, weight in sorted_features[:3]:
 direction = "↑ Increases" if weight > 0 else "↓ Decreases"
 print(f"{direction} risk: {feature} (impact: {weight:+.3f})")

Example output:

=== Patient 5 ===
Predicted readmission probability: 64%

Feature contributions:
 num_prior_admissions > 3.00: +0.22 ← Major risk factor
 comorbidity_count > 4.00: +0.15
 age > 65.00: +0.08
 emergency_admission = 1: +0.12
 length_of_stay ≤ 3.00: -0.05  ← Protective (longer stays = more stabilization)
 num_medications ≤ 6.00: -0.02

Interpretation: This patient's high readmission risk is driven primarily
by multiple prior admissions (4 in past year) and high comorbidity burden.

LIME Advantages and Limitations

Advantages: - Fast: Quicker than SHAP for local explanations - Intuitive: Simple “if-then” rules easy for clinicians to understand - Model-agnostic: Works with any black box model

Limitations: - Instability: Explanations can vary significantly with small input changes - Local only: Doesn’t provide global model understanding - Arbitrary perturbations: Sampling strategy affects explanation quality - No theoretical guarantees: Unlike SHAP, not mathematically principled

When to choose LIME over SHAP: - Real-time explanations needed (speed critical) - Prefer rule-based explanations (“If X > 5 AND Y < 10…”) - SHAP computationally infeasible for your model

Method 3: Attention Mechanisms (For Deep Learning)

What it is: Neural network architectures that learn to focus on important input features, making attention weights interpretable.

Where it’s used: - Transformers: BERT, GPT for clinical notes analysis - Vision models: Which parts of chest X-ray drove diagnosis? - Time-series: Which ICU monitoring data points triggered alert?

Example application: Radiology AI highlights suspicious regions in medical images using attention heatmaps.

Attention Visualization Example

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

# Simple attention-based model for ICU time-series data
class AttentionICU(nn.Module):
 def __init__(self, input_dim, hidden_dim):
  super().__init__()
  self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True)
  # Attention mechanism
  self.attention = nn.Linear(hidden_dim, 1)
  self.classifier = nn.Linear(hidden_dim, 1)

 def forward(self, x):
  # x shape: (batch, time_steps, features)
  lstm_out, _ = self.lstm(x) # (batch, time_steps, hidden_dim)

  # Calculate attention scores
  attention_scores = self.attention(lstm_out) # (batch, time_steps, 1)
  attention_weights = torch.softmax(attention_scores, dim=1)

  # Apply attention (weighted sum of LSTM outputs)
  context = torch.sum(attention_weights * lstm_out, dim=1) # (batch, hidden_dim)

  # Final prediction
  output = self.classifier(context)

  return torch.sigmoid(output), attention_weights

# Simulate ICU time-series data
# Features: HR, BP, SpO2, RR over 24 hours (hourly measurements)
torch.manual_seed(42)
n_patients = 100
time_steps = 24
n_features = 4

X = torch.randn(n_patients, time_steps, n_features)
y = torch.randint(0, 2, (n_patients, 1)).float() # Binary outcome

# Train model
model = AttentionICU(input_dim=n_features, hidden_dim=32)
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop (simplified)
for epoch in range(50):
 optimizer.zero_grad()
 predictions, attention_weights = model(X)
 loss = criterion(predictions, y)
 loss.backward()
 optimizer.step()

# ===== INTERPRET ATTENTION WEIGHTS =====

# Explain a specific patient
patient_idx = 0
patient_data = X[patient_idx:patient_idx+1]
prediction, attention = model(patient_data)

print(f"Predicted risk: {prediction.item():.2%}")

# Visualize attention over time
attention_np = attention.detach().numpy()[0, :, 0] # Shape: (time_steps,)

plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(range(24), attention_np, marker='o')
plt.xlabel('Hour')
plt.ylabel('Attention Weight')
plt.title('Which Time Points Were Most Important?')
plt.axhline(1/24, color='r', linestyle='--', label='Uniform attention')
plt.legend()

# Identify critical time periods
top_hours = np.argsort(attention_np)[-3:][::-1]
print(f"\nMost important time periods: Hours {top_hours}")
print("Interpretation: Model focused on these specific hours when making prediction")

# Overlay attention on vital signs
plt.subplot(1, 2, 2)
vitals = patient_data.detach().numpy()[0, :, 0] # Heart rate
plt.plot(range(24), vitals, label='Heart Rate', alpha=0.7)
plt.scatter(range(24), vitals, s=attention_np*1000, c='red', alpha=0.5,
   label='Attention (size = importance)')
plt.xlabel('Hour')
plt.ylabel('Heart Rate')
plt.title('Attention-Weighted Vital Signs')
plt.legend()
plt.tight_layout()
plt.savefig('attention_interpretation.png', dpi=150)
plt.show()

print("\nClinical interpretation:")
print(f"The model identified hours {top_hours[0]}, {top_hours[1]}, {top_hours[2]} as critical.")
print("Clinician should review events during these time windows.")

Key insight: Attention mechanisms provide inherent interpretability, the model learns what’s important during training, rather than requiring post-hoc explanation.

Limitations: - Attention ≠ causation - High attention doesn’t guarantee that feature is truly important (attention is correlation) - Requires model architecture modification (can’t apply to existing black boxes)

Method 4: Counterfactual Explanations

What it is: “What would need to change for the model to make a different prediction?”

Example: - Prediction: Patient has 75% readmission risk - Counterfactual: “If patient had ≤2 prior admissions (currently 4) OR comorbidity count ≤3 (currently 5), risk would drop to <30%”

Why it’s valuable: - Actionable: Tells clinicians what interventions might help - Patient-friendly: Easy to communicate (“If you lose 10 lbs, your risk decreases…”) - Fair: Reveals whether model relies on unchangeable features (race, gender)

Counterfactual Example with DiCE

# Install: pip install dice-ml
import dice_ml
from dice_ml import Dice
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load data (reuse readmission example from LIME section)
# ... (same data generation code) ...

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# ===== COUNTERFACTUAL GENERATION =====

# 1. Prepare DiCE
dice_data = dice_ml.Data(
 dataframe=pd.concat([X_train, y_train], axis=1),
 continuous_features=['age', 'num_prior_admissions', 'length_of_stay',
      'num_medications', 'comorbidity_count'],
 outcome_name='readmitted_30d'
)

dice_model = dice_ml.Model(model=model, backend='sklearn')

explainer = Dice(dice_data, dice_model, method='random')

# 2. Generate counterfactuals for high-risk patient
patient_idx = 5
patient_df = X_test.iloc[[patient_idx]]

# Find alternative scenarios where patient would NOT be readmitted
counterfactuals = explainer.generate_counterfactuals(
 query_instances=patient_df,
 total_CFs=3, # Generate 3 alternative scenarios
 desired_class='opposite' # Want opposite prediction
)

# 3. Display results
print("=== Original Patient ===")
print(patient_df.T)
print(f"\nPredicted outcome: Readmission (Risk: {model.predict_proba(patient_df)[0][1]:.2%})")

print("\n=== Counterfactual Scenarios (How to Avoid Readmission) ===")
cf_df = counterfactuals.cf_examples_list[0].final_cfs_df
print(cf_df.T)

# 4. Identify key changes
print("\n=== Key Changes Needed ===")
for col in X_test.columns:
 original = patient_df[col].values[0]
 for i, cf in cf_df.iterrows():
  if abs(cf[col] - original) > 0.01:
   change = cf[col] - original
   print(f" {col}: {original:.1f} → {cf[col]:.1f} (change: {change:+.1f})")

# 5. Clinical translation
print("\n=== Actionable Recommendations ===")
print("To reduce readmission risk below 30%, consider:")
print(" • Reduce medication complexity (consolidate from 8 to ≤6 drugs)")
print(" • Intensive post-discharge follow-up (reduce prior admit pattern)")
print(" • Comorbidity management (focus on top 2-3 conditions)")

Output interpretation:

Original Patient: Readmission Risk = 72%
- Age: 71
- Prior admissions: 4
- Comorbidities: 5
- Medications: 8

Counterfactual Scenario 1: Risk = 18%
- Age: 71 (unchanged)
- Prior admissions: 1 (reduced from 4) ← Major change
- Comorbidities: 5 (unchanged)
- Medications: 6 (reduced from 8)

Interpretation: Model suggests that reducing medication complexity and
preventing repeat admissions are the highest-impact interventions.

Reference: Wachter et al., 2017

Method 5: Feature Importance (For Tree-Based Models)

What it is: For models like Random Forest and XGBoost, built-in feature importance scores.

How it works: - Gini importance: How much each feature reduces impurity when splitting - Permutation importance: Performance drop when feature is randomly shuffled

Advantage: Fast, easy to compute Limitation: Can be biased toward high-cardinality features

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# ===== FEATURE IMPORTANCE =====

# Method 1: Built-in feature importance (Gini)
gini_importance = pd.DataFrame({
 'feature': X_train.columns,
 'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print("=== Gini Importance ===")
print(gini_importance)

# Method 2: Permutation importance (more reliable)
perm_importance = permutation_importance(
 model, X_test, y_test, n_repeats=10, random_state=42
)

perm_df = pd.DataFrame({
 'feature': X_train.columns,
 'importance': perm_importance.importances_mean,
 'std': perm_importance.importances_std
}).sort_values('importance', ascending=False)

print("\n=== Permutation Importance ===")
print(perm_df)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].barh(gini_importance['feature'], gini_importance['importance'])
axes[0].set_xlabel('Gini Importance')
axes[0].set_title('Feature Importance (Gini)')

axes[1].barh(perm_df['feature'], perm_df['importance'])
axes[1].errorbar(perm_df['importance'], perm_df['feature'],
     xerr=perm_df['std'], fmt='none', color='black', alpha=0.5)
axes[1].set_xlabel('Permutation Importance')
axes[1].set_title('Feature Importance (Permutation)')

plt.tight_layout()
plt.savefig('feature_importance_comparison.png', dpi=150)
plt.show()

Choosing the Right Explainability Method

Method	Global or Local?	Model-Agnostic?	Speed	Best For
SHAP	Both	Yes	Medium-Slow	Most robust, theoretically grounded explanations
LIME	Local only	Yes	Fast	Quick local explanations, rule-based output
Attention	Local only	No (DL only)	Fast	Deep learning models (transformers, CNNs)
Counterfactual	Local only	Yes	Medium	Actionable recommendations, fairness audits
Feature Importance	Global only	No (tree models)	Very Fast	Tree-based models, quick initial analysis

Decision flowchart:

Need global understanding? → SHAP (global) or Feature Importance (trees only)
Need local explanation for specific patient? → SHAP (most robust) or LIME (faster)
Need actionable recommendations? → Counterfactuals
Using deep learning? → Attention mechanisms or SHAP
Real-time constraint? → LIME or Feature Importance
Regulatory submission? → SHAP (theoretically grounded)

Evaluating Explainability: Does Your XAI Actually Work?

Critical question: How do you know if your explanations are good?

Explainability Evaluation Criteria

1. Fidelity: Does the explanation accurately reflect the model’s behavior?

Test: - Remove high-importance features → prediction should change significantly - Flip low-importance features → prediction should stay similar

# Fidelity test
original_pred = model.predict_proba(patient_data)[0][1]

# Remove most important feature (set to mean)
modified_data = patient_data.copy()
modified_data['lactate'] = X_train['lactate'].mean()
modified_pred = model.predict_proba(modified_data)[0][1]

print(f"Original prediction: {original_pred:.2%}")
print(f"After removing 'lactate': {modified_pred:.2%}")
print(f"Change: {abs(original_pred - modified_pred):.2%}")

if abs(original_pred - modified_pred) > 0.1: # >10% change
 print("[OK] High fidelity: Explanation correctly identified important feature")
else:
 print("[FAIL] Low fidelity: Feature removal didn't change prediction as expected")

2. Consistency: Do similar patients get similar explanations?

Test: Generate explanations for similar patients; feature importance rankings should be similar

3. Stability: Do explanations change drastically with small input perturbations?

Problem with LIME: Small changes to patient data can yield very different explanations

4. Clinical validity: Do domain experts agree the explanations make sense?

Gold standard: Clinician review - Do identified features align with medical knowledge? - Are there unexpected/spurious correlations?

Regulatory Perspectives on Explainability

FDA Guidance (2024)

From FDA 2024 AI/ML Action Plan:

“For AI/ML-enabled devices, transparency regarding the device’s operating characteristics, performance, and limitations is critical… Organizations should provide information about factors that influenced model predictions.”

Requirements for high-risk devices: - Explanation of key features driving predictions - Model limitations and failure modes - Performance across demographic subgroups

EU AI Act (2024)

Transparency obligations for high-risk AI (includes medical AI):

Article 13 - Transparency: - Users must be informed that they are interacting with AI - Information on the logic involved in decision-making - Information on significance and consequences of predictions

Practical implication: “Black box” systems without explanations will face regulatory barriers in EU.

Reference: EU AI Act, 2024

Implementing Explainability in Production Systems

Best Practices for Deployed AI

1. Multi-level explanations for different users:

User	Explanation Level	Method
Patient	Why this prediction affects me?	Simplified counterfactual (“If X, then Y”)
Clinician	What factors drive this prediction?	SHAP/LIME with top 3-5 features
Data Scientist	How does the model work globally?	SHAP global importance, partial dependence
Regulator	Is the model fair and robust?	Subgroup analysis, fairness metrics

2. Explanation caching: Pre-compute SHAP values during batch prediction to avoid real-time latency

3. Explanation documentation: Log explanations alongside predictions for audit trails

4. Explanation monitoring: Track whether explanations remain consistent over time (if not, indicates model drift)

Example: Explainable Sepsis Alert System

## Explainability Architecture for Sepsis Early Warning System

**User-facing interface:**

┌─────────────────────────────────────────────┐
│ SEPSIS ALERT: High Risk (82%)    │
├─────────────────────────────────────────────┤
│ Primary Risk Factors:      │
│ [CRITICAL] Lactate: 3.8 mmol/L (Critical: >2.0) │
│ [CRITICAL] Temp: 39.2°C (Elevated: >38.3)   │
│ [ELEVATED] WBC: 13,500 (Elevated: >12,000)  │
│            │
│ Protective Factors:      │
│ [NORMAL] Blood Pressure: Normal (118/76)  │
│            │
│ [View Detailed Explanation]    │
│ [Similar Cases] [Dismiss Alert]   │
└─────────────────────────────────────────────┘

**Backend logging (for audit):**
{
 "patient_id": "47291",
 "timestamp": "2025-10-30T14:23:11Z",
 "prediction": 0.82,
 "model_version": "sepsis_v3.2.1",
 "shap_values": {
 "lactate": 0.35,
 "temperature": 0.22,
 "wbc_count": 0.18,
 "systolic_bp": -0.08
 },
 "explanation_method": "SHAP_TreeExplainer",
 "explanation_fidelity_score": 0.94
}

Common Pitfalls and How to Avoid Them

Pitfall 1: Confusing Correlation with Causation

Problem: SHAP/LIME identify correlations, not causal relationships.

Example: - Model assigns high importance to “hospital length of stay” for mortality prediction - Interpretation error: “Longer stays cause death” - Reality: Sicker patients stay longer; length of stay is a proxy for severity

Solution: Always validate explanations with clinical domain knowledge

Pitfall 2: Over-relying on Feature Importance

Problem: Global feature importance hides subgroup differences.

Example: - “Age” is most important feature globally (average across all patients) - But for young patients (<40), “comorbidities” might be more important

Solution: Examine SHAP dependence plots and subgroup-specific explanations

Pitfall 3: Ignoring Explanation Instability

Problem: LIME explanations can vary substantially between similar patients.

Test:

# Generate 10 explanations for same patient (with different LIME seeds)
explanations = []
for seed in range(10):
 exp = explainer.explain_instance(patient, model.predict_proba, random_state=seed)
 explanations.append(exp.as_list())

# Check consistency
# If feature rankings vary significantly → unstable explanations

Solution: Use SHAP for high-stakes decisions (more stable)

Pitfall 4: Explaining the Wrong Model

Problem: Explain a simplified “surrogate” model instead of the actual production model.

Example: - Production: Complex ensemble of 50 models - Explanation: Generated from single decision tree approximation - Risk: Explanations don’t reflect actual system behavior

Solution: Always explain the actual deployed model (even if slower)

Key Takeaways: Explainability

Trust requires transparency: Clinicians won’t act on predictions they don’t understand
Multiple methods, multiple purposes: SHAP for robustness, LIME for speed, counterfactuals for action
Evaluate your explanations: Fidelity, consistency, clinical validity
Regulatory trend: Explainability moving from “nice-to-have” to mandatory (FDA, EU)
Layer explanations by user: Patients need simple “why me?”; regulators need comprehensive validation
Correlation ≠ causation: Explanations show what model uses, not necessarily what’s clinically causal
Explainability is not a fix for bad models: If your model is biased or poorly validated, explanations just make the problems more visible (which is actually good for debugging)

Essential resources:

Christoph Molnar, Interpretable Machine Learning (2025): Free online book, comprehensive guide
SHAP documentation: https://shap.readthedocs.io/
LIME GitHub: https://github.com/marcotcr/lime
DiCE (Counterfactuals): https://interpret.ml/DiCE/
Google’s Explainable AI whitepaper: Exploratory guide to XAI (2019)

Mechanistic Interpretability for Sequential Decision-Making

SHAP and LIME explain individual predictions, but public health AI increasingly involves sequential decision-making where current actions influence future states. Reinforcement learning (RL) systems for population health management require interpretability methods that expose reasoning pathways, not just feature importance.

Case study: Medicaid care coordination. A SARSA reinforcement learning system for Medicaid care management across three U.S. states (250,000 decisions, 45,000 beneficiaries) used a mixed-methods approach combining quantitative RL optimization with qualitative clinical validation. The system achieved a 12 percentage-point reduction in acute care events (NNT 8.3) compared to standard practice, while also reducing race/ethnicity equalized odds disparity from 8.9% to 5.6% (37% relative reduction) and gender disparity from 5.3% to 3.8% (28% relative reduction) (Basu et al., 2025).

Implications for public health AI evaluation:

Sequential decision-making systems require interpretability methods beyond single-prediction explainers like SHAP and LIME
Mixed-methods validation (combining quantitative metrics with clinical expert review) provides stronger evidence than either approach alone
Fairness constraints can be integrated into RL optimization without large accuracy trade-offs, suggesting baseline disparities often stem from suboptimal calibration rather than fundamental accuracy-fairness tension
Tiered oversight (automated decisions for low-risk cases, human review for high-risk cases) is an emerging approach for balancing efficiency with safety

Regulatory Evaluation Frameworks

The Evolving Regulatory Landscape for Medical AI

2024-2025 reality: AI evaluation is no longer just a scientific question. It’s a regulatory requirement.

By 2025, medical AI systems face scrutiny from multiple regulatory bodies: - FDA (United States): Software as a Medical Device (SaMD) framework - European Union: AI Act (2024) classifying medical AI as “high-risk” - UK MHRA: Software and AI as a Medical Device framework - Health Canada: Medical Device Regulations for AI/ML

Key insight: Understanding regulatory evaluation requirements is essential for deployment, not just compliance.

Why Regulatory Context Matters for Evaluation

Even if you’re not directly developing a regulatory-approved device, understanding these frameworks helps you:

Design better evaluations: Regulatory standards define best practices
Anticipate deployment requirements: Many health systems require FDA clearance or equivalent
Benchmark your work: Compare your validation against regulatory expectations
Communicate with stakeholders: Speak the language of hospital legal/compliance teams

Cross-reference: Policy and Governance covers broader policy implications; this section focuses on evaluation-specific regulatory requirements.

FDA Software as a Medical Device (SaMD) Framework

Software as a Medical Device (SaMD) refers to software intended for medical purposes that operates independently of hardware medical devices.

Examples: - SaMD: Sepsis prediction algorithm, diabetic retinopathy screening app, radiology CAD software - Not SaMD: EHR system (administrative), fitness tracker (wellness, not medical diagnosis)

Risk-Based Classification System

The FDA classifies SaMD by risk level, which determines evaluation rigor required.

Risk Categorization Matrix

State of Healthcare	Significance of Information
	Treat/Diagnose	Drive Clinical Management	Inform Clinical Management
Critical	III (Highest risk)	III	II
Serious	III	II	II
Non-serious	II	II	I (Lowest risk)

Definitions: - Critical: Death or permanent impairment (e.g., ICU monitoring) - Serious: Long-term morbidity (e.g., cancer diagnosis) - Non-serious: Minor conditions (e.g., acne treatment)

Information significance: - Treat/Diagnose: Directly triggers treatment decisions - Drive clinical management: Significant influence on treatment path - Inform clinical management: One input among many

Evaluation Requirements by Risk Level

Level I (Low Risk): - Example: App suggesting lifestyle modifications for mild hypertension - Requirements: - Basic technical performance validation - User studies demonstrating safe use - Minimal clinical validation

Level II (Moderate Risk): - Example: Algorithm flagging abnormal chest X-rays for radiologist review - Requirements: - Robust internal validation - External validation recommended - Clinical utility assessment - Usability testing - Performance monitoring plan

Level III (High Risk): - Example: AI system autonomously diagnosing cancer, guiding ICU interventions - Requirements: - Extensive multi-site external validation - Prospective clinical studies - Randomized controlled trials (for novel interventions) - Comprehensive fairness audits - Predetermined change control plan (for adaptive algorithms) - Post-market surveillance

Good Machine Learning Practice (GMLP) Principles

FDA, Health Canada, and UK MHRA jointly published 10 GMLP principles (2021) for ML-based medical devices:

1. Multi-disciplinary expertise throughout ML lifecycle

2. Good software engineering practices - Version control, testing, documentation

3. Clinical study participants are representative - Training data reflects intended use population

4. Training datasets are independent of test datasets - No data leakage

5. Selected reference datasets are based on best available methods - Gold-standard labels (expert consensus, biopsy confirmation, etc.)

6. Model design is tailored to available data and reflects intended use - Avoid overfitting, ensure clinical relevance

7. Focus on human-AI team performance - Evaluate AI + clinician together, not AI in isolation

8. Testing demonstrates device performance in clinically relevant conditions - Real-world data, relevant patient populations

9. Users are provided with clear, essential information - Transparency about limitations, training data, performance metrics

10. Deployed models are monitored for performance - Post-market surveillance, drift detection

Reference: FDA/MHRA/Health Canada, 2021 - GMLP Guiding Principles

Predetermined Change Control Plans (PCCP)

Challenge: Traditional medical devices are static; ML models need updating.

FDA’s solution (2023 draft guidance): Predetermined Change Control Plans allow specified modifications without new regulatory review.

What can be included in a PCCP:

1. Allowable modifications: - Retraining on new data (within specified bounds) - Algorithm parameter adjustments - Performance improvements

2. Modification protocol: - Data requirements: Minimum sample size, quality standards - Performance thresholds: Must maintain ≥ X sensitivity - Validation approach: Test set size, external validation sites

3. Update assessment: - Performance monitoring triggers retraining - Validation results compared to pre-specified thresholds - Automated decision: deploy update or flag for review

4. Transparency and reporting: - Change documentation - Performance reports to regulators - User notifications

Example PCCP:

### Sepsis Prediction Model PCCP

**Allowable Change:** Retrain model quarterly using new institutional data

**Modification Protocol:**
- Minimum 5,000 new patient encounters with ≥200 sepsis cases
- Maintain AUC-ROC ≥ 0.82 (original validation: 0.85)
- Sensitivity at 80% specificity must be ≥ 70%
- External validation on ≥1 additional hospital required

**Assessment:**
- Automated performance monitoring (monthly)
- Retraining triggered if AUC drops below 0.83
- New model validated on hold-out test set (20% of new data)
- Deploy if all thresholds met; otherwise, flag for manual review

**Reporting:**
- Quarterly performance reports to FDA
- User notification of model updates via EHR alert

EU AI Act: High-Risk Classification for Medical AI

Adopted 2024, fully enforced by 2026.

Key classification: Most medical AI systems are high-risk, requiring:

1. Risk management system: - Continuous identification and mitigation of risks

2. Data governance: - Training data quality assurance - Bias detection and mitigation - Data representativeness

3. Technical documentation: - Detailed model specifications - Training procedures - Validation results

4. Transparency: - Users informed that AI is involved - Clear information on limitations

5. Human oversight: - Human-in-the-loop for high-stakes decisions

6. Accuracy, robustness, cybersecurity: - Performance standards - Adversarial robustness testing - Security safeguards

7. Post-market monitoring: - Continuous performance tracking - Incident reporting

Implications for evaluation: - External validation is effectively mandatory - Fairness audits required (bias detection) - Security testing (adversarial robustness) required - Ongoing monitoring, not just pre-deployment validation

Reference: EU AI Act, 2024

Practical Implications for Evaluation

Regulatory-Aligned Evaluation Checklist

Even if your system isn’t currently seeking regulatory approval, aligning with these standards ensures quality:

Before Development: - [ ] Determine risk classification (SaMD Level I/II/III or EU high-risk) - [ ] Identify applicable regulatory frameworks - [ ] Define evaluation requirements based on risk level

During Development: - [ ] Ensure training/test data independence (GMLP #4) - [ ] Use representative training data (GMLP #3) - [ ] Apply good software engineering (version control, testing) (GMLP #2) - [ ] Document model architecture, hyperparameters, training process

Validation Phase: - [ ] Internal validation with appropriate cross-validation - [ ] Temporal validation (if time-dependent data) - [ ] External validation (mandatory for Level II/III, EU high-risk) - [ ] Fairness audit across demographic subgroups - [ ] Usability testing with intended users - [ ] Clinical utility assessment (not just technical performance)

Pre-Deployment: - [ ] Human-AI team evaluation (GMLP #7) - [ ] Adversarial robustness testing (EU AI Act requirement) - [ ] Create predetermined change control plan (if adaptive model) - [ ] Develop post-market monitoring protocol

Post-Deployment: - [ ] Continuous performance monitoring (GMLP #10) - [ ] Drift detection and retraining triggers - [ ] Incident reporting system - [ ] Periodic re-validation (recommend annually minimum)

Comparison: Research vs. Regulatory Evaluation Standards

Aspect	Research Publication	Regulatory Approval
External validation	Recommended, often skipped (6% in 2020 study)	Mandatory for Level II/III, EU high-risk
Prospective testing	Rare	Often required for novel high-risk devices
Fairness audit	Increasingly expected	Mandatory (EU AI Act)
Post-market monitoring	Not required	Mandatory
Clinical utility	Recommended	Required (must demonstrate benefit)
Documentation	Methods section	Extensive technical documentation
Timeline	Months	1-3+ years (depending on risk level)

Key insight: Regulatory standards are higher than typical research standards. If you aim for deployment, plan for regulatory-level evaluation from the start.

When to Seek Regulatory Approval

You likely need FDA clearance/approval if: - System makes diagnostic or treatment recommendations - Used for screening (e.g., diabetic retinopathy, cancer) - Influences clinical decision-making significantly - Marketed as improving health outcomes

You might NOT need approval if: - Administrative tools (scheduling, billing) - Wellness apps (general fitness, not medical claims) - Clinical decision support providing information only (not recommendations) - Gray area: FDA discretion, often depends on risk

When in doubt: Consult FDA’s Digital Health Center of Excellence or regulatory counsel.

Resources for Regulatory Evaluation

FDA: - Digital Health Center of Excellence - Software as a Medical Device Guidance - AI/ML-Based SaMD Action Plan

International: - IMDRF SaMD Framework - EU AI Act Official Text

Academic: - TRIPOD-AI: Reporting guidelines for prediction models using AI - CONSORT-AI: Reporting guidelines for clinical trials evaluating AI interventions

Key Takeaways: Regulatory Evaluation

Regulatory requirements are evaluation requirements: FDA, EU standards define rigorous validation expectations
Risk-based approach: Higher-risk systems require more extensive validation (external validation, prospective studies, RCTs)
External validation is mandatory: For Level II/III devices and EU high-risk AI
Continuous monitoring is required: Post-market surveillance, not just pre-deployment validation
Good Machine Learning Practice: 10 principles provide practical framework for development and evaluation
Predetermined change control plans: Enable adaptive models while maintaining regulatory compliance
EU AI Act raises the bar: Fairness audits, robustness testing, transparency now mandatory
Plan early: Regulatory evaluation takes longer and costs more than research validation; design for it from the start
Seek expert guidance: Regulatory pathways are complex; consult with regulatory specialists
Standards improve quality: Even if not seeking approval, regulatory frameworks represent best practices

Comprehensive Evaluation Framework

Complete Evaluation Checklist

Use this when evaluating AI systems:

AI System Evaluation Checklist

TECHNICAL PERFORMANCE - [ ] Discrimination metrics reported (AUC-ROC, sensitivity, specificity, PPV, NPV) - [ ] 95% confidence intervals provided for all metrics - [ ] Calibration assessed (calibration plot, Brier score, ECE) - [ ] Appropriate for class imbalance (if applicable) - [ ] Comparison to baseline model (e.g., logistic regression, clinical judgment) - [ ] Multiple metrics reported (not just accuracy)

VALIDATION RIGOR - [ ] Internal validation performed (CV or hold-out) - [ ] Temporal validation (train on past, test on future) - [ ] External validation on independent dataset from different institution - [ ] Prospective validation performed or planned - [ ] Data leakage prevented (feature engineering within folds) - [ ] Appropriate cross-validation for data structure (stratified, grouped, time-series)

FAIRNESS AND EQUITY - [ ] Performance stratified by demographic subgroups (race, gender, age, SES) - [ ] Disparities quantified (absolute and relative differences) - [ ] Calibration assessed per subgroup - [ ] Training data representation documented - [ ] Potential for bias explicitly discussed - [ ] Mitigation strategies proposed if disparities identified

CLINICAL UTILITY - [ ] Decision curve analysis or similar utility assessment - [ ] Comparison to current standard of care - [ ] Clinical workflow integration considered - [ ] Net benefit quantified - [ ] Cost-effectiveness assessed (if applicable) - [ ] Actionable outputs (not just risk scores)

TRANSPARENCY AND REPRODUCIBILITY - [ ] Model architecture and type clearly described - [ ] Feature engineering documented - [ ] Hyperparameters and training procedure reported - [ ] Reporting guidelines followed (TRIPOD, STARD-AI) - [ ] Code availability stated - [ ] Data availability (with appropriate privacy protections) - [ ] Conflicts of interest disclosed

IMPLEMENTATION PLANNING - [ ] Target users and use cases defined - [ ] Workflow integration plan described - [ ] Alert/decision threshold selection justified - [ ] Plan for performance monitoring post-deployment - [ ] Model updating and maintenance plan - [ ] Training plan for end users - [ ] Contingency plan for model failure

LIMITATIONS - [ ] Limitations clearly stated - [ ] Generalizability constraints acknowledged - [ ] Potential biases discussed - [ ] Appropriate caveats about clinical use

Reporting Guidelines

TRIPOD: Transparent Reporting of Prediction Models

Collins et al., 2015, BMJ - TRIPOD statement

22-item checklist for prediction model studies:

Title and Abstract 1. Identify as prediction model study 2. Summary of objectives, design, setting, participants, outcome, prediction model, results

Introduction 3. Background and objectives 4. Rationale for development or validation

Methods - Source of Data 5. Study design and data sources 6. Eligibility criteria and study period

Methods - Participants 7. Participant characteristics 8. Outcome definition 9. Predictors (features) clearly defined

Methods - Sample Size 10. Sample size determination

Methods - Missing Data 11. How missing data were handled

Methods - Model Development 12. Statistical methods for model development 13. Model selection procedure 14. Model performance measures

Results - Participants 15. Participant flow diagram 16. Descriptive characteristics

Results - Model Specification 17. Model specification (all parameters) 18. Model performance (discrimination and calibration)

Discussion 19. Interpretation (clinical meaning, implications) 20. Limitations 21. Implications for practice

Other 22. Funding and conflicts of interest

TRIPOD-AI extension (in development): Additional items for AI/ML models: - Training/validation/test set composition - Data augmentation - Hyperparameter tuning - Computational environment - Algorithm selection process

TRIPOD-LLM: Reporting Guidelines for LLM Studies

Traditional TRIPOD was designed for classical prediction models. Large language models require fundamentally different reporting standards. TRIPOD-LLM (Gallifant et al., 2025, Nature Medicine) provides the first LLM-specific extension, endorsed by EQUATOR Network.

Why LLM-specific guidelines are needed:

Training data for LLMs is often undisclosed or incompletely characterized
Performance varies dramatically with prompt wording
Contamination between training and test data is difficult to verify
Reproducibility requires specifying API versions, dates, and system prompts

Key TRIPOD-LLM additions:

Category	New Requirements
Model specification	API version, access date, temperature/sampling parameters
Prompting	Full prompt text, development process, any prompt optimization
Input/output	How health data was formatted, output parsing methods
Reproducibility	Whether results vary across API calls, random seed handling
Contamination	Assessment of whether test data appeared in training corpus

The LLM reporting crisis:

A systematic review of LLM chatbot health advice studies (Huo et al., 2025, JAMA Network Open) examined 137 studies and found 99.3% used closed-source models without identifying which version. Key findings:

Only 5 studies (1%) used open-source models with reproducible weights
Prompts were incompletely reported in most studies
Temperature and sampling parameters rarely disclosed
API access dates (critical for versioning) almost never reported

This means virtually all current LLM health research cannot be reproduced, even in principle. TRIPOD-LLM directly addresses these gaps.

CHART Statement (Chatbot Assessment Reporting Tool):

Complementing TRIPOD-LLM, the CHART statement (Huo et al., 2025, JAMA Network Open) provides a 12-item, 39-subitem checklist specifically for evaluating chatbot and conversational AI systems in healthcare settings.

STARD-AI: Standards for Reporting Diagnostic Accuracy Using AI

Extension of STARD guidelines for diagnostic AI.

Additional items: - Model architecture details - Training procedure (epochs, batch size, optimization) - Validation strategy - External validation results - Subgroup analyses - Calibration assessment - Comparison to human performance (if applicable)

Critical Appraisal of Published Studies

Systematic Evaluation Framework

When reading AI studies:

1. Study Design and Data Quality

Questions: - Representative sample of target population? - External validation performed? - Test set truly independent? - Outcome objectively defined and consistently measured? - Potential for data leakage?

Red flags: - No external validation - Small sample size (<500 events) - Convenience sampling - Vague outcome definitions - Feature engineering on entire dataset before splitting

2. Model Development and Reporting

Questions: - Multiple models compared? - Simple baseline included (logistic regression)? - Hyperparameters tuned on separate validation set? - Feature selection appropriate? - Model clearly described?

Red flags: - No baseline comparison - Hyperparameter tuning on test set - Inadequate model description - No cross-validation

3. Performance Evaluation

Questions: - Appropriate metrics for task? - Confidence intervals provided? - Calibration assessed? - Multiple metrics reported? - Statistical testing appropriate?

Red flags: - Only accuracy reported (especially for imbalanced data) - No calibration assessment - No confidence intervals - Cherry-picked metrics

4. Fairness and Generalizability

Questions: - Performance stratified by subgroups? - Diverse populations included? - Generalizability limitations discussed? - Potential biases identified?

Red flags: - No subgroup analysis - Homogeneous study population - Claims of broad generalizability without external validation - Dismissal of fairness concerns

5. Clinical Utility

Questions: - Clinical utility assessed (beyond accuracy)? - Compared to current practice? - Implementation considerations discussed? - Cost-effectiveness assessed?

Red flags: - Only technical metrics - No comparison to existing approaches - No implementation discussion - Overstated clinical claims

6. Transparency and Reproducibility

Questions: - Code and data available? - Reporting guidelines followed? - Sufficient detail to reproduce? - Limitations clearly stated? - Conflicts of interest disclosed?

Red flags: - No code/data availability - Insufficient methodological detail - Overstated conclusions - Undisclosed industry funding

Key Takeaways

Essential Principles

Evaluation is multidimensional , Technical performance, clinical utility, fairness, and implementation outcomes all matter
Internal validation is insufficient , External validation on independent data is essential to assess generalizability
Calibration is critical , Predicted probabilities must be meaningful for clinical decisions, not just discriminative
Assess fairness proactively , Stratify performance by demographic subgroups; disparities invisible otherwise
Clinical utility ≠ statistical performance , A model can be statistically accurate but clinically useless without improving outcomes
Prospective validation is the gold standard , Real-world testing provides strongest evidence
Common pitfalls are avoidable , Data leakage, improper CV, threshold optimization on test set lead to overoptimistic estimates
Implementation determines success , Even well-performing models fail if workflow integration ignored
Transparency enables trust , Follow reporting guidelines (TRIPOD, STARD-AI); share code and data when possible
Continuous monitoring is essential , Model performance drifts over time; plan for ongoing evaluation and updating

Check Your Understanding

Test your knowledge of the key concepts from this chapter. Click “Show Answer” to reveal the correct response and explanation.

Question 1: Cross-Validation Strategy

You’re building a model to predict hospital readmissions using data from 2018-2023. Which cross-validation strategy is MOST appropriate?

10-fold random cross-validation
Leave-one-out cross-validation
Stratified K-fold cross-validation
Time-based forward-chaining cross-validation

Answer: d) Time-based forward-chaining cross-validation

Explanation: Time-based (temporal) cross-validation is essential for healthcare data with temporal dependencies:

Why temporal CV is critical:

Fold 1: Train 2018-2019 → Test 2020
Fold 2: Train 2018-2020 → Test 2021
Fold 3: Train 2018-2021 → Test 2022
Fold 4: Train 2018-2022 → Test 2023

What this tests: - Model performance as deployed (using past to predict future) - Robustness to temporal drift (treatment changes, policy updates) - Realistic performance estimates

Why not random K-fold (a)? Creates data leakage:

Train: [2019, 2021, 2023]
Test: [2018, 2020, 2022]

You’re using 2023 data to predict 2018! Inflates performance artificially.

Why not leave-one-out (b)? - Computationally expensive - Still has temporal leakage problem - High variance in estimates

Why not stratified K-fold (c)? - Useful for class imbalance - But still allows temporal leakage - Doesn’t test temporal robustness

Real-world impact: Models validated with random CV often show 10-20% performance drops when deployed because they never faced forward-looking prediction during validation.

Lesson: Healthcare data has temporal structure. Always validate as you’ll deploy, using past to predict future, never the reverse.

Question 2: Calibration Assessment

A cancer risk model predicts 20% risk for 1,000 patients. In reality, 300 of these patients develop cancer. What does this indicate?

The model is well-calibrated
The model is overconfident (underestimates risk)
The model is underconfident (overestimates risk)
The model has good discrimination but poor calibration

Answer: b) The model is overconfident (underestimates risk)

Explanation: Calibration compares predicted probabilities to observed outcomes:

Analysis: - Predicted: 20% of 1,000 patients = 200 patients expected to develop cancer - Observed: 300 patients actually developed cancer - Gap: Predicted 200, observed 300 → Underestimating risk

Calibration terminology: - Well-calibrated: Predicted ≈ Observed (20% predicted → 20% observed) - Overconfident/Underestimate: Predicted < Observed (20% predicted → 30% observed) - This case - Underconfident/Overestimate: Predicted > Observed (20% predicted → 10% observed)

Why it matters:

# Clinical decision: Treat if risk > 25%
model.predict_proba(patient) = 0.20 # Below threshold → No treatment

# Reality: True risk was 0.30
# Patient should have been treated!

How to detect: 1. Calibration plot: Predicted vs observed by risk bin 2. Brier score: Mean squared error of probabilities 3. Expected Calibration Error (ECE): Average absolute calibration error

Lesson: High AUC doesn’t guarantee calibration. When predictions inform decisions with probability thresholds, calibration is critical. Always check calibration plots, not just discrimination metrics.

Question 3: External Validation

Your sepsis model achieves AUC 0.88 on internal test set. You test on external hospitals and get AUC 0.72-0.82 (varying by site). What does this variability indicate?

The model is overfitting
External sites have poor data quality
There is substantial site-specific heterogeneity
The model should not be used

Answer: c) There is substantial site-specific heterogeneity

Explanation: Performance variability across sites reveals important heterogeneity:

What varies between hospitals:

Patient populations:

Demographics (age, race, socioeconomic status)
Disease severity (tertiary referral vs community hospital)
Comorbidity profiles

Clinical practices:

Sepsis protocols (early vs delayed antibiotics)
ICU admission criteria
Documentation practices

Infrastructure:

EHR systems (Epic vs Cerner vs homegrown)
Lab equipment (different reference ranges)
Staffing models (nurse-to-patient ratios)

Data capture:

Missing data patterns
Measurement frequency
Feature definitions

Why not overfitting (a)? Overfitting shows as gap between training and test within same dataset. Here, internal test was fine (0.88). It’s external generalization that varies.

Why not poor data quality (b)? Could contribute, but more likely reflects legitimate differences in populations and practices.

Why not unusable (d)? AUC 0.72-0.82 is still useful! But indicates need for: - Site-specific calibration - Understanding what drives differences - Possibly site-specific models or adjustments

Best practice: External validation almost always shows performance drops. Variability across sites is normal and informative, reveals where model struggles and needs adaptation.

Question 4: Statistical Significance vs Clinical Significance

True or False: If a model improvement is statistically significant (p < 0.05), it is clinically meaningful and should be deployed.

Answer: False

Explanation: Statistical significance ≠ clinical significance. Both are necessary but neither alone is sufficient:

Statistical significance: - Tests if difference is unlikely due to chance - Depends on sample size (large N → small differences become significant) - Answers: “Is there an effect?”

Clinical significance: - Tests if difference matters for patient care - Independent of sample size - Answers: “Is the effect large enough to care?”

Example:

# New model vs baseline
results = {
 'baseline_auc': 0.820,
 'new_model_auc': 0.825,
 'difference': 0.005,
 'p_value': 0.03, # Statistically significant
 'sample_size': 50000 # Large sample
}

Analysis: - Statistically significant: p=0.03 < 0.05 - Clinically insignificant: 0.5% AUC improvement negligible - Why significant? Large sample detects tiny differences - Should deploy? No, not worth the cost/disruption

Lesson: Always evaluate both statistical and clinical significance. With large samples, trivial differences become statistically significant. Ask: “Is this improvement large enough to change practice?” Consider effect sizes, confidence intervals, and practical impact, not just p-values.

Question 5: Confidence Intervals

Two models have been evaluated: - Model A: AUC 0.85 (95% CI: 0.83-0.87) - Model B: AUC 0.86 (95% CI: 0.79-0.93)

Which statement is correct?

Model B is definitely better because it has higher AUC
Model A is more reliable because it has a narrower confidence interval
The models cannot be compared without more information
Model B is better if you’re willing to accept more uncertainty

Answer: b) Model A is more reliable because it has a narrower confidence interval

Explanation: Confidence intervals reveal precision/uncertainty, not just point estimates:

Model A: - AUC: 0.85 - 95% CI: 0.83-0.87 - Width: 0.04 (narrow) - Interpretation: We’re 95% confident true AUC is between 0.83-0.87 (precise estimate)

Model B: - AUC: 0.86 - 95% CI: 0.79-0.93 - Width: 0.14 (wide) - Interpretation: We’re 95% confident true AUC is between 0.79-0.93 (imprecise estimate)

Key insight: CIs overlap substantially (0.83-0.87 vs 0.79-0.93). Cannot conclude Model B is actually better, difference might be due to chance.

In practice: Most organizations prefer Model A: - Predictable performance for planning - Lower risk of underperformance - Easier to set appropriate thresholds - Small gain (0.01 AUC) not worth the uncertainty

Lesson: Always report and consider confidence intervals, not just point estimates. Narrow CIs indicate reliable performance. Wide CIs indicate uncertainty, might get much worse (or better) than point estimate suggests.

Question 6: Subgroup Analysis

You evaluate a diagnostic model and find: - Overall AUC: 0.84 - Men: AUC 0.88 - Women: AUC 0.78

What should you do?

Report only overall performance (0.84)
Report overall performance but note subgroup differences exist
Investigate why women’s performance is lower and consider separate models or adjustments
Conclude the model is biased and should not be used

Answer: c) Investigate why women’s performance is lower and consider separate models or adjustments

Explanation: Subgroup performance disparities require investigation and action, not just reporting:

Why performance differs: Possible reasons

Biological differences:

Disease presents differently (atypical symptoms in women)
Different physiological reference ranges
Example: Heart attack symptoms differ by sex

Data representation:

Fewer women in training data → model learns men’s patterns better
Women may be underdiagnosed historically → labels biased

Feature appropriateness:

Features optimized for men
Missing features relevant for women
Example: Pregnancy-related factors not included

Measurement bias:

Tests/measurements less accurate for women
Different documentation patterns

Potential solutions:

Collect more women’s data (if sample size issue)
Add sex-specific features:

# Include pregnancy status, hormonal factors
features += ['pregnant', 'menopause_status', 'hormone_therapy']

Stratified modeling:

# Separate models for men/women
if patient.sex == 'M':
 prediction = model_men.predict(patient)
else:
 prediction = model_women.predict(patient)

Weighted loss function:

# Penalize errors on women more heavily during training
sample_weights = [2.0 if sex=='F' else 1.0 for sex in data['sex']]
model.fit(X, y, sample_weight=sample_weights)

Lesson: Subgroup analysis is mandatory, not optional. When disparities found, investigate root causes and take corrective action. Don’t hide disparities in overall metrics.

Question 7: LLM Evaluation Strategy

You’re deploying an LLM to summarize public health literature for practitioners. Which evaluation approach is MOST appropriate?

Calculate AUC-ROC on a test set of summaries
Measure perplexity (how surprised the model is by correct summaries)
Use BERTScore to compare generated summaries against expert-written gold standards + human expert evaluation
Only use BLEU score (n-gram overlap with reference summaries)

Answer: c) Use BERTScore to compare generated summaries against expert-written gold standards + human expert evaluation

Why:

Correct approach: - BERTScore captures semantic similarity better than simple n-gram matching (BLEU) - Human expert evaluation is essential for: - Factual accuracy (automated metrics can’t verify facts) - Clinical relevance (is the right information prioritized?) - Safety (are there dangerous omissions or errors?) - Multiple metrics needed: BERTScore + factual accuracy + completeness + safety rating

Why other options are wrong:

a) AUC-ROC: Not applicable, LLMs generate text, not binary classifications or probabilities
b) Perplexity alone: Measures fluency, not accuracy/relevance (fluent nonsense scores well)
d) BLEU only: Too limited, high BLEU doesn’t guarantee accurate or clinically appropriate summaries

Best practice for LLM summarization: 1. Automated metrics (BERTScore, ROUGE) for efficiency 2. Human expert review on sample (100+ summaries) 3. Safety audit (check for hallucinations, dangerous errors) 4. Prompt robustness testing (consistency across variations)

Lesson: LLM evaluation requires fundamentally different approaches than traditional ML. Automated metrics alone are insufficient, human expert evaluation is mandatory, especially for clinical applications.

Question 8: Model Drift Detection

You’re monitoring a deployed readmission prediction model. Over 6 months, you observe: - AUC-ROC: Stable at 0.82 (was 0.83 at deployment) - Brier score: Increased from 0.15 to 0.21 - Patient age distribution: Mean shifted from 58 to 65 years - PSI for age feature: 0.28

What does this indicate, and what should you do?

Model is fine, AUC is stable; continue monitoring
Significant data drift occurred; retrain immediately
Concept drift occurred; model is failing and needs urgent retraining
Calibration degraded but discrimination stable; recalibrate or retrain

Answer: d) Calibration degraded but discrimination stable; recalibrate or retrain

Analysis:

What happened: - AUC-ROC stable: Model can still distinguish high-risk from low-risk patients (discrimination intact) - Brier score increased: Predicted probabilities are inaccurate (calibration degraded) - Age distribution shifted: PSI = 0.28 indicates significant data drift (threshold: PSI > 0.25) - Likely cause: Data drift (patient population aging) → calibration degrades even if model’s relative ranking ability (AUC) persists

Why other options are wrong:

a) Model is fine: WRONG, Brier score degradation and high PSI require action
b) Data drift, retrain immediately: Partially correct but oversimplified, could recalibrate first (faster, simpler)
c) Concept drift: UNLIKELY, AUC would degrade if relationship between features and outcome changed; this looks like data drift affecting calibration

What to do:

Immediate (within 1 week): - Recalibrate model on recent data (faster than full retraining) - Test calibration on recent hold-out set - Deploy recalibrated version if performance restored

Scheduled (within 1 month): - Full model retraining recommended (PSI > 0.25 for critical feature) - Validate retrained model on hold-out set - Compare retrained vs. recalibrated performance - Deploy better-performing version

Monitoring: - Track Brier score monthly (calibration early warning) - Track PSI for all critical features - Set alert: PSI > 0.25 for ≥3 features = immediate retraining

Lesson: Different types of drift require different interventions. Data drift may degrade calibration while preserving discrimination. Monitor multiple metrics (not just AUC) to catch drift early.

Question 9: Regulatory Classification

You’re developing an AI system that analyzes electronic health records to flag patients at high risk for cardiovascular disease in the next 5 years. Clinicians review flagged patients and decide whether to prescribe statins. How would the FDA likely classify this system?

Not a medical device (wellness/administrative use)
SaMD Level I (Low Risk)
SaMD Level II (Moderate Risk)
SaMD Level III (High Risk)

Answer: c) SaMD Level II (Moderate Risk)

Classification reasoning:

FDA SaMD Matrix: - State of healthcare situation: Serious (cardiovascular disease can cause long-term morbidity) - Significance of information: Drive clinical management (significantly influences statin prescription decision)

Per FDA matrix: Serious + Drive clinical management = Level II

Why not other levels:

a) Not a medical device: WRONG - System makes medical claims (predicts disease risk) - Influences clinical decisions (statin prescription) - Clearly falls under SaMD definition

b) Level I (Low Risk): WRONG - CVD is not a “non-serious” condition - System does more than just “inform”, it drives treatment decisions

d) Level III (High Risk): WRONG - Not critical (immediately life-threatening) condition, that would be ICU monitoring, acute MI - Not treating/diagnosing directly, clinicians make final decision - If system autonomously prescribed statins → Level III

Evaluation implications for Level II:

Required: - Robust internal validation - External validation recommended (highly encouraged) - Clinical utility assessment (does it actually improve outcomes?) - Usability testing with clinicians - Performance monitoring plan (drift detection)

Not required (Level III would need these): - Randomized controlled trial - Extensive multi-site prospective validation - Predetermined Change Control Plan (optional but recommended)

Borderline considerations:

If the system provided lower-stakes information (e.g., “Consider discussing lifestyle changes”) → Could be Level I

If the condition were critical/life-threatening (e.g., predict sepsis, guide ICU ventilator settings) → Level III

Lesson: FDA classification depends on both severity of condition AND how the information is used. “Driving clinical management” for serious conditions = Level II. Understanding classification early helps you plan appropriate validation rigor.

Question 10: Adversarial Robustness Testing

You’re evaluating a chest X-ray pneumonia detection model. Which robustness test is MOST important for real-world deployment?

Fast Gradient Sign Method (FGSM) adversarial attack testing
Natural perturbation testing (image compression, scanner variations, noise)
Model extraction attack testing (preventing reverse engineering)
Data poisoning resilience testing (can the training set be corrupted?)

Answer: b) Natural perturbation testing (image compression, scanner variations, noise)

Reasoning:

Real-world threat model: - Natural perturbations occur constantly: Different X-ray machines, image compression algorithms, patient positioning variations, image quality differences across sites - Likelihood: 100% of deployed systems encounter natural variation - Impact if not tested: Model may fail on images from different hospitals/equipment, limiting generalizability

Why other options are less critical (though still valuable):

a) FGSM adversarial attacks: - Likelihood: Near zero, no documented malicious adversarial attacks on medical imaging in practice - Value: Academic interest, EU AI Act may require, but not the most pressing real-world concern - When important: High-profile systems, regulatory compliance (EU AI Act)

c) Model extraction: - Risk: Intellectual property theft, but doesn’t directly harm patients - Mitigation: API rate limiting, access control (non-evaluation solutions)

d) Data poisoning: - Risk: Rare unless using crowdsourced/untrusted training data - Prevention: Data provenance, quality control during training (not post-deployment testing)

Practical testing hierarchy:

Essential (all deployed models): 1. Natural perturbation testing: Compression, noise, equipment variation 2. Out-of-distribution detection: Flag unfamiliar images (different scanner types, anatomies) 3. Multi-site external validation: Real-world test of natural robustness

Recommended (high-risk/regulatory): 4. Adversarial attack testing: FGSM, PGD (EU AI Act requirement) 5. Ablation studies: Performance with missing/corrupted inputs

Advanced (specific threats): 6. Data poisoning resilience: If using federated learning or external data 7. Model extraction prevention: If protecting proprietary models

Example test:

# Test robustness to JPEG compression (natural perturbation)
import cv2

compression_qualities = [100, 90, 70, 50, 30]
for quality in compression_qualities:
 # Compress image
 _, compressed = cv2.imencode('.jpg', image,
         [cv2.IMWRITE_JPEG_QUALITY, quality])
 compressed_img = cv2.imdecode(compressed, cv2.IMREAD_COLOR)

 # Test model
 prediction = model.predict(compressed_img)
 print(f"Quality {quality}: Prediction = {prediction:.3f}")

# Acceptable: <10% prediction change across quality 100→70
# Red flag: >20% prediction change (overfitting to high-quality images)

Lesson: Prioritize robustness testing based on real-world threat likelihood. For medical imaging, natural perturbations (equipment variation) are vastly more common than adversarial attacks. Test what will actually break your model in practice.

Discussion Questions

Validation hierarchy: You’ve developed a hospital-acquired infection prediction model. What validation studies would you conduct before deploying? In what order? What evidence would convince you to deploy at other hospitals?
Fairness trade-offs: Your sepsis model has AUC-ROC = 0.85 overall, but sensitivity is 0.90 for White patients vs. 0.75 for Black patients. What would you do? What are trade-offs of different mitigation approaches?
Calibration vs. discrimination: Model A: AUC-ROC = 0.85, Brier score = 0.30 (poor calibration). Model B: AUC-ROC = 0.80, Brier score = 0.15 (excellent calibration). Which deploy? Why?
External validation failure: Your model achieves AUC-ROC = 0.82 internal validation but 0.68 external validation at different hospital. What explains this? What next steps?
Clinical utility skepticism: Model predicts 30-day mortality with AUC-ROC = 0.88. Does this mean it’s clinically useful? What additional evaluations needed?
Prospective study design: Evaluate hospital readmission model prospectively. RCT, stepped-wedge, or silent mode? What are trade-offs?
Alert threshold selection: Clinical decision support tool can alert at >10%, >20%, or >30% predicted risk. How choose? What factors matter?
Model drift: COVID-19 forecasting model trained on 2020 data; now 2023 with new variants. How assess if still valid? What triggers retraining?

Further Resources

Books

Steyerberg, 2019, Clinical Prediction Models - Comprehensive guide to prediction modeling
Barocas et al., 2023, Fairness and Machine Learning - Free online textbook

Essential Papers

Validation: - Collins et al., 2015, BMJ - TRIPOD guidelines - Liu et al., 2019, Radiology - Medical imaging AI systematic review - Oakden-Rayner et al., 2020, Nature Medicine - Hidden stratification

Fairness: - Obermeyer et al., 2019, Science - Racial bias case study - Chouldechova, 2017, FAT - Impossibility theorem

Clinical Utility: - Vickers & Elkin, 2006, Medical Decision Making - Decision curve analysis

Implementation: - Proctor et al., 2011 - Implementation outcomes

Tools

Metrics and Validation: - Scikit-learn - Comprehensive metrics - Scikit-survival - Survival analysis metrics

Fairness: - Fairlearn - Microsoft fairness toolkit - AI Fairness 360 - IBM toolkit - Aequitas - Bias audit

Explainability: - SHAP - Feature importance - LIME - Local explanations

Next: Ethics, Bias, and Equity in Healthcare AI →