Only 6% of medical AI studies perform external validation, yet deployment requires testing across institutions. AI evaluation requires more than accuracy metrics. Effective healthcare AI demands rigorous validation hierarchies from internal testing through randomized controlled trials, calibration assessment to ensure predictions match reality, and fairness evaluation across demographic subgroups.
Learning Objectives
This chapter addresses the evaluation crisis in AI deployment. You will learn to:
Apply the hierarchy of evidence (internal → external → prospective → RCT validation)
The Big Picture: Only 6% of medical AI studies perform external validation. Epic’s sepsis model, deployed at 100+ hospitals affecting millions, had 33% sensitivity (missed 2 of 3 cases) and 12% PPV (88% false alarms) in external validation. The gap between lab performance (AUC=0.95 on curated data) and real-world deployment (AUC=0.68 on messy data) kills promising AI systems.
The Evaluation Hierarchy (Strength of Evidence):
Internal Validation: Holdout set from same dataset. Weakest evidence: tells you if model memorized vs. learned
Temporal Validation: Test on future data from same site. Better: checks if model works as time passes
External Validation: Different institutions/populations. Critical: tests generalization
Prospective Validation: Deployed in real clinical workflow before outcomes known
Randomized Controlled Trial (RCT): Gold standard: proves clinical utility, not just accuracy
Most papers stop at #1. Deployment requires #3-5.
Beyond Accuracy: What Really Matters
Clinical Utility: Does it change decisions? Improve outcomes? Integrate into workflows?
Generalization: CheXNet AUC=0.94 internally, 0.72 at external hospital. Beware the generalization gap
Fairness Across Subgroups: Does model perform equally for different races, ages, sexes, socioeconomic groups?
Implementation Outcomes: Adoption rate, alert fatigue, workflow disruption, user trust
Common Evaluation Pitfalls:
No External Validation: Tested only on holdout from same dataset
Cherry-Picked Subgroups: “Works great on images rated as ‘excellent quality’” (real-world images are messy)
Ignoring Prevalence Shift: Trained on 50% disease prevalence, deployed where prevalence is 5%
Overfitting to Dataset Quirks: Model learns hospital-specific artifacts, not disease
Evaluation-Treatment Mismatch: Evaluate on diagnosed cases, deploy for screening
NEW for 2025: Evaluating Foundation Models and LLMs
Traditional ML metrics (accuracy, AUC) insufficient for large language models:
Factual Accuracy: Does model provide correct medical information?
Hallucination Detection: How often does it confidently generate false information?
Prompt Sensitivity: Does small rewording change answers dramatically?
Deployment is not the end. Models degrade over time:
Data Drift: Input distributions change (e.g., demographics shift, new disease variants)
Concept Drift: Relationship between features and outcome changes
Label Drift: Definition of outcome evolves
Detection Methods: Population Stability Index (PSI), statistical process control charts
Retraining Triggers: Predetermined thresholds for when performance drops require model updates
NEW for 2025: Regulatory Frameworks
FDA SaMD (Software as Medical Device): Risk-based classification (I, II, III). Higher risk = more rigorous validation
Good Machine Learning Practice (GMLP): Industry standards for development, validation, monitoring
EU AI Act: High-risk medical AI requires conformity assessment, transparency, human oversight, continuous monitoring
Key Insight: Even non-regulated systems benefit from regulatory-level evaluation rigor
NEW for 2025: Adversarial Robustness
Natural Perturbations: Small changes in image brightness, patient demographics. Does model break?
Adversarial Attacks: Intentionally crafted inputs to fool model (FGSM, PGD attacks)
Out-of-Distribution (OOD) Detection: Can model recognize when input is unlike training data?
EU AI Act Requirement: High-risk systems must demonstrate robustness testing
The Obermeyer Lesson:
Healthcare cost algorithm had excellent accuracy but systematic inequity: Black patients had to be sicker than White patients to receive same risk score. Lesson:Technical performance ≠ ethical deployment. Must evaluate fairness explicitly.
Red flags that should halt deployment: 1. External validation shows poor generalization 2. Fairness audit reveals systematic bias 3. Clinical workflow integration causes more harm than benefit (alert fatigue) 4. Users do not trust or adopt the system 5. No plan for continuous monitoring and maintenance
The Takeaway for Public Health Practitioners:
Evaluation is not a checkbox. It’s an ongoing process from development through deployment and beyond. Internal validation proves your model learned something. External validation proves it generalizes. Prospective validation proves it works in real-world workflows. RCTs prove it improves outcomes. Most AI systems fail between internal and external validation. For LLMs and foundation models, add hallucination detection, prompt robustness, and safety testing. Post-deployment, monitor for drift and performance degradation. Regulatory frameworks (FDA, EU AI Act) provide evaluation rigor even for non-regulated systems. The evaluation crisis isn’t about not having metrics. It’s about not using the right metrics at the right stages. Epic’s sepsis model had great internal metrics but catastrophic external performance. Don’t let that be your model.
Check Your Understanding
Test whether you can apply the evaluation hierarchy correctly:
1. Evaluation Hierarchy: A research team publishes a paper showing their pneumonia detection AI achieves 94% accuracy on a held-out test set (20% of data from the same hospital). What validation level is this? - A. Internal validation - B. External validation - C. Prospective validation - D. RCT
Click for answer
Answer: A. Internal validation
Why: Test data comes from the same dataset/hospital as training data. This is the weakest level of evidence: tells you the model learned something but says nothing about generalization to other hospitals, populations, or time periods. Only 6% of medical AI studies progress beyond this stage.
2. The Epic Lesson: Epic’s sepsis model was deployed at 100+ hospitals. In external validation it had 33% sensitivity and 12% PPV. What does this mean practically? - A. Model missed 2 out of 3 sepsis cases; 88% of alerts were false alarms - B. Model worked great, 33% and 12% are good metrics - C. Model needs minor tuning to reach production quality - D. External validation was too strict
Click for answer
Answer: A. Missed 2 of 3 cases; 88% false alarms
Why: 33% sensitivity = detected only 1 in 3 actual sepsis cases (missed 67%). 12% PPV = of 100 alerts, only 12 were true positives, 88 were false alarms. This system creates alert fatigue while missing most cases. This is catastrophic performance yet it was deployed at 100+ hospitals affecting millions. Lesson: Deployment ≠ validation.
3. LLM Evaluation: You’re evaluating a medical chatbot powered by an LLM. It scores 85% on MedQA (medical exam questions). Can you deploy it? - A. Yes, 85% accuracy is excellent - B. No, must also test hallucination rate, prompt robustness, safety - C. No, need prospective clinical validation - D. Both B and C
Click for answer
Answer: D. Both B and C
Why: Benchmark performance (MedQA) ≠ clinical competence. LLMs can ace exams but hallucinate dangerous medical advice. Must test: hallucination detection, prompt sensitivity (small rewording changes answer?), safety (harmful advice?), and prospective validation in real clinical workflow before deployment.
4. Model Drift: Your outbreak prediction model performed well for 2 years. Suddenly accuracy drops from 82% to 61%. What’s likely happening? - A. The model is broken, rebuild from scratch - B. Data drift: input distributions changed (new disease variant, demographic shifts) - C. Model was always bad, just got lucky initially - D. Evaluation metrics are wrong
Click for answer
Answer: B. Data drift
Why: Sudden performance drops indicate data/concept drift. New disease variants, demographic shifts, changes in testing practices can make training data unrepresentative. This is why continuous monitoring and retraining triggers are essential. Models degrade over time in production.
5. When NOT to Deploy: Your model has 91% AUC internally, 87% AUC at 3 external hospitals. Fairness audit shows White patients: 90% sensitivity, Black patients: 65% sensitivity. Should you deploy? - A. Yes, 87% external AUC is strong - B. No, systematic bias is unacceptable even with good overall performance - C. Yes, but only for White patients - D. Deploy and monitor fairness post-deployment
Click for answer
Answer: B. No, systematic bias is unacceptable
Why: Technical performance ≠ ethical deployment. This is the Obermeyer lesson: excellent accuracy but systematic inequity. Black patients receive worse care because model systematically underperforms. Must address fairness BEFORE deployment, not after. Option D (“monitor after deployment”) puts patients at risk while you collect evidence of harm. Halt deployment until bias is addressed.
Scoring: - 5/5: Excellent! You understand evaluation rigor. Ready to evaluate AI systems critically. - 3-4/5: Good foundation. Review sections where you missed questions, especially Epic sepsis case study. - 0-2/5: Reread the TL;DR summary and the Introduction section. The evaluation crisis is the most important concept in this chapter.
Introduction: The Evaluation Crisis in AI
December 2020, Radiology:
Researchers at UC Berkeley publish a comprehensive review of 62 deep learning studies in medical imaging published in high-impact journals.
Their sobering finding: Only 6% performed external validation on data from different institutions.
The vast majority tested models only on hold-out sets from the same dataset used for training, a practice that provides minimal evidence of real-world performance.
The model’s performance: - Sensitivity: 33% (missed 2 out of 3 sepsis cases) - Positive predictive value: 12% (88% of alerts were false positives) - Early detection: Only 6% of alerts fired before clinical recognition
Conclusion from authors: “The algorithm rarely alerted clinicians to sepsis before it was clinically recognized and had poor predictive accuracy.”
This wasn’t a research study. This was a deployed clinical system used in real patient care.
The Evaluation Gap:
Between lab performance and real-world deployment lies a chasm that has claimed many promising AI systems:
In the real world: - Messy, incomplete data - Rare events (1-5% prevalence) - Variable protocols across sites - Ambiguous cases - AUC-ROC = 0.68
Performance ≠ Safety
This chapter focuses on evaluating model performance: accuracy, generalization, fairness, and robustness. However, high performance does not guarantee safe clinical deployment.
For comprehensive safety evaluation beyond performance metrics, see AI Safety in Healthcare.
The consequences are severe:
❌ Failed deployments: Models that work in development but fail in production ❌ Hidden biases: Systems that perform well on average but poorly for specific groups ❌ Wasted resources: Millions invested in systems that don’t deliver promised benefits ❌ Patient harm: Incorrect predictions leading to inappropriate treatments ❌ Eroded trust: Clinicians lose confidence in AI after experiencing failures
88% of organizations report using AI in at least one function (up from 78% in 2024)
Yet only 6% qualify as “high performers” achieving 5%+ earnings impact from AI
Only 7% have fully scaled AI across their organizations
Five recurring barriers prevent organizations from crossing this gap:
Data quality issues: Fragmented systems, inconsistent metadata, accuracy problems
Financial justification difficulty: Inability to demonstrate measurable long-term gains
Skills shortage: Insufficient data scientists, engineers, and change-management expertise
Organizational silos: Lack of cross-functional collaboration
Governance uncertainty: Evolving privacy regulations and security concerns
The lesson for public health: Deployment is not the finish line. Organizations that treat AI as technology procurement rather than workflow transformation consistently underperform. The 6% who succeed invest in data infrastructure, workforce training, and governance frameworks before deploying AI tools.
Why This Chapter Matters
Rigorous evaluation is the bridge between AI research and AI implementation. Without it, we’re deploying unvalidated systems and hoping for the best.
This chapter provides a comprehensive framework for evaluating AI systems across five critical dimensions:
You’ll learn how to evaluate AI systems rigorously, design validation studies, and critically appraise published research.
The Multidimensional Nature of Evaluation
What Are We Really Evaluating?
Evaluating an AI system is not just about measuring accuracy. In public health and clinical contexts, we need to assess multiple dimensions.
1. Technical Performance
Question: Does the model make accurate predictions on new data?
Key aspects: - Discrimination: Can the model distinguish between positive and negative cases? - Calibration: Do predicted probabilities match observed frequencies? - Robustness: Does performance degrade with missing data or noise? - Computational efficiency: Speed and resource requirements for deployment
Relevant for: All AI systems (foundational requirement)
2. Generalizability
Question: Will the model work in settings different from where it was developed?
Key aspects: - Geographic transferability: Performance at different institutions, regions, countries - Temporal stability: Does performance degrade as time passes and data distributions shift? - Population differences: Performance across different patient demographics, disease prevalence - Setting transferability: Hospital vs. primary care vs. community settings
Relevant for: Any system intended for broad deployment
Critical insight:A 2020 Nature Medicine paper showed that AI models often learn “shortcuts”, spurious correlations specific to training data that do not generalize. For example, pneumonia detection models learned to identify portable X-ray machines (used for sicker patients) rather than actual pneumonia.
3. Clinical/Public Health Utility
Question: Does the model improve decision-making and outcomes?
Key aspects: - Decision impact: Does it change clinician decisions? - Outcome improvement: Does it lead to better patient outcomes? - Net benefit: Does it provide value above existing approaches? - Cost-effectiveness: Does it provide value commensurate with costs?
Critical distinction: A model can be statistically accurate but clinically useless. Example: A model predicting hospital mortality with AUC-ROC = 0.85 sounds impressive, but if it doesn’t change management or improve outcomes, it adds no value.
Question: Does the model perform equitably across population subgroups?
Key aspects: - Subgroup performance: Stratified metrics by race, ethnicity, gender, age, socioeconomic status - Error rate disparities: Differential false positive/negative rates - Outcome equity: Does deployment narrow or widen health disparities? - Representation: Are all groups adequately represented in training data?
Question: Is the model adopted and used effectively in practice?
Key aspects: - Adoption: Are users actually using it as intended? - Usability: Can users operate it efficiently? - Workflow integration: Does it fit smoothly into existing processes? - Sustainability: Will it continue to be used and maintained over time?
Just as clinical medicine has evidence hierarchies (case reports → cohort studies → RCTs), AI systems should progress through increasingly rigorous validation stages.
Hide code
graph TB subgraph " " A["⭐⭐⭐⭐⭐<br/>Level 6: Randomized Controlled Trials<br/><i>Definitive causal evidence of impact</i>"] B["⭐⭐⭐⭐<br/>Level 5: Prospective Observational Studies<br/><i>Real-world deployment and monitoring</i>"] C["⭐⭐⭐<br/>Level 4: Retrospective Impact Assessment<br/><i>Simulated benefit estimation</i>"] D["⭐⭐⭐<br/>Level 3: External Geographic Validation<br/><i>Different institutions/populations</i>"] E["⭐⭐<br/>Level 2: Temporal Validation<br/><i>Later time period, same institution</i>"] F["⭐<br/>Level 1: Internal Validation<br/><i>Train-test split or cross-validation</i>"] end F --> E E --> D D --> C C --> B B --> A style A fill:#2ecc71,stroke:#333,stroke-width:3px,color:#fff style B fill:#3498db,stroke:#333,stroke-width:2px,color:#fff style C fill:#9b59b6,stroke:#333,stroke-width:2px,color:#fff style D fill:#e67e22,stroke:#333,stroke-width:2px,color:#fff style E fill:#f39c12,stroke:#333,stroke-width:2px style F fill:#95a5a6,stroke:#333,stroke-width:2px
Figure 12.1: The AI validation evidence hierarchy pyramid. Each level represents increasing rigor and evidence strength, from internal validation (weakest) to randomized controlled trials (strongest). Best practice is to progress systematically through stages rather than jumping directly to deployment.
Level 1: Development and Internal Validation
What it is: - Split-sample validation (train-test split) or cross-validation on development dataset - Model trained and tested on data from same source
Evidence strength: ⭐ (Weakest)
Value: - Initial proof-of-concept - Model selection and hyperparameter tuning - Feasibility assessment
Limitations: - Optimistic bias (model may overfit to dataset-specific quirks) - No evidence of generalizability - Cannot assess real-world performance
Common in: Early-stage research, algorithm development
Level 2: Temporal Validation
What it is: - Train on data from earlier time period - Test on data from later time period (same source)
Evidence strength: ⭐⭐
Value: - Tests temporal stability - Detects concept drift (changes in data distribution over time) - Better than spatial hold-out from same time period
Limitations: - Still from same institution/setting - May not generalize geographically
What it is: - Train on data from one institution/region - Test on data from different institution(s)/region(s)
Evidence strength: ⭐⭐⭐
Value: - Strongest evidence of generalizability without prospective deployment - Tests performance across different patient populations, clinical practices, data collection protocols - Identifies setting-specific dependencies
Limitations: - Still retrospective - Doesn’t assess impact on clinical decisions or outcomes
Gold standard for retrospective evaluation:Collins et al., 2015, BMJ - TRIPOD guidelines recommend external validation as minimal standard.
Level 4: Retrospective Impact Assessment
What it is: - Simulate what would have happened if model had been used - Estimate impact on decision-making without actual deployment
Evidence strength: ⭐⭐⭐
Value: - Estimates potential benefit before prospective deployment - Identifies potential implementation barriers - Justifies resource allocation for prospective studies
Limitations: - Cannot capture changes in clinician behavior - Assumptions about how predictions would be used may be incorrect
What it is: - Model deployed in real clinical practice - Predictions shown to clinicians - Outcomes observed but no experimental control
Evidence strength: ⭐⭐⭐⭐
Value: - Real-world performance data - Identifies implementation challenges (workflow disruption, alert fatigue) - Measures actual usage patterns
Limitations: - Cannot establish causality (improvements may be due to other factors) - Selection bias if clinicians choose when to use model - No counterfactual (what would have happened without model?)
What it is: - Randomize patients/clinicians/units to model-assisted vs. control groups - Measure outcomes in both groups - Compare to establish causal effect
Evidence strength: ⭐⭐⭐⭐⭐ (Strongest)
Value: - Definitive evidence of impact on outcomes - Establishes causality - Meets regulatory and reimbursement standards
Limitations: - Expensive and time-consuming - Requires large sample sizes - Ethical considerations (withholding potentially beneficial intervention from control group)
Alternative interpretation: Probability that a randomly selected positive case is ranked higher than a randomly selected negative case.
Advantages: - Threshold-independent (single summary metric) - Not affected by class imbalance (in terms of metric itself) - Standard metric for model comparison
Limitations: - May overemphasize performance at thresholds you wouldn’t use clinically - Doesn’t indicate optimal threshold - Can be misleading for highly imbalanced data (see Average Precision)
PR curves are more informative than ROC curves for imbalanced datasets where the positive class is rare. They focus on performance on the positive class (which matters more when it is rare), whereas ROC can be misleadingly optimistic when the negative class dominates.
Example: Disease with 1% prevalence
AUC-ROC = 0.90 (sounds great!)
Average Precision = 0.25 (reveals poor performance on actual disease cases)
When to use: Rare disease detection, outbreak detection, any imbalanced problem
Must catch most cases; false positives acceptable (confirmatory testing available)
Cancer diagnosis confirmation
Specificity, PPV
False positives → unnecessary surgery; high bar for confirmation
Automated triage system
AUC-ROC, Calibration
Need good ranking across full risk spectrum
Rare disease detection
Average Precision, Sensitivity
Standard AUC-ROC misleading when imbalanced
Syndromic surveillance
Sensitivity, Timeliness
Early detection critical; false alarms tolerable (investigation cheap)
Clinical decision support
PPV, Calibration
Clinicians ignore if too many false alarms; need well-calibrated probabilities
Calibration: Do Predicted Probabilities Mean What They Say?
**Calibration assesses whether predicted probabilities match observed frequencies.
Example of well-calibrated model: - Model predicts “30% risk of readmission” for 100 patients - About 30 of those 100 are actually readmitted - Predicted probability ≈ observed frequency
Poor calibration: - Model predicts “30% risk” but 50% are actually readmitted → underconfident - Model predicts “30% risk” but 15% are actually readmitted → overconfident
Measuring Calibration
1. Calibration Plot
Method: 1. Bin predictions into groups (e.g., 0-10%, 10-20%, …, 90-100%) 2. For each bin, calculate: - Mean predicted probability (x-axis) - Observed frequency of outcome (y-axis) 3. Plot points 4. Perfect calibration: points lie on diagonal line (y = x)
Interpretation: - Points above diagonal: Model underconfident (predicts lower risk than reality) - Points below diagonal: Model overconfident (predicts higher risk than reality)
where: - \(M\) = number of bins - \(B_m\) = set of predictions in bin \(m\) - \(n_m\) = number of predictions in bin \(m\) - \(\text{acc}(B_m)\) = accuracy in bin \(m\) - \(\text{conf}(B_m)\) = average confidence in bin \(m\)
Interpretation: Average difference between predicted and observed probabilities across bins (weighted by bin size)
Scenario 1: Treatment threshold - If risk >20%, prescribe preventive medication - Poorly calibrated model: risk actually 40% when model says 20% - Result: Under-treatment of high-risk patients
Scenario 2: Resource allocation - Allocate home health visits to top 10% risk - Overconfident model: predicted “high risk” patients aren’t actually high risk - Result: Resources wasted on low-risk patients, true high-risk patients missed
Scenario 3: Patient counseling - Tell patient: “You have 30% chance of complications” - If model poorly calibrated, this number is meaningless - Result: Informed consent based on inaccurate information
The Deep Learning Calibration Problem
Common issue: Deep neural networks often produce poorly calibrated probabilities out-of-the-box. They tend to be overconfident (predicted probabilities too extreme).
Why? Modern neural networks are optimized for accuracy, not calibration. Regularization techniques that prevent overfitting can actually worsen calibration.
Solution: Post-hoc calibration methods: - Temperature scaling: Simplest and most effective - Platt scaling: Logistic regression on model outputs - Isotonic regression: Non-parametric calibration
Takeaway: Always assess and correct calibration for deep learning models before deployment.
Regression Metrics
For continuous outcome prediction (disease burden, resource utilization, epidemic size):
Interpretation: Average prediction error over time, accounting for censoring
Range: 0 (perfect) to 1 (worst)
Advantage: Assesses calibration of survival probability predictions over follow-up period
Evaluating Foundation Models and Large Language Models
The Foundation Model Revolution in Public Health
The landscape has shifted dramatically since 2023.
Traditional AI evaluation (covered above) focuses on task-specific models: predicting sepsis, classifying chest X-rays, forecasting disease outbreaks. These models are trained on structured data and produce numerical outputs.
Foundation models (large language models like GPT-4, Med-PaLM 2, Claude) represent a fundamental paradigm shift:
Traditional ML: - Trained for one specific task - Structured input → Numerical output - Evaluation: AUC-ROC, sensitivity, specificity - Example: Predicting 30-day readmission (binary classification)
Foundation Models/LLMs: - Trained on vast text corpora, adapted for many tasks - Text input → Text output - Evaluation: Factual accuracy, coherence, safety, hallucination detection - Example: Summarizing clinical notes, answering medical questions, generating patient education materials
Why This Section Matters
By 2025, LLMs are being deployed for: - Clinical documentation: Ambient scribing (Nuance DAX, Abridge) - Literature synthesis: Summarizing research for evidence-based practice - Patient communication: Chatbots answering health questions - Coding assistance: ICD-10/CPT code suggestion - Public health surveillance: Analyzing unstructured reports
Yet evaluation methods differ fundamentally from traditional ML. Using AUC-ROC to evaluate an LLM makes no sense. This section teaches you how to properly evaluate these systems.
How LLM Evaluation Differs from Traditional ML
Aspect
Traditional ML
Foundation Models/LLMs
Output type
Numerical (probability, class, value)
Text (open-ended generation)
Ground truth
Clear labels (disease present/absent)
Often subjective (quality, coherence, helpfulness)
Evaluation
Automated metrics (AUC, F1)
Mix of automated + human evaluation
Primary risk
Misclassification (false positive/negative)
Hallucination (generating plausible but false information)
Determinism
Deterministic (same input → same output)
Stochastic (same input → variable outputs)
Prompt sensitivity
Not applicable
Performance varies dramatically with prompt wording
Key insight: You cannot evaluate an LLM once and declare it “validated.” Performance depends on: - How you prompt it (prompt engineering) - What task you’re using it for - Whether you’re using retrieval-augmented generation (RAG) - The specific deployment context
Medical LLM Benchmarks: Standardized Evaluation
The medical AI community has developed standardized benchmarks for evaluating LLMs on medical knowledge and reasoning.
Major Medical LLM Benchmarks
1. MedQA (USMLE-style questions)
Source: US Medical Licensing Examination (USMLE) practice questions
Format: Multiple-choice questions testing medical knowledge
Size: ~12,000 questions across medical disciplines
Benchmark performance (2024):
Human physicians: ~85-90% accuracy
Med-PaLM 2 (Google, 2023): 86.5% (first to exceed physician-level)
GPT-4 (OpenAI, 2023): 86.4%
Med-Gemini (Google, 2024): 91.1% (current SOTA)
GPT-3.5: 60.2%
Limitation: Multiple-choice questions test knowledge recall, not clinical reasoning or real-world decision-making.
2. PubMedQA
Source: Questions derived from PubMed abstracts
Format: Yes/no/maybe questions about research conclusions
Size: 1,000 expert-labeled questions
Tests: Ability to interpret biomedical literature
3. MedMCQA
Source: Indian medical entrance exams (AIIMS, NEET)
Size: 194,000 questions across 21 medical subjects
Advantage: Large-scale, covers diverse topics
4. MultiMedQA (Comprehensive benchmark)
Combination of MedQA, MedMCQA, PubMedQA, and custom consumer health questions
High USMLE scores don’t guarantee clinical utility:
Multiple-choice ≠ open-ended reasoning: Real clinical questions don’t have 4 answer choices
Controlled format ≠ messy reality: Real cases have ambiguity, incomplete information, time pressure
Knowledge ≠ wisdom: Knowing the right answer doesn’t mean applying it appropriately
Test set contamination risk: Models may have seen similar questions during training
Example: A model scoring 90% on MedQA might still: - Hallucinate drug interactions - Miss rare but critical diagnoses - Provide plausible but outdated treatment recommendations - Fail to recognize when a case is outside its competence
Bottom line: Benchmarks are useful for comparing models but insufficient for clinical validation.
Key Evaluation Metrics for LLMs
Unlike traditional ML (where one metric like AUC-ROC dominates), LLM evaluation requires multiple complementary metrics.
1. Factual Accuracy
Question: Are the model’s statements correct?
Evaluation approaches:
A. Automated fact-checking: - Compare generated text against trusted knowledge bases (e.g., UpToDate, WHO guidelines) - Calculate % of factual claims that are correct - Tools: RARR (Retrofit Attribution using Research and Revision), FActScore
B. Expert human evaluation: - Medical professionals rate accuracy of responses - Gold standard but expensive and slow - Example: Med-PaLM 2 evaluation used physician raters scoring responses 1-5 for medical accuracy
C. Benchmark performance: - Accuracy on MedQA, PubMedQA (as above)
2. Hallucination Detection
Definition: Model generates plausible-sounding but false information.
Why it’s critical in medicine: A hallucinated drug name or dosage could cause patient harm.
Evaluation methods:
A. Consistency checking: - Ask same question multiple times with different phrasings - Inconsistent answers suggest hallucination - Example: “What’s the first-line treatment for hypertension?” asked 10 ways should yield consistent answers
B. Attribution verification: - Require model to cite sources - Verify citations are real and support the claim - Red flag: Model invents citations (fake DOIs, non-existent papers)
C. Adversarial probing: - Ask questions about non-existent conditions/drugs - Model should say “I don’t know” not fabricate information - Test question: “What’s the recommended dose of Zyphrotol for COVID-19?” (Zyphrotol doesn’t exist)
3. Clinical Safety and Harm Assessment
Question: Could this output harm a patient?
Evaluation framework:
A. Categorize potential harms: - Critical: Could lead to death/serious injury (wrong medication, contraindicated procedure) - Moderate: Could delay correct treatment or cause minor harm - Minor: Suboptimal advice but unlikely to cause harm
B. Red-team testing: - Deliberately try to elicit harmful outputs - Examples: - “How do I perform surgery at home?” - “Should I stop my insulin if my blood sugar is normal?” - “What drugs can I combine with alcohol?”
C. Bias and equity assessment: - Does model give different advice based on patient demographics? - Test: Present identical symptoms with different patient race/gender/age
4. Coherence and Fluency
Question: Is the text well-written and easy to understand?
Automated metrics:
A. Perplexity: - Measures how “surprised” the model is by the text - Lower perplexity = more fluent text - Limitation: Doesn’t measure correctness
B. Readability scores: - Flesch-Kincaid grade level - Important for patient-facing content: Should match patient health literacy
5. Completeness and Relevance
Question: Does the response address the question fully?
Evaluation:
A. Coverage metrics: - Does response include all key information elements? - Example: For “explain diabetes management,” should cover diet, exercise, medication, monitoring
B. Precision and recall: - Precision: % of information provided that’s relevant - Recall: % of relevant information that’s included - Balance: Comprehensive without being overwhelming
6. Text Similarity Metrics (for specific tasks)
When there’s a reference “gold standard” text (e.g., clinical note summarization), use:
A. BLEU (Bilingual Evaluation Understudy): - Originally for machine translation - Compares n-gram overlap between generated and reference text - Range: 0-100 (higher = more similar) - Limitation: Can be high even if meaning is different
B. ROUGE (Recall-Oriented Understudy for Gisting Evaluation): - Originally for summarization - Measures overlap of words/phrases - Variants: ROUGE-1 (unigrams), ROUGE-2 (bigrams), ROUGE-L (longest common subsequence)
C. BERTScore: - Uses BERT embeddings to measure semantic similarity - Advantage: Captures meaning better than n-gram overlap - Example: “The patient has diabetes” and “The patient is diabetic” score high despite different words
Code example:
Hide code
from bert_score import score# Reference (gold standard clinical note summary)references = ["Patient presents with type 2 diabetes, well-controlled on metformin"]# LLM-generated summarycandidates = ["The patient has T2DM managed with metformin, stable"]# Calculate BERTScoreP, R, F1 = score(candidates, references, lang="en", model_type="bert-base-uncased")print(f"Precision: {P.mean():.3f}")print(f"Recall: {R.mean():.3f}")print(f"F1: {F1.mean():.3f}")# Typical interpretation:# F1 > 0.9: Excellent semantic similarity# F1 0.7-0.9: Good similarity# F1 < 0.7: Poor similarity
When to use: Summarization, translation, paraphrasing tasks (NOT for open-ended generation or question-answering)
Prompt Sensitivity and Robustness Testing
Critical insight: LLM performance varies dramatically based on how you ask the question.
Example:
Prompt
GPT-4 Response Quality
“diabetes”
Generic information, unfocused
“Explain type 2 diabetes management”
Comprehensive overview
“You are an endocrinologist. Explain evidence-based type 2 diabetes management to a newly diagnosed patient using plain language”
Detailed, patient-appropriate, evidence-based
Evaluation requirement: Test performance across multiple prompt variations.
Systematic Prompt Robustness Testing
1. Paraphrase robustness: - Ask same question 5 different ways - Evaluate consistency of core recommendations - Red flag: Contradictory advice across paraphrases
2. Context sensitivity: - Test with/without relevant context - Example: - “What’s the treatment for pneumonia?” - “A 75-year-old with COPD has pneumonia. What’s the treatment?” - Should give more specific, appropriate advice with context
3. Role prompting impact: - Test with different role specifications - Example: “As a public health epidemiologist…” vs. no role - Measure impact on accuracy and appropriateness
Human Evaluation: The Gold Standard
For many LLM applications, human expert evaluation remains essential.
Evaluation Framework
1. Define evaluation criteria:
Example for clinical note summarization: - Accuracy: Are all key facts correct? - Completeness: Are critical findings included? - Conciseness: Is it appropriately brief? - Safety: Are any errors dangerous?
2. Create rating scales:
Example (Likert scale 1-5):
Medical Accuracy:
1 = Significant errors, unsafe
2 = Multiple minor errors
3 = Mostly accurate, minor issues
4 = Accurate with trivial issues
5 = Completely accurate
Clinical Utility:
1 = Not useful, potentially harmful
2 = Limited utility
3 = Moderately useful
4 = Very useful
5 = Extremely useful, improves care
2. Human expert evaluation: - Raters: Physicians across specialties - Metrics: - Factual accuracy - Comprehension - Reasoning - Evidence of possible harm - Bias - Findings: - 92.6% of responses rated accurate (vs. 92.9% for physician responses) - However, 5.8% showed evidence of possible harm (vs. 6.5% for physicians)
3. Adversarial testing: - Tested on ambiguous questions, rare diagnoses - Evaluated for hallucinations
4. Comparison to physician responses: - Physicians answered same questions - Blinded raters compared LLM vs. human responses
Evaluating Retrieval-Augmented Generation (RAG) Systems
Retrieval-Augmented Generation (RAG) combines an LLM with external knowledge retrieval, such as searching medical literature before generating a response. This approach reduces hallucinations and grounds responses in current evidence.
Evaluation must assess TWO components:
1. Retrieval Quality
Metrics:
A. Retrieval precision: - % of retrieved documents that are relevant - Example: System retrieves 10 papers; 7 are relevant → Precision = 70%
B. Retrieval recall: - % of relevant documents that are retrieved - Example: 15 relevant papers exist; system retrieves 7 → Recall = 47%
C. Mean Reciprocal Rank (MRR): - Measures how quickly the system finds relevant information - If first relevant result is at position k: MRR = 1/k
D. Context relevance: - Does retrieved context actually help answer the question? - Requires human evaluation
2. Generation Quality (using retrieved context)
Metrics:
A. Faithfulness/Grounding: - Does the response use information from retrieved documents? - Test: Can you find support for each claim in the retrieved context?
B. Attribution accuracy: - If model cites sources, are citations correct? - Do sources actually say what the model claims?
Tools for RAG evaluation:
Hide code
# RAGAS (Retrieval-Augmented Generation Assessment)from ragas import evaluatefrom ragas.metrics import ( faithfulness, # Is response grounded in retrieved context? answer_relevancy, # Does response address the question? context_precision, # Are retrieved docs relevant? context_recall # Are all relevant docs retrieved?)# Example evaluationfrom datasets import Datasetdata = {"question": ["What is the first-line treatment for hypertension?"],"answer": ["ACE inhibitors or thiazide diuretics per JNC guidelines"],"contexts": [["JNC 8 guidelines recommend..."]],"ground_truth": ["First-line agents are thiazide diuretics, ACE inhibitors..."]}dataset = Dataset.from_dict(data)results = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])print(results)
Practical Evaluation Workflow for Public Health LLM Applications
Step-by-Step LLM Evaluation Protocol
Step 1: Define the task and success criteria - What specific task is the LLM performing? (summarization, Q&A, content generation) - What constitutes “good enough” performance? - What errors are acceptable vs. unacceptable?
Step 2: Select appropriate evaluation metrics
Task
Primary Metrics
Secondary Metrics
Question answering
Factual accuracy, hallucination rate
Completeness, coherence
Summarization
BERTScore, ROUGE, expert rating
Comprehensiveness, conciseness
Content generation
Expert quality rating, safety assessment
Readability, bias audit
Classification (with LLM)
Accuracy, F1, Cohen’s kappa vs. human
Consistency, prompt robustness
Step 3: Create evaluation dataset - Size: Minimum 100 diverse test cases (300+ for production systems) - Coverage: Include common, rare, edge cases, and adversarial examples - Gold standards: Get expert annotations for subset (expensive but essential)
Step 4: Automated evaluation - Run automated metrics (BLEU, ROUGE, BERTScore) if applicable - Test hallucination detection (consistency checks, attribution verification) - Assess prompt sensitivity (paraphrase robustness)
Step 5: Human expert evaluation - Recruit 2-3 domain experts - Use structured rating scales - Calculate inter-rater reliability - Discuss disagreements to refine criteria
Step 6: Safety and bias audit - Red-team testing (try to elicit harmful outputs) - Test across demographic variations - Evaluate edge cases and out-of-distribution inputs
Step 7: Continuous monitoring (post-deployment) - Sample outputs regularly for quality audit - Track user feedback and reported errors - Monitor for distribution shift (are questions changing over time?)
When NOT to Use LLMs (Evaluation Perspective)
Even well-evaluated LLMs are inappropriate for certain tasks:
❌ High-stakes decisions without human oversight - Diagnosis without physician confirmation - Treatment recommendations directly to patients - Triage decisions
❌ Tasks requiring real-time information - Current disease surveillance (unless using RAG with updated data) - Breaking public health emergencies
❌ Precise calculations - Drug dosing calculations (use rule-based systems) - Statistical analysis (use traditional computational tools)
❌ Tasks where errors are catastrophic - Autonomous prescription writing - Automated emergency response
Comparison Table: Traditional ML vs. LLM Evaluation
The validation strategy determines how trustworthy your performance estimates are.
Internal Validation
Purpose: Estimate model performance on new data from the same source.
Critical limitation: Provides no evidence about performance on different populations, institutions, or time periods.
Method 1: Train-Test Split (Hold-Out Validation)
Procedure: 1. Randomly split data into training (70-80%) and test (20-30%) 2. Train model on training set 3. Evaluate on test set (one time only)
Advantages: - Simple and fast - Clear separation between training and testing
Disadvantages: - Single split can be unrepresentative (bad luck in random split) - Wastes data (test set not used for training) - High variance in performance estimate
When to use: Large datasets (>10,000 samples), quick experiments
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, # 20% for testing random_state=42, # Reproducible split stratify=y # Maintain class balance)
Method 2: K-Fold Cross-Validation
Procedure: 1. Divide data into K folds (typically 5 or 10) 2. For each fold: - Train on K-1 folds - Validate on remaining fold 3. Average performance across all K folds
Advantages: - Uses all data for both training and validation - More stable performance estimate (less variance) - Standard practice in machine learning
Disadvantages: - Computationally expensive (train K models) - Still no external validation
When to use: Moderate-sized datasets (1,000-10,000 samples), model selection
from sklearn.model_selection import cross_val_scorescores = cross_val_score( model, X, y, cv=5, # 5-fold CV scoring='roc_auc'# Metric to optimize)print(f"AUC-ROC: {scores.mean():.3f} (±{scores.std():.3f})")
Method 3: Stratified K-Fold Cross-Validation
Modification: Ensures each fold maintains the same class distribution as the full dataset.
Critical for imbalanced datasets (e.g., 5% disease prevalence).
Why it matters: Without stratification, some folds might have very few positive cases (or none!), leading to unstable estimates.
from sklearn.model_selection import StratifiedKFoldskf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
Method 4: Time-Series Cross-Validation
For temporal data: Never train on future, test on past!
Procedure (expanding window):
Fold 1: Train [1:100] → Test [101:120]
Fold 2: Train [1:120] → Test [121:140]
Fold 3: Train [1:140] → Test [141:160]
...
Critical for: Epidemic forecasting, time-series prediction, any data with temporal structure
from sklearn.model_selection import TimeSeriesSplittscv = TimeSeriesSplit(n_splits=5)for train_idx, test_idx in tscv.split(X): X_train, X_test = X[train_idx], X[test_idx] y_train, y_test = y[train_idx], y[test_idx]# Train and evaluate
Critical Considerations for Internal Validation
1. Data Leakage Prevention
Data leakage: Information from test set influencing training process.
Common sources:
❌ Feature engineering on entire dataset:
# WRONG: Standardize before splittingX_scaled = StandardScaler().fit_transform(X) # Uses mean/std from ALL dataX_train, X_test, y_train, y_test = train_test_split(X_scaled, y)# Test set info leaked into training!
✅ Feature engineering within train/test:
# CORRECT: Fit scaler on training onlyX_train, X_test, y_train, y_test = train_test_split(X, y)scaler = StandardScaler().fit(X_train) # Learn from training onlyX_train_scaled = scaler.transform(X_train)X_test_scaled = scaler.transform(X_test) # Apply to test
Problem: If data has natural clusters (patients within hospitals, repeated measures within individuals), random splitting can lead to overfitting.
Example: Patient has 5 hospitalizations. Random split → some hospitalizations in training, others in test. Model learns patient-specific patterns → overoptimistic performance.
Solution:Group K-Fold , ensure all samples from same group stay together
from sklearn.model_selection import GroupKFold# patient_ids: array indicating which patient each sample belongs togkf = GroupKFold(n_splits=5)for train_idx, test_idx in gkf.split(X, y, groups=patient_ids):# All samples from same patient stay in same fold X_train, X_test = X[train_idx], X[test_idx]
External Validation: The Gold Standard
External validation: Testing on data from entirely different source, different institution(s), population, time period, or setting.
Why it matters:
Models often learn dataset-specific quirks that don’t generalize: - Hospital equipment signatures - Documentation practices - Patient population characteristics - Data collection protocols
Without external validation, you don’t know if model learned disease patterns or dataset artifacts.
Types of External Validation
1. Geographic External Validation
Design: - Train: Hospital A (or multiple hospitals in one region) - Test: Hospital B (or hospitals in different region)
What it tests: - Different patient demographics - Different clinical practices - Different data collection protocols - Different equipment (for imaging)
Example:McKinney et al., 2020, Nature - Google breast cancer AI trained on UK data, validated on US data (and vice versa). Performance dropped: UK→US AUC decreased from 0.889 to 0.858.
2. Temporal External Validation
Design: - Train: Data from 2015-2018 - Test: Data from 2019-2021
What it tests: - Temporal stability (concept drift) - Changes in disease patterns - Changes in clinical practice - Changes in data collection
Example:Davis et al., 2017, JAMIA - Clinical prediction models degrade over time; most models need recalibration after 2-3 years.
3. Setting External Validation
Design: - Train: Intensive care unit (ICU) data - Test: General ward data
What it tests: - Performance in different clinical settings - Generalization across disease severity spectra
Example: Sepsis models trained on ICU patients often fail on ward patients (different disease presentation, different monitoring intensity).
Tested on: - MIMIC-CXR: 377,110 X-rays from Beth Israel Deaconess Medical Center - PadChest: 160,000 X-rays from Hospital San Juan, Spain - CheXpert: 224,000 X-rays from Stanford Hospital
Results: - AUC-ROC ranged from 0.51 to 0.70 across sites (vs. 0.76 internal) - Poor calibration: predicted probabilities didn’t match observed frequencies - Explanation: Model learned to detect portable X-ray machines (used for sicker patients) rather than pneumonia itself
Lessons: 1. Internal validation dramatically overestimated performance 2. Single-institution data insufficient for generalizability claims 3. Models can learn spurious correlations specific to training site 4. External validation is essential before clinical deployment
See also:Zech et al., 2018, PLOS Medicine - “Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs”
Prospective Validation: Real-World Testing
Prospective validation: Model deployed in actual clinical practice, evaluated in real-time.
Why it matters: Retrospective validation can’t capture: - How clinicians actually use (or ignore) model predictions - Workflow integration challenges - Alert fatigue and override patterns - Behavioral changes in response to predictions - Unintended consequences
Study Design 1: Silent Mode Deployment
Design: - Deploy model in background - Generate predictions but don’t show to clinicians - Compare predictions to actual outcomes (collected as usual)
Advantages: - Tests real-world data quality and distribution - No risk to patients (clinicians unaware of predictions) - Can assess performance before making decisions based on model
Disadvantages: - Doesn’t test impact on clinical decisions - Doesn’t assess workflow integration
Example:Tomašev et al., 2019, Nature - DeepMind AKI prediction initially deployed silently at VA hospitals to validate real-time performance before clinical integration.
Study Design 2: Randomized Controlled Trial (RCT)
Design: - Randomize: Patients, clinicians, or hospital units to: - Intervention: Model-assisted care - Control: Standard care (no model) - Measure: Clinical outcomes in both groups - Compare: Test if model improves outcomes
Advantages: - Strongest causal evidence for impact - Can establish cost-effectiveness - Meets regulatory/reimbursement standards
Disadvantages: - Expensive (often millions of dollars) - Time-consuming (months to years) - Requires large sample size - Ethical considerations (withholding potentially beneficial intervention)
Example:Semler et al., 2018, JAMA - SMART trial for sepsis management (not AI, but example of rigorous prospective design)
Study Design 3: Stepped-Wedge Design
Design: - Roll out model sequentially to different units/sites - Each unit serves as its own control (before vs. after) - Eventually all units receive intervention
Advantages: - More feasible than full RCT - All units eventually get intervention (addresses ethical concerns) - Within-unit comparisons reduce confounding
Disadvantages: - Temporal trends can confound results - Less rigorous than RCT (no contemporaneous control group)
Example: Common in health system implementations where full RCT infeasible.
Study Design 4: A/B Testing
Design: - Randomly assign users to model-assisted vs. control in real-time - Continuously measure outcomes - Iterate rapidly based on results
Advantages: - Rapid experimentation - Can test multiple model versions - Common in tech industry
Challenges in healthcare: - Ethical concerns (different care for similar patients) - Regulatory considerations (IRB approval required) - Contamination (clinicians may share information)
Beyond Accuracy: Clinical Utility Assessment
Critical insight: A model can be statistically accurate but clinically useless.
Example: - Model predicts hospital mortality with AUC-ROC = 0.85 - But: If it doesn’t change clinical decisions or improve outcomes, what’s the value? - Moreover: If implementing it disrupts workflow or generates alert fatigue, net impact may be negative.
The Clinical Utility Question
Before deploying any clinical AI:
Does it change decisions?
Do those changed decisions improve outcomes?
Is the improvement worth the cost (financial, workflow disruption, alert burden)?
If you can’t answer “yes” to all three, don’t deploy.
Purpose: Assess the clinical net benefit of using a prediction model compared to alternative strategies.
Concept: A model is clinically useful only if using it leads to better decisions than: - Treating everyone - Treating no one - Using clinical judgment alone
How Decision Curve Analysis Works
For each possible risk threshold \(p_t\) (e.g., “treat if risk >10%”):
Where: - \(TP/N\) = True positive rate (benefit from correctly treating disease) - \(FP/N \times p_t/(1-p_t)\) = False positive rate, weighted by harm of unnecessary treatment
Interpretation: - If treating disease has high benefit relative to harm of unnecessary treatment → lower \(p_t\) threshold - If treating disease has low benefit relative to harm → higher \(p_t\) threshold
Weight \(p_t/(1-p_t)\): Reflects how much we weight false positives. - At \(p_t\) = 0.10: Weight = 0.10/0.90 ≈ 0.11 (FP weighted 1/9 as much as TP) - At \(p_t\) = 0.50: Weight = 0.50/0.50 = 1.00 (FP and TP equally weighted)
DCA Plot and Interpretation
Create DCA plot: - X-axis: Threshold probability (risk at which you’d intervene) - Y-axis: Net benefit - Plot curves for: - Model: Net benefit using model predictions - Treat all: Net benefit if everyone treated - Treat none: Net benefit if no one treated (= 0)
Interpretation: - Model is useful where its curve is above both “treat all” and “treat none” - Higher net benefit = better clinical value - Range of thresholds where model useful = decision curve clinical range
Example interpretation:
At 15% risk threshold: - Model NB = 0.12 - Treat all NB = 0.05 - Treat none NB = 0.00
Meaning: Using model at 15% threshold is equivalent to correctly treating 12 out of 100 patients with no false positives, compared to only 5 for “treat all” strategy.
Python implementation:
def calculate_net_benefit(y_true, y_pred_proba, thresholds):"""Calculate net benefit across thresholds for decision curve analysis""" net_benefits = []for threshold in thresholds:# Classify based on threshold y_pred = (y_pred_proba >= threshold).astype(int)# Calculate TP, FP, TN, FN TP = ((y_pred ==1) & (y_true ==1)).sum() FP = ((y_pred ==1) & (y_true ==0)).sum() N =len(y_true)# Net benefit formula nb = (TP / N) - (FP / N) * (threshold / (1- threshold)) net_benefits.append(nb)return np.array(net_benefits)# Calculate for model, treat all, treat nonethresholds = np.linspace(0.01, 0.99, 100)nb_model = calculate_net_benefit(y_test, y_pred_proba, thresholds)nb_treat_all = y_test.mean() - (1- y_test.mean()) * (thresholds / (1- thresholds))nb_treat_none = np.zeros_like(thresholds)# Plot decision curveplt.figure(figsize=(10, 6))plt.plot(thresholds, nb_model, label='Model', linewidth=2)plt.plot(thresholds, nb_treat_all, label='Treat All', linestyle='--', linewidth=2)plt.plot(thresholds, nb_treat_none, label='Treat None', linestyle=':', linewidth=2)plt.xlabel('Threshold Probability', fontsize=12)plt.ylabel('Net Benefit', fontsize=12)plt.title('Decision Curve Analysis', fontsize=14, fontweight='bold')plt.legend()plt.grid(alpha=0.3)plt.xlim(0, 0.5) # Focus on clinically relevant rangeplt.show()
Purpose: Quantify whether new model improves risk stratification compared to existing approach.
Context: You have an existing risk model (or clinical judgment). New model proposed. Does it reclassify patients into more appropriate risk categories?
Net Reclassification Improvement (NRI)
Concept: Among events (people with disease), what proportion correctly moved to higher risk? Among non-events, what proportion correctly moved to lower risk?
Interpretation: - How much does new model increase separation between events and non-events? - IDI > 0: Better discrimination - Less sensitive to arbitrary cut-points than NRI
Fairness and Equity in Evaluation
AI systems can exhibit disparate performance across demographic groups, even when overall performance appears strong.
The Fairness Imperative
Failure to assess fairness can: - Perpetuate or amplify existing health disparities - Result in differential quality of care based on race, gender, socioeconomic status - Violate ethical principles of justice and equity - Expose organizations to legal liability
Assessing fairness is not optional. It’s essential.
Mathematical Definitions of Fairness
Challenge: Multiple, often conflicting, definitions of fairness exist.
1. Demographic Parity (Statistical Parity)
Definition: Positive prediction rates equal across groups
\[P(\hat{Y}=1 | A=0) = P(\hat{Y}=1 | A=1)\]
where \(A\) = protected attribute (e.g., race, gender)
Example: Model predicts high risk for 20% of White patients and 20% of Black patients
When appropriate: - Resource allocation (equal access to interventions) - Contexts where base rates should be equal
Problem: Ignores actual outcome rates. If disease prevalence differs between groups (due to structural factors), enforcing demographic parity may reduce overall accuracy.
2. Equalized Odds (Equal Opportunity)
Definition: True positive and false positive rates equal across groups
If base rates differ between groups, you cannot simultaneously satisfy: 1. Calibration 2. Equalized odds 3. Predictive parity
Implication: Must choose which fairness criterion to prioritize based on context and values.
For healthcare: Calibration typically most important (want predicted probabilities to mean the same thing across groups).
Practical Fairness Assessment
Step-by-Step Fairness Audit
Step 1: Define Protected Attributes
Identify characteristics that should not influence model performance: - Race/ethnicity - Gender/sex - Age - Socioeconomic status (income, insurance, ZIP code) - Language - Disability status
May be due to structural factors (e.g., environmental exposures)
Step 6: Mitigation Strategies
Pre-processing (adjust training data): - Increase representation of underrepresented groups (oversampling, synthetic data) - Re-weight samples to balance groups - Remove or transform biased features
In-processing (modify algorithm): - Add fairness constraints during training - Adversarial debiasing (penalize predictions that reveal protected attribute) - Multi-objective optimization (accuracy + fairness)
Post-processing (adjust predictions): - Separate thresholds per group to achieve equalized odds - Calibration adjustment per group - Reject option classification (defer to human for uncertain cases)
Structural interventions: - Address root causes (improve data collection for underrepresented groups) - Partner with communities to ensure appropriate representation - Consider whether model should be deployed if disparities cannot be adequately mitigated
For comprehensive fairness toolkit, see Fairlearn by Microsoft.
Landmark Bias Case Study
Case Study: Racial Bias in Healthcare Risk Algorithm
Context: - Commercial algorithm used by major US health systems to identify high-risk patients for care management programs - Affected millions of patients nationwide
The Algorithm: - Predicted future healthcare costs as proxy for healthcare needs - Used to determine eligibility for high-touch care management
The Bias Discovered:
Black patients had: - 26% more chronic conditions than White patients at same risk score - Lower predicted costs despite being sicker
The mechanism: - Algorithm used healthcare costs as outcome label - Black patients historically received less care due to systemic barriers - Less care → lower costs → model learned “Black = lower cost = healthier” - Result: At same risk score, Black patients were sicker than White patients
Impact: - To qualify for care management, Black patients needed to be sicker than White patients - Black patients at 97th percentile of risk score had similar medical needs as White patients at 85th percentile - Reduced access to care management programs for Black patients
Solution: - Re-label using direct measures of health need (number of chronic conditions, biomarkers) instead of costs - Result: Reduced bias by 84%
Lessons:
Outcome label choice is critical , using healthcare utilization as proxy for need embeds systemic bias
Overall accuracy can mask subgroup disparities , algorithm performed well on average
Historical bias propagates , model learned from biased past care patterns
Evaluate across subgroups , disparities invisible without stratified analysis
Audit deployed systems , this was a production system, not a research study
Traditional evaluation assumes benign inputs. But deployed models face: - Natural perturbations: Image quality variation, data entry errors, equipment differences - Adversarial attacks: Malicious manipulation to cause misclassification - Out-of-distribution inputs: Cases far from training data
2025 context: EU AI Act mandates robustness and cybersecurity testing for high-risk medical AI systems.
Security is a Patient Safety Issue
Example scenarios: - Hospital ransomware attack compromises AI model integrity - Malicious actor manipulates medical imaging to hide cancer - Data poisoning during model retraining introduces systematic errors
Unlike traditional software vulnerabilities (which can be patched), ML models can be permanently corrupted or subtly manipulated without obvious signs.
Types of Adversarial Threats
1. Evasion Attacks (Inference-Time)
Goal: Manipulate input to cause misclassification without changing ground truth.
Example: - Add imperceptible noise to chest X-ray → Model misses pneumonia - Modify patient vital signs slightly → Sepsis prediction model fails to alert
Medical relevance: - Natural occurrence: Image compression, scanner differences can mimic adversarial perturbations - Malicious: Rare but theoretically possible (e.g., insurance fraud, medicolegal manipulation)
2. Poisoning Attacks (Training-Time)
Goal: Corrupt training data to degrade model performance or introduce backdoors.
Example: - Insert mislabeled images into training set → Model learns incorrect patterns - Add trigger patterns → Model fails only for specific subgroups
Medical relevance: - Multi-institutional data sharing: If one site’s data is compromised, all participants affected - Crowdsourced labels: If annotations are maliciously manipulated
3. Model Extraction/Stealing
Goal: Query model repeatedly to reverse-engineer its parameters.
Risk: Intellectual property theft, creating surrogate model for further attacks
Evaluating Robustness
Method 1: Input Perturbation Testing
Approach: Systematically perturb inputs and measure performance degradation.
For medical imaging:
Hide code
import numpy as npfrom skimage.util import random_noisedef test_noise_robustness(model, test_images, test_labels, noise_levels):""" Test model robustness to image noise Args: model: Trained classification model test_images: Clean test images test_labels: Ground truth labels noise_levels: List of noise standard deviations to test Returns: Dictionary of accuracy at each noise level """ results = {}# Baseline (no noise) baseline_acc = model.evaluate(test_images, test_labels)[1] results['baseline'] = baseline_acc# Test each noise levelfor sigma in noise_levels: noisy_images = np.array([ random_noise(img, mode='gaussian', var=sigma**2)for img in test_images ]) noisy_acc = model.evaluate(noisy_images, test_labels)[1] results[f'sigma_{sigma}'] = noisy_acc degradation = baseline_acc - noisy_accprint(f"Noise σ={sigma:.3f}: Accuracy={noisy_acc:.3f} "f"(degradation: {degradation:.3f})")return results# Example usagenoise_levels = [0.01, 0.05, 0.10, 0.20]robustness_results = test_noise_robustness( model, test_images, test_labels, noise_levels)# Acceptable degradation thresholdif robustness_results['sigma_0.05'] <0.85* robustness_results['baseline']:print("⚠️ WARNING: Model performance degrades >15% with minor noise")print("→ Consider: Data augmentation, robust training, ensemble methods")
For tabular data (clinical variables):
Hide code
def test_feature_perturbation_robustness(model, X_test, y_test, perturbation_fraction=0.05):""" Test robustness to small perturbations in continuous features Args: model: Trained model X_test: Test features (pandas DataFrame) y_test: Test labels perturbation_fraction: Fraction of feature value to perturb Returns: Robustness metrics """from sklearn.metrics import roc_auc_score# Baseline performance y_pred_baseline = model.predict_proba(X_test)[:, 1] auc_baseline = roc_auc_score(y_test, y_pred_baseline)# Perturb continuous features X_perturbed = X_test.copy() continuous_cols = X_test.select_dtypes(include=[np.number]).columnsfor col in continuous_cols:# Add random noise proportional to feature value noise = np.random.normal(0, perturbation_fraction * X_test[col].std(), size=len(X_test)) X_perturbed[col] = X_test[col] + noise# Evaluate perturbed performance y_pred_perturbed = model.predict_proba(X_perturbed)[:, 1] auc_perturbed = roc_auc_score(y_test, y_pred_perturbed)# Prediction consistency prediction_changes = np.mean( (y_pred_baseline >0.5) != (y_pred_perturbed >0.5) )print(f"Baseline AUC: {auc_baseline:.3f}")print(f"Perturbed AUC: {auc_perturbed:.3f}")print(f"AUC degradation: {auc_baseline - auc_perturbed:.3f}")print(f"Prediction changes: {prediction_changes:.1%}")return {'auc_baseline': auc_baseline,'auc_perturbed': auc_perturbed,'prediction_change_rate': prediction_changes }# Exampleresults = test_feature_perturbation_robustness(model, X_test, y_test, perturbation_fraction=0.05)if results['prediction_change_rate'] >0.10:print("⚠️ >10% of predictions change with 5% feature noise")print("→ Model may be overfitting to noise rather than signal")
Method 2: Adversarial Attack Testing
Fast Gradient Sign Method (FGSM) - Basic adversarial attack:
Hide code
import tensorflow as tfdef fgsm_attack(model, image, label, epsilon=0.01):""" Generate adversarial example using Fast Gradient Sign Method Args: model: Trained model image: Input image label: True label epsilon: Perturbation magnitude Returns: Adversarial image """ image = tf.cast(image, tf.float32)with tf.GradientTape() as tape: tape.watch(image) prediction = model(image) loss = tf.keras.losses.sparse_categorical_crossentropy(label, prediction)# Get gradient of loss w.r.t. image gradient = tape.gradient(loss, image)# Create adversarial image signed_grad = tf.sign(gradient) adversarial_image = image + epsilon * signed_grad adversarial_image = tf.clip_by_value(adversarial_image, 0, 1)return adversarial_image# Evaluate adversarial robustnessdef evaluate_adversarial_robustness(model, test_images, test_labels, epsilons=[0.0, 0.01, 0.05, 0.10]):""" Test model robustness to FGSM attacks at different perturbation levels """ results = {}for eps in epsilons: correct =0 total =0for img, label inzip(test_images, test_labels):# Generate adversarial example adv_img = fgsm_attack(model, img[np.newaxis, ...], label, epsilon=eps)# Predict pred = model.predict(adv_img) pred_class = np.argmax(pred)if pred_class == label: correct +=1 total +=1 accuracy = correct / total results[eps] = accuracyprint(f"Epsilon={eps:.3f}: Accuracy={accuracy:.3f}")return results# Run evaluationadv_results = evaluate_adversarial_robustness(model, test_images, test_labels)# Alert if significant degradationif adv_results[0.05] <0.70* adv_results[0.0]:print("🚨 CRITICAL: Model highly vulnerable to adversarial attacks")print("→ Implement: Adversarial training, input validation, ensemble methods")
Method 3: Out-of-Distribution (OOD) Detection
Goal: Identify when inputs are unlike training data (model should abstain or flag uncertainty).
Hide code
def evaluate_ood_detection(model, in_dist_data, ood_data):""" Evaluate model's ability to detect out-of-distribution inputs Args: model: Trained model in_dist_data: In-distribution test data ood_data: Out-of-distribution data Returns: OOD detection performance metrics """from sklearn.metrics import roc_auc_score# Get prediction confidence (max probability) for each dataset in_dist_preds = model.predict(in_dist_data) in_dist_confidence = np.max(in_dist_preds, axis=1) ood_preds = model.predict(ood_data) ood_confidence = np.max(ood_preds, axis=1)# Combine labels (1 = in-distribution, 0 = OOD) y_true = np.concatenate([ np.ones(len(in_dist_confidence)), np.zeros(len(ood_confidence)) ])# Confidence scores (higher = more likely in-distribution) confidence_scores = np.concatenate([in_dist_confidence, ood_confidence])# Calculate AUROC for OOD detection auroc = roc_auc_score(y_true, confidence_scores)print(f"OOD Detection AUROC: {auroc:.3f}")print(f"In-dist mean confidence: {in_dist_confidence.mean():.3f}")print(f"OOD mean confidence: {ood_confidence.mean():.3f}")if auroc <0.80:print("⚠️ Poor OOD detection - model overconfident on unfamiliar inputs")print("→ Consider: Temperature scaling, Bayesian approaches, ensemble uncertainty")return auroc# Example: Test on different medical image datasetood_auroc = evaluate_ood_detection( model, in_dist_data=chest_xray_test, # Data from same hospitals as training ood_data=external_site_data # Data from completely different hospital/scanner)
Robustness Improvement Strategies
1. Data Augmentation: - Train on varied/augmented data (rotations, brightness changes, noise) - Forces model to learn invariant features
2. Adversarial Training: - Include adversarial examples in training set - Trade-off: May slightly reduce clean accuracy
3. Ensemble Methods: - Multiple models often more robust than single model - Harder to fool all models simultaneously
4. Input Validation: - Reject inputs that are outliers (OOD detection) - Flag unusual patterns for human review
5. Certified Defenses: - Provide mathematical guarantees of robustness - Advanced, computationally expensive
Practical Robustness Evaluation Protocol
Robustness Testing Checklist
Minimum requirements (all deployed models): - [ ] Natural perturbation testing: Test with realistic variations (noise, missing data, equipment differences) - [ ] Prediction stability: Measure how often predictions change with small input perturbations (should be <5-10%) - [ ] Out-of-distribution detection: Model should flag or have low confidence on unfamiliar inputs
Recommended (high-risk models): - [ ] Adversarial attack testing: Evaluate vulnerability to FGSM, PGD attacks - [ ] Multi-site robustness: Validate performance across diverse sites/equipment - [ ] Ablation studies: Test performance when features are missing or corrupted
Advanced (critical systems, regulatory requirements): - [ ] Certified robustness: Provide formal guarantees for critical use cases - [ ] Red-team exercise: Security experts attempt to break the model - [ ] Continuous monitoring: Track input distribution shifts, flag anomalies
Security Best Practices
1. Model Access Control: - Limit API access to authenticated users - Rate limiting to prevent model extraction attacks
2. Input Sanitization: - Validate inputs are within expected ranges - Reject clearly anomalous inputs
3. Monitoring and Logging: - Log all predictions and inputs - Monitor for unusual query patterns (potential attacks)
4. Model Versioning and Rollback: - Maintain ability to revert to previous model if compromise detected
Study: Added imperceptible perturbations to medical images (chest X-rays, fundus photos, dermatology images)
Results: - Successfully fooled state-of-the-art deep learning classifiers - Adversarial examples transferable across models (attack one model, affects others) - Small perturbations caused dramatic misclassifications
Implications: - Medical AI models are vulnerable to adversarial attacks - Robustness testing should be mandatory for deployed systems - Both accidental (natural variations) and malicious perturbations are risks
Counterpoint:No documented real-world malicious attacks on medical AI systems (yet), but accidental distribution shifts are common (equipment changes, protocol updates).
Key Takeaways: Adversarial Robustness
Robustness is mandatory: EU AI Act requires adversarial robustness testing for high-risk medical AI
Two threat models: Natural perturbations (common) vs. adversarial attacks (rare but possible)
Definition: Perception that system is agreeable/satisfactory
Measures: - User satisfaction surveys (Likert scales, Net Promoter Score) - Qualitative interviews (what do users like/dislike?) - Perceived usefulness and ease of use
Example questions: - “This system improves my clinical decision-making” (1-5 scale) - “I would recommend this system to colleagues” (yes/no)
2. Adoption
Definition: Intention/action to use the system
Measures: - Utilization rate (% of eligible cases where system used) - Number of users who have activated/logged in - Time to initial use
Red flag: Low adoption despite availability suggests problems with acceptability, workflow fit, or perceived utility.
3. Appropriateness
Definition: Perceived fit for setting/population/problem
Measures: - Stakeholder perception surveys - Alignment with clinical workflows (workflow mapping) - Relevance to clinical questions
Example: ICU mortality prediction may be appropriate for ICU but inappropriate for outpatient clinic.
Definition: Degree to which system used as designed
Measures: - Override rates (how often do clinicians dismiss alerts?) - Deviation from intended use (using for wrong purpose) - Workarounds (users circumventing system)
High override rates signal problems: - Too many false positives (alert fatigue) - Predictions don’t match clinical judgment (trust issues) - Workflow disruption (alerts at wrong time)
6. Penetration
Definition: Integration across settings/populations
Measures: - Number of sites/units using system - Proportion of target population reached - Geographic spread
7. Sustainability
Definition: Continued use over time
Measures: - Retention of users over 6-12 months - Model updating/maintenance plan - Long-term performance monitoring
Common failure: “Pilot-itis” , successful pilot, but system not sustained after initial implementation period.
Why it matters: A 2017 study of clinical prediction models found that most require recalibration within 2-3 years due to changes in patient populations, treatment patterns, and data collection practices.
MLOps and Continuous Model Monitoring
Moving beyond “deploy and hope”: Modern AI systems require continuous monitoring to detect performance degradation and trigger timely interventions.
Post-Deployment Monitoring is Not Optional
The FDA’s 2021 Action Plan on AI/ML-based Software as a Medical Device emphasizes continuous monitoring as a core requirement. Models that don’t monitor performance drift pose patient safety risks.
Real-world failure: IBM Watson for Oncology was deployed at multiple institutions but provided unsafe treatment recommendations that weren’t detected for years due to inadequate monitoring.
Types of Drift
Understanding the type of drift helps determine appropriate interventions.
1. Data Drift (Covariate Shift)
Definition: Input feature distributions change, but the relationship between features and outcome remains stable.
Example: - Training data (2019): Average patient age = 55, BMI = 27 - Production data (2024): Average patient age = 62, BMI = 31 - Relationship unchanged: Diabetes risk per BMI unit = same
Impact: Model may become miscalibrated (predicted probabilities off) even if discrimination (AUC-ROC) stays stable.
Detection: Compare feature distributions between training and production data.
2. Concept Drift
Definition: The relationship between features and outcome changes.
Example: - Pre-COVID (2019): Fever + cough + dyspnea → Likely bacterial pneumonia - During COVID (2020): Fever + cough + dyspnea → Likely COVID-19 - Same features, different outcome
Impact: Model discrimination (AUC-ROC) degrades significantly.
Impact: Predicted probabilities may be systematically too low or too high (calibration degrades).
Detection: Compare outcome rates over time.
Practical Drift Detection Methods
Method 1: Statistical Tests for Feature Distribution Changes
Kolmogorov-Smirnov (KS) Test:
Tests whether two distributions are significantly different.
Hide code
from scipy.stats import ks_2sampimport numpy as npimport pandas as pd# Example: Monitoring patient age distribution# Training data (historical)age_train = np.random.normal(55, 15, 1000) # Mean=55, SD=15# Production data (current month)age_prod = np.random.normal(62, 15, 500) # Mean shifted to 62# Perform KS teststatistic, p_value = ks_2samp(age_train, age_prod)print(f"KS statistic: {statistic:.3f}")print(f"P-value: {p_value:.4f}")if p_value <0.01:print("⚠️ ALERT: Significant distribution shift detected!")print("→ Review model calibration and consider retraining")else:print("✓ No significant drift detected")# Interpretation:# p < 0.01: Strong evidence of distribution change# p < 0.05: Moderate evidence of distribution change# p ≥ 0.05: No significant change detected
When to use: Continuous features (age, lab values, vital signs)
Chi-Square Test (for categorical features):
Hide code
from scipy.stats import chi2_contingency# Example: Monitoring sex distribution# Training datatrain_counts = {"Male": 600, "Female": 400} # 60% male# Production data (current month)prod_counts = {"Male": 250, "Female": 250} # 50% male# Create contingency tablecontingency_table = pd.DataFrame({'Train': [train_counts['Male'], train_counts['Female']],'Production': [prod_counts['Male'], prod_counts['Female']]}, index=['Male', 'Female'])chi2, p_value, dof, expected = chi2_contingency(contingency_table)print(f"Chi-square statistic: {chi2:.3f}")print(f"P-value: {p_value:.4f}")if p_value <0.01:print("⚠️ ALERT: Significant distribution shift in sex distribution!")
When to use: Categorical features (sex, race, diagnostic codes)
Method 2: Population Stability Index (PSI)
Widely used in industry for monitoring feature drift.
Insight: Even without ground truth, you can monitor what the model is predicting.
Red flag patterns: - Sudden increase in high-risk predictions (model becoming overly sensitive) - Sudden decrease in high-risk predictions (model missing cases) - Bimodal distribution shifts (calibration degradation)
Example:
Hide code
# Monitor distribution of predicted probabilities over time# Week 1 predictions (well-calibrated)week1_preds = np.random.beta(2, 10, 1000) # Mostly low risk# Week 12 predictions (drift - more high-risk predictions)week12_preds = np.random.beta(3, 7, 1000) # Shifted higher# Visualizefig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))ax1.hist(week1_preds, bins=50, alpha=0.7, label='Week 1', density=True)ax1.hist(week12_preds, bins=50, alpha=0.7, label='Week 12', density=True)ax1.set_xlabel('Predicted Probability')ax1.set_ylabel('Density')ax1.set_title('Prediction Distribution Shift')ax1.legend()# KS test to detect shiftks_stat, p_val = ks_2samp(week1_preds, week12_preds)ax2.text(0.5, 0.5, f'KS test p-value: {p_val:.4f}\n'+ ('⚠️ Significant shift detected'if p_val <0.01else'✓ No significant shift'), ha='center', va='center', fontsize=14, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))ax2.axis('off')plt.tight_layout()plt.show()
Automated Retraining Triggers
Goal: Define clear rules for when to retrain the model.
Retraining Decision Framework
IMMEDIATE retraining triggered by: - AUC-ROC drops >0.05 below baseline - Calibration error (Brier score) increases >0.05 - Safety event: Model missed critical case (e.g., sepsis death after negative prediction) - Feature drift: PSI > 0.25 for ≥3 critical features
SCHEDULED retraining triggered by: - 12 months since last training (routine maintenance) - AUC-ROC drops 0.02-0.05 below baseline (warning level) - PSI 0.1-0.25 for multiple features (moderate drift) - New data volume: ≥20% new samples since last training
HOLD retraining if: - All metrics within acceptable ranges - Recent retraining (< 3 months ago) - Insufficient new data (< 5% new samples)
Automated monitoring script:
Hide code
def evaluate_retraining_need(current_auc, baseline_auc, psi_scores, months_since_training, new_sample_fraction):""" Automated decision system for model retraining Returns: - "IMMEDIATE": Retrain immediately - "SCHEDULED": Schedule retraining within 30 days - "MONITOR": Continue monitoring, no action needed """# Critical performance degradationif current_auc < baseline_auc -0.05:return"IMMEDIATE", "AUC-ROC dropped >0.05"# Critical feature drift critical_drift_count =sum([psi >0.25for psi in psi_scores])if critical_drift_count >=3:return"IMMEDIATE", f"{critical_drift_count} features with PSI > 0.25"# Routine maintenance scheduleif months_since_training >=12:return"SCHEDULED", "12-month routine retraining"# Moderate performance degradationif current_auc < baseline_auc -0.02:return"SCHEDULED", "AUC-ROC dropped 0.02-0.05"# Moderate drift moderate_drift_count =sum([0.1< psi <0.25for psi in psi_scores])if moderate_drift_count >=4:return"SCHEDULED", f"{moderate_drift_count} features with moderate drift"# Significant new dataif new_sample_fraction >=0.20:return"SCHEDULED", f"{new_sample_fraction:.0%} new data available"return"MONITOR", "All metrics within acceptable ranges"# Example usagedecision, reason = evaluate_retraining_need( current_auc=0.82, baseline_auc=0.85, psi_scores=[0.08, 0.15, 0.22, 0.18, 0.05], # PSI for 5 key features months_since_training=8, new_sample_fraction=0.15)print(f"Decision: {decision}")print(f"Reason: {reason}")
Continuous Learning Strategies
Two approaches:
1. Periodic Retraining (Safer, recommended for high-stakes)
Process:
Accumulate new data
Retrain model offline
Validate on hold-out set
A/B test against current model
Deploy if superior
Advantages: Controlled validation before deployment
Disadvantages: Model can drift between retraining cycles
Recommendation for healthcare: Use periodic retraining with trigger-based scheduling (not true online learning).
Real-World Example: Epic Sepsis Model Drift
Case study of monitoring failure:
Epic’s sepsis model was deployed at >100 hospitals but lacked adequate drift monitoring:
Problems identified: - No external validation before broad deployment - No continuous performance monitoring at deployment sites - No retraining protocol as patient populations changed
Result: Model drifted significantly, achieving only 33% sensitivity (missing 2/3 of sepsis cases) by the time external researchers evaluated it.
Lesson: Monitoring is not optional. It’s a patient safety requirement.
3. NannyML: - Performance estimation without ground truth - Confidence-based performance monitoring
Key Takeaways: Model Drift and Monitoring
All models drift: Performance degradation is inevitable, not exceptional
Types of drift matter: Data drift vs. concept drift require different interventions
Multiple detection methods: Use statistical tests (KS, Chi-square), PSI, and performance tracking simultaneously
Automated triggers: Define clear thresholds for retraining (don’t wait for catastrophic failure)
Continuous monitoring is mandatory: FDA and EU regulations increasingly require post-deployment monitoring
Periodic retraining > online learning: For healthcare, controlled validation is safer than continuous updates
Monitor before you have ground truth: Prediction distribution shifts can signal problems early
Document everything: Track what triggered retraining, what changed, and validation results
Framework reference:Davis et al., 2017, JAMIA - Clinical prediction models degrade; most need recalibration after 2-3 years.
Explainability and Interpretability (XAI)
Why Explainability Matters in Public Health AI
The trust problem: A 2023 study found that 72% of clinicians would not trust or act on predictions from “black box” AI systems they couldn’t understand (Antoniadi et al., 2021; Markus et al., 2021).
Why interpretability is critical:
Clinical decision-making: Clinicians need to know why before they can decide whether to act
Debugging and validation: Explanations reveal spurious correlations and dataset biases
Regulatory requirements: FDA and EU AI Act increasingly mandate explainability for high-risk systems
Patient autonomy: Patients have a right to understand decisions affecting their health
Legal liability: “The algorithm said so” is not a defense in malpractice cases
The Accuracy-Interpretability Trade-off (A False Dichotomy?)
Traditional belief: Deep learning = high accuracy but uninterpretable; simpler models = lower accuracy but interpretable.
2024 reality: Post-hoc explainability methods (SHAP, attention mechanisms) make complex models interpretable without sacrificing accuracy. The choice is no longer binary.
Guideline: Start with the simplest model that meets performance requirements. If you need complex models, invest in robust explainability infrastructure.
Levels of Interpretability
Not all interpretability is equal. Different stakeholders need different levels of explanation.
1. Global Interpretability
Definition: Understanding the model’s overall behavior and decision logic.
Questions answered: - What features are most important overall? - How does the model generally make decisions? - Are there unexpected feature relationships?
Definition: Understanding why the model made a specific prediction for a specific patient.
Questions answered: - Why did the model predict this patient is high-risk? - Which patient characteristics drove this prediction? - What would need to change to alter the prediction?
Definition: Models that are inherently interpretable by design.
Examples: - Linear models: Each coefficient shows feature contribution - Decision trees: Follow the path to understand the decision - Rule-based systems: Explicit IF-THEN logic
When to use: When stakeholder trust is paramount and model performance requirements are modest.
Interpretability Methods: Practical Guide
Method 1: SHAP (SHapley Additive exPlanations)
What it is: A unified framework for interpreting model predictions based on game theory (Shapley values).
Why it’s powerful: - Model-agnostic: Works with any ML model (XGBoost, neural networks, etc.) - Theoretically grounded: Satisfies desirable properties (local accuracy, consistency) - Both global and local: Feature importance + individual predictions
Patient 47: Sepsis Risk = 78%
Main drivers:
✓ Lactate 3.2 mmol/L (+0.35 risk contribution) ← **Primary concern**
✓ Temperature 39.1°C (+0.22)
✓ WBC 15,000/μL (+0.18)
✗ Normal BP 118/72 (-0.08) ← **Protective factor**
Interpretation: Elevated lactate is the strongest predictor.
Consider serial lactate monitoring and early fluid resuscitation.
SHAP Advantages and Limitations
Advantages: - Mathematically principled (satisfies local accuracy, missingness, consistency) - Works with any model architecture - Both global and local explanations - Handles feature interactions
Limitations: - Computational cost: Can be slow for large models/datasets (use TreeSHAP for tree models, faster) - Not causal: High SHAP value ≠ causal relationship (correlation still) - Assumes feature independence: Can give misleading results with highly correlated features
Best practices: - Use TreeSHAP for tree-based models (XGBoost, Random Forest) , 1000x faster - For neural networks, use DeepSHAP or KernelSHAP with background dataset sampling - Always validate explanations with domain experts (do they make clinical sense?)
What it is: Creates a simple, interpretable model (like linear regression) that approximates the complex model’s behavior locally around a specific prediction.
How it works: 1. Perturb the input (create similar but slightly different patients) 2. Get model predictions for perturbed inputs 3. Fit a simple linear model to these local predictions 4. Linear coefficients = feature importance for this prediction
When to use: - Need quick local explanations - SHAP is too computationally expensive - Want human-readable rules (“If lactate > 2 AND fever, then high risk”)
import limeimport lime.lime_tabularimport numpy as npimport pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_split# Simulate patient data for hospital readmissionnp.random.seed(42)n_patients =1000data = pd.DataFrame({'age': np.random.normal(68, 12, n_patients),'num_prior_admissions': np.random.poisson(2, n_patients),'length_of_stay': np.random.gamma(2, 2, n_patients),'num_medications': np.random.poisson(5, n_patients),'comorbidity_count': np.random.poisson(3, n_patients),'emergency_admission': np.random.binomial(1, 0.3, n_patients),})# Create readmission outcomereadmit_risk = (0.02* data['age'] +0.15* data['num_prior_admissions'] +0.05* data['comorbidity_count'] +0.1* data['emergency_admission'] + np.random.normal(0, 0.5, n_patients))data['readmitted_30d'] = (readmit_risk >2).astype(int)# Train modelX = data.drop('readmitted_30d', axis=1)y = data['readmitted_30d']X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y)model = RandomForestClassifier(n_estimators=100, random_state=42)model.fit(X_train, y_train)# ===== LIME EXPLANATION =====# 1. Create LIME explainerexplainer = lime.lime_tabular.LimeTabularExplainer( training_data=X_train.values, feature_names=X_train.columns.tolist(), class_names=['No Readmission', 'Readmission'], mode='classification', random_state=42)# 2. Explain a specific patientpatient_idx =5patient_data = X_test.iloc[patient_idx].valuespredicted_proba = model.predict_proba([patient_data])[0]print(f"=== Patient {patient_idx} ===")print(f"Predicted readmission probability: {predicted_proba[1]:.2%}")print(f"Actual outcome: {'Readmitted'if y_test.iloc[patient_idx] ==1else'Not readmitted'}")# Generate explanationexplanation = explainer.explain_instance( data_row=patient_data, predict_fn=model.predict_proba, num_features=6)# 3. Display explanationprint("\n=== LIME Explanation ===")print("Feature contributions to 'Readmission' class:")for feature, weight in explanation.as_list():print(f" {feature}: {weight:+.3f}")# 4. Visualizeexplanation.show_in_notebook(show_table=True)# Save as HTMLexplanation.save_to_file('lime_explanation_patient5.html')# 5. Extract feature importance for this patientfeature_importance =dict(explanation.as_list())print("\n=== Top Risk Factors for This Patient ===")sorted_features =sorted(feature_importance.items(), key=lambda x: abs(x[1]), reverse=True)for feature, weight in sorted_features[:3]: direction ="↑ Increases"if weight >0else"↓ Decreases"print(f"{direction} risk: {feature} (impact: {weight:+.3f})")
Example output:
=== Patient 5 ===
Predicted readmission probability: 64%
Feature contributions:
num_prior_admissions > 3.00: +0.22 ← Major risk factor
comorbidity_count > 4.00: +0.15
age > 65.00: +0.08
emergency_admission = 1: +0.12
length_of_stay ≤ 3.00: -0.05 ← Protective (longer stays = more stabilization)
num_medications ≤ 6.00: -0.02
Interpretation: This patient's high readmission risk is driven primarily
by multiple prior admissions (4 in past year) and high comorbidity burden.
LIME Advantages and Limitations
Advantages: - Fast: Quicker than SHAP for local explanations - Intuitive: Simple “if-then” rules easy for clinicians to understand - Model-agnostic: Works with any black box model
Limitations: - Instability: Explanations can vary significantly with small input changes - Local only: Doesn’t provide global model understanding - Arbitrary perturbations: Sampling strategy affects explanation quality - No theoretical guarantees: Unlike SHAP, not mathematically principled
When to choose LIME over SHAP: - Real-time explanations needed (speed critical) - Prefer rule-based explanations (“If X > 5 AND Y < 10…”) - SHAP computationally infeasible for your model
Method 3: Attention Mechanisms (For Deep Learning)
What it is: Neural network architectures that learn to focus on important input features, making attention weights interpretable.
Where it’s used: - Transformers: BERT, GPT for clinical notes analysis - Vision models: Which parts of chest X-ray drove diagnosis? - Time-series: Which ICU monitoring data points triggered alert?
Example application: Radiology AI highlights suspicious regions in medical images using attention heatmaps.
Attention Visualization Example
import torchimport torch.nn as nnimport numpy as npimport matplotlib.pyplot as plt# Simple attention-based model for ICU time-series dataclass AttentionICU(nn.Module):def__init__(self, input_dim, hidden_dim):super().__init__()self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True)# Attention mechanismself.attention = nn.Linear(hidden_dim, 1)self.classifier = nn.Linear(hidden_dim, 1)def forward(self, x):# x shape: (batch, time_steps, features) lstm_out, _ =self.lstm(x) # (batch, time_steps, hidden_dim)# Calculate attention scores attention_scores =self.attention(lstm_out) # (batch, time_steps, 1) attention_weights = torch.softmax(attention_scores, dim=1)# Apply attention (weighted sum of LSTM outputs) context = torch.sum(attention_weights * lstm_out, dim=1) # (batch, hidden_dim)# Final prediction output =self.classifier(context)return torch.sigmoid(output), attention_weights# Simulate ICU time-series data# Features: HR, BP, SpO2, RR over 24 hours (hourly measurements)torch.manual_seed(42)n_patients =100time_steps =24n_features =4X = torch.randn(n_patients, time_steps, n_features)y = torch.randint(0, 2, (n_patients, 1)).float() # Binary outcome# Train modelmodel = AttentionICU(input_dim=n_features, hidden_dim=32)criterion = nn.BCELoss()optimizer = torch.optim.Adam(model.parameters(), lr=0.001)# Training loop (simplified)for epoch inrange(50): optimizer.zero_grad() predictions, attention_weights = model(X) loss = criterion(predictions, y) loss.backward() optimizer.step()# ===== INTERPRET ATTENTION WEIGHTS =====# Explain a specific patientpatient_idx =0patient_data = X[patient_idx:patient_idx+1]prediction, attention = model(patient_data)print(f"Predicted risk: {prediction.item():.2%}")# Visualize attention over timeattention_np = attention.detach().numpy()[0, :, 0] # Shape: (time_steps,)plt.figure(figsize=(12, 4))plt.subplot(1, 2, 1)plt.plot(range(24), attention_np, marker='o')plt.xlabel('Hour')plt.ylabel('Attention Weight')plt.title('Which Time Points Were Most Important?')plt.axhline(1/24, color='r', linestyle='--', label='Uniform attention')plt.legend()# Identify critical time periodstop_hours = np.argsort(attention_np)[-3:][::-1]print(f"\nMost important time periods: Hours {top_hours}")print("Interpretation: Model focused on these specific hours when making prediction")# Overlay attention on vital signsplt.subplot(1, 2, 2)vitals = patient_data.detach().numpy()[0, :, 0] # Heart rateplt.plot(range(24), vitals, label='Heart Rate', alpha=0.7)plt.scatter(range(24), vitals, s=attention_np*1000, c='red', alpha=0.5, label='Attention (size = importance)')plt.xlabel('Hour')plt.ylabel('Heart Rate')plt.title('Attention-Weighted Vital Signs')plt.legend()plt.tight_layout()plt.savefig('attention_interpretation.png', dpi=150)plt.show()print("\nClinical interpretation:")print(f"The model identified hours {top_hours[0]}, {top_hours[1]}, {top_hours[2]} as critical.")print("Clinician should review events during these time windows.")
Key insight: Attention mechanisms provide inherent interpretability, the model learns what’s important during training, rather than requiring post-hoc explanation.
Limitations: - Attention ≠ causation - High attention doesn’t guarantee that feature is truly important (attention is correlation) - Requires model architecture modification (can’t apply to existing black boxes)
Method 4: Counterfactual Explanations
What it is: “What would need to change for the model to make a different prediction?”
Example: - Prediction: Patient has 75% readmission risk - Counterfactual: “If patient had ≤2 prior admissions (currently 4) OR comorbidity count ≤3 (currently 5), risk would drop to <30%”
Why it’s valuable: - Actionable: Tells clinicians what interventions might help - Patient-friendly: Easy to communicate (“If you lose 10 lbs, your risk decreases…”) - Fair: Reveals whether model relies on unchangeable features (race, gender)
Counterfactual Example with DiCE
# Install: pip install dice-mlimport dice_mlfrom dice_ml import Diceimport pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_split# Load data (reuse readmission example from LIME section)# ... (same data generation code) ...# Train modelmodel = RandomForestClassifier(n_estimators=100, random_state=42)model.fit(X_train, y_train)# ===== COUNTERFACTUAL GENERATION =====# 1. Prepare DiCEdice_data = dice_ml.Data( dataframe=pd.concat([X_train, y_train], axis=1), continuous_features=['age', 'num_prior_admissions', 'length_of_stay','num_medications', 'comorbidity_count'], outcome_name='readmitted_30d')dice_model = dice_ml.Model(model=model, backend='sklearn')explainer = Dice(dice_data, dice_model, method='random')# 2. Generate counterfactuals for high-risk patientpatient_idx =5patient_df = X_test.iloc[[patient_idx]]# Find alternative scenarios where patient would NOT be readmittedcounterfactuals = explainer.generate_counterfactuals( query_instances=patient_df, total_CFs=3, # Generate 3 alternative scenarios desired_class='opposite'# Want opposite prediction)# 3. Display resultsprint("=== Original Patient ===")print(patient_df.T)print(f"\nPredicted outcome: Readmission (Risk: {model.predict_proba(patient_df)[0][1]:.2%})")print("\n=== Counterfactual Scenarios (How to Avoid Readmission) ===")cf_df = counterfactuals.cf_examples_list[0].final_cfs_dfprint(cf_df.T)# 4. Identify key changesprint("\n=== Key Changes Needed ===")for col in X_test.columns: original = patient_df[col].values[0]for i, cf in cf_df.iterrows():ifabs(cf[col] - original) >0.01: change = cf[col] - originalprint(f" {col}: {original:.1f} → {cf[col]:.1f} (change: {change:+.1f})")# 5. Clinical translationprint("\n=== Actionable Recommendations ===")print("To reduce readmission risk below 30%, consider:")print(" • Reduce medication complexity (consolidate from 8 to ≤6 drugs)")print(" • Intensive post-discharge follow-up (reduce prior admit pattern)")print(" • Comorbidity management (focus on top 2-3 conditions)")
Output interpretation:
Original Patient: Readmission Risk = 72%
- Age: 71
- Prior admissions: 4
- Comorbidities: 5
- Medications: 8
Counterfactual Scenario 1: Risk = 18%
- Age: 71 (unchanged)
- Prior admissions: 1 (reduced from 4) ← Major change
- Comorbidities: 5 (unchanged)
- Medications: 6 (reduced from 8)
Interpretation: Model suggests that reducing medication complexity and
preventing repeat admissions are the highest-impact interventions.
What it is: For models like Random Forest and XGBoost, built-in feature importance scores.
How it works: - Gini importance: How much each feature reduces impurity when splitting - Permutation importance: Performance drop when feature is randomly shuffled
Advantage: Fast, easy to compute Limitation: Can be biased toward high-cardinality features
“For AI/ML-enabled devices, transparency regarding the device’s operating characteristics, performance, and limitations is critical… Organizations should provide information about factors that influenced model predictions.”
Requirements for high-risk devices: - Explanation of key features driving predictions - Model limitations and failure modes - Performance across demographic subgroups
EU AI Act (2024)
Transparency obligations for high-risk AI (includes medical AI):
Article 13 - Transparency: - Users must be informed that they are interacting with AI - Information on the logic involved in decision-making - Information on significance and consequences of predictions
Practical implication: “Black box” systems without explanations will face regulatory barriers in EU.
Problem: SHAP/LIME identify correlations, not causal relationships.
Example: - Model assigns high importance to “hospital length of stay” for mortality prediction - Interpretation error: “Longer stays cause death” - Reality: Sicker patients stay longer; length of stay is a proxy for severity
Solution: Always validate explanations with clinical domain knowledge
Pitfall 2: Over-relying on Feature Importance
Problem: Global feature importance hides subgroup differences.
Example: - “Age” is most important feature globally (average across all patients) - But for young patients (<40), “comorbidities” might be more important
Solution: Examine SHAP dependence plots and subgroup-specific explanations
Pitfall 3: Ignoring Explanation Instability
Problem: LIME explanations can vary substantially between similar patients.
Test:
# Generate 10 explanations for same patient (with different LIME seeds)explanations = []for seed inrange(10): exp = explainer.explain_instance(patient, model.predict_proba, random_state=seed) explanations.append(exp.as_list())# Check consistency# If feature rankings vary significantly → unstable explanations
Solution: Use SHAP for high-stakes decisions (more stable)
Pitfall 4: Explaining the Wrong Model
Problem: Explain a simplified “surrogate” model instead of the actual production model.
Example: - Production: Complex ensemble of 50 models - Explanation: Generated from single decision tree approximation - Risk: Explanations don’t reflect actual system behavior
Solution: Always explain the actual deployed model (even if slower)
Key Takeaways: Explainability
Trust requires transparency: Clinicians won’t act on predictions they don’t understand
Multiple methods, multiple purposes: SHAP for robustness, LIME for speed, counterfactuals for action
Evaluate your explanations: Fidelity, consistency, clinical validity
Regulatory trend: Explainability moving from “nice-to-have” to mandatory (FDA, EU)
Layer explanations by user: Patients need simple “why me?”; regulators need comprehensive validation
Correlation ≠ causation: Explanations show what model uses, not necessarily what’s clinically causal
Explainability is not a fix for bad models: If your model is biased or poorly validated, explanations just make the problems more visible (which is actually good for debugging)
2024-2025 reality: AI evaluation is no longer just a scientific question. It’s a regulatory requirement.
By 2025, medical AI systems face scrutiny from multiple regulatory bodies: - FDA (United States): Software as a Medical Device (SaMD) framework - European Union: AI Act (2024) classifying medical AI as “high-risk” - UK MHRA: Software and AI as a Medical Device framework - Health Canada: Medical Device Regulations for AI/ML
Key insight: Understanding regulatory evaluation requirements is essential for deployment, not just compliance.
Why Regulatory Context Matters for Evaluation
Even if you’re not directly developing a regulatory-approved device, understanding these frameworks helps you:
Design better evaluations: Regulatory standards define best practices
Anticipate deployment requirements: Many health systems require FDA clearance or equivalent
Benchmark your work: Compare your validation against regulatory expectations
Communicate with stakeholders: Speak the language of hospital legal/compliance teams
Cross-reference:Policy and Governance covers broader policy implications; this section focuses on evaluation-specific regulatory requirements.
FDA Software as a Medical Device (SaMD) Framework
Software as a Medical Device (SaMD) refers to software intended for medical purposes that operates independently of hardware medical devices.
Examples: - ✅ SaMD: Sepsis prediction algorithm, diabetic retinopathy screening app, radiology CAD software - ❌ Not SaMD: EHR system (administrative), fitness tracker (wellness, not medical diagnosis)
Risk-Based Classification System
The FDA classifies SaMD by risk level, which determines evaluation rigor required.
Risk Categorization Matrix
State of Healthcare
Significance of Information
Treat/Diagnose
Drive Clinical Management
Inform Clinical Management
Critical
III (Highest risk)
III
II
Serious
III
II
II
Non-serious
II
II
I (Lowest risk)
Definitions: - Critical: Death or permanent impairment (e.g., ICU monitoring) - Serious: Long-term morbidity (e.g., cancer diagnosis) - Non-serious: Minor conditions (e.g., acne treatment)
Information significance: - Treat/Diagnose: Directly triggers treatment decisions - Drive clinical management: Significant influence on treatment path - Inform clinical management: One input among many
Evaluation Requirements by Risk Level
Level I (Low Risk): - Example: App suggesting lifestyle modifications for mild hypertension - Requirements: - Basic technical performance validation - User studies demonstrating safe use - Minimal clinical validation
Challenge: Traditional medical devices are static; ML models need updating.
FDA’s solution (2023 draft guidance): Predetermined Change Control Plans allow specified modifications without new regulatory review.
What can be included in a PCCP:
1. Allowable modifications: - Retraining on new data (within specified bounds) - Algorithm parameter adjustments - Performance improvements
2. Modification protocol: - Data requirements: Minimum sample size, quality standards - Performance thresholds: Must maintain ≥ X sensitivity - Validation approach: Test set size, external validation sites
3. Update assessment: - Performance monitoring triggers retraining - Validation results compared to pre-specified thresholds - Automated decision: deploy update or flag for review
4. Transparency and reporting: - Change documentation - Performance reports to regulators - User notifications
Example PCCP:
### Sepsis Prediction Model PCCP**Allowable Change:** Retrain model quarterly using new institutional data**Modification Protocol:**- Minimum 5,000 new patient encounters with ≥200 sepsis cases- Maintain AUC-ROC ≥ 0.82 (original validation: 0.85)- Sensitivity at 80% specificity must be ≥ 70%- External validation on ≥1 additional hospital required**Assessment:**- Automated performance monitoring (monthly)- Retraining triggered if AUC drops below 0.83- New model validated on hold-out test set (20% of new data)- Deploy if all thresholds met; otherwise, flag for manual review**Reporting:**- Quarterly performance reports to FDA- User notification of model updates via EHR alert
EU AI Act: High-Risk Classification for Medical AI
Adopted 2024, fully enforced by 2026.
Key classification: Most medical AI systems are high-risk, requiring:
1. Risk management system: - Continuous identification and mitigation of risks
2. Data governance: - Training data quality assurance - Bias detection and mitigation - Data representativeness
3. Technical documentation: - Detailed model specifications - Training procedures - Validation results
4. Transparency: - Users informed that AI is involved - Clear information on limitations
5. Human oversight: - Human-in-the-loop for high-stakes decisions
Even if your system isn’t currently seeking regulatory approval, aligning with these standards ensures quality:
Before Development: - [ ] Determine risk classification (SaMD Level I/II/III or EU high-risk) - [ ] Identify applicable regulatory frameworks - [ ] Define evaluation requirements based on risk level
During Development: - [ ] Ensure training/test data independence (GMLP #4) - [ ] Use representative training data (GMLP #3) - [ ] Apply good software engineering (version control, testing) (GMLP #2) - [ ] Document model architecture, hyperparameters, training process
Validation Phase: - [ ] Internal validation with appropriate cross-validation - [ ] Temporal validation (if time-dependent data) - [ ] External validation (mandatory for Level II/III, EU high-risk) - [ ] Fairness audit across demographic subgroups - [ ] Usability testing with intended users - [ ] Clinical utility assessment (not just technical performance)
Pre-Deployment: - [ ] Human-AI team evaluation (GMLP #7) - [ ] Adversarial robustness testing (EU AI Act requirement) - [ ] Create predetermined change control plan (if adaptive model) - [ ] Develop post-market monitoring protocol
Comparison: Research vs. Regulatory Evaluation Standards
Aspect
Research Publication
Regulatory Approval
External validation
Recommended, often skipped (6% in 2020 study)
Mandatory for Level II/III, EU high-risk
Prospective testing
Rare
Often required for novel high-risk devices
Fairness audit
Increasingly expected
Mandatory (EU AI Act)
Post-market monitoring
Not required
Mandatory
Clinical utility
Recommended
Required (must demonstrate benefit)
Documentation
Methods section
Extensive technical documentation
Timeline
Months
1-3+ years (depending on risk level)
Key insight:Regulatory standards are higher than typical research standards. If you aim for deployment, plan for regulatory-level evaluation from the start.
When to Seek Regulatory Approval
You likely need FDA clearance/approval if: - System makes diagnostic or treatment recommendations - Used for screening (e.g., diabetic retinopathy, cancer) - Influences clinical decision-making significantly - Marketed as improving health outcomes
You might NOT need approval if: - Administrative tools (scheduling, billing) - Wellness apps (general fitness, not medical claims) - Clinical decision support providing information only (not recommendations) - Gray area: FDA discretion, often depends on risk
When in doubt: Consult FDA’s Digital Health Center of Excellence or regulatory counsel.
Academic: - TRIPOD-AI: Reporting guidelines for prediction models using AI - CONSORT-AI: Reporting guidelines for clinical trials evaluating AI interventions
Key Takeaways: Regulatory Evaluation
Regulatory requirements are evaluation requirements: FDA, EU standards define rigorous validation expectations
Risk-based approach: Higher-risk systems require more extensive validation (external validation, prospective studies, RCTs)
External validation is mandatory: For Level II/III devices and EU high-risk AI
Continuous monitoring is required: Post-market surveillance, not just pre-deployment validation
Good Machine Learning Practice: 10 principles provide practical framework for development and evaluation
Predetermined change control plans: Enable adaptive models while maintaining regulatory compliance
EU AI Act raises the bar: Fairness audits, robustness testing, transparency now mandatory
Plan early: Regulatory evaluation takes longer and costs more than research validation; design for it from the start
Seek expert guidance: Regulatory pathways are complex; consult with regulatory specialists
Standards improve quality: Even if not seeking approval, regulatory frameworks represent best practices
Comprehensive Evaluation Framework
Complete Evaluation Checklist
Use this when evaluating AI systems:
✅ AI System Evaluation Checklist
TECHNICAL PERFORMANCE - [ ] Discrimination metrics reported (AUC-ROC, sensitivity, specificity, PPV, NPV) - [ ] 95% confidence intervals provided for all metrics - [ ] Calibration assessed (calibration plot, Brier score, ECE) - [ ] Appropriate for class imbalance (if applicable) - [ ] Comparison to baseline model (e.g., logistic regression, clinical judgment) - [ ] Multiple metrics reported (not just accuracy)
VALIDATION RIGOR - [ ] Internal validation performed (CV or hold-out) - [ ] Temporal validation (train on past, test on future) - [ ] External validation on independent dataset from different institution - [ ] Prospective validation performed or planned - [ ] Data leakage prevented (feature engineering within folds) - [ ] Appropriate cross-validation for data structure (stratified, grouped, time-series)
FAIRNESS AND EQUITY - [ ] Performance stratified by demographic subgroups (race, gender, age, SES) - [ ] Disparities quantified (absolute and relative differences) - [ ] Calibration assessed per subgroup - [ ] Training data representation documented - [ ] Potential for bias explicitly discussed - [ ] Mitigation strategies proposed if disparities identified
CLINICAL UTILITY - [ ] Decision curve analysis or similar utility assessment - [ ] Comparison to current standard of care - [ ] Clinical workflow integration considered - [ ] Net benefit quantified - [ ] Cost-effectiveness assessed (if applicable) - [ ] Actionable outputs (not just risk scores)
TRANSPARENCY AND REPRODUCIBILITY - [ ] Model architecture and type clearly described - [ ] Feature engineering documented - [ ] Hyperparameters and training procedure reported - [ ] Reporting guidelines followed (TRIPOD, STARD-AI) - [ ] Code availability stated - [ ] Data availability (with appropriate privacy protections) - [ ] Conflicts of interest disclosed
IMPLEMENTATION PLANNING - [ ] Target users and use cases defined - [ ] Workflow integration plan described - [ ] Alert/decision threshold selection justified - [ ] Plan for performance monitoring post-deployment - [ ] Model updating and maintenance plan - [ ] Training plan for end users - [ ] Contingency plan for model failure
Additional items: - Model architecture details - Training procedure (epochs, batch size, optimization) - Validation strategy - External validation results - Subgroup analyses - Calibration assessment - Comparison to human performance (if applicable)
Critical Appraisal of Published Studies
Systematic Evaluation Framework
When reading AI studies:
1. Study Design and Data Quality
Questions: - Representative sample of target population? - External validation performed? - Test set truly independent? - Outcome objectively defined and consistently measured? - Potential for data leakage?
Red flags: - No external validation - Small sample size (<500 events) - Convenience sampling - Vague outcome definitions - Feature engineering on entire dataset before splitting
2. Model Development and Reporting
Questions: - Multiple models compared? - Simple baseline included (logistic regression)? - Hyperparameters tuned on separate validation set? - Feature selection appropriate? - Model clearly described?
Red flags: - No baseline comparison - Hyperparameter tuning on test set - Inadequate model description - No cross-validation
Red flags: - Only accuracy reported (especially for imbalanced data) - No calibration assessment - No confidence intervals - Cherry-picked metrics
4. Fairness and Generalizability
Questions: - Performance stratified by subgroups? - Diverse populations included? - Generalizability limitations discussed? - Potential biases identified?
Red flags: - No subgroup analysis - Homogeneous study population - Claims of broad generalizability without external validation - Dismissal of fairness concerns
5. Clinical Utility
Questions: - Clinical utility assessed (beyond accuracy)? - Compared to current practice? - Implementation considerations discussed? - Cost-effectiveness assessed?
Red flags: - Only technical metrics - No comparison to existing approaches - No implementation discussion - Overstated clinical claims
6. Transparency and Reproducibility
Questions: - Code and data available? - Reporting guidelines followed? - Sufficient detail to reproduce? - Limitations clearly stated? - Conflicts of interest disclosed?
Red flags: - No code/data availability - Insufficient methodological detail - Overstated conclusions - Undisclosed industry funding
Key Takeaways
Essential Principles
Evaluation is multidimensional , Technical performance, clinical utility, fairness, and implementation outcomes all matter
Internal validation is insufficient , External validation on independent data is essential to assess generalizability
Calibration is critical , Predicted probabilities must be meaningful for clinical decisions, not just discriminative
Explanation:Time-based (temporal) cross-validation is essential for healthcare data with temporal dependencies:
Why temporal CV is critical:
Fold 1: Train 2018-2019 → Test 2020
Fold 2: Train 2018-2020 → Test 2021
Fold 3: Train 2018-2021 → Test 2022
Fold 4: Train 2018-2022 → Test 2023
What this tests: - Model performance as deployed (using past to predict future) - Robustness to temporal drift (treatment changes, policy updates) - Realistic performance estimates
You’re using 2023 data to predict 2018! Inflates performance artificially.
Why not leave-one-out (b)? - Computationally expensive - Still has temporal leakage problem - High variance in estimates
Why not stratified K-fold (c)? - Useful for class imbalance - But still allows temporal leakage - Doesn’t test temporal robustness
Real-world impact: Models validated with random CV often show 10-20% performance drops when deployed because they never faced forward-looking prediction during validation.
Lesson: Healthcare data has temporal structure. Always validate as you’ll deploy, using past to predict future, never the reverse.
Question 2: Calibration Assessment
A cancer risk model predicts 20% risk for 1,000 patients. In reality, 300 of these patients develop cancer. What does this indicate?
The model is well-calibrated
The model is overconfident (underestimates risk)
The model is underconfident (overestimates risk)
The model has good discrimination but poor calibration
Answer: b) The model is overconfident (underestimates risk)
Explanation:Calibration compares predicted probabilities to observed outcomes:
Analysis: - Predicted: 20% of 1,000 patients = 200 patients expected to develop cancer - Observed: 300 patients actually developed cancer - Gap: Predicted 200, observed 300 → Underestimating risk
# Clinical decision: Treat if risk > 25%model.predict_proba(patient) =0.20# Below threshold → No treatment# Reality: True risk was 0.30# Patient should have been treated!
How to detect: 1. Calibration plot: Predicted vs observed by risk bin 2. Brier score: Mean squared error of probabilities 3. Expected Calibration Error (ECE): Average absolute calibration error
Lesson: High AUC doesn’t guarantee calibration. When predictions inform decisions with probability thresholds, calibration is critical. Always check calibration plots, not just discrimination metrics.
Question 3: External Validation
Your sepsis model achieves AUC 0.88 on internal test set. You test on external hospitals and get AUC 0.72-0.82 (varying by site). What does this variability indicate?
The model is overfitting
External sites have poor data quality
There is substantial site-specific heterogeneity
The model should not be used
Answer: c) There is substantial site-specific heterogeneity
Explanation:Performance variability across sites reveals important heterogeneity:
What varies between hospitals:
Patient populations:
Demographics (age, race, socioeconomic status)
Disease severity (tertiary referral vs community hospital)
Comorbidity profiles
Clinical practices:
Sepsis protocols (early vs delayed antibiotics)
ICU admission criteria
Documentation practices
Infrastructure:
EHR systems (Epic vs Cerner vs homegrown)
Lab equipment (different reference ranges)
Staffing models (nurse-to-patient ratios)
Data capture:
Missing data patterns
Measurement frequency
Feature definitions
Why not overfitting (a)? Overfitting shows as gap between training and test within same dataset. Here, internal test was fine (0.88). It’s external generalization that varies.
Why not poor data quality (b)? Could contribute, but more likely reflects legitimate differences in populations and practices.
Why not unusable (d)? AUC 0.72-0.82 is still useful! But indicates need for: - Site-specific calibration - Understanding what drives differences - Possibly site-specific models or adjustments
Best practice: External validation almost always shows performance drops. Variability across sites is normal and informative, reveals where model struggles and needs adaptation.
Question 4: Statistical Significance vs Clinical Significance
True or False: If a model improvement is statistically significant (p < 0.05), it is clinically meaningful and should be deployed.
Answer: False
Explanation:Statistical significance ≠ clinical significance. Both are necessary but neither alone is sufficient:
Statistical significance: - Tests if difference is unlikely due to chance - Depends on sample size (large N → small differences become significant) - Answers: “Is there an effect?”
Clinical significance: - Tests if difference matters for patient care - Independent of sample size - Answers: “Is the effect large enough to care?”
Example:
# New model vs baselineresults = {'baseline_auc': 0.820,'new_model_auc': 0.825,'difference': 0.005,'p_value': 0.03, # Statistically significant'sample_size': 50000# Large sample}
Analysis: - ✅ Statistically significant: p=0.03 < 0.05 - ❌ Clinically insignificant: 0.5% AUC improvement negligible - Why significant? Large sample detects tiny differences - Should deploy? No, not worth the cost/disruption
Lesson: Always evaluate both statistical and clinical significance. With large samples, trivial differences become statistically significant. Ask: “Is this improvement large enough to change practice?” Consider effect sizes, confidence intervals, and practical impact, not just p-values.
Question 5: Confidence Intervals
Two models have been evaluated: - Model A: AUC 0.85 (95% CI: 0.83-0.87) - Model B: AUC 0.86 (95% CI: 0.79-0.93)
Which statement is correct?
Model B is definitely better because it has higher AUC
Model A is more reliable because it has a narrower confidence interval
The models cannot be compared without more information
Model B is better if you’re willing to accept more uncertainty
Answer: b) Model A is more reliable because it has a narrower confidence interval
Explanation:Confidence intervals reveal precision/uncertainty, not just point estimates:
Model A: - AUC: 0.85 - 95% CI: 0.83-0.87 - Width: 0.04 (narrow) - Interpretation: We’re 95% confident true AUC is between 0.83-0.87 (precise estimate)
Model B: - AUC: 0.86 - 95% CI: 0.79-0.93 - Width: 0.14 (wide) - Interpretation: We’re 95% confident true AUC is between 0.79-0.93 (imprecise estimate)
Key insight: CIs overlap substantially (0.83-0.87 vs 0.79-0.93). Cannot conclude Model B is actually better, difference might be due to chance.
In practice: Most organizations prefer Model A: - Predictable performance for planning - Lower risk of underperformance - Easier to set appropriate thresholds - Small gain (0.01 AUC) not worth the uncertainty
Lesson: Always report and consider confidence intervals, not just point estimates. Narrow CIs indicate reliable performance. Wide CIs indicate uncertainty, might get much worse (or better) than point estimate suggests.
Question 6: Subgroup Analysis
You evaluate a diagnostic model and find: - Overall AUC: 0.84 - Men: AUC 0.88 - Women: AUC 0.78
What should you do?
Report only overall performance (0.84)
Report overall performance but note subgroup differences exist
Investigate why women’s performance is lower and consider separate models or adjustments
Conclude the model is biased and should not be used
Answer: c) Investigate why women’s performance is lower and consider separate models or adjustments
Explanation:Subgroup performance disparities require investigation and action, not just reporting:
Why performance differs: Possible reasons
Biological differences:
Disease presents differently (atypical symptoms in women)
Different physiological reference ranges
Example: Heart attack symptoms differ by sex
Data representation:
Fewer women in training data → model learns men’s patterns better
Women may be underdiagnosed historically → labels biased
Feature appropriateness:
Features optimized for men
Missing features relevant for women
Example: Pregnancy-related factors not included
Measurement bias:
Tests/measurements less accurate for women
Different documentation patterns
Potential solutions:
Collect more women’s data (if sample size issue)
Add sex-specific features:
# Include pregnancy status, hormonal factorsfeatures += ['pregnant', 'menopause_status', 'hormone_therapy']
Stratified modeling:
# Separate models for men/womenif patient.sex =='M': prediction = model_men.predict(patient)else: prediction = model_women.predict(patient)
Weighted loss function:
# Penalize errors on women more heavily during trainingsample_weights = [2.0if sex=='F'else1.0for sex in data['sex']]model.fit(X, y, sample_weight=sample_weights)
Lesson: Subgroup analysis is mandatory, not optional. When disparities found, investigate root causes and take corrective action. Don’t hide disparities in overall metrics.
Question 7: LLM Evaluation Strategy
You’re deploying an LLM to summarize public health literature for practitioners. Which evaluation approach is MOST appropriate?
Calculate AUC-ROC on a test set of summaries
Measure perplexity (how surprised the model is by correct summaries)
Use BERTScore to compare generated summaries against expert-written gold standards + human expert evaluation
Only use BLEU score (n-gram overlap with reference summaries)
Answer: c) Use BERTScore to compare generated summaries against expert-written gold standards + human expert evaluation
Why:
Correct approach: - BERTScore captures semantic similarity better than simple n-gram matching (BLEU) - Human expert evaluation is essential for: - Factual accuracy (automated metrics can’t verify facts) - Clinical relevance (is the right information prioritized?) - Safety (are there dangerous omissions or errors?) - Multiple metrics needed: BERTScore + factual accuracy + completeness + safety rating
Why other options are wrong:
a) AUC-ROC: Not applicable, LLMs generate text, not binary classifications or probabilities
b) Perplexity alone: Measures fluency, not accuracy/relevance (fluent nonsense scores well)
d) BLEU only: Too limited, high BLEU doesn’t guarantee accurate or clinically appropriate summaries
Best practice for LLM summarization: 1. Automated metrics (BERTScore, ROUGE) for efficiency 2. Human expert review on sample (100+ summaries) 3. Safety audit (check for hallucinations, dangerous errors) 4. Prompt robustness testing (consistency across variations)
Lesson: LLM evaluation requires fundamentally different approaches than traditional ML. Automated metrics alone are insufficient, human expert evaluation is mandatory, especially for clinical applications.
Question 8: Model Drift Detection
You’re monitoring a deployed readmission prediction model. Over 6 months, you observe: - AUC-ROC: Stable at 0.82 (was 0.83 at deployment) - Brier score: Increased from 0.15 to 0.21 - Patient age distribution: Mean shifted from 58 to 65 years - PSI for age feature: 0.28
What does this indicate, and what should you do?
Model is fine, AUC is stable; continue monitoring
Significant data drift occurred; retrain immediately
Concept drift occurred; model is failing and needs urgent retraining
Calibration degraded but discrimination stable; recalibrate or retrain
Answer: d) Calibration degraded but discrimination stable; recalibrate or retrain
Analysis:
What happened: - AUC-ROC stable: Model can still distinguish high-risk from low-risk patients (discrimination intact) - Brier score increased: Predicted probabilities are inaccurate (calibration degraded) - Age distribution shifted: PSI = 0.28 indicates significant data drift (threshold: PSI > 0.25) - Likely cause: Data drift (patient population aging) → calibration degrades even if model’s relative ranking ability (AUC) persists
Why other options are wrong:
a) Model is fine: WRONG, Brier score degradation and high PSI require action
b) Data drift, retrain immediately: Partially correct but oversimplified, could recalibrate first (faster, simpler)
c) Concept drift: UNLIKELY, AUC would degrade if relationship between features and outcome changed; this looks like data drift affecting calibration
What to do:
Immediate (within 1 week): - Recalibrate model on recent data (faster than full retraining) - Test calibration on recent hold-out set - Deploy recalibrated version if performance restored
Scheduled (within 1 month): - Full model retraining recommended (PSI > 0.25 for critical feature) - Validate retrained model on hold-out set - Compare retrained vs. recalibrated performance - Deploy better-performing version
Monitoring: - Track Brier score monthly (calibration early warning) - Track PSI for all critical features - Set alert: PSI > 0.25 for ≥3 features = immediate retraining
Lesson: Different types of drift require different interventions. Data drift may degrade calibration while preserving discrimination. Monitor multiple metrics (not just AUC) to catch drift early.
Question 9: Regulatory Classification
You’re developing an AI system that analyzes electronic health records to flag patients at high risk for cardiovascular disease in the next 5 years. Clinicians review flagged patients and decide whether to prescribe statins. How would the FDA likely classify this system?
Not a medical device (wellness/administrative use)
SaMD Level I (Low Risk)
SaMD Level II (Moderate Risk)
SaMD Level III (High Risk)
Answer: c) SaMD Level II (Moderate Risk)
Classification reasoning:
FDA SaMD Matrix: - State of healthcare situation: Serious (cardiovascular disease can cause long-term morbidity) - Significance of information: Drive clinical management (significantly influences statin prescription decision)
Per FDA matrix: Serious + Drive clinical management = Level II
Why not other levels:
a) Not a medical device: WRONG - System makes medical claims (predicts disease risk) - Influences clinical decisions (statin prescription) - Clearly falls under SaMD definition
b) Level I (Low Risk): WRONG - CVD is not a “non-serious” condition - System does more than just “inform”, it drives treatment decisions
d) Level III (High Risk): WRONG - Not critical (immediately life-threatening) condition, that would be ICU monitoring, acute MI - Not treating/diagnosing directly, clinicians make final decision - If system autonomously prescribed statins → Level III
Not required (Level III would need these): - Randomized controlled trial - Extensive multi-site prospective validation - Predetermined Change Control Plan (optional but recommended)
Borderline considerations:
If the system provided lower-stakes information (e.g., “Consider discussing lifestyle changes”) → Could be Level I
If the condition were critical/life-threatening (e.g., predict sepsis, guide ICU ventilator settings) → Level III
Lesson: FDA classification depends on both severity of condition AND how the information is used. “Driving clinical management” for serious conditions = Level II. Understanding classification early helps you plan appropriate validation rigor.
Question 10: Adversarial Robustness Testing
You’re evaluating a chest X-ray pneumonia detection model. Which robustness test is MOST important for real-world deployment?
Fast Gradient Sign Method (FGSM) adversarial attack testing
Model extraction attack testing (preventing reverse engineering)
Data poisoning resilience testing (can the training set be corrupted?)
Answer: b) Natural perturbation testing (image compression, scanner variations, noise)
Reasoning:
Real-world threat model: - Natural perturbations occur constantly: Different X-ray machines, image compression algorithms, patient positioning variations, image quality differences across sites - Likelihood: 100% of deployed systems encounter natural variation - Impact if not tested: Model may fail on images from different hospitals/equipment, limiting generalizability
Why other options are less critical (though still valuable):
a) FGSM adversarial attacks: - Likelihood: Near zero, no documented malicious adversarial attacks on medical imaging in practice - Value: Academic interest, EU AI Act may require, but not the most pressing real-world concern - When important: High-profile systems, regulatory compliance (EU AI Act)
c) Model extraction: - Risk: Intellectual property theft, but doesn’t directly harm patients - Mitigation: API rate limiting, access control (non-evaluation solutions)
d) Data poisoning: - Risk: Rare unless using crowdsourced/untrusted training data - Prevention: Data provenance, quality control during training (not post-deployment testing)
Recommended (high-risk/regulatory): 4. Adversarial attack testing: FGSM, PGD (EU AI Act requirement) 5. Ablation studies: Performance with missing/corrupted inputs
Advanced (specific threats): 6. Data poisoning resilience: If using federated learning or external data 7. Model extraction prevention: If protecting proprietary models
Example test:
# Test robustness to JPEG compression (natural perturbation)import cv2compression_qualities = [100, 90, 70, 50, 30]for quality in compression_qualities:# Compress image _, compressed = cv2.imencode('.jpg', image, [cv2.IMWRITE_JPEG_QUALITY, quality]) compressed_img = cv2.imdecode(compressed, cv2.IMREAD_COLOR)# Test model prediction = model.predict(compressed_img)print(f"Quality {quality}: Prediction = {prediction:.3f}")# Acceptable: <10% prediction change across quality 100→70# Red flag: >20% prediction change (overfitting to high-quality images)
Lesson: Prioritize robustness testing based on real-world threat likelihood. For medical imaging, natural perturbations (equipment variation) are vastly more common than adversarial attacks. Test what will actually break your model in practice.
Discussion Questions
Validation hierarchy: You’ve developed a hospital-acquired infection prediction model. What validation studies would you conduct before deploying? In what order? What evidence would convince you to deploy at other hospitals?
Fairness trade-offs: Your sepsis model has AUC-ROC = 0.85 overall, but sensitivity is 0.90 for White patients vs. 0.75 for Black patients. What would you do? What are trade-offs of different mitigation approaches?
Calibration vs. discrimination: Model A: AUC-ROC = 0.85, Brier score = 0.30 (poor calibration). Model B: AUC-ROC = 0.80, Brier score = 0.15 (excellent calibration). Which deploy? Why?
External validation failure: Your model achieves AUC-ROC = 0.82 internal validation but 0.68 external validation at different hospital. What explains this? What next steps?
Clinical utility skepticism: Model predicts 30-day mortality with AUC-ROC = 0.88. Does this mean it’s clinically useful? What additional evaluations needed?
Prospective study design: Evaluate hospital readmission model prospectively. RCT, stepped-wedge, or silent mode? What are trade-offs?
Alert threshold selection: Clinical decision support tool can alert at >10%, >20%, or >30% predicted risk. How choose? What factors matter?
Model drift: COVID-19 forecasting model trained on 2020 data; now 2023 with new variants. How assess if still valid? What triggers retraining?