💡 The Big Picture: Only 6% of medical AI studies perform external validation. Epic’s sepsis model—deployed at 100+ hospitals affecting millions—had 33% sensitivity (missed 2 of 3 cases) and 7% PPV (93% false alarms) in external validation. The gap between lab performance (AUC=0.95 on curated data) and real-world deployment (AUC=0.68 on messy data) kills promising AI systems.
The Evaluation Hierarchy (Strength of Evidence):
- Internal Validation: Holdout set from same dataset. Weakest evidence—tells you if model memorized vs. learned
- Temporal Validation: Test on future data from same site. Better—checks if model works as time passes
- External Validation: Different institutions/populations. Critical—tests generalization
- Prospective Validation: Deployed in real clinical workflow before outcomes known
- Randomized Controlled Trial (RCT): Gold standard—proves clinical utility, not just accuracy
Most papers stop at #1. Deployment requires #3-5.
Beyond Accuracy—What Really Matters:
- Clinical Utility: Does it change decisions? Improve outcomes? Integrate into workflows?
- Generalization: CheXNet AUC=0.94 internally, 0.72 at external hospital. Beware the generalization gap
- Fairness Across Subgroups: Does model perform equally for different races, ages, sexes, socioeconomic groups?
- Implementation Outcomes: Adoption rate, alert fatigue, workflow disruption, user trust
❌ Common Evaluation Pitfalls:
- No External Validation: Tested only on holdout from same dataset
- Cherry-Picked Subgroups: “Works great on images rated as ‘excellent quality’” (real-world images are messy)
- Ignoring Prevalence Shift: Trained on 50% disease prevalence, deployed where prevalence is 5%
- Overfitting to Dataset Quirks: Model learns hospital-specific artifacts, not disease
- Evaluation-Treatment Mismatch: Evaluate on diagnosed cases, deploy for screening
✨ NEW for 2025—Evaluating Foundation Models & LLMs:
Traditional ML metrics (accuracy, AUC) insufficient for large language models:
- Factual Accuracy: Does model provide correct medical information?
- Hallucination Detection: How often does it confidently generate false information?
- Prompt Sensitivity: Does small rewording change answers dramatically?
- Safety: Harmful advice, biased responses, privacy leaks
- Medical Benchmarks: MedQA, PubMedQA, USMLE-style questions (but benchmarks ≠ clinical competence)
- RAG Evaluation: For retrieval-augmented generation—evaluate retrieval quality AND generation quality separately
→ See also: Chapter 20: Large Language Models in Public Health for comprehensive LLM evaluation frameworks and practical validation strategies
✨ NEW for 2025—Continuous Monitoring (ML Ops):
Deployment isn’t the end—models degrade over time:
- Data Drift: Input distributions change (e.g., demographics shift, new disease variants)
- Concept Drift: Relationship between features and outcome changes
- Label Drift: Definition of outcome evolves
- Detection Methods: Population Stability Index (PSI), statistical process control charts
- Retraining Triggers: Predetermined thresholds for when performance drops require model updates
✨ NEW for 2025—Regulatory Frameworks:
- FDA SaMD (Software as Medical Device): Risk-based classification (I, II, III). Higher risk = more rigorous validation
- Good Machine Learning Practice (GMLP): Industry standards for development, validation, monitoring
- EU AI Act: High-risk medical AI requires conformity assessment, transparency, human oversight, continuous monitoring
- 💡 Key Insight: Even non-regulated systems benefit from regulatory-level evaluation rigor
✨ NEW for 2025—Adversarial Robustness:
- Natural Perturbations: Small changes in image brightness, patient demographics—does model break?
- Adversarial Attacks: Intentionally crafted inputs to fool model (FGSM, PGD attacks)
- Out-of-Distribution (OOD) Detection: Can model recognize when input is unlike training data?
- EU AI Act Requirement: High-risk systems must demonstrate robustness testing
⚠️ The Obermeyer Lesson:
Healthcare cost algorithm had excellent accuracy but systematic inequity—Black patients had to be sicker than White patients to receive same risk score. Lesson: Technical performance ≠ ethical deployment. Must evaluate fairness explicitly.
→ See also: Chapter 9: Ethics, Bias, and Equity for comprehensive frameworks on evaluating AI fairness
⚠️ When NOT to Deploy (Despite Good Performance):
Red flags that should halt deployment: 1. External validation shows poor generalization 2. Fairness audit reveals systematic bias 3. Clinical workflow integration causes more harm than benefit (alert fatigue) 4. Users don’t trust or adopt the system 5. No plan for continuous monitoring and maintenance
🎯 The Takeaway for Public Health Practitioners:
Evaluation is not a checkbox—it’s an ongoing process from development through deployment and beyond. Internal validation proves your model learned something. External validation proves it generalizes. Prospective validation proves it works in real-world workflows. RCTs prove it improves outcomes. Most AI systems fail between internal and external validation. For LLMs and foundation models, add hallucination detection, prompt robustness, and safety testing. Post-deployment, monitor for drift and performance degradation. Regulatory frameworks (FDA, EU AI Act) provide evaluation rigor even for non-regulated systems. The evaluation crisis isn’t about not having metrics—it’s about not using the right metrics at the right stages. Epic’s sepsis model had great internal metrics but catastrophic external performance. Don’t let that be your model.