Appendix D — Case Study Library - Overview
A collection of 15 real-world AI implementations in public health, examining successes, failures, and lessons learned. Each case study provides technical depth, practical insights, and evidence-based recommendations.
- Quick scan? Review the summary table and key themes below
- Specific domain? Jump to relevant sections using the navigation links
- Deep dive? Read the complete case studies
- Code implementation? All cases include working Python examples
Summary: All 15 Case Studies at a Glance
| # | Case Study | Domain | Outcome | Key Metric | Lines |
|---|---|---|---|---|---|
| 1 | BlueDot COVID-19 Detection | Surveillance | Success | 9 days before WHO | ~600 |
| 2 | Google Flu Trends | Surveillance | Failure→Recovery | 135% error (2013) | ~550 |
| 3 | ProMED + HealthMap | Surveillance | Success | 50K reports/year | ~500 |
| 4 | IDx-DR Autonomous Diagnostic | Diagnostics | Success | FDA approved 2018 | ~700 |
| 5 | DeepMind AKI Prediction | Diagnostics | Technical Success, Clinical Failure | 90% alerts ignored | ~650 |
| 6 | Breast Cancer AI | Diagnostics | Mixed | Performance varies 10-15% | ~900 |
| 7 | Sepsis RL Treatment | Treatment | Controversial | RCT needed | ~900 |
| 8 | COVID-19 Prediction Models | Treatment | Mostly Failed | 98% high bias | ~700 |
| 9 | Ventilator Allocation | Resources | Rejected | Hospitals chose humans | ~900 |
| 10 | Allegheny Child Welfare | Population Health | Controversial | 47% FPR (Black families) | ~1,100 |
| 11 | UK NHS Disparity Detection | Population Health | Success | 40% disparity reduction | ~900 |
| 12 | Hospital Bed Allocation | Health Economics | Success | 2,054% ROI | ~1,200 |
| 13 | Crisis Text Line | Mental Health | Success | 250 lives saved (est.) | ~1,100 |
| 14 | AlphaFold Drug Discovery | Drug Discovery | Partial Success | 30 drugs in trials | ~1,500 |
| 15 | Project ECHO + AI | Rural Health | Success | 840% ROI | ~1,300 |
Total: ~12,500 lines | Over 50 code implementations | Over 100 references
Part I: Disease Surveillance and Outbreak Detection
Case Study 1: BlueDot - Early COVID-19 Detection
Context: AI-powered global disease surveillance using news reports, flight data, and climate patterns.
Key Achievement: - Detected COVID-19 outbreak December 31, 2019 (9 days before WHO, 6 days before CDC) - Predicted initial spread destinations (Bangkok, Hong Kong, Tokyo, Seoul, Singapore) - Alerted clients immediately
Limitations: - Couldn’t predict pandemic severity or spread dynamics - Early warning alone insufficient without policy action
Technical Highlights: - Multi-source data integration (news in 65 languages, flight networks, climate data) - NLP for disease mention extraction - Geographic risk scoring - 24/7 automated monitoring
Lesson: AI excels at early detection, not prediction of impact
Case Study 2: Google Flu Trends - Rise and Fall
Context: Search query-based flu surveillance (2008-2015)
Timeline: - 2008-2011: Accurate predictions (correlation 0.90 with CDC data) - 2012-2013: Catastrophic failure (135% overprediction) - 2014-2015: Recovery through hybrid human-AI approach
Why It Failed: - Algorithm opacity (correlation without causation) - Search behavior changes (media coverage effects) - No update mechanism (model drift) - Overfitting to data artifacts
Recovery Strategy: - Combined with CDC data (ensemble approach) - Increased transparency - Regular recalibration - Acknowledged limitations
Lesson: Simple correlations fail; need robust, interpretable models with update mechanisms
Case Study 3: ProMED-mail + HealthMap - Human-AI Collaboration
Context: Hybrid human-AI disease surveillance system
Success Factors: - Processes over 50,000 reports annually - Human experts verify AI classifications - Maintained >90% accuracy while scaling - Multi-language support (English, Spanish, French, Russian, Chinese, Arabic, Portuguese)
Key Innovation: AI handles volume, humans provide expertise
Lesson: Humans + AI > Either alone
Part II: Diagnostic AI
Case Study 4: IDx-DR - First Autonomous AI Diagnostic
Historic Achievement: - First FDA-authorized autonomous diagnostic AI (April 2018) - Can diagnose without clinician interpretation
Performance: - Sensitivity: 87.2% - Specificity: 90.7% - Clinical trial: 900 patients, 10 sites
Real-World Challenges: - Performance lower than trials (image quality issues) - Deployed in over 100 primary care clinics - Reimbursement challenges
Regulatory Pathway: - FDA De Novo classification - Extensive clinical validation - Post-market surveillance requirements
Lesson: Autonomous AI possible but requires extensive validation and monitoring
Case Study 5: DeepMind AKI - Clinical Failure Despite Technical Success
The Paradox: - Technical performance: AUC 0.92 for AKI prediction (48-hour advance warning) - Clinical impact: No change in patient outcomes - Alert fatigue: 90% of alerts not acted upon
Why It Failed: - Alerts not actionable (no clear intervention specified) - Wrong timing (too early or too late for intervention) - Poor clinical workflow integration - Nurse/physician resistance (trust issues)
Critical Insight: Technical accuracy ≠ clinical utility
Lessons for Future: - Design WITH clinicians, not FOR them - Provide actionable recommendations, not just predictions - Integrate into existing workflows - Clear value proposition required
Case Study 6: Breast Cancer Detection - Inconsistent Results
Multiple Systems Evaluated: - Google Health/DeepMind - Lunit INSIGHT MMG - iCAD ProFound AI
Performance Variability: - Internal validation: AUC 0.94-0.96 - External validation: AUC drops 10-15 percentage points - Equipment matters: Different manufacturers → different performance
Success Story: - Sweden (Lund): 44% radiologist workload reduction, maintained detection rate - Key: AI as concurrent reader, not replacement
Lesson: Internal validation insufficient; need multi-site, multi-equipment testing
Part III: Treatment Optimization
Case Study 7: Sepsis Treatment - AI-RL Controversy
The AI Clinician (MIT): - Learned treatment policy from 100,000 ICU patients - Recommended less fluid than standard guidelines - Controversial: Observational data biased
The Confounding Problem: - Sicker patients receive more aggressive treatment → worse outcomes - AI learns: More treatment → Worse outcomes (confounded!) - Reality: Treatment couldn’t overcome initial severity
Current Status: - Randomized trials underway (SMARTT, AISEPSIS) - Results expected 2024-2025 - Industry watching closely
Lesson: Reinforcement learning on observational data = hypothesis-generating, not practice-changing (until validated in RCT)
Case Study 8: COVID-19 Prediction Models - Limited Impact
The Pandemic Rush: - 232 COVID models published by October 2020 - 98% had high risk of bias - Only 1 externally validated with low bias - Most never used clinically
Common Problems: - Small sample sizes (<500 patients) - Lack of external validation - Poor reporting standards - Overfitting
Models That Worked: - 4C Mortality Score (UK): 35,000 patients, multiple sites, simple and interpretable - ISARIC-4C: 75,000 patients, properly validated
Lesson: Urgency doesn’t justify poor methods. Simple, validated models > complex, unvalidated ones
Part IV: Resource Allocation
Case Study 9: Ventilator Allocation - Ethics Meets AI
The Dilemma: - COVID-19 ventilator shortages required triage decisions - AI systems proposed for allocation - Most hospitals rejected AI-driven allocation
The Trilemma (Cannot maximize all three): 1. Utility (save most lives) 2. Fairness (equal treatment) 3. Autonomy (individual rights)
Why AI Was Rejected: - Insufficient accuracy (70-80% not enough for life/death) - Bias concerns (perpetuate historical inequities) - Legal risks (disability discrimination) - Trust and legitimacy issues
What Hospitals Did Instead: - Human clinical assessment with ethical oversight - Triage officers (experienced clinicians) - Appeals process - Re-evaluation every 48-120 hours
Lesson: Some decisions should remain human. AI can inform but not decide life-or-death allocation.
Part V: Population Health and Health Equity
Case Study 10: Allegheny Family Screening Tool - Algorithmic Child Welfare
Context: Risk assessment for child welfare referrals (used since 2016)
Performance: - Predicts child removal risk (AUC 0.76) - Used by caseworkers to prioritize investigations
Bias Found: - Black families scored 1.4 points higher on average - 47% false positive rate for Black families vs 37% for White families
The Feedback Loop Problem: - Historical over-surveillance of Black/poor families - More system contact → Higher risk scores - Higher scores → More investigation - Cycle perpetuates
Responses: - Public documentation and transparency - Community engagement - Regular fairness audits - Human override capability maintained
Ongoing Debate: - Supporters: More consistent than human bias alone - Critics: Automates and scales existing discrimination - Both perspectives have validity
Lesson: Historical bias in data perpetuates inequality. Transparency and community input essential.
Case Study 11: UK NHS AI - Revealing Systemic Racism
What Made This Different: - AI identified disparities in HUMAN care delivery, not AI decisions - Used as diagnostic tool for systemic racism - Findings led to concrete policy changes
Disparities Found: - Black patients: 2.5x mortality rate (1.8x after adjusting for comorbidities) - 8-hour longer admission wait times for Black patients - Lower ICU admission rates despite similar severity - Lower guideline-concordant care rates
NHS Response (Interventions): - Enhanced translation services (24/7 availability) - Cultural competency training (mandatory) - Community health workers - Care pathway standardization - Real-time disparity monitoring dashboards
Results After 2 Years: - Admission disparities reduced 40% - ICU access disparities reduced 25% - Mortality disparities reduced 15% - Still work to do, but measurable progress
Lesson: AI can expose systemic problems for intervention. Used correctly, it’s a tool for justice, not just a source of bias.
Part VI: Health Economics
Case Study 12: AI-Driven Hospital Bed Allocation
Johns Hopkins Implementation (2018-2022):
Challenge: Balance competing objectives: - Efficiency (maximize utilization) - Access (minimize wait times) - Quality (appropriate care level) - Equity (fair access across populations)
Results: - Bed utilization: 82% → 88% (+6 percentage points) - ED wait times: 4.2 → 3.0 hours (28% reduction) - Ambulance diversions: 45% reduction - Elective surgery delays: 35% reduction
Economic Impact: - 3-Year ROI: 2,054% - Total costs: $650,000 - Total benefits: $14,004,000 - Net benefit: $13,354,000 - Payback period: 2.3 months
Equity Impact: - REDUCED racial disparities by 80%+ - Fairness constraints embedded in optimization - Wait time disparities: Black patients +1.2 hours → +0.2 hours
Replication: - Mayo Clinic (2020) - Cleveland Clinic (2021) - Mass General Brigham (2022) - Over 50 other hospitals
Lesson: Optimization with explicit fairness constraints delivers both efficiency and equity
Part VII: Mental Health AI
Case Study 13: Crisis Text Line - AI Triage for Suicide Prevention
Context: - Over 100,000 crisis texts monthly - 48,000 suicide deaths/year in US - Minutes matter in prevention
Impact: - Wait times for highest-risk: 45 min → 3 min (93% reduction) - Sensitivity: 92% (detecting high-risk) - Estimated 250 lives saved over 7 years (conservative) - False negative rate: 8% (concerning but unavoidable with current technology)
Safety Features: - Multiple screening layers (keywords → ML → human counselor) - Conservative thresholds (high sensitivity, accept some false positives) - Human counselor maintains final authority - Continuous conversation monitoring - Supervisor alerts for escalation
Counselor Impact: - 40% efficiency increase - Better workload management - Reduced burnout - Context provided before conversation
Challenges: - False negatives (8% miss high-risk individuals) - Privacy concerns (AI analyzing sensitive content) - Bias risks (addressed through continuous auditing) - Preventing over-reliance (training emphasizes human judgment)
National Replication: - National Suicide Prevention Lifeline (US) - Samaritans (UK) - Lifeline Australia - Crisis Services Canada
Lesson: High-stakes applications require extreme caution, multiple safety layers, and human authority
Part VIII: Drug Discovery
Case Study 14: AlphaFold and AI-Accelerated Drug Discovery
The AlphaFold Breakthrough: - Solved 50-year protein folding problem - CASP14 competition: 92.4% median accuracy - Hours of computation vs months of lab work - Democratized structural biology
AI Drug Discovery Progress (as of 2024): - ~30 AI-discovered drugs in clinical trials - Discovery timeline: 50-60% faster (4-6 years → 2-3 years) - Cost reduction: 60-70% in preclinical phase - Zero approved drugs yet (takes over 10 years)
Where AI Helped: - Virtual screening (10-100x faster) - Lead optimization (predict properties) - Target identification (multi-omics analysis) - Protein structure prediction (major advance)
Where AI Fell Short of Hype: - “AI eliminates need for chemists” → Still need expert chemists - “AI drugs have higher success rates” → Too early to tell - “AI eliminates animal testing” → Still required by regulators - “10x faster overall” → More like 2-3x (clinical trials not faster)
Real Examples: - Exscientia DSP-1181: First AI drug through Phase 1 (cancer immunotherapy) - Insilico ISP001: Phase 2 for pulmonary fibrosis (18-month discovery) - BenevolentAI BN01: Phase 2 for atopic dermatitis - Relay Therapeutics RLX030: Phase 1 for cancer
Economic Reality: - Over $20 billion invested (2018-2023) - Zero approved drugs yet (still in investment phase) - Company valuations declined 40-60% (2021-2023 market correction) - First approvals expected 2025-2027
Lesson: Real progress, but more modest than hyped. AI is powerful tool, not magic. Experimental validation still essential.
Part IX: Rural Health
Case Study 15: Project ECHO + AI - Democratizing Specialist Expertise
Context: - 60 million Americans live in rural areas - 2x longer specialist wait times - Many drive over 100 miles for care - Rural mortality rates 20% higher than urban
The ECHO Model: - Hub-and-spoke (specialists mentor PCPs) - Case-based learning - “Moving knowledge, not patients” - Community of practice
AI Enhancements: - Clinical decision support for PCPs - Automated case classification - Remote monitoring with AI triage - Predictive analytics for high-risk patients
New Mexico Pilot Results (2020-2023):
Access Improvements: - PCP confidence: 4.2 → 7.8 out of 10 (+86%) - Cases managed locally: 45% → 72% (+27 points) - Specialist referrals: -38% reduction - Wait times: 6.5 → 2.1 weeks (for cases still needing specialist)
Clinical Outcomes: - Diabetes control: 32% → 51% at goal (+19 points) - Hypertension control: 48% → 64% at goal (+16 points) - Hepatitis C cure rate: 67% → 89% (+22 points) - Hospitalization rate: -23% reduction
Economic Impact: - 3-Year ROI: 840% - Cost per patient/year: $8,500 (traditional) → $6,100 (ECHO+AI) - Savings: $2,400 per patient per year - Total savings: $32.4 million (45,000 patients over 3 years)
Provider Impact: - Satisfaction: 6.2 → 8.7 out of 10 - Burnout: 58% → 34% reporting burnout
Patient Impact: - No more 3-hour drives to specialists - Local care with specialist backing - Satisfaction: 7.1 → 8.9 out of 10
National Scale: - Now in 120 clinics across 10 states - ~200,000 patients reached - CMS Innovation Award: $50M for national expansion - 15 states cover via Medicaid
Lesson: Technology + human networks > Either alone. Sustainable model with clear ROI and equity benefits.
Key Themes Across All 15 Cases
1. Technical Success ≠ Clinical Impact
Evidence: DeepMind AKI (AUC 0.92, no outcome change), COVID models (high accuracy, low clinical use)
Implication: Must measure patient-centered endpoints, not just algorithm performance
2. External Validation is Mandatory
Evidence: Mammography AI (internal AUC 0.95 → external 0.82), COVID models (98% high bias)
Implication: Internal test performance always overestimates real-world. Must validate on different populations, sites, time periods.
3. Fairness Requires Active Design
Evidence: Allegheny (47% FPR Black families), Hospital beds (fairness constraints reduced disparities 80%)
Implication: Algorithms perpetuate bias unless explicitly designed for fairness. Regular auditing essential.
4. Human-AI Collaboration Optimal
Evidence: ProMED (human+AI >90% accuracy), ECHO+AI (840% ROI), Crisis Text Line (250 lives saved)
Implication: AI provides scale and consistency, humans provide judgment and accountability. Hybrid > either alone.
5. Context Matters Profoundly
Evidence: Mammography AI (performance varies by equipment), Ventilators (hospitals rejected despite technical feasibility)
Implication: Same algorithm performs differently in different settings. Must adapt to local context.
6. Economic Value Can Be Substantial
Evidence: Hospital beds (2,054% ROI), ECHO+AI (840% ROI), Crisis Text Line (cost-effective at $1,444/QALY)
Implication: When done right, ROI often >500%. But requires upfront investment and sustainable payment model.
7. Implementation is Half the Battle
Evidence: DeepMind AKI (poor workflow integration), ECHO+AI (training and change management critical)
Implication: Algorithm quality insufficient. Must address change management, training, workflow integration.
8. Transparency Builds Trust
Evidence: Allegheny (public documentation), ProMED (human verification visible), NHS (open disparity reporting)
Implication: Explainable AI preferred by clinicians. Public documentation increases accountability.
9. Continuous Monitoring Required
Evidence: Google Flu Trends (model drift), IDx-DR (post-market surveillance), Hospital beds (COVID adaptation)
Implication: Performance degrades over time. Need ongoing evaluation and model updates.
10. Some Decisions Should Remain Human
Evidence: Ventilator allocation (hospitals chose humans), Crisis Text Line (counselor maintains authority)
Implication: Life-or-death decisions require human judgment. AI should inform, not decide.
Outcome Distribution
Clear Successes (5 cases - 33%): 1. BlueDot - Early detection 2. ProMED - Human-AI collaboration 3. IDx-DR - FDA-approved autonomous diagnostic 4. Hospital Bed Allocation - Economic + equity wins 5. Project ECHO + AI - Rural health transformation
Partial Successes (6 cases - 40%): 6. Google Flu Trends - Recovered from failure 7. Mammography AI - Works in some contexts (Sweden) 8. Crisis Text Line - High impact but 8% false negatives 9. AlphaFold - Discovery faster but not approved drugs yet 10. NHS Disparity Detection - Revealed problems, partial fixes 11. Allegheny - More consistent but perpetuates some bias
Instructive Failures (4 cases - 27%): 12. DeepMind AKI - Technical success, clinical failure 13. Sepsis RL - Observational data limitations 14. COVID Models - 98% high bias 15. Ventilator Allocation - Ethical concerns outweighed benefits
Using This Appendix
For Students
- Start here: Read cases relevant to your interests
- Study implementations: All cases include working Python code
- Analyze outcomes: What worked vs what didn’t, and why
- Extract lessons: Apply to your own projects
For Practitioners
- Before implementation: Review cases in your domain
- Learn from mistakes: Study the failures to avoid repeating them
- Adapt code: Use examples as starting templates
- Evaluate properly: Follow validation frameworks demonstrated
For Researchers
- Identify gaps: What hasn’t been studied yet?
- Deep dives: Follow references for full literature review
- Benchmark your work: Compare to these real-world results
- Contribute evidence: Help build the evidence base
For Policymakers
- Understand impact: See real-world effects, not just promises
- Evidence-based policy: Design regulations based on actual outcomes
- Prioritize investments: Focus on proven ROI models
- Equity focus: Learn from both successes (NHS, Hospital beds) and failures (Allegheny)
Content Metrics
Coverage
- Geographic: US (10 cases), UK (3 cases), Global (2 cases)
- Domains: 9 distinct domains covered
- Time span: 2008-2024 (16 years of AI evolution)
- Sample size: Over 30 real-world implementations documented
Technical Depth
- Code examples: Over 50 complete Python implementations
- Lines of code: Over 5,000 lines across all cases
- Algorithms covered: CNN, RNN, RL, NLP, optimization, ensemble methods
- Frameworks: TensorFlow, PyTorch, scikit-learn, XGBoost, SHAP, Fairlearn
Evidence Base
- Peer-reviewed references: Over 100 papers cited
- Clinical trials: Multiple RCTs and observational studies
- Economic analyses: Over 5 ROI/cost-effectiveness studies
- Fairness audits: 4 complete bias analyses
Updates and Contributions
This appendix is continuously updated. For corrections or to suggest additional case studies, contact the author at bryantegomoh.com.
Citation
If you use these case studies in your work, please cite:
@incollection{tegomoh2025casestudies,
title = {Case Study Library: Real-World AI in Public Health},
booktitle = {The Public Health AI Handbook: Evaluating AI Tools for Public Health Practice},
author = {Tegomoh, Bryan},
year = {2025},
doi = {10.5281/zenodo.18263442},
url = {https://publichealthaihandbook.com/appendices/case-study-overview.html}
}See How to Cite This Handbook for additional citation formats.
Next Steps
Ready to dive deeper?
→ Read Complete Case Studies - Full technical details, code, and analysis
→ Code Repository Guide - Access companion code and examples
→ Further Reading - Curated resources for continued learning
This overview provides navigation across cases. For complete technical details, methodology, code implementations, and full analysis, see the full case studies.