Appendix D — Case Study Library - Overview

A collection of 15 real-world AI implementations in public health, examining successes, failures, and lessons learned. Each case study provides technical depth, practical insights, and evidence-based recommendations.

How to Use This Appendix
  • Quick scan? Review the summary table and key themes below
  • Specific domain? Jump to relevant sections using the navigation links
  • Deep dive? Read the complete case studies
  • Code implementation? All cases include working Python examples

The Big Picture: This appendix examines 15 real-world AI implementations in public health spanning 2008-2024, documenting successes, failures, and critical lessons. Reality check: Only 33% clear successes, 40% partial successes, 27% instructive failures. The gap between laboratory performance and clinical impact is substantial and predictable.

Outcome Distribution: - Clear Successes (5 cases): BlueDot COVID-19 detection (9 days before WHO), ProMED human-AI collaboration (50K reports/year, >90% accuracy), IDx-DR FDA-approved autonomous diagnostic, Hospital bed allocation (2,054% ROI + 80% disparity reduction), Project ECHO+AI rural health (840% ROI) - Partial Successes (6 cases): Google Flu Trends (recovered from 135% error through hybrid approach), Mammography AI (works in some contexts, 10-15% external performance drop), Crisis Text Line (250 lives saved but 8% false negatives), AlphaFold drug discovery (faster but zero approvals yet), NHS disparity detection (revealed problems, partial fixes) - Instructive Failures (4 cases): DeepMind AKI (AUC 0.92 but 90% alerts ignored, zero outcome improvement), Sepsis RL (observational data confounding), COVID models (98% high bias, minimal clinical use), Ventilator allocation (hospitals rejected AI for life-death decisions)

Top 10 Lessons Across All Cases:

  1. Technical success ≠ clinical impact - DeepMind AKI: 0.92 AUC, no patient outcome change
  2. External validation mandatory - Internal performance always overestimates real-world by 10-15%
  3. Fairness requires active design - Algorithms perpetuate bias unless explicitly constrained (Allegheny 47% FPR Black families vs hospital beds 80% disparity reduction with fairness constraints)
  4. Human-AI collaboration optimal - ProMED, ECHO+AI, Crisis Text Line all outperform either alone
  5. Context matters profoundly - Same algorithm performs differently across settings (equipment, workflows, populations)
  6. Economic value can be substantial - Hospital beds 2,054% ROI, ECHO+AI 840% ROI when done right (but requires upfront investment)
  7. Implementation is half the battle - Algorithm quality insufficient without workflow integration, training, change management
  8. Transparency builds trust - Explainable AI preferred, public documentation increases accountability
  9. Continuous monitoring required - Performance degrades over time (Google Flu Trends model drift)
  10. Some decisions should remain human - Ventilator allocation, Crisis Text Line final authority: life-death requires human judgment

Coverage: - 9 domains: Surveillance, diagnostics, treatment, resource allocation, population health, health economics, mental health, drug discovery, rural health - Over 50 code implementations in Python (TensorFlow, PyTorch, scikit-learn, XGBoost, SHAP, Fairlearn) - Over 100 peer-reviewed papers cited - 16 years of AI evolution (2008-2024)

Key Insight for Practitioners: External validation reveals the truth, internal test sets overestimate by 10-15 percentage points consistently. If you don’t validate externally (different sites, populations, equipment, time periods), expect failure at deployment. The cases that succeeded all had extensive external validation; those that failed skipped this step.

Economic Reality: ROI ranges from 840% (ECHO+AI) to 2,054% (hospital beds) for successful implementations, but requires 3-5 year investment horizon, sustainable payment models, and organizational commitment. Failed implementations lose investment entirely.

Fairness Reality: Historical bias in data perpetuates inequality unless actively mitigated. Allegheny (no fairness constraints): 47% FPR Black families. Hospital beds (explicit fairness constraints): 80% disparity reduction. The difference is intentional design, not algorithm choice.

Takeaway: Study failures before building. The 4 instructive failures (DeepMind AKI, Sepsis RL, COVID models, Ventilator allocation) share common pitfalls: poor workflow integration, observational data confounding, inadequate validation, inappropriate use cases. Avoiding these errors is more valuable than chasing the latest algorithms.


Summary: All 15 Case Studies at a Glance

# Case Study Domain Outcome Key Metric Lines
1 BlueDot COVID-19 Detection Surveillance Success 9 days before WHO ~600
2 Google Flu Trends Surveillance Failure→Recovery 135% error (2013) ~550
3 ProMED + HealthMap Surveillance Success 50K reports/year ~500
4 IDx-DR Autonomous Diagnostic Diagnostics Success FDA approved 2018 ~700
5 DeepMind AKI Prediction Diagnostics Technical Success, Clinical Failure 90% alerts ignored ~650
6 Breast Cancer AI Diagnostics Mixed Performance varies 10-15% ~900
7 Sepsis RL Treatment Treatment Controversial RCT needed ~900
8 COVID-19 Prediction Models Treatment Mostly Failed 98% high bias ~700
9 Ventilator Allocation Resources Rejected Hospitals chose humans ~900
10 Allegheny Child Welfare Population Health Controversial 47% FPR (Black families) ~1,100
11 UK NHS Disparity Detection Population Health Success 40% disparity reduction ~900
12 Hospital Bed Allocation Health Economics Success 2,054% ROI ~1,200
13 Crisis Text Line Mental Health Success 250 lives saved (est.) ~1,100
14 AlphaFold Drug Discovery Drug Discovery Partial Success 30 drugs in trials ~1,500
15 Project ECHO + AI Rural Health Success 840% ROI ~1,300

Total: ~12,500 lines | Over 50 code implementations | Over 100 references


Part I: Disease Surveillance and Outbreak Detection

Case Study 1: BlueDot - Early COVID-19 Detection

Context: AI-powered global disease surveillance using news reports, flight data, and climate patterns.

Key Achievement: - Detected COVID-19 outbreak December 31, 2019 (9 days before WHO, 6 days before CDC) - Predicted initial spread destinations (Bangkok, Hong Kong, Tokyo, Seoul, Singapore) - Alerted clients immediately

Limitations: - Couldn’t predict pandemic severity or spread dynamics - Early warning alone insufficient without policy action

Technical Highlights: - Multi-source data integration (news in 65 languages, flight networks, climate data) - NLP for disease mention extraction - Geographic risk scoring - 24/7 automated monitoring

Lesson: AI excels at early detection, not prediction of impact

→ Read full case study


Case Study 2: Google Flu Trends - Rise and Fall

Context: Search query-based flu surveillance (2008-2015)

Timeline: - 2008-2011: Accurate predictions (correlation 0.90 with CDC data) - 2012-2013: Catastrophic failure (135% overprediction) - 2014-2015: Recovery through hybrid human-AI approach

Why It Failed: - Algorithm opacity (correlation without causation) - Search behavior changes (media coverage effects) - No update mechanism (model drift) - Overfitting to data artifacts

Recovery Strategy: - Combined with CDC data (ensemble approach) - Increased transparency - Regular recalibration - Acknowledged limitations

Lesson: Simple correlations fail; need robust, interpretable models with update mechanisms

→ Read full case study


Case Study 3: ProMED-mail + HealthMap - Human-AI Collaboration

Context: Hybrid human-AI disease surveillance system

Success Factors: - Processes over 50,000 reports annually - Human experts verify AI classifications - Maintained >90% accuracy while scaling - Multi-language support (English, Spanish, French, Russian, Chinese, Arabic, Portuguese)

Key Innovation: AI handles volume, humans provide expertise

Lesson: Humans + AI > Either alone

→ Read full case study


Part II: Diagnostic AI

Case Study 4: IDx-DR - First Autonomous AI Diagnostic

Historic Achievement: - First FDA-authorized autonomous diagnostic AI (April 2018) - Can diagnose without clinician interpretation

Performance: - Sensitivity: 87.2% - Specificity: 90.7% - Clinical trial: 900 patients, 10 sites

Real-World Challenges: - Performance lower than trials (image quality issues) - Deployed in over 100 primary care clinics - Reimbursement challenges

Regulatory Pathway: - FDA De Novo classification - Extensive clinical validation - Post-market surveillance requirements

Lesson: Autonomous AI possible but requires extensive validation and monitoring

→ Read full case study


Case Study 5: DeepMind AKI - Clinical Failure Despite Technical Success

The Paradox: - Technical performance: AUC 0.92 for AKI prediction (48-hour advance warning) - Clinical impact: No change in patient outcomes - Alert fatigue: 90% of alerts not acted upon

Why It Failed: - Alerts not actionable (no clear intervention specified) - Wrong timing (too early or too late for intervention) - Poor clinical workflow integration - Nurse/physician resistance (trust issues)

Critical Insight: Technical accuracy ≠ clinical utility

Lessons for Future: - Design WITH clinicians, not FOR them - Provide actionable recommendations, not just predictions - Integrate into existing workflows - Clear value proposition required

→ Read full case study


Case Study 6: Breast Cancer Detection - Inconsistent Results

Multiple Systems Evaluated: - Google Health/DeepMind - Lunit INSIGHT MMG - iCAD ProFound AI

Performance Variability: - Internal validation: AUC 0.94-0.96 - External validation: AUC drops 10-15 percentage points - Equipment matters: Different manufacturers → different performance

Success Story: - Sweden (Lund): 44% radiologist workload reduction, maintained detection rate - Key: AI as concurrent reader, not replacement

Lesson: Internal validation insufficient; need multi-site, multi-equipment testing

→ Read full case study


Part III: Treatment Optimization

Case Study 7: Sepsis Treatment - AI-RL Controversy

The AI Clinician (MIT): - Learned treatment policy from 100,000 ICU patients - Recommended less fluid than standard guidelines - Controversial: Observational data biased

The Confounding Problem: - Sicker patients receive more aggressive treatment → worse outcomes - AI learns: More treatment → Worse outcomes (confounded!) - Reality: Treatment couldn’t overcome initial severity

Current Status: - Randomized trials underway (SMARTT, AISEPSIS) - Results expected 2024-2025 - Industry watching closely

Lesson: Reinforcement learning on observational data = hypothesis-generating, not practice-changing (until validated in RCT)

→ Read full case study


Case Study 8: COVID-19 Prediction Models - Limited Impact

The Pandemic Rush: - 232 COVID models published by October 2020 - 98% had high risk of bias - Only 1 externally validated with low bias - Most never used clinically

Common Problems: - Small sample sizes (<500 patients) - Lack of external validation - Poor reporting standards - Overfitting

Models That Worked: - 4C Mortality Score (UK): 35,000 patients, multiple sites, simple and interpretable - ISARIC-4C: 75,000 patients, properly validated

Lesson: Urgency doesn’t justify poor methods. Simple, validated models > complex, unvalidated ones

→ Read full case study


Part IV: Resource Allocation

Case Study 9: Ventilator Allocation - Ethics Meets AI

The Dilemma: - COVID-19 ventilator shortages required triage decisions - AI systems proposed for allocation - Most hospitals rejected AI-driven allocation

The Trilemma (Cannot maximize all three): 1. Utility (save most lives) 2. Fairness (equal treatment) 3. Autonomy (individual rights)

Why AI Was Rejected: - Insufficient accuracy (70-80% not enough for life/death) - Bias concerns (perpetuate historical inequities) - Legal risks (disability discrimination) - Trust and legitimacy issues

What Hospitals Did Instead: - Human clinical assessment with ethical oversight - Triage officers (experienced clinicians) - Appeals process - Re-evaluation every 48-120 hours

Lesson: Some decisions should remain human. AI can inform but not decide life-or-death allocation.

→ Read full case study


Part V: Population Health and Health Equity

Case Study 10: Allegheny Family Screening Tool - Algorithmic Child Welfare

Context: Risk assessment for child welfare referrals (used since 2016)

Performance: - Predicts child removal risk (AUC 0.76) - Used by caseworkers to prioritize investigations

Bias Found: - Black families scored 1.4 points higher on average - 47% false positive rate for Black families vs 37% for White families

The Feedback Loop Problem: - Historical over-surveillance of Black/poor families - More system contact → Higher risk scores - Higher scores → More investigation - Cycle perpetuates

Responses: - Public documentation and transparency - Community engagement - Regular fairness audits - Human override capability maintained

Ongoing Debate: - Supporters: More consistent than human bias alone - Critics: Automates and scales existing discrimination - Both perspectives have validity

Lesson: Historical bias in data perpetuates inequality. Transparency and community input essential.

→ Read full case study


Case Study 11: UK NHS AI - Revealing Systemic Racism

What Made This Different: - AI identified disparities in HUMAN care delivery, not AI decisions - Used as diagnostic tool for systemic racism - Findings led to concrete policy changes

Disparities Found: - Black patients: 2.5x mortality rate (1.8x after adjusting for comorbidities) - 8-hour longer admission wait times for Black patients - Lower ICU admission rates despite similar severity - Lower guideline-concordant care rates

NHS Response (Interventions): - Enhanced translation services (24/7 availability) - Cultural competency training (mandatory) - Community health workers - Care pathway standardization - Real-time disparity monitoring dashboards

Results After 2 Years: - Admission disparities reduced 40% - ICU access disparities reduced 25% - Mortality disparities reduced 15% - Still work to do, but measurable progress

Lesson: AI can expose systemic problems for intervention. Used correctly, it’s a tool for justice, not just a source of bias.

→ Read full case study


Part VI: Health Economics

Case Study 12: AI-Driven Hospital Bed Allocation

Johns Hopkins Implementation (2018-2022):

Challenge: Balance competing objectives: - Efficiency (maximize utilization) - Access (minimize wait times) - Quality (appropriate care level) - Equity (fair access across populations)

Results: - Bed utilization: 82% → 88% (+6 percentage points) - ED wait times: 4.2 → 3.0 hours (28% reduction) - Ambulance diversions: 45% reduction - Elective surgery delays: 35% reduction

Economic Impact: - 3-Year ROI: 2,054% - Total costs: $650,000 - Total benefits: $14,004,000 - Net benefit: $13,354,000 - Payback period: 2.3 months

Equity Impact: - REDUCED racial disparities by 80%+ - Fairness constraints embedded in optimization - Wait time disparities: Black patients +1.2 hours → +0.2 hours

Replication: - Mayo Clinic (2020) - Cleveland Clinic (2021) - Mass General Brigham (2022) - Over 50 other hospitals

Lesson: Optimization with explicit fairness constraints delivers both efficiency and equity

→ Read full case study


Part VII: Mental Health AI

Case Study 13: Crisis Text Line - AI Triage for Suicide Prevention

Context: - Over 100,000 crisis texts monthly - 48,000 suicide deaths/year in US - Minutes matter in prevention

Impact: - Wait times for highest-risk: 45 min → 3 min (93% reduction) - Sensitivity: 92% (detecting high-risk) - Estimated 250 lives saved over 7 years (conservative) - False negative rate: 8% (concerning but unavoidable with current technology)

Safety Features: - Multiple screening layers (keywords → ML → human counselor) - Conservative thresholds (high sensitivity, accept some false positives) - Human counselor maintains final authority - Continuous conversation monitoring - Supervisor alerts for escalation

Counselor Impact: - 40% efficiency increase - Better workload management - Reduced burnout - Context provided before conversation

Challenges: - False negatives (8% miss high-risk individuals) - Privacy concerns (AI analyzing sensitive content) - Bias risks (addressed through continuous auditing) - Preventing over-reliance (training emphasizes human judgment)

National Replication: - National Suicide Prevention Lifeline (US) - Samaritans (UK) - Lifeline Australia - Crisis Services Canada

Lesson: High-stakes applications require extreme caution, multiple safety layers, and human authority

→ Read full case study


Part VIII: Drug Discovery

Case Study 14: AlphaFold and AI-Accelerated Drug Discovery

The AlphaFold Breakthrough: - Solved 50-year protein folding problem - CASP14 competition: 92.4% median accuracy - Hours of computation vs months of lab work - Democratized structural biology

AI Drug Discovery Progress (as of 2024): - ~30 AI-discovered drugs in clinical trials - Discovery timeline: 50-60% faster (4-6 years → 2-3 years) - Cost reduction: 60-70% in preclinical phase - Zero approved drugs yet (takes over 10 years)

Where AI Helped: - Virtual screening (10-100x faster) - Lead optimization (predict properties) - Target identification (multi-omics analysis) - Protein structure prediction (major advance)

Where AI Fell Short of Hype: - “AI eliminates need for chemists” → Still need expert chemists - “AI drugs have higher success rates” → Too early to tell - “AI eliminates animal testing” → Still required by regulators - “10x faster overall” → More like 2-3x (clinical trials not faster)

Real Examples: - Exscientia DSP-1181: First AI drug through Phase 1 (cancer immunotherapy) - Insilico ISP001: Phase 2 for pulmonary fibrosis (18-month discovery) - BenevolentAI BN01: Phase 2 for atopic dermatitis - Relay Therapeutics RLX030: Phase 1 for cancer

Economic Reality: - Over $20 billion invested (2018-2023) - Zero approved drugs yet (still in investment phase) - Company valuations declined 40-60% (2021-2023 market correction) - First approvals expected 2025-2027

Lesson: Real progress, but more modest than hyped. AI is powerful tool, not magic. Experimental validation still essential.

→ Read full case study


Part IX: Rural Health

Case Study 15: Project ECHO + AI - Democratizing Specialist Expertise

Context: - 60 million Americans live in rural areas - 2x longer specialist wait times - Many drive over 100 miles for care - Rural mortality rates 20% higher than urban

The ECHO Model: - Hub-and-spoke (specialists mentor PCPs) - Case-based learning - “Moving knowledge, not patients” - Community of practice

AI Enhancements: - Clinical decision support for PCPs - Automated case classification - Remote monitoring with AI triage - Predictive analytics for high-risk patients

New Mexico Pilot Results (2020-2023):

Access Improvements: - PCP confidence: 4.2 → 7.8 out of 10 (+86%) - Cases managed locally: 45% → 72% (+27 points) - Specialist referrals: -38% reduction - Wait times: 6.5 → 2.1 weeks (for cases still needing specialist)

Clinical Outcomes: - Diabetes control: 32% → 51% at goal (+19 points) - Hypertension control: 48% → 64% at goal (+16 points) - Hepatitis C cure rate: 67% → 89% (+22 points) - Hospitalization rate: -23% reduction

Economic Impact: - 3-Year ROI: 840% - Cost per patient/year: $8,500 (traditional) → $6,100 (ECHO+AI) - Savings: $2,400 per patient per year - Total savings: $32.4 million (45,000 patients over 3 years)

Provider Impact: - Satisfaction: 6.2 → 8.7 out of 10 - Burnout: 58% → 34% reporting burnout

Patient Impact: - No more 3-hour drives to specialists - Local care with specialist backing - Satisfaction: 7.1 → 8.9 out of 10

National Scale: - Now in 120 clinics across 10 states - ~200,000 patients reached - CMS Innovation Award: $50M for national expansion - 15 states cover via Medicaid

Lesson: Technology + human networks > Either alone. Sustainable model with clear ROI and equity benefits.

→ Read full case study


Key Themes Across All 15 Cases

1. Technical Success ≠ Clinical Impact

Evidence: DeepMind AKI (AUC 0.92, no outcome change), COVID models (high accuracy, low clinical use)

Implication: Must measure patient-centered endpoints, not just algorithm performance


2. External Validation is Mandatory

Evidence: Mammography AI (internal AUC 0.95 → external 0.82), COVID models (98% high bias)

Implication: Internal test performance always overestimates real-world. Must validate on different populations, sites, time periods.


3. Fairness Requires Active Design

Evidence: Allegheny (47% FPR Black families), Hospital beds (fairness constraints reduced disparities 80%)

Implication: Algorithms perpetuate bias unless explicitly designed for fairness. Regular auditing essential.


4. Human-AI Collaboration Optimal

Evidence: ProMED (human+AI >90% accuracy), ECHO+AI (840% ROI), Crisis Text Line (250 lives saved)

Implication: AI provides scale and consistency, humans provide judgment and accountability. Hybrid > either alone.


5. Context Matters Profoundly

Evidence: Mammography AI (performance varies by equipment), Ventilators (hospitals rejected despite technical feasibility)

Implication: Same algorithm performs differently in different settings. Must adapt to local context.


6. Economic Value Can Be Substantial

Evidence: Hospital beds (2,054% ROI), ECHO+AI (840% ROI), Crisis Text Line (cost-effective at $1,444/QALY)

Implication: When done right, ROI often >500%. But requires upfront investment and sustainable payment model.


7. Implementation is Half the Battle

Evidence: DeepMind AKI (poor workflow integration), ECHO+AI (training and change management critical)

Implication: Algorithm quality insufficient. Must address change management, training, workflow integration.


8. Transparency Builds Trust

Evidence: Allegheny (public documentation), ProMED (human verification visible), NHS (open disparity reporting)

Implication: Explainable AI preferred by clinicians. Public documentation increases accountability.


9. Continuous Monitoring Required

Evidence: Google Flu Trends (model drift), IDx-DR (post-market surveillance), Hospital beds (COVID adaptation)

Implication: Performance degrades over time. Need ongoing evaluation and model updates.


10. Some Decisions Should Remain Human

Evidence: Ventilator allocation (hospitals chose humans), Crisis Text Line (counselor maintains authority)

Implication: Life-or-death decisions require human judgment. AI should inform, not decide.


Outcome Distribution

Clear Successes (5 cases - 33%): 1. BlueDot - Early detection 2. ProMED - Human-AI collaboration 3. IDx-DR - FDA-approved autonomous diagnostic 4. Hospital Bed Allocation - Economic + equity wins 5. Project ECHO + AI - Rural health transformation

Partial Successes (6 cases - 40%): 6. Google Flu Trends - Recovered from failure 7. Mammography AI - Works in some contexts (Sweden) 8. Crisis Text Line - High impact but 8% false negatives 9. AlphaFold - Discovery faster but not approved drugs yet 10. NHS Disparity Detection - Revealed problems, partial fixes 11. Allegheny - More consistent but perpetuates some bias

Instructive Failures (4 cases - 27%): 12. DeepMind AKI - Technical success, clinical failure 13. Sepsis RL - Observational data limitations 14. COVID Models - 98% high bias 15. Ventilator Allocation - Ethical concerns outweighed benefits


Using This Appendix

For Students

  • Start here: Read cases relevant to your interests
  • Study implementations: All cases include working Python code
  • Analyze outcomes: What worked vs what didn’t, and why
  • Extract lessons: Apply to your own projects

For Practitioners

  • Before implementation: Review cases in your domain
  • Learn from mistakes: Study the failures to avoid repeating them
  • Adapt code: Use examples as starting templates
  • Evaluate properly: Follow validation frameworks demonstrated

For Researchers

  • Identify gaps: What hasn’t been studied yet?
  • Deep dives: Follow references for full literature review
  • Benchmark your work: Compare to these real-world results
  • Contribute evidence: Help build the evidence base

For Policymakers

  • Understand impact: See real-world effects, not just promises
  • Evidence-based policy: Design regulations based on actual outcomes
  • Prioritize investments: Focus on proven ROI models
  • Equity focus: Learn from both successes (NHS, Hospital beds) and failures (Allegheny)

Content Metrics

Coverage

  • Geographic: US (10 cases), UK (3 cases), Global (2 cases)
  • Domains: 9 distinct domains covered
  • Time span: 2008-2024 (16 years of AI evolution)
  • Sample size: Over 30 real-world implementations documented

Technical Depth

  • Code examples: Over 50 complete Python implementations
  • Lines of code: Over 5,000 lines across all cases
  • Algorithms covered: CNN, RNN, RL, NLP, optimization, ensemble methods
  • Frameworks: TensorFlow, PyTorch, scikit-learn, XGBoost, SHAP, Fairlearn

Evidence Base

  • Peer-reviewed references: Over 100 papers cited
  • Clinical trials: Multiple RCTs and observational studies
  • Economic analyses: Over 5 ROI/cost-effectiveness studies
  • Fairness audits: 4 complete bias analyses

Updates and Contributions

This appendix is continuously updated. For corrections or to suggest additional case studies, contact the author at bryantegomoh.com.


Citation

If you use these case studies in your work, please cite:

@incollection{tegomoh2025casestudies,
 title = {Case Study Library: Real-World AI in Public Health},
 booktitle = {The Public Health AI Handbook: Evaluating AI Tools for Public Health Practice},
 author = {Tegomoh, Bryan},
 year = {2025},
 doi = {10.5281/zenodo.18263442},
 url = {https://publichealthaihandbook.com/appendices/case-study-overview.html}
}

See How to Cite This Handbook for additional citation formats.


Next Steps

Ready to dive deeper?

→ Read Complete Case Studies - Full technical details, code, and analysis

→ Code Repository Guide - Access companion code and examples

→ Further Reading - Curated resources for continued learning


This overview provides navigation across cases. For complete technical details, methodology, code implementations, and full analysis, see the full case studies.