Appendix B — Case Study Library - Overview

Appendix B: Case Study Library - Overview

A comprehensive collection of 15 real-world AI implementations in public health, examining successes, failures, and lessons learned. Each case study provides technical depth, practical insights, and evidence-based recommendations.

NoteHow to Use This Appendix
  • Quick scan? Review the summary table and key themes below
  • Specific domain? Jump to relevant sections using the navigation links
  • Deep dive? Read the complete case studies
  • Code implementation? All cases include working Python examples

Summary: All 15 Case Studies at a Glance

# Case Study Domain Outcome Key Metric Lines
1 BlueDot COVID-19 Detection Surveillance Success 9 days before WHO ~600
2 Google Flu Trends Surveillance Failure→Recovery 135% error (2013) ~550
3 ProMED + HealthMap Surveillance Success 50K reports/year ~500
4 IDx-DR Autonomous Diagnostic Diagnostics Success FDA approved 2018 ~700
5 DeepMind AKI Prediction Diagnostics Technical Success, Clinical Failure 90% alerts ignored ~650
6 Breast Cancer AI Diagnostics Mixed Performance varies 10-15% ~900
7 Sepsis RL Treatment Treatment Controversial RCT needed ~900
8 COVID-19 Prediction Models Treatment Mostly Failed 98% high bias ~700
9 Ventilator Allocation Resources Rejected Hospitals chose humans ~900
10 Allegheny Child Welfare Population Health Controversial 47% FPR (Black families) ~1,100
11 UK NHS Disparity Detection Population Health Success 40% disparity reduction ~900
12 Hospital Bed Allocation Health Economics Success 2,054% ROI ~1,200
13 Crisis Text Line Mental Health Success 250 lives saved (est.) ~1,100
14 AlphaFold Drug Discovery Drug Discovery Partial Success 30 drugs in trials ~1,500
15 Project ECHO + AI Rural Health Success 840% ROI ~1,300

Total: ~12,500 lines | 50+ code implementations | 100+ references


Part I: Disease Surveillance and Outbreak Detection

Case Study 1: BlueDot - Early COVID-19 Detection

Context: AI-powered global disease surveillance using news reports, flight data, and climate patterns.

Key Achievement: - ✅ Detected COVID-19 outbreak December 31, 2019 (9 days before WHO, 6 days before CDC) - ✅ Predicted initial spread destinations (Bangkok, Hong Kong, Tokyo, Seoul, Singapore) - ✅ Alerted clients immediately

Limitations: - ❌ Couldn’t predict pandemic severity or spread dynamics - ❌ Early warning alone insufficient without policy action

Technical Highlights: - Multi-source data integration (news in 65 languages, flight networks, climate data) - NLP for disease mention extraction - Geographic risk scoring - 24/7 automated monitoring

Lesson: AI excels at early detection, not prediction of impact

→ Read full case study


Case Study 2: Google Flu Trends - Rise and Fall

Context: Search query-based flu surveillance (2008-2015)

Timeline: - 2008-2011: Accurate predictions (correlation 0.90 with CDC data) - 2012-2013: Catastrophic failure (135% overprediction) - 2014-2015: Recovery through hybrid human-AI approach

Why It Failed: - Algorithm opacity (correlation without causation) - Search behavior changes (media coverage effects) - No update mechanism (model drift) - Overfitting to data artifacts

Recovery Strategy: - Combined with CDC data (ensemble approach) - Increased transparency - Regular recalibration - Acknowledged limitations

Lesson: Simple correlations fail; need robust, interpretable models with update mechanisms

→ Read full case study


Case Study 3: ProMED-mail + HealthMap - Human-AI Collaboration

Context: Hybrid human-AI disease surveillance system

Success Factors: - ✅ Processes 50,000+ reports annually - ✅ Human experts verify AI classifications - ✅ Maintained >90% accuracy while scaling - ✅ Multi-language support (English, Spanish, French, Russian, Chinese, Arabic, Portuguese)

Key Innovation: AI handles volume, humans provide expertise

Lesson: Humans + AI > Either alone

→ Read full case study


Part II: Diagnostic AI

Case Study 4: IDx-DR - First Autonomous AI Diagnostic

Historic Achievement: - First FDA-authorized autonomous diagnostic AI (April 2018) - Can diagnose without clinician interpretation

Performance: - Sensitivity: 87.2% - Specificity: 90.7% - Clinical trial: 900 patients, 10 sites

Real-World Challenges: - ⚠️ Performance lower than trials (image quality issues) - Deployed in 100+ primary care clinics - Reimbursement challenges

Regulatory Pathway: - FDA De Novo classification - Extensive clinical validation - Post-market surveillance requirements

Lesson: Autonomous AI possible but requires extensive validation and monitoring

→ Read full case study


Case Study 5: DeepMind AKI - Clinical Failure Despite Technical Success

The Paradox: - ✅ Technical performance: AUC 0.92 for AKI prediction (48-hour advance warning) - ❌ Clinical impact: No change in patient outcomes - ❌ Alert fatigue: 90% of alerts not acted upon

Why It Failed: - Alerts not actionable (no clear intervention specified) - Wrong timing (too early or too late for intervention) - Poor clinical workflow integration - Nurse/physician resistance (trust issues)

Critical Insight: Technical accuracy ≠ clinical utility

Lessons for Future: - Design WITH clinicians, not FOR them - Provide actionable recommendations, not just predictions - Integrate into existing workflows - Clear value proposition required

→ Read full case study


Case Study 6: Breast Cancer Detection - Inconsistent Results

Multiple Systems Evaluated: - Google Health/DeepMind - Lunit INSIGHT MMG - iCAD ProFound AI

Performance Variability: - Internal validation: AUC 0.94-0.96 - External validation: AUC drops 10-15 percentage points - Equipment matters: Different manufacturers → different performance

Success Story: - Sweden (Lund): 44% radiologist workload reduction, maintained detection rate - Key: AI as concurrent reader, not replacement

Lesson: Internal validation insufficient; need multi-site, multi-equipment testing

→ Read full case study


Part III: Treatment Optimization

Case Study 7: Sepsis Treatment - AI-RL Controversy

The AI Clinician (MIT): - Learned treatment policy from 100,000 ICU patients - Recommended less fluid than standard guidelines - Controversial: Observational data biased

The Confounding Problem: - Sicker patients receive more aggressive treatment → worse outcomes - AI learns: More treatment → Worse outcomes (confounded!) - Reality: Treatment couldn’t overcome initial severity

Current Status: - Randomized trials underway (SMARTT, AISEPSIS) - Results expected 2024-2025 - Industry watching closely

Lesson: Reinforcement learning on observational data = hypothesis-generating, not practice-changing (until validated in RCT)

→ Read full case study


Case Study 8: COVID-19 Prediction Models - Limited Impact

The Pandemic Rush: - 232 COVID models published by October 2020 - 98% had high risk of bias - Only 1 externally validated with low bias - Most never used clinically

Common Problems: - Small sample sizes (<500 patients) - Lack of external validation - Poor reporting standards - Overfitting

Models That Worked: - 4C Mortality Score (UK): 35,000 patients, multiple sites, simple and interpretable - ISARIC-4C: 75,000 patients, properly validated

Lesson: Urgency doesn’t justify poor methods. Simple, validated models > complex, unvalidated ones

→ Read full case study


Part IV: Resource Allocation

Case Study 9: Ventilator Allocation - Ethics Meets AI

The Dilemma: - COVID-19 ventilator shortages required triage decisions - AI systems proposed for allocation - Most hospitals rejected AI-driven allocation

The Trilemma (Cannot maximize all three): 1. Utility (save most lives) 2. Fairness (equal treatment) 3. Autonomy (individual rights)

Why AI Was Rejected: - Insufficient accuracy (70-80% not enough for life/death) - Bias concerns (perpetuate historical inequities) - Legal risks (disability discrimination) - Trust and legitimacy issues

What Hospitals Did Instead: - Human clinical assessment with ethical oversight - Triage officers (experienced clinicians) - Appeals process - Re-evaluation every 48-120 hours

Lesson: Some decisions should remain human. AI can inform but not decide life-or-death allocation.

→ Read full case study


Part V: Population Health and Health Equity

Case Study 10: Allegheny Family Screening Tool - Algorithmic Child Welfare

Context: Risk assessment for child welfare referrals (used since 2016)

Performance: - Predicts child removal risk (AUC 0.76) - Used by caseworkers to prioritize investigations

Bias Found: - Black families scored 1.4 points higher on average - 47% false positive rate for Black families vs 37% for White families

The Feedback Loop Problem: - Historical over-surveillance of Black/poor families - More system contact → Higher risk scores - Higher scores → More investigation - Cycle perpetuates

Responses: - Public documentation and transparency - Community engagement - Regular fairness audits - Human override capability maintained

Ongoing Debate: - Supporters: More consistent than human bias alone - Critics: Automates and scales existing discrimination - Both perspectives have validity

Lesson: Historical bias in data perpetuates inequality. Transparency and community input essential.

→ Read full case study


Case Study 11: UK NHS AI - Revealing Systemic Racism

What Made This Different: - AI identified disparities in HUMAN care delivery, not AI decisions - Used as diagnostic tool for systemic racism - Findings led to concrete policy changes

Disparities Found: - Black patients: 2.5x mortality rate (1.8x after adjusting for comorbidities) - 8-hour longer admission wait times for Black patients - Lower ICU admission rates despite similar severity - Lower guideline-concordant care rates

NHS Response (Interventions): - Enhanced translation services (24/7 availability) - Cultural competency training (mandatory) - Community health workers - Care pathway standardization - Real-time disparity monitoring dashboards

Results After 2 Years: - ✅ Admission disparities reduced 40% - ✅ ICU access disparities reduced 25% - ✅ Mortality disparities reduced 15% - Still work to do, but measurable progress

Lesson: AI can expose systemic problems for intervention. Used correctly, it’s a tool for justice, not just a source of bias.

→ Read full case study


Part VI: Health Economics

Case Study 12: AI-Driven Hospital Bed Allocation

Johns Hopkins Implementation (2018-2022):

Challenge: Balance competing objectives: - Efficiency (maximize utilization) - Access (minimize wait times) - Quality (appropriate care level) - Equity (fair access across populations)

Results: - ✅ Bed utilization: 82% → 88% (+6 percentage points) - ✅ ED wait times: 4.2 → 3.0 hours (28% reduction) - ✅ Ambulance diversions: 45% reduction - ✅ Elective surgery delays: 35% reduction

Economic Impact: - 3-Year ROI: 2,054% - Total costs: $650,000 - Total benefits: $14,004,000 - Net benefit: $13,354,000 - Payback period: 2.3 months

Equity Impact: - REDUCED racial disparities by 80%+ - Fairness constraints embedded in optimization - Wait time disparities: Black patients +1.2 hours → +0.2 hours

Replication: - Mayo Clinic (2020) - Cleveland Clinic (2021) - Mass General Brigham (2022) - 50+ other hospitals

Lesson: Optimization with explicit fairness constraints delivers both efficiency and equity

→ Read full case study


Part VII: Mental Health AI

Case Study 13: Crisis Text Line - AI Triage for Suicide Prevention

Context: - 100,000+ crisis texts monthly - 48,000 suicide deaths/year in US - Minutes matter in prevention

Impact: - ✅ Wait times for highest-risk: 45 min → 3 min (93% reduction) - ✅ Sensitivity: 92% (detecting high-risk) - ✅ Estimated 250 lives saved over 7 years (conservative) - ⚠️ False negative rate: 8% (concerning but unavoidable with current technology)

Safety Features: - Multiple screening layers (keywords → ML → human counselor) - Conservative thresholds (high sensitivity, accept some false positives) - Human counselor maintains final authority - Continuous conversation monitoring - Supervisor alerts for escalation

Counselor Impact: - 40% efficiency increase - Better workload management - Reduced burnout - Context provided before conversation

Challenges: - False negatives (8% miss high-risk individuals) - Privacy concerns (AI analyzing sensitive content) - Bias risks (addressed through continuous auditing) - Preventing over-reliance (training emphasizes human judgment)

National Replication: - National Suicide Prevention Lifeline (US) - Samaritans (UK) - Lifeline Australia - Crisis Services Canada

Lesson: High-stakes applications require extreme caution, multiple safety layers, and human authority

→ Read full case study


Part VIII: Drug Discovery

Case Study 14: AlphaFold and AI-Accelerated Drug Discovery

The AlphaFold Breakthrough: - Solved 50-year protein folding problem - CASP14 competition: 92.4% median accuracy - Hours of computation vs months of lab work - Democratized structural biology

AI Drug Discovery Progress (as of 2024): - ✅ ~30 AI-discovered drugs in clinical trials - ✅ Discovery timeline: 50-60% faster (4-6 years → 2-3 years) - ✅ Cost reduction: 60-70% in preclinical phase - ❌ Zero approved drugs yet (takes 10+ years)

Where AI Helped: - ✅ Virtual screening (10-100x faster) - ✅ Lead optimization (predict properties) - ✅ Target identification (multi-omics analysis) - ✅ Protein structure prediction (game-changer)

Where AI Fell Short of Hype: - ❌ “AI eliminates need for chemists” → Still need expert chemists - ❌ “AI drugs have higher success rates” → Too early to tell - ❌ “AI eliminates animal testing” → Still required by regulators - ❌ “10x faster overall” → More like 2-3x (clinical trials not faster)

Real Examples: - Exscientia DSP-1181: First AI drug through Phase 1 (cancer immunotherapy) - Insilico ISP001: Phase 2 for pulmonary fibrosis (18-month discovery) - BenevolentAI BN01: Phase 2 for atopic dermatitis - Relay Therapeutics RLX030: Phase 1 for cancer

Economic Reality: - $20+ billion invested (2018-2023) - Zero approved drugs yet (still in investment phase) - Company valuations declined 40-60% (2021-2023 market correction) - First approvals expected 2025-2027

Lesson: Real progress, but less revolutionary than hyped. AI is powerful tool, not magic. Experimental validation still essential.

→ Read full case study


Part IX: Rural Health

Case Study 15: Project ECHO + AI - Democratizing Specialist Expertise

Context: - 60 million Americans live in rural areas - 2x longer specialist wait times - Many drive 100+ miles for care - Rural mortality rates 20% higher than urban

The ECHO Model: - Hub-and-spoke (specialists mentor PCPs) - Case-based learning - “Moving knowledge, not patients” - Community of practice

AI Enhancements: - Clinical decision support for PCPs - Automated case classification - Remote monitoring with AI triage - Predictive analytics for high-risk patients

New Mexico Pilot Results (2020-2023):

Access Improvements: - ✅ PCP confidence: 4.2 → 7.8 out of 10 (+86%) - ✅ Cases managed locally: 45% → 72% (+27 points) - ✅ Specialist referrals: -38% reduction - ✅ Wait times: 6.5 → 2.1 weeks (for cases still needing specialist)

Clinical Outcomes: - ✅ Diabetes control: 32% → 51% at goal (+19 points) - ✅ Hypertension control: 48% → 64% at goal (+16 points) - ✅ Hepatitis C cure rate: 67% → 89% (+22 points) - ✅ Hospitalization rate: -23% reduction

Economic Impact: - 3-Year ROI: 840% - Cost per patient/year: $8,500 (traditional) → $6,100 (ECHO+AI) - Savings: $2,400 per patient per year - Total savings: $32.4 million (45,000 patients over 3 years)

Provider Impact: - Satisfaction: 6.2 → 8.7 out of 10 - Burnout: 58% → 34% reporting burnout

Patient Impact: - No more 3-hour drives to specialists - Local care with specialist backing - Satisfaction: 7.1 → 8.9 out of 10

National Scale: - Now in 120 clinics across 10 states - ~200,000 patients reached - CMS Innovation Award: $50M for national expansion - 15 states cover via Medicaid

Lesson: Technology + human networks > Either alone. Sustainable model with clear ROI and equity benefits.

→ Read full case study


Key Themes Across All 15 Cases

1. Technical Success ≠ Clinical Impact

Evidence: DeepMind AKI (AUC 0.92, no outcome change), COVID models (high accuracy, low clinical use)

Implication: Must measure patient-centered endpoints, not just algorithm performance


2. External Validation is Mandatory

Evidence: Mammography AI (internal AUC 0.95 → external 0.82), COVID models (98% high bias)

Implication: Internal test performance always overestimates real-world. Must validate on different populations, sites, time periods.


3. Fairness Requires Active Design

Evidence: Allegheny (47% FPR Black families), Hospital beds (fairness constraints reduced disparities 80%)

Implication: Algorithms perpetuate bias unless explicitly designed for fairness. Regular auditing essential.


4. Human-AI Collaboration Optimal

Evidence: ProMED (human+AI >90% accuracy), ECHO+AI (840% ROI), Crisis Text Line (250 lives saved)

Implication: AI provides scale and consistency, humans provide judgment and accountability. Hybrid > either alone.


5. Context Matters Profoundly

Evidence: Mammography AI (performance varies by equipment), Ventilators (hospitals rejected despite technical feasibility)

Implication: Same algorithm performs differently in different settings. Must adapt to local context.


6. Economic Value Can Be Substantial

Evidence: Hospital beds (2,054% ROI), ECHO+AI (840% ROI), Crisis Text Line (cost-effective at $1,444/QALY)

Implication: When done right, ROI often >500%. But requires upfront investment and sustainable payment model.


7. Implementation is Half the Battle

Evidence: DeepMind AKI (poor workflow integration), ECHO+AI (training and change management critical)

Implication: Algorithm quality insufficient. Must address change management, training, workflow integration.


8. Transparency Builds Trust

Evidence: Allegheny (public documentation), ProMED (human verification visible), NHS (open disparity reporting)

Implication: Explainable AI preferred by clinicians. Public documentation increases accountability.


9. Continuous Monitoring Required

Evidence: Google Flu Trends (model drift), IDx-DR (post-market surveillance), Hospital beds (COVID adaptation)

Implication: Performance degrades over time. Need ongoing evaluation and model updates.


10. Some Decisions Should Remain Human

Evidence: Ventilator allocation (hospitals chose humans), Crisis Text Line (counselor maintains authority)

Implication: Life-or-death decisions require human judgment. AI should inform, not decide.


Outcome Distribution

Clear Successes (5 cases - 33%): 1. BlueDot - Early detection 2. ProMED - Human-AI collaboration 3. IDx-DR - FDA-approved autonomous diagnostic 4. Hospital Bed Allocation - Economic + equity wins 5. Project ECHO + AI - Rural health transformation

Partial Successes (6 cases - 40%): 6. Google Flu Trends - Recovered from failure 7. Mammography AI - Works in some contexts (Sweden) 8. Crisis Text Line - High impact but 8% false negatives 9. AlphaFold - Discovery faster but not approved drugs yet 10. NHS Disparity Detection - Revealed problems, partial fixes 11. Allegheny - More consistent but perpetuates some bias

Instructive Failures (4 cases - 27%): 12. DeepMind AKI - Technical success, clinical failure 13. Sepsis RL - Observational data limitations 14. COVID Models - 98% high bias 15. Ventilator Allocation - Ethical concerns outweighed benefits


Using This Appendix

For Students

  • Start here: Read cases relevant to your interests
  • Study implementations: All cases include working Python code
  • Analyze outcomes: What worked vs what didn’t, and why
  • Extract lessons: Apply to your own projects

For Practitioners

  • Before implementation: Review cases in your domain
  • Learn from mistakes: Study the failures to avoid repeating them
  • Adapt code: Use examples as starting templates
  • Evaluate properly: Follow validation frameworks demonstrated

For Researchers

  • Identify gaps: What hasn’t been studied yet?
  • Deep dives: Follow references for comprehensive literature
  • Benchmark your work: Compare to these real-world results
  • Contribute evidence: Help build the evidence base

For Policymakers

  • Understand impact: See real-world effects, not just promises
  • Evidence-based policy: Design regulations based on actual outcomes
  • Prioritize investments: Focus on proven ROI models
  • Equity focus: Learn from both successes (NHS, Hospital beds) and failures (Allegheny)

Content Metrics

Coverage

  • Geographic: US (10 cases), UK (3 cases), Global (2 cases)
  • Domains: 9 distinct domains covered
  • Time span: 2008-2024 (16 years of AI evolution)
  • Sample size: 30+ real-world implementations documented

Technical Depth

  • Code examples: 50+ complete Python implementations
  • Lines of code: ~5,000+ lines across all cases
  • Algorithms covered: CNN, RNN, RL, NLP, optimization, ensemble methods
  • Frameworks: TensorFlow, PyTorch, scikit-learn, XGBoost, SHAP, Fairlearn

Evidence Base

  • Peer-reviewed references: 100+ papers cited
  • Clinical trials: Multiple RCTs and observational studies
  • Economic analyses: 5+ ROI/cost-effectiveness studies
  • Fairness audits: 4 comprehensive bias analyses

Updates and Contributions

This is a living document. For updates, corrections, or to suggest additional case studies:

  • GitHub Issues: Report errors or suggest cases
  • Email: appendix@ai-public-health.org
  • Web: https://www.ai-public-health.org/cases

Last Updated: January 2025 Version: 1.0


Citation

If you use these case studies in your work, please cite:

@incollection{ai_public_health_cases_2025,
  title = {Case Study Library: Real-World AI in Public Health},
  booktitle = {The Public Health AI Handbook},
  author = {[Author Name]},
  year = {2025},
  publisher = {[Publisher]},
  chapter = {Appendix B},
  url = {https://www.ai-public-health.org}
}

Next Steps

Ready to dive deeper?

→ Read Complete Case Studies - Full technical details, code, and analysis

→ Code Repository Guide - Access companion code and examples

→ Further Reading - Curated resources for continued learning


This overview provides navigation and synthesis. For complete technical details, methodology, code implementations, and comprehensive analysis, see the full case studies.