Appendix D — Case Study Library - Overview

doi:10.5281/zenodo.18263442

Appendix D — Case Study Library - Overview

A collection of 15 real-world AI implementations in public health, examining successes, failures, and lessons learned. Each case study provides technical depth, practical insights, and evidence-based recommendations.

How to Use This Appendix

Quick scan? Review the summary table and key themes below
Specific domain? Jump to relevant sections using the navigation links
Deep dive? Read the complete case studies
Code implementation? All cases include working Python examples

Appendix Summary (TL;DR)

The Big Picture: This appendix examines 15 real-world AI implementations in public health spanning 2008-2024, documenting successes, failures, and critical lessons. Reality check: Only 33% clear successes, 40% partial successes, 27% instructive failures. The gap between laboratory performance and clinical impact is substantial and predictable.

Outcome Distribution: - Clear Successes (5 cases): BlueDot COVID-19 detection (9 days before WHO), ProMED human-AI collaboration (50K reports/year, >90% accuracy), IDx-DR FDA-approved autonomous diagnostic, Hospital bed allocation (2,054% ROI + 80% disparity reduction), Project ECHO+AI rural health (840% ROI) - Partial Successes (6 cases): Google Flu Trends (recovered from 135% error through hybrid approach), Mammography AI (works in some contexts, 10-15% external performance drop), Crisis Text Line (250 lives saved but 8% false negatives), AlphaFold drug discovery (faster but zero approvals yet), NHS disparity detection (revealed problems, partial fixes) - Instructive Failures (4 cases): DeepMind AKI (AUC 0.92 but 90% alerts ignored, zero outcome improvement), Sepsis RL (observational data confounding), COVID models (98% high bias, minimal clinical use), Ventilator allocation (hospitals rejected AI for life-death decisions)

Top 10 Lessons Across All Cases:

Technical success ≠ clinical impact - DeepMind AKI: 0.92 AUC, no patient outcome change
External validation mandatory - Internal performance always overestimates real-world by 10-15%
Fairness requires active design - Algorithms perpetuate bias unless explicitly constrained (Allegheny 47% FPR Black families vs hospital beds 80% disparity reduction with fairness constraints)
Human-AI collaboration optimal - ProMED, ECHO+AI, Crisis Text Line all outperform either alone
Context matters profoundly - Same algorithm performs differently across settings (equipment, workflows, populations)
Economic value can be substantial - Hospital beds 2,054% ROI, ECHO+AI 840% ROI when done right (but requires upfront investment)
Implementation is half the battle - Algorithm quality insufficient without workflow integration, training, change management
Transparency builds trust - Explainable AI preferred, public documentation increases accountability
Continuous monitoring required - Performance degrades over time (Google Flu Trends model drift)
Some decisions should remain human - Ventilator allocation, Crisis Text Line final authority: life-death requires human judgment

Coverage: - 9 domains: Surveillance, diagnostics, treatment, resource allocation, population health, health economics, mental health, drug discovery, rural health - Over 50 code implementations in Python (TensorFlow, PyTorch, scikit-learn, XGBoost, SHAP, Fairlearn) - Over 100 peer-reviewed papers cited - 16 years of AI evolution (2008-2024)

Key Insight for Practitioners: External validation reveals the truth, internal test sets overestimate by 10-15 percentage points consistently. If you don’t validate externally (different sites, populations, equipment, time periods), expect failure at deployment. The cases that succeeded all had extensive external validation; those that failed skipped this step.

Economic Reality: ROI ranges from 840% (ECHO+AI) to 2,054% (hospital beds) for successful implementations, but requires 3-5 year investment horizon, sustainable payment models, and organizational commitment. Failed implementations lose investment entirely.

Fairness Reality: Historical bias in data perpetuates inequality unless actively mitigated. Allegheny (no fairness constraints): 47% FPR Black families. Hospital beds (explicit fairness constraints): 80% disparity reduction. The difference is intentional design, not algorithm choice.

Takeaway: Study failures before building. The 4 instructive failures (DeepMind AKI, Sepsis RL, COVID models, Ventilator allocation) share common pitfalls: poor workflow integration, observational data confounding, inadequate validation, inappropriate use cases. Avoiding these errors is more valuable than chasing the latest algorithms.

Summary: All 15 Case Studies at a Glance

#	Case Study	Domain	Outcome	Key Metric	Lines
1	BlueDot COVID-19 Detection	Surveillance	Success	9 days before WHO	~600
2	Google Flu Trends	Surveillance	Failure→Recovery	135% error (2013)	~550
3	ProMED + HealthMap	Surveillance	Success	50K reports/year	~500
4	IDx-DR Autonomous Diagnostic	Diagnostics	Success	FDA approved 2018	~700
5	DeepMind AKI Prediction	Diagnostics	Technical Success, Clinical Failure	90% alerts ignored	~650
6	Breast Cancer AI	Diagnostics	Mixed	Performance varies 10-15%	~900
7	Sepsis RL Treatment	Treatment	Controversial	RCT needed	~900
8	COVID-19 Prediction Models	Treatment	Mostly Failed	98% high bias	~700
9	Ventilator Allocation	Resources	Rejected	Hospitals chose humans	~900
10	Allegheny Child Welfare	Population Health	Controversial	47% FPR (Black families)	~1,100
11	UK NHS Disparity Detection	Population Health	Success	40% disparity reduction	~900
12	Hospital Bed Allocation	Health Economics	Success	2,054% ROI	~1,200
13	Crisis Text Line	Mental Health	Success	250 lives saved (est.)	~1,100
14	AlphaFold Drug Discovery	Drug Discovery	Partial Success	30 drugs in trials	~1,500
15	Project ECHO + AI	Rural Health	Success	840% ROI	~1,300

Total: ~12,500 lines | Over 50 code implementations | Over 100 references

Part I: Disease Surveillance and Outbreak Detection

Case Study 1: BlueDot - Early COVID-19 Detection

Context: AI-powered global disease surveillance using news reports, flight data, and climate patterns.

Key Achievement: - Detected COVID-19 outbreak December 31, 2019 (9 days before WHO, 6 days before CDC) - Predicted initial spread destinations (Bangkok, Hong Kong, Tokyo, Seoul, Singapore) - Alerted clients immediately

Limitations: - Couldn’t predict pandemic severity or spread dynamics - Early warning alone insufficient without policy action

Technical Highlights: - Multi-source data integration (news in 65 languages, flight networks, climate data) - NLP for disease mention extraction - Geographic risk scoring - 24/7 automated monitoring

Lesson: AI excels at early detection, not prediction of impact

Summary: All 15 Case Studies at a Glance

Part I: Disease Surveillance and Outbreak Detection

Case Study 1: BlueDot - Early COVID-19 Detection

Case Study 2: Google Flu Trends - Rise and Fall

Case Study 3: ProMED-mail + HealthMap - Human-AI Collaboration

Part II: Diagnostic AI

Case Study 4: IDx-DR - First Autonomous AI Diagnostic

Case Study 5: DeepMind AKI - Clinical Failure Despite Technical Success

Case Study 6: Breast Cancer Detection - Inconsistent Results

Part III: Treatment Optimization

Case Study 7: Sepsis Treatment - AI-RL Controversy

Case Study 8: COVID-19 Prediction Models - Limited Impact

Part IV: Resource Allocation

Case Study 9: Ventilator Allocation - Ethics Meets AI

Part V: Population Health and Health Equity

Case Study 10: Allegheny Family Screening Tool - Algorithmic Child Welfare

Case Study 11: UK NHS AI - Revealing Systemic Racism

Part VI: Health Economics

Case Study 12: AI-Driven Hospital Bed Allocation

Part VII: Mental Health AI

Case Study 13: Crisis Text Line - AI Triage for Suicide Prevention

Part VIII: Drug Discovery

Case Study 14: AlphaFold and AI-Accelerated Drug Discovery

Part IX: Rural Health

Case Study 15: Project ECHO + AI - Democratizing Specialist Expertise

Key Themes Across All 15 Cases

1. Technical Success ≠ Clinical Impact

2. External Validation is Mandatory

3. Fairness Requires Active Design

4. Human-AI Collaboration Optimal

5. Context Matters Profoundly

6. Economic Value Can Be Substantial

7. Implementation is Half the Battle

8. Transparency Builds Trust

9. Continuous Monitoring Required

10. Some Decisions Should Remain Human

Outcome Distribution

Using This Appendix

For Students

For Practitioners

For Researchers

For Policymakers

Content Metrics

Coverage

Technical Depth

Evidence Base

Updates and Contributions

Citation

Next Steps