Appendix F — The AI Morgue: Failure Post-Mortems
Appendix E: The AI Morgue - Post-Mortems of Failed AI Projects
By the end of this appendix, you will:
- Understand the common failure modes of AI systems in healthcare and public health
- Analyze root causes of high-profile AI failures through detailed post-mortems
- Identify warning signs that predict project failure before deployment
- Apply failure prevention frameworks to your own AI projects
- Learn from $100M+ in failed AI investments without repeating the mistakes
- Develop a critical eye for evaluating AI vendor claims and research findings
Introduction: Why Study Failure?
The Value of Failure Analysis
“Success is a lousy teacher. It seduces smart people into thinking they can’t lose.” - Bill Gates
In public health AI, failure is not just a learning opportunity—it can mean lives lost, trust destroyed, and health equity worsened. Yet the literature overwhelmingly focuses on successes. Failed projects are quietly shelved, vendors move on to the next product, and the same mistakes repeat.
This appendix is different.
We document 10 major AI failures in healthcare and public health with forensic detail: - What was promised vs. what was delivered - Root cause analysis (technical, organizational, ethical) - Real-world consequences and costs - What should have happened - Prevention strategies you can apply
Who Should Read This
For practitioners: Learn to spot red flags before investing time and resources in doomed projects.
For researchers: Understand why technically sound models fail in deployment.
For policymakers: See the consequences of inadequate oversight and validation requirements.
For students: Develop the critical thinking skills to evaluate AI systems skeptically.
Quick Reference: All 10 Failures at a Glance
| # | Project | Domain | Cost | Primary Failure Mode | Key Lesson |
|---|---|---|---|---|---|
| 1 | IBM Watson for Oncology | Clinical Decision Support | $62M+ investment | Unsafe recommendations from synthetic training data | Synthetic data ≠ real expertise |
| 2 | DeepMind Streams NHS | Patient Monitoring | £0 (free), cost = trust | Privacy violations, unlawful data sharing | Innovation doesn’t excuse privacy violations |
| 3 | Google Health India | Diabetic Retinopathy | $M investment | Lab-to-field performance gap | 96% accuracy in lab ≠ field success |
| 4 | Epic Sepsis Model | Clinical Prediction | Implemented in 100+ hospitals | Poor external validation, high false alarms | Vendor claims need independent validation |
| 5 | UK NHS COVID App | Contact Tracing | £12M spent | Technical + privacy issues | Social acceptability ≠ technical feasibility |
| 6 | OPTUM/UnitedHealth | Resource Allocation | Affected millions | Systematic racial bias via proxy variable | Proxy variables encode discrimination |
| 7 | Singapore TraceTogether | Contact Tracing | $10M+ | Broken privacy promises | Mission creep destroys public trust |
| 8 | Babylon GP at Hand | Symptom Checker | £0 pilot, trust cost | Unsafe triage recommendations | Chatbots ≠ medical diagnosis |
| 9 | COVID-19 Forecasting Models | Epidemic Prediction | 232 models published | 98% high risk of bias, overfitting | Urgency ≠ excuse for poor methods |
| 10 | Apple Watch AFib Study | Digital Epidemiology | $M research | Selection bias, unrepresentative sample | Convenience samples ≠ population inference |
Total documented costs: $100M+ in direct spending, incalculable trust damage
Case Study 1: IBM Watson for Oncology - Unsafe Recommendations from AI
The Promise
2013-2017: IBM heavily marketed Watson for Oncology as an AI system that could: - Analyze massive amounts of medical literature - Provide evidence-based treatment recommendations - Match or exceed expert oncologists - Democratize access to world-class cancer care
Marketing claims: - “Watson can read and remember all medical literature” - “90% concordance with expert tumor boards” - Hospitals paid $200K-$1M+ for licensing
The Reality
July 2018: Internal documents leaked to STAT News revealed Watson recommended unsafe treatment combinations, incorrect dosing, and treatments contradicting medical evidence (Ross and Swetlitz 2018).
Example from leaked documents: - Patient: 65-year-old with severe bleeding - Watson recommendation: Prescribe chemotherapy + bevacizumab (increases bleeding risk) - Expert oncologist assessment: “This would be harmful or fatal”
Root Cause Analysis
1. Training Data Problem
Critical flaw: Watson was trained on synthetic cases, not real patient outcomes.
# WHAT IBM DID (WRONG)
class WatsonTrainingApproach:
"""
Watson for Oncology training methodology (simplified)
"""
def generate_training_data(self):
"""Generate synthetic cases from expert opinions"""
training_cases = []
# Experts at Memorial Sloan Kettering created hypothetical cases
for scenario in self.expert_generated_scenarios:
case = {
'patient_features': scenario['demographics'],
'diagnosis': scenario['cancer_type'],
'recommended_treatment': scenario['expert_preference'], # NOT actual outcomes
'confidence': 'high' # Based on expert assertion, not evidence
}
training_cases.append(case)
return training_cases
def train_model(self, training_cases):
"""Train on expert preferences, not patient outcomes"""
# Model learns: "Expert X prefers treatment Y"
# Model does NOT learn: "Treatment Y improves survival"
# This is preference learning, not outcome learning
self.model.fit(
X=[case['patient_features'] for case in training_cases],
y=[case['recommended_treatment'] for case in training_cases]
)
# Result: Watson mimics expert opinions
# Problem: Expert opinions can be wrong, biased, outdated
# WHAT SHOULD HAVE BEEN DONE (CORRECT)
class EvidenceBasedApproach:
"""
How oncology decision support should be developed
"""
def generate_training_data(self):
"""Use real patient outcomes from EHR data"""
training_cases = []
# Use actual patient data with outcomes
for patient in self.ehr_database:
if patient.has_outcome_data():
case = {
'patient_features': patient.demographics + patient.tumor_characteristics,
'treatment_received': patient.treatment_history,
'outcome': patient.survival_months, # ACTUAL OUTCOME
'adverse_events': patient.complications,
'quality_of_life': patient.qol_scores
}
training_cases.append(case)
return training_cases
def validate_against_rcts(self, model_recommendations):
"""Validate recommendations against randomized trial evidence"""
for recommendation in model_recommendations:
# Check if recommendation aligns with RCT evidence
rct_evidence = self.search_clinical_trials(
condition=recommendation['diagnosis'],
intervention=recommendation['treatment']
)
if rct_evidence.contradicts(recommendation):
recommendation['flag'] = 'CONTRADICTS_RCT_EVIDENCE'
recommendation['use'] = False
# Check for safety signals
safety_data = self.fda_adverse_event_database.query(
drug=recommendation['treatment'],
patient_profile=recommendation['patient_features']
)
if safety_data.has_contraindications():
recommendation['flag'] = 'SAFETY_CONTRAINDICATION'
recommendation['use'] = False
return model_recommendationsWhy synthetic data failed: - Expert preferences ≠ evidence-based best practices - No validation against actual patient outcomes - Biases in expert opinions propagated at scale - No feedback loop from real-world results
2. Validation Failure
What IBM reported: 90% concordance with expert tumor boards
What this actually meant: - Watson agreed with the same experts who trained it (circular validation) - NOT validated against independent oncologists - NOT validated against patient survival outcomes - NOT validated in external hospitals before widespread deployment
The validation fallacy:
# IBM's circular validation approach
def evaluate_watson(test_cases):
"""
Problematic validation methodology
"""
# Test cases created by same experts who trained Watson
expert_recommendations = memorial_sloan_kettering_experts.recommend(test_cases)
watson_recommendations = watson_model.predict(test_cases)
# Concordance = how often Watson agrees with trainers
concordance = agreement_rate(expert_recommendations, watson_recommendations)
# PROBLEM: This measures memorization, not clinical validity
print(f"Concordance: {concordance}%") # 90%!
# MISSING: Does Watson improve patient outcomes?
# MISSING: External validation at different hospitals
# MISSING: Comparison to actual survival data
# MISSING: Safety evaluation3. Commercial Pressure Over Clinical Rigor
Timeline reveals rushed deployment: - 2013: Partnership announced with Memorial Sloan Kettering - 2015: First hospital deployments begin - 2016-2017: Aggressive global sales push - 2018: Safety issues surface
Financial incentives misaligned with patient safety: - IBM under pressure to monetize Watson investments - Hospitals wanted prestigious “AI partnership” - Marketing preceded clinical validation
The Fallout
Hospitals that abandoned Watson for Oncology: - MD Anderson Cancer Center (after spending $62M) (Fry 2018) - Jupiter Hospital (India) - cited “unsafe recommendations” (Hernandez and Greenwald 2018) - Gachon University Gil Medical Center (South Korea) - Multiple European hospitals (Strickland 2019)
Patient impact: - Unknown number exposed to potentially unsafe recommendations - Degree of harm unknown (no systematic study) - Oncologists reported catching unsafe suggestions before implementation - Trust in AI-based clinical support damaged
Financial costs: - MD Anderson: $62M investment, project shut down (Ross and Swetlitz 2017) - Multiple hospitals: $200K-$1M licensing fees - IBM: Massive reputational damage, eventually sold Watson Health to investment firm in 2022 (Lohr 2022)
Lessons for the field: - Set back clinical AI adoption by years - Increased regulatory skepticism - Hospitals now demand extensive validation before AI adoption
What Should Have Happened
Phase 1: Proper Development (2-3 years) 1. Train on real patient outcomes from EHR data across multiple institutions 2. Validate against randomized clinical trial evidence 3. Build safety checks to flag contraindications 4. Involve diverse oncologists from community hospitals, not just academic centers
Phase 2: Rigorous Validation (1-2 years) 1. External validation at hospitals not involved in development 2. Prospective study comparing Watson recommendations to actual outcomes 3. Safety monitoring for adverse events 4. Subgroup analysis by cancer type, stage, patient characteristics
Phase 3: Controlled Deployment (1+ years) 1. Pilot at 3-5 hospitals with intensive monitoring 2. Oncologist oversight of all recommendations 3. Track concordance, outcomes, and safety 4. Iterative improvement based on real-world data
Phase 4: Gradual Scale (if Phase 3 succeeds) 1. Expand only after demonstrating clinical benefit or equivalence 2. Continuous monitoring and model updates 3. Transparent reporting of performance
Total timeline: 4-6 years before widespread deployment
What actually happened: 2 years from partnership to aggressive global sales
Prevention Checklist
Use this checklist to evaluate clinical AI systems:
Training Data ❌ Watson failed all of these - [ ] Trained on real patient outcomes (not synthetic cases) - [ ] Data from multiple institutions (not single center) - [ ] Includes diverse patient populations - [ ] Outcomes include survival, not just expert opinion
Validation ❌ Watson failed all of these - [ ] External validation at independent sites - [ ] Compared to patient outcomes (not just expert agreement) - [ ] Safety evaluation included - [ ] Subgroup performance reported - [ ] Validation by independent researchers (not just vendor)
Deployment ❌ Watson failed all of these - [ ] Prospective pilot study completed - [ ] Clinical benefit demonstrated (not just claimed) - [ ] Physician oversight required - [ ] Continuous monitoring plan - [ ] Transparent performance reporting
Governance ❌ Watson failed all of these - [ ] Development timeline allows for proper validation - [ ] Commercial pressure doesn’t override clinical rigor - [ ] Independent ethics review - [ ] Patient safety prioritized over revenue
Key Takeaways
Synthetic data ≠ Real-world evidence - Expert-generated hypothetical cases cannot substitute for actual patient outcomes
Circular validation is not validation - Concordance with the experts who trained the system proves nothing about clinical validity
Marketing claims require independent verification - Vendor assertions must be validated by independent researchers
Commercial pressure kills patients - Rushing to market before proper validation has consequences
AI is not a substitute for clinical trials - Evidence-based medicine requires… evidence
References
Primary sources: - Ross, C., & Swetlitz, I. (2017). IBM’s Watson supercomputer recommended ‘unsafe and incorrect’ cancer treatments. STAT News - Hernandez, D., & Greenwald, T. (2018). IBM Has a Watson Dilemma. Wall Street Journal - Strickland, E. (2019). How IBM Watson Overpromised and Underdelivered on AI Health Care. IEEE Spectrum - Fry, E. (2017). MD Anderson Benches IBM Watson in Setback for Artificial Intelligence. Fortune - Ross, C. (2018). MD Anderson Cancer Center’s $62 million Watson project is scrapped after audit. STAT News - Lohr, S. (2022). IBM Sells Watson Health Assets to Investment Firm. New York Times
Case Study 2: DeepMind Streams and the NHS Data Sharing Scandal
The Promise
2015-2016: DeepMind (owned by Google) partnered with Royal Free NHS Trust to develop Streams, a mobile app to help nurses and doctors detect acute kidney injury (AKI) earlier.
Stated goals: - Alert clinicians to deteriorating patients - Reduce preventable deaths from AKI - Demonstrate Google’s commitment to healthcare - “Save lives with AI”
The Reality
July 2017: UK Information Commissioner’s Office ruled the data sharing agreement unlawful (Information Commissioner’s Office 2017).
What went wrong: - Royal Free shared 1.6 million patient records with DeepMind (Hodson 2017) - Patients not informed their data would be used - Data included entire medical histories (not just kidney-related) - Used for purposes beyond the stated clinical care - No proper legal basis under UK Data Protection Act (Powles and Hodson 2017)
Data included: - HIV status - Abortion records - Drug overdose history - Complete medical histories dating back 5 years - Data from patients who never consented
Root Cause Analysis
1. Privacy Framework Violations
Legal failures:
# What DeepMind/Royal Free did (UNLAWFUL)
class DataSharingAgreement:
"""
DeepMind Streams data sharing approach
"""
def __init__(self):
self.legal_basis = "Implied consent for direct care" # WRONG
self.data_minimization = False # Took everything
self.patient_notification = False # Patients not informed
def collect_patient_data(self, nhs_trust):
"""Collect patient data for app development"""
# PROBLEM 1: Scope creep beyond stated purpose
stated_purpose = "Detect acute kidney injury"
actual_purpose = "Develop AI algorithms + train models + product development"
# PROBLEM 2: Excessive data collection (violates data minimization)
data_requested = {
'kidney_function_tests': True, # Relevant to AKI
'vital_signs': True, # Relevant to AKI
'complete_medical_history': True, # NOT necessary for AKI alerts
'hiv_status': True, # NOT necessary for AKI alerts
'mental_health_records': True, # NOT necessary for AKI alerts
'abortion_history': True, # NOT necessary for AKI alerts
'historical_data': '5 years' # Far exceeds clinical need
}
# PROBLEM 3: No patient consent or notification
patient_consent = self.obtain_explicit_consent() # This was never called
patient_notification = self.notify_patients() # This was never called
# PROBLEM 4: Commercial use of NHS data
data_use = {
'clinical_care': True, # OK
'algorithm_development': True, # Requires different legal basis
'google_ai_research': True, # Requires patient consent
'product_development': True # Requires patient consent
}
return patient_data
# What SHOULD have been done (LAWFUL)
class LawfulDataSharingApproach:
"""
Privacy-preserving approach to clinical AI development
"""
def __init__(self):
self.legal_basis = "Explicit consent for research" # CORRECT
self.data_minimization = True
self.patient_notification = True
self.independent_ethics_review = True
def collect_patient_data_lawfully(self, nhs_trust):
"""Lawful approach to data collection"""
# Step 1: Define minimum necessary data
minimum_data_set = {
'patient_id': 'pseudonymized',
'age': True,
'sex': True,
'kidney_function_tests': True,
'relevant_vital_signs': True,
'relevant_medications': True, # Only nephrotoxic drugs
'aki_history': True
}
# Explicitly exclude unnecessary data
excluded_data = [
'complete_medical_history',
'unrelated_diagnoses',
'mental_health_records',
'reproductive_history',
'hiv_status'
]
# Step 2: Obtain explicit informed consent
consent_process = {
'plain_language_explanation': True,
'purpose_clearly_stated': "Develop AKI detection algorithm",
'data_use_specified': "Clinical care AND algorithm development",
'commercial_partner_disclosed': "Google DeepMind",
'opt_out_option': True,
'withdrawal_rights': True
}
# Step 3: Ethics approval
ethics_review = self.submit_to_ethics_committee({
'study_protocol': self.protocol,
'consent_forms': self.consent_forms,
'data_protection_impact_assessment': self.dpia,
'benefit_risk_analysis': self.analysis
})
if not ethics_review.approved:
return None # Don't proceed without approval
# Step 4: Transparent patient notification
self.notify_all_patients(
method='letter + posters + website',
content='Data being used for AI research with Google',
opt_out_period='30 days'
)
# Step 5: Collect only consented data
consented_patients = self.get_consented_patients()
data = self.extract_minimum_data_set(consented_patients)
return data2. Organizational Culture: “Move Fast, Get Permission Later”
Evidence of privacy-second culture:
- Data sharing agreement signed before proper legal review
- Agreement signed: September 2015
- Information Governance review: After the fact
- Legal basis analysis: Inadequate
- No Data Protection Impact Assessment (DPIA)
- Required for high-risk processing under GDPR
- Should have been completed BEFORE data sharing
- Would have identified legal issues
- Patient safety used to justify privacy violations
- “We need all the data to save lives”
- False dichotomy: privacy OR patient safety
- Reality: Can have both with proper safeguards
3. Power Imbalance: Google vs. NHS
Structural factors: - NHS chronically underfunded, attracted by “free” Google technology - DeepMind offered app development at no cost - Royal Free eager for prestigious partnership - Imbalance in legal and technical expertise - Google’s lawyers vs. under-resourced NHS legal teams
The Fallout
Regulatory action: - UK Information Commissioner’s Office: Ruled data sharing unlawful (July 2017) (Information Commissioner’s Office 2017) - Royal Free Trust found in breach of Data Protection Act - Required to update practices and systems (Hern 2017)
Reputational damage: - Massive media coverage: “Google got NHS patient data improperly” - Patient trust in NHS data sharing damaged - DeepMind’s healthcare ambitions set back - Chilling effect on beneficial NHS-tech partnerships
Patient impact: - 1.6 million patients’ privacy violated - Highly sensitive data (HIV status, abortions, overdoses) shared without consent - No evidence of direct patient harm from data misuse - BUT: Violation of patient autonomy and dignity
Policy impact: - Strengthened NHS data sharing requirements - Increased scrutiny of commercial partnerships - Contributed to GDPR implementation awareness - NHS data transparency initiatives
What Should Have Happened
Lawful pathway (would have added 6-12 months):
Phase 1: Planning and Legal Review (2-3 months) 1. Define minimum necessary data set for AKI detection 2. Conduct Data Protection Impact Assessment (DPIA) 3. Obtain legal opinion on appropriate legal basis 4. Design patient consent/notification process 5. Submit to NHS Research Ethics Committee
Phase 2: Ethics and Governance (2-3 months) 1. Ethics committee review and approval 2. Information Governance approval 3. Caldicott Guardian sign-off (NHS data guardian) 4. Transparent public announcement of partnership
Phase 3: Patient Engagement (3-6 months) 1. Patient information campaign (letters, posters, website) 2. 30-day opt-out period 3. Mechanism for patient questions and concerns 4. Patient advisory group involvement
Phase 4: Data Sharing with Safeguards (ongoing) 1. Share only minimum necessary data 2. Pseudonymization and encryption 3. Audit trail of all data access 4. Regular privacy audits 5. Transparent reporting to patients and public
Would this have delayed the project? Yes, by 6-12 months.
Would it have preserved trust? Yes.
Would the app still have saved lives? Yes, and without violating patient privacy.
Prevention Checklist
Use this checklist before any health data sharing for AI:
Legal Basis ❌ DeepMind failed all of these - [ ] Explicit legal basis identified (consent, legal obligation, legitimate interest with balance test) - [ ] Legal basis appropriate for ALL intended uses (including commercial AI development) - [ ] Legal review by qualified data protection lawyer - [ ] Data sharing agreement reviewed by independent party
Data Minimization ❌ DeepMind failed this - [ ] Only minimum necessary data collected - [ ] Scope limited to stated purpose - [ ] Irrelevant data explicitly excluded - [ ] Justification documented for each data element
Transparency ❌ DeepMind failed all of these - [ ] Patients informed about data use - [ ] Commercial partners disclosed - [ ] Purpose clearly explained - [ ] Opt-out option provided
Governance ❌ DeepMind failed all of these - [ ] Ethics committee approval obtained - [ ] Data Protection Impact Assessment completed - [ ] Information Governance approval - [ ] Independent oversight (e.g., Caldicott Guardian) - [ ] Patient advisory group consulted
Safeguards (DeepMind did implement some technical safeguards) - [x] Data encrypted in transit and at rest - [x] Access controls and audit logs - [ ] Regular privacy audits - [ ] Breach notification plan
Key Takeaways
Innovation doesn’t excuse privacy violations - “Saving lives” is not a justification for unlawful data sharing
Data minimization is not optional - Collect only what you need, not everything you can access
Patient consent matters - Even for “beneficial” uses, patients have a right to know and choose
Power imbalances create risk - Under-resourced public health agencies need independent legal support when partnering with tech giants
“Free” technology is not free - Costs may be paid in patient privacy and public trust
Trust, once broken, is hard to rebuild - This scandal damaged NHS-tech partnerships for years
References
Primary sources: - UK Information Commissioner’s Office. (2017). Royal Free - Google DeepMind trial failed to comply with data protection law - Powles, J., & Hodson, H. (2017). Google DeepMind and healthcare in an age of algorithms. Health and Technology, 7(4), 351-367. DOI: 10.1007/s12553-017-0179-1 - Hodson, H. (2017). DeepMind’s NHS patient data deal was illegal, says UK watchdog. New Scientist - Hern, A. (2017). Royal Free breached UK data law in 1.6m patient deal with Google’s DeepMind. The Guardian - Powles, J. (2016). DeepMind’s latest NHS deal leaves big questions unanswered. The Guardian
Analysis: - Veale, M., & Binns, R. (2017). Fairer machine learning in the real world: Mitigating discrimination without collecting sensitive data. Big Data & Society, 4(2). DOI: 10.1177/2053951717743530
Case Study 3: Google Health India - The Lab-to-Field Performance Gap
The Promise
2016-2018: Google Health developed an AI system for diabetic retinopathy (DR) screening with impressive results (Gulshan et al. 2016): - 96% sensitivity in validation studies - Published in JAMA (high-impact journal) (Krause et al. 2018) - Regulatory approval in Europe - Deployment in India to address ophthalmologist shortage
The vision: - Democratize DR screening in low-resource settings - Address 415 million people with diabetes globally - Prevent blindness through early detection - Showcase AI’s potential for global health equity
The Reality
2019-2020: Field deployment in rural India clinics encountered severe problems (Beede et al. 2020): - Nurses couldn’t use the system effectively - Poor image quality from non-standard cameras - Internet connectivity too unreliable - Workflow disruptions caused bottlenecks - Patient follow-up rates plummeted - Program quietly scaled back (Mukherjee 2021)
Performance degradation: - Lab conditions: 96% sensitivity - Field conditions: ~55% of images were ungradable (system rejected them as too poor quality) (Beede et al. 2020) - Of gradable images, performance unknown (not systematically evaluated)
Root Cause Analysis
1. Lab-to-Field Translation Failure
Controlled research environment vs. real-world chaos:
# Lab environment (where AI performed well)
class LabEnvironment:
"""
Idealized conditions for AI development
"""
def __init__(self):
self.camera = "High-end retinal camera ($40,000)"
self.operator = "Trained ophthalmology photographer"
self.lighting = "Optimal, controlled"
self.patient_cooperation = "High (research volunteers)"
self.internet = "Fast, reliable hospital WiFi"
self.support = "On-site AI researchers for troubleshooting"
def capture_image(self, patient):
"""Image capture in lab conditions"""
# Professional photographer with optimal equipment
image = self.camera.capture(
patient=patient,
attempts=5, # Can retry multiple times
lighting='optimal',
dilation='complete' # Pupils fully dilated
)
# Quality control before AI analysis
if image.quality_score < 0.9:
image = self.recapture() # Try again
# Fast, reliable internet for cloud processing
result = self.ai_model.predict(
image,
internet_speed='1 Gbps',
latency='<100ms'
)
return result # High quality input → High quality output
# Field environment (where AI failed)
class FieldEnvironmentIndia:
"""
Reality of rural Indian primary care clinics
"""
def __init__(self):
self.camera = "Portable retinal camera ($5,000, different model than training data)"
self.operator = "Nurse with 2-hour training"
self.lighting = "Variable, often poor"
self.patient_cooperation = "Variable (many elderly, diabetic complications)"
self.internet = "Intermittent, slow (when available)"
self.support = "None (Google researchers in California)"
def capture_image(self, patient):
"""Image capture in field conditions"""
# PROBLEM 1: Equipment mismatch
# AI trained on $40K cameras, deployed with $5K cameras
# Different image characteristics, compression, resolution
# PROBLEM 2: Operator skill gap
# Nurse has 2 hours of training vs. professional photographers
image = self.camera.capture(
patient=patient,
attempts=2, # Limited time per patient
lighting='suboptimal', # Poor clinic lighting
dilation='partial' # Patients dislike dilation, often incomplete
)
# PROBLEM 3: Image quality issues
image_quality_issues = {
'blurry': 0.25, # Camera shake, patient movement
'poor_lighting': 0.30, # Inadequate illumination
'wrong_angle': 0.20, # Inexperienced operator
'incomplete_dilation': 0.35, # Patient discomfort
'off_center': 0.15 # Targeting errors
}
# AI rejects poor quality images
if image.quality_score < 0.7:
return "UNGRADABLE IMAGE - REFER TO OPHTHALMOLOGIST"
# Problem: Clinic has no ophthalmologist
# Patient told to travel 50km to district hospital
# Most patients don't follow up
# PROBLEM 4: Connectivity failure
try:
result = self.ai_model.predict(
image,
internet_speed='0.5 Mbps', # 2000x slower than lab
latency='2000ms', # 20x worse than lab
timeout='30 seconds'
)
except TimeoutError:
# Internet too slow, AI in cloud can't process
# Patient leaves without screening
return "SYSTEM ERROR - UNABLE TO PROCESS"
# PROBLEM 5: Workflow disruption
processing_time = 5_minutes # vs 30 seconds in lab
# Clinic sees 50 patients/day
# 5 min/patient for DR screening = 250 minutes = 4+ hours
# Entire clinic workflow collapses
return result2. User-Centered Design Failure
Google designed for ophthalmologists, deployed with nurses:
Training gap: - Ophthalmology photographers: Years of training, hundreds of images daily - Rural clinic nurses: 2-hour training session, first time using retinal camera - No ongoing support or troubleshooting
Workflow integration failure: - System added 5+ minutes per patient (clinic operates on tight schedules) - Required internet connectivity (unreliable in rural areas) - Cloud-based processing created dependency on Google servers - No offline mode for areas with poor connectivity
Error handling: - System rejected 55% of images as “ungradable” - No actionable guidance for nurses on how to improve image quality - Patients told “refer to ophthalmologist” but nearest one was 50km+ away - Follow-up rate for referrals: <20%
3. Validation Mismatch
What was validated: - AI performance on high-quality images from research-grade cameras - Agreement with expert ophthalmologists on curated datasets - Technical accuracy in controlled settings
What was NOT validated: - End-to-end workflow in actual clinic settings - Performance with portable cameras used in field - Nurse ability to obtain gradable images - Patient acceptance and follow-up rates - Impact on clinic workflow and throughput - Actual health outcomes (Did blindness decrease?)
The Fallout
Program outcomes: - Quietly scaled back in 2020 (Mukherjee 2021) - No published results on real-world impact - Unknown number of patients screened - Unknown impact on diabetic retinopathy detection or blindness prevention
Lessons for Google: - Led to major changes in Google Health strategy (Mukherjee 2021) - Increased focus on user research and field testing - Recognition that “AI accuracy” ≠ “system effectiveness” (Beede et al. 2020) - Several key researchers left Google Health
Impact on field: - Highlighted gap between AI research and implementation science - Demonstrated need for human-centered design in clinical AI - Showed that technical performance is necessary but not sufficient
Missed opportunity: - India has massive DR screening gap (millions unscreened) - Well-designed system could have made real impact - Failure set back AI adoption in Indian primary care
What Should Have Happened
Implementation science approach:
Phase 1: Formative Research (6-12 months) 1. Ethnographic study of actual clinic workflows - Shadow nurses in rural clinics for weeks - Document real-world constraints (time, connectivity, equipment) - Identify workflow integration points - Understand patient barriers (cost, distance, literacy)
- Technology assessment
- Test portable cameras actually available in rural clinics
- Measure real-world internet connectivity
- Assess power reliability
- Identify equipment constraints
- User research with nurses
- What training do they need?
- What support systems are required?
- How much time can be allocated per patient?
- What error messages are actionable?
Phase 2: Adapt AI System (6-12 months) 1. Retrain AI on images from field equipment - Collect training data using actual portable cameras deployed - Include poor lighting, motion blur, incomplete dilation - Train AI to be robust to field conditions
- Design for intermittent connectivity
- Offline mode for AI processing (edge deployment)
- Sync results when connectivity available
- No dependency on cloud for basic functionality
- Improve usability for nurses
- Real-time feedback on image quality
- Guidance system: “Move camera up,” “Improve lighting,” etc.
- Simplified training program with ongoing support
Phase 3: Pilot Implementation (12 months) 1. Small-scale pilot (3-5 clinics) - Intensive monitoring and support - Rapid iteration based on feedback - Document workflow integration challenges - Measure key outcomes: gradable image rate, screening completion, referral follow-up
- Hybrid approach
- AI flags high-risk cases
- Tele-ophthalmology for borderline cases
- Local health workers support follow-up
- Integration with existing health systems
Phase 4: Evaluation and Iteration (12 months) 1. Process evaluation - What percentage of eligible patients screened? - What percentage of images gradable? - Nurse satisfaction and confidence - Workflow impact on clinic operations
- Outcome evaluation
- Detection rates (vs baseline)
- Referral completion rates
- Time to treatment
- Long-term impact on vision outcomes
Phase 5: Scale Only If Successful 1. Expand only if pilot demonstrates: - Feasible workflow integration - High gradable image rate (>80%) - Improved patient outcomes - Sustainable without ongoing external support
Total timeline: 3-4 years from development to scale
What actually happened: Lab validation → immediate deployment → failure
Prevention Checklist
Use this checklist for AI deployed in resource-limited settings:
User Research ❌ Google failed all of these - [ ] Ethnographic study of actual deployment environment - [ ] End-user involvement in design (not just technical experts) - [ ] Workflow analysis in real-world conditions - [ ] Identification of infrastructure constraints (connectivity, power, equipment)
Technology Adaptation ❌ Google failed all of these - [ ] AI trained on data from actual deployment equipment - [ ] System designed for worst-case conditions (poor connectivity, power outages) - [ ] Offline functionality for critical features - [ ] Performance validated with target end-users (not just technical performance)
Pilot Testing ❌ Google failed to do adequate pilot - [ ] Small-scale pilot before full deployment - [ ] Intensive monitoring and rapid iteration - [ ] Process metrics tracked (gradable image rate, completion rate, time per patient) - [ ] Outcome metrics tracked (detection rate, referral follow-up, health impact)
Training and Support ❌ Google failed these - [ ] Adequate training for end-users (not 2-hour session) - [ ] Ongoing support and troubleshooting - [ ] Local champions and peer support - [ ] Refresher training and skill maintenance
Sustainability ❌ Google failed to assess this - [ ] System sustainable without external support - [ ] Integration with existing health system - [ ] Local ownership and maintenance - [ ] Cost-effectiveness analysis
Key Takeaways
96% accuracy in the lab ≠ Success in the field - Technical performance is necessary but not sufficient
Design for real-world conditions, not idealized lab settings - Rural clinics ≠ Research hospitals
Technology must fit workflow, not the other way around - Adding 5 minutes per patient collapsed clinic operations
End-users must be involved in design - Designing for ophthalmologists, deploying with nurses = failure
Infrastructure constraints are not optional - Intermittent internet, poor lighting, limited equipment are realities to design around
Pilot, iterate, then scale - Not deploy globally and hope for the best
Implementation science matters as much as AI science - Getting technology into hands of users requires different expertise than developing the technology
References
Primary research: - Gulshan, V., et al. (2016). Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA, 316(22), 2402-2410. DOI: 10.1001/jama.2016.17216 - Krause, J., et al. (2018). Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy. Ophthalmology, 125(8), 1264-1272. DOI: 10.1016/j.ophtha.2018.01.034 - Beede, E., et al. (2020). A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy. CHI 2020: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1-12. DOI: 10.1145/3313831.3376718
Media coverage and analysis: - Mukherjee, S. (2021). A.I. Versus M.D.: What Happens When Diagnosis Is Automated? The New Yorker - Heaven, W. D. (2020). Google’s medical AI was super accurate in a lab. Real life was a different story. MIT Technology Review
Implementation science context: - Keane, P. A., & Topol, E. J. (2018). With an eye to AI and autonomous diagnosis. npj Digital Medicine, 1(1), 40. DOI: 10.1038/s41746-018-0048-y
Case Study 4: Epic Sepsis Model - When Vendor Claims Meet Reality
The Promise
Epic (largest EHR vendor in US, used by 50%+ of US hospitals) developed and deployed a machine learning model to predict sepsis risk and alert clinicians.
Vendor claims: - High accuracy (AUC 0.76-0.83 depending on version) - Early warning (hours before sepsis diagnosis) - Implemented in 100+ hospitals - Potential to save thousands of lives
Marketing message: - “AI-powered early warning system” - Integrated seamlessly into Epic EHR workflow - Evidence-based and clinically validated
The Reality
2021: External validation study published in JAMA Internal Medicine (Wong et al. 2021)
Researchers at University of Michigan tested Epic’s sepsis model on their patients:
Findings: - Sensitivity: 63% (missed 37% of sepsis cases) - Positive Predictive Value: 12% (88% of alerts were false alarms) - Of every 100 alerts, only 12 patients actually had sepsis - Alert fatigue: Clinicians ignored most alerts - No evidence of improved patient outcomes
External validation results diverged dramatically from vendor claims (Wong et al. 2021).
Root Cause Analysis
1. Internal vs. External Validation Gap
The validation problem:
# What Epic likely did (internal validation)
class InternalValidation:
"""
Vendor validation approach
"""
def __init__(self):
self.training_data = "Epic customer hospitals (unspecified number)"
self.test_data = "Same Epic customer hospitals (different time period)"
def validate_model(self):
"""Internal validation methodology"""
# Train on Epic customer data
model = self.train_model(
data=self.get_epic_customer_ehr_data(),
features=self.epic_specific_features,
labels=self.sepsis_cases
)
# Test on different time period from same hospitals
# PROBLEM: Same patient population, same documentation practices, same workflows
test_performance = model.evaluate(
data=self.get_epic_customer_ehr_data(time_period='later'),
metric='AUC'
)
# Report performance
print(f"AUC: {test_performance['auc']}") # 0.83!
# WHAT'S MISSING:
# - Validation on hospitals not in training data
# - Validation on non-Epic EHR systems
# - Different patient populations
# - Different clinical workflows
# - Real-world alert rate and clinician response
# - Impact on patient outcomes
# What independent researchers did (external validation)
class ExternalValidation:
"""
University of Michigan external validation
"""
def __init__(self):
self.test_hospital = "University of Michigan (not in Epic training data)"
self.ehr_system = "Epic (same vendor, different implementation)"
def validate_model(self):
"""Independent validation methodology"""
# Test Epic's deployed model on completely new population
results = epic_sepsis_model.evaluate(
data=self.umich_patient_data, # NEW hospital, NEW patients
ground_truth=self.chart_review_sepsis_diagnosis # Gold standard
)
# Comprehensive metrics
performance = {
'auc': 0.63, # Lower than Epic's claim of 0.83
'sensitivity': 0.63, # Misses 37% of sepsis cases
'specificity': 0.66, # Many false alarms
'ppv': 0.12, # 88% of alerts are wrong
'alert_rate': '1 per 2.1 patients', # Overwhelming alert burden
'alert_burden': 'Median 84 alerts per day per ICU team'
}
# Clinical workflow impact
clinician_response = self.survey_clinicians()
# "Too many false alarms"
# "Ignored most alerts due to alert fatigue"
# "No change in sepsis management"
# Patient outcomes
outcome_analysis = self.compare_outcomes(
before_epic_sepsis_model,
after_epic_sepsis_model
)
# No significant change in:
# - Time to antibiotics
# - Time to sepsis bundle completion
# - ICU length of stay
# - Mortality
return performanceWhy performance degraded:
- Different patient populations
- Training hospitals vs. Michigan patient mix
- Different case severity distributions
- Different comorbidity profiles
- Different documentation practices
- How clinicians document varies by institution
- Model learned institution-specific patterns
- Patterns don’t generalize
- Different workflows
- How quickly vitals are entered
- Which lab tests are ordered when
- Documentation timing and completeness
2. The False Alarm Problem
Alert burden analysis:
class AlertFatigueAnalysis:
"""
Understanding the alert burden problem
"""
def calculate_alert_burden(self):
"""Michigan ICU alert volume"""
hospital_stats = {
'icu_patients_per_day': 100,
'alert_rate': '1 per 2.1 patients', # Per Michigan study
'alerts_per_day': 100 / 2.1 # ≈ 48 alerts/day
}
# Each alert requires:
alert_overhead = {
'time_to_review_alert': '2-3 minutes',
'review_patient_chart': '3-5 minutes',
'assess_clinical_relevance': '2-3 minutes',
'document_response': '1-2 minutes',
'total_per_alert': '8-13 minutes'
}
# For ICU team seeing 48 alerts/day:
daily_burden = {
'time_spent_on_alerts': '6-10 hours', # Of nursing/physician time
'true_sepsis_cases': 48 * 0.12, # Only 6 patients actually have sepsis
'false_alarms': 48 * 0.88, # 42 false alarms
'true_positives_missed': 'Unknown (37% sensitivity means many missed)'
}
# Outcome: Alert fatigue
clinician_response = {
'alert_responsiveness': 'Decreases over time',
'cognitive_burden': 'High',
'trust_in_system': 'Low',
'actual_behavior_change': 'Minimal'
}
return "System adds burden without clear benefit"The specificity-alert burden tradeoff:
If you want to catch more sepsis cases (higher sensitivity), you must accept more false alarms (lower specificity). But in a hospital with: - 100 ICU patients - 5% sepsis prevalence - Target: 95% sensitivity (catch almost all cases)
You need to accept: - ~80% of alerts will be false alarms - Clinicians will become fatigued and ignore alerts - The rare true positives will be lost in noise
Epic’s model had: - 63% sensitivity (missed 37% of cases) ← Not good enough - 66% specificity (34% false positive rate) ← Alert burden too high - Worst of both worlds
3. Lack of Outcome Validation
Epic measured: - ✓ AUC (model discrimination) - ✓ Sensitivity/specificity - ✓ Calibration
Epic did NOT measure (or publish): - ✗ Impact on time to antibiotics - ✗ Impact on sepsis bundle completion - ✗ Impact on ICU length of stay - ✗ Impact on mortality - ✗ Cost-effectiveness - ✗ Clinician alert fatigue and response rates
Model accuracy ≠ Clinical impact
The Fallout
Hospital response: - Many hospitals that implemented Epic sepsis model reported similar problems - Some hospitals turned off the alerts due to alert fatigue - Others lowered alert thresholds (fewer alerts but miss more cases) - Unknown how many hospitals continue to use it effectively
Patient impact: - No evidence of benefit (outcomes not improved) - Potential harm from alert fatigue causing real alerts to be ignored - Unknown number of sepsis cases missed due to 63% sensitivity
Trust impact: - Increased skepticism of vendor AI claims - Hospitals demanding independent validation before adoption - Regulatory interest in AI medical device claims
Research impact: - Highlighted need for external validation (Wong et al. 2021) - Demonstrated gap between technical performance and clinical utility - Showed importance of measuring patient outcomes, not just AUC (McCoy et al. 2020)
What Should Have Happened
Responsible AI deployment pathway:
Phase 1: Transparent Development (Epic’s responsibility) 1. Publish development methodology - Training data sources and characteristics - Feature engineering approach - Validation methodology and results - Known limitations 2. Make model available for independent validation 3. Provide implementation guide with expected performance ranges
Phase 2: External Validation (Independent researchers) 1. Pre-deployment validation at 3-5 hospitals not in training data 2. Report performance across diverse settings 3. Measure clinical outcomes, not just AUC 4. Assess alert burden and clinician response 5. Publish results in peer-reviewed journal
Phase 3: Pilot Implementation (Hospitals considering adoption) 1. Small-scale pilot (1-2 ICU units) 2. Intensive monitoring: - Alert volume and clinician response rate - Time to sepsis interventions - Patient outcomes (mortality, length of stay) - Clinician satisfaction and alert fatigue 3. Compare to historical controls 4. Decide: Scale, modify, or abandon
Phase 4: Iterative Improvement 1. Customize model to local patient population 2. Adjust alert thresholds based on local clinician feedback 3. Integrate with local sepsis protocols 4. Continuous monitoring and updates
What actually happened: 1. Epic developed and deployed model 2. Hospitals adopted based on vendor claims 3. External researchers discovered poor performance 4. Damage to trust already done
Prevention Checklist
Before adopting any commercial clinical AI:
Validation Evidence ❌ Epic sepsis model failed these - [ ] External validation at multiple independent sites - [ ] Validation results published in peer-reviewed journal (not just vendor white paper) - [ ] Independent researchers (not vendor employees) conducted validation - [ ] Performance reported across diverse patient populations - [ ] Sensitivity to different EHR documentation practices assessed
Outcome Evidence ❌ Epic sepsis model failed all of these - [ ] Impact on patient outcomes measured (not just model accuracy) - [ ] Clinical workflow impact assessed - [ ] Alert burden quantified - [ ] Clinician acceptance and response rates reported - [ ] Cost-effectiveness analysis
Transparency ❌ Epic sepsis model failed these - [ ] Training data characteristics disclosed - [ ] Feature engineering documented - [ ] Known limitations clearly stated - [ ] Performance expectations realistic (not just best-case) - [ ] Conflicts of interest disclosed
Implementation Support (Variable) - [ ] Implementation guide provided - [ ] Training for clinical staff - [ ] Ongoing technical support - [ ] Monitoring dashboards for performance tracking - [ ] Customization to local population possible
Key Takeaways
Vendor claims require independent verification - Epic’s reported performance did not hold up to external validation
Internal validation overfits to training data - Same hospitals, same workflows, same documentation practices
AUC is not enough - Model accuracy must translate to clinical benefit and workflow fit
Alert burden matters more than you think - 88% false alarm rate causes alert fatigue and system abandonment
Measure outcomes, not just model performance - Did patients actually benefit? Were sepsis deaths prevented?
Hospitals need to demand evidence - “Deployed in 100+ hospitals” is not evidence of effectiveness
Transparency enables trust - Vendor opacity prevents independent validation and slows progress
References
Primary research: - Wong, A., et al. (2021). External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. JAMA Internal Medicine, 181(8), 1065-1070. DOI: 10.1001/jamainternmed.2021.2626 - McCoy, A., & Das, R. (2017). Reducing patient mortality, length of stay and readmissions through machine learning-based sepsis prediction in the emergency department, intensive care unit and hospital floor units. BMJ Open Quality, 6(2), e000158. DOI: 10.1136/bmjoq-2017-000158
Commentary and analysis: - Sendak, M. P., et al. (2020). A Path for Translation of Machine Learning Products into Healthcare Delivery. EMJ Innovations, 10(1), 19-00172. DOI: 10.33590/emjinnov/19-00172 - Ginestra, J. C., et al. (2019). Clinician Perception of a Machine Learning-Based Early Warning System Designed to Predict Severe Sepsis and Septic Shock. Critical Care Medicine, 47(11), 1477-1484. DOI: 10.1097/CCM.0000000000003803
Media coverage: - Ross, C., & Swetlitz, I. (2021). Epic sepsis prediction tool shows sizable overestimation in external study. STAT News - Strickland, E. (2022). How Sepsis Prediction Algorithms Failed in Real-World Implementation. IEEE Spectrum
Case Study 5: UK NHS COVID-19 Contact Tracing App - Technical and Social Failure
The Promise
May 2020: UK government announced a smartphone app for COVID-19 contact tracing to enable rapid identification and isolation of contacts, allowing the country to ease lockdown safely.
Stated goals: - Rapid contact identification (within hours, not days) - Privacy-preserving (no central database of contacts) - Enable safe reopening of economy - Complement manual contact tracing - “World-beating” system (PM Boris Johnson’s words)
Initial timeline: - App promised by mid-May 2020 - Nationwide rollout by June 2020
The Reality
Timeline of failures: - May 2020: Pilot on Isle of Wight reveals technical problems - June 2020: Original app abandoned after £12M spent - September 2020: New app finally launched (4 months late) - Adoption: Only 28% of population downloaded final app (needed 60%+ for effectiveness) - Impact: Limited evidence of meaningful contact tracing benefit
Technical problems: - Original app couldn’t detect contacts on iPhones when screen locked - Bluetooth proximity detection unreliable (detected people through walls, missed close contacts) - Battery drain issues - False positives from fleeting encounters
Social and political problems: - Centralized data collection raised privacy concerns - Trust eroded by constantly changing approaches - Forced switch to Apple/Google API after rejecting it initially - £12M wasted on failed first version
Root Cause Analysis
1. Technical Hubris: Reinventing the Wheel Badly
The Apple/Google API decision:
# What UK tried to do: Centralized model
class UKCentralizedModel:
"""
UK's original approach: centralized contact database
"""
def __init__(self):
self.architecture = "centralized"
self.data_location = "NHS servers"
self.compatible_with_ios = False # CRITICAL FLAW
def detect_contacts(self, user_phone):
"""Contact detection on phones"""
# PROBLEM 1: iOS restrictions
# Apple does not allow Bluetooth to run in background
# unless using official Apple/Google Exposure Notification API
if user_phone.os == 'iOS':
if user_phone.screen_locked or user_phone.app_backgrounded:
# Bluetooth turned off by iOS
# Can't detect any contacts
return [] # MASSIVE FAILURE
# ~50% of UK uses iPhones
# App useless for half the population
# PROBLEM 2: Even on Android, unreliable
contacts = self.scan_bluetooth()
# Bluetooth RSSI (signal strength) poor proxy for distance
false_positives = {
'through_walls': True, # Detects neighbors through walls
'through_windows': True, # Detects people outside
'fleeting_encounters': True, # Walking past someone for 2 seconds
}
false_negatives = {
'signal_interference': True, # Phone in pocket or bag
'device_variability': True, # Different phones = different Bluetooth
'body_blocking': True # Human body blocks signal
}
return contacts # Low quality, unreliable data
def send_data_to_server(self, contacts):
"""Upload contact data to NHS servers"""
# PROBLEM 3: Privacy concerns
# All contact data sent to central government database
# Who you met, when, where (if combined with location)
# Potential for mission creep and surveillance
privacy_concerns = {
'central_database': True,
'government_access': True,
'mission_creep_risk': True,
'public_trust': 'Low'
}
# Upload to NHS servers
self.nhs_server.store_contact_graph(contacts)
# Creates network graph of entire population's contacts
# Privacy advocates alarmed
# Public skeptical
# What Apple/Google designed: Decentralized model
class AppleGoogleExposureNotification:
"""
Apple/Google's approach: decentralized, privacy-preserving
"""
def __init__(self):
self.architecture = "decentralized"
self.data_location = "on device only"
self.compatible_with_ios = True # WORKS on all devices
self.privacy_preserving = True
def detect_contacts(self, user_phone):
"""Contact detection using OS-level API"""
# ADVANTAGE 1: Works on iOS and Android
# Apple and Google built this into operating system
# Bluetooth works even when screen locked
# Because OS manages it, not app
contacts = self.exposure_notification_api.scan_bluetooth()
# ADVANTAGE 2: Privacy by design
# Exchange random, rotating identifiers
# No names, no phone numbers, no location
# Just anonymous tokens that change every 15 minutes
return contacts # Better technical performance
def handle_positive_test(self, user):
"""User tests positive for COVID-19"""
# ADVANTAGE 3: Decentralized matching
# When user tests positive, upload their random IDs
# Other phones download list of positive IDs
# Match locally on device
# No central database of who met whom
if user.tests_positive:
user.phone.upload_anonymous_ids() # Just random numbers
# Other users' phones check:
# "Have I encountered any of these random IDs?"
# If yes: "You may have been exposed, get tested"
# If no: No notification
# NHS servers never know:
# - Who you are
# - Who you met
# - Where you were
# - When encounters happened
privacy_preserved = True
functionality_maintained = TrueWhy UK rejected Apple/Google API initially: - Wanted centralized data for epidemiological research - Wanted to control contact matching algorithm - Believed they could build better system - Pride: “We don’t need American tech companies”
Why UK ultimately had to adopt it: - Their app literally didn’t work on iPhones - Couldn’t bypass Apple’s iOS restrictions - Public wouldn’t accept privacy violations - Months wasted before admitting mistake
3. Organizational Dysfunction
Flawed decision-making process:
- No diversity of expertise
- Led by NHSX (digital transformation unit)
- Limited public health input initially
- No social scientists or behavioral economists early on
- Technical team isolated from policy team
- Sunk cost fallacy
- £12M invested in centralized model
- Reluctance to abandon and switch to Apple/Google API
- Months wasted before admitting failure
- Political pressure to show “progress”
- Overpromising
- PM Boris Johnson: “world-beating” system
- Timeline commitments impossible to meet
- Set up for public disappointment
- Ignoring international experience
- Other countries already struggling with similar apps
- Germany, France, Australia all faced low adoption
- Could have learned from their mistakes
- Instead: “British exceptionalism”
The Fallout
Financial costs: - £12M on failed first version - Unknown additional costs for second version - Opportunity cost of not investing in manual contact tracing capacity
Public health impact: - App launched 4 months late (during crucial second wave preparation) - Low adoption (28%) meant limited effectiveness - No strong evidence app prevented significant transmission - Resources diverted from manual contact tracing
Trust damage: - Public trust in government COVID response eroded - Privacy concerns about future health data initiatives - Skepticism about government digital projects
Political fallout: - Embarrassment for UK government - Delayed reopening plans (app was prerequisite) - International reputation damage (“world-beating” became punchline)
What Should Have Happened
Evidence-based approach:
Phase 1: Rapid Evidence Review (Week 1-2) 1. Review international contact tracing app experiences - Singapore, Australia, Germany early adopters - Document technical challenges and adoption barriers - Learn from their mistakes 2. Consult with Apple and Google - Understand iOS/Android constraints BEFORE building - Evaluate Apple/Google Exposure Notification API 3. Engage privacy and civil liberties experts - Design privacy-preserving architecture from start - Address concerns proactively 4. Model adoption requirements - What adoption rate needed for effectiveness? - What factors drive adoption? - Is app even necessary given manual tracing capacity?
Phase 2: Stakeholder Engagement (Week 3-4) 1. Public consultation on privacy model - Centralized vs. decentralized tradeoffs - Transparency about data use - Build trust before launch 2. Behavioral science input - How to message app to maximize adoption? - What concerns need addressing? - Pilot messaging with focus groups 3. Integrate with broader testing strategy - App is just one tool, not silver bullet - Ensure testing capacity can handle app-generated demand - Link to NHS Test and Trace infrastructure
Phase 3: Technical Development (Week 5-12) 1. Use Apple/Google Exposure Notification API from start - Saves months of development time - Ensures cross-platform compatibility - Provides privacy guarantees 2. Design for intermittent engagement - Most users won’t check app daily - Push notifications for exposures - Low friction user experience 3. Plan for false positives - How to calibrate sensitivity vs. specificity? - What support for people receiving exposure alerts? - Avoid overwhelming testing system
Phase 4: Pilot and Iterate (Week 13-16) 1. Pilot on Isle of Wight (good choice actually) - But be transparent about results - Acknowledge problems quickly - Iterate rapidly based on feedback 2. Monitor technical performance AND social adoption - Don’t just measure downloads - Measure active use, notification response rates - Survey reasons for non-adoption
Phase 5: Conditional National Rollout 1. Only scale if pilot shows: - Technical reliability (works on all phones) - Adequate adoption (>50% in pilot area) - Manageable false positive rate - Integration with Test & Trace system works 2. If pilot fails: Abandon or redesign - Don’t waste £12M on failed system - Redirect resources to manual contact tracing
Realistic timeline: 12-16 weeks (vs. promised 4-6 weeks)
Key difference: Honest about technical constraints and social requirements from day one
Prevention Checklist
Use this before launching apps or digital tools:
Technical Feasibility ❌ UK NHS app failed all initially - [ ] Platform constraints understood (iOS, Android limitations) - [ ] Technical architecture validated by external experts - [ ] Pilot testing in real-world conditions - [ ] Failure modes identified and mitigated - [ ] Battery, data, accessibility considered
Privacy and Ethics ❌ UK initially failed these - [ ] Privacy-by-design from start (not added later) - [ ] Independent ethics review - [ ] Privacy Impact Assessment completed - [ ] Data minimization principle applied - [ ] Transparency about data use and storage
Social Adoption ❌ UK failed to model these properly - [ ] Adoption requirements modeled (what % needed for effectiveness?) - [ ] Barriers to adoption identified through user research - [ ] Trust-building strategy developed - [ ] Behavioral science input integrated - [ ] Communication strategy tested with target audience
Integration with Health System ❌ UK struggled with this - [ ] Integration with existing contact tracing infrastructure - [ ] Testing capacity adequate for app-generated demand - [ ] Workflow for handling notifications designed - [ ] Training for contact tracers and clinical staff - [ ] Monitoring and evaluation plan
Governance ❌ UK failed several of these - [ ] Diverse expert input (technical, clinical, social science, ethics) - [ ] Realistic timeline (not politically driven) - [ ] Contingency plans if adoption low - [ ] Transparent reporting of performance - [ ] Willingness to abandon if not working
Key Takeaways
Technical feasibility ≠ Social acceptability - App that works technically can still fail if people won’t use it
You can’t bypass platform constraints - Apple and Google control iOS/Android; work with them, not against them
Privacy skepticism is rational - Especially for government surveillance; design for privacy from start
Overpromising backfires - “World-beating” claims set up for humiliating failure
Sunk cost fallacy is dangerous - Abandon failed approaches quickly; don’t throw good money after bad
Digital is not a substitute for infrastructure - App can’t compensate for inadequate testing and manual contact tracing
Learn from others’ mistakes - Multiple countries’ failures preceded UK’s; could have learned from them
Diverse expertise matters - Technical teams alone make bad policy; need social science, ethics, public health input
References
Government reports and official documents: - National Audit Office. (2021). Test and Trace in England - progress update - Information Commissioner’s Office. (2020). ICO statement on NHS COVID-19 app
Research and analysis: - Rowe, F., et al. (2021). When is it effective to use digital contact tracing as a response to COVID-19? PLOS ONE, 16(5), e0248250. DOI: 10.1371/journal.pone.0248250 - Abeler, J., et al. (2020). COVID-19 Contact Tracing and Data Protection Can Go Together. JMIR mHealth and uHealth, 8(4), e19359. DOI: 10.2196/19359 - Williams, S. N., et al. (2021). Public attitudes towards COVID-19 contact tracing apps: A UK-based focus group study. Health Expectations, 24(2), 377-385. DOI: 10.1111/hex.13179
Media coverage: - Kelion, L. (2020). Coronavirus: UK contact-tracing app switches to Apple-Google model. BBC News - Murphy, M., & Bradshaw, T. (2020). UK abandons contact-tracing app for Apple and Google model. Financial Times - Hern, A. (2020). UK’s NHS Covid-19 contact tracing app has cost £35m so far. The Guardian
Case Study 6: OPTUM/UnitedHealth Algorithmic Bias - Proxy Discrimination
The Promise
UnitedHealth’s OPTUM developed an algorithm used to identify high-risk patients for “care management programs” - extra support and resources for complex medical needs.
Stated purpose: - Identify patients who would benefit from intensive care management - Reduce emergency department visits and hospitalizations - Improve health outcomes for complex patients - Allocate healthcare resources efficiently
Scale: - Used by many US healthcare systems - Affected approximately 200 million people annually
The Reality
October 2019: Published in Science (Obermeyer2019AlgorithmicBias?)
Researchers at UC Berkeley discovered the algorithm exhibited severe racial bias:
Key findings: - Algorithm systematically scored Black patients as lower risk than White patients with the same level of health needs - At a given risk score, Black patients were significantly sicker than White patients - Bias magnitude: To achieve equal access, 46.5% more Black patients would need to be enrolled - Impact: Millions of Black patients denied access to care management programs they needed
How it worked: - Algorithm predicted healthcare costs as proxy for healthcare needs - Problem: Black patients have lower healthcare spending even when equally or more sick - Result: Algorithm learned that Black patients are “lower risk” because they spend less
Root Cause Analysis
1. Proxy Variable Bias: Healthcare Costs ≠ Healthcare Needs
The fundamental design flaw:
# What OPTUM did (WRONG): Predict costs as proxy for health needs
class OPTUMRiskAlgorithm:
"""
OPTUM's approach: Use cost as proxy for health needs
"""
def __init__(self):
self.target_variable = "total_healthcare_costs" # WRONG CHOICE
self.intended_outcome = "identify patients with high health needs"
# PROBLEM: Costs ≠ Needs (especially across racial groups)
def train_model(self, patient_data):
"""Train model to predict healthcare costs"""
features = [
'age',
'sex',
'diagnoses_codes', # ICD-10 codes
'medications',
'prior_utilization',
'comorbidities'
# NOTE: 'race' not explicitly included as feature
# But race correlated with many other features
]
# Target: Total healthcare costs in next year
target = 'total_healthcare_costs_next_year'
# Train predictive model
model = self.ml_algorithm.fit(
X=patient_data[features],
y=patient_data[target] # Predicting costs
)
return model
def predict_risk(self, patient):
"""Predict patient's risk score"""
# Higher predicted cost = higher risk score
predicted_cost = self.model.predict(patient)
risk_score = predicted_cost # Cost used as proxy for health need
return risk_score
def analyze_why_bias_occurs(self):
"""Why this approach creates racial bias"""
# PROBLEM: Healthcare costs reflect systemic inequities
# Example: Two patients with same chronic kidney disease
white_patient = {
'ckd_stage': 4, # Advanced kidney disease
'symptoms': 'Severe',
'access_to_care': 'Good insurance, nearby nephrologist',
'historical_spending': '$15,000/year',
'receives_appropriate_care': True
}
black_patient = {
'ckd_stage': 4, # Same advanced kidney disease
'symptoms': 'Severe', # Same symptom severity
'access_to_care': 'Medicaid, far from nephrologist, work scheduling conflicts',
'historical_spending': '$8,000/year', # LOWER spending despite same disease
'receives_appropriate_care': False # Barriers prevent getting needed care
}
# Algorithm learns:
# White patient costs $15K → High risk → Enroll in care management
# Black patient costs $8K → Lower risk → Don't enroll
# Reality:
# Black patient has SAME disease severity
# But structural barriers reduce their healthcare spending
# Algorithm learns "Black patients cost less" = "Black patients are healthier"
# This is FALSE and harmful
# Root causes of spending disparities:
reasons_for_lower_black_spending = {
'access_barriers': [
'Lack of transportation',
'Work schedule inflexibility',
'Childcare responsibilities',
'Provider shortages in predominantly Black neighborhoods',
'Distance to specialists'
],
'insurance_barriers': [
'Higher rates of Medicaid (lower reimbursement)',
'Higher rates of uninsurance',
'High deductibles deterring care-seeking'
],
'systemic_factors': [
'Physician implicit bias (Black patients' pain undertreated)',
'Lower referral rates to specialists',
'Medical mistrust from historical abuses (Tuskegee, etc.)',
'Discrimination in healthcare settings'
],
'economic_factors': [
'Lower incomes → delay seeking care',
'Cost-related medication non-adherence',
'Unable to afford copays and deductibles'
]
}
# Result: Algorithm encodes systemic racism into automated decisions
# What SHOULD have been done (CORRECT): Predict health needs directly
class ImprovedRiskAlgorithm:
"""
Better approach: Predict health needs, not costs
"""
def __init__(self):
self.target_variable = "active_chronic_conditions" # Better proxy
self.intended_outcome = "identify patients with high health needs"
# Much better alignment between target and intended outcome
def train_model(self, patient_data):
"""Train model to predict health needs directly"""
# Use multiple indicators of health need:
health_need_indicators = [
'number_active_chronic_conditions',
'disease_severity_scores',
'functional_status', # Activities of daily living
'biomarkers', # HbA1c for diabetes, GFR for kidney disease
'patient_reported_symptoms',
'risk_of_deterioration'
]
# Composite target: Actual health need, not spending
# Each chronic condition weighted by severity
target = self.calculate_health_need_score(patient_data)
model = self.ml_algorithm.fit(
X=patient_data[features],
y=target # Predicting health need, not costs
)
return model
def evaluate_for_bias(self, model, test_data):
"""Proactive bias testing"""
# Check: Do patients with same health needs get same risk scores
# regardless of race?
for condition_severity in ['mild', 'moderate', 'severe']:
white_patients = test_data[
(test_data['race'] == 'White') &
(test_data['condition_severity'] == condition_severity)
]
black_patients = test_data[
(test_data['race'] == 'Black') &
(test_data['condition_severity'] == condition_severity)
]
white_predictions = model.predict(white_patients)
black_predictions = model.predict(black_patients)
# Test for disparate impact
if abs(white_predictions.mean() - black_predictions.mean()) > threshold:
print(f"WARNING: Racial disparity in predictions for {condition_severity}")
print(f"White patients average risk: {white_predictions.mean()}")
print(f"Black patients average risk: {black_predictions.mean()}")
# Investigate and fix before deployment
self.investigate_disparity(condition_severity)
return fairness_audit_reportThe cost-based approach encoded existing healthcare disparities: - Black patients historically receive less care for same conditions (due to systemic racism) - Algorithm learned this pattern - Algorithm perpetuated and amplified the disparity by denying Black patients access to intensive care management - Feedback loop: Less care → Sicker → But still predicted as “low risk” because spending remains low
2. Organizational Blindness to Bias
How did this get deployed affecting 200 million people without detection?
Lack of bias testing:
class OrganizationalFailure:
"""
Why bias wasn't caught before deployment
"""
def pre_deployment_testing(self):
"""What OPTUM likely tested"""
tests_performed = {
'predictive_accuracy': True, # AUC, R-squared
'calibration': True, # Do predictions match actual costs?
'stability': True, # Performance over time?
'racial_fairness': False # NOT TESTED
}
# OPTUM measured:
# "Does algorithm accurately predict healthcare costs?" YES
# "Is algorithm calibrated?" YES
# "Does algorithm perform consistently?" YES
# OPTUM did NOT measure:
# "Do patients with same health needs get same risk scores
# regardless of race?" NO - Never asked this question
# Why not?
why_fairness_not_tested = {
'lack_of_awareness': "Didn't consider algorithmic bias as risk",
'lack_of_expertise': "No fairness/ethics experts on development team",
'lack_of_mandate': "No regulatory requirement to test for bias",
'lack_of_incentive': "Incentivized to predict costs accurately, not fairly",
'organizational_culture': "Tech solutionism, not critical reflection"
}
return "Bias went undetected for years"
def incentive_misalignment(self):
"""Why choosing costs as target variable"""
# OPTUM's business model:
# - Paid by health insurers
# - Goal: Reduce costs (hospitalizations, ED visits)
# - Success metric: Cost reduction
# This incentivizes:
# - Target variable: Predict costs (align with business goal)
# - Not: Predict health needs (doesn't align with business goal)
# Problem: What's good for business (cost reduction)
# ≠ What's good for patients (equitable access to care)
business_incentives = {
'primary_goal': 'Reduce healthcare costs',
'secondary_goal': 'Improve outcomes',
'equity_goal': 'Not prioritized'
}
# Result: Algorithm optimized for wrong objective
return "Profit motive misaligned with equity"
def lack_of_diverse_perspectives(self):
"""Homogeneous teams miss bias"""
typical_ml_team_composition = {
'data_scientists': 'Mostly White and Asian',
'engineers': 'Mostly White and Asian men',
'product_managers': 'Mostly White',
'domain_experts': 'Healthcare economists, some clinicians'
}
missing_perspectives = {
'Black_and_Latino_clinicians': 'Could have flagged health access disparities',
'health_equity_researchers': 'Could have identified proxy variable problem',
'ethicists': 'Could have raised fairness questions',
'community_representatives': 'Could have voiced concerns about discrimination'
}
# Homogeneous teams have blind spots
# Especially around how systems affect marginalized communities
return "Lack of diversity → Lack of critical perspectives"3. The Illusion of Objectivity
“The algorithm doesn’t use race, so it can’t be racist” - WRONG
class FairnessWashingMyths:
"""
Common misconceptions about algorithmic fairness
"""
def myth_1_not_using_race_means_unbiased(self):
"""
MYTH: If we don't include race as a feature, algorithm is fair
REALITY: Race correlated with many other features
"""
# OPTUM did not include 'race' as explicit feature
# But many features highly correlated with race:
race_proxies = {
'zip_code': 'Residential segregation means zip code predicts race',
'type_of_insurance': 'Medicaid vs private correlates with race',
'hospital_where_treated': 'Hospital segregation',
'primary_language': 'Spanish, Creole, etc.',
'diagnosis_codes': 'Some conditions more prevalent in certain racial groups',
'historical_spending': 'Reflects past access barriers'
}
# Algorithm doesn't need 'race' explicitly
# It learns racial patterns from correlated features
# Example:
# Zip code 02119 (Roxbury, Boston) → 65% Black
# Zip code 02467 (Chestnut Hill, Boston) → 95% White
# Algorithm learns different risk profiles by zip code
# = Learning race without explicitly using race
return "Fairness through unawareness does NOT work"
def myth_2_algorithms_are_objective(self):
"""
MYTH: Algorithms are objective, humans are biased
REALITY: Algorithms encode human choices and societal biases
"""
human_choices_in_algorithm = {
'what_to_predict': 'CHOICE: Costs vs health needs',
'what_data_to_use': 'CHOICE: Historical EHR data (contains bias)',
'what_features_to_include': 'CHOICE: Symptom reports, biomarkers, spending?',
'how_to_measure_success': 'CHOICE: Predictive accuracy vs fairness',
'who_to_test_on': 'CHOICE: Diverse population or convenience sample',
'what_threshold_to_use': 'CHOICE: How high a risk score triggers intervention?',
}
# Every choice made by humans
# Every choice can encode bias
# Algorithm amplifies and scales these choices
return "Algorithms are not objective, they're laundered subjectivity"
def myth_3_accuracy_implies_fairness(self):
"""
MYTH: If algorithm is accurate, it must be fair
REALITY: Can be highly accurate AND highly biased
"""
# OPTUM's algorithm WAS accurate at predicting costs
# But costs ≠ needs, especially across racial groups
# Algorithm accurately learned:
# "Black patients spend less money"
# This is TRUE (accurate)
# But UNFAIR (spending lower due to discrimination)
# Accuracy on wrong objective = Accurate unfairness
return "Accuracy ≠ Fairness"The Fallout
Publication and reaction: - October 2019: Research published in Science (Obermeyer2019AlgorithmicBias?) - Massive media coverage: “Algorithm discriminates against Black patients” - UnitedHealth/OPTUM acknowledged bias and committed to fixing it
Impact on patients: - Unknown how many Black patients denied access to care management over the years - Impossible to retroactively identify all affected patients - Disparities in chronic disease outcomes may have been worsened
Legal and regulatory: - FTC opens inquiry into health AI fairness - Multiple states investigate algorithmic bias in healthcare - Increased regulatory attention to health algorithms - Calls for auditing requirements
Research impact: - Sparked wave of research on algorithmic fairness in healthcare - Led to development of bias detection and mitigation methods - Changed how health AI is evaluated (fairness now standard consideration)
Industry impact: - Health AI vendors now (claim to) test for bias - Fairness audits becoming standard practice - Increased scrutiny of proxy variables
What Should Have Happened
Responsible development process:
Phase 1: Problem Formulation (Weeks 1-4) 1. Define objective clearly - What are we trying to achieve? (Identify patients with high health needs) - What is the right target variable? (NOT costs) - Are there proxy variable concerns?
- Assemble diverse team
- Data scientists + clinicians
- Health equity researchers
- Ethicists
- Black and Latino community health advocates
- Social determinants of health experts
- Literature review
- What is known about racial disparities in healthcare spending?
- What is known about racial disparities in access to care?
- Are costs a good proxy for health needs across racial groups? NO
Phase 2: Data and Modeling (Months 2-6) 1. Choose appropriate target variable - Health need indicators (chronic conditions, disease severity) - NOT healthcare costs
- Exploratory data analysis
- Examine racial disparities in data
- Understand correlation between race and features
- Document known biases in historical data
- Model development with fairness constraints
- Develop multiple models
- Test fairness metrics alongside accuracy metrics
- Use fairness-aware ML methods if needed
Phase 3: Bias Testing (Months 7-9) 1. Comprehensive fairness audit - Do patients with same health needs get same risk scores across racial groups? - Disparate impact analysis - Calibration by subgroup - Multiple fairness definitions tested
- Clinical validation
- Do clinicians agree risk scores reflect health needs?
- Are Black patients at given risk score as sick as White patients?
- External validation
- Test at healthcare systems not in development data
- Diverse patient populations
- Independent researchers evaluate
Phase 4: Pilot Implementation (Months 10-15) 1. Small-scale pilot - 3-5 healthcare systems - Monitor enrollment rates by race - Track whether patients enrolled actually have high health needs - Monitor outcomes (do enrolled patients benefit?)
- Continuous bias monitoring
- Dashboard showing enrollment rates by race
- Alerts if disparities emerge
- Regular audits
Phase 5: Transparent Deployment 1. Public documentation - How algorithm works - What fairness testing was done - Known limitations - Ongoing monitoring plan
- Appeals process
- Patients and clinicians can challenge risk scores
- Manual review of borderline cases
- Feedback loop for model improvement
Timeline: 15-18 months before full deployment (with extensive testing)
What actually happened: Algorithm deployed widely without fairness testing, bias discovered by external researchers years later
Prevention Checklist
Use this checklist for any AI system affecting people:
Problem Formulation ❌ OPTUM failed these - [ ] Target variable directly measures intended outcome (not proxy) - [ ] Proxy variables examined for bias potential - [ ] Diverse team involved in problem formulation - [ ] Health equity expert consulted - [ ] Historical bias in data acknowledged
Model Development ❌ OPTUM failed these - [ ] Fairness metrics defined alongside accuracy metrics - [ ] Multiple fairness definitions tested - [ ] Subgroup analysis by race, ethnicity, gender, age - [ ] Fairness-aware ML methods considered - [ ] Interpretability (can identify sources of bias)
Validation and Testing ❌ OPTUM failed all of these - [ ] Comprehensive fairness audit conducted - [ ] External validation on diverse populations - [ ] Independent researchers evaluate fairness - [ ] Clinical validation (do scores match clinical judgment?) - [ ] Published results in peer-reviewed journal
Deployment ❌ OPTUM failed these - [ ] Continuous bias monitoring - [ ] Dashboard showing outcomes by demographic group - [ ] Appeals process for contested decisions - [ ] Regular re-auditing (bias can emerge over time) - [ ] Transparency about how algorithm works
Governance ❌ OPTUM failed these - [ ] Diverse team (not just White/Asian men) - [ ] Ethics review board approval - [ ] Community stakeholder input - [ ] Alignment between business incentives and equity goals - [ ] Accountability (who is responsible if algorithm causes harm?)
Key Takeaways
Proxy variables encode discrimination - Healthcare costs reflect systemic racism; using costs as proxy perpetuates bias
“Not using race” doesn’t prevent bias - Race correlated with many features; algorithm learns racial patterns indirectly
Accuracy ≠ Fairness - Algorithm can be highly accurate at wrong objective and still cause harm
Diverse teams catch bias - Homogeneous teams have blind spots about how systems affect marginalized groups
Business incentives can misalign with equity - Optimizing for cost reduction ≠ Optimizing for equitable care
Fairness testing must be proactive - Waiting for external researchers to find bias years later is unacceptable
Transparency enables accountability - Black-box algorithms escape scrutiny
Algorithmic bias is not a technical problem alone - It’s a socio-technical problem requiring diverse expertise
References
Primary research: - Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447-453. DOI: 10.1126/science.aax2342
Analysis and commentary: - Parikh, R. B., Teeple, S., & Navathe, A. S. (2019). Addressing Bias in Artificial Intelligence in Health Care. JAMA, 322(24), 2377-2378. DOI: 10.1001/jama.2019.18058 - Vyas, D. A., Eisenstein, L. G., & Jones, D. S. (2020). Hidden in Plain Sight - Reconsidering the Use of Race Correction in Clinical Algorithms. New England Journal of Medicine, 383(9), 874-882. DOI: 10.1056/NEJMms2004740 - Rajkomar, A., et al. (2018). Ensuring Fairness in Machine Learning to Advance Health Equity. Annals of Internal Medicine, 169(12), 866-872. DOI: 10.7326/M18-1990
Media coverage: - Ledford, H. (2019). Millions of black people affected by racial bias in health-care algorithms. Nature - Hoffman, B. (2019). Racial bias in a medical algorithm favors white patients over sicker black patients. Washington Post
Broader context on algorithmic fairness: - O’Neil, C. (2016). Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown. - Noble, S. U. (2018). Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press.
Case Study 7-10: Additional Failure Post-Mortems (Concise Format)
The following case studies are presented in concise format. Each failure illustrates a distinct pattern that complements the detailed case studies above. Full versions are available in the online appendix.
Case Study 7: Singapore TraceTogether - Privacy Promise Broken
The Promise: Privacy-preserving COVID-19 contact tracing app with explicit guarantees that data would ONLY be used for contact tracing.
What Happened: - January 2021: Singapore government revealed TraceTogether data accessible to police for criminal investigations - Direct contradiction of prior privacy assurances - Public outcry and trust collapse
Key Failure: Mission creep and broken privacy promises - “Data will only be used for contact tracing” → Became crime-fighting tool - ~4.2 million people (78% of population) had downloaded app based on privacy guarantees - Retroactive disclosure destroyed trust
Lesson: Privacy promises must be legally binding and technically enforced. Voluntary assurances are insufficient. Use-limitation must be coded into system architecture, not just policy documents.
Prevention: - Technical controls preventing unauthorized data access - Legislative limits on data use (not just policy) - Independent oversight and auditing - Transparent disclosure of all potential uses BEFORE deployment
References: - Wong, J. (2021). Singapore reveals Covid-tracing data available to police. BBC News - Lim, A. (2021). Trust in Singapore government plunges after TraceTogether data scandal. Straits Times
Case Study 8: Babylon GP at Hand - Chatbot Playing Doctor
The Promise: AI chatbot (Babylon Health) that could diagnose conditions and triage patients as well as or better than GPs.
Marketing claims: - “AI matches or exceeds doctors in diagnostic accuracy” - Claims of 93% accuracy based on internal testing
What Happened: - External validation revealed serious safety concerns (Fraser2020BabylonChatbot?) - Chatbot failed to recognize serious conditions (sepsis, meningitis) - Unsafe triage recommendations (told patients with serious symptoms to stay home) - Regulators investigated advertising claims
Key Examples of Failures: - Chest pain case: Patient with cardiac symptoms told “no urgent action needed” - Meningitis case: Classic meningitis symptoms flagged as “non-urgent” - Bias: System performed worse for non-English speakers and elderly
Root Causes: 1. Validated on easy cases, not emergency presentations 2. No external clinical validation before deployment 3. Overstated marketing claims not supported by evidence 4. Commercial pressure to launch before proving safety
Lesson: Chatbots ≠ Medical diagnosis. Symptom checkers for triage require different validation than consumer apps. Safety bar must be much higher.
Prevention Checklist: - [ ] External validation by independent clinicians - [ ] Test on real emergency presentations (not textbook cases) - [ ] Safety testing: Does system catch life-threatening conditions? - [ ] Clear disclaimers about limitations - [ ] Marketing claims supported by peer-reviewed evidence
References: - Fraser, H., et al. (2020). Safety of patient-facing digital symptom checkers. The Lancet, 395(10233), 1199. DOI: 10.1016/S0140-6736(20)30819-8 - Gilbert, S., et al. (2020). GPT-3 for healthcare: Potential and pitfalls. BMJ
Case Study 9: COVID-19 Forecasting Models - Mass Overfitting
The Promise: Hundreds of ML models published claiming to predict COVID-19 outcomes (mortality, ICU need, disease progression).
What Happened: - Systematic review found 232 COVID-19 prediction models published in 2020 (Wynants2020COVID19Models?) - Result: 98% at high risk of bias - Almost none suitable for clinical use - Many overfitted to early, unrepresentative data
Common Failures: - Small sample sizes (some models trained on <100 patients) - Data leakage (test data leaked into training) - Geographic overfitting (trained on Wuhan, deployed in New York) - Outcome mismeasurement (PCR as ground truth when false negative rate high) - No external validation before publication
Example: Model predicting COVID mortality from chest X-rays - Training data: X-rays from COVID patients (supine, ICU) vs healthy controls (upright, outpatient) - Model learned: “Supine position = COVID” (not actual disease features) - Failed completely on external validation
Why So Many Failures: - Urgency override rigor: “Pandemic emergency” used to justify cutting corners - Publication pressure: Journals fast-tracked COVID papers without usual scrutiny - Lack of clinical involvement: Many models built by CS/AI teams without clinician collaboration - Data quality ignored: Early COVID data messy, incomplete, biased
Lesson: Urgency is not an excuse for poor methods. Bad AI is worse than no AI. Clinical decisions require validated tools, not proof-of-concept models.
What Should Happen in Pandemic Response: 1. Coordination: Central registry of prediction models (avoid duplication) 2. Standards: Minimum validation requirements before publication 3. External validation: Independent test sets from multiple sites 4. Clinical partnership: Every model needs clinician co-leads 5. Transparency: Open data, open code, reproducibility
References: - Wynants, L., et al. (2020). Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ, 369. DOI: 10.1136/bmj.m1328 - Roberts, M., et al. (2021). Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence, 3(3), 199-217. DOI: 10.1038/s42256-021-00307-0
Case Study 10: Apple Watch AFib Study - Selection Bias in Digital Epidemiology
The Promise: Large-scale digital epidemiology study using Apple Watch to detect atrial fibrillation (AFib) in general population.
The Study: - 419,297 participants enrolled (massive sample size!) - Apple Watch detects irregular pulse → Alert users → Get EKG confirmation - Published in NEJM 2019 (Perez2019AppleHeartStudy?)
The Problem: Severe Selection Bias
Who participated: - Apple Watch owners (not representative of general population) - Opted into research study (self-selected) - Younger, wealthier, healthier, more educated than general population - Already health-conscious (bought fitness tracking device)
Biases compounded: 1. Socioeconomic: Apple Watch costs $400+ (excludes low-income) 2. Age: Younger population (AFib more common in elderly, who are less likely to own smartwatches) 3. Health literacy: Self-selected participants more health-engaged 4. Technology access: Requires smartphone, internet, technical proficiency 5. Geographic: US-centric, limited racial/ethnic diversity
Why This Matters: - Can’t generalize to general population: Prevalence estimates biased - Missed high-risk populations: Elderly, low-income, minorities underrepresented - Widened health disparities: Benefits accrue to already-advantaged groups - Flawed public health inference: Can’t guide policy with unrepresentative sample
The AFib Detection Paradox: - Detected AFib in younger, lower-risk population (who own Apple Watches) - Missed AFib in older, higher-risk population (who don’t) - Result: Detect disease in people who need it least, miss it in people who need it most
Lesson: Convenience samples ≠ Population inference. Digital epidemiology requires explicit attention to representativeness. Technology-based recruitment inherently biases toward privileged groups.
How to Do Digital Epidemiology Responsibly:
- Acknowledge limitations clearly
- “This is a study of Apple Watch owners, not general population”
- Don’t overgeneralize findings
- Recruit representatively
- Provide devices to underrepresented groups
- Multiple recruitment channels (not just app store)
- Actively recruit high-risk populations
- Report demographics transparently
- Compare sample to target population
- Quantify selection bias
- Discuss implications for generalizability
- Don’t claim population-level inference from convenience samples
- Be honest about who findings apply to
- Acknowledge equity implications
Prevention Checklist: - [ ] Sample representativeness assessed - [ ] Selection bias quantified and reported - [ ] High-risk populations intentionally recruited - [ ] Generalizability limitations clearly stated - [ ] Health equity implications considered
References: - Perez, M. V., et al. (2019). Large-Scale Assessment of a Smartwatch to Identify Atrial Fibrillation. New England Journal of Medicine, 381(20), 1909-1917. DOI: 10.1056/NEJMoa1901183 - Goldsack, J. C., et al. (2020). Verification, analytical validation, and clinical validation (V3): the foundation of determining fit-for-purpose for Biometric Monitoring Technologies (BioMeTs). npj Digital Medicine, 3(1), 55. DOI: 10.1038/s41746-020-0260-4
Common Failure Patterns: Synthesis Across All 10 Cases
After analyzing 10 major failures, clear patterns emerge. Here’s the taxonomy:
Failure Pattern 1: Training Data Problems ⚠️
Seen in: Watson, Epic, OPTUM, COVID models
Manifestations: - Synthetic data instead of real outcomes (Watson) - Biased historical data (OPTUM: costs reflect discrimination) - Small, unrepresentative samples (COVID models) - Data leakage (COVID: test data contamination)
Root cause: Garbage in, garbage out. Models learn what’s in the data, including biases and artifacts.
Prevention: - Real patient outcomes, not hypotheticals - Diverse, representative samples - Document known biases in data - Data quality checks before modeling
Failure Pattern 2: Validation Failures 🔍
Seen in: Watson, Epic, Google India, Babylon, COVID models
Manifestations: - Circular validation (Watson: same experts train and test) - Internal-only validation (Epic: same hospitals) - Lab ≠ Field (Google India: clinic performance diverged) - No external validation (COVID models)
Root cause: Models overfit to development data. Performance degrades in new settings.
Prevention: - External validation at independent sites - Test in deployment conditions (not just lab) - Diverse patient populations - Independent researchers validate
Failure Pattern 3: Wrong Objective Function 🎯
Seen in: OPTUM, Epic (indirectly)
Manifestations: - Proxy variables (OPTUM: predict costs to infer needs) - Business goals ≠ Patient goals (optimize for cost reduction vs. equitable care) - Accuracy on wrong metric (technically correct, ethically wrong)
Root cause: What you optimize for is what you get. If you optimize for the wrong thing, you get harmful outcomes.
Prevention: - Align objective function with intended outcome - Question proxy variables (do they introduce bias?) - Multidisciplinary team defines goals - Fairness metrics alongside accuracy metrics
Failure Pattern 4: Privacy and Ethics Violations 🔒
Seen in: DeepMind, UK NHS app, TraceTogether
Manifestations: - Unlawful data sharing (DeepMind) - Broken privacy promises (TraceTogether) - Privacy-hostile architecture (UK centralized model) - Insufficient patient consent (DeepMind)
Root cause: “Move fast and break things” applied to sensitive health data. Innovation prioritized over privacy.
Prevention: - Privacy by design (not afterthought) - Legal review BEFORE data sharing - Technical controls enforcing privacy - Independent oversight
Failure Pattern 5: Deployment Without Measuring Outcomes 📊
Seen in: Watson, Epic, Google India, UK app
Manifestations: - AUC reported, patient outcomes not (Epic) - Deployment without outcome study (Watson) - No impact evaluation (UK app)
Root cause: Focus on technical performance, not clinical impact. Model accuracy ≠ Patient benefit.
Prevention: - Prospective outcome studies BEFORE wide deployment - Measure what matters: mortality, quality of life, health equity - Pilot with intensive monitoring - Only scale if outcomes demonstrate benefit
Failure Pattern 6: Lack of Diverse Expertise 👥
Seen in: All cases to varying degrees
Manifestations: - Homogeneous teams miss bias (OPTUM) - No clinical input (COVID models) - No social science input (UK app: underestimated adoption barriers) - No ethics expertise (multiple)
Root cause: AI problems are socio-technical. Can’t solve with technical expertise alone.
Prevention: - Multidisciplinary teams (clinicians, ethicists, social scientists, patients) - Diverse racial/ethnic representation - Domain experts co-lead (not just consultants) - Community stakeholder input
Failure Pattern 7: Commercial Pressure Over Clinical Rigor 💰
Seen in: Watson, Babylon, UK app (political pressure)
Manifestations: - Rushed timelines (Watson: 2 years lab to global deployment) - Overpromised marketing (Babylon: “matches doctors”) - Skipped validation (multiple) - Sunk cost fallacy (UK app: £12M wasted)
Root cause: Financial/political incentives misaligned with patient safety.
Prevention: - Independent safety oversight - Regulatory requirements for validation - Transparency about limitations - Willingness to abandon failed projects
Failure Pattern 8: Lab-to-Field Translation Gap 🏥
Seen in: Google India, UK app, COVID models
Manifestations: - Idealized lab conditions ≠ messy reality (Google India: poor cameras, unreliable internet) - Workflow disruption (Google India: 5 min/patient collapsed clinics) - User skill gap (nurses vs. photographers) - Infrastructure assumptions (connectivity, power, support)
Root cause: Deployment environment fundamentally different from development environment.
Prevention: - Ethnographic study of deployment settings - Design for worst-case conditions - End-user involvement from day one - Pilot before scale
Failure Pattern 9: Selection Bias in Samples 📉
Seen in: Apple AFib, implicitly in others
Manifestations: - Convenience samples (Apple Watch owners) - Self-selection bias (research volunteers) - Socioeconomic exclusion (expensive devices) - Technology access barriers
Root cause: Who has access to technology ≠ Who needs healthcare most.
Prevention: - Actively recruit underrepresented groups - Provide technology to ensure access - Report sample representativeness - Acknowledge generalizability limits
Failure Pattern 10: Alert Fatigue and Human Factors 🚨
Seen in: Epic sepsis model
Manifestations: - High false alarm rate (88% in Epic case) - Clinicians ignore alerts (alert fatigue) - Workflow disruption - Degraded human performance
Root cause: AI systems designed in isolation from human workflow.
Prevention: - Human factors engineering from start - Acceptable false alarm rate defined with clinicians - Measure clinician response and satisfaction - Iterative refinement based on workflow integration
Unified Prevention Framework: The “Public Health AI Safety Checklist” 📋
Based on all 10 failures, here’s a comprehensive pre-deployment checklist:
Use this BEFORE deploying any AI system in healthcare or public health:
Phase 1: Problem Formulation & Team Assembly
Problem Definition - [ ] Objective clearly defined (what are we trying to achieve?) - [ ] Target variable directly measures objective (not proxy) - [ ] Problem suitable for AI (vs. non-AI alternatives) - [ ] Success criteria defined (including patient outcomes)
Team Composition - [ ] Multidisciplinary team (technical + clinical + social science + ethics) - [ ] Racial/ethnic diversity in team - [ ] Domain experts co-lead (not just consultants) - [ ] Patient/community representatives involved
Phase 2: Data & Model Development
Data Quality - [ ] Real patient outcomes (not synthetic, not hypothetical) - [ ] Representative sample (diverse by age, race, sex, geography, socioeconomic status) - [ ] Historical biases documented - [ ] Data quality assessment completed - [ ] Data provenance and lineage documented
Ethical Data Use - [ ] Appropriate consent obtained - [ ] Privacy Impact Assessment completed - [ ] Data minimization applied - [ ] Ethics review board approval
Model Development - [ ] Multiple fairness metrics defined (alongside accuracy) - [ ] Subgroup analysis (performance by race, age, sex, etc.) - [ ] Interpretability/explainability built in - [ ] Known limitations documented
Phase 3: Validation & Testing
Technical Validation - [ ] External validation at independent sites - [ ] Tested in deployment conditions (not just lab) - [ ] Diverse test populations - [ ] Independent researchers validate - [ ] Performance reported transparently (including failures)
Fairness Audit - [ ] Disparate impact analysis - [ ] Calibration by subgroup - [ ] Multiple fairness definitions tested - [ ] Equity impact assessment
Safety Testing - [ ] Failure mode analysis - [ ] False positive rate acceptable? - [ ] Alert burden quantified - [ ] Worst-case scenarios tested
Phase 4: Deployment Preparation
Workflow Integration - [ ] Ethnographic study of deployment setting - [ ] User research with actual end-users - [ ] Workflow impact assessed - [ ] Training program for users - [ ] Technical support available
Infrastructure - [ ] Deployment environment meets requirements (internet, power, equipment) - [ ] Works under worst-case conditions - [ ] Offline mode if needed - [ ] Integration with existing systems
Governance - [ ] Clear accountability (who is responsible for failures?) - [ ] Monitoring plan (continuous performance tracking) - [ ] Incident response protocol - [ ] Appeals process for contested decisions - [ ] Plan for model updates and maintenance
Phase 5: Pilot Implementation
Small-Scale Pilot - [ ] 3-5 sites, intensive monitoring - [ ] Technical performance tracked - [ ] Clinical outcomes measured - [ ] User satisfaction assessed - [ ] Equity impacts monitored
Outcome Evaluation - [ ] Primary outcome defined (patient benefit, not just AUC) - [ ] Comparison to baseline or control - [ ] Subgroup outcomes reported - [ ] Cost-effectiveness analyzed - [ ] Qualitative feedback collected
Go/No-Go Decision - [ ] Pre-defined success criteria met? - [ ] Pilot demonstrated benefit (not just technical feasibility)? - [ ] Equity not worsened? - [ ] If No: Iterate, redesign, or abandon
Phase 6: Scaled Deployment (Only if Pilot Succeeds)
Transparency - [ ] Public documentation of how system works - [ ] Validation results published (peer-reviewed) - [ ] Limitations clearly communicated - [ ] Conflicts of interest disclosed
Ongoing Monitoring - [ ] Dashboard: performance by subgroup - [ ] Drift detection (data distribution changes) - [ ] Adverse event reporting - [ ] Regular re-auditing (quarterly or annually) - [ ] Feedback loop for continuous improvement
Exit Strategy - [ ] Conditions under which system would be turned off - [ ] Sunset plan if not providing benefit - [ ] No sunk cost fallacy
Case Study Comparison Matrix
| Failure Case | Primary Pattern | Secondary Pattern | Cost ($) | Patient Harm | Trust Damage |
|---|---|---|---|---|---|
| Watson | Training Data | Validation | $62M+ | Unknown | High |
| DeepMind | Privacy | Governance | £0 (trust) | None direct | Very High |
| Google India | Lab-Field Gap | User Design | $M | Indirect | Medium |
| Epic Sepsis | Validation | Alert Fatigue | Opportunity cost | Possible | High |
| UK NHS App | Technical Hubris | Social Adoption | £35M | None direct | High |
| OPTUM Bias | Wrong Objective | Lack of Fairness Audit | Systemic | Millions affected | High |
| TraceTogether | Privacy | Mission Creep | Trust collapse | None direct | Very High |
| Babylon Chatbot | Overpromising | Safety | Unknown | Possible | Medium |
| COVID Models | Urgency Rush | Validation | Opportunity cost | Possible | Medium |
| Apple AFib | Selection Bias | Equity | None | None | Low |
Total documented costs: $100M+ direct, incalculable opportunity and trust costs
Interactive Self-Assessment Quiz
Test your ability to spot red flags before deployment:
Context: Your hospital is considering purchasing a commercial AI sepsis prediction tool. The vendor provides the following information:
- Trained on 100,000 patients from 5 major academic medical centers
- AUC: 0.88 in internal validation
- “Deployed in 150+ hospitals nationwide”
- Price: $250,000/year
Question: What additional information would you demand before adoption?
Click to reveal answer
Critical questions to ask:
- External validation:
- Has model been validated at independent hospitals?
- What was performance at hospitals NOT in training data?
- Peer-reviewed publication of validation results?
- Your hospital specifics:
- Patient population similar to training data?
- EHR documentation practices similar?
- Sepsis prevalence in your ICU?
- Operational realities:
- Alert rate (alerts per day)?
- False positive rate (what % are false alarms)?
- Alert response protocol (what do nurses/physicians do with alerts)?
- Outcomes:
- Has deployment improved patient outcomes anywhere?
- Time to antibiotics reduced?
- Mortality reduced?
- Or just technical metrics (AUC)?
- Fairness:
- Performance by race, ethnicity, age, sex?
- Does model perform equally well for your diverse patient population?
- Pilot plan:
- Can you pilot in 1-2 units before hospital-wide?
- What metrics would trigger Go/No-Go decision?
Red flags in this scenario: - “Deployed in 150+ hospitals” ≠ Evidence of effectiveness - No mention of patient outcomes - No mention of fairness audits - No mention of alert burden
Epic sepsis model taught us: Vendor claims need independent verification!
Context: It’s April 2020, early pandemic. Your research team develops an ML model predicting COVID-19 severity from chest X-rays.
Training data: 500 chest X-rays - 250 COVID-positive (from ICU patients, all supine position) - 250 COVID-negative controls (healthy volunteers, outpatient, upright position)
Performance: 95% sensitivity, 93% specificity on held-out test set
Question: Your team wants to publish immediately in a fast-track COVID journal. Should you? What’s wrong with this model?
Click to reveal answer
DO NOT PUBLISH. Major problems:
- Sampling bias:
- COVID patients: ICU, supine, severe disease
- Controls: Healthy, outpatient, upright
- Model likely learning position, not disease
- Data leakage risk:
- Are test set patients from same hospital/scanner?
- Same time period?
- Model may be learning institution-specific artifacts
- Small sample size:
- 500 patients is tiny for deep learning
- High risk of overfitting
- Not enough diversity
- Lack of clinical validation:
- No clinician co-authors?
- Does radiologist agree with model?
- Clinical utility unclear
- Absence of external validation:
- Need to test on completely different hospital
- Different scanners, different populations
- Before publishing, let alone deploying
What you SHOULD do:
- Recruit appropriate controls:
- COVID-negative patients from same ICU
- Same positioning, same severity
- Remove confounders
- Expand dataset:
- Multiple hospitals
- Multiple scanners
- Geographic diversity
- 1000s of images, not 500
- External validation:
- Test at 2-3 completely independent hospitals
- Report performance honestly (likely much lower than 95%)
- Clinical collaboration:
- Partner with radiologists and ICU physicians
- Define clinical use case clearly
- Compare to existing tools
- Honest about limitations:
- “Proof of concept, not ready for clinical use”
- “Requires extensive further validation”
COVID forecasting models taught us: Urgency is not an excuse for poor methods. Bad AI is worse than no AI.
Context: Your public health department wants to deploy a smartphone app for depression screening in the community.
App features: - Survey (PHQ-9 depression questionnaire) - AI analyzes voice patterns during short phone call - Predicts depression severity - Refers high-risk individuals to mental health services
Target: 100,000 residents in your county
Question: What are the health equity concerns? Who will this help? Who will it miss?
Click to reveal answer
Major equity concerns:
Who will be EXCLUDED:
- No smartphone:
- ~15% of US adults don’t own smartphones
- Higher rates among elderly, low-income, rural
- Limited English proficiency:
- Voice AI likely trained on English speakers
- May not work for Spanish, Creole, Vietnamese, etc.
- Disabilities:
- Visual impairment: Screen reader compatibility?
- Hearing impairment: Voice-based screening excludes
- Motor disabilities: Touchscreen accessibility?
- Low tech literacy:
- Elderly, low education may struggle with app
- No support for troubleshooting
- No internet/data:
- Requires data plan
- Rural areas: poor connectivity
- Low-income: may ration data
- Distrust of technology:
- Privacy concerns
- Cultural barriers
- Historical medical exploitation (Tuskegee, etc.)
Who WILL use it: - Younger, wealthier, educated, tech-savvy - English speakers - Already engaged with healthcare system - = Lowest risk for untreated depression
Who NEEDS it most: - Older adults (highest suicide rates) - Low-income (highest untreated depression rates) - Minorities (lower mental health service access) - = Least likely to use app
Result: Widened health disparities - Technology benefits already-advantaged groups - Misses those with greatest need
What you SHOULD do:
- Multi-channel strategy:
- App is ONE tool, not only tool
- In-person screening at clinics
- Community health worker outreach
- Telephone hotline (not smartphone-based)
- Make app accessible:
- Multiple languages
- Screen reader compatible
- Phone call option for those without smartphones
- SMS/text option (lower tech burden)
- Provide technology access:
- Free smartphones for those who need them
- Free data plans
- In-person support for onboarding
- Community engagement:
- Partner with trusted community organizations
- Address privacy and trust concerns upfront
- Cultural adaptations
- Monitor equity:
- Track who uses app by race, income, age, language
- Measure: Are we reaching high-need populations?
- If No: Redesign approach
Apple AFib taught us: Convenience samples miss those who need healthcare most.
Summary: How to Build Public Health AI That Doesn’t Fail
Core Principles:
Safety First: “Do no harm” applies to AI. If not proven safe, don’t deploy.
Validation Before Scale: Pilot → Evaluate outcomes → Only scale if beneficial.
Fairness is Not Optional: Test for bias proactively. Health equity is a design requirement.
Privacy by Design: Technical controls, not just policies. Privacy promises must be enforceable.
Diverse Teams: Multidisciplinary collaboration from day one. No technical solutions to socio-technical problems.
Transparency: Document limitations honestly. Publish validation results. Enable independent scrutiny.
Measure Outcomes: AUC is not enough. Did patients benefit? Health equity improve? Costs justify benefits?
Fail Fast: If not working, abandon or redesign. No sunk cost fallacy. Every project is an experiment.
Learn from Failures: Yours and others’. This appendix exists so you don’t repeat these mistakes.
Humility: AI is a tool, not a panacea. Many problems don’t need AI. Some AI makes things worse.
The most important lesson: Every failure in this appendix was preventable. The warning signs were there. The expertise to identify problems existed. What was missing was: - Willingness to slow down - Diverse perspectives at the table - Prioritization of safety and equity over speed and profit - Courage to abandon failing projects
You can do better.
This appendix gives you the knowledge to spot red flags, ask hard questions, demand rigorous validation, and build AI systems that help rather than harm.
The question is: Will you?
Notes on References
Each case study includes a dedicated References section with: - Primary sources: Peer-reviewed research papers, official reports, regulatory documents - Media coverage: Investigative journalism and analysis from reputable outlets - Commentary: Expert analysis and academic perspectives
All references include: - Full citation information - Direct links to sources (DOIs for academic papers, URLs for media) - Categorization by source type for easier navigation