Appendix F — The AI Morgue: Failure Post-Mortems

Appendix E: The AI Morgue - Post-Mortems of Failed AI Projects

TipLearning Objectives

By the end of this appendix, you will:

  • Understand the common failure modes of AI systems in healthcare and public health
  • Analyze root causes of high-profile AI failures through detailed post-mortems
  • Identify warning signs that predict project failure before deployment
  • Apply failure prevention frameworks to your own AI projects
  • Learn from $100M+ in failed AI investments without repeating the mistakes
  • Develop a critical eye for evaluating AI vendor claims and research findings

Introduction: Why Study Failure?

The Value of Failure Analysis

“Success is a lousy teacher. It seduces smart people into thinking they can’t lose.” - Bill Gates

In public health AI, failure is not just a learning opportunity—it can mean lives lost, trust destroyed, and health equity worsened. Yet the literature overwhelmingly focuses on successes. Failed projects are quietly shelved, vendors move on to the next product, and the same mistakes repeat.

This appendix is different.

We document 10 major AI failures in healthcare and public health with forensic detail: - What was promised vs. what was delivered - Root cause analysis (technical, organizational, ethical) - Real-world consequences and costs - What should have happened - Prevention strategies you can apply

Who Should Read This

For practitioners: Learn to spot red flags before investing time and resources in doomed projects.

For researchers: Understand why technically sound models fail in deployment.

For policymakers: See the consequences of inadequate oversight and validation requirements.

For students: Develop the critical thinking skills to evaluate AI systems skeptically.


Quick Reference: All 10 Failures at a Glance

# Project Domain Cost Primary Failure Mode Key Lesson
1 IBM Watson for Oncology Clinical Decision Support $62M+ investment Unsafe recommendations from synthetic training data Synthetic data ≠ real expertise
2 DeepMind Streams NHS Patient Monitoring £0 (free), cost = trust Privacy violations, unlawful data sharing Innovation doesn’t excuse privacy violations
3 Google Health India Diabetic Retinopathy $M investment Lab-to-field performance gap 96% accuracy in lab ≠ field success
4 Epic Sepsis Model Clinical Prediction Implemented in 100+ hospitals Poor external validation, high false alarms Vendor claims need independent validation
5 UK NHS COVID App Contact Tracing £12M spent Technical + privacy issues Social acceptability ≠ technical feasibility
6 OPTUM/UnitedHealth Resource Allocation Affected millions Systematic racial bias via proxy variable Proxy variables encode discrimination
7 Singapore TraceTogether Contact Tracing $10M+ Broken privacy promises Mission creep destroys public trust
8 Babylon GP at Hand Symptom Checker £0 pilot, trust cost Unsafe triage recommendations Chatbots ≠ medical diagnosis
9 COVID-19 Forecasting Models Epidemic Prediction 232 models published 98% high risk of bias, overfitting Urgency ≠ excuse for poor methods
10 Apple Watch AFib Study Digital Epidemiology $M research Selection bias, unrepresentative sample Convenience samples ≠ population inference

Total documented costs: $100M+ in direct spending, incalculable trust damage


Case Study 1: IBM Watson for Oncology - Unsafe Recommendations from AI

The Promise

2013-2017: IBM heavily marketed Watson for Oncology as an AI system that could: - Analyze massive amounts of medical literature - Provide evidence-based treatment recommendations - Match or exceed expert oncologists - Democratize access to world-class cancer care

Marketing claims: - “Watson can read and remember all medical literature” - “90% concordance with expert tumor boards” - Hospitals paid $200K-$1M+ for licensing

The Reality

July 2018: Internal documents leaked to STAT News revealed Watson recommended unsafe treatment combinations, incorrect dosing, and treatments contradicting medical evidence (Ross and Swetlitz 2018).

Example from leaked documents: - Patient: 65-year-old with severe bleeding - Watson recommendation: Prescribe chemotherapy + bevacizumab (increases bleeding risk) - Expert oncologist assessment: “This would be harmful or fatal”

Root Cause Analysis

1. Training Data Problem

Critical flaw: Watson was trained on synthetic cases, not real patient outcomes.

# WHAT IBM DID (WRONG)
class WatsonTrainingApproach:
    """
    Watson for Oncology training methodology (simplified)
    """

    def generate_training_data(self):
        """Generate synthetic cases from expert opinions"""
        training_cases = []

        # Experts at Memorial Sloan Kettering created hypothetical cases
        for scenario in self.expert_generated_scenarios:
            case = {
                'patient_features': scenario['demographics'],
                'diagnosis': scenario['cancer_type'],
                'recommended_treatment': scenario['expert_preference'],  # NOT actual outcomes
                'confidence': 'high'  # Based on expert assertion, not evidence
            }
            training_cases.append(case)

        return training_cases

    def train_model(self, training_cases):
        """Train on expert preferences, not patient outcomes"""
        # Model learns: "Expert X prefers treatment Y"
        # Model does NOT learn: "Treatment Y improves survival"

        # This is preference learning, not outcome learning
        self.model.fit(
            X=[case['patient_features'] for case in training_cases],
            y=[case['recommended_treatment'] for case in training_cases]
        )

        # Result: Watson mimics expert opinions
        # Problem: Expert opinions can be wrong, biased, outdated


# WHAT SHOULD HAVE BEEN DONE (CORRECT)
class EvidenceBasedApproach:
    """
    How oncology decision support should be developed
    """

    def generate_training_data(self):
        """Use real patient outcomes from EHR data"""
        training_cases = []

        # Use actual patient data with outcomes
        for patient in self.ehr_database:
            if patient.has_outcome_data():
                case = {
                    'patient_features': patient.demographics + patient.tumor_characteristics,
                    'treatment_received': patient.treatment_history,
                    'outcome': patient.survival_months,  # ACTUAL OUTCOME
                    'adverse_events': patient.complications,
                    'quality_of_life': patient.qol_scores
                }
                training_cases.append(case)

        return training_cases

    def validate_against_rcts(self, model_recommendations):
        """Validate recommendations against randomized trial evidence"""

        for recommendation in model_recommendations:
            # Check if recommendation aligns with RCT evidence
            rct_evidence = self.search_clinical_trials(
                condition=recommendation['diagnosis'],
                intervention=recommendation['treatment']
            )

            if rct_evidence.contradicts(recommendation):
                recommendation['flag'] = 'CONTRADICTS_RCT_EVIDENCE'
                recommendation['use'] = False

            # Check for safety signals
            safety_data = self.fda_adverse_event_database.query(
                drug=recommendation['treatment'],
                patient_profile=recommendation['patient_features']
            )

            if safety_data.has_contraindications():
                recommendation['flag'] = 'SAFETY_CONTRAINDICATION'
                recommendation['use'] = False

        return model_recommendations

Why synthetic data failed: - Expert preferences ≠ evidence-based best practices - No validation against actual patient outcomes - Biases in expert opinions propagated at scale - No feedback loop from real-world results

2. Validation Failure

What IBM reported: 90% concordance with expert tumor boards

What this actually meant: - Watson agreed with the same experts who trained it (circular validation) - NOT validated against independent oncologists - NOT validated against patient survival outcomes - NOT validated in external hospitals before widespread deployment

The validation fallacy:

# IBM's circular validation approach
def evaluate_watson(test_cases):
    """
    Problematic validation methodology
    """

    # Test cases created by same experts who trained Watson
    expert_recommendations = memorial_sloan_kettering_experts.recommend(test_cases)
    watson_recommendations = watson_model.predict(test_cases)

    # Concordance = how often Watson agrees with trainers
    concordance = agreement_rate(expert_recommendations, watson_recommendations)

    # PROBLEM: This measures memorization, not clinical validity
    print(f"Concordance: {concordance}%")  # 90%!

    # MISSING: Does Watson improve patient outcomes?
    # MISSING: External validation at different hospitals
    # MISSING: Comparison to actual survival data
    # MISSING: Safety evaluation

3. Commercial Pressure Over Clinical Rigor

Timeline reveals rushed deployment: - 2013: Partnership announced with Memorial Sloan Kettering - 2015: First hospital deployments begin - 2016-2017: Aggressive global sales push - 2018: Safety issues surface

Financial incentives misaligned with patient safety: - IBM under pressure to monetize Watson investments - Hospitals wanted prestigious “AI partnership” - Marketing preceded clinical validation

The Fallout

Hospitals that abandoned Watson for Oncology: - MD Anderson Cancer Center (after spending $62M) (Fry 2018) - Jupiter Hospital (India) - cited “unsafe recommendations” (Hernandez and Greenwald 2018) - Gachon University Gil Medical Center (South Korea) - Multiple European hospitals (Strickland 2019)

Patient impact: - Unknown number exposed to potentially unsafe recommendations - Degree of harm unknown (no systematic study) - Oncologists reported catching unsafe suggestions before implementation - Trust in AI-based clinical support damaged

Financial costs: - MD Anderson: $62M investment, project shut down (Ross and Swetlitz 2017) - Multiple hospitals: $200K-$1M licensing fees - IBM: Massive reputational damage, eventually sold Watson Health to investment firm in 2022 (Lohr 2022)

Lessons for the field: - Set back clinical AI adoption by years - Increased regulatory skepticism - Hospitals now demand extensive validation before AI adoption

What Should Have Happened

Phase 1: Proper Development (2-3 years) 1. Train on real patient outcomes from EHR data across multiple institutions 2. Validate against randomized clinical trial evidence 3. Build safety checks to flag contraindications 4. Involve diverse oncologists from community hospitals, not just academic centers

Phase 2: Rigorous Validation (1-2 years) 1. External validation at hospitals not involved in development 2. Prospective study comparing Watson recommendations to actual outcomes 3. Safety monitoring for adverse events 4. Subgroup analysis by cancer type, stage, patient characteristics

Phase 3: Controlled Deployment (1+ years) 1. Pilot at 3-5 hospitals with intensive monitoring 2. Oncologist oversight of all recommendations 3. Track concordance, outcomes, and safety 4. Iterative improvement based on real-world data

Phase 4: Gradual Scale (if Phase 3 succeeds) 1. Expand only after demonstrating clinical benefit or equivalence 2. Continuous monitoring and model updates 3. Transparent reporting of performance

Total timeline: 4-6 years before widespread deployment

What actually happened: 2 years from partnership to aggressive global sales

Prevention Checklist

WarningRed Flags That Predicted Watson’s Failure

Use this checklist to evaluate clinical AI systems:

Training Data ❌ Watson failed all of these - [ ] Trained on real patient outcomes (not synthetic cases) - [ ] Data from multiple institutions (not single center) - [ ] Includes diverse patient populations - [ ] Outcomes include survival, not just expert opinion

Validation ❌ Watson failed all of these - [ ] External validation at independent sites - [ ] Compared to patient outcomes (not just expert agreement) - [ ] Safety evaluation included - [ ] Subgroup performance reported - [ ] Validation by independent researchers (not just vendor)

Deployment ❌ Watson failed all of these - [ ] Prospective pilot study completed - [ ] Clinical benefit demonstrated (not just claimed) - [ ] Physician oversight required - [ ] Continuous monitoring plan - [ ] Transparent performance reporting

Governance ❌ Watson failed all of these - [ ] Development timeline allows for proper validation - [ ] Commercial pressure doesn’t override clinical rigor - [ ] Independent ethics review - [ ] Patient safety prioritized over revenue

Key Takeaways

  1. Synthetic data ≠ Real-world evidence - Expert-generated hypothetical cases cannot substitute for actual patient outcomes

  2. Circular validation is not validation - Concordance with the experts who trained the system proves nothing about clinical validity

  3. Marketing claims require independent verification - Vendor assertions must be validated by independent researchers

  4. Commercial pressure kills patients - Rushing to market before proper validation has consequences

  5. AI is not a substitute for clinical trials - Evidence-based medicine requires… evidence

References

Primary sources: - Ross, C., & Swetlitz, I. (2017). IBM’s Watson supercomputer recommended ‘unsafe and incorrect’ cancer treatments. STAT News - Hernandez, D., & Greenwald, T. (2018). IBM Has a Watson Dilemma. Wall Street Journal - Strickland, E. (2019). How IBM Watson Overpromised and Underdelivered on AI Health Care. IEEE Spectrum - Fry, E. (2017). MD Anderson Benches IBM Watson in Setback for Artificial Intelligence. Fortune - Ross, C. (2018). MD Anderson Cancer Center’s $62 million Watson project is scrapped after audit. STAT News - Lohr, S. (2022). IBM Sells Watson Health Assets to Investment Firm. New York Times


Case Study 2: DeepMind Streams and the NHS Data Sharing Scandal

The Promise

2015-2016: DeepMind (owned by Google) partnered with Royal Free NHS Trust to develop Streams, a mobile app to help nurses and doctors detect acute kidney injury (AKI) earlier.

Stated goals: - Alert clinicians to deteriorating patients - Reduce preventable deaths from AKI - Demonstrate Google’s commitment to healthcare - “Save lives with AI”

The Reality

July 2017: UK Information Commissioner’s Office ruled the data sharing agreement unlawful (Information Commissioner’s Office 2017).

What went wrong: - Royal Free shared 1.6 million patient records with DeepMind (Hodson 2017) - Patients not informed their data would be used - Data included entire medical histories (not just kidney-related) - Used for purposes beyond the stated clinical care - No proper legal basis under UK Data Protection Act (Powles and Hodson 2017)

Data included: - HIV status - Abortion records - Drug overdose history - Complete medical histories dating back 5 years - Data from patients who never consented

Root Cause Analysis

1. Privacy Framework Violations

Legal failures:

# What DeepMind/Royal Free did (UNLAWFUL)
class DataSharingAgreement:
    """
    DeepMind Streams data sharing approach
    """

    def __init__(self):
        self.legal_basis = "Implied consent for direct care"  # WRONG
        self.data_minimization = False  # Took everything
        self.patient_notification = False  # Patients not informed

    def collect_patient_data(self, nhs_trust):
        """Collect patient data for app development"""

        # PROBLEM 1: Scope creep beyond stated purpose
        stated_purpose = "Detect acute kidney injury"
        actual_purpose = "Develop AI algorithms + train models + product development"

        # PROBLEM 2: Excessive data collection (violates data minimization)
        data_requested = {
            'kidney_function_tests': True,  # Relevant to AKI
            'vital_signs': True,  # Relevant to AKI
            'complete_medical_history': True,  # NOT necessary for AKI alerts
            'hiv_status': True,  # NOT necessary for AKI alerts
            'mental_health_records': True,  # NOT necessary for AKI alerts
            'abortion_history': True,  # NOT necessary for AKI alerts
            'historical_data': '5 years'  # Far exceeds clinical need
        }

        # PROBLEM 3: No patient consent or notification
        patient_consent = self.obtain_explicit_consent()  # This was never called
        patient_notification = self.notify_patients()  # This was never called

        # PROBLEM 4: Commercial use of NHS data
        data_use = {
            'clinical_care': True,  # OK
            'algorithm_development': True,  # Requires different legal basis
            'google_ai_research': True,  # Requires patient consent
            'product_development': True  # Requires patient consent
        }

        return patient_data


# What SHOULD have been done (LAWFUL)
class LawfulDataSharingApproach:
    """
    Privacy-preserving approach to clinical AI development
    """

    def __init__(self):
        self.legal_basis = "Explicit consent for research"  # CORRECT
        self.data_minimization = True
        self.patient_notification = True
        self.independent_ethics_review = True

    def collect_patient_data_lawfully(self, nhs_trust):
        """Lawful approach to data collection"""

        # Step 1: Define minimum necessary data
        minimum_data_set = {
            'patient_id': 'pseudonymized',
            'age': True,
            'sex': True,
            'kidney_function_tests': True,
            'relevant_vital_signs': True,
            'relevant_medications': True,  # Only nephrotoxic drugs
            'aki_history': True
        }

        # Explicitly exclude unnecessary data
        excluded_data = [
            'complete_medical_history',
            'unrelated_diagnoses',
            'mental_health_records',
            'reproductive_history',
            'hiv_status'
        ]

        # Step 2: Obtain explicit informed consent
        consent_process = {
            'plain_language_explanation': True,
            'purpose_clearly_stated': "Develop AKI detection algorithm",
            'data_use_specified': "Clinical care AND algorithm development",
            'commercial_partner_disclosed': "Google DeepMind",
            'opt_out_option': True,
            'withdrawal_rights': True
        }

        # Step 3: Ethics approval
        ethics_review = self.submit_to_ethics_committee({
            'study_protocol': self.protocol,
            'consent_forms': self.consent_forms,
            'data_protection_impact_assessment': self.dpia,
            'benefit_risk_analysis': self.analysis
        })

        if not ethics_review.approved:
            return None  # Don't proceed without approval

        # Step 4: Transparent patient notification
        self.notify_all_patients(
            method='letter + posters + website',
            content='Data being used for AI research with Google',
            opt_out_period='30 days'
        )

        # Step 5: Collect only consented data
        consented_patients = self.get_consented_patients()
        data = self.extract_minimum_data_set(consented_patients)

        return data

2. Organizational Culture: “Move Fast, Get Permission Later”

Evidence of privacy-second culture:

  1. Data sharing agreement signed before proper legal review
    • Agreement signed: September 2015
    • Information Governance review: After the fact
    • Legal basis analysis: Inadequate
  2. No Data Protection Impact Assessment (DPIA)
    • Required for high-risk processing under GDPR
    • Should have been completed BEFORE data sharing
    • Would have identified legal issues
  3. Patient safety used to justify privacy violations
    • “We need all the data to save lives”
    • False dichotomy: privacy OR patient safety
    • Reality: Can have both with proper safeguards

3. Power Imbalance: Google vs. NHS

Structural factors: - NHS chronically underfunded, attracted by “free” Google technology - DeepMind offered app development at no cost - Royal Free eager for prestigious partnership - Imbalance in legal and technical expertise - Google’s lawyers vs. under-resourced NHS legal teams

The Fallout

Regulatory action: - UK Information Commissioner’s Office: Ruled data sharing unlawful (July 2017) (Information Commissioner’s Office 2017) - Royal Free Trust found in breach of Data Protection Act - Required to update practices and systems (Hern 2017)

Reputational damage: - Massive media coverage: “Google got NHS patient data improperly” - Patient trust in NHS data sharing damaged - DeepMind’s healthcare ambitions set back - Chilling effect on beneficial NHS-tech partnerships

Patient impact: - 1.6 million patients’ privacy violated - Highly sensitive data (HIV status, abortions, overdoses) shared without consent - No evidence of direct patient harm from data misuse - BUT: Violation of patient autonomy and dignity

Policy impact: - Strengthened NHS data sharing requirements - Increased scrutiny of commercial partnerships - Contributed to GDPR implementation awareness - NHS data transparency initiatives

What Should Have Happened

Lawful pathway (would have added 6-12 months):

Phase 1: Planning and Legal Review (2-3 months) 1. Define minimum necessary data set for AKI detection 2. Conduct Data Protection Impact Assessment (DPIA) 3. Obtain legal opinion on appropriate legal basis 4. Design patient consent/notification process 5. Submit to NHS Research Ethics Committee

Phase 2: Ethics and Governance (2-3 months) 1. Ethics committee review and approval 2. Information Governance approval 3. Caldicott Guardian sign-off (NHS data guardian) 4. Transparent public announcement of partnership

Phase 3: Patient Engagement (3-6 months) 1. Patient information campaign (letters, posters, website) 2. 30-day opt-out period 3. Mechanism for patient questions and concerns 4. Patient advisory group involvement

Phase 4: Data Sharing with Safeguards (ongoing) 1. Share only minimum necessary data 2. Pseudonymization and encryption 3. Audit trail of all data access 4. Regular privacy audits 5. Transparent reporting to patients and public

Would this have delayed the project? Yes, by 6-12 months.

Would it have preserved trust? Yes.

Would the app still have saved lives? Yes, and without violating patient privacy.

Prevention Checklist

WarningRed Flags for Privacy Violations

Use this checklist before any health data sharing for AI:

Legal Basis ❌ DeepMind failed all of these - [ ] Explicit legal basis identified (consent, legal obligation, legitimate interest with balance test) - [ ] Legal basis appropriate for ALL intended uses (including commercial AI development) - [ ] Legal review by qualified data protection lawyer - [ ] Data sharing agreement reviewed by independent party

Data Minimization ❌ DeepMind failed this - [ ] Only minimum necessary data collected - [ ] Scope limited to stated purpose - [ ] Irrelevant data explicitly excluded - [ ] Justification documented for each data element

Transparency ❌ DeepMind failed all of these - [ ] Patients informed about data use - [ ] Commercial partners disclosed - [ ] Purpose clearly explained - [ ] Opt-out option provided

Governance ❌ DeepMind failed all of these - [ ] Ethics committee approval obtained - [ ] Data Protection Impact Assessment completed - [ ] Information Governance approval - [ ] Independent oversight (e.g., Caldicott Guardian) - [ ] Patient advisory group consulted

Safeguards (DeepMind did implement some technical safeguards) - [x] Data encrypted in transit and at rest - [x] Access controls and audit logs - [ ] Regular privacy audits - [ ] Breach notification plan

Key Takeaways

  1. Innovation doesn’t excuse privacy violations - “Saving lives” is not a justification for unlawful data sharing

  2. Data minimization is not optional - Collect only what you need, not everything you can access

  3. Patient consent matters - Even for “beneficial” uses, patients have a right to know and choose

  4. Power imbalances create risk - Under-resourced public health agencies need independent legal support when partnering with tech giants

  5. “Free” technology is not free - Costs may be paid in patient privacy and public trust

  6. Trust, once broken, is hard to rebuild - This scandal damaged NHS-tech partnerships for years

References

Primary sources: - UK Information Commissioner’s Office. (2017). Royal Free - Google DeepMind trial failed to comply with data protection law - Powles, J., & Hodson, H. (2017). Google DeepMind and healthcare in an age of algorithms. Health and Technology, 7(4), 351-367. DOI: 10.1007/s12553-017-0179-1 - Hodson, H. (2017). DeepMind’s NHS patient data deal was illegal, says UK watchdog. New Scientist - Hern, A. (2017). Royal Free breached UK data law in 1.6m patient deal with Google’s DeepMind. The Guardian - Powles, J. (2016). DeepMind’s latest NHS deal leaves big questions unanswered. The Guardian

Analysis: - Veale, M., & Binns, R. (2017). Fairer machine learning in the real world: Mitigating discrimination without collecting sensitive data. Big Data & Society, 4(2). DOI: 10.1177/2053951717743530


Case Study 3: Google Health India - The Lab-to-Field Performance Gap

The Promise

2016-2018: Google Health developed an AI system for diabetic retinopathy (DR) screening with impressive results (Gulshan et al. 2016): - 96% sensitivity in validation studies - Published in JAMA (high-impact journal) (Krause et al. 2018) - Regulatory approval in Europe - Deployment in India to address ophthalmologist shortage

The vision: - Democratize DR screening in low-resource settings - Address 415 million people with diabetes globally - Prevent blindness through early detection - Showcase AI’s potential for global health equity

The Reality

2019-2020: Field deployment in rural India clinics encountered severe problems (Beede et al. 2020): - Nurses couldn’t use the system effectively - Poor image quality from non-standard cameras - Internet connectivity too unreliable - Workflow disruptions caused bottlenecks - Patient follow-up rates plummeted - Program quietly scaled back (Mukherjee 2021)

Performance degradation: - Lab conditions: 96% sensitivity - Field conditions: ~55% of images were ungradable (system rejected them as too poor quality) (Beede et al. 2020) - Of gradable images, performance unknown (not systematically evaluated)

Root Cause Analysis

1. Lab-to-Field Translation Failure

Controlled research environment vs. real-world chaos:

# Lab environment (where AI performed well)
class LabEnvironment:
    """
    Idealized conditions for AI development
    """

    def __init__(self):
        self.camera = "High-end retinal camera ($40,000)"
        self.operator = "Trained ophthalmology photographer"
        self.lighting = "Optimal, controlled"
        self.patient_cooperation = "High (research volunteers)"
        self.internet = "Fast, reliable hospital WiFi"
        self.support = "On-site AI researchers for troubleshooting"

    def capture_image(self, patient):
        """Image capture in lab conditions"""

        # Professional photographer with optimal equipment
        image = self.camera.capture(
            patient=patient,
            attempts=5,  # Can retry multiple times
            lighting='optimal',
            dilation='complete'  # Pupils fully dilated
        )

        # Quality control before AI analysis
        if image.quality_score < 0.9:
            image = self.recapture()  # Try again

        # Fast, reliable internet for cloud processing
        result = self.ai_model.predict(
            image,
            internet_speed='1 Gbps',
            latency='<100ms'
        )

        return result  # High quality input → High quality output


# Field environment (where AI failed)
class FieldEnvironmentIndia:
    """
    Reality of rural Indian primary care clinics
    """

    def __init__(self):
        self.camera = "Portable retinal camera ($5,000, different model than training data)"
        self.operator = "Nurse with 2-hour training"
        self.lighting = "Variable, often poor"
        self.patient_cooperation = "Variable (many elderly, diabetic complications)"
        self.internet = "Intermittent, slow (when available)"
        self.support = "None (Google researchers in California)"

    def capture_image(self, patient):
        """Image capture in field conditions"""

        # PROBLEM 1: Equipment mismatch
        # AI trained on $40K cameras, deployed with $5K cameras
        # Different image characteristics, compression, resolution

        # PROBLEM 2: Operator skill gap
        # Nurse has 2 hours of training vs. professional photographers
        image = self.camera.capture(
            patient=patient,
            attempts=2,  # Limited time per patient
            lighting='suboptimal',  # Poor clinic lighting
            dilation='partial'  # Patients dislike dilation, often incomplete
        )

        # PROBLEM 3: Image quality issues
        image_quality_issues = {
            'blurry': 0.25,  # Camera shake, patient movement
            'poor_lighting': 0.30,  # Inadequate illumination
            'wrong_angle': 0.20,  # Inexperienced operator
            'incomplete_dilation': 0.35,  # Patient discomfort
            'off_center': 0.15  # Targeting errors
        }

        # AI rejects poor quality images
        if image.quality_score < 0.7:
            return "UNGRADABLE IMAGE - REFER TO OPHTHALMOLOGIST"
            # Problem: Clinic has no ophthalmologist
            # Patient told to travel 50km to district hospital
            # Most patients don't follow up

        # PROBLEM 4: Connectivity failure
        try:
            result = self.ai_model.predict(
                image,
                internet_speed='0.5 Mbps',  # 2000x slower than lab
                latency='2000ms',  # 20x worse than lab
                timeout='30 seconds'
            )
        except TimeoutError:
            # Internet too slow, AI in cloud can't process
            # Patient leaves without screening
            return "SYSTEM ERROR - UNABLE TO PROCESS"

        # PROBLEM 5: Workflow disruption
        processing_time = 5_minutes  # vs 30 seconds in lab
        # Clinic sees 50 patients/day
        # 5 min/patient for DR screening = 250 minutes = 4+ hours
        # Entire clinic workflow collapses

        return result

2. User-Centered Design Failure

Google designed for ophthalmologists, deployed with nurses:

Training gap: - Ophthalmology photographers: Years of training, hundreds of images daily - Rural clinic nurses: 2-hour training session, first time using retinal camera - No ongoing support or troubleshooting

Workflow integration failure: - System added 5+ minutes per patient (clinic operates on tight schedules) - Required internet connectivity (unreliable in rural areas) - Cloud-based processing created dependency on Google servers - No offline mode for areas with poor connectivity

Error handling: - System rejected 55% of images as “ungradable” - No actionable guidance for nurses on how to improve image quality - Patients told “refer to ophthalmologist” but nearest one was 50km+ away - Follow-up rate for referrals: <20%

3. Validation Mismatch

What was validated: - AI performance on high-quality images from research-grade cameras - Agreement with expert ophthalmologists on curated datasets - Technical accuracy in controlled settings

What was NOT validated: - End-to-end workflow in actual clinic settings - Performance with portable cameras used in field - Nurse ability to obtain gradable images - Patient acceptance and follow-up rates - Impact on clinic workflow and throughput - Actual health outcomes (Did blindness decrease?)

The Fallout

Program outcomes: - Quietly scaled back in 2020 (Mukherjee 2021) - No published results on real-world impact - Unknown number of patients screened - Unknown impact on diabetic retinopathy detection or blindness prevention

Lessons for Google: - Led to major changes in Google Health strategy (Mukherjee 2021) - Increased focus on user research and field testing - Recognition that “AI accuracy” ≠ “system effectiveness” (Beede et al. 2020) - Several key researchers left Google Health

Impact on field: - Highlighted gap between AI research and implementation science - Demonstrated need for human-centered design in clinical AI - Showed that technical performance is necessary but not sufficient

Missed opportunity: - India has massive DR screening gap (millions unscreened) - Well-designed system could have made real impact - Failure set back AI adoption in Indian primary care

What Should Have Happened

Implementation science approach:

Phase 1: Formative Research (6-12 months) 1. Ethnographic study of actual clinic workflows - Shadow nurses in rural clinics for weeks - Document real-world constraints (time, connectivity, equipment) - Identify workflow integration points - Understand patient barriers (cost, distance, literacy)

  1. Technology assessment
    • Test portable cameras actually available in rural clinics
    • Measure real-world internet connectivity
    • Assess power reliability
    • Identify equipment constraints
  2. User research with nurses
    • What training do they need?
    • What support systems are required?
    • How much time can be allocated per patient?
    • What error messages are actionable?

Phase 2: Adapt AI System (6-12 months) 1. Retrain AI on images from field equipment - Collect training data using actual portable cameras deployed - Include poor lighting, motion blur, incomplete dilation - Train AI to be robust to field conditions

  1. Design for intermittent connectivity
    • Offline mode for AI processing (edge deployment)
    • Sync results when connectivity available
    • No dependency on cloud for basic functionality
  2. Improve usability for nurses
    • Real-time feedback on image quality
    • Guidance system: “Move camera up,” “Improve lighting,” etc.
    • Simplified training program with ongoing support

Phase 3: Pilot Implementation (12 months) 1. Small-scale pilot (3-5 clinics) - Intensive monitoring and support - Rapid iteration based on feedback - Document workflow integration challenges - Measure key outcomes: gradable image rate, screening completion, referral follow-up

  1. Hybrid approach
    • AI flags high-risk cases
    • Tele-ophthalmology for borderline cases
    • Local health workers support follow-up
    • Integration with existing health systems

Phase 4: Evaluation and Iteration (12 months) 1. Process evaluation - What percentage of eligible patients screened? - What percentage of images gradable? - Nurse satisfaction and confidence - Workflow impact on clinic operations

  1. Outcome evaluation
    • Detection rates (vs baseline)
    • Referral completion rates
    • Time to treatment
    • Long-term impact on vision outcomes

Phase 5: Scale Only If Successful 1. Expand only if pilot demonstrates: - Feasible workflow integration - High gradable image rate (>80%) - Improved patient outcomes - Sustainable without ongoing external support

Total timeline: 3-4 years from development to scale

What actually happened: Lab validation → immediate deployment → failure

Prevention Checklist

WarningRed Flags for Implementation Failure

Use this checklist for AI deployed in resource-limited settings:

User Research ❌ Google failed all of these - [ ] Ethnographic study of actual deployment environment - [ ] End-user involvement in design (not just technical experts) - [ ] Workflow analysis in real-world conditions - [ ] Identification of infrastructure constraints (connectivity, power, equipment)

Technology Adaptation ❌ Google failed all of these - [ ] AI trained on data from actual deployment equipment - [ ] System designed for worst-case conditions (poor connectivity, power outages) - [ ] Offline functionality for critical features - [ ] Performance validated with target end-users (not just technical performance)

Pilot Testing ❌ Google failed to do adequate pilot - [ ] Small-scale pilot before full deployment - [ ] Intensive monitoring and rapid iteration - [ ] Process metrics tracked (gradable image rate, completion rate, time per patient) - [ ] Outcome metrics tracked (detection rate, referral follow-up, health impact)

Training and Support ❌ Google failed these - [ ] Adequate training for end-users (not 2-hour session) - [ ] Ongoing support and troubleshooting - [ ] Local champions and peer support - [ ] Refresher training and skill maintenance

Sustainability ❌ Google failed to assess this - [ ] System sustainable without external support - [ ] Integration with existing health system - [ ] Local ownership and maintenance - [ ] Cost-effectiveness analysis

Key Takeaways

  1. 96% accuracy in the lab ≠ Success in the field - Technical performance is necessary but not sufficient

  2. Design for real-world conditions, not idealized lab settings - Rural clinics ≠ Research hospitals

  3. Technology must fit workflow, not the other way around - Adding 5 minutes per patient collapsed clinic operations

  4. End-users must be involved in design - Designing for ophthalmologists, deploying with nurses = failure

  5. Infrastructure constraints are not optional - Intermittent internet, poor lighting, limited equipment are realities to design around

  6. Pilot, iterate, then scale - Not deploy globally and hope for the best

  7. Implementation science matters as much as AI science - Getting technology into hands of users requires different expertise than developing the technology

References

Primary research: - Gulshan, V., et al. (2016). Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA, 316(22), 2402-2410. DOI: 10.1001/jama.2016.17216 - Krause, J., et al. (2018). Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy. Ophthalmology, 125(8), 1264-1272. DOI: 10.1016/j.ophtha.2018.01.034 - Beede, E., et al. (2020). A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy. CHI 2020: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1-12. DOI: 10.1145/3313831.3376718

Media coverage and analysis: - Mukherjee, S. (2021). A.I. Versus M.D.: What Happens When Diagnosis Is Automated? The New Yorker - Heaven, W. D. (2020). Google’s medical AI was super accurate in a lab. Real life was a different story. MIT Technology Review

Implementation science context: - Keane, P. A., & Topol, E. J. (2018). With an eye to AI and autonomous diagnosis. npj Digital Medicine, 1(1), 40. DOI: 10.1038/s41746-018-0048-y


Case Study 4: Epic Sepsis Model - When Vendor Claims Meet Reality

The Promise

Epic (largest EHR vendor in US, used by 50%+ of US hospitals) developed and deployed a machine learning model to predict sepsis risk and alert clinicians.

Vendor claims: - High accuracy (AUC 0.76-0.83 depending on version) - Early warning (hours before sepsis diagnosis) - Implemented in 100+ hospitals - Potential to save thousands of lives

Marketing message: - “AI-powered early warning system” - Integrated seamlessly into Epic EHR workflow - Evidence-based and clinically validated

The Reality

2021: External validation study published in JAMA Internal Medicine (Wong et al. 2021)

Researchers at University of Michigan tested Epic’s sepsis model on their patients:

Findings: - Sensitivity: 63% (missed 37% of sepsis cases) - Positive Predictive Value: 12% (88% of alerts were false alarms) - Of every 100 alerts, only 12 patients actually had sepsis - Alert fatigue: Clinicians ignored most alerts - No evidence of improved patient outcomes

External validation results diverged dramatically from vendor claims (Wong et al. 2021).

Root Cause Analysis

1. Internal vs. External Validation Gap

The validation problem:

# What Epic likely did (internal validation)
class InternalValidation:
    """
    Vendor validation approach
    """

    def __init__(self):
        self.training_data = "Epic customer hospitals (unspecified number)"
        self.test_data = "Same Epic customer hospitals (different time period)"

    def validate_model(self):
        """Internal validation methodology"""

        # Train on Epic customer data
        model = self.train_model(
            data=self.get_epic_customer_ehr_data(),
            features=self.epic_specific_features,
            labels=self.sepsis_cases
        )

        # Test on different time period from same hospitals
        # PROBLEM: Same patient population, same documentation practices, same workflows
        test_performance = model.evaluate(
            data=self.get_epic_customer_ehr_data(time_period='later'),
            metric='AUC'
        )

        # Report performance
        print(f"AUC: {test_performance['auc']}")  # 0.83!

        # WHAT'S MISSING:
        # - Validation on hospitals not in training data
        # - Validation on non-Epic EHR systems
        # - Different patient populations
        # - Different clinical workflows
        # - Real-world alert rate and clinician response
        # - Impact on patient outcomes


# What independent researchers did (external validation)
class ExternalValidation:
    """
    University of Michigan external validation
    """

    def __init__(self):
        self.test_hospital = "University of Michigan (not in Epic training data)"
        self.ehr_system = "Epic (same vendor, different implementation)"

    def validate_model(self):
        """Independent validation methodology"""

        # Test Epic's deployed model on completely new population
        results = epic_sepsis_model.evaluate(
            data=self.umich_patient_data,  # NEW hospital, NEW patients
            ground_truth=self.chart_review_sepsis_diagnosis  # Gold standard
        )

        # Comprehensive metrics
        performance = {
            'auc': 0.63,  # Lower than Epic's claim of 0.83
            'sensitivity': 0.63,  # Misses 37% of sepsis cases
            'specificity': 0.66,  # Many false alarms
            'ppv': 0.12,  # 88% of alerts are wrong
            'alert_rate': '1 per 2.1 patients',  # Overwhelming alert burden
            'alert_burden': 'Median 84 alerts per day per ICU team'
        }

        # Clinical workflow impact
        clinician_response = self.survey_clinicians()
        # "Too many false alarms"
        # "Ignored most alerts due to alert fatigue"
        # "No change in sepsis management"

        # Patient outcomes
        outcome_analysis = self.compare_outcomes(
            before_epic_sepsis_model,
            after_epic_sepsis_model
        )
        # No significant change in:
        # - Time to antibiotics
        # - Time to sepsis bundle completion
        # - ICU length of stay
        # - Mortality

        return performance

Why performance degraded:

  1. Different patient populations
    • Training hospitals vs. Michigan patient mix
    • Different case severity distributions
    • Different comorbidity profiles
  2. Different documentation practices
    • How clinicians document varies by institution
    • Model learned institution-specific patterns
    • Patterns don’t generalize
  3. Different workflows
    • How quickly vitals are entered
    • Which lab tests are ordered when
    • Documentation timing and completeness

2. The False Alarm Problem

Alert burden analysis:

class AlertFatigueAnalysis:
    """
    Understanding the alert burden problem
    """

    def calculate_alert_burden(self):
        """Michigan ICU alert volume"""

        hospital_stats = {
            'icu_patients_per_day': 100,
            'alert_rate': '1 per 2.1 patients',  # Per Michigan study
            'alerts_per_day': 100 / 2.1  # ≈ 48 alerts/day
        }

        # Each alert requires:
        alert_overhead = {
            'time_to_review_alert': '2-3 minutes',
            'review_patient_chart': '3-5 minutes',
            'assess_clinical_relevance': '2-3 minutes',
            'document_response': '1-2 minutes',
            'total_per_alert': '8-13 minutes'
        }

        # For ICU team seeing 48 alerts/day:
        daily_burden = {
            'time_spent_on_alerts': '6-10 hours',  # Of nursing/physician time
            'true_sepsis_cases': 48 * 0.12,  # Only 6 patients actually have sepsis
            'false_alarms': 48 * 0.88,  # 42 false alarms
            'true_positives_missed': 'Unknown (37% sensitivity means many missed)'
        }

        # Outcome: Alert fatigue
        clinician_response = {
            'alert_responsiveness': 'Decreases over time',
            'cognitive_burden': 'High',
            'trust_in_system': 'Low',
            'actual_behavior_change': 'Minimal'
        }

        return "System adds burden without clear benefit"

The specificity-alert burden tradeoff:

If you want to catch more sepsis cases (higher sensitivity), you must accept more false alarms (lower specificity). But in a hospital with: - 100 ICU patients - 5% sepsis prevalence - Target: 95% sensitivity (catch almost all cases)

You need to accept: - ~80% of alerts will be false alarms - Clinicians will become fatigued and ignore alerts - The rare true positives will be lost in noise

Epic’s model had: - 63% sensitivity (missed 37% of cases) ← Not good enough - 66% specificity (34% false positive rate) ← Alert burden too high - Worst of both worlds

3. Lack of Outcome Validation

Epic measured: - ✓ AUC (model discrimination) - ✓ Sensitivity/specificity - ✓ Calibration

Epic did NOT measure (or publish): - ✗ Impact on time to antibiotics - ✗ Impact on sepsis bundle completion - ✗ Impact on ICU length of stay - ✗ Impact on mortality - ✗ Cost-effectiveness - ✗ Clinician alert fatigue and response rates

Model accuracy ≠ Clinical impact

The Fallout

Hospital response: - Many hospitals that implemented Epic sepsis model reported similar problems - Some hospitals turned off the alerts due to alert fatigue - Others lowered alert thresholds (fewer alerts but miss more cases) - Unknown how many hospitals continue to use it effectively

Patient impact: - No evidence of benefit (outcomes not improved) - Potential harm from alert fatigue causing real alerts to be ignored - Unknown number of sepsis cases missed due to 63% sensitivity

Trust impact: - Increased skepticism of vendor AI claims - Hospitals demanding independent validation before adoption - Regulatory interest in AI medical device claims

Research impact: - Highlighted need for external validation (Wong et al. 2021) - Demonstrated gap between technical performance and clinical utility - Showed importance of measuring patient outcomes, not just AUC (McCoy et al. 2020)

What Should Have Happened

Responsible AI deployment pathway:

Phase 1: Transparent Development (Epic’s responsibility) 1. Publish development methodology - Training data sources and characteristics - Feature engineering approach - Validation methodology and results - Known limitations 2. Make model available for independent validation 3. Provide implementation guide with expected performance ranges

Phase 2: External Validation (Independent researchers) 1. Pre-deployment validation at 3-5 hospitals not in training data 2. Report performance across diverse settings 3. Measure clinical outcomes, not just AUC 4. Assess alert burden and clinician response 5. Publish results in peer-reviewed journal

Phase 3: Pilot Implementation (Hospitals considering adoption) 1. Small-scale pilot (1-2 ICU units) 2. Intensive monitoring: - Alert volume and clinician response rate - Time to sepsis interventions - Patient outcomes (mortality, length of stay) - Clinician satisfaction and alert fatigue 3. Compare to historical controls 4. Decide: Scale, modify, or abandon

Phase 4: Iterative Improvement 1. Customize model to local patient population 2. Adjust alert thresholds based on local clinician feedback 3. Integrate with local sepsis protocols 4. Continuous monitoring and updates

What actually happened: 1. Epic developed and deployed model 2. Hospitals adopted based on vendor claims 3. External researchers discovered poor performance 4. Damage to trust already done

Prevention Checklist

WarningRed Flags for Vendor AI Systems

Before adopting any commercial clinical AI:

Validation Evidence ❌ Epic sepsis model failed these - [ ] External validation at multiple independent sites - [ ] Validation results published in peer-reviewed journal (not just vendor white paper) - [ ] Independent researchers (not vendor employees) conducted validation - [ ] Performance reported across diverse patient populations - [ ] Sensitivity to different EHR documentation practices assessed

Outcome Evidence ❌ Epic sepsis model failed all of these - [ ] Impact on patient outcomes measured (not just model accuracy) - [ ] Clinical workflow impact assessed - [ ] Alert burden quantified - [ ] Clinician acceptance and response rates reported - [ ] Cost-effectiveness analysis

Transparency ❌ Epic sepsis model failed these - [ ] Training data characteristics disclosed - [ ] Feature engineering documented - [ ] Known limitations clearly stated - [ ] Performance expectations realistic (not just best-case) - [ ] Conflicts of interest disclosed

Implementation Support (Variable) - [ ] Implementation guide provided - [ ] Training for clinical staff - [ ] Ongoing technical support - [ ] Monitoring dashboards for performance tracking - [ ] Customization to local population possible

Key Takeaways

  1. Vendor claims require independent verification - Epic’s reported performance did not hold up to external validation

  2. Internal validation overfits to training data - Same hospitals, same workflows, same documentation practices

  3. AUC is not enough - Model accuracy must translate to clinical benefit and workflow fit

  4. Alert burden matters more than you think - 88% false alarm rate causes alert fatigue and system abandonment

  5. Measure outcomes, not just model performance - Did patients actually benefit? Were sepsis deaths prevented?

  6. Hospitals need to demand evidence - “Deployed in 100+ hospitals” is not evidence of effectiveness

  7. Transparency enables trust - Vendor opacity prevents independent validation and slows progress

References

Primary research: - Wong, A., et al. (2021). External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. JAMA Internal Medicine, 181(8), 1065-1070. DOI: 10.1001/jamainternmed.2021.2626 - McCoy, A., & Das, R. (2017). Reducing patient mortality, length of stay and readmissions through machine learning-based sepsis prediction in the emergency department, intensive care unit and hospital floor units. BMJ Open Quality, 6(2), e000158. DOI: 10.1136/bmjoq-2017-000158

Commentary and analysis: - Sendak, M. P., et al. (2020). A Path for Translation of Machine Learning Products into Healthcare Delivery. EMJ Innovations, 10(1), 19-00172. DOI: 10.33590/emjinnov/19-00172 - Ginestra, J. C., et al. (2019). Clinician Perception of a Machine Learning-Based Early Warning System Designed to Predict Severe Sepsis and Septic Shock. Critical Care Medicine, 47(11), 1477-1484. DOI: 10.1097/CCM.0000000000003803

Media coverage: - Ross, C., & Swetlitz, I. (2021). Epic sepsis prediction tool shows sizable overestimation in external study. STAT News - Strickland, E. (2022). How Sepsis Prediction Algorithms Failed in Real-World Implementation. IEEE Spectrum

Case Study 5: UK NHS COVID-19 Contact Tracing App - Technical and Social Failure

The Promise

May 2020: UK government announced a smartphone app for COVID-19 contact tracing to enable rapid identification and isolation of contacts, allowing the country to ease lockdown safely.

Stated goals: - Rapid contact identification (within hours, not days) - Privacy-preserving (no central database of contacts) - Enable safe reopening of economy - Complement manual contact tracing - “World-beating” system (PM Boris Johnson’s words)

Initial timeline: - App promised by mid-May 2020 - Nationwide rollout by June 2020

The Reality

Timeline of failures: - May 2020: Pilot on Isle of Wight reveals technical problems - June 2020: Original app abandoned after £12M spent - September 2020: New app finally launched (4 months late) - Adoption: Only 28% of population downloaded final app (needed 60%+ for effectiveness) - Impact: Limited evidence of meaningful contact tracing benefit

Technical problems: - Original app couldn’t detect contacts on iPhones when screen locked - Bluetooth proximity detection unreliable (detected people through walls, missed close contacts) - Battery drain issues - False positives from fleeting encounters

Social and political problems: - Centralized data collection raised privacy concerns - Trust eroded by constantly changing approaches - Forced switch to Apple/Google API after rejecting it initially - £12M wasted on failed first version

Root Cause Analysis

1. Technical Hubris: Reinventing the Wheel Badly

The Apple/Google API decision:

# What UK tried to do: Centralized model
class UKCentralizedModel:
    """
    UK's original approach: centralized contact database
    """

    def __init__(self):
        self.architecture = "centralized"
        self.data_location = "NHS servers"
        self.compatible_with_ios = False  # CRITICAL FLAW

    def detect_contacts(self, user_phone):
        """Contact detection on phones"""

        # PROBLEM 1: iOS restrictions
        # Apple does not allow Bluetooth to run in background
        # unless using official Apple/Google Exposure Notification API

        if user_phone.os == 'iOS':
            if user_phone.screen_locked or user_phone.app_backgrounded:
                # Bluetooth turned off by iOS
                # Can't detect any contacts
                return []  # MASSIVE FAILURE
                # ~50% of UK uses iPhones
                # App useless for half the population

        # PROBLEM 2: Even on Android, unreliable
        contacts = self.scan_bluetooth()

        # Bluetooth RSSI (signal strength) poor proxy for distance
        false_positives = {
            'through_walls': True,  # Detects neighbors through walls
            'through_windows': True,  # Detects people outside
            'fleeting_encounters': True,  # Walking past someone for 2 seconds
        }

        false_negatives = {
            'signal_interference': True,  # Phone in pocket or bag
            'device_variability': True,  # Different phones = different Bluetooth
            'body_blocking': True  # Human body blocks signal
        }

        return contacts  # Low quality, unreliable data

    def send_data_to_server(self, contacts):
        """Upload contact data to NHS servers"""

        # PROBLEM 3: Privacy concerns
        # All contact data sent to central government database
        # Who you met, when, where (if combined with location)
        # Potential for mission creep and surveillance

        privacy_concerns = {
            'central_database': True,
            'government_access': True,
            'mission_creep_risk': True,
            'public_trust': 'Low'
        }

        # Upload to NHS servers
        self.nhs_server.store_contact_graph(contacts)
        # Creates network graph of entire population's contacts
        # Privacy advocates alarmed
        # Public skeptical


# What Apple/Google designed: Decentralized model
class AppleGoogleExposureNotification:
    """
    Apple/Google's approach: decentralized, privacy-preserving
    """

    def __init__(self):
        self.architecture = "decentralized"
        self.data_location = "on device only"
        self.compatible_with_ios = True  # WORKS on all devices
        self.privacy_preserving = True

    def detect_contacts(self, user_phone):
        """Contact detection using OS-level API"""

        # ADVANTAGE 1: Works on iOS and Android
        # Apple and Google built this into operating system
        # Bluetooth works even when screen locked
        # Because OS manages it, not app

        contacts = self.exposure_notification_api.scan_bluetooth()

        # ADVANTAGE 2: Privacy by design
        # Exchange random, rotating identifiers
        # No names, no phone numbers, no location
        # Just anonymous tokens that change every 15 minutes

        return contacts  # Better technical performance

    def handle_positive_test(self, user):
        """User tests positive for COVID-19"""

        # ADVANTAGE 3: Decentralized matching
        # When user tests positive, upload their random IDs
        # Other phones download list of positive IDs
        # Match locally on device
        # No central database of who met whom

        if user.tests_positive:
            user.phone.upload_anonymous_ids()  # Just random numbers

        # Other users' phones check:
        # "Have I encountered any of these random IDs?"
        # If yes: "You may have been exposed, get tested"
        # If no: No notification

        # NHS servers never know:
        # - Who you are
        # - Who you met
        # - Where you were
        # - When encounters happened

        privacy_preserved = True
        functionality_maintained = True

Why UK rejected Apple/Google API initially: - Wanted centralized data for epidemiological research - Wanted to control contact matching algorithm - Believed they could build better system - Pride: “We don’t need American tech companies”

Why UK ultimately had to adopt it: - Their app literally didn’t work on iPhones - Couldn’t bypass Apple’s iOS restrictions - Public wouldn’t accept privacy violations - Months wasted before admitting mistake

2. Misunderstanding Social Adoption Requirements

Technical performance ≠ Public adoption

class AppAdoptionModeling:
    """
    What factors drive contact tracing app adoption?
    """

    def predict_adoption_rate(self, app_characteristics):
        """Model of app adoption"""

        # UK's initial assumptions (WRONG)
        uk_assumptions = {
            'primary_factor': 'technical_performance',
            'assumption': 'If app works technically, people will download it',
            'privacy_concerns': 'minimal',  # WRONG
            'trust_in_government': 'high'  # WRONG (especially after lockdown)
        }

        # Actual drivers of adoption (research-based)
        actual_factors = {
            'trust_in_government': 0.35,  # 35% of variance
            'privacy_concerns': 0.25,  # 25% of variance
            'perceived_effectiveness': 0.20,  # 20% of variance
            'social_norms': 0.10,  # 10% of variance
            'ease_of_use': 0.10  # 10% of variance
        }

        # UK's app had problems with ALL factors:
        uk_app_scores = {
            'trust': 'Low (after privacy controversies, changing approaches)',
            'privacy': 'Low (centralized data collection)',
            'effectiveness': 'Low (didn\'t work on iPhones)',
            'social_norms': 'Weak (low initial adoption)',
            'ease_of_use': 'Medium (battery drain, false alerts)'
        }

        predicted_adoption = self.calculate(actual_factors, uk_app_scores)
        # Result: 28% actual adoption
        # Threshold for effectiveness: 60%+
        # Conclusion: Insufficient adoption to be effective

        return predicted_adoption

    def analyze_feedback_loops(self):
        """Adoption creates feedback loops"""

        # Negative feedback loop:
        # 1. Few people download app (privacy concerns)
        # 2. App ineffective with low adoption
        # 3. Word spreads: "App doesn't work"
        # 4. Even fewer people download
        # 5. Repeat

        # Positive feedback loop (if achieved):
        # 1. Many people download (high trust)
        # 2. App effective with high adoption
        # 3. Success stories: "App alerted me, got tested, prevented outbreak"
        # 4. More people download
        # 5. Repeat

        # UK ended up in negative feedback loop
        # Never reached critical mass for effectiveness

Social factors UK underestimated: 1. Privacy skepticism - Government mass surveillance concerns (post-Snowden era) 2. Trust erosion - Constantly changing approaches signaled incompetence 3. Pandemic fatigue - By September launch, public exhausted and skeptical 4. Alternative strategies - Manual contact tracing, testing infrastructure more important 5. Transparency deficit - Technical details not clearly communicated

3. Organizational Dysfunction

Flawed decision-making process:

  1. No diversity of expertise
    • Led by NHSX (digital transformation unit)
    • Limited public health input initially
    • No social scientists or behavioral economists early on
    • Technical team isolated from policy team
  2. Sunk cost fallacy
    • £12M invested in centralized model
    • Reluctance to abandon and switch to Apple/Google API
    • Months wasted before admitting failure
    • Political pressure to show “progress”
  3. Overpromising
    • PM Boris Johnson: “world-beating” system
    • Timeline commitments impossible to meet
    • Set up for public disappointment
  4. Ignoring international experience
    • Other countries already struggling with similar apps
    • Germany, France, Australia all faced low adoption
    • Could have learned from their mistakes
    • Instead: “British exceptionalism”

The Fallout

Financial costs: - £12M on failed first version - Unknown additional costs for second version - Opportunity cost of not investing in manual contact tracing capacity

Public health impact: - App launched 4 months late (during crucial second wave preparation) - Low adoption (28%) meant limited effectiveness - No strong evidence app prevented significant transmission - Resources diverted from manual contact tracing

Trust damage: - Public trust in government COVID response eroded - Privacy concerns about future health data initiatives - Skepticism about government digital projects

Political fallout: - Embarrassment for UK government - Delayed reopening plans (app was prerequisite) - International reputation damage (“world-beating” became punchline)

What Should Have Happened

Evidence-based approach:

Phase 1: Rapid Evidence Review (Week 1-2) 1. Review international contact tracing app experiences - Singapore, Australia, Germany early adopters - Document technical challenges and adoption barriers - Learn from their mistakes 2. Consult with Apple and Google - Understand iOS/Android constraints BEFORE building - Evaluate Apple/Google Exposure Notification API 3. Engage privacy and civil liberties experts - Design privacy-preserving architecture from start - Address concerns proactively 4. Model adoption requirements - What adoption rate needed for effectiveness? - What factors drive adoption? - Is app even necessary given manual tracing capacity?

Phase 2: Stakeholder Engagement (Week 3-4) 1. Public consultation on privacy model - Centralized vs. decentralized tradeoffs - Transparency about data use - Build trust before launch 2. Behavioral science input - How to message app to maximize adoption? - What concerns need addressing? - Pilot messaging with focus groups 3. Integrate with broader testing strategy - App is just one tool, not silver bullet - Ensure testing capacity can handle app-generated demand - Link to NHS Test and Trace infrastructure

Phase 3: Technical Development (Week 5-12) 1. Use Apple/Google Exposure Notification API from start - Saves months of development time - Ensures cross-platform compatibility - Provides privacy guarantees 2. Design for intermittent engagement - Most users won’t check app daily - Push notifications for exposures - Low friction user experience 3. Plan for false positives - How to calibrate sensitivity vs. specificity? - What support for people receiving exposure alerts? - Avoid overwhelming testing system

Phase 4: Pilot and Iterate (Week 13-16) 1. Pilot on Isle of Wight (good choice actually) - But be transparent about results - Acknowledge problems quickly - Iterate rapidly based on feedback 2. Monitor technical performance AND social adoption - Don’t just measure downloads - Measure active use, notification response rates - Survey reasons for non-adoption

Phase 5: Conditional National Rollout 1. Only scale if pilot shows: - Technical reliability (works on all phones) - Adequate adoption (>50% in pilot area) - Manageable false positive rate - Integration with Test & Trace system works 2. If pilot fails: Abandon or redesign - Don’t waste £12M on failed system - Redirect resources to manual contact tracing

Realistic timeline: 12-16 weeks (vs. promised 4-6 weeks)

Key difference: Honest about technical constraints and social requirements from day one

Prevention Checklist

WarningRed Flags for Digital Public Health Interventions

Use this before launching apps or digital tools:

Technical Feasibility ❌ UK NHS app failed all initially - [ ] Platform constraints understood (iOS, Android limitations) - [ ] Technical architecture validated by external experts - [ ] Pilot testing in real-world conditions - [ ] Failure modes identified and mitigated - [ ] Battery, data, accessibility considered

Privacy and Ethics ❌ UK initially failed these - [ ] Privacy-by-design from start (not added later) - [ ] Independent ethics review - [ ] Privacy Impact Assessment completed - [ ] Data minimization principle applied - [ ] Transparency about data use and storage

Social Adoption ❌ UK failed to model these properly - [ ] Adoption requirements modeled (what % needed for effectiveness?) - [ ] Barriers to adoption identified through user research - [ ] Trust-building strategy developed - [ ] Behavioral science input integrated - [ ] Communication strategy tested with target audience

Integration with Health System ❌ UK struggled with this - [ ] Integration with existing contact tracing infrastructure - [ ] Testing capacity adequate for app-generated demand - [ ] Workflow for handling notifications designed - [ ] Training for contact tracers and clinical staff - [ ] Monitoring and evaluation plan

Governance ❌ UK failed several of these - [ ] Diverse expert input (technical, clinical, social science, ethics) - [ ] Realistic timeline (not politically driven) - [ ] Contingency plans if adoption low - [ ] Transparent reporting of performance - [ ] Willingness to abandon if not working

Key Takeaways

  1. Technical feasibility ≠ Social acceptability - App that works technically can still fail if people won’t use it

  2. You can’t bypass platform constraints - Apple and Google control iOS/Android; work with them, not against them

  3. Privacy skepticism is rational - Especially for government surveillance; design for privacy from start

  4. Overpromising backfires - “World-beating” claims set up for humiliating failure

  5. Sunk cost fallacy is dangerous - Abandon failed approaches quickly; don’t throw good money after bad

  6. Digital is not a substitute for infrastructure - App can’t compensate for inadequate testing and manual contact tracing

  7. Learn from others’ mistakes - Multiple countries’ failures preceded UK’s; could have learned from them

  8. Diverse expertise matters - Technical teams alone make bad policy; need social science, ethics, public health input

References

Government reports and official documents: - National Audit Office. (2021). Test and Trace in England - progress update - Information Commissioner’s Office. (2020). ICO statement on NHS COVID-19 app

Research and analysis: - Rowe, F., et al. (2021). When is it effective to use digital contact tracing as a response to COVID-19? PLOS ONE, 16(5), e0248250. DOI: 10.1371/journal.pone.0248250 - Abeler, J., et al. (2020). COVID-19 Contact Tracing and Data Protection Can Go Together. JMIR mHealth and uHealth, 8(4), e19359. DOI: 10.2196/19359 - Williams, S. N., et al. (2021). Public attitudes towards COVID-19 contact tracing apps: A UK-based focus group study. Health Expectations, 24(2), 377-385. DOI: 10.1111/hex.13179

Media coverage: - Kelion, L. (2020). Coronavirus: UK contact-tracing app switches to Apple-Google model. BBC News - Murphy, M., & Bradshaw, T. (2020). UK abandons contact-tracing app for Apple and Google model. Financial Times - Hern, A. (2020). UK’s NHS Covid-19 contact tracing app has cost £35m so far. The Guardian


Case Study 6: OPTUM/UnitedHealth Algorithmic Bias - Proxy Discrimination

The Promise

UnitedHealth’s OPTUM developed an algorithm used to identify high-risk patients for “care management programs” - extra support and resources for complex medical needs.

Stated purpose: - Identify patients who would benefit from intensive care management - Reduce emergency department visits and hospitalizations - Improve health outcomes for complex patients - Allocate healthcare resources efficiently

Scale: - Used by many US healthcare systems - Affected approximately 200 million people annually

The Reality

October 2019: Published in Science (Obermeyer2019AlgorithmicBias?)

Researchers at UC Berkeley discovered the algorithm exhibited severe racial bias:

Key findings: - Algorithm systematically scored Black patients as lower risk than White patients with the same level of health needs - At a given risk score, Black patients were significantly sicker than White patients - Bias magnitude: To achieve equal access, 46.5% more Black patients would need to be enrolled - Impact: Millions of Black patients denied access to care management programs they needed

How it worked: - Algorithm predicted healthcare costs as proxy for healthcare needs - Problem: Black patients have lower healthcare spending even when equally or more sick - Result: Algorithm learned that Black patients are “lower risk” because they spend less

Root Cause Analysis

1. Proxy Variable Bias: Healthcare Costs ≠ Healthcare Needs

The fundamental design flaw:

# What OPTUM did (WRONG): Predict costs as proxy for health needs
class OPTUMRiskAlgorithm:
    """
    OPTUM's approach: Use cost as proxy for health needs
    """

    def __init__(self):
        self.target_variable = "total_healthcare_costs"  # WRONG CHOICE
        self.intended_outcome = "identify patients with high health needs"
        # PROBLEM: Costs ≠ Needs (especially across racial groups)

    def train_model(self, patient_data):
        """Train model to predict healthcare costs"""

        features = [
            'age',
            'sex',
            'diagnoses_codes',  # ICD-10 codes
            'medications',
            'prior_utilization',
            'comorbidities'
            # NOTE: 'race' not explicitly included as feature
            # But race correlated with many other features
        ]

        # Target: Total healthcare costs in next year
        target = 'total_healthcare_costs_next_year'

        # Train predictive model
        model = self.ml_algorithm.fit(
            X=patient_data[features],
            y=patient_data[target]  # Predicting costs
        )

        return model

    def predict_risk(self, patient):
        """Predict patient's risk score"""

        # Higher predicted cost = higher risk score
        predicted_cost = self.model.predict(patient)
        risk_score = predicted_cost  # Cost used as proxy for health need

        return risk_score

    def analyze_why_bias_occurs(self):
        """Why this approach creates racial bias"""

        # PROBLEM: Healthcare costs reflect systemic inequities

        # Example: Two patients with same chronic kidney disease
        white_patient = {
            'ckd_stage': 4,  # Advanced kidney disease
            'symptoms': 'Severe',
            'access_to_care': 'Good insurance, nearby nephrologist',
            'historical_spending': '$15,000/year',
            'receives_appropriate_care': True
        }

        black_patient = {
            'ckd_stage': 4,  # Same advanced kidney disease
            'symptoms': 'Severe',  # Same symptom severity
            'access_to_care': 'Medicaid, far from nephrologist, work scheduling conflicts',
            'historical_spending': '$8,000/year',  # LOWER spending despite same disease
            'receives_appropriate_care': False  # Barriers prevent getting needed care
        }

        # Algorithm learns:
        # White patient costs $15K → High risk → Enroll in care management
        # Black patient costs $8K → Lower risk → Don't enroll

        # Reality:
        # Black patient has SAME disease severity
        # But structural barriers reduce their healthcare spending
        # Algorithm learns "Black patients cost less" = "Black patients are healthier"
        # This is FALSE and harmful

        # Root causes of spending disparities:
        reasons_for_lower_black_spending = {
            'access_barriers': [
                'Lack of transportation',
                'Work schedule inflexibility',
                'Childcare responsibilities',
                'Provider shortages in predominantly Black neighborhoods',
                'Distance to specialists'
            ],
            'insurance_barriers': [
                'Higher rates of Medicaid (lower reimbursement)',
                'Higher rates of uninsurance',
                'High deductibles deterring care-seeking'
            ],
            'systemic_factors': [
                'Physician implicit bias (Black patients' pain undertreated)',
                'Lower referral rates to specialists',
                'Medical mistrust from historical abuses (Tuskegee, etc.)',
                'Discrimination in healthcare settings'
            ],
            'economic_factors': [
                'Lower incomes → delay seeking care',
                'Cost-related medication non-adherence',
                'Unable to afford copays and deductibles'
            ]
        }

        # Result: Algorithm encodes systemic racism into automated decisions


# What SHOULD have been done (CORRECT): Predict health needs directly
class ImprovedRiskAlgorithm:
    """
    Better approach: Predict health needs, not costs
    """

    def __init__(self):
        self.target_variable = "active_chronic_conditions"  # Better proxy
        self.intended_outcome = "identify patients with high health needs"
        # Much better alignment between target and intended outcome

    def train_model(self, patient_data):
        """Train model to predict health needs directly"""

        # Use multiple indicators of health need:
        health_need_indicators = [
            'number_active_chronic_conditions',
            'disease_severity_scores',
            'functional_status',  # Activities of daily living
            'biomarkers',  # HbA1c for diabetes, GFR for kidney disease
            'patient_reported_symptoms',
            'risk_of_deterioration'
        ]

        # Composite target: Actual health need, not spending
        # Each chronic condition weighted by severity
        target = self.calculate_health_need_score(patient_data)

        model = self.ml_algorithm.fit(
            X=patient_data[features],
            y=target  # Predicting health need, not costs
        )

        return model

    def evaluate_for_bias(self, model, test_data):
        """Proactive bias testing"""

        # Check: Do patients with same health needs get same risk scores
        # regardless of race?

        for condition_severity in ['mild', 'moderate', 'severe']:
            white_patients = test_data[
                (test_data['race'] == 'White') &
                (test_data['condition_severity'] == condition_severity)
            ]

            black_patients = test_data[
                (test_data['race'] == 'Black') &
                (test_data['condition_severity'] == condition_severity)
            ]

            white_predictions = model.predict(white_patients)
            black_predictions = model.predict(black_patients)

            # Test for disparate impact
            if abs(white_predictions.mean() - black_predictions.mean()) > threshold:
                print(f"WARNING: Racial disparity in predictions for {condition_severity}")
                print(f"White patients average risk: {white_predictions.mean()}")
                print(f"Black patients average risk: {black_predictions.mean()}")

                # Investigate and fix before deployment
                self.investigate_disparity(condition_severity)

        return fairness_audit_report

The cost-based approach encoded existing healthcare disparities: - Black patients historically receive less care for same conditions (due to systemic racism) - Algorithm learned this pattern - Algorithm perpetuated and amplified the disparity by denying Black patients access to intensive care management - Feedback loop: Less care → Sicker → But still predicted as “low risk” because spending remains low

2. Organizational Blindness to Bias

How did this get deployed affecting 200 million people without detection?

Lack of bias testing:

class OrganizationalFailure:
    """
    Why bias wasn't caught before deployment
    """

    def pre_deployment_testing(self):
        """What OPTUM likely tested"""

        tests_performed = {
            'predictive_accuracy': True,  # AUC, R-squared
            'calibration': True,  # Do predictions match actual costs?
            'stability': True,  # Performance over time?
            'racial_fairness': False  # NOT TESTED
        }

        # OPTUM measured:
        # "Does algorithm accurately predict healthcare costs?" YES
        # "Is algorithm calibrated?" YES
        # "Does algorithm perform consistently?" YES

        # OPTUM did NOT measure:
        # "Do patients with same health needs get same risk scores
        # regardless of race?" NO - Never asked this question

        # Why not?
        why_fairness_not_tested = {
            'lack_of_awareness': "Didn't consider algorithmic bias as risk",
            'lack_of_expertise': "No fairness/ethics experts on development team",
            'lack_of_mandate': "No regulatory requirement to test for bias",
            'lack_of_incentive': "Incentivized to predict costs accurately, not fairly",
            'organizational_culture': "Tech solutionism, not critical reflection"
        }

        return "Bias went undetected for years"

    def incentive_misalignment(self):
        """Why choosing costs as target variable"""

        # OPTUM's business model:
        # - Paid by health insurers
        # - Goal: Reduce costs (hospitalizations, ED visits)
        # - Success metric: Cost reduction

        # This incentivizes:
        # - Target variable: Predict costs (align with business goal)
        # - Not: Predict health needs (doesn't align with business goal)

        # Problem: What's good for business (cost reduction)
        # ≠ What's good for patients (equitable access to care)

        business_incentives = {
            'primary_goal': 'Reduce healthcare costs',
            'secondary_goal': 'Improve outcomes',
            'equity_goal': 'Not prioritized'
        }

        # Result: Algorithm optimized for wrong objective
        return "Profit motive misaligned with equity"

    def lack_of_diverse_perspectives(self):
        """Homogeneous teams miss bias"""

        typical_ml_team_composition = {
            'data_scientists': 'Mostly White and Asian',
            'engineers': 'Mostly White and Asian men',
            'product_managers': 'Mostly White',
            'domain_experts': 'Healthcare economists, some clinicians'
        }

        missing_perspectives = {
            'Black_and_Latino_clinicians': 'Could have flagged health access disparities',
            'health_equity_researchers': 'Could have identified proxy variable problem',
            'ethicists': 'Could have raised fairness questions',
            'community_representatives': 'Could have voiced concerns about discrimination'
        }

        # Homogeneous teams have blind spots
        # Especially around how systems affect marginalized communities

        return "Lack of diversity → Lack of critical perspectives"

3. The Illusion of Objectivity

“The algorithm doesn’t use race, so it can’t be racist” - WRONG

class FairnessWashingMyths:
    """
    Common misconceptions about algorithmic fairness
    """

    def myth_1_not_using_race_means_unbiased(self):
        """
        MYTH: If we don't include race as a feature, algorithm is fair
        REALITY: Race correlated with many other features
        """

        # OPTUM did not include 'race' as explicit feature
        # But many features highly correlated with race:

        race_proxies = {
            'zip_code': 'Residential segregation means zip code predicts race',
            'type_of_insurance': 'Medicaid vs private correlates with race',
            'hospital_where_treated': 'Hospital segregation',
            'primary_language': 'Spanish, Creole, etc.',
            'diagnosis_codes': 'Some conditions more prevalent in certain racial groups',
            'historical_spending': 'Reflects past access barriers'
        }

        # Algorithm doesn't need 'race' explicitly
        # It learns racial patterns from correlated features

        # Example:
        # Zip code 02119 (Roxbury, Boston) → 65% Black
        # Zip code 02467 (Chestnut Hill, Boston) → 95% White
        # Algorithm learns different risk profiles by zip code
        # = Learning race without explicitly using race

        return "Fairness through unawareness does NOT work"

    def myth_2_algorithms_are_objective(self):
        """
        MYTH: Algorithms are objective, humans are biased
        REALITY: Algorithms encode human choices and societal biases
        """

        human_choices_in_algorithm = {
            'what_to_predict': 'CHOICE: Costs vs health needs',
            'what_data_to_use': 'CHOICE: Historical EHR data (contains bias)',
            'what_features_to_include': 'CHOICE: Symptom reports, biomarkers, spending?',
            'how_to_measure_success': 'CHOICE: Predictive accuracy vs fairness',
            'who_to_test_on': 'CHOICE: Diverse population or convenience sample',
            'what_threshold_to_use': 'CHOICE: How high a risk score triggers intervention?',
        }

        # Every choice made by humans
        # Every choice can encode bias
        # Algorithm amplifies and scales these choices

        return "Algorithms are not objective, they're laundered subjectivity"

    def myth_3_accuracy_implies_fairness(self):
        """
        MYTH: If algorithm is accurate, it must be fair
        REALITY: Can be highly accurate AND highly biased
        """

        # OPTUM's algorithm WAS accurate at predicting costs
        # But costs ≠ needs, especially across racial groups

        # Algorithm accurately learned:
        # "Black patients spend less money"
        # This is TRUE (accurate)
        # But UNFAIR (spending lower due to discrimination)

        # Accuracy on wrong objective = Accurate unfairness

        return "Accuracy ≠ Fairness"

The Fallout

Publication and reaction: - October 2019: Research published in Science (Obermeyer2019AlgorithmicBias?) - Massive media coverage: “Algorithm discriminates against Black patients” - UnitedHealth/OPTUM acknowledged bias and committed to fixing it

Impact on patients: - Unknown how many Black patients denied access to care management over the years - Impossible to retroactively identify all affected patients - Disparities in chronic disease outcomes may have been worsened

Legal and regulatory: - FTC opens inquiry into health AI fairness - Multiple states investigate algorithmic bias in healthcare - Increased regulatory attention to health algorithms - Calls for auditing requirements

Research impact: - Sparked wave of research on algorithmic fairness in healthcare - Led to development of bias detection and mitigation methods - Changed how health AI is evaluated (fairness now standard consideration)

Industry impact: - Health AI vendors now (claim to) test for bias - Fairness audits becoming standard practice - Increased scrutiny of proxy variables

What Should Have Happened

Responsible development process:

Phase 1: Problem Formulation (Weeks 1-4) 1. Define objective clearly - What are we trying to achieve? (Identify patients with high health needs) - What is the right target variable? (NOT costs) - Are there proxy variable concerns?

  1. Assemble diverse team
    • Data scientists + clinicians
    • Health equity researchers
    • Ethicists
    • Black and Latino community health advocates
    • Social determinants of health experts
  2. Literature review
    • What is known about racial disparities in healthcare spending?
    • What is known about racial disparities in access to care?
    • Are costs a good proxy for health needs across racial groups? NO

Phase 2: Data and Modeling (Months 2-6) 1. Choose appropriate target variable - Health need indicators (chronic conditions, disease severity) - NOT healthcare costs

  1. Exploratory data analysis
    • Examine racial disparities in data
    • Understand correlation between race and features
    • Document known biases in historical data
  2. Model development with fairness constraints
    • Develop multiple models
    • Test fairness metrics alongside accuracy metrics
    • Use fairness-aware ML methods if needed

Phase 3: Bias Testing (Months 7-9) 1. Comprehensive fairness audit - Do patients with same health needs get same risk scores across racial groups? - Disparate impact analysis - Calibration by subgroup - Multiple fairness definitions tested

  1. Clinical validation
    • Do clinicians agree risk scores reflect health needs?
    • Are Black patients at given risk score as sick as White patients?
  2. External validation
    • Test at healthcare systems not in development data
    • Diverse patient populations
    • Independent researchers evaluate

Phase 4: Pilot Implementation (Months 10-15) 1. Small-scale pilot - 3-5 healthcare systems - Monitor enrollment rates by race - Track whether patients enrolled actually have high health needs - Monitor outcomes (do enrolled patients benefit?)

  1. Continuous bias monitoring
    • Dashboard showing enrollment rates by race
    • Alerts if disparities emerge
    • Regular audits

Phase 5: Transparent Deployment 1. Public documentation - How algorithm works - What fairness testing was done - Known limitations - Ongoing monitoring plan

  1. Appeals process
    • Patients and clinicians can challenge risk scores
    • Manual review of borderline cases
    • Feedback loop for model improvement

Timeline: 15-18 months before full deployment (with extensive testing)

What actually happened: Algorithm deployed widely without fairness testing, bias discovered by external researchers years later

Prevention Checklist

WarningRed Flags for Algorithmic Bias

Use this checklist for any AI system affecting people:

Problem Formulation ❌ OPTUM failed these - [ ] Target variable directly measures intended outcome (not proxy) - [ ] Proxy variables examined for bias potential - [ ] Diverse team involved in problem formulation - [ ] Health equity expert consulted - [ ] Historical bias in data acknowledged

Model Development ❌ OPTUM failed these - [ ] Fairness metrics defined alongside accuracy metrics - [ ] Multiple fairness definitions tested - [ ] Subgroup analysis by race, ethnicity, gender, age - [ ] Fairness-aware ML methods considered - [ ] Interpretability (can identify sources of bias)

Validation and Testing ❌ OPTUM failed all of these - [ ] Comprehensive fairness audit conducted - [ ] External validation on diverse populations - [ ] Independent researchers evaluate fairness - [ ] Clinical validation (do scores match clinical judgment?) - [ ] Published results in peer-reviewed journal

Deployment ❌ OPTUM failed these - [ ] Continuous bias monitoring - [ ] Dashboard showing outcomes by demographic group - [ ] Appeals process for contested decisions - [ ] Regular re-auditing (bias can emerge over time) - [ ] Transparency about how algorithm works

Governance ❌ OPTUM failed these - [ ] Diverse team (not just White/Asian men) - [ ] Ethics review board approval - [ ] Community stakeholder input - [ ] Alignment between business incentives and equity goals - [ ] Accountability (who is responsible if algorithm causes harm?)

Key Takeaways

  1. Proxy variables encode discrimination - Healthcare costs reflect systemic racism; using costs as proxy perpetuates bias

  2. “Not using race” doesn’t prevent bias - Race correlated with many features; algorithm learns racial patterns indirectly

  3. Accuracy ≠ Fairness - Algorithm can be highly accurate at wrong objective and still cause harm

  4. Diverse teams catch bias - Homogeneous teams have blind spots about how systems affect marginalized groups

  5. Business incentives can misalign with equity - Optimizing for cost reduction ≠ Optimizing for equitable care

  6. Fairness testing must be proactive - Waiting for external researchers to find bias years later is unacceptable

  7. Transparency enables accountability - Black-box algorithms escape scrutiny

  8. Algorithmic bias is not a technical problem alone - It’s a socio-technical problem requiring diverse expertise

References

Primary research: - Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447-453. DOI: 10.1126/science.aax2342

Analysis and commentary: - Parikh, R. B., Teeple, S., & Navathe, A. S. (2019). Addressing Bias in Artificial Intelligence in Health Care. JAMA, 322(24), 2377-2378. DOI: 10.1001/jama.2019.18058 - Vyas, D. A., Eisenstein, L. G., & Jones, D. S. (2020). Hidden in Plain Sight - Reconsidering the Use of Race Correction in Clinical Algorithms. New England Journal of Medicine, 383(9), 874-882. DOI: 10.1056/NEJMms2004740 - Rajkomar, A., et al. (2018). Ensuring Fairness in Machine Learning to Advance Health Equity. Annals of Internal Medicine, 169(12), 866-872. DOI: 10.7326/M18-1990

Media coverage: - Ledford, H. (2019). Millions of black people affected by racial bias in health-care algorithms. Nature - Hoffman, B. (2019). Racial bias in a medical algorithm favors white patients over sicker black patients. Washington Post

Broader context on algorithmic fairness: - O’Neil, C. (2016). Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown. - Noble, S. U. (2018). Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press.

Case Study 7-10: Additional Failure Post-Mortems (Concise Format)

The following case studies are presented in concise format. Each failure illustrates a distinct pattern that complements the detailed case studies above. Full versions are available in the online appendix.


Case Study 7: Singapore TraceTogether - Privacy Promise Broken

The Promise: Privacy-preserving COVID-19 contact tracing app with explicit guarantees that data would ONLY be used for contact tracing.

What Happened: - January 2021: Singapore government revealed TraceTogether data accessible to police for criminal investigations - Direct contradiction of prior privacy assurances - Public outcry and trust collapse

Key Failure: Mission creep and broken privacy promises - “Data will only be used for contact tracing” → Became crime-fighting tool - ~4.2 million people (78% of population) had downloaded app based on privacy guarantees - Retroactive disclosure destroyed trust

Lesson: Privacy promises must be legally binding and technically enforced. Voluntary assurances are insufficient. Use-limitation must be coded into system architecture, not just policy documents.

Prevention: - Technical controls preventing unauthorized data access - Legislative limits on data use (not just policy) - Independent oversight and auditing - Transparent disclosure of all potential uses BEFORE deployment

References: - Wong, J. (2021). Singapore reveals Covid-tracing data available to police. BBC News - Lim, A. (2021). Trust in Singapore government plunges after TraceTogether data scandal. Straits Times


Case Study 8: Babylon GP at Hand - Chatbot Playing Doctor

The Promise: AI chatbot (Babylon Health) that could diagnose conditions and triage patients as well as or better than GPs.

Marketing claims: - “AI matches or exceeds doctors in diagnostic accuracy” - Claims of 93% accuracy based on internal testing

What Happened: - External validation revealed serious safety concerns (Fraser2020BabylonChatbot?) - Chatbot failed to recognize serious conditions (sepsis, meningitis) - Unsafe triage recommendations (told patients with serious symptoms to stay home) - Regulators investigated advertising claims

Key Examples of Failures: - Chest pain case: Patient with cardiac symptoms told “no urgent action needed” - Meningitis case: Classic meningitis symptoms flagged as “non-urgent” - Bias: System performed worse for non-English speakers and elderly

Root Causes: 1. Validated on easy cases, not emergency presentations 2. No external clinical validation before deployment 3. Overstated marketing claims not supported by evidence 4. Commercial pressure to launch before proving safety

Lesson: Chatbots ≠ Medical diagnosis. Symptom checkers for triage require different validation than consumer apps. Safety bar must be much higher.

Prevention Checklist: - [ ] External validation by independent clinicians - [ ] Test on real emergency presentations (not textbook cases) - [ ] Safety testing: Does system catch life-threatening conditions? - [ ] Clear disclaimers about limitations - [ ] Marketing claims supported by peer-reviewed evidence

References: - Fraser, H., et al. (2020). Safety of patient-facing digital symptom checkers. The Lancet, 395(10233), 1199. DOI: 10.1016/S0140-6736(20)30819-8 - Gilbert, S., et al. (2020). GPT-3 for healthcare: Potential and pitfalls. BMJ


Case Study 9: COVID-19 Forecasting Models - Mass Overfitting

The Promise: Hundreds of ML models published claiming to predict COVID-19 outcomes (mortality, ICU need, disease progression).

What Happened: - Systematic review found 232 COVID-19 prediction models published in 2020 (Wynants2020COVID19Models?) - Result: 98% at high risk of bias - Almost none suitable for clinical use - Many overfitted to early, unrepresentative data

Common Failures: - Small sample sizes (some models trained on <100 patients) - Data leakage (test data leaked into training) - Geographic overfitting (trained on Wuhan, deployed in New York) - Outcome mismeasurement (PCR as ground truth when false negative rate high) - No external validation before publication

Example: Model predicting COVID mortality from chest X-rays - Training data: X-rays from COVID patients (supine, ICU) vs healthy controls (upright, outpatient) - Model learned: “Supine position = COVID” (not actual disease features) - Failed completely on external validation

Why So Many Failures: - Urgency override rigor: “Pandemic emergency” used to justify cutting corners - Publication pressure: Journals fast-tracked COVID papers without usual scrutiny - Lack of clinical involvement: Many models built by CS/AI teams without clinician collaboration - Data quality ignored: Early COVID data messy, incomplete, biased

Lesson: Urgency is not an excuse for poor methods. Bad AI is worse than no AI. Clinical decisions require validated tools, not proof-of-concept models.

What Should Happen in Pandemic Response: 1. Coordination: Central registry of prediction models (avoid duplication) 2. Standards: Minimum validation requirements before publication 3. External validation: Independent test sets from multiple sites 4. Clinical partnership: Every model needs clinician co-leads 5. Transparency: Open data, open code, reproducibility

References: - Wynants, L., et al. (2020). Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ, 369. DOI: 10.1136/bmj.m1328 - Roberts, M., et al. (2021). Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence, 3(3), 199-217. DOI: 10.1038/s42256-021-00307-0


Case Study 10: Apple Watch AFib Study - Selection Bias in Digital Epidemiology

The Promise: Large-scale digital epidemiology study using Apple Watch to detect atrial fibrillation (AFib) in general population.

The Study: - 419,297 participants enrolled (massive sample size!) - Apple Watch detects irregular pulse → Alert users → Get EKG confirmation - Published in NEJM 2019 (Perez2019AppleHeartStudy?)

The Problem: Severe Selection Bias

Who participated: - Apple Watch owners (not representative of general population) - Opted into research study (self-selected) - Younger, wealthier, healthier, more educated than general population - Already health-conscious (bought fitness tracking device)

Biases compounded: 1. Socioeconomic: Apple Watch costs $400+ (excludes low-income) 2. Age: Younger population (AFib more common in elderly, who are less likely to own smartwatches) 3. Health literacy: Self-selected participants more health-engaged 4. Technology access: Requires smartphone, internet, technical proficiency 5. Geographic: US-centric, limited racial/ethnic diversity

Why This Matters: - Can’t generalize to general population: Prevalence estimates biased - Missed high-risk populations: Elderly, low-income, minorities underrepresented - Widened health disparities: Benefits accrue to already-advantaged groups - Flawed public health inference: Can’t guide policy with unrepresentative sample

The AFib Detection Paradox: - Detected AFib in younger, lower-risk population (who own Apple Watches) - Missed AFib in older, higher-risk population (who don’t) - Result: Detect disease in people who need it least, miss it in people who need it most

Lesson: Convenience samples ≠ Population inference. Digital epidemiology requires explicit attention to representativeness. Technology-based recruitment inherently biases toward privileged groups.

How to Do Digital Epidemiology Responsibly:

  1. Acknowledge limitations clearly
    • “This is a study of Apple Watch owners, not general population”
    • Don’t overgeneralize findings
  2. Recruit representatively
    • Provide devices to underrepresented groups
    • Multiple recruitment channels (not just app store)
    • Actively recruit high-risk populations
  3. Report demographics transparently
    • Compare sample to target population
    • Quantify selection bias
    • Discuss implications for generalizability
  4. Don’t claim population-level inference from convenience samples
    • Be honest about who findings apply to
    • Acknowledge equity implications

Prevention Checklist: - [ ] Sample representativeness assessed - [ ] Selection bias quantified and reported - [ ] High-risk populations intentionally recruited - [ ] Generalizability limitations clearly stated - [ ] Health equity implications considered

References: - Perez, M. V., et al. (2019). Large-Scale Assessment of a Smartwatch to Identify Atrial Fibrillation. New England Journal of Medicine, 381(20), 1909-1917. DOI: 10.1056/NEJMoa1901183 - Goldsack, J. C., et al. (2020). Verification, analytical validation, and clinical validation (V3): the foundation of determining fit-for-purpose for Biometric Monitoring Technologies (BioMeTs). npj Digital Medicine, 3(1), 55. DOI: 10.1038/s41746-020-0260-4


Common Failure Patterns: Synthesis Across All 10 Cases

After analyzing 10 major failures, clear patterns emerge. Here’s the taxonomy:

Failure Pattern 1: Training Data Problems ⚠️

Seen in: Watson, Epic, OPTUM, COVID models

Manifestations: - Synthetic data instead of real outcomes (Watson) - Biased historical data (OPTUM: costs reflect discrimination) - Small, unrepresentative samples (COVID models) - Data leakage (COVID: test data contamination)

Root cause: Garbage in, garbage out. Models learn what’s in the data, including biases and artifacts.

Prevention: - Real patient outcomes, not hypotheticals - Diverse, representative samples - Document known biases in data - Data quality checks before modeling


Failure Pattern 2: Validation Failures 🔍

Seen in: Watson, Epic, Google India, Babylon, COVID models

Manifestations: - Circular validation (Watson: same experts train and test) - Internal-only validation (Epic: same hospitals) - Lab ≠ Field (Google India: clinic performance diverged) - No external validation (COVID models)

Root cause: Models overfit to development data. Performance degrades in new settings.

Prevention: - External validation at independent sites - Test in deployment conditions (not just lab) - Diverse patient populations - Independent researchers validate


Failure Pattern 3: Wrong Objective Function 🎯

Seen in: OPTUM, Epic (indirectly)

Manifestations: - Proxy variables (OPTUM: predict costs to infer needs) - Business goals ≠ Patient goals (optimize for cost reduction vs. equitable care) - Accuracy on wrong metric (technically correct, ethically wrong)

Root cause: What you optimize for is what you get. If you optimize for the wrong thing, you get harmful outcomes.

Prevention: - Align objective function with intended outcome - Question proxy variables (do they introduce bias?) - Multidisciplinary team defines goals - Fairness metrics alongside accuracy metrics


Failure Pattern 4: Privacy and Ethics Violations 🔒

Seen in: DeepMind, UK NHS app, TraceTogether

Manifestations: - Unlawful data sharing (DeepMind) - Broken privacy promises (TraceTogether) - Privacy-hostile architecture (UK centralized model) - Insufficient patient consent (DeepMind)

Root cause: “Move fast and break things” applied to sensitive health data. Innovation prioritized over privacy.

Prevention: - Privacy by design (not afterthought) - Legal review BEFORE data sharing - Technical controls enforcing privacy - Independent oversight


Failure Pattern 5: Deployment Without Measuring Outcomes 📊

Seen in: Watson, Epic, Google India, UK app

Manifestations: - AUC reported, patient outcomes not (Epic) - Deployment without outcome study (Watson) - No impact evaluation (UK app)

Root cause: Focus on technical performance, not clinical impact. Model accuracy ≠ Patient benefit.

Prevention: - Prospective outcome studies BEFORE wide deployment - Measure what matters: mortality, quality of life, health equity - Pilot with intensive monitoring - Only scale if outcomes demonstrate benefit


Failure Pattern 6: Lack of Diverse Expertise 👥

Seen in: All cases to varying degrees

Manifestations: - Homogeneous teams miss bias (OPTUM) - No clinical input (COVID models) - No social science input (UK app: underestimated adoption barriers) - No ethics expertise (multiple)

Root cause: AI problems are socio-technical. Can’t solve with technical expertise alone.

Prevention: - Multidisciplinary teams (clinicians, ethicists, social scientists, patients) - Diverse racial/ethnic representation - Domain experts co-lead (not just consultants) - Community stakeholder input


Failure Pattern 7: Commercial Pressure Over Clinical Rigor 💰

Seen in: Watson, Babylon, UK app (political pressure)

Manifestations: - Rushed timelines (Watson: 2 years lab to global deployment) - Overpromised marketing (Babylon: “matches doctors”) - Skipped validation (multiple) - Sunk cost fallacy (UK app: £12M wasted)

Root cause: Financial/political incentives misaligned with patient safety.

Prevention: - Independent safety oversight - Regulatory requirements for validation - Transparency about limitations - Willingness to abandon failed projects


Failure Pattern 8: Lab-to-Field Translation Gap 🏥

Seen in: Google India, UK app, COVID models

Manifestations: - Idealized lab conditions ≠ messy reality (Google India: poor cameras, unreliable internet) - Workflow disruption (Google India: 5 min/patient collapsed clinics) - User skill gap (nurses vs. photographers) - Infrastructure assumptions (connectivity, power, support)

Root cause: Deployment environment fundamentally different from development environment.

Prevention: - Ethnographic study of deployment settings - Design for worst-case conditions - End-user involvement from day one - Pilot before scale


Failure Pattern 9: Selection Bias in Samples 📉

Seen in: Apple AFib, implicitly in others

Manifestations: - Convenience samples (Apple Watch owners) - Self-selection bias (research volunteers) - Socioeconomic exclusion (expensive devices) - Technology access barriers

Root cause: Who has access to technology ≠ Who needs healthcare most.

Prevention: - Actively recruit underrepresented groups - Provide technology to ensure access - Report sample representativeness - Acknowledge generalizability limits


Failure Pattern 10: Alert Fatigue and Human Factors 🚨

Seen in: Epic sepsis model

Manifestations: - High false alarm rate (88% in Epic case) - Clinicians ignore alerts (alert fatigue) - Workflow disruption - Degraded human performance

Root cause: AI systems designed in isolation from human workflow.

Prevention: - Human factors engineering from start - Acceptable false alarm rate defined with clinicians - Measure clinician response and satisfaction - Iterative refinement based on workflow integration


Unified Prevention Framework: The “Public Health AI Safety Checklist” 📋

Based on all 10 failures, here’s a comprehensive pre-deployment checklist:

ImportantThe Public Health AI Safety Checklist

Use this BEFORE deploying any AI system in healthcare or public health:

Phase 1: Problem Formulation & Team Assembly

Problem Definition - [ ] Objective clearly defined (what are we trying to achieve?) - [ ] Target variable directly measures objective (not proxy) - [ ] Problem suitable for AI (vs. non-AI alternatives) - [ ] Success criteria defined (including patient outcomes)

Team Composition - [ ] Multidisciplinary team (technical + clinical + social science + ethics) - [ ] Racial/ethnic diversity in team - [ ] Domain experts co-lead (not just consultants) - [ ] Patient/community representatives involved

Phase 2: Data & Model Development

Data Quality - [ ] Real patient outcomes (not synthetic, not hypothetical) - [ ] Representative sample (diverse by age, race, sex, geography, socioeconomic status) - [ ] Historical biases documented - [ ] Data quality assessment completed - [ ] Data provenance and lineage documented

Ethical Data Use - [ ] Appropriate consent obtained - [ ] Privacy Impact Assessment completed - [ ] Data minimization applied - [ ] Ethics review board approval

Model Development - [ ] Multiple fairness metrics defined (alongside accuracy) - [ ] Subgroup analysis (performance by race, age, sex, etc.) - [ ] Interpretability/explainability built in - [ ] Known limitations documented

Phase 3: Validation & Testing

Technical Validation - [ ] External validation at independent sites - [ ] Tested in deployment conditions (not just lab) - [ ] Diverse test populations - [ ] Independent researchers validate - [ ] Performance reported transparently (including failures)

Fairness Audit - [ ] Disparate impact analysis - [ ] Calibration by subgroup - [ ] Multiple fairness definitions tested - [ ] Equity impact assessment

Safety Testing - [ ] Failure mode analysis - [ ] False positive rate acceptable? - [ ] Alert burden quantified - [ ] Worst-case scenarios tested

Phase 4: Deployment Preparation

Workflow Integration - [ ] Ethnographic study of deployment setting - [ ] User research with actual end-users - [ ] Workflow impact assessed - [ ] Training program for users - [ ] Technical support available

Infrastructure - [ ] Deployment environment meets requirements (internet, power, equipment) - [ ] Works under worst-case conditions - [ ] Offline mode if needed - [ ] Integration with existing systems

Governance - [ ] Clear accountability (who is responsible for failures?) - [ ] Monitoring plan (continuous performance tracking) - [ ] Incident response protocol - [ ] Appeals process for contested decisions - [ ] Plan for model updates and maintenance

Phase 5: Pilot Implementation

Small-Scale Pilot - [ ] 3-5 sites, intensive monitoring - [ ] Technical performance tracked - [ ] Clinical outcomes measured - [ ] User satisfaction assessed - [ ] Equity impacts monitored

Outcome Evaluation - [ ] Primary outcome defined (patient benefit, not just AUC) - [ ] Comparison to baseline or control - [ ] Subgroup outcomes reported - [ ] Cost-effectiveness analyzed - [ ] Qualitative feedback collected

Go/No-Go Decision - [ ] Pre-defined success criteria met? - [ ] Pilot demonstrated benefit (not just technical feasibility)? - [ ] Equity not worsened? - [ ] If No: Iterate, redesign, or abandon

Phase 6: Scaled Deployment (Only if Pilot Succeeds)

Transparency - [ ] Public documentation of how system works - [ ] Validation results published (peer-reviewed) - [ ] Limitations clearly communicated - [ ] Conflicts of interest disclosed

Ongoing Monitoring - [ ] Dashboard: performance by subgroup - [ ] Drift detection (data distribution changes) - [ ] Adverse event reporting - [ ] Regular re-auditing (quarterly or annually) - [ ] Feedback loop for continuous improvement

Exit Strategy - [ ] Conditions under which system would be turned off - [ ] Sunset plan if not providing benefit - [ ] No sunk cost fallacy


Case Study Comparison Matrix

Failure Case Primary Pattern Secondary Pattern Cost ($) Patient Harm Trust Damage
Watson Training Data Validation $62M+ Unknown High
DeepMind Privacy Governance £0 (trust) None direct Very High
Google India Lab-Field Gap User Design $M Indirect Medium
Epic Sepsis Validation Alert Fatigue Opportunity cost Possible High
UK NHS App Technical Hubris Social Adoption £35M None direct High
OPTUM Bias Wrong Objective Lack of Fairness Audit Systemic Millions affected High
TraceTogether Privacy Mission Creep Trust collapse None direct Very High
Babylon Chatbot Overpromising Safety Unknown Possible Medium
COVID Models Urgency Rush Validation Opportunity cost Possible Medium
Apple AFib Selection Bias Equity None None Low

Total documented costs: $100M+ direct, incalculable opportunity and trust costs


Interactive Self-Assessment Quiz

Test your ability to spot red flags before deployment:

Context: Your hospital is considering purchasing a commercial AI sepsis prediction tool. The vendor provides the following information:

  • Trained on 100,000 patients from 5 major academic medical centers
  • AUC: 0.88 in internal validation
  • “Deployed in 150+ hospitals nationwide”
  • Price: $250,000/year

Question: What additional information would you demand before adoption?

Click to reveal answer

Critical questions to ask:

  1. External validation:
    • Has model been validated at independent hospitals?
    • What was performance at hospitals NOT in training data?
    • Peer-reviewed publication of validation results?
  2. Your hospital specifics:
    • Patient population similar to training data?
    • EHR documentation practices similar?
    • Sepsis prevalence in your ICU?
  3. Operational realities:
    • Alert rate (alerts per day)?
    • False positive rate (what % are false alarms)?
    • Alert response protocol (what do nurses/physicians do with alerts)?
  4. Outcomes:
    • Has deployment improved patient outcomes anywhere?
    • Time to antibiotics reduced?
    • Mortality reduced?
    • Or just technical metrics (AUC)?
  5. Fairness:
    • Performance by race, ethnicity, age, sex?
    • Does model perform equally well for your diverse patient population?
  6. Pilot plan:
    • Can you pilot in 1-2 units before hospital-wide?
    • What metrics would trigger Go/No-Go decision?

Red flags in this scenario: - “Deployed in 150+ hospitals” ≠ Evidence of effectiveness - No mention of patient outcomes - No mention of fairness audits - No mention of alert burden

Epic sepsis model taught us: Vendor claims need independent verification!

Context: It’s April 2020, early pandemic. Your research team develops an ML model predicting COVID-19 severity from chest X-rays.

Training data: 500 chest X-rays - 250 COVID-positive (from ICU patients, all supine position) - 250 COVID-negative controls (healthy volunteers, outpatient, upright position)

Performance: 95% sensitivity, 93% specificity on held-out test set

Question: Your team wants to publish immediately in a fast-track COVID journal. Should you? What’s wrong with this model?

Click to reveal answer

DO NOT PUBLISH. Major problems:

  1. Sampling bias:
    • COVID patients: ICU, supine, severe disease
    • Controls: Healthy, outpatient, upright
    • Model likely learning position, not disease
  2. Data leakage risk:
    • Are test set patients from same hospital/scanner?
    • Same time period?
    • Model may be learning institution-specific artifacts
  3. Small sample size:
    • 500 patients is tiny for deep learning
    • High risk of overfitting
    • Not enough diversity
  4. Lack of clinical validation:
    • No clinician co-authors?
    • Does radiologist agree with model?
    • Clinical utility unclear
  5. Absence of external validation:
    • Need to test on completely different hospital
    • Different scanners, different populations
    • Before publishing, let alone deploying

What you SHOULD do:

  1. Recruit appropriate controls:
    • COVID-negative patients from same ICU
    • Same positioning, same severity
    • Remove confounders
  2. Expand dataset:
    • Multiple hospitals
    • Multiple scanners
    • Geographic diversity
    • 1000s of images, not 500
  3. External validation:
    • Test at 2-3 completely independent hospitals
    • Report performance honestly (likely much lower than 95%)
  4. Clinical collaboration:
    • Partner with radiologists and ICU physicians
    • Define clinical use case clearly
    • Compare to existing tools
  5. Honest about limitations:
    • “Proof of concept, not ready for clinical use”
    • “Requires extensive further validation”

COVID forecasting models taught us: Urgency is not an excuse for poor methods. Bad AI is worse than no AI.

Context: Your public health department wants to deploy a smartphone app for depression screening in the community.

App features: - Survey (PHQ-9 depression questionnaire) - AI analyzes voice patterns during short phone call - Predicts depression severity - Refers high-risk individuals to mental health services

Target: 100,000 residents in your county

Question: What are the health equity concerns? Who will this help? Who will it miss?

Click to reveal answer

Major equity concerns:

Who will be EXCLUDED:

  1. No smartphone:
    • ~15% of US adults don’t own smartphones
    • Higher rates among elderly, low-income, rural
  2. Limited English proficiency:
    • Voice AI likely trained on English speakers
    • May not work for Spanish, Creole, Vietnamese, etc.
  3. Disabilities:
    • Visual impairment: Screen reader compatibility?
    • Hearing impairment: Voice-based screening excludes
    • Motor disabilities: Touchscreen accessibility?
  4. Low tech literacy:
    • Elderly, low education may struggle with app
    • No support for troubleshooting
  5. No internet/data:
    • Requires data plan
    • Rural areas: poor connectivity
    • Low-income: may ration data
  6. Distrust of technology:
    • Privacy concerns
    • Cultural barriers
    • Historical medical exploitation (Tuskegee, etc.)

Who WILL use it: - Younger, wealthier, educated, tech-savvy - English speakers - Already engaged with healthcare system - = Lowest risk for untreated depression

Who NEEDS it most: - Older adults (highest suicide rates) - Low-income (highest untreated depression rates) - Minorities (lower mental health service access) - = Least likely to use app

Result: Widened health disparities - Technology benefits already-advantaged groups - Misses those with greatest need

What you SHOULD do:

  1. Multi-channel strategy:
    • App is ONE tool, not only tool
    • In-person screening at clinics
    • Community health worker outreach
    • Telephone hotline (not smartphone-based)
  2. Make app accessible:
    • Multiple languages
    • Screen reader compatible
    • Phone call option for those without smartphones
    • SMS/text option (lower tech burden)
  3. Provide technology access:
    • Free smartphones for those who need them
    • Free data plans
    • In-person support for onboarding
  4. Community engagement:
    • Partner with trusted community organizations
    • Address privacy and trust concerns upfront
    • Cultural adaptations
  5. Monitor equity:
    • Track who uses app by race, income, age, language
    • Measure: Are we reaching high-need populations?
    • If No: Redesign approach

Apple AFib taught us: Convenience samples miss those who need healthcare most.


Summary: How to Build Public Health AI That Doesn’t Fail

Core Principles:

  1. Safety First: “Do no harm” applies to AI. If not proven safe, don’t deploy.

  2. Validation Before Scale: Pilot → Evaluate outcomes → Only scale if beneficial.

  3. Fairness is Not Optional: Test for bias proactively. Health equity is a design requirement.

  4. Privacy by Design: Technical controls, not just policies. Privacy promises must be enforceable.

  5. Diverse Teams: Multidisciplinary collaboration from day one. No technical solutions to socio-technical problems.

  6. Transparency: Document limitations honestly. Publish validation results. Enable independent scrutiny.

  7. Measure Outcomes: AUC is not enough. Did patients benefit? Health equity improve? Costs justify benefits?

  8. Fail Fast: If not working, abandon or redesign. No sunk cost fallacy. Every project is an experiment.

  9. Learn from Failures: Yours and others’. This appendix exists so you don’t repeat these mistakes.

  10. Humility: AI is a tool, not a panacea. Many problems don’t need AI. Some AI makes things worse.


The most important lesson: Every failure in this appendix was preventable. The warning signs were there. The expertise to identify problems existed. What was missing was: - Willingness to slow down - Diverse perspectives at the table - Prioritization of safety and equity over speed and profit - Courage to abandon failing projects

You can do better.

This appendix gives you the knowledge to spot red flags, ask hard questions, demand rigorous validation, and build AI systems that help rather than harm.

The question is: Will you?


Notes on References

Each case study includes a dedicated References section with: - Primary sources: Peer-reviewed research papers, official reports, regulatory documents - Media coverage: Investigative journalism and analysis from reputable outlets - Commentary: Expert analysis and academic perspectives

All references include: - Full citation information - Direct links to sources (DOIs for academic papers, URLs for media) - Categorization by source type for easier navigation