13  AI Safety: Protecting Patients and Populations

TipLearning Objectives

This chapter addresses AI safety in healthcare and public health. You will learn to:

  • Distinguish AI safety from security, privacy, and quality assurance
  • Apply safety-critical systems frameworks (IEC 62304, ISO 14971) to AI
  • Identify failure modes using FMEA and hazard analysis
  • Implement safety validation beyond standard ML evaluation
  • Navigate regulatory safety requirements (FDA, EU MDR, MHRA)
  • Design safeguards: human-in-the-loop, fallback systems, monitoring
  • Develop incident response protocols for AI-caused harm
  • Understand why technical excellence ≠ safe deployment

Prerequisites: Evaluating AI Systems, Ethics, Bias, and Equity, Privacy, Security, and Governance.

The Big Picture: AI safety ≠ security ≠ privacy ≠ quality. Epic sepsis passed technical validation (90% accuracy) but failed fundamental safety test—33% sensitivity means 67% of sepsis cases missed, causing potential harm. Healthcare AI requires safety-critical systems frameworks borrowed from aviation, nuclear power, medical devices. When AI fails in healthcare, people die—safety must be systematic, not an afterthought.

Safety vs. Related Concepts (Distinct But Complementary):

  • Safety: Preventing harm from system failures (false negatives miss disease, false positives cause unnecessary treatment)
  • Security: Protecting from adversarial attacks (poisoning training data, model theft, adversarial examples)
  • Privacy: Protecting individual data (re-identification, inference attacks, data breaches)
  • Quality: Meeting specifications (accuracy, precision, data quality)

All necessary; none sufficient alone.

Why Healthcare AI Has a Safety Problem:

  1. Epic Sepsis: 90% retrospective accuracy → 33% prospective sensitivity. Model technically “worked” but clinically dangerous
  2. Distribution Shift: Models degrade silently when data patterns change (AUROC 0.77 → 0.63 over 3 years)
  3. Alert Fatigue: 95% false positive rate means warnings ignored, defeating purpose
  4. Concept Drift: Clinical practice evolves, invalidating training assumptions
  5. Black Box Opacity: Can’t inspect neural network decision logic to verify safety

Safety-Critical Systems Frameworks for AI:

IEC 62304 (Medical Device Software Lifecycle): - Class A: No injury/harm possible - Class B: Non-serious injury possible - Class C: Death or serious injury possible

Most clinical AI is Class B or C, requiring rigorous safety documentation, testing, change control.

ISO 14971 (Risk Management): 1. Risk Analysis: Identify hazards systematically (FMEA - Failure Modes and Effects Analysis) 2. Risk Evaluation: Assess severity × probability 3. Risk Control: Mitigation strategies (reduce likelihood, reduce severity, add safeguards) 4. Post-Market Surveillance: Continuous monitoring for emerging risks

FDA GMLP (Good Machine Learning Practice) - 10 Principles: 1. Multi-stakeholder involvement from conception 2. Clear intended use and target population 3. Training/testing datasets representative and curated 4. Independently developed training and test sets 5. Best practices for model design and selection 6. Focus on human-AI team performance 7. Testing demonstrates clinically meaningful performance 8. Transparency and interpretability 9. Deployed models monitored and maintained 10. Deployed models monitored for performance decay

Predetermined Change Control Plans (PCCP): FDA innovation allowing model updates within pre-specified boundaries without new clearance (e.g., retrain on new data using same architecture).

Failure Modes and Effects Analysis (FMEA):

Systematic hazard identification:

For Sepsis Prediction Model: 1. False Negative (Miss Sepsis): Patient deteriorates untreated. Severity: Critical. Mitigation: High sensitivity threshold, human oversight 2. False Positive (False Alarm): Alert fatigue, clinicians ignore. Severity: Moderate. Mitigation: High specificity, contextual alerts 3. Data Pipeline Failure: Missing labs → wrong prediction. Severity: High. Mitigation: Input validation, fallback to clinical judgment 4. Distribution Shift: New patient population, model degrades. Severity: High. Mitigation: Continuous monitoring, automated alerts 5. Integration Failure: Predictions delayed, arrive after clinical decision. Severity: Moderate. Mitigation: Real-time processing, latency SLAs

Each failure mode needs: Severity rating, probability estimate, detectability assessment, mitigation strategy.

Safety Safeguards (Defense in Depth):

  1. Human-in-the-Loop: AI recommends, human decides. High-stakes decisions require human confirmation
  2. Circuit Breakers: Automatic fallback when model confidence low or input unusual (out-of-distribution detection)
  3. Monitoring Dashboards: Real-time performance tracking (sensitivity, specificity, alert rate, override rate)
  4. Graceful Degradation: System continues functioning safely when AI component fails
  5. Audit Logging: Record all predictions, decisions, overrides for post-hoc analysis

Adversarial Robustness (Security Overlaps Safety):

Medical imaging AI vulnerable to adversarial attacks: - Pixel-level perturbations (invisible to humans) change diagnosis - Natural perturbations (image brightness, contrast) can flip predictions - Out-of-distribution inputs: Model confident on images unlike training data

Testing: FGSM, PGD attacks assess vulnerability. EU AI Act requires robustness testing for high-risk systems.

Post-Market Surveillance (Continuous Monitoring):

Deployment isn’t the end—models degrade over time:

Monitor for: - Data Drift: Input distributions change (demographics, disease prevalence, diagnostic protocols) - Concept Drift: Relationship between features and outcome changes - Performance Degradation: AUC, sensitivity, specificity decline - Fairness Drift: Disparities emerge in subgroups - Alert Metrics: False alarm rate, override rate, time-to-action

Automated Alerts: Trigger investigations when performance drops below thresholds.

Retraining Decisions: When to retrain? How to validate? FDA PCCP pre-specifies retraining protocols.

Incident Response Protocols:

When AI causes harm:

  1. Immediate: Disable/rollback problematic model, mitigate patient harm
  2. Investigation: Root cause analysis—was it data, model, integration, workflow?
  3. Reporting: FDA MedWatch (serious adverse events), institutional review boards
  4. Corrective Action: Fix root cause, prevent recurrence
  5. Documentation: Transparent reporting builds trust, enables learning
  6. Communication: Notify stakeholders, affected patients (if applicable), regulatory bodies

Blame-free culture: Encourage reporting, not punishment—hidden failures more dangerous than disclosed ones.

Why Technical Excellence ≠ Safe Deployment:

Epic Sepsis Example: - ✅ 90% accuracy retrospectively - ✅ Published research - ✅ Deployed at 100+ hospitals - ❌ 33% sensitivity prospectively (67% misses) - ❌ 7% PPV (93% false alarms) - ❌ Alert fatigue, clinicians overriding - ❌ No external validation before deployment

Lesson: Safety requires prospective validation, clinical workflow integration, continuous monitoring—not just retrospective metrics.

The Takeaway for Public Health Practitioners:

Healthcare AI has a safety problem—Epic sepsis proves technical validation ≠ safe deployment. Safety requires systematic frameworks (IEC 62304, ISO 14971, FDA GMLP), not just accuracy metrics. FMEA identifies failure modes before deployment. Defense in depth: human-in-the-loop, circuit breakers, monitoring dashboards, audit logging. Models degrade silently—post-market surveillance essential. Adversarial robustness matters (EU AI Act requires testing). Incident response protocols needed for when AI harms patients. Blame-free culture encourages reporting. Most importantly: 90% accuracy can still cause catastrophic harm if 10% errors concentrate in high-stakes scenarios. Safety isn’t a feature—it’s a systematic practice requiring continuous vigilance, honest reporting, and humility about AI limitations. When in doubt, fail safe (false alarm better than missed disease in screening; opposite for confirmatory tests).


13.1 Introduction: When AI Fails, People Die

13.1.1 The Epic Sepsis Story: A Safety Failure

In July 2021, researchers from the University of Michigan published a sobering study in JAMA Internal Medicine that sent shockwaves through the healthcare AI community (Wong et al., 2021). They evaluated Epic’s Sepsis Model (ESM), one of the most widely deployed AI systems in American hospitals, used to predict which patients would develop sepsis—a life-threatening condition requiring immediate intervention.

The results were devastating:

  • Sensitivity: 63% - The model missed 37% of sepsis cases (1 in 3 patients who needed urgent care)
  • Specificity: 13% - The model generated false alarms 87% of the time
  • Positive Predictive Value: 5% - Of all sepsis alerts, 95% were false alarms
  • Alert burden: 22,000+ false alarms for 38,455 patients over 7 months

Real-world consequences:

The model wasn’t just inaccurate—it was clinically dangerous. With 87% false alarms, clinicians experienced severe alert fatigue, learning to ignore the very warnings meant to save lives. Meanwhile, one in three actual sepsis cases went undetected until patients became critically ill.

Epic had validated the model on historical data and achieved impressive-looking metrics. But those metrics didn’t translate to real-world safety. The system passed technical validation but failed the fundamental test: Does this make patients safer?

What went wrong?

This wasn’t a deployment failure or an MLOps problem. It was a safety failure at multiple levels:

  1. Validation gap: High retrospective accuracy ≠ prospective clinical utility
  2. Operational context ignored: Alert burden and clinician workflow not considered
  3. Safety-critical validation missing: No FMEA, no hazard analysis, no worst-case testing
  4. False sense of security: Technical metrics masked operational dangers
  5. Post-market surveillance delayed: Took years for independent evaluation

The Epic sepsis case crystalizes why healthcare AI needs rigorous safety frameworks, not just high accuracy scores.

ImportantThe Central Lesson

Technical excellence does not guarantee safety. A model can achieve 90% accuracy and still cause catastrophic harm if deployed without safety-critical validation.

This chapter provides the frameworks to prevent such failures.

13.1.2 Why Safety Is Different from Security, Privacy, and Quality

Many practitioners conflate these concepts. They are distinct but complementary:

Dimension What It Protects Primary Threats Example Failure
Safety Patients/populations from harm caused by system behavior Design flaws, operational failures, edge cases Epic sepsis: missed 37% of cases
Security Systems/data from malicious actors Hackers, unauthorized access, tampering Anthem breach: 78M records stolen
Privacy Sensitive data from unauthorized disclosure Re-identification, inference attacks Cambridge Analytica: 87M users exploited
Quality Meeting specified requirements Implementation bugs, spec violations Software crashes, incorrect calculations

Why the distinction matters:

  • Secure but unsafe: A perfectly secure system can still harm patients if it makes wrong predictions
  • High-quality but unsafe: Bug-free code with no specification errors can still fail in edge cases
  • Private but unsafe: Perfect data protection doesn’t prevent clinical harm from model errors

Safety is about operational behavior in the real world, not just technical correctness in controlled environments.

13.1.3 The Safety Crisis in Healthcare AI

The Epic case isn’t isolated. Healthcare AI has a safety problem:

FDA recalls and safety alerts (2019-2024):

  • Eko stethoscope AI (2023): Cardiac detection algorithm recalled due to incorrect predictions
  • Caption Health ultrasound AI (2022): Guidance software producing poor-quality images
  • Multiple radiology AI systems: Post-market surveillance revealing silent performance degradation

High-profile failures:

  • IBM Watson for Oncology: Unsafe treatment recommendations (e.g., chemotherapy for patient with severe bleeding)
  • Google’s diabetic retinopathy AI: Couldn’t handle common image quality issues in real clinics
  • UK NHS COVID risk algorithm: Systematic bias causing inequitable care

The deployment gap:

According to Sendak et al. (2020, Nature Medicine), fewer than 20% of healthcare AI models that show promise in research actually get deployed. Safety concerns are a major barrier.

Why traditional medical device regulation falls short:

Traditional FDA/EU frameworks assume:

  • Static devices - Performance doesn’t change after approval
  • Transparent logic - Decision rules can be inspected
  • Predictable behavior - Same input always yields same output
  • Controlled environment - Operates within specified conditions

AI systems violate all these assumptions:

  • Continuous learning - Models update with new data
  • Black box - Neural networks lack interpretability
  • Stochastic - Non-deterministic outputs
  • Distribution shift - Performance degrades when data changes

Example: Concept drift in sepsis prediction

Finlayson et al. (2021, Nature Medicine) showed systematic performance degradation:

  • Model trained on 2017 data: AUROC 0.77
  • Same model in 2020 (no retraining): AUROC 0.63
  • Degradation cause: Changes in clinical practice, patient demographics, documentation patterns

This degradation happened silently—no alarms, no warnings, just gradually increasing patient risk.

13.1.4 What This Chapter Covers

We address AI safety comprehensively:

  1. Safety frameworks and standards - IEC 62304, ISO 14971, FDA guidance
  2. Hazard analysis - Identifying failure modes before they cause harm
  3. Safety validation - Testing beyond standard ML metrics
  4. Safety-critical design - Building safety into system architecture
  5. Post-market surveillance - Detecting and responding to safety incidents
  6. Organizational culture - Creating a culture of safety

What we don’t cover:

  • General software quality assurance (see IEC 62304)
  • Clinical trial safety (see ICH E6 guidance)
  • Occupational safety (see OSHA standards)

Our focus: AI-specific safety challenges in healthcare and public health applications.


13.2 Safety Frameworks and Standards

13.2.1 Medical Device Software Standards

13.2.1.1 IEC 62304: Medical Device Software Lifecycle

Overview: International standard for software development lifecycle of medical devices.

Safety classification (determines rigor of development process):

Class Definition Examples Requirements
A No injury or damage to health possible Administrative tools, non-diagnostic displays Basic documentation
B Non-serious injury possible Decision support for non-critical conditions Moderate rigor
C Death or serious injury possible Diagnostic tools for critical conditions, treatment recommendations Maximum rigor

Most healthcare AI falls into Class B or C.

Key requirements for Class C software:

  1. Risk management process (ISO 14971 integration)
  2. Software safety requirements derived from risk analysis
  3. Architecture designed for safety
  4. Unit, integration, and system testing with traceability
  5. Problem resolution process for post-market issues
  6. Version control and configuration management
NoteIEC 62304 and AI

Traditional IEC 62304 assumes deterministic software with reviewable code. AI systems challenge this:

  • Black box models: Can’t review decision logic
  • Data-dependent behavior: Changes with training data
  • Probabilistic outputs: Non-deterministic

FDA and EU regulators are updating guidance to address AI-specific concerns, but fundamental tension remains: safety frameworks designed for transparency applied to inherently opaque systems.

13.2.1.2 ISO 14971: Risk Management for Medical Devices

Overview: Systematic process for identifying and controlling risks throughout device lifecycle.

Five-step risk management process:

Hide code
graph LR
    A[1. Risk Analysis] --> B[2. Risk Evaluation]
    B --> C[3. Risk Control]
    C --> D[4. Residual Risk Evaluation]
    D --> E[5. Risk Management Report]
    E --> F[Post-Market Surveillance]
    F --> A
Figure 13.1: ISO 14971 five-step risk management process for AI medical devices. The process is cyclical, with post-market surveillance feeding back into ongoing risk analysis to ensure continuous improvement.

Step 1: Risk Analysis

Identify hazards, hazardous situations, and harms:

  • Hazard: Potential source of harm (e.g., incorrect prediction)
  • Hazardous situation: Circumstance where hazard could lead to harm (e.g., clinician acts on false negative)
  • Harm: Injury or damage to health (e.g., patient dies from undetected sepsis)

Step 2: Risk Evaluation

Estimate risk using:

  • Severity: How bad is the harm? (1=negligible, 5=catastrophic)
  • Probability: How likely is it? (1=remote, 5=frequent)
  • Risk = Severity × Probability

Risk matrix example:

Probability ↓  Severity → Negligible (1) Minor (2) Moderate (3) Major (4) Catastrophic (5)
Frequent (5) 5 10 15 20 25
Probable (4) 4 8 12 16 20
Occasional (3) 3 6 9 12 15
Remote (2) 2 4 6 8 10
Improbable (1) 1 2 3 4 5

Risk acceptability: - 1-5: Acceptable (monitor) - 6-12: ALARP (As Low As Reasonably Practicable) - reduce if feasible - 15-25: Unacceptable - must reduce before deployment

Step 3: Risk Control

Hierarchy of control measures (in order of preference):

  1. Inherent safety by design - Eliminate hazard (best)
  2. Protective measures - Add safeguards (e.g., human-in-the-loop)
  3. Information for safety - Warnings, training (weakest)

Step 4: Residual Risk Evaluation

After controls, re-evaluate: - Is residual risk acceptable? - Do control measures introduce new risks? - Is overall benefit-risk ratio favorable?

Step 5: Risk Management Report

Document: - All identified hazards - Risk estimates before/after controls - Rationale for acceptability decisions - Post-market surveillance plan

WarningCommon ISO 14971 Mistakes in AI Projects
  1. Performing risk analysis only once - Must update as model changes
  2. Using only accuracy metrics - Severity and probability require clinical context
  3. Treating all errors equally - False negative in cancer screening ≠ false positive
  4. Insufficient hazard identification - Missing operational and systemic hazards
  5. No post-market surveillance plan - Required by standard but often omitted

13.2.2 Regulatory Safety Guidance

13.2.2.1 FDA: AI/ML-Based Software as a Medical Device (SaMD)

FDA’s AI/ML Action Plan (2021) addresses unique AI challenges:

1. Good Machine Learning Practice (GMLP)

Ten guiding principles:

  1. Clinical study participants reflect intended population
  2. Independent test sets
  3. Reference datasets with robust features
  4. Models designed for robustness and interpretability
  5. Training and test procedures documented
  6. Human-AI team performance demonstrated
  7. Deployed model matches validated version
  8. Users provided with clear information
  9. Deployed models monitored for performance
  10. Risk mitigations for known failure modes

2. Predetermined Change Control Plans (PCCP)

Allows approved AI/ML models to evolve without new 510(k) for each change, if:

  • SaMD Pre-Specifications (SPS): Pre-defined types of modifications
  • Algorithm Change Protocol (ACP): Methods to implement and validate changes
  • Post-market performance monitoring

Example SPS for diabetic retinopathy screening: - Retrain with new data quarterly - Add image preprocessing techniques - Update decision threshold based on population - NOT allowed: Change intended use, add new conditions

3. Real-World Performance Monitoring

FDA expects: - Continuous monitoring of safety and effectiveness - Detection of performance degradation - Timely response to safety signals

FDA risk categorization for SaMD:

State of Healthcare Significance of Information Provided
Treat/Diagnose
Critical III
Serious III
Non-Serious II

Approval pathways:

  • Class I (Low risk): General controls, usually exempt from premarket review
  • Class II (Moderate risk): 510(k) clearance (substantial equivalence) - most AI devices
  • Class III (High risk): Premarket Approval (PMA) - most rigorous
TipFDA 510(k) Shortcut

90% of AI medical devices clear through 510(k) by claiming “substantial equivalence” to existing devices.

Problem: Often compared to predicate devices from 1990s-2000s that were never rigorously validated. Creates “grandfather paradox”—new AI compared to old non-AI tools that never proved clinical utility.

Recent FDA tightening: Requiring more clinical evidence, especially for high-risk applications.

13.2.2.2 EU: Medical Device Regulation (MDR) and AI

EU Medical Device Regulation (2017/745) took full effect May 2021, replacing older directives.

Key requirements for AI SaMD:

1. Clinical Evaluation

Much more rigorous than FDA 510(k):

  • Clinical investigation often required (not just analytical validation)
  • Post-market clinical follow-up (PMCF) mandatory
  • Equivalence harder to claim for novel AI

2. Risk Classification

Rule AI Application Class
Rule 11 Software for diagnosis/treatment decisions IIa (moderate-low risk)
Rule 11 Software that could cause death/serious injury IIb (moderate-high risk)
Rule 11 Software for monitoring vital parameters IIb

Most diagnostic AI = Class IIa or IIb (requiring Notified Body involvement).

3. Transparency and Explainability

MDR requires:

  • “Information to be supplied with the device” must include how algorithm works
  • Rationale for clinical decisions
  • Limitations and warnings

4. Post-Market Surveillance

Manufacturers must:

  • Have post-market surveillance system
  • Report serious incidents and near-misses
  • Update clinical evaluation with post-market data
  • Periodic Safety Update Reports (PSUR)

13.2.2.3 UK: MHRA Software and AI as Medical Device Change Programme

Post-Brexit, UK developing own framework:

Software and AI as a Medical Device (SAMDaaS) initiative emphasizes:

  1. Proportionate regulation based on risk and autonomy level
  2. Lifecycle regulation - not just pre-market
  3. Real-world evidence for approval
  4. Transparency - explainability where feasible
NoteInternational Divergence

FDA, EU, and UK increasingly diverging on AI regulation:

  • FDA: Risk-based, allows “substantial equivalence,” PCCP for adaptive algorithms
  • EU MDR: Stricter clinical evidence, harder to claim equivalence, mandatory PMCF
  • UK: Still defining post-Brexit framework, emphasizing innovation

Implication: Medical AI companies need three different validation strategies for US, EU, UK markets.

13.2.3 Safety-Critical Systems Theory from Other Industries

Healthcare can learn from industries with mature safety cultures:

13.2.3.1 Commercial Aviation Safety

Swiss Cheese Model of Accidents (James Reason)

[Hazard] → | Hole | → | Hole | → | Hole | → | Hole | → [Accident]
            Layer 1   Layer 2   Layer 3   Layer 4
            (Design)  (Training) (Procedure) (Defense)

Key insight: Accidents require multiple failures aligning. Safety requires defense in depth—multiple independent layers.

Application to AI:

  • Layer 1 (Design): Robust model architecture, diverse training data
  • Layer 2 (Validation): Comprehensive testing including edge cases
  • Layer 3 (Deployment): Human-in-the-loop, fallback systems
  • Layer 4 (Monitoring): Real-time performance tracking, circuit breakers

Just Culture: Aviation learned blame-free reporting increases safety. Healthcare AI needs similar culture—punishing failures drives them underground.

13.2.3.2 Nuclear Safety

Principles from nuclear industry:

  1. Defense in depth: Multiple independent protection layers
  2. Redundancy: Backup systems for critical functions
  3. Diversity: Different technologies for same function (prevents common-mode failures)
  4. Conservative decision-making: Err on side of caution when uncertain
  5. Continuous oversight: Independent safety monitoring

Application to AI:

  • Defense in depth: Don’t rely solely on model accuracy—add monitoring, human oversight, fallbacks
  • Redundancy: Multiple models or rule-based backup for critical decisions
  • Diversity: Ensemble methods, different model architectures
  • Conservative: Higher confidence thresholds for critical decisions
  • Oversight: Independent evaluation before deployment, continuous monitoring after
ImportantFail-Safe vs. Fail-Secure

Fail-safe: System defaults to safe state on failure (e.g., railway signals default to red)

Fail-secure: System maintains security on failure (e.g., door lock stays locked)

For medical AI: Usually want fail-safe (defer to human) not fail-secure (block access).

Example: Sepsis prediction model fails → Alert clinician to manual assessment (fail-safe) NOT continue with last prediction (fail-secure).


13.3 Hazard Analysis and Failure Modes

13.3.1 AI-Specific Failure Modes

Traditional software has well-understood failure modes (bugs, crashes, security vulnerabilities). AI introduces novel failure modes:

13.3.1.1 Category 1: Technical Failures

1. Data Drift and Distribution Shift

Definition: Training data distribution differs from deployment distribution.

Types:

  • Covariate shift: P(X) changes but P(Y|X) stays same
    • Example: COVID-19 changes patient demographics (age, comorbidities)
  • Prior probability shift: P(Y) changes but P(X|Y) stays same
    • Example: Disease prevalence changes seasonally
  • Concept drift: P(Y|X) changes
    • Example: Treatment protocols evolve, changing disease outcomes

Why it’s dangerous: Model performance silently degrades without obvious errors or crashes.

Real-world example: Finlayson et al., 2021 - Sepsis prediction AUROC dropped from 0.77 to 0.63 over 3 years due to: - Changes in clinical documentation practices - Updates to EHR systems - Shifts in patient population

2. Adversarial Examples

Definition: Inputs deliberately crafted to fool model, often imperceptible to humans.

Medical imaging example:

# Simplified adversarial example
import numpy as np

def generate_adversarial_example(model, image, true_label, epsilon=0.01):
    """
    Generate adversarial perturbation using Fast Gradient Sign Method

    Parameters:
    - model: Trained classifier
    - image: Original medical image (e.g., X-ray)
    - true_label: Correct diagnosis
    - epsilon: Perturbation magnitude (small = imperceptible)

    Returns:
    - Adversarial image that model misclassifies
    """
    # Get model's gradient with respect to input
    loss_gradient = model.get_loss_gradient(image, true_label)

    # Create perturbation in direction that increases loss
    perturbation = epsilon * np.sign(loss_gradient)

    # Add imperceptible noise
    adversarial_image = image + perturbation

    # Model misclassifies, but human sees no difference
    return adversarial_image

# Example: Chest X-ray
original_prediction = model.predict(xray)  # "Pneumonia: 95% confidence"
adversarial_xray = generate_adversarial_example(model, xray, "pneumonia")
adversarial_prediction = model.predict(adversarial_xray)  # "Normal: 92% confidence"

# ⚠️ Radiologist sees identical images, but model flips diagnosis

Why it matters:

  • Accidental adversarial examples: Real medical images might accidentally fall into adversarial regions
  • Malicious attacks: Bad actors could manipulate images to change diagnoses
  • Fragility indicator: Sensitivity to tiny perturbations suggests model relies on spurious features

Defense strategies:

  • Adversarial training (include adversarial examples in training)
  • Input sanitization and validation
  • Ensemble methods (adversarial examples often don’t transfer)
  • Certified defenses (formal guarantees of robustness)

3. Spurious Correlations

Definition: Model learns from confounders or artifacts rather than true causal features.

Classic example: Zech et al., 2018, PLoS Medicine

  • Pneumonia detection model achieved high accuracy
  • Analysis revealed model was detecting hospital systems (from image markers) not pneumonia
  • Model learned: “Hospital A uses portable X-rays for ICU patients → Portable X-ray marker → Pneumonia”
  • Confounding: Sicker patients → ICU → Portable X-rays

Safety hazard: Model works in training hospital, fails completely when deployed elsewhere.

Another example: COVID-19 detection from chest X-rays

Many models learned: - Patients with COVID → Admitted to hospital → Supine position X-rays - Healthy people → Outpatient imaging → Standing position X-rays - Model detected position (supine vs. standing) rather than COVID

4. Training Data Poisoning

Definition: Malicious or erroneous data corrupts training set.

Scenarios:

  • Accidental: Mislabeled data, data entry errors, EHR errors
  • Malicious: Adversary injects poisoned samples during training

Example: Model trained on EHR data with documentation errors

  • Diagnosis codes frequently misapplied
  • Billing codes used inconsistently
  • Copy-paste errors in clinical notes

Consequence: Model learns incorrect patterns from noisy labels.

5. Out-of-Distribution (OOD) Detection Failure

Definition: Model makes confident predictions on data unlike anything in training set.

Example: Diabetic retinopathy screening AI deployed in rural clinic

  • Training data: High-quality retinal images from academic centers
  • Deployment: Poor lighting, cataracts, lower-quality cameras
  • Model made confident (wrong) predictions on low-quality images it should have rejected

Safety requirement: Model must detect OOD inputs and defer to human.

def safe_prediction_with_ood_detection(model, ood_detector, input_data):
    """
    Make prediction only if input is in-distribution

    Returns:
    - (prediction, confidence) if in-distribution
    - (None, "OOD detected") if out-of-distribution
    """
    # Check if input similar to training distribution
    ood_score = ood_detector.score(input_data)

    if ood_score > THRESHOLD:
        # Out of distribution - defer to human
        log_warning(f"OOD input detected: {ood_score}")
        return None, "Input outside model's training distribution - manual review required"

    # In distribution - proceed with prediction
    prediction = model.predict(input_data)
    return prediction, model.confidence

13.3.1.2 Category 2: Operational Failures

6. Misuse and Off-Label Use

Definition: Using AI for purposes beyond validated scope.

Examples:

  • Diabetic retinopathy AI used for other eye diseases
  • COVID-19 detection AI used for other respiratory conditions
  • Sepsis prediction AI used in populations different from training

Why it happens:

  • Intended use not clearly communicated
  • Clinicians assume broader applicability
  • Pressure to “do something” with AI investment

Safety requirement: Clear labeling of validated use cases and populations.

7. Automation Bias

Definition: Tendency to favor automated decisions over human judgment, even when human is correct.

Study: Goddard et al., 2012, Journal of General Internal Medicine

  • Physicians shown EHR alerts for drug interactions
  • 40% overrode their own correct judgment when computer suggested different action
  • Automation bias stronger when physicians were time-pressured or fatigued

Application to AI: High-confidence predictions can override clinical judgment, even when AI is wrong.

Mitigation:

  • Present AI as “decision support” not “decision maker”
  • Display uncertainty/confidence intervals
  • Encourage critical evaluation
  • Training on AI limitations

8. Alert Fatigue

Definition: Desensitization to warnings due to high false positive rate.

Epic sepsis model: 87% false alarm rate → Clinicians learned to ignore alerts → Missed actual sepsis cases

Threshold:

  • 90% false positives: Clinicians ignore almost all alerts

  • 70-90%: Alerts lose credibility
  • <50%: Actionable alert burden

Trade-off: Sensitivity vs. specificity

  • Increase sensitivity → Catch more cases but more false alarms
  • Increase specificity → Fewer false alarms but miss more cases

Safety-critical design: Must account for alert burden in operational context, not just optimize AUROC.

9. Workflow Integration Failures

Definition: AI disrupts clinical workflow, leading to workarounds or abandonment.

Example: AI diagnostic tool requires:

  • Separate login
  • Data re-entry
  • 5-minute wait for prediction
  • Results in different system from EHR

Consequence: Clinicians skip using tool or use incorrectly.

Safety requirement: AI must fit seamlessly into existing workflow.

13.3.1.3 Category 3: Systemic Failures

10. Cascading Failures

Definition: AI failure triggers failures in dependent systems.

Example: Hospital resource allocation AI

  • AI predicts low emergency admissions
  • Hospital reduces staffing
  • Unpredicted surge occurs
  • Insufficient staff → Delayed care → Patient harm

System view: AI is embedded in complex sociotechnical system. Local optimization can create global failure.

11. Common-Cause Failures

Definition: Single event causes multiple AI systems to fail simultaneously.

Example: EHR system update changes data format

  • Multiple AI models expecting old format fail
  • All deployed models affected at once
  • No backup since all rely on same infrastructure

Mitigation: Diversity in system architecture, independent fallback systems.

12. Latent Failures

Definition: Errors or vulnerabilities that exist but don’t cause harm until triggered by specific conditions.

Example: Model trained predominantly on data from one demographic

  • Performs well on similar populations (latent failure undetected)
  • Deployed to different demographic → Poor performance → Patient harm

Swiss cheese model: Latent failures are “holes” waiting for alignment.

13.3.2 Failure Mode and Effects Analysis (FMEA) for AI

FMEA: Systematic method for identifying failure modes before they cause harm.

Process:

  1. Identify potential failure modes (what could go wrong?)
  2. Determine effects (what happens if it fails?)
  3. Assess severity (how bad?)
  4. Assess occurrence (how likely?)
  5. Assess detection (can we catch it before harm?)
  6. Calculate Risk Priority Number (RPN) = Severity × Occurrence × Detection
  7. Prioritize mitigation (address highest RPNs first)

13.3.2.1 FMEA Template for AI Systems

Scenario: Sepsis prediction model

Failure Mode Potential Cause Effect Severity (1-5) Occurrence (1-5) Detection (1-5) RPN Mitigation
Model misses sepsis case (false negative) Data drift, edge case, training bias Patient doesn’t receive timely treatment → Death/severe harm 5 3 4 60 1) Human review for borderline cases
2) Ensemble with rule-based backup
3) Lower confidence threshold
False alarm (false positive) Low specificity, noisy input data Alert fatigue → Ignore future alerts 4 5 2 40 1) Increase specificity threshold
2) Two-stage alerting
3) Monitor alert burden
Model unavailable (system crash) Server failure, network outage No predictions → Revert to standard care 3 2 1 6 1) Redundant systems
2) Graceful degradation
3) Automatic failover
Data drift (silent degradation) Population changes, practice changes Gradually increasing errors 5 4 5 100 1) Continuous monitoring
2) Automated drift detection
3) Retraining protocol
OOD input (unfamiliar case) Patient outside training distribution Confident wrong prediction 5 3 4 60 1) OOD detection
2) Uncertainty quantification
3) Defer to human
Adversarial attack Malicious manipulation of input Incorrect diagnosis 5 1 5 25 1) Input validation
2) Adversarial training
3) Anomaly detection

Severity scale: - 1: No harm - 2: Minor harm (temporary, reversible) - 3: Moderate harm (prolonged recovery) - 4: Major harm (permanent injury) - 5: Catastrophic (death or multiple major harms)

Occurrence scale: - 1: Very rare (< 1 in 10,000) - 2: Rare (1 in 1,000) - 3: Occasional (1 in 100) - 4: Frequent (1 in 10) - 5: Very frequent (> 1 in 10)

Detection scale: - 1: Almost certain to detect before harm - 2: High chance of detection - 3: Moderate chance - 4: Low chance - 5: Almost certain NOT to detect

Risk Priority Number (RPN): - 1-30: Low risk (monitor) - 31-100: Moderate risk (reduce if feasible) - 101-125: High risk (must reduce before deployment)

Post-mitigation: Re-calculate RPN after implementing controls. Goal: All RPN < 100, ideally < 50.

TipFMEA Workshop

Conduct FMEA with multidisciplinary team:

  • Data scientists: Technical failure modes
  • Clinicians: Clinical context and effects
  • Safety officers: Severity assessment
  • Quality assurance: Detection methods
  • Patients: Real-world impact

Timing: Before deployment, then updated: - When model changes - When deployment context changes - After incidents or near-misses - Annually at minimum

13.3.2.2 Automated FMEA Template

import pandas as pd

class AISystemFMEA:
    """
    Systematic FMEA for AI healthcare systems
    """

    def __init__(self, system_name):
        self.system_name = system_name
        self.failure_modes = []

    def add_failure_mode(self, mode, cause, effect, severity, occurrence, detection):
        """
        Add failure mode to FMEA

        Parameters:
        - mode: Description of failure
        - cause: What causes this failure
        - effect: Consequence of failure
        - severity: 1-5 (1=negligible, 5=catastrophic)
        - occurrence: 1-5 (1=rare, 5=frequent)
        - detection: 1-5 (1=certain to detect, 5=unlikely to detect)
        """
        rpn = severity * occurrence * detection

        self.failure_modes.append({
            'Failure Mode': mode,
            'Cause': cause,
            'Effect': effect,
            'Severity': severity,
            'Occurrence': occurrence,
            'Detection': detection,
            'RPN': rpn,
            'Risk Level': self._classify_risk(rpn)
        })

    def _classify_risk(self, rpn):
        if rpn <= 30:
            return 'Low'
        elif rpn <= 100:
            return 'Moderate'
        else:
            return 'High'

    def generate_report(self):
        """Generate FMEA report sorted by RPN"""
        df = pd.DataFrame(self.failure_modes)
        df = df.sort_values('RPN', ascending=False)

        print(f"\n{'='*80}")
        print(f"FMEA Report: {self.system_name}")
        print(f"{'='*80}\n")

        print(df.to_string(index=False))

        # Risk summary
        risk_counts = df['Risk Level'].value_counts()
        print(f"\n{'='*80}")
        print("Risk Summary:")
        print(f"  High Risk:     {risk_counts.get('High', 0)} failure modes (RPN > 100)")
        print(f"  Moderate Risk: {risk_counts.get('Moderate', 0)} failure modes (RPN 31-100)")
        print(f"  Low Risk:      {risk_counts.get('Low', 0)} failure modes (RPN ≤ 30)")

        if risk_counts.get('High', 0) > 0:
            print(f"\n⚠️  WARNING: {risk_counts.get('High', 0)} HIGH RISK failure modes must be mitigated before deployment")

        return df

# Example usage
fmea = AISystemFMEA("Sepsis Prediction Model")

# Add failure modes from clinical team
fmea.add_failure_mode(
    mode="False negative (missed sepsis case)",
    cause="Data drift, patient outside training distribution",
    effect="Delayed treatment → Patient death",
    severity=5,
    occurrence=3,
    detection=4
)

fmea.add_failure_mode(
    mode="False alarm (false positive)",
    cause="Low model specificity",
    effect="Alert fatigue → Future alerts ignored",
    severity=4,
    occurrence=5,
    detection=2
)

fmea.add_failure_mode(
    mode="Silent performance degradation",
    cause="Concept drift over time",
    effect="Gradually increasing error rate",
    severity=5,
    occurrence=4,
    detection=5
)

# Generate report
report = fmea.generate_report()

Output:

================================================================================
FMEA Report: Sepsis Prediction Model
================================================================================

                    Failure Mode                              Cause                                       Effect  Severity  Occurrence  Detection  RPN Risk Level
Silent performance degradation                 Concept drift over time              Gradually increasing error rate         5           4          5  100   Moderate
False negative (missed sepsis case)  Data drift, patient outside training...         Delayed treatment → Patient death         5           3          4   60   Moderate
  False alarm (false positive)                     Low model specificity  Alert fatigue → Future alerts ignored         4           5          2   40   Moderate

================================================================================
Risk Summary:
  High Risk:     0 failure modes (RPN > 100)
  Moderate Risk: 3 failure modes (RPN 31-100)
  Low Risk:      0 failure modes (RPN ≤ 30)
WarningFMEA Limitations for AI

Traditional FMEA assumes: - Known failure modes - Quantifiable probabilities - Independent failures

AI challenges: - Unknown unknowns: Failure modes we haven’t thought of - Uncertain probabilities: Hard to estimate occurrence of data drift - Correlated failures: Adversarial examples affect whole model class

Solution: Combine FMEA with: - Stress testing for unknown failure modes - Continuous monitoring for early detection - Red team exercises to find vulnerabilities

13.3.3 Other Hazard Analysis Methods

Fault Tree Analysis (FTA)

Top-down approach: Start with harm, work backward to causes.

                  Patient Death from Sepsis
                           |
              +------------+------------+
              |                         |
        Sepsis Not Detected     Treatment Delayed
              |                         |
     +--------+--------+       +--------+--------+
     |                 |       |                 |
Model Missed Case  Clinician  Alert    System
                   Didn't     Not      Down
                   Notice     Seen

Event Tree Analysis (ETA)

Forward approach: Start with event, trace possible outcomes.

Sepsis Develops → Model Alerts? → Clinician Reviews? → Treatment Given? → Outcome
                      ↓                    ↓                   ↓
                    Yes/No              Yes/No             Yes/No       Survive/Die

HAZOP (Hazard and Operability Study)

Use guide words to identify hazards:

Guide Word Application to AI Example Hazard
No/Not No prediction available System downtime → Manual workflow
More More false positives Alert fatigue
Less Less sensitivity Missed cases
As Well As Model captures spurious features Detects hospital system, not disease
Part Of Model trained on subset Bias against underrepresented groups
Reverse Opposite of intended Model predicts low risk for high-risk patient
Other Than Different output Model provides prediction for wrong patient

13.4 Safety Validation Beyond ML Evaluation

Standard ML evaluation (accuracy, AUROC, precision/recall on test set) is necessary but insufficient for safety validation.

13.4.1 Why Standard Metrics Don’t Ensure Safety

Problem 1: Average performance masks critical failures

  • 95% accuracy sounds great
  • But 5% errors might all be in high-severity cases

Example: Cancer screening AI

  • 98% accuracy overall
  • But 30% false negative rate on early-stage cancers (most treatable)
  • Model performs well on late-stage (obvious) and no-cancer cases
  • Fails exactly where it would be most valuable

Problem 2: Test set doesn’t represent edge cases

  • IID (independent, identically distributed) assumption
  • Real world has outliers, distribution shift, adversarial conditions

Problem 3: Metrics don’t account for operational context

  • High sensitivity with low specificity → Alert fatigue → System abandonment
  • Metrics evaluated in isolation, not system-level impact

Problem 4: No assessment of failure modes

  • Which errors are acceptable?
  • Which are catastrophic?
  • Standard metrics treat all errors equally

13.4.2 Safety-Specific Validation Methods

13.4.2.1 1. Worst-Case Analysis

Goal: Evaluate performance on hardest cases, not average cases.

def worst_case_validation(model, X, y, protected_groups):
    """
    Evaluate model on worst-performing subgroups

    Returns performance on:
    - Hardest examples (lowest confidence correct predictions)
    - Underrepresented groups
    - Edge cases
    """
    results = {}

    # 1. Evaluate on low-confidence predictions
    predictions = model.predict_proba(X)
    confidence = np.max(predictions, axis=1)

    # Bottom quartile confidence
    low_conf_mask = confidence < np.percentile(confidence, 25)
    results['low_confidence_accuracy'] = accuracy_score(
        y[low_conf_mask],
        model.predict(X[low_conf_mask])
    )

    # 2. Evaluate on underrepresented groups
    for group in protected_groups:
        group_mask = (X['demographic_group'] == group)
        results[f'{group}_accuracy'] = accuracy_score(
            y[group_mask],
            model.predict(X[group_mask])
        )
        results[f'{group}_sample_size'] = group_mask.sum()

    # 3. Evaluate on edge cases (outliers)
    # Use isolation forest or other outlier detection
    from sklearn.ensemble import IsolationForest
    outlier_detector = IsolationForest()
    outlier_scores = outlier_detector.fit_predict(X)
    outlier_mask = (outlier_scores == -1)

    results['outlier_accuracy'] = accuracy_score(
        y[outlier_mask],
        model.predict(X[outlier_mask])
    )

    return results

# Example
worst_case_results = worst_case_validation(
    model=sepsis_model,
    X=test_data,
    y=test_labels,
    protected_groups=['Black', 'Hispanic', 'Asian', 'White']
)

print("Worst-Case Validation Results:")
print(f"  Overall accuracy: 92%")
print(f"  Low-confidence accuracy: {worst_case_results['low_confidence_accuracy']:.1%}")
print(f"  Outlier accuracy: {worst_case_results['outlier_accuracy']:.1%}")
print(f"  Black patient accuracy: {worst_case_results['Black_accuracy']:.1%}")
print(f"  Hispanic patient accuracy: {worst_case_results['Hispanic_accuracy']:.1%}")

Safety threshold: Worst-case performance must meet minimum safety requirements, not just average.

13.4.2.2 2. Boundary Condition Testing

Test inputs at extremes:

Input Type Example Why It Matters
Out of range Age = 150, Heart rate = 0 How does model handle impossible values?
Missing data 50% of features missing Graceful degradation or catastrophic failure?
Corrupted inputs Image with noise, text with typos Real-world data is messy
Adversarial examples Small perturbations designed to fool model Robustness to attacks
OOD inputs Data from different population/site Detects distribution shift?
def boundary_condition_testing(model):
    """
    Test model behavior on edge cases
    """
    test_results = {}

    # Test 1: Out-of-range values
    invalid_input = {
        'age': 200,  # Impossible
        'heart_rate': -50,  # Invalid
        'temperature': 150  # Lethal
    }

    try:
        prediction = model.predict(invalid_input)
        test_results['invalid_input'] = f"⚠️  FAILURE: Model accepted invalid input (predicted: {prediction})"
    except ValueError as e:
        test_results['invalid_input'] = f"✅ PASS: Model rejected invalid input ({e})"

    # Test 2: Missing data
    for missing_pct in [0.1, 0.3, 0.5, 0.7, 0.9]:
        partial_input = create_missing_data(test_data, missing_pct)
        accuracy = evaluate(model, partial_input)
        test_results[f'missing_{int(missing_pct*100)}pct'] = accuracy

    # Test 3: Adversarial robustness
    adversarial_accuracy = evaluate_adversarial_robustness(model, test_data)
    test_results['adversarial_robustness'] = adversarial_accuracy

    # Test 4: OOD detection
    ood_data = load_ood_dataset()  # Data from different hospital
    ood_detected = model.detect_ood(ood_data)
    test_results['ood_detection_rate'] = ood_detected.mean()

    return test_results

13.4.2.3 3. Adversarial Robustness Evaluation

Attacks to test:

Fast Gradient Sign Method (FGSM):

def fgsm_attack(model, image, label, epsilon=0.03):
    """
    Generate adversarial example using FGSM

    Adds small perturbation in direction of gradient
    """
    # Compute gradient of loss with respect to input
    with tf.GradientTape() as tape:
        tape.watch(image)
        prediction = model(image)
        loss = loss_function(label, prediction)

    gradient = tape.gradient(loss, image)

    # Create adversarial image
    adversarial_image = image + epsilon * tf.sign(gradient)

    # Clip to valid range [0, 1]
    adversarial_image = tf.clip_by_value(adversarial_image, 0, 1)

    return adversarial_image

# Evaluate robustness
original_accuracy = evaluate(model, test_images, test_labels)
adversarial_images = [fgsm_attack(model, img, lbl) for img, lbl in zip(test_images, test_labels)]
adversarial_accuracy = evaluate(model, adversarial_images, test_labels)

print(f"Original accuracy: {original_accuracy:.1%}")
print(f"Adversarial accuracy: {adversarial_accuracy:.1%}")
print(f"Robustness gap: {original_accuracy - adversarial_accuracy:.1%}")

if adversarial_accuracy < 0.5:
    print("⚠️  SAFETY CONCERN: Model highly vulnerable to adversarial attacks")

Projected Gradient Descent (PGD): Stronger iterative attack

Certified defenses: Formal guarantees of robustness within epsilon-ball

13.4.2.4 4. Stress Testing

Test system under adverse conditions:

class SafetyStressTest:
    """
    Comprehensive stress testing for AI safety
    """

    def test_high_load(self, model, num_requests=10000):
        """Can system handle peak load?"""
        start = time.time()
        predictions = model.batch_predict(num_requests)
        latency = (time.time() - start) / num_requests

        if latency > 1.0:  # > 1 second per prediction
            return f"⚠️  FAIL: Latency {latency:.2f}s exceeds 1s requirement"
        return f"✅ PASS: Latency {latency:.2f}s"

    def test_degraded_input_quality(self, model, test_data):
        """Performance with poor-quality inputs"""
        results = {}

        # Add increasing amounts of noise
        for noise_level in [0.1, 0.3, 0.5]:
            noisy_data = test_data + noise_level * np.random.randn(*test_data.shape)
            accuracy = evaluate(model, noisy_data)
            results[f'noise_{noise_level}'] = accuracy

            if accuracy < 0.7:  # Arbitrary safety threshold
                results[f'noise_{noise_level}_status'] = "⚠️  UNSAFE"

        return results

    def test_concurrent_failures(self, model):
        """What happens when multiple things go wrong?"""
        # Simulate database connection failure during high load
        # Simulate network latency while processing corrupted input
        # etc.
        pass

    def test_recovery(self, model):
        """Can system recover from failures?"""
        # Simulate crash and restart
        # Verify state restoration
        # Check for data loss
        pass

13.4.2.5 5. Subgroup Analysis

Adapted from Chapter 9 (Ethics), but safety-focused:

def safety_subgroup_analysis(model, X, y, subgroups):
    """
    Evaluate safety metrics across demographic subgroups

    Focus on clinical safety, not just fairness
    """
    safety_metrics = {}

    for group, mask in subgroups.items():
        # Standard metrics
        accuracy = accuracy_score(y[mask], model.predict(X[mask]))

        # Safety-critical metrics
        # False negatives in high-severity cases
        high_severity_mask = (y[mask] == 'high_risk')
        false_negative_rate = 1 - recall_score(
            y[mask][high_severity_mask],
            model.predict(X[mask][high_severity_mask])
        )

        safety_metrics[group] = {
            'accuracy': accuracy,
            'false_negative_rate': false_negative_rate,
            'sample_size': mask.sum()
        }

    # Check for subgroup safety disparities
    fnr_values = [m['false_negative_rate'] for m in safety_metrics.values()]
    max_fnr_disparity = max(fnr_values) - min(fnr_values)

    if max_fnr_disparity > 0.1:  # 10% disparity threshold
        print(f"⚠️  SAFETY ALERT: {max_fnr_disparity:.1%} disparity in false negative rates across subgroups")

    return safety_metrics

13.4.3 Regulatory Safety Testing Requirements

13.4.3.1 FDA Perspective

Software Validation Guidance:

  1. Analytical validation: Does model work on retrospective data?
  2. Clinical validation: Does model work in prospective clinical use?
  3. Usability validation: Can clinicians use it correctly?

AI/ML-specific:

  • Predetermined change control plan (PCCP)
  • Real-world performance monitoring
  • Reporting of performance degradation

13.4.3.2 EU MDR Perspective

Clinical Evaluation Report (CER) requirements:

  • Literature review of similar devices
  • Clinical investigation data (if available)
  • Post-market clinical follow-up (PMCF) plan

For AI:

  • Demonstrate clinical benefit vs. current standard of care
  • Address uncertainty in AI predictions
  • Plan for ongoing evaluation as model evolves
TipSafety Validation Checklist

Before deployment:


13.5 Safety-Critical Design Patterns

Safety must be designed in, not tested in. This section covers architectural patterns that build safety into AI systems.

13.5.1 Human-in-the-Loop (HITL) Systems

Levels of automation (adapted from aviation):

Level Name Description AI Role Human Role Healthcare Example
1 Manual Human decides, no AI None Decision maker Traditional diagnosis
2 Decision support AI suggests, human decides Advisor Decision maker Diagnostic suggestions that clinician reviews
3 Conditional automation AI decides, human approves before action Recommender Approver Treatment plan requires clinician sign-off
4 High automation AI acts, human monitors and can intervene Autonomous actor Supervisor Automated insulin pump with override
5 Full automation AI acts, no human involvement Fully autonomous None (Rare in healthcare)

Safety principle: Higher automation requires higher reliability.

When to use each level:

  • Level 2 (Decision support): Most current healthcare AI
    • Diagnostic aids
    • Risk prediction
    • Treatment suggestions
  • Level 3 (Human approval): When AI mistakes would be catastrophic but human can catch
    • Chemotherapy dosing
    • Surgical planning
    • Critical medication changes
  • Level 4 (Human monitoring): When immediate action required, human response too slow
    • Ventilator management
    • Insulin pumps
    • Anesthesia control
  • Level 5 (Full automation): Almost never appropriate in healthcare (except routine tasks)
WarningThe Ironies of Automation

Bainbridge, 1983 identified paradoxes:

  1. Automation reduces human engagement → When automation fails, humans are out of the loop and can’t respond
  2. Automation handles routine cases → Humans only see hard cases, losing routine expertise
  3. More automation = more complex failures → When AI fails, failures are harder for humans to diagnose

Application to healthcare AI:

  • Don’t automate routine cases and leave only hard cases to humans
  • Maintain human expertise through regular engagement
  • Design for graceful degradation, not sudden failure

13.5.2 Fallback Systems and Graceful Degradation

Defense in depth: Multiple independent safety layers.

13.5.2.1 Primary-Backup Architecture

class SafeAISystem:
    """
    AI system with fallback to rule-based or human decision
    """

    def __init__(self, ai_model, rule_based_backup, confidence_threshold=0.8):
        self.ai_model = ai_model
        self.rule_based_backup = rule_based_backup
        self.confidence_threshold = confidence_threshold
        self.system_health = "operational"

    def predict(self, input_data):
        """
        Make prediction with safety fallbacks
        """
        # Health check
        if self.system_health != "operational":
            return self.safe_mode_prediction(input_data)

        # Try AI prediction
        try:
            prediction = self.ai_model.predict(input_data)
            confidence = prediction.confidence

            # Check confidence threshold
            if confidence >= self.confidence_threshold:
                return prediction
            else:
                # Low confidence → Fallback
                logging.warning(f"Low confidence ({confidence:.2f}) → Falling back to rule-based")
                return self.rule_based_backup.predict(input_data)

        except Exception as e:
            # AI failure → Fallback
            logging.error(f"AI prediction failed: {e} → Falling back to rule-based")
            return self.rule_based_backup.predict(input_data)

    def safe_mode_prediction(self, input_data):
        """
        Conservative predictions when system degraded
        """
        # Use most conservative/safe default
        return {
            'prediction': 'manual_review_required',
            'confidence': 0.0,
            'reason': 'System in safe mode - human evaluation required'
        }

    def health_check(self):
        """
        Continuous system health monitoring
        """
        try:
            # Check model availability
            test_prediction = self.ai_model.predict(self.get_test_input())

            # Check latency
            import time
            start = time.time()
            _ = self.ai_model.predict(self.get_test_input())
            latency = time.time() - start

            if latency > 5.0:  # 5 second threshold
                self.system_health = "degraded"
                return "degraded"

            self.system_health = "operational"
            return "operational"

        except Exception as e:
            self.system_health = "failed"
            return "failed"

# Example usage
ai_system = SafeAISystem(
    ai_model=neural_network_model,
    rule_based_backup=clinical_decision_rules,
    confidence_threshold=0.8
)

# Prediction with automatic fallback
result = ai_system.predict(patient_data)

if result['source'] == 'rule_based':
    print("⚠️  AI fallback activated - using rule-based system")
elif result['source'] == 'manual':
    print("⚠️  Safe mode - manual review required")

13.5.2.2 Ensemble Methods for Safety

Diverse models reduce common-mode failures:

class SafetyEnsemble:
    """
    Ensemble with safety-focused disagreement handling
    """

    def __init__(self, models, min_agreement=0.75):
        self.models = models
        self.min_agreement = min_agreement

    def predict(self, input_data):
        """
        Predict with ensemble agreement checking
        """
        predictions = [model.predict(input_data) for model in self.models]

        # Check agreement
        agreement = self.calculate_agreement(predictions)

        if agreement >= self.min_agreement:
            # High agreement → Use ensemble prediction
            return {
                'prediction': self.aggregate(predictions),
                'confidence': agreement,
                'source': 'ensemble'
            }
        else:
            # Low agreement → Flag for human review
            return {
                'prediction': 'uncertain',
                'confidence': agreement,
                'source': 'manual_review_required',
                'reason': f'Low model agreement ({agreement:.2%})'
            }

    def calculate_agreement(self, predictions):
        """
        Fraction of models that agree with majority
        """
        from collections import Counter
        counts = Counter(predictions)
        majority_count = counts.most_common(1)[0][1]
        return majority_count / len(predictions)

    def aggregate(self, predictions):
        """
        Aggregate predictions (majority vote, averaging, etc.)
        """
        from collections import Counter
        return Counter(predictions).most_common(1)[0][0]

13.5.3 Safety Monitoring and Circuit Breakers

Continuous monitoring for safety signals:

class SafetyMonitor:
    """
    Real-time safety monitoring with automatic circuit breakers
    """

    def __init__(self, model, safety_thresholds):
        self.model = model
        self.thresholds = safety_thresholds
        self.metrics_history = []
        self.circuit_breaker_active = False

    def monitor_prediction(self, input_data, prediction, ground_truth=None):
        """
        Monitor individual prediction for safety concerns
        """
        alerts = []

        # Check 1: Confidence too low
        if prediction['confidence'] < self.thresholds['min_confidence']:
            alerts.append(f"Low confidence: {prediction['confidence']:.2f}")

        # Check 2: OOD detection
        ood_score = self.model.ood_detector.score(input_data)
        if ood_score > self.thresholds['max_ood_score']:
            alerts.append(f"Out-of-distribution input: {ood_score:.2f}")

        # Check 3: If ground truth available, check error
        if ground_truth is not None:
            error = (prediction['value'] != ground_truth)
            self.metrics_history.append({'error': error, 'timestamp': time.time()})

            # Check recent error rate
            recent_error_rate = self.calculate_recent_error_rate()
            if recent_error_rate > self.thresholds['max_error_rate']:
                alerts.append(f"Elevated error rate: {recent_error_rate:.1%}")
                self.activate_circuit_breaker()

        if alerts:
            self.log_safety_alert(alerts, input_data, prediction)

        return alerts

    def calculate_recent_error_rate(self, window_minutes=60):
        """
        Calculate error rate in recent time window
        """
        now = time.time()
        window_start = now - (window_minutes * 60)

        recent_errors = [
            m['error'] for m in self.metrics_history
            if m['timestamp'] > window_start
        ]

        if len(recent_errors) == 0:
            return 0.0

        return sum(recent_errors) / len(recent_errors)

    def activate_circuit_breaker(self):
        """
        Stop using AI model when safety threshold exceeded
        """
        self.circuit_breaker_active = True
        logging.critical("🚨 CIRCUIT BREAKER ACTIVATED - AI model disabled")
        logging.critical("System falling back to safe mode until manual review")

        # Alert on-call team
        self.send_alert_to_oncall()

    def send_alert_to_oncall(self):
        """Send urgent alert to on-call safety team"""
        # Integration with paging system
        pass

# Example usage
monitor = SafetyMonitor(
    model=sepsis_model,
    safety_thresholds={
        'min_confidence': 0.7,
        'max_ood_score': 0.5,
        'max_error_rate': 0.2  # 20% error rate
    }
)

# Monitor each prediction
prediction = model.predict(patient_data)
alerts = monitor.monitor_prediction(patient_data, prediction, ground_truth)

if monitor.circuit_breaker_active:
    # Use fallback system
    prediction = fallback_system.predict(patient_data)

13.5.4 Safety Interlocks and Guardrails

Prevent unsafe actions before they occur:

class SafetyInterlocks:
    """
    Hard constraints preventing unsafe model behavior
    """

    def validate_input(self, input_data):
        """
        Validate input before allowing prediction
        """
        errors = []

        # Range checks
        if input_data['age'] < 0 or input_data['age'] > 120:
            errors.append(f"Invalid age: {input_data['age']}")

        if input_data['heart_rate'] < 20 or input_data['heart_rate'] > 300:
            errors.append(f"Invalid heart rate: {input_data['heart_rate']}")

        # Completeness checks
        required_fields = ['age', 'heart_rate', 'blood_pressure', 'temperature']
        missing = [f for f in required_fields if f not in input_data or input_data[f] is None]
        if missing:
            errors.append(f"Missing required fields: {missing}")

        # Consistency checks
        if input_data['systolic_bp'] < input_data['diastolic_bp']:
            errors.append("Systolic BP < Diastolic BP (impossible)")

        if errors:
            raise ValueError(f"Input validation failed: {errors}")

    def validate_output(self, prediction, input_data):
        """
        Validate prediction before returning to clinician
        """
        warnings = []

        # Confidence check
        if prediction['confidence'] < 0.5:
            warnings.append("Low confidence prediction")

        # Plausibility check
        # Example: Risk score shouldn't increase dramatically for stable patient
        if 'previous_risk' in input_data:
            change = abs(prediction['risk'] - input_data['previous_risk'])
            if change > 0.5:  # >50% change
                warnings.append(f"Large risk change: {input_data['previous_risk']:.2f}{prediction['risk']:.2f}")

        # Output bounds check
        if prediction['risk'] < 0 or prediction['risk'] > 1:
            raise ValueError(f"Risk score out of bounds: {prediction['risk']}")

        if warnings:
            prediction['warnings'] = warnings

        return prediction

    def cross_check_with_rules(self, prediction, input_data):
        """
        Cross-check AI prediction against clinical decision rules
        """
        # Example: qSOFA score for sepsis
        qsofa = self.calculate_qsofa(input_data)

        if qsofa >= 2 and prediction['sepsis_risk'] < 0.5:
            # qSOFA suggests sepsis but AI says low risk
            return {
                **prediction,
                'cross_check_alert': f"AI predicts low risk but qSOFA={qsofa} suggests high risk - manual review recommended"
            }

        return prediction
TipSafety Design Principles Summary
  1. Fail-safe: Default to safe state on failure (defer to human)
  2. Defense in depth: Multiple independent safety layers
  3. Diversity: Different approaches prevent common-mode failures
  4. Conservative: When uncertain, err on side of caution
  5. Monitoring: Continuous surveillance for safety signals
  6. Circuit breakers: Automatic shutoff when thresholds exceeded
  7. Graceful degradation: Reduce functionality, don’t fail completely
  8. Human-in-the-loop: Appropriate level of automation for risk level

13.6 Post-Market Safety Surveillance

Deployment isn’t the end of safety validation—it’s the beginning of continuous safety monitoring.

13.6.1 Why Post-Market Surveillance Is Critical for AI

Traditional medical devices are static. AI systems are dynamic:

  • Data drift: Population changes over time
  • Concept drift: Relationship between features and outcome changes
  • Software updates: Model retraining, architecture changes
  • Integration changes: New EHR versions, workflow updates

Result: Performance validated pre-deployment can degrade silently post-deployment.

13.6.2 Incident Detection and Response

13.6.2.1 Incident Severity Classification

FDA MedWatch categories (adapted for AI):

Severity Definition Examples Reporting
Critical Death or serious injury AI missed cancer diagnosis → Patient died Immediate (within 24h)
Major Significant harm, prolonged recovery AI-guided surgery error → Complications requiring additional procedures Within 30 days
Minor Temporary harm, full recovery False alarm causing unnecessary procedure Annual summary
Near miss Potential for harm, none occurred AI error caught by clinician before action Track for trending

13.6.2.2 Incident Response Protocol

class IncidentResponse:
    """
    Systematic incident response for AI safety events
    """

    def __init__(self, system_name):
        self.system_name = system_name
        self.incidents = []

    def report_incident(self, severity, description, patient_impact, root_cause_preliminary):
        """
        Report and initiate response to safety incident
        """
        incident = {
            'timestamp': datetime.now(),
            'severity': severity,
            'description': description,
            'patient_impact': patient_impact,
            'root_cause_preliminary': root_cause_preliminary,
            'status': 'reported'
        }

        self.incidents.append(incident)

        # Immediate actions based on severity
        if severity == 'critical':
            self.critical_incident_response(incident)
        elif severity == 'major':
            self.major_incident_response(incident)

        # All incidents trigger investigation
        self.initiate_investigation(incident)

    def critical_incident_response(self, incident):
        """
        Immediate response to critical safety event
        """
        # 1. Activate circuit breaker if systematic issue
        if self.is_systematic_failure(incident):
            self.activate_circuit_breaker()

        # 2. Alert on-call safety team
        self.page_safety_team(incident)

        # 3. Notify regulatory authorities (FDA MedWatch within 24h)
        self.notify_regulators(incident, deadline_hours=24)

        # 4. Quarantine affected predictions
        self.quarantine_predictions(incident)

        # 5. Initiate root cause analysis
        self.initiate_rca(incident)

    def is_systematic_failure(self, incident):
        """
        Determine if incident indicates systematic problem
        """
        # Check for similar recent incidents
        recent_similar = [
            i for i in self.incidents[-10:]
            if i['root_cause_preliminary'] == incident['root_cause_preliminary']
        ]

        if len(recent_similar) >= 3:
            return True  # Trend suggests systematic issue

        return False

    def initiate_investigation(self, incident):
        """
        Start formal investigation
        """
        investigation = {
            'incident_id': len(self.incidents),
            'team': ['data_scientist', 'clinician', 'safety_officer', 'quality_assurance'],
            'timeline': '30 days for major/critical, 90 days for minor',
            'deliverables': ['root_cause_analysis', 'corrective_actions', 'preventive_actions']
        }

        return investigation

# Example
incident_response = IncidentResponse(system_name="Sepsis Prediction Model")

# Critical incident reported
incident_response.report_incident(
    severity='critical',
    description="Model failed to alert for patient who developed septic shock",
    patient_impact="Patient required ICU admission, prolonged recovery",
    root_cause_preliminary="Data drift - patient demographics outside training distribution"
)

13.6.3 Root Cause Analysis (RCA) for AI Failures

Five Whys method:

Incident: AI missed sepsis case

Why #1: Why did AI miss sepsis?
→ Patient's early sepsis presented atypically

Why #2: Why didn't AI detect atypical presentation?
→ Training data had few atypical cases

Why #3: Why did training data lack atypical cases?
→ Training data from single academic center with specific patient population

Why #4: Why wasn't data diversity validated?
→ Validation focused on overall metrics, not subgroup coverage

Why #5: Why wasn't subgroup validation required?
→ Validation protocol didn't include diversity assessment

ROOT CAUSE: Inadequate validation protocol lacking diversity requirements

Corrective Action: Update validation to require minimum sample sizes across demographic and clinical subgroups

Preventive Action: Implement ongoing diversity monitoring in post-market surveillance

13.6.3.1 Fishbone Diagram (Ishikawa) for AI Failures

                                   AI Failure (Missed Sepsis)
                                           |
          People              Process      |      Technology        Data
            |                    |         |           |              |
    Alert fatigue        No FMEA done      |    Model too simple   Data drift
    Insufficient         Inadequate        |    No OOD detection   Unrepresentative
     training            validation        |    No monitoring      training set

13.6.4 Continuous Performance Monitoring

class PostMarketSurveillance:
    """
    Continuous monitoring of AI safety and performance
    """

    def __init__(self, model, baseline_metrics):
        self.model = model
        self.baseline_metrics = baseline_metrics
        self.current_metrics = {}
        self.degradation_threshold = 0.1  # 10% degradation triggers alert

    def monitor_performance(self, predictions, ground_truth):
        """
        Track performance metrics over time
        """
        from sklearn.metrics import roc_auc_score, accuracy_score

        # Calculate current metrics
        self.current_metrics = {
            'auroc': roc_auc_score(ground_truth, predictions),
            'accuracy': accuracy_score(ground_truth, (predictions > 0.5)),
            'sample_size': len(predictions),
            'timestamp': datetime.now()
        }

        # Check for degradation
        for metric in ['auroc', 'accuracy']:
            baseline = self.baseline_metrics[metric]
            current = self.current_metrics[metric]
            degradation = (baseline - current) / baseline

            if degradation > self.degradation_threshold:
                self.alert_degradation(metric, baseline, current, degradation)

    def monitor_data_drift(self, new_data, reference_data):
        """
        Detect distribution shift using statistical tests
        """
        from scipy.stats import ks_2samp

        drift_detected = {}

        for feature in new_data.columns:
            # Kolmogorov-Smirnov test
            statistic, p_value = ks_2samp(
                reference_data[feature],
                new_data[feature]
            )

            if p_value < 0.05:  # Significant drift
                drift_detected[feature] = {
                    'statistic': statistic,
                    'p_value': p_value
                }

        if drift_detected:
            self.alert_data_drift(drift_detected)

        return drift_detected

    def monitor_subgroup_performance(self, predictions, ground_truth, subgroups):
        """
        Ensure no subgroup experiencing degradation
        """
        subgroup_metrics = {}

        for group, mask in subgroups.items():
            if mask.sum() > 30:  # Minimum sample size
                subgroup_metrics[group] = {
                    'auroc': roc_auc_score(ground_truth[mask], predictions[mask]),
                    'sample_size': mask.sum()
                }

        # Check for subgroup disparities
        aurocs = [m['auroc'] for m in subgroup_metrics.values()]
        disparity = max(aurocs) - min(aurocs)

        if disparity > 0.1:  # 10% disparity threshold
            self.alert_subgroup_disparity(subgroup_metrics, disparity)

        return subgroup_metrics

    def generate_periodic_safety_update(self, period='quarterly'):
        """
        Periodic Safety Update Report (PSUR) as required by EU MDR
        """
        report = {
            'period': period,
            'timestamp': datetime.now(),
            'system': self.model.name,
            'deployment_stats': {
                'total_predictions': self.get_total_predictions(),
                'active_sites': self.get_active_sites(),
                'user_count': self.get_user_count()
            },
            'performance_summary': {
                'baseline_auroc': self.baseline_metrics['auroc'],
                'current_auroc': self.current_metrics['auroc'],
                'change': self.current_metrics['auroc'] - self.baseline_metrics['auroc']
            },
            'safety_events': {
                'critical': self.count_incidents('critical'),
                'major': self.count_incidents('major'),
                'minor': self.count_incidents('minor'),
                'near_miss': self.count_incidents('near_miss')
            },
            'corrective_actions': self.get_corrective_actions(),
            'recommendations': self.generate_recommendations()
        }

        return report

13.6.5 Regulatory Reporting Requirements

13.6.5.1 FDA: MedWatch and Maude

When to report:

  • Death or serious injury caused by or associated with device: Within 30 days (5 days if unexpected death)
  • Malfunction that could cause death/serious injury: Within 30 days

What to report:

  • Device identification (model, version, serial number for AI: model hash)
  • Patient outcomes
  • Description of event
  • Suspected cause
  • Corrective actions taken

13.6.5.2 EU: Vigilance System

Serious incidents must be reported to competent authority:

  • Within 15 days for death or unanticipated serious deterioration
  • Within 30 days for other serious incidents

Periodic Safety Update Reports (PSUR):

  • Class III devices: Annually
  • Class IIa/IIb: Every 2-5 years (depends on device)
ImportantWhat Constitutes “Serious Incident” for AI
  • Patient death or serious injury attributed to or contributed by AI system
  • Patient death/injury where AI could have prevented but failed to (false negative)
  • Patient harm from unnecessary intervention due to false positive
  • Systematic error affecting multiple patients
  • Performance degradation below safety threshold

Gray area: When AI is “decision support” (clinician makes final call), attribution is unclear. Regulators increasingly expect reporting even when AI is indirect contributor.


13.7 Organizational Safety Culture

Technology alone cannot ensure safety. Organizations need culture, processes, and governance supporting safety.

13.7.1 Safety Management Systems

Components of effective safety management:

  1. Safety policy and objectives
    • Leadership commitment
    • Clear safety priorities
    • Resources allocated to safety
  2. Safety roles and responsibilities
    • Safety officer role
    • AI safety review board
    • Incident response team
  3. Safety risk management
    • FMEA and hazard analysis
    • Risk registers
    • Periodic risk reviews
  4. Safety assurance
    • Post-market surveillance
    • Audits and inspections
    • Continuous improvement
  5. Safety promotion
    • Training and education
    • Communication of safety information
    • Just culture and reporting

13.7.2 Training and Competency

Clinicians must understand:

  • How AI works (conceptually)
  • What AI can and cannot do
  • Limitations and failure modes
  • How to use AI safely
  • What to do when AI fails or gives unexpected results

Developers must understand:

  • Clinical context and workflows
  • Consequences of errors
  • Safety-critical systems engineering
  • Regulatory requirements
  • Ethical implications

Example training curriculum for clinicians:

Module 1: AI Basics (30 min)
- What is machine learning?
- How does our model work?
- What data was it trained on?

Module 2: Using the AI System (45 min)
- Workflow integration
- Interpreting predictions
- Understanding confidence scores
- Hands-on practice

Module 3: Limitations and Failure Modes (30 min)
- Known failure modes
- When to override AI
- Reporting errors and near-misses

Module 4: Case Studies (30 min)
- Real examples of AI failures
- Lessons learned
- Discussion and Q&A

Assessment:
- Written quiz (80% passing)
- Practical simulation (demonstrate safe use)
- Annual refresher required

13.7.3 Just Culture vs. Blame Culture

Aviation safety lesson: Punishing errors drives them underground. Just culture encourages reporting.

Just Culture principles:

  1. Distinguish errors from violations
    • Honest mistakes → Learning opportunity
    • Reckless behavior → Disciplinary action
  2. Encourage reporting
    • Blame-free for genuine errors
    • Confidential reporting systems
    • Recognition for reporting near-misses
  3. System improvements over individual blame
    • Why did system allow error to occur?
    • What barriers failed?
    • How can we prevent recurrence?

Application to AI safety:

BLAME CULTURE                              JUST CULTURE
"Dr. Smith ignored the AI alert"   →      "Why was alert rate so high that ignoring became norm?"
"Data scientist didn't test edge cases" → "Why didn't our validation protocol require edge case testing?"
"Hospital didn't monitor performance" →    "What barriers prevented effective monitoring?"
TipBuilding Safety Culture
  1. Leadership commitment: Senior leadership visibly prioritizes safety
  2. Psychological safety: Team members feel safe reporting concerns
  3. Learning mindset: View failures as learning opportunities
  4. Resources: Adequate time and budget for safety activities
  5. Recognition: Celebrate safety successes, recognize reporting
  6. Transparency: Share safety data openly (within organization)
  7. Accountability: Clear roles and responsibilities for safety
  8. Continuous improvement: Regular safety reviews and updates

13.8 Key Takeaways

  1. Safety ≠ Security ≠ Quality: Distinct dimensions requiring different frameworks

  2. Technical excellence ≠ Safe deployment: High accuracy doesn’t guarantee safety

  3. Fail-safe by design: Build safety into architecture (HITL, fallbacks, monitoring)

  4. Multiple barriers: Defense in depth prevents single point of failure

  5. Proactive hazard analysis: FMEA and risk assessment before deployment, not after harm

  6. Validation beyond ML metrics: Worst-case analysis, boundary testing, adversarial robustness

  7. Continuous monitoring: Safety validation doesn’t end at deployment—requires ongoing surveillance

  8. Regulatory compliance is minimum: FDA/EU requirements are floor, not ceiling

  9. Learn from failures: Both yours and others’ (see Appendix E)

  10. Organizational culture matters: Technology cannot overcome culture that doesn’t prioritize safety

  11. Edge cases matter more than averages: Safety depends on worst-case performance

  12. Human-centered design: Appropriate autonomy level, interpretability, workflow integration


13.9 Hands-On Exercise: Conduct Safety FMEA

Scenario: Your public health department is deploying an AI model to predict COVID-19 hospitalization risk. The model will be used to prioritize antiviral treatments during supply shortages.

Context:

  • Input: Patient demographics, comorbidities, vaccination status, symptoms
  • Output: Risk score (0-1) for hospitalization within 14 days
  • Action: High-risk patients (score > 0.7) offered monoclonal antibodies
  • Setting: Outpatient clinics, emergency departments
  • Users: Physicians, nurse practitioners, physician assistants

13.9.1 Task 1: Identify Failure Modes (20 minutes)

For each category, identify at least 2 potential failure modes:

Technical failures: - Data drift - OOD inputs - Adversarial examples - Training data issues

Operational failures: - Misuse/off-label use - Automation bias - Alert fatigue - Workflow integration

Systemic failures: - Cascading failures - Common-cause failures - Equity issues

13.9.2 Task 2: Complete FMEA Table (30 minutes)

For each failure mode, assess:

Failure Mode Cause Effect Severity (1-5) Occurrence (1-5) Detection (1-5) RPN
[Your failure mode]

13.9.3 Task 3: Design Mitigations (20 minutes)

For failure modes with RPN > 50, design specific mitigation strategies:

  • Design changes
  • Protective measures
  • Monitoring and detection
  • Response protocols

13.9.4 Task 4: Create Monitoring Plan (20 minutes)

Design post-deployment safety monitoring:

What metrics to track: - Performance metrics (AUROC, calibration) - Safety metrics (false negative rate in high-risk groups) - Operational metrics (alert burden, override rate) - Equity metrics (performance across demographics)

How to detect degradation: - Automated alerts - Statistical process control - Manual review frequency

Response protocol: - Degradation thresholds - Escalation procedures - Circuit breaker criteria

13.9.5 Task 5: Draft Incident Response Protocol (10 minutes)

Create incident response workflow:

  1. How are incidents detected and reported?
  2. Who is notified?
  3. What immediate actions are taken?
  4. How is investigation conducted?
  5. When are regulators notified?
  6. How is learning disseminated?

Check Your Understanding

Test your knowledge of the key concepts from this chapter. Click “Show Answer” to reveal the correct response and explanation.

NoteQuestion 1: AI Accuracy and Safety Validation

A COVID-19 diagnostic AI achieves 97% accuracy in validation. Why might this be insufficient for safety clearance?

Answer: Multiple safety concerns beyond accuracy:

  1. Accuracy doesn’t reveal failure mode distribution - 3% errors might all be false negatives (missing COVID cases), which is more dangerous than false positives

  2. Aggregate metric masks subgroup performance - Model might be 99% accurate on typical presentations but 70% accurate on atypical cases (elderly, immunocompromised)

  3. Test set may not represent edge cases - Validation data might not include poor image quality, operator errors, or equipment variations seen in real clinics

  4. No assessment of operational impact - High accuracy with 30% false positive rate could cause alert fatigue and system abandonment

  5. Missing safety-specific validation - No adversarial robustness testing, OOD detection, worst-case analysis, or boundary condition testing

  6. No consideration of context - In low-prevalence settings, even 97% accuracy can have very low positive predictive value

Safety clearance requires: Comprehensive FMEA, worst-case analysis, subgroup validation, operational testing, and safety-critical design (fallbacks, monitoring, HITL).

NoteQuestion 2: Autonomous AI Safety Requirements

Hospital wants to deploy fully autonomous AI (Level 5 automation) for sepsis prediction—no human review, automatic treatment orders. What safety requirements must be met?

Answer: Level 5 (fully autonomous) is almost never appropriate in healthcare. If considered, would require:

Technical requirements: - Near-perfect accuracy (>99.9%) across ALL subgroups - Certified robustness to adversarial examples - Reliable OOD detection - Continuous monitoring with real-time circuit breakers - Redundant fallback systems - Formal verification of safety properties

Regulatory requirements: - Class III designation (highest risk) → PMA (premarket approval), not 510(k) - Extensive clinical trials demonstrating superiority to human decision-making - Post-market surveillance plan with frequent reporting - Predetermined change control plan for any updates

Operational requirements: - Comprehensive FMEA with all RPN < 10 - Incident response protocol with immediate escalation - Human monitoring capability (contradiction: “autonomous” yet monitored) - Clear liability assignment

Ethical requirements: - Patient consent for fully autonomous care - Opt-out mechanism - Demonstration that benefits significantly outweigh risks

Realistic recommendation: Use Level 2-3 (decision support or human approval required) instead. Full autonomy creates unacceptable liability, technical, and ethical risks for life-critical decisions.

NoteQuestion 3: Incident Response for Model Degradation

Your AI model’s performance begins degrading after 6 months in production (AUROC dropped from 0.85 to 0.78). Walk through the incident response steps.

Answer: Systematic incident response:

Step 1: Detect and confirm (24-48 hours) - Automated monitoring alerts to degradation - Confirm with manual validation on recent data - Calculate magnitude and statistical significance of degradation - Assess impact: How many patients affected? Any harm occurred?

Step 2: Immediate response (Day 1) - Classify severity: Major (significant degradation) or Critical (if harm occurred) - Notify safety officer, clinical leadership, development team - DO NOT immediately disable system (could disrupt care) unless severe degradation - Increase human oversight temporarily (lower automation level) - Consider lowering confidence threshold (increase sensitivity at cost of specificity)

Step 3: Root cause investigation (Days 2-7) - Data drift analysis: Has population changed? New demographics, comorbidities? - Concept drift analysis: Have clinical practices changed? New protocols, medications? - System changes: EHR updates, integration changes, data pipeline issues? - Ground truth verification: Is degradation real or measurement artifact?

Step 4: Develop corrective actions (Days 7-14) - If data drift: Retrain model on recent data - If concept drift: Retrain with updated labels - If system issue: Fix technical problem - If measurement artifact: Correct monitoring system

Step 5: Validate correction (Days 14-21) - Test updated model on held-out recent data - Confirm performance restored - Perform safety validation (not just performance) - Subgroup analysis to ensure no equity issues

Step 6: Deploy correction and monitor closely (Day 21+) - Deploy updated model - Increase monitoring frequency temporarily - Verify performance in production - Document lessons learned

Step 7: Regulatory reporting - If harm occurred: FDA MedWatch within 30 days, EU vigilance within 15-30 days - If no harm: Include in periodic safety update report (PSUR)

Step 8: Preventive actions - Update monitoring to detect this drift pattern earlier - Implement automated retraining protocol - Update FMEA with new failure mode - Train team on lessons learned

NoteQuestion 4: FMEA vs. Traditional ML Evaluation

Compare FMEA vs. traditional ML evaluation. When is each appropriate?

Answer:

Dimension Traditional ML Evaluation FMEA (Failure Mode and Effects Analysis)
Goal Assess average performance Identify potential failure modes and harms
Timing After model training Before deployment (proactive)
Scope Model performance only End-to-end system including operational context
Metrics Accuracy, AUROC, precision, recall Risk Priority Number (Severity × Occurrence × Detection)
Errors treated All errors weighted equally Errors weighted by clinical severity
Team Data scientists Multidisciplinary (clinicians, safety, QA, patients)
Output Performance report Risk mitigation plan

When to use ML evaluation: - Iterative model development - Comparing model architectures - Hyperparameter tuning - Model selection

When to use FMEA: - Before clinical deployment - After major system changes - Following incidents or near-misses - Annual safety reviews - Regulatory submissions (ISO 14971 requirement)

Why both are needed:

ML evaluation asks: “How accurate is the model?

FMEA asks: “What can go wrong, how bad would it be, and how do we prevent it?

Example:

  • ML evaluation: “Model has 90% sensitivity and 85% specificity” ✓
  • FMEA: “5 failure modes identified, including ‘false negative in early-stage cancer’ with RPN=100 (high risk). Mitigations: lower threshold for high-risk demographics, ensemble with rule-based screening, manual review for borderline cases.” ✓✓

You cannot safely deploy with ML evaluation alone.

NoteQuestion 5: FDA vs. EU MDR Regulatory Requirements

Regulatory perspective: FDA vs. EU MDR safety requirements for AI. What are the key differences?

Answer:

Aspect FDA (US) EU MDR
Primary pathway 510(k) - Substantial Equivalence (most AI) Conformity assessment with Notified Body
Clinical evidence Can rely on equivalence to predicate device; less clinical data required Requires robust clinical evaluation; equivalence harder to claim
Adaptive algorithms Predetermined Change Control Plan (PCCP) allows updates without new submission More restrictive; significant changes require new assessment
Post-market Voluntary post-market surveillance for Class I/II; MedWatch reporting Mandatory Post-Market Clinical Follow-up (PMCF); Periodic Safety Update Reports (PSUR)
Transparency Less stringent explainability requirements MDR requires information about how algorithm works
Timeline Faster (510(k): 3-6 months) Slower (12-18+ months with Notified Body)
Validation rigor Analytical + clinical validation (clinical often retrospective) Strong emphasis on prospective clinical evidence
Updates PCCP allows planned updates pre-approved Each significant change reassessed

Key philosophical difference:

FDA approach: Innovation-friendly, risk-based, allows incremental validation, trusts substantial equivalence

EU MDR approach: Precautionary, requires stronger evidence upfront, less reliance on equivalence, more stringent post-market requirements

Practical implications:

  1. Development strategy: EU market requires more upfront clinical data; plan prospective studies early

  2. Adaptive AI: If model will update frequently, FDA PCCP pathway more viable than EU

  3. Cost: EU MDR compliance generally more expensive (Notified Body fees, extensive clinical evaluation)

  4. Timeline: Plan 12-18 months for EU vs. 6-9 months for FDA 510(k)

  5. Post-market: EU requires ongoing clinical follow-up and regular PSUR; budget for this

Divergence increasing: US, EU, and UK (post-Brexit) regulatory approaches diverging. AI companies may need separate validation strategies per market.

NoteQuestion 6: Safety Interlocks for Chemotherapy AI

Design safety interlocks for chemotherapy dose recommendation AI. What checks must occur before allowing dosing?

Answer: Multi-layered safety interlocks:

Layer 1: Input Validation

def validate_chemotherapy_inputs(patient_data):
    checks_failed = []

    # Range checks
    if patient_data['weight'] < 20 or patient_data['weight'] > 200:
        checks_failed.append("Weight out of plausible range")

    if patient_data['bsa'] < 0.5 or patient_data['bsa'] > 3.0:
        checks_failed.append("Body surface area out of range")

    # Required labs present and recent
    required_labs = ['creatinine', 'bilirubin', 'wbc', 'platelets', 'hemoglobin']
    for lab in required_labs:
        if lab not in patient_data:
            checks_failed.append(f"Missing required lab: {lab}")
        elif patient_data[f'{lab}_date'] > 7:  # days old
            checks_failed.append(f"Lab too old: {lab} ({patient_data[f'{lab}_date']} days)")

    # Organ function adequate for chemotherapy
    if patient_data['creatinine'] > 2.0:
        checks_failed.append("Renal function inadequate (Cr > 2.0)")

    if patient_data['bilirubin'] > 2.0:
        checks_failed.append("Hepatic function inadequate (Bili > 2.0)")

    if checks_failed:
        raise ValueError(f"Input validation failed: {checks_failed}")

    return True

Layer 2: Dose Bounds Checking

def validate_dose_recommendation(drug, dose, patient_bsa):
    # Check against standard dosing ranges
    standard_doses = {
        'doxorubicin': {'min': 40, 'max': 75, 'unit': 'mg/m2'},
        'cyclophosphamide': {'min': 500, 'max': 1500, 'unit': 'mg/m2'},
        # etc.
    }

    dose_per_bsa = dose / patient_bsa

    if dose_per_bsa < standard_doses[drug]['min']:
        raise ValueError(f"Dose below standard range: {dose_per_bsa} < {standard_doses[drug]['min']} {standard_doses[drug]['unit']}")

    if dose_per_bsa > standard_doses[drug]['max']:
        raise ValueError(f"⚠️  CRITICAL: Dose above standard range: {dose_per_bsa} > {standard_doses[drug]['max']} {standard_doses[drug]['unit']}")

    return True

Layer 3: Cross-Check with Clinical Guidelines

def cross_check_with_guidelines(diagnosis, regimen, patient):
    # Verify regimen appropriate for diagnosis
    approved_regimens = get_nccn_guidelines(diagnosis)

    if regimen not in approved_regimens:
        return {
            'alert': 'WARNING',
            'message': f'Regimen {regimen} not in NCCN guidelines for {diagnosis}',
            'action': 'require_pharmacist_approval'
        }

    # Check for contraindications
    if 'doxorubicin' in regimen and patient['cardiac_ejection_fraction'] < 50:
        return {
            'alert': 'CONTRAINDICATION',
            'message': 'Doxorubicin contraindicated with EF < 50%',
            'action': 'block_order'
        }

    return {'alert': 'PASS'}

Layer 4: Drug Interaction Check

def check_drug_interactions(chemotherapy_drugs, current_medications):
    critical_interactions = []

    interaction_db = load_interaction_database()

    for chemo_drug in chemotherapy_drugs:
        for med in current_medications:
            interaction = interaction_db.check(chemo_drug, med)
            if interaction['severity'] == 'severe':
                critical_interactions.append({
                    'drug1': chemo_drug,
                    'drug2': med,
                    'interaction': interaction['description'],
                    'recommendation': interaction['recommendation']
                })

    if critical_interactions:
        return {
            'alert': 'CRITICAL INTERACTION',
            'interactions': critical_interactions,
            'action': 'require_oncologist_review'
        }

    return {'alert': 'PASS'}

Layer 5: Mandatory Human Review

class ChemotherapyAI:
    def recommend_dose(self, patient_data):
        # AI makes recommendation
        recommendation = self.model.predict(patient_data)

        # All safety interlocks MUST pass
        validate_chemotherapy_inputs(patient_data)
        validate_dose_recommendation(
            recommendation['drug'],
            recommendation['dose'],
            patient_data['bsa']
        )
        guideline_check = cross_check_with_guidelines(
            patient_data['diagnosis'],
            recommendation['regimen'],
            patient_data
        )
        interaction_check = check_drug_interactions(
            recommendation['drugs'],
            patient_data['current_meds']
        )

        # Add mandatory review flags
        recommendation['require_oncologist_approval'] = True
        recommendation['require_pharmacist_verification'] = True
        recommendation['safety_checks'] = {
            'input_validation': 'PASS',
            'dose_bounds': 'PASS',
            'guidelines': guideline_check,
            'interactions': interaction_check
        }

        # NEVER allow fully autonomous chemotherapy ordering
        recommendation['autonomous_ordering'] = False

        return recommendation

Key safety principles:

  1. Multiple independent checks: Input validation, dose bounds, guidelines, interactions
  2. Fail-safe: Any check failure blocks recommendation
  3. Mandatory human approval: Oncologist AND pharmacist must review
  4. Audit trail: Document all checks and approvals
  5. Hard limits: Some checks (contraindications) block entirely, not just warn
  6. Conservative defaults: When uncertain, require human review

Chemotherapy is Level 3 automation at most (AI recommends, human approves), never Level 4-5.


13.10 Discussion Questions

  1. Ethics vs. Safety: When is it ethical to deploy AI with known failure modes? What risk level is acceptable?

  2. Liability: When a safely-designed AI still causes harm (no negligence, just residual risk), who should be liable: developer, hospital, clinician, or patient assumes risk?

  3. Safety standards: Should healthcare AI face higher safety standards than commercial aviation? Why or why not? What about compared to pharmaceuticals?

  4. Innovation vs. Safety: How do we balance innovation velocity (rapid deployment of beneficial AI) with safety rigor (extensive validation)? Is FDA too slow or too fast?

  5. Proof of safety: Can we ever prove AI is “safe enough”? What’s the evidentiary standard? How does this compare to drugs (RCTs) or procedures (surgical training)?

  6. Automation levels: As AI becomes more capable, should we allow higher automation (Level 4-5), or should healthcare always require human decision-maker? What about non-critical decisions?

  7. Patient agency: Should patients have right to opt out of AI-assisted care? Right to know when AI was used? Right to appeal AI decisions?

  8. Failure transparency: Should AI failures be publicly disclosed (like aviation accidents)? Or does this create liability disincentives and chill innovation?


13.11 Further Resources

13.11.1 Essential Books

13.11.2 Key Papers

13.11.2.1 Foundational AI Safety

13.11.2.2 Healthcare AI Safety

13.11.2.3 Medical Device Regulation

13.11.2.4 Adversarial Robustness

13.11.3 Regulatory Guidance

13.11.3.1 FDA

13.11.3.2 European Union

13.11.3.3 UK (MHRA)

13.11.4 Standards

13.11.5 Tools and Frameworks

13.11.5.1 Safety Validation

  • Foolbox - Python library for adversarial robustness testing

  • CleverHans - Library for adversarial examples

  • AI Verify - Toolkit for validating AI systems

13.11.5.2 Risk Management

13.11.5.3 Monitoring

  • Evidently AI - Open-source ML monitoring (drift detection)

  • Whylabs - Data and ML monitoring platform

  • Fiddler - AI observability and monitoring

13.11.6 Online Courses


Next: Chapter 12: Deployment, Monitoring, and Maintenance → now covers the “how” of deploying AI that has passed safety validation.