Hide code
graph LR
A[1. Risk Analysis] --> B[2. Risk Evaluation]
B --> C[3. Risk Control]
C --> D[4. Residual Risk Evaluation]
D --> E[5. Risk Management Report]
E --> F[Post-Market Surveillance]
F --> AThis chapter addresses AI safety in healthcare and public health. You will learn to:
Prerequisites: Evaluating AI Systems, Ethics, Bias, and Equity, Privacy, Security, and Governance.
In July 2021, researchers from the University of Michigan published a sobering study in JAMA Internal Medicine that sent shockwaves through the healthcare AI community (Wong et al., 2021). They evaluated Epic’s Sepsis Model (ESM), one of the most widely deployed AI systems in American hospitals, used to predict which patients would develop sepsis—a life-threatening condition requiring immediate intervention.
The results were devastating:
Real-world consequences:
The model wasn’t just inaccurate—it was clinically dangerous. With 87% false alarms, clinicians experienced severe alert fatigue, learning to ignore the very warnings meant to save lives. Meanwhile, one in three actual sepsis cases went undetected until patients became critically ill.
Epic had validated the model on historical data and achieved impressive-looking metrics. But those metrics didn’t translate to real-world safety. The system passed technical validation but failed the fundamental test: Does this make patients safer?
What went wrong?
This wasn’t a deployment failure or an MLOps problem. It was a safety failure at multiple levels:
The Epic sepsis case crystalizes why healthcare AI needs rigorous safety frameworks, not just high accuracy scores.
Technical excellence does not guarantee safety. A model can achieve 90% accuracy and still cause catastrophic harm if deployed without safety-critical validation.
This chapter provides the frameworks to prevent such failures.
Many practitioners conflate these concepts. They are distinct but complementary:
| Dimension | What It Protects | Primary Threats | Example Failure |
|---|---|---|---|
| Safety | Patients/populations from harm caused by system behavior | Design flaws, operational failures, edge cases | Epic sepsis: missed 37% of cases |
| Security | Systems/data from malicious actors | Hackers, unauthorized access, tampering | Anthem breach: 78M records stolen |
| Privacy | Sensitive data from unauthorized disclosure | Re-identification, inference attacks | Cambridge Analytica: 87M users exploited |
| Quality | Meeting specified requirements | Implementation bugs, spec violations | Software crashes, incorrect calculations |
Why the distinction matters:
Safety is about operational behavior in the real world, not just technical correctness in controlled environments.
The Epic case isn’t isolated. Healthcare AI has a safety problem:
FDA recalls and safety alerts (2019-2024):
High-profile failures:
The deployment gap:
According to Sendak et al. (2020, Nature Medicine), fewer than 20% of healthcare AI models that show promise in research actually get deployed. Safety concerns are a major barrier.
Why traditional medical device regulation falls short:
Traditional FDA/EU frameworks assume:
AI systems violate all these assumptions:
Example: Concept drift in sepsis prediction
Finlayson et al. (2021, Nature Medicine) showed systematic performance degradation:
This degradation happened silently—no alarms, no warnings, just gradually increasing patient risk.
We address AI safety comprehensively:
What we don’t cover:
Our focus: AI-specific safety challenges in healthcare and public health applications.
Overview: International standard for software development lifecycle of medical devices.
Safety classification (determines rigor of development process):
| Class | Definition | Examples | Requirements |
|---|---|---|---|
| A | No injury or damage to health possible | Administrative tools, non-diagnostic displays | Basic documentation |
| B | Non-serious injury possible | Decision support for non-critical conditions | Moderate rigor |
| C | Death or serious injury possible | Diagnostic tools for critical conditions, treatment recommendations | Maximum rigor |
Most healthcare AI falls into Class B or C.
Key requirements for Class C software:
Traditional IEC 62304 assumes deterministic software with reviewable code. AI systems challenge this:
FDA and EU regulators are updating guidance to address AI-specific concerns, but fundamental tension remains: safety frameworks designed for transparency applied to inherently opaque systems.
Overview: Systematic process for identifying and controlling risks throughout device lifecycle.
Five-step risk management process:
graph LR
A[1. Risk Analysis] --> B[2. Risk Evaluation]
B --> C[3. Risk Control]
C --> D[4. Residual Risk Evaluation]
D --> E[5. Risk Management Report]
E --> F[Post-Market Surveillance]
F --> AStep 1: Risk Analysis
Identify hazards, hazardous situations, and harms:
Step 2: Risk Evaluation
Estimate risk using:
Risk matrix example:
| Probability ↓ Severity → | Negligible (1) | Minor (2) | Moderate (3) | Major (4) | Catastrophic (5) |
|---|---|---|---|---|---|
| Frequent (5) | 5 | 10 | 15 | 20 | 25 |
| Probable (4) | 4 | 8 | 12 | 16 | 20 |
| Occasional (3) | 3 | 6 | 9 | 12 | 15 |
| Remote (2) | 2 | 4 | 6 | 8 | 10 |
| Improbable (1) | 1 | 2 | 3 | 4 | 5 |
Risk acceptability: - 1-5: Acceptable (monitor) - 6-12: ALARP (As Low As Reasonably Practicable) - reduce if feasible - 15-25: Unacceptable - must reduce before deployment
Step 3: Risk Control
Hierarchy of control measures (in order of preference):
Step 4: Residual Risk Evaluation
After controls, re-evaluate: - Is residual risk acceptable? - Do control measures introduce new risks? - Is overall benefit-risk ratio favorable?
Step 5: Risk Management Report
Document: - All identified hazards - Risk estimates before/after controls - Rationale for acceptability decisions - Post-market surveillance plan
FDA’s AI/ML Action Plan (2021) addresses unique AI challenges:
1. Good Machine Learning Practice (GMLP)
Ten guiding principles:
2. Predetermined Change Control Plans (PCCP)
Allows approved AI/ML models to evolve without new 510(k) for each change, if:
Example SPS for diabetic retinopathy screening: - Retrain with new data quarterly - Add image preprocessing techniques - Update decision threshold based on population - NOT allowed: Change intended use, add new conditions
3. Real-World Performance Monitoring
FDA expects: - Continuous monitoring of safety and effectiveness - Detection of performance degradation - Timely response to safety signals
FDA risk categorization for SaMD:
| State of Healthcare | Significance of Information Provided |
|---|---|
| Treat/Diagnose | |
| Critical | III |
| Serious | III |
| Non-Serious | II |
Approval pathways:
90% of AI medical devices clear through 510(k) by claiming “substantial equivalence” to existing devices.
Problem: Often compared to predicate devices from 1990s-2000s that were never rigorously validated. Creates “grandfather paradox”—new AI compared to old non-AI tools that never proved clinical utility.
Recent FDA tightening: Requiring more clinical evidence, especially for high-risk applications.
EU Medical Device Regulation (2017/745) took full effect May 2021, replacing older directives.
Key requirements for AI SaMD:
1. Clinical Evaluation
Much more rigorous than FDA 510(k):
2. Risk Classification
| Rule | AI Application | Class |
|---|---|---|
| Rule 11 | Software for diagnosis/treatment decisions | IIa (moderate-low risk) |
| Rule 11 | Software that could cause death/serious injury | IIb (moderate-high risk) |
| Rule 11 | Software for monitoring vital parameters | IIb |
Most diagnostic AI = Class IIa or IIb (requiring Notified Body involvement).
3. Transparency and Explainability
MDR requires:
4. Post-Market Surveillance
Manufacturers must:
Post-Brexit, UK developing own framework:
Software and AI as a Medical Device (SAMDaaS) initiative emphasizes:
FDA, EU, and UK increasingly diverging on AI regulation:
Implication: Medical AI companies need three different validation strategies for US, EU, UK markets.
Healthcare can learn from industries with mature safety cultures:
Swiss Cheese Model of Accidents (James Reason)
[Hazard] → | Hole | → | Hole | → | Hole | → | Hole | → [Accident]
Layer 1 Layer 2 Layer 3 Layer 4
(Design) (Training) (Procedure) (Defense)
Key insight: Accidents require multiple failures aligning. Safety requires defense in depth—multiple independent layers.
Application to AI:
Just Culture: Aviation learned blame-free reporting increases safety. Healthcare AI needs similar culture—punishing failures drives them underground.
Principles from nuclear industry:
Application to AI:
Fail-safe: System defaults to safe state on failure (e.g., railway signals default to red)
Fail-secure: System maintains security on failure (e.g., door lock stays locked)
For medical AI: Usually want fail-safe (defer to human) not fail-secure (block access).
Example: Sepsis prediction model fails → Alert clinician to manual assessment (fail-safe) NOT continue with last prediction (fail-secure).
Traditional software has well-understood failure modes (bugs, crashes, security vulnerabilities). AI introduces novel failure modes:
1. Data Drift and Distribution Shift
Definition: Training data distribution differs from deployment distribution.
Types:
Why it’s dangerous: Model performance silently degrades without obvious errors or crashes.
Real-world example: Finlayson et al., 2021 - Sepsis prediction AUROC dropped from 0.77 to 0.63 over 3 years due to: - Changes in clinical documentation practices - Updates to EHR systems - Shifts in patient population
2. Adversarial Examples
Definition: Inputs deliberately crafted to fool model, often imperceptible to humans.
Medical imaging example:
# Simplified adversarial example
import numpy as np
def generate_adversarial_example(model, image, true_label, epsilon=0.01):
"""
Generate adversarial perturbation using Fast Gradient Sign Method
Parameters:
- model: Trained classifier
- image: Original medical image (e.g., X-ray)
- true_label: Correct diagnosis
- epsilon: Perturbation magnitude (small = imperceptible)
Returns:
- Adversarial image that model misclassifies
"""
# Get model's gradient with respect to input
loss_gradient = model.get_loss_gradient(image, true_label)
# Create perturbation in direction that increases loss
perturbation = epsilon * np.sign(loss_gradient)
# Add imperceptible noise
adversarial_image = image + perturbation
# Model misclassifies, but human sees no difference
return adversarial_image
# Example: Chest X-ray
original_prediction = model.predict(xray) # "Pneumonia: 95% confidence"
adversarial_xray = generate_adversarial_example(model, xray, "pneumonia")
adversarial_prediction = model.predict(adversarial_xray) # "Normal: 92% confidence"
# ⚠️ Radiologist sees identical images, but model flips diagnosisWhy it matters:
Defense strategies:
3. Spurious Correlations
Definition: Model learns from confounders or artifacts rather than true causal features.
Classic example: Zech et al., 2018, PLoS Medicine
Safety hazard: Model works in training hospital, fails completely when deployed elsewhere.
Another example: COVID-19 detection from chest X-rays
Many models learned: - Patients with COVID → Admitted to hospital → Supine position X-rays - Healthy people → Outpatient imaging → Standing position X-rays - Model detected position (supine vs. standing) rather than COVID
4. Training Data Poisoning
Definition: Malicious or erroneous data corrupts training set.
Scenarios:
Example: Model trained on EHR data with documentation errors
Consequence: Model learns incorrect patterns from noisy labels.
5. Out-of-Distribution (OOD) Detection Failure
Definition: Model makes confident predictions on data unlike anything in training set.
Example: Diabetic retinopathy screening AI deployed in rural clinic
Safety requirement: Model must detect OOD inputs and defer to human.
def safe_prediction_with_ood_detection(model, ood_detector, input_data):
"""
Make prediction only if input is in-distribution
Returns:
- (prediction, confidence) if in-distribution
- (None, "OOD detected") if out-of-distribution
"""
# Check if input similar to training distribution
ood_score = ood_detector.score(input_data)
if ood_score > THRESHOLD:
# Out of distribution - defer to human
log_warning(f"OOD input detected: {ood_score}")
return None, "Input outside model's training distribution - manual review required"
# In distribution - proceed with prediction
prediction = model.predict(input_data)
return prediction, model.confidence6. Misuse and Off-Label Use
Definition: Using AI for purposes beyond validated scope.
Examples:
Why it happens:
Safety requirement: Clear labeling of validated use cases and populations.
7. Automation Bias
Definition: Tendency to favor automated decisions over human judgment, even when human is correct.
Study: Goddard et al., 2012, Journal of General Internal Medicine
Application to AI: High-confidence predictions can override clinical judgment, even when AI is wrong.
Mitigation:
8. Alert Fatigue
Definition: Desensitization to warnings due to high false positive rate.
Epic sepsis model: 87% false alarm rate → Clinicians learned to ignore alerts → Missed actual sepsis cases
Threshold:
90% false positives: Clinicians ignore almost all alerts
Trade-off: Sensitivity vs. specificity
Safety-critical design: Must account for alert burden in operational context, not just optimize AUROC.
9. Workflow Integration Failures
Definition: AI disrupts clinical workflow, leading to workarounds or abandonment.
Example: AI diagnostic tool requires:
Consequence: Clinicians skip using tool or use incorrectly.
Safety requirement: AI must fit seamlessly into existing workflow.
10. Cascading Failures
Definition: AI failure triggers failures in dependent systems.
Example: Hospital resource allocation AI
System view: AI is embedded in complex sociotechnical system. Local optimization can create global failure.
11. Common-Cause Failures
Definition: Single event causes multiple AI systems to fail simultaneously.
Example: EHR system update changes data format
Mitigation: Diversity in system architecture, independent fallback systems.
12. Latent Failures
Definition: Errors or vulnerabilities that exist but don’t cause harm until triggered by specific conditions.
Example: Model trained predominantly on data from one demographic
Swiss cheese model: Latent failures are “holes” waiting for alignment.
FMEA: Systematic method for identifying failure modes before they cause harm.
Process:
Scenario: Sepsis prediction model
| Failure Mode | Potential Cause | Effect | Severity (1-5) | Occurrence (1-5) | Detection (1-5) | RPN | Mitigation |
|---|---|---|---|---|---|---|---|
| Model misses sepsis case (false negative) | Data drift, edge case, training bias | Patient doesn’t receive timely treatment → Death/severe harm | 5 | 3 | 4 | 60 | 1) Human review for borderline cases 2) Ensemble with rule-based backup 3) Lower confidence threshold |
| False alarm (false positive) | Low specificity, noisy input data | Alert fatigue → Ignore future alerts | 4 | 5 | 2 | 40 | 1) Increase specificity threshold 2) Two-stage alerting 3) Monitor alert burden |
| Model unavailable (system crash) | Server failure, network outage | No predictions → Revert to standard care | 3 | 2 | 1 | 6 | 1) Redundant systems 2) Graceful degradation 3) Automatic failover |
| Data drift (silent degradation) | Population changes, practice changes | Gradually increasing errors | 5 | 4 | 5 | 100 | 1) Continuous monitoring 2) Automated drift detection 3) Retraining protocol |
| OOD input (unfamiliar case) | Patient outside training distribution | Confident wrong prediction | 5 | 3 | 4 | 60 | 1) OOD detection 2) Uncertainty quantification 3) Defer to human |
| Adversarial attack | Malicious manipulation of input | Incorrect diagnosis | 5 | 1 | 5 | 25 | 1) Input validation 2) Adversarial training 3) Anomaly detection |
Severity scale: - 1: No harm - 2: Minor harm (temporary, reversible) - 3: Moderate harm (prolonged recovery) - 4: Major harm (permanent injury) - 5: Catastrophic (death or multiple major harms)
Occurrence scale: - 1: Very rare (< 1 in 10,000) - 2: Rare (1 in 1,000) - 3: Occasional (1 in 100) - 4: Frequent (1 in 10) - 5: Very frequent (> 1 in 10)
Detection scale: - 1: Almost certain to detect before harm - 2: High chance of detection - 3: Moderate chance - 4: Low chance - 5: Almost certain NOT to detect
Risk Priority Number (RPN): - 1-30: Low risk (monitor) - 31-100: Moderate risk (reduce if feasible) - 101-125: High risk (must reduce before deployment)
Post-mitigation: Re-calculate RPN after implementing controls. Goal: All RPN < 100, ideally < 50.
Conduct FMEA with multidisciplinary team:
Timing: Before deployment, then updated: - When model changes - When deployment context changes - After incidents or near-misses - Annually at minimum
import pandas as pd
class AISystemFMEA:
"""
Systematic FMEA for AI healthcare systems
"""
def __init__(self, system_name):
self.system_name = system_name
self.failure_modes = []
def add_failure_mode(self, mode, cause, effect, severity, occurrence, detection):
"""
Add failure mode to FMEA
Parameters:
- mode: Description of failure
- cause: What causes this failure
- effect: Consequence of failure
- severity: 1-5 (1=negligible, 5=catastrophic)
- occurrence: 1-5 (1=rare, 5=frequent)
- detection: 1-5 (1=certain to detect, 5=unlikely to detect)
"""
rpn = severity * occurrence * detection
self.failure_modes.append({
'Failure Mode': mode,
'Cause': cause,
'Effect': effect,
'Severity': severity,
'Occurrence': occurrence,
'Detection': detection,
'RPN': rpn,
'Risk Level': self._classify_risk(rpn)
})
def _classify_risk(self, rpn):
if rpn <= 30:
return 'Low'
elif rpn <= 100:
return 'Moderate'
else:
return 'High'
def generate_report(self):
"""Generate FMEA report sorted by RPN"""
df = pd.DataFrame(self.failure_modes)
df = df.sort_values('RPN', ascending=False)
print(f"\n{'='*80}")
print(f"FMEA Report: {self.system_name}")
print(f"{'='*80}\n")
print(df.to_string(index=False))
# Risk summary
risk_counts = df['Risk Level'].value_counts()
print(f"\n{'='*80}")
print("Risk Summary:")
print(f" High Risk: {risk_counts.get('High', 0)} failure modes (RPN > 100)")
print(f" Moderate Risk: {risk_counts.get('Moderate', 0)} failure modes (RPN 31-100)")
print(f" Low Risk: {risk_counts.get('Low', 0)} failure modes (RPN ≤ 30)")
if risk_counts.get('High', 0) > 0:
print(f"\n⚠️ WARNING: {risk_counts.get('High', 0)} HIGH RISK failure modes must be mitigated before deployment")
return df
# Example usage
fmea = AISystemFMEA("Sepsis Prediction Model")
# Add failure modes from clinical team
fmea.add_failure_mode(
mode="False negative (missed sepsis case)",
cause="Data drift, patient outside training distribution",
effect="Delayed treatment → Patient death",
severity=5,
occurrence=3,
detection=4
)
fmea.add_failure_mode(
mode="False alarm (false positive)",
cause="Low model specificity",
effect="Alert fatigue → Future alerts ignored",
severity=4,
occurrence=5,
detection=2
)
fmea.add_failure_mode(
mode="Silent performance degradation",
cause="Concept drift over time",
effect="Gradually increasing error rate",
severity=5,
occurrence=4,
detection=5
)
# Generate report
report = fmea.generate_report()Output:
================================================================================
FMEA Report: Sepsis Prediction Model
================================================================================
Failure Mode Cause Effect Severity Occurrence Detection RPN Risk Level
Silent performance degradation Concept drift over time Gradually increasing error rate 5 4 5 100 Moderate
False negative (missed sepsis case) Data drift, patient outside training... Delayed treatment → Patient death 5 3 4 60 Moderate
False alarm (false positive) Low model specificity Alert fatigue → Future alerts ignored 4 5 2 40 Moderate
================================================================================
Risk Summary:
High Risk: 0 failure modes (RPN > 100)
Moderate Risk: 3 failure modes (RPN 31-100)
Low Risk: 0 failure modes (RPN ≤ 30)
Traditional FMEA assumes: - Known failure modes - Quantifiable probabilities - Independent failures
AI challenges: - Unknown unknowns: Failure modes we haven’t thought of - Uncertain probabilities: Hard to estimate occurrence of data drift - Correlated failures: Adversarial examples affect whole model class
Solution: Combine FMEA with: - Stress testing for unknown failure modes - Continuous monitoring for early detection - Red team exercises to find vulnerabilities
Fault Tree Analysis (FTA)
Top-down approach: Start with harm, work backward to causes.
Patient Death from Sepsis
|
+------------+------------+
| |
Sepsis Not Detected Treatment Delayed
| |
+--------+--------+ +--------+--------+
| | | |
Model Missed Case Clinician Alert System
Didn't Not Down
Notice Seen
Event Tree Analysis (ETA)
Forward approach: Start with event, trace possible outcomes.
Sepsis Develops → Model Alerts? → Clinician Reviews? → Treatment Given? → Outcome
↓ ↓ ↓
Yes/No Yes/No Yes/No Survive/Die
HAZOP (Hazard and Operability Study)
Use guide words to identify hazards:
| Guide Word | Application to AI | Example Hazard |
|---|---|---|
| No/Not | No prediction available | System downtime → Manual workflow |
| More | More false positives | Alert fatigue |
| Less | Less sensitivity | Missed cases |
| As Well As | Model captures spurious features | Detects hospital system, not disease |
| Part Of | Model trained on subset | Bias against underrepresented groups |
| Reverse | Opposite of intended | Model predicts low risk for high-risk patient |
| Other Than | Different output | Model provides prediction for wrong patient |
Standard ML evaluation (accuracy, AUROC, precision/recall on test set) is necessary but insufficient for safety validation.
Problem 1: Average performance masks critical failures
Example: Cancer screening AI
Problem 2: Test set doesn’t represent edge cases
Problem 3: Metrics don’t account for operational context
Problem 4: No assessment of failure modes
Goal: Evaluate performance on hardest cases, not average cases.
def worst_case_validation(model, X, y, protected_groups):
"""
Evaluate model on worst-performing subgroups
Returns performance on:
- Hardest examples (lowest confidence correct predictions)
- Underrepresented groups
- Edge cases
"""
results = {}
# 1. Evaluate on low-confidence predictions
predictions = model.predict_proba(X)
confidence = np.max(predictions, axis=1)
# Bottom quartile confidence
low_conf_mask = confidence < np.percentile(confidence, 25)
results['low_confidence_accuracy'] = accuracy_score(
y[low_conf_mask],
model.predict(X[low_conf_mask])
)
# 2. Evaluate on underrepresented groups
for group in protected_groups:
group_mask = (X['demographic_group'] == group)
results[f'{group}_accuracy'] = accuracy_score(
y[group_mask],
model.predict(X[group_mask])
)
results[f'{group}_sample_size'] = group_mask.sum()
# 3. Evaluate on edge cases (outliers)
# Use isolation forest or other outlier detection
from sklearn.ensemble import IsolationForest
outlier_detector = IsolationForest()
outlier_scores = outlier_detector.fit_predict(X)
outlier_mask = (outlier_scores == -1)
results['outlier_accuracy'] = accuracy_score(
y[outlier_mask],
model.predict(X[outlier_mask])
)
return results
# Example
worst_case_results = worst_case_validation(
model=sepsis_model,
X=test_data,
y=test_labels,
protected_groups=['Black', 'Hispanic', 'Asian', 'White']
)
print("Worst-Case Validation Results:")
print(f" Overall accuracy: 92%")
print(f" Low-confidence accuracy: {worst_case_results['low_confidence_accuracy']:.1%}")
print(f" Outlier accuracy: {worst_case_results['outlier_accuracy']:.1%}")
print(f" Black patient accuracy: {worst_case_results['Black_accuracy']:.1%}")
print(f" Hispanic patient accuracy: {worst_case_results['Hispanic_accuracy']:.1%}")Safety threshold: Worst-case performance must meet minimum safety requirements, not just average.
Test inputs at extremes:
| Input Type | Example | Why It Matters |
|---|---|---|
| Out of range | Age = 150, Heart rate = 0 | How does model handle impossible values? |
| Missing data | 50% of features missing | Graceful degradation or catastrophic failure? |
| Corrupted inputs | Image with noise, text with typos | Real-world data is messy |
| Adversarial examples | Small perturbations designed to fool model | Robustness to attacks |
| OOD inputs | Data from different population/site | Detects distribution shift? |
def boundary_condition_testing(model):
"""
Test model behavior on edge cases
"""
test_results = {}
# Test 1: Out-of-range values
invalid_input = {
'age': 200, # Impossible
'heart_rate': -50, # Invalid
'temperature': 150 # Lethal
}
try:
prediction = model.predict(invalid_input)
test_results['invalid_input'] = f"⚠️ FAILURE: Model accepted invalid input (predicted: {prediction})"
except ValueError as e:
test_results['invalid_input'] = f"✅ PASS: Model rejected invalid input ({e})"
# Test 2: Missing data
for missing_pct in [0.1, 0.3, 0.5, 0.7, 0.9]:
partial_input = create_missing_data(test_data, missing_pct)
accuracy = evaluate(model, partial_input)
test_results[f'missing_{int(missing_pct*100)}pct'] = accuracy
# Test 3: Adversarial robustness
adversarial_accuracy = evaluate_adversarial_robustness(model, test_data)
test_results['adversarial_robustness'] = adversarial_accuracy
# Test 4: OOD detection
ood_data = load_ood_dataset() # Data from different hospital
ood_detected = model.detect_ood(ood_data)
test_results['ood_detection_rate'] = ood_detected.mean()
return test_resultsAttacks to test:
Fast Gradient Sign Method (FGSM):
def fgsm_attack(model, image, label, epsilon=0.03):
"""
Generate adversarial example using FGSM
Adds small perturbation in direction of gradient
"""
# Compute gradient of loss with respect to input
with tf.GradientTape() as tape:
tape.watch(image)
prediction = model(image)
loss = loss_function(label, prediction)
gradient = tape.gradient(loss, image)
# Create adversarial image
adversarial_image = image + epsilon * tf.sign(gradient)
# Clip to valid range [0, 1]
adversarial_image = tf.clip_by_value(adversarial_image, 0, 1)
return adversarial_image
# Evaluate robustness
original_accuracy = evaluate(model, test_images, test_labels)
adversarial_images = [fgsm_attack(model, img, lbl) for img, lbl in zip(test_images, test_labels)]
adversarial_accuracy = evaluate(model, adversarial_images, test_labels)
print(f"Original accuracy: {original_accuracy:.1%}")
print(f"Adversarial accuracy: {adversarial_accuracy:.1%}")
print(f"Robustness gap: {original_accuracy - adversarial_accuracy:.1%}")
if adversarial_accuracy < 0.5:
print("⚠️ SAFETY CONCERN: Model highly vulnerable to adversarial attacks")Projected Gradient Descent (PGD): Stronger iterative attack
Certified defenses: Formal guarantees of robustness within epsilon-ball
Test system under adverse conditions:
class SafetyStressTest:
"""
Comprehensive stress testing for AI safety
"""
def test_high_load(self, model, num_requests=10000):
"""Can system handle peak load?"""
start = time.time()
predictions = model.batch_predict(num_requests)
latency = (time.time() - start) / num_requests
if latency > 1.0: # > 1 second per prediction
return f"⚠️ FAIL: Latency {latency:.2f}s exceeds 1s requirement"
return f"✅ PASS: Latency {latency:.2f}s"
def test_degraded_input_quality(self, model, test_data):
"""Performance with poor-quality inputs"""
results = {}
# Add increasing amounts of noise
for noise_level in [0.1, 0.3, 0.5]:
noisy_data = test_data + noise_level * np.random.randn(*test_data.shape)
accuracy = evaluate(model, noisy_data)
results[f'noise_{noise_level}'] = accuracy
if accuracy < 0.7: # Arbitrary safety threshold
results[f'noise_{noise_level}_status'] = "⚠️ UNSAFE"
return results
def test_concurrent_failures(self, model):
"""What happens when multiple things go wrong?"""
# Simulate database connection failure during high load
# Simulate network latency while processing corrupted input
# etc.
pass
def test_recovery(self, model):
"""Can system recover from failures?"""
# Simulate crash and restart
# Verify state restoration
# Check for data loss
passAdapted from Chapter 9 (Ethics), but safety-focused:
def safety_subgroup_analysis(model, X, y, subgroups):
"""
Evaluate safety metrics across demographic subgroups
Focus on clinical safety, not just fairness
"""
safety_metrics = {}
for group, mask in subgroups.items():
# Standard metrics
accuracy = accuracy_score(y[mask], model.predict(X[mask]))
# Safety-critical metrics
# False negatives in high-severity cases
high_severity_mask = (y[mask] == 'high_risk')
false_negative_rate = 1 - recall_score(
y[mask][high_severity_mask],
model.predict(X[mask][high_severity_mask])
)
safety_metrics[group] = {
'accuracy': accuracy,
'false_negative_rate': false_negative_rate,
'sample_size': mask.sum()
}
# Check for subgroup safety disparities
fnr_values = [m['false_negative_rate'] for m in safety_metrics.values()]
max_fnr_disparity = max(fnr_values) - min(fnr_values)
if max_fnr_disparity > 0.1: # 10% disparity threshold
print(f"⚠️ SAFETY ALERT: {max_fnr_disparity:.1%} disparity in false negative rates across subgroups")
return safety_metricsSoftware Validation Guidance:
AI/ML-specific:
Clinical Evaluation Report (CER) requirements:
For AI:
Before deployment:
Safety must be designed in, not tested in. This section covers architectural patterns that build safety into AI systems.
Levels of automation (adapted from aviation):
| Level | Name | Description | AI Role | Human Role | Healthcare Example |
|---|---|---|---|---|---|
| 1 | Manual | Human decides, no AI | None | Decision maker | Traditional diagnosis |
| 2 | Decision support | AI suggests, human decides | Advisor | Decision maker | Diagnostic suggestions that clinician reviews |
| 3 | Conditional automation | AI decides, human approves before action | Recommender | Approver | Treatment plan requires clinician sign-off |
| 4 | High automation | AI acts, human monitors and can intervene | Autonomous actor | Supervisor | Automated insulin pump with override |
| 5 | Full automation | AI acts, no human involvement | Fully autonomous | None | (Rare in healthcare) |
Safety principle: Higher automation requires higher reliability.
When to use each level:
Bainbridge, 1983 identified paradoxes:
Application to healthcare AI:
Defense in depth: Multiple independent safety layers.
class SafeAISystem:
"""
AI system with fallback to rule-based or human decision
"""
def __init__(self, ai_model, rule_based_backup, confidence_threshold=0.8):
self.ai_model = ai_model
self.rule_based_backup = rule_based_backup
self.confidence_threshold = confidence_threshold
self.system_health = "operational"
def predict(self, input_data):
"""
Make prediction with safety fallbacks
"""
# Health check
if self.system_health != "operational":
return self.safe_mode_prediction(input_data)
# Try AI prediction
try:
prediction = self.ai_model.predict(input_data)
confidence = prediction.confidence
# Check confidence threshold
if confidence >= self.confidence_threshold:
return prediction
else:
# Low confidence → Fallback
logging.warning(f"Low confidence ({confidence:.2f}) → Falling back to rule-based")
return self.rule_based_backup.predict(input_data)
except Exception as e:
# AI failure → Fallback
logging.error(f"AI prediction failed: {e} → Falling back to rule-based")
return self.rule_based_backup.predict(input_data)
def safe_mode_prediction(self, input_data):
"""
Conservative predictions when system degraded
"""
# Use most conservative/safe default
return {
'prediction': 'manual_review_required',
'confidence': 0.0,
'reason': 'System in safe mode - human evaluation required'
}
def health_check(self):
"""
Continuous system health monitoring
"""
try:
# Check model availability
test_prediction = self.ai_model.predict(self.get_test_input())
# Check latency
import time
start = time.time()
_ = self.ai_model.predict(self.get_test_input())
latency = time.time() - start
if latency > 5.0: # 5 second threshold
self.system_health = "degraded"
return "degraded"
self.system_health = "operational"
return "operational"
except Exception as e:
self.system_health = "failed"
return "failed"
# Example usage
ai_system = SafeAISystem(
ai_model=neural_network_model,
rule_based_backup=clinical_decision_rules,
confidence_threshold=0.8
)
# Prediction with automatic fallback
result = ai_system.predict(patient_data)
if result['source'] == 'rule_based':
print("⚠️ AI fallback activated - using rule-based system")
elif result['source'] == 'manual':
print("⚠️ Safe mode - manual review required")Diverse models reduce common-mode failures:
class SafetyEnsemble:
"""
Ensemble with safety-focused disagreement handling
"""
def __init__(self, models, min_agreement=0.75):
self.models = models
self.min_agreement = min_agreement
def predict(self, input_data):
"""
Predict with ensemble agreement checking
"""
predictions = [model.predict(input_data) for model in self.models]
# Check agreement
agreement = self.calculate_agreement(predictions)
if agreement >= self.min_agreement:
# High agreement → Use ensemble prediction
return {
'prediction': self.aggregate(predictions),
'confidence': agreement,
'source': 'ensemble'
}
else:
# Low agreement → Flag for human review
return {
'prediction': 'uncertain',
'confidence': agreement,
'source': 'manual_review_required',
'reason': f'Low model agreement ({agreement:.2%})'
}
def calculate_agreement(self, predictions):
"""
Fraction of models that agree with majority
"""
from collections import Counter
counts = Counter(predictions)
majority_count = counts.most_common(1)[0][1]
return majority_count / len(predictions)
def aggregate(self, predictions):
"""
Aggregate predictions (majority vote, averaging, etc.)
"""
from collections import Counter
return Counter(predictions).most_common(1)[0][0]Continuous monitoring for safety signals:
class SafetyMonitor:
"""
Real-time safety monitoring with automatic circuit breakers
"""
def __init__(self, model, safety_thresholds):
self.model = model
self.thresholds = safety_thresholds
self.metrics_history = []
self.circuit_breaker_active = False
def monitor_prediction(self, input_data, prediction, ground_truth=None):
"""
Monitor individual prediction for safety concerns
"""
alerts = []
# Check 1: Confidence too low
if prediction['confidence'] < self.thresholds['min_confidence']:
alerts.append(f"Low confidence: {prediction['confidence']:.2f}")
# Check 2: OOD detection
ood_score = self.model.ood_detector.score(input_data)
if ood_score > self.thresholds['max_ood_score']:
alerts.append(f"Out-of-distribution input: {ood_score:.2f}")
# Check 3: If ground truth available, check error
if ground_truth is not None:
error = (prediction['value'] != ground_truth)
self.metrics_history.append({'error': error, 'timestamp': time.time()})
# Check recent error rate
recent_error_rate = self.calculate_recent_error_rate()
if recent_error_rate > self.thresholds['max_error_rate']:
alerts.append(f"Elevated error rate: {recent_error_rate:.1%}")
self.activate_circuit_breaker()
if alerts:
self.log_safety_alert(alerts, input_data, prediction)
return alerts
def calculate_recent_error_rate(self, window_minutes=60):
"""
Calculate error rate in recent time window
"""
now = time.time()
window_start = now - (window_minutes * 60)
recent_errors = [
m['error'] for m in self.metrics_history
if m['timestamp'] > window_start
]
if len(recent_errors) == 0:
return 0.0
return sum(recent_errors) / len(recent_errors)
def activate_circuit_breaker(self):
"""
Stop using AI model when safety threshold exceeded
"""
self.circuit_breaker_active = True
logging.critical("🚨 CIRCUIT BREAKER ACTIVATED - AI model disabled")
logging.critical("System falling back to safe mode until manual review")
# Alert on-call team
self.send_alert_to_oncall()
def send_alert_to_oncall(self):
"""Send urgent alert to on-call safety team"""
# Integration with paging system
pass
# Example usage
monitor = SafetyMonitor(
model=sepsis_model,
safety_thresholds={
'min_confidence': 0.7,
'max_ood_score': 0.5,
'max_error_rate': 0.2 # 20% error rate
}
)
# Monitor each prediction
prediction = model.predict(patient_data)
alerts = monitor.monitor_prediction(patient_data, prediction, ground_truth)
if monitor.circuit_breaker_active:
# Use fallback system
prediction = fallback_system.predict(patient_data)Prevent unsafe actions before they occur:
class SafetyInterlocks:
"""
Hard constraints preventing unsafe model behavior
"""
def validate_input(self, input_data):
"""
Validate input before allowing prediction
"""
errors = []
# Range checks
if input_data['age'] < 0 or input_data['age'] > 120:
errors.append(f"Invalid age: {input_data['age']}")
if input_data['heart_rate'] < 20 or input_data['heart_rate'] > 300:
errors.append(f"Invalid heart rate: {input_data['heart_rate']}")
# Completeness checks
required_fields = ['age', 'heart_rate', 'blood_pressure', 'temperature']
missing = [f for f in required_fields if f not in input_data or input_data[f] is None]
if missing:
errors.append(f"Missing required fields: {missing}")
# Consistency checks
if input_data['systolic_bp'] < input_data['diastolic_bp']:
errors.append("Systolic BP < Diastolic BP (impossible)")
if errors:
raise ValueError(f"Input validation failed: {errors}")
def validate_output(self, prediction, input_data):
"""
Validate prediction before returning to clinician
"""
warnings = []
# Confidence check
if prediction['confidence'] < 0.5:
warnings.append("Low confidence prediction")
# Plausibility check
# Example: Risk score shouldn't increase dramatically for stable patient
if 'previous_risk' in input_data:
change = abs(prediction['risk'] - input_data['previous_risk'])
if change > 0.5: # >50% change
warnings.append(f"Large risk change: {input_data['previous_risk']:.2f} → {prediction['risk']:.2f}")
# Output bounds check
if prediction['risk'] < 0 or prediction['risk'] > 1:
raise ValueError(f"Risk score out of bounds: {prediction['risk']}")
if warnings:
prediction['warnings'] = warnings
return prediction
def cross_check_with_rules(self, prediction, input_data):
"""
Cross-check AI prediction against clinical decision rules
"""
# Example: qSOFA score for sepsis
qsofa = self.calculate_qsofa(input_data)
if qsofa >= 2 and prediction['sepsis_risk'] < 0.5:
# qSOFA suggests sepsis but AI says low risk
return {
**prediction,
'cross_check_alert': f"AI predicts low risk but qSOFA={qsofa} suggests high risk - manual review recommended"
}
return predictionDeployment isn’t the end of safety validation—it’s the beginning of continuous safety monitoring.
Traditional medical devices are static. AI systems are dynamic:
Result: Performance validated pre-deployment can degrade silently post-deployment.
FDA MedWatch categories (adapted for AI):
| Severity | Definition | Examples | Reporting |
|---|---|---|---|
| Critical | Death or serious injury | AI missed cancer diagnosis → Patient died | Immediate (within 24h) |
| Major | Significant harm, prolonged recovery | AI-guided surgery error → Complications requiring additional procedures | Within 30 days |
| Minor | Temporary harm, full recovery | False alarm causing unnecessary procedure | Annual summary |
| Near miss | Potential for harm, none occurred | AI error caught by clinician before action | Track for trending |
class IncidentResponse:
"""
Systematic incident response for AI safety events
"""
def __init__(self, system_name):
self.system_name = system_name
self.incidents = []
def report_incident(self, severity, description, patient_impact, root_cause_preliminary):
"""
Report and initiate response to safety incident
"""
incident = {
'timestamp': datetime.now(),
'severity': severity,
'description': description,
'patient_impact': patient_impact,
'root_cause_preliminary': root_cause_preliminary,
'status': 'reported'
}
self.incidents.append(incident)
# Immediate actions based on severity
if severity == 'critical':
self.critical_incident_response(incident)
elif severity == 'major':
self.major_incident_response(incident)
# All incidents trigger investigation
self.initiate_investigation(incident)
def critical_incident_response(self, incident):
"""
Immediate response to critical safety event
"""
# 1. Activate circuit breaker if systematic issue
if self.is_systematic_failure(incident):
self.activate_circuit_breaker()
# 2. Alert on-call safety team
self.page_safety_team(incident)
# 3. Notify regulatory authorities (FDA MedWatch within 24h)
self.notify_regulators(incident, deadline_hours=24)
# 4. Quarantine affected predictions
self.quarantine_predictions(incident)
# 5. Initiate root cause analysis
self.initiate_rca(incident)
def is_systematic_failure(self, incident):
"""
Determine if incident indicates systematic problem
"""
# Check for similar recent incidents
recent_similar = [
i for i in self.incidents[-10:]
if i['root_cause_preliminary'] == incident['root_cause_preliminary']
]
if len(recent_similar) >= 3:
return True # Trend suggests systematic issue
return False
def initiate_investigation(self, incident):
"""
Start formal investigation
"""
investigation = {
'incident_id': len(self.incidents),
'team': ['data_scientist', 'clinician', 'safety_officer', 'quality_assurance'],
'timeline': '30 days for major/critical, 90 days for minor',
'deliverables': ['root_cause_analysis', 'corrective_actions', 'preventive_actions']
}
return investigation
# Example
incident_response = IncidentResponse(system_name="Sepsis Prediction Model")
# Critical incident reported
incident_response.report_incident(
severity='critical',
description="Model failed to alert for patient who developed septic shock",
patient_impact="Patient required ICU admission, prolonged recovery",
root_cause_preliminary="Data drift - patient demographics outside training distribution"
)Five Whys method:
Incident: AI missed sepsis case
Why #1: Why did AI miss sepsis?
→ Patient's early sepsis presented atypically
Why #2: Why didn't AI detect atypical presentation?
→ Training data had few atypical cases
Why #3: Why did training data lack atypical cases?
→ Training data from single academic center with specific patient population
Why #4: Why wasn't data diversity validated?
→ Validation focused on overall metrics, not subgroup coverage
Why #5: Why wasn't subgroup validation required?
→ Validation protocol didn't include diversity assessment
ROOT CAUSE: Inadequate validation protocol lacking diversity requirements
Corrective Action: Update validation to require minimum sample sizes across demographic and clinical subgroups
Preventive Action: Implement ongoing diversity monitoring in post-market surveillance
AI Failure (Missed Sepsis)
|
People Process | Technology Data
| | | | |
Alert fatigue No FMEA done | Model too simple Data drift
Insufficient Inadequate | No OOD detection Unrepresentative
training validation | No monitoring training set
class PostMarketSurveillance:
"""
Continuous monitoring of AI safety and performance
"""
def __init__(self, model, baseline_metrics):
self.model = model
self.baseline_metrics = baseline_metrics
self.current_metrics = {}
self.degradation_threshold = 0.1 # 10% degradation triggers alert
def monitor_performance(self, predictions, ground_truth):
"""
Track performance metrics over time
"""
from sklearn.metrics import roc_auc_score, accuracy_score
# Calculate current metrics
self.current_metrics = {
'auroc': roc_auc_score(ground_truth, predictions),
'accuracy': accuracy_score(ground_truth, (predictions > 0.5)),
'sample_size': len(predictions),
'timestamp': datetime.now()
}
# Check for degradation
for metric in ['auroc', 'accuracy']:
baseline = self.baseline_metrics[metric]
current = self.current_metrics[metric]
degradation = (baseline - current) / baseline
if degradation > self.degradation_threshold:
self.alert_degradation(metric, baseline, current, degradation)
def monitor_data_drift(self, new_data, reference_data):
"""
Detect distribution shift using statistical tests
"""
from scipy.stats import ks_2samp
drift_detected = {}
for feature in new_data.columns:
# Kolmogorov-Smirnov test
statistic, p_value = ks_2samp(
reference_data[feature],
new_data[feature]
)
if p_value < 0.05: # Significant drift
drift_detected[feature] = {
'statistic': statistic,
'p_value': p_value
}
if drift_detected:
self.alert_data_drift(drift_detected)
return drift_detected
def monitor_subgroup_performance(self, predictions, ground_truth, subgroups):
"""
Ensure no subgroup experiencing degradation
"""
subgroup_metrics = {}
for group, mask in subgroups.items():
if mask.sum() > 30: # Minimum sample size
subgroup_metrics[group] = {
'auroc': roc_auc_score(ground_truth[mask], predictions[mask]),
'sample_size': mask.sum()
}
# Check for subgroup disparities
aurocs = [m['auroc'] for m in subgroup_metrics.values()]
disparity = max(aurocs) - min(aurocs)
if disparity > 0.1: # 10% disparity threshold
self.alert_subgroup_disparity(subgroup_metrics, disparity)
return subgroup_metrics
def generate_periodic_safety_update(self, period='quarterly'):
"""
Periodic Safety Update Report (PSUR) as required by EU MDR
"""
report = {
'period': period,
'timestamp': datetime.now(),
'system': self.model.name,
'deployment_stats': {
'total_predictions': self.get_total_predictions(),
'active_sites': self.get_active_sites(),
'user_count': self.get_user_count()
},
'performance_summary': {
'baseline_auroc': self.baseline_metrics['auroc'],
'current_auroc': self.current_metrics['auroc'],
'change': self.current_metrics['auroc'] - self.baseline_metrics['auroc']
},
'safety_events': {
'critical': self.count_incidents('critical'),
'major': self.count_incidents('major'),
'minor': self.count_incidents('minor'),
'near_miss': self.count_incidents('near_miss')
},
'corrective_actions': self.get_corrective_actions(),
'recommendations': self.generate_recommendations()
}
return reportWhen to report:
What to report:
Serious incidents must be reported to competent authority:
Periodic Safety Update Reports (PSUR):
Gray area: When AI is “decision support” (clinician makes final call), attribution is unclear. Regulators increasingly expect reporting even when AI is indirect contributor.
Technology alone cannot ensure safety. Organizations need culture, processes, and governance supporting safety.
Components of effective safety management:
Clinicians must understand:
Developers must understand:
Example training curriculum for clinicians:
Module 1: AI Basics (30 min)
- What is machine learning?
- How does our model work?
- What data was it trained on?
Module 2: Using the AI System (45 min)
- Workflow integration
- Interpreting predictions
- Understanding confidence scores
- Hands-on practice
Module 3: Limitations and Failure Modes (30 min)
- Known failure modes
- When to override AI
- Reporting errors and near-misses
Module 4: Case Studies (30 min)
- Real examples of AI failures
- Lessons learned
- Discussion and Q&A
Assessment:
- Written quiz (80% passing)
- Practical simulation (demonstrate safe use)
- Annual refresher required
Aviation safety lesson: Punishing errors drives them underground. Just culture encourages reporting.
Just Culture principles:
Application to AI safety:
BLAME CULTURE JUST CULTURE
"Dr. Smith ignored the AI alert" → "Why was alert rate so high that ignoring became norm?"
"Data scientist didn't test edge cases" → "Why didn't our validation protocol require edge case testing?"
"Hospital didn't monitor performance" → "What barriers prevented effective monitoring?"
Safety ≠ Security ≠ Quality: Distinct dimensions requiring different frameworks
Technical excellence ≠ Safe deployment: High accuracy doesn’t guarantee safety
Fail-safe by design: Build safety into architecture (HITL, fallbacks, monitoring)
Multiple barriers: Defense in depth prevents single point of failure
Proactive hazard analysis: FMEA and risk assessment before deployment, not after harm
Validation beyond ML metrics: Worst-case analysis, boundary testing, adversarial robustness
Continuous monitoring: Safety validation doesn’t end at deployment—requires ongoing surveillance
Regulatory compliance is minimum: FDA/EU requirements are floor, not ceiling
Learn from failures: Both yours and others’ (see Appendix E)
Organizational culture matters: Technology cannot overcome culture that doesn’t prioritize safety
Edge cases matter more than averages: Safety depends on worst-case performance
Human-centered design: Appropriate autonomy level, interpretability, workflow integration
Scenario: Your public health department is deploying an AI model to predict COVID-19 hospitalization risk. The model will be used to prioritize antiviral treatments during supply shortages.
Context:
For each category, identify at least 2 potential failure modes:
Technical failures: - Data drift - OOD inputs - Adversarial examples - Training data issues
Operational failures: - Misuse/off-label use - Automation bias - Alert fatigue - Workflow integration
Systemic failures: - Cascading failures - Common-cause failures - Equity issues
For each failure mode, assess:
| Failure Mode | Cause | Effect | Severity (1-5) | Occurrence (1-5) | Detection (1-5) | RPN |
|---|---|---|---|---|---|---|
| [Your failure mode] |
For failure modes with RPN > 50, design specific mitigation strategies:
Design post-deployment safety monitoring:
What metrics to track: - Performance metrics (AUROC, calibration) - Safety metrics (false negative rate in high-risk groups) - Operational metrics (alert burden, override rate) - Equity metrics (performance across demographics)
How to detect degradation: - Automated alerts - Statistical process control - Manual review frequency
Response protocol: - Degradation thresholds - Escalation procedures - Circuit breaker criteria
Create incident response workflow:
Test your knowledge of the key concepts from this chapter. Click “Show Answer” to reveal the correct response and explanation.
A COVID-19 diagnostic AI achieves 97% accuracy in validation. Why might this be insufficient for safety clearance?
Answer: Multiple safety concerns beyond accuracy:
Accuracy doesn’t reveal failure mode distribution - 3% errors might all be false negatives (missing COVID cases), which is more dangerous than false positives
Aggregate metric masks subgroup performance - Model might be 99% accurate on typical presentations but 70% accurate on atypical cases (elderly, immunocompromised)
Test set may not represent edge cases - Validation data might not include poor image quality, operator errors, or equipment variations seen in real clinics
No assessment of operational impact - High accuracy with 30% false positive rate could cause alert fatigue and system abandonment
Missing safety-specific validation - No adversarial robustness testing, OOD detection, worst-case analysis, or boundary condition testing
No consideration of context - In low-prevalence settings, even 97% accuracy can have very low positive predictive value
Safety clearance requires: Comprehensive FMEA, worst-case analysis, subgroup validation, operational testing, and safety-critical design (fallbacks, monitoring, HITL).
Hospital wants to deploy fully autonomous AI (Level 5 automation) for sepsis prediction—no human review, automatic treatment orders. What safety requirements must be met?
Answer: Level 5 (fully autonomous) is almost never appropriate in healthcare. If considered, would require:
Technical requirements: - Near-perfect accuracy (>99.9%) across ALL subgroups - Certified robustness to adversarial examples - Reliable OOD detection - Continuous monitoring with real-time circuit breakers - Redundant fallback systems - Formal verification of safety properties
Regulatory requirements: - Class III designation (highest risk) → PMA (premarket approval), not 510(k) - Extensive clinical trials demonstrating superiority to human decision-making - Post-market surveillance plan with frequent reporting - Predetermined change control plan for any updates
Operational requirements: - Comprehensive FMEA with all RPN < 10 - Incident response protocol with immediate escalation - Human monitoring capability (contradiction: “autonomous” yet monitored) - Clear liability assignment
Ethical requirements: - Patient consent for fully autonomous care - Opt-out mechanism - Demonstration that benefits significantly outweigh risks
Realistic recommendation: Use Level 2-3 (decision support or human approval required) instead. Full autonomy creates unacceptable liability, technical, and ethical risks for life-critical decisions.
Your AI model’s performance begins degrading after 6 months in production (AUROC dropped from 0.85 to 0.78). Walk through the incident response steps.
Answer: Systematic incident response:
Step 1: Detect and confirm (24-48 hours) - Automated monitoring alerts to degradation - Confirm with manual validation on recent data - Calculate magnitude and statistical significance of degradation - Assess impact: How many patients affected? Any harm occurred?
Step 2: Immediate response (Day 1) - Classify severity: Major (significant degradation) or Critical (if harm occurred) - Notify safety officer, clinical leadership, development team - DO NOT immediately disable system (could disrupt care) unless severe degradation - Increase human oversight temporarily (lower automation level) - Consider lowering confidence threshold (increase sensitivity at cost of specificity)
Step 3: Root cause investigation (Days 2-7) - Data drift analysis: Has population changed? New demographics, comorbidities? - Concept drift analysis: Have clinical practices changed? New protocols, medications? - System changes: EHR updates, integration changes, data pipeline issues? - Ground truth verification: Is degradation real or measurement artifact?
Step 4: Develop corrective actions (Days 7-14) - If data drift: Retrain model on recent data - If concept drift: Retrain with updated labels - If system issue: Fix technical problem - If measurement artifact: Correct monitoring system
Step 5: Validate correction (Days 14-21) - Test updated model on held-out recent data - Confirm performance restored - Perform safety validation (not just performance) - Subgroup analysis to ensure no equity issues
Step 6: Deploy correction and monitor closely (Day 21+) - Deploy updated model - Increase monitoring frequency temporarily - Verify performance in production - Document lessons learned
Step 7: Regulatory reporting - If harm occurred: FDA MedWatch within 30 days, EU vigilance within 15-30 days - If no harm: Include in periodic safety update report (PSUR)
Step 8: Preventive actions - Update monitoring to detect this drift pattern earlier - Implement automated retraining protocol - Update FMEA with new failure mode - Train team on lessons learned
Compare FMEA vs. traditional ML evaluation. When is each appropriate?
Answer:
| Dimension | Traditional ML Evaluation | FMEA (Failure Mode and Effects Analysis) |
|---|---|---|
| Goal | Assess average performance | Identify potential failure modes and harms |
| Timing | After model training | Before deployment (proactive) |
| Scope | Model performance only | End-to-end system including operational context |
| Metrics | Accuracy, AUROC, precision, recall | Risk Priority Number (Severity × Occurrence × Detection) |
| Errors treated | All errors weighted equally | Errors weighted by clinical severity |
| Team | Data scientists | Multidisciplinary (clinicians, safety, QA, patients) |
| Output | Performance report | Risk mitigation plan |
When to use ML evaluation: - Iterative model development - Comparing model architectures - Hyperparameter tuning - Model selection
When to use FMEA: - Before clinical deployment - After major system changes - Following incidents or near-misses - Annual safety reviews - Regulatory submissions (ISO 14971 requirement)
Why both are needed:
ML evaluation asks: “How accurate is the model?”
FMEA asks: “What can go wrong, how bad would it be, and how do we prevent it?”
Example:
You cannot safely deploy with ML evaluation alone.
Regulatory perspective: FDA vs. EU MDR safety requirements for AI. What are the key differences?
Answer:
| Aspect | FDA (US) | EU MDR |
|---|---|---|
| Primary pathway | 510(k) - Substantial Equivalence (most AI) | Conformity assessment with Notified Body |
| Clinical evidence | Can rely on equivalence to predicate device; less clinical data required | Requires robust clinical evaluation; equivalence harder to claim |
| Adaptive algorithms | Predetermined Change Control Plan (PCCP) allows updates without new submission | More restrictive; significant changes require new assessment |
| Post-market | Voluntary post-market surveillance for Class I/II; MedWatch reporting | Mandatory Post-Market Clinical Follow-up (PMCF); Periodic Safety Update Reports (PSUR) |
| Transparency | Less stringent explainability requirements | MDR requires information about how algorithm works |
| Timeline | Faster (510(k): 3-6 months) | Slower (12-18+ months with Notified Body) |
| Validation rigor | Analytical + clinical validation (clinical often retrospective) | Strong emphasis on prospective clinical evidence |
| Updates | PCCP allows planned updates pre-approved | Each significant change reassessed |
Key philosophical difference:
FDA approach: Innovation-friendly, risk-based, allows incremental validation, trusts substantial equivalence
EU MDR approach: Precautionary, requires stronger evidence upfront, less reliance on equivalence, more stringent post-market requirements
Practical implications:
Development strategy: EU market requires more upfront clinical data; plan prospective studies early
Adaptive AI: If model will update frequently, FDA PCCP pathway more viable than EU
Cost: EU MDR compliance generally more expensive (Notified Body fees, extensive clinical evaluation)
Timeline: Plan 12-18 months for EU vs. 6-9 months for FDA 510(k)
Post-market: EU requires ongoing clinical follow-up and regular PSUR; budget for this
Divergence increasing: US, EU, and UK (post-Brexit) regulatory approaches diverging. AI companies may need separate validation strategies per market.
Design safety interlocks for chemotherapy dose recommendation AI. What checks must occur before allowing dosing?
Answer: Multi-layered safety interlocks:
Layer 1: Input Validation
def validate_chemotherapy_inputs(patient_data):
checks_failed = []
# Range checks
if patient_data['weight'] < 20 or patient_data['weight'] > 200:
checks_failed.append("Weight out of plausible range")
if patient_data['bsa'] < 0.5 or patient_data['bsa'] > 3.0:
checks_failed.append("Body surface area out of range")
# Required labs present and recent
required_labs = ['creatinine', 'bilirubin', 'wbc', 'platelets', 'hemoglobin']
for lab in required_labs:
if lab not in patient_data:
checks_failed.append(f"Missing required lab: {lab}")
elif patient_data[f'{lab}_date'] > 7: # days old
checks_failed.append(f"Lab too old: {lab} ({patient_data[f'{lab}_date']} days)")
# Organ function adequate for chemotherapy
if patient_data['creatinine'] > 2.0:
checks_failed.append("Renal function inadequate (Cr > 2.0)")
if patient_data['bilirubin'] > 2.0:
checks_failed.append("Hepatic function inadequate (Bili > 2.0)")
if checks_failed:
raise ValueError(f"Input validation failed: {checks_failed}")
return TrueLayer 2: Dose Bounds Checking
def validate_dose_recommendation(drug, dose, patient_bsa):
# Check against standard dosing ranges
standard_doses = {
'doxorubicin': {'min': 40, 'max': 75, 'unit': 'mg/m2'},
'cyclophosphamide': {'min': 500, 'max': 1500, 'unit': 'mg/m2'},
# etc.
}
dose_per_bsa = dose / patient_bsa
if dose_per_bsa < standard_doses[drug]['min']:
raise ValueError(f"Dose below standard range: {dose_per_bsa} < {standard_doses[drug]['min']} {standard_doses[drug]['unit']}")
if dose_per_bsa > standard_doses[drug]['max']:
raise ValueError(f"⚠️ CRITICAL: Dose above standard range: {dose_per_bsa} > {standard_doses[drug]['max']} {standard_doses[drug]['unit']}")
return TrueLayer 3: Cross-Check with Clinical Guidelines
def cross_check_with_guidelines(diagnosis, regimen, patient):
# Verify regimen appropriate for diagnosis
approved_regimens = get_nccn_guidelines(diagnosis)
if regimen not in approved_regimens:
return {
'alert': 'WARNING',
'message': f'Regimen {regimen} not in NCCN guidelines for {diagnosis}',
'action': 'require_pharmacist_approval'
}
# Check for contraindications
if 'doxorubicin' in regimen and patient['cardiac_ejection_fraction'] < 50:
return {
'alert': 'CONTRAINDICATION',
'message': 'Doxorubicin contraindicated with EF < 50%',
'action': 'block_order'
}
return {'alert': 'PASS'}Layer 4: Drug Interaction Check
def check_drug_interactions(chemotherapy_drugs, current_medications):
critical_interactions = []
interaction_db = load_interaction_database()
for chemo_drug in chemotherapy_drugs:
for med in current_medications:
interaction = interaction_db.check(chemo_drug, med)
if interaction['severity'] == 'severe':
critical_interactions.append({
'drug1': chemo_drug,
'drug2': med,
'interaction': interaction['description'],
'recommendation': interaction['recommendation']
})
if critical_interactions:
return {
'alert': 'CRITICAL INTERACTION',
'interactions': critical_interactions,
'action': 'require_oncologist_review'
}
return {'alert': 'PASS'}Layer 5: Mandatory Human Review
class ChemotherapyAI:
def recommend_dose(self, patient_data):
# AI makes recommendation
recommendation = self.model.predict(patient_data)
# All safety interlocks MUST pass
validate_chemotherapy_inputs(patient_data)
validate_dose_recommendation(
recommendation['drug'],
recommendation['dose'],
patient_data['bsa']
)
guideline_check = cross_check_with_guidelines(
patient_data['diagnosis'],
recommendation['regimen'],
patient_data
)
interaction_check = check_drug_interactions(
recommendation['drugs'],
patient_data['current_meds']
)
# Add mandatory review flags
recommendation['require_oncologist_approval'] = True
recommendation['require_pharmacist_verification'] = True
recommendation['safety_checks'] = {
'input_validation': 'PASS',
'dose_bounds': 'PASS',
'guidelines': guideline_check,
'interactions': interaction_check
}
# NEVER allow fully autonomous chemotherapy ordering
recommendation['autonomous_ordering'] = False
return recommendationKey safety principles:
Chemotherapy is Level 3 automation at most (AI recommends, human approves), never Level 4-5.
Ethics vs. Safety: When is it ethical to deploy AI with known failure modes? What risk level is acceptable?
Liability: When a safely-designed AI still causes harm (no negligence, just residual risk), who should be liable: developer, hospital, clinician, or patient assumes risk?
Safety standards: Should healthcare AI face higher safety standards than commercial aviation? Why or why not? What about compared to pharmaceuticals?
Innovation vs. Safety: How do we balance innovation velocity (rapid deployment of beneficial AI) with safety rigor (extensive validation)? Is FDA too slow or too fast?
Proof of safety: Can we ever prove AI is “safe enough”? What’s the evidentiary standard? How does this compare to drugs (RCTs) or procedures (surgical training)?
Automation levels: As AI becomes more capable, should we allow higher automation (Level 4-5), or should healthcare always require human decision-maker? What about non-critical decisions?
Patient agency: Should patients have right to opt out of AI-assisted care? Right to know when AI was used? Right to appeal AI decisions?
Failure transparency: Should AI failures be publicly disclosed (like aviation accidents)? Or does this create liability disincentives and chill innovation?
Safety-Critical Computer Systems by John Knight - Foundational text on safety-critical systems engineering
Engineering a Safer World by Nancy Leveson - Systems-theoretic approach to safety (STAMP methodology)
Medical Device Software Development - IEC 62304 compliance and software validation
Normal Accidents by Charles Perrow - Why complex systems fail
Amodei et al., 2016, “Concrete Problems in AI Safety” - Taxonomy of AI safety challenges
Leveson & Turner, 1993, “An Investigation of the Therac-25 Accidents” - Classic case study: radiation therapy software killed 6 patients
Wong et al., 2021, “External Validation of a Widely Implemented Proprietary Sepsis Prediction Model” - Epic sepsis model failure analysis
Sendak et al., 2020, “Real-World Integration of a Sepsis Deep Learning Tool” - Deployment challenges
Finlayson et al., 2021, “The Clinician and Dataset Shift in Artificial Intelligence” - Data drift causing performance degradation
Scott et al., 2021, “Post-market Surveillance of Clinical Decision Support Tools” - Framework for continuous monitoring
Gerke et al., 2020, “The Need for a System View to Regulate AI/ML-Based Medical Devices” - Regulatory gaps
Parikh et al., 2019, “Addressing Bias in AI for Health Care” - Regulatory approaches to fairness
Finlayson et al., 2019, “Adversarial Attacks on Medical ML” - Adversarial examples in healthcare
Madry et al., 2018, “Towards Deep Learning Models Resistant to Adversarial Attacks” - Defense strategies
IEC 62304:2006+AMD1:2015 - Medical device software lifecycle
ISO 14971:2019 - Risk management for medical devices
ISO 13485:2016 - Quality management systems for medical devices
Foolbox - Python library for adversarial robustness testing
CleverHans - Library for adversarial examples
AI Verify - Toolkit for validating AI systems
FMEA-AI - Failure mode analysis templates for AI
Risk Register Templates - ISO 14971 risk management
Evidently AI - Open-source ML monitoring (drift detection)
Whylabs - Data and ML monitoring platform
Fiddler - AI observability and monitoring
MIT 6.S191: Introduction to Deep Learning - Safety Lecture - AI safety fundamentals
Stanford CS329S: Machine Learning Systems Design - Deployment and monitoring
Fast.ai: Practical Deep Learning - Ethics Module - Practical AI ethics and safety
Next: Chapter 12: Deployment, Monitoring, and Maintenance → now covers the “how” of deploying AI that has passed safety validation.