Appendix G — Vendor AI Evaluation Toolkit

Appendix G: AI Vendor Evaluation Checklist

TipPurpose

This appendix provides a comprehensive, actionable framework for evaluating commercial AI products for healthcare and public health settings. Use this before purchasing or deploying any vendor AI system.

Who should use this: - Health department leaders considering AI tools - Hospital administrators evaluating AI vendors - Procurement officers assessing AI products - Public health practitioners making technology decisions - IRB/ethics committees reviewing AI deployments

What you’ll find: - Structured evaluation scorecards - Red flag identification guides - Sample questions to ask vendors - Decision frameworks for Go/No-Go - Procurement contract language recommendations


Introduction: Why You Need This Checklist

The AI Morgue (Appendix E) documented $100M+ in failed AI investments. Many failures were predictable. Warning signs existed but were ignored. Vendor claims went unquestioned. Due diligence was insufficient.

This toolkit helps you avoid those mistakes.

The Vendor-Buyer Information Asymmetry

Vendor knows: - Where the model was trained - What its limitations are - Where external validation failed - What patient populations it doesn’t work for - What the false alarm rate is in real-world use

You know: - What the vendor tells you in sales materials - (Often: not much more)

This checklist helps level the playing field.


Quick Reference: The 6-Domain Evaluation Framework

Use this framework to systematically evaluate any health AI vendor:

Domain Key Questions Red Flags
1. Technical Validation Is there external validation? Peer-reviewed publication? Internal validation only; no publications; vendor-funded studies
2. Clinical Safety Safety testing? Adverse events tracked? No safety data; no mention of harms; “100% safe” claims
3. Fairness & Equity Performance across demographics? Bias audits? No fairness testing; “we don’t use race so it’s fair”
4. Privacy & Security HIPAA compliant? BAA provided? Encryption? Vague privacy claims; no BAA; data sent to foreign servers
5. Workflow Integration End-user testing? Training provided? Support available? No user research; “plug and play”; minimal training
6. Business Viability Company financially stable? Other customers? Roadmap? Startup with no revenue; no references; unclear roadmap

Domain 1: Technical Validation 🔍

The Questions to Ask

ImportantCritical Validation Questions
  1. Training Data
  2. Validation Studies
  3. Performance Metrics
  4. Generalizability

Red Flags 🚩

Proceed with extreme caution if:

  • “Validated on 100,000 patients” - But all from the same institution (not external validation)
  • “98% accuracy” - On cherry-picked test set; no external validation
  • “Deployed in 150+ hospitals” - Deployment ≠ Effectiveness; no outcome data
  • “Proprietary validation” - No peer-reviewed publications; trust us
  • “Can’t share validation data” - Red flag for lack of transparency
  • Internal validation only - Models always perform better on development data
  • Vendor-funded validation studies - Conflicts of interest

Scoring Rubric

Rate each item (0-2 points):

Criterion 0 Points 1 Point 2 Points
External Validation None or internal only 1 site 3+ independent sites
Peer-Reviewed Publication No Conference abstract Full peer-reviewed paper
Independent Researchers Vendor employees only Partial independence Fully independent validation
Performance Reporting AUC only Sensitivity/specificity Full confusion matrix + subgroup analysis
Generalizability Evidence No evidence Similar populations Validated on your patient population

Scoring: - 8-10 points: Strong validation evidence - 5-7 points: Moderate evidence; pilot testing recommended - 0-4 points: Insufficient evidence; do NOT deploy


Domain 2: Clinical Safety 🏥

The Questions to Ask

ImportantCritical Safety Questions
  1. Safety Testing
  2. Clinical Outcomes
  3. Alert Burden
  4. Human Factors

Red Flags 🚩

Proceed with extreme caution if:

  • “No reported adverse events” - Likely means no monitoring system, not that it’s safe
  • “Clinicians love it” - No quantitative data on alert fatigue or response rates
  • “Seamless integration” - No workflow analysis or user research
  • High false positive rate (>20%) - Will cause alert fatigue
  • No outcome data - Only technical performance (AUC) reported
  • “100% safe” - Overconfident claims; every system has failure modes
  • No failure mode analysis - Every AI system can fail; what happens when it does?

Lessons from Epic Sepsis Model

The Epic sepsis model had: - ✅ High AUC (0.76-0.83) - Impressive technical performance - ❌ But 88% false positive rate in external validation - ❌ No evidence of improved patient outcomes - ❌ Alert fatigue caused clinicians to ignore alerts

Don’t repeat this mistake. Demand outcome data, not just AUC.

Scoring Rubric

Rate each item (0-2 points):

Criterion 0 Points 1 Point 2 Points
Safety Testing None mentioned Some testing Comprehensive failure mode analysis
Outcome Evidence None Retrospective analysis Prospective RCT or quasi-experimental
Alert Burden Unknown <50% false positive <20% false positive + manageable alert rate
Human Factors No testing Limited usability testing Comprehensive workflow analysis
Adverse Event Monitoring No system Passive reporting Active surveillance system

Scoring: - 8-10 points: Strong safety evidence - 5-7 points: Moderate evidence; close monitoring required - 0-4 points: Insufficient safety evidence; do NOT deploy


Domain 3: Fairness & Equity ⚖️

The Questions to Ask

ImportantCritical Fairness Questions
  1. Bias Testing
  2. Training Data Representativeness
  3. Proxy Variables
  4. Health Equity Impact

Red Flags 🚩

Proceed with extreme caution if:

  • “We don’t use race as a feature, so it’s fair” - Fairness through unawareness doesn’t work; race correlated with many features
  • “Our algorithm is objective” - Algorithms encode human choices and societal biases
  • “High accuracy means fair” - Accuracy ≠ Fairness (see OPTUM case)
  • No fairness testing - Bias is default; fairness must be tested, not assumed
  • Using costs as proxy for health needs - See OPTUM case: costs reflect access barriers, not just health needs
  • Trained on non-representative data - Homogeneous training data (academic medical centers only, commercially insured only, etc.)

Lessons from OPTUM Algorithmic Bias

The OPTUM algorithm: - ✅ Accurately predicted healthcare costs - High technical performance - ❌ But costs ≠ health needs, especially for Black patients - ❌ Systematically underestimated Black patients’ health needs - ❌ 46.5% more Black patients should have been enrolled for equity

Don’t repeat this mistake. Test for fairness explicitly.

Scoring Rubric

Rate each item (0-2 points):

Criterion 0 Points 1 Point 2 Points
Fairness Audit None conducted Internal audit Independent external audit
Subgroup Analysis No reporting Some subgroups Comprehensive (race, ethnicity, age, sex, SES)
Training Data Diversity Homogeneous Somewhat diverse Highly representative of your population
Proxy Variable Assessment No assessment Acknowledged Validated against direct outcomes
Equity Impact Plan No plan Monitoring planned Active mitigation strategies

Scoring: - 8-10 points: Strong fairness evidence - 5-7 points: Moderate evidence; continuous monitoring essential - 0-4 points: High bias risk; do NOT deploy without mitigation


Domain 4: Privacy & Security 🔒

The Questions to Ask

ImportantCritical Privacy & Security Questions
  1. Regulatory Compliance
  2. Data Handling
  3. Security Measures
  4. Privacy by Design
  5. Transparency & Accountability

Red Flags 🚩

Proceed with extreme caution if:

  • Refuses to sign BAA - Non-starter for HIPAA compliance
  • “We’ll sign BAA later” - Must be in place BEFORE data sharing
  • Vague about data storage location - “Cloud” is not specific enough; which cloud? Which region?
  • Data sent to foreign servers - Compliance and privacy risks
  • “We anonymize data so HIPAA doesn’t apply” - Re-identification risk; HIPAA still applies to most health data
  • No SOC 2 or security certification - Unvetted security practices
  • “Trust us with your data” - Trust requires verification
  • Unclear data retention/deletion - Your data may persist indefinitely

Lessons from DeepMind & TraceTogether

DeepMind Streams: - ❌ Collected entire medical histories (not just kidney-related data) - ❌ Patients not informed - ❌ No proper legal basis - ❌ Result: Ruled unlawful by UK ICO

TraceTogether: - ❌ “Data only for contact tracing” → Became crime-fighting tool - ❌ Privacy promises broken - ❌ Trust destroyed

Lesson: Privacy promises must be legally binding and technically enforced.

Scoring Rubric

Rate each item (0-2 points):

Criterion 0 Points 1 Point 2 Points
HIPAA Compliance No BAA or refuses Will sign BAA BAA + SOC 2 + HITRUST
Data Minimization Collects everything Some minimization Strict minimization; edge deployment option
Security Certifications None SOC 2 Type I SOC 2 Type II + penetration testing
Transparency Vague policies Clear policies Detailed + third-party audit
Data Control Vendor retains indefinitely Retention period defined You control data; deletion guaranteed

Scoring: - 8-10 points: Strong privacy & security - 5-7 points: Moderate; additional safeguards needed - 0-4 points: Unacceptable risk; do NOT proceed


Domain 5: Workflow Integration 🔄

The Questions to Ask

ImportantCritical Workflow Integration Questions
  1. User Research
  2. Implementation
  3. Training & Support
  4. Customization
  5. Monitoring & Feedback

Red Flags 🚩

Proceed with extreme caution if:

  • “Plug and play” - Healthcare is complex; no system is truly plug-and-play
  • “Works with all EHRs” - Each EHR integration is custom; this is implausible
  • “No training needed” - Users always need training for clinical decision support tools
  • “One-size-fits-all” - Different institutions have different workflows and patient populations
  • “We can implement in 2 weeks” - Unrealistic for complex systems; implementation takes months
  • No user research - Designed in isolation from actual clinical workflows
  • Minimal support - Email-only support; no phone; no dedicated account manager
  • Black box, no customization - Can’t adjust thresholds or workflows to fit your institution

Lessons from Google Health India

Google’s diabetic retinopathy AI: - ✅ 96% accuracy in lab with research-grade cameras - ❌ But 55% of images ungradable in field with portable cameras - ❌ Nurses couldn’t operate system effectively (2-hour training inadequate) - ❌ 5 min/patient workflow disruption collapsed clinics - ❌ No offline mode; internet connectivity required

Lesson: Lab performance ≠ Field performance. Workflow integration matters.

Scoring Rubric

Rate each item (0-2 points):

Criterion 0 Points 1 Point 2 Points
User Research None Some user testing Extensive ethnographic research
EHR Integration No integration or manual entry Some EHR support Native integration with your EHR
Training Program Minimal (<2 hours) Half-day training Comprehensive with ongoing support
Customization Black box, no customization Limited adjustments Highly customizable
Support Quality Email only Email + phone Dedicated account manager + on-site support

Scoring: - 8-10 points: Strong workflow integration - 5-7 points: Moderate; expect implementation challenges - 0-4 points: High risk of failure; don’t proceed without pilot


Domain 6: Business Viability 💼

The Questions to Ask

ImportantCritical Business Viability Questions
  1. Company Stability
  2. Customer Base
  3. Product Maturity
  4. Regulatory Status
  5. Pricing & Contracts

Red Flags 🚩

Proceed with extreme caution if:

  • Early-stage startup, no revenue - High risk of going out of business; your investment lost
  • Can’t provide customer references - No one willing to vouch for them; bad sign
  • Version 1.0 product - Expect bugs and instability; you’re the beta tester
  • Vague about pricing - “It depends”; no transparency; potential for unexpected costs
  • Long-term contract with no exit clause - You’re locked in even if it doesn’t work
  • No regulatory clearance when required - Legal risk
  • Leadership with no healthcare experience - Tech team with no domain expertise

Lessons from IBM Watson for Oncology

IBM Watson: - ✅ IBM is a massive, stable company - ❌ But even IBM couldn’t make Watson work - ❌ $62M spent by MD Anderson; project failed - ❌ Hospitals that bought in early lost millions

Lesson: Big company ≠ Good product. Validation matters more than brand name.

Scoring Rubric

Rate each item (0-2 points):

Criterion 0 Points 1 Point 2 Points
Company Stability Startup, no revenue Funded startup or small profitable company Established, profitable, >5 years
Customer Base <5 customers or no references 5-20 customers, some references 20+ customers, multiple willing references
Product Maturity V1.0 V2-3 V4+ with track record
Regulatory No clearance (when required) Clearance in progress FDA cleared/approved
Pricing Transparency Vague or hidden Somewhat clear Fully transparent, fair terms

Scoring: - 8-10 points: Financially stable, low risk - 5-7 points: Moderate risk; negotiate favorable contract terms - 0-4 points: High financial risk; consider waiting


Putting It All Together: The Overall Evaluation Matrix

Use this to synthesize scores across all domains:

Domain Weight Your Score (0-10) Weighted Score
1. Technical Validation 25% _____ _____
2. Clinical Safety 25% _____ _____
3. Fairness & Equity 20% _____ _____
4. Privacy & Security 15% _____ _____
5. Workflow Integration 10% _____ _____
6. Business Viability 5% _____ _____
TOTAL 100% _____ / 10

Decision Framework

Overall Score Interpretation:

  • 8.0 - 10.0: Proceed with Deployment
    • Strong evidence across all domains
    • Still: Start with pilot in 1-2 units before hospital-wide deployment
    • Monitor closely for first 6 months
  • 6.0 - 7.9: Conditional Deployment with Mitigation
    • Identify weak domains and create mitigation plans
    • Example: Weak fairness score → Implement continuous bias monitoring
    • Pilot with intensive monitoring
    • Re-evaluate after 6 months
  • 4.0 - 5.9: Do NOT Deploy; Negotiate Improvements
    • Too many gaps in evidence
    • Go back to vendor with requirements:
      • “We need external validation study before we’ll consider”
      • “We need fairness audit before we’ll proceed”
    • Consider waiting for product maturity
  • 0 - 3.9: Do NOT Deploy
    • Insufficient evidence
    • High risk of failure or harm
    • Wait for better products or invest in developing your own

Sample Questions for Vendor Meetings

Use these scripts to extract critical information:

Technical Validation Questions

Script: > “Can you provide the peer-reviewed publication of your external validation study? We’d like to see performance metrics at institutions not involved in development, broken down by patient demographics.”

Follow-ups if vendor hesitates: - “If there’s no peer-reviewed external validation, when do you plan to conduct one?” - “Can you share the names of institutions where validation occurred so we can contact them?” - “What was the performance at hospitals most similar to ours?”

Fairness Questions

Script: > “We’re concerned about health equity. Can you show us the fairness audit results? Specifically, performance by race, ethnicity, socioeconomic status, and insurance type?”

Follow-ups: - “If no fairness audit has been done, why not?” - “What is your plan for ongoing bias monitoring?” - “What happens if we discover bias after deployment?”

Privacy Questions

Script: > “Walk us through exactly what data leaves our institution, where it goes, how it’s stored, and how we can verify this. Can we see the Business Associate Agreement and SOC 2 report?”

Follow-ups: - “What specific PHI elements does your system need?” - “Can the system work on-premise without sending data to the cloud?” - “What happens to our data if we terminate the contract?”

Safety Questions

Script: > “What patient outcomes have improved at hospitals using your system? Can you provide data on mortality, length of stay, readmissions, or quality of life?”

Follow-ups if vendor only cites AUC: - “AUC is a technical metric. Has deployment improved patient outcomes?” - “What is the false positive rate in real-world use?” - “What happens when the model fails? What are the failure modes?”

Workflow Integration Questions

Script: > “Has your system been tested with nurses and physicians at institutions like ours? What did the usability testing reveal?”

Follow-ups: - “What is the typical implementation timeline?” - “What ongoing support do you provide?” - “Can we speak with your customers about their implementation experience?”


Procurement Contract Language Recommendations

If you decide to proceed, include these provisions in your contract:

1. Performance Guarantees

Vendor guarantees that the AI system will achieve the following performance metrics
at [Institution Name] during the pilot period:

- AUC ≥ [threshold] (or other appropriate metric)
- False positive rate ≤ [threshold]%
- Alert burden ≤ [number] alerts per day per unit
- User satisfaction ≥ [threshold] (measured by survey)

If performance falls below these thresholds for [timeframe], [Institution] may
terminate the contract without penalty and receive full refund.

2. Fairness Requirements

Vendor warrants that the AI system has undergone bias testing and demonstrates
equitable performance across patient demographic groups (race, ethnicity, age, sex,
socioeconomic status).

Vendor will provide [Institution] with quarterly bias audit reports showing
performance metrics stratified by demographic subgroups.

If disparate impact is identified (performance difference >10% across groups),
Vendor will work with [Institution] to mitigate bias within [timeframe] or
[Institution] may terminate without penalty.

3. Data Privacy & Security

Vendor agrees to:
- Sign Business Associate Agreement (BAA) prior to any PHI access
- Store all data in HIPAA-compliant infrastructure
- Encrypt data at rest (AES-256) and in transit (TLS 1.3+)
- Provide SOC 2 Type II audit report annually
- Not use [Institution] data for Vendor's own R&D without explicit written consent
- Delete all [Institution] data within 30 days of contract termination
- Provide audit logs of all data access quarterly

4. Validation & Monitoring

[Institution] has the right to:
- Conduct independent validation studies of the AI system
- Publish validation results (positive or negative)
- Access model performance dashboards in real-time
- Receive quarterly performance reports from Vendor

Vendor will provide:
- Technical documentation for independent validation
- API access for performance monitoring
- Support for [Institution]'s evaluation efforts

5. Liability & Indemnification

Vendor agrees to indemnify [Institution] for:
- Any patient harm caused by AI system errors or failures
- Regulatory fines resulting from Vendor's non-compliance (HIPAA, etc.)
- Data breaches resulting from Vendor's security failures

Liability cap: [Amount] (no less than annual contract value x 5)

Vendor maintains professional liability insurance of at least [Amount].

6. Termination Rights

[Institution] may terminate this agreement:
- For cause (breach of contract): Immediate termination, full refund
- For convenience: 90-day notice, pro-rated refund for remaining term
- For safety concerns: Immediate termination if AI system poses patient safety risk
- For non-performance: If system fails to meet performance guarantees after [timeframe]

Upon termination:
- Vendor must delete all [Institution] data within 30 days
- Vendor must provide data export in standard format (CSV, FHIR, etc.)
- [Institution] retains all rights to its data

Pilot Implementation Plan

Even after thorough evaluation, start small:

Phase 1: Controlled Pilot (Months 1-3)

Scope: - 1-2 clinical units (ICU, primary care clinic, etc.) - 10-50 patients/day - Intensive monitoring

Metrics to Track: - Technical performance (AUC, sensitivity, specificity, PPV) - Alert burden (alerts/day, false positive rate) - User experience (satisfaction, time spent per alert, response rate) - Workflow impact (time added per patient) - Clinical outcomes (compare to baseline) - Equity impact (outcomes by race, ethnicity, SES)

Success Criteria (Define Before Pilot): - Technical: AUC ≥ [threshold], FP rate ≤ [threshold]% - User: Satisfaction ≥ [threshold]/5, Response rate ≥ [threshold]% - Outcome: [Primary outcome] improved by ≥ [threshold]% vs baseline - Equity: No performance disparities >10% across groups

Go/No-Go Decision: - ✅ Proceed to Phase 2 if ALL success criteria met - ⚠️ Iterate/adjust if some criteria met (address specific gaps) - ❌ Terminate if major criteria not met (don’t throw good money after bad)

Phase 2: Expanded Pilot (Months 4-9)

Scope: - 5-10 units - Continue intensive monitoring

Objectives: - Validate Phase 1 results at larger scale - Test in diverse clinical settings - Identify implementation challenges - Refine workflows

Phase 3: Full Deployment (Month 10+)

Scope: - Hospital-wide or system-wide

Requirements: - Phase 2 demonstrated sustained benefit - User training completed - Ongoing monitoring system in place - Regular re-auditing planned (quarterly)

Never skip the pilot.


Real-World Case Study: Using the Checklist

Example: Evaluating a Hypothetical Sepsis Prediction Tool

Vendor Claims: - “AI predicts sepsis 6 hours before clinical recognition” - “90% sensitivity, 85% specificity” - “Deployed in 200+ hospitals” - “$500K/year for hospital-wide license”

Your Evaluation Using This Checklist:

Domain 1: Technical Validation (Score: 4/10)

  • ✅ Published in peer-reviewed journal
  • ❌ Internal validation only (same hospital group)
  • ❌ No independent external validation
  • ❌ 90% sensitivity/85% specificity in paper, but what about external sites?
  • Red flag: “Deployed in 200+ hospitals” ≠ Evidence of effectiveness (Epic!)

Domain 2: Clinical Safety (Score: 3/10)

  • ✅ Safety mentioned in paper
  • ❌ No prospective outcome studies
  • ❌ No data on whether deployment reduced mortality
  • ❌ False positive rate not reported
  • ❌ Alert burden unknown
  • Red flag: Only technical metrics (AUC), no patient outcomes

Domain 3: Fairness & Equity (Score: 2/10)

  • ❌ No fairness audit mentioned
  • ❌ No performance by race/ethnicity
  • ❌ When asked, vendor says “we don’t use race as a feature, so it’s fair”
  • Red flag: Fairness through unawareness (doesn’t work!)

Domain 4: Privacy & Security (Score: 7/10)

  • ✅ Will sign BAA
  • ✅ SOC 2 Type II certified
  • ✅ Data encrypted at rest and in transit
  • ⚠️ Data stored in vendor cloud (not on-premise option)

Domain 5: Workflow Integration (Score: 5/10)

  • ✅ Integrates with your EHR (Epic)
  • ⚠️ Implementation takes 3-6 months
  • ⚠️ Training: 2-hour online module
  • ❌ No customization; one-size-fits-all

Domain 6: Business Viability (Score: 8/10)

  • ✅ Established company, 7 years in business
  • ✅ 200 customers (they say)
  • ✅ Willing to provide 2 references
  • ⚠️ Pricing seems high ($500K/year)

Overall Weighted Score: 4.6 / 10

Decision: Do NOT Deploy - Insufficient validation (internal only) - No outcome evidence (AUC is not enough) - No fairness testing (high bias risk) - Workflow concerns (alert burden unknown)

Recommendation to Leadership: > “We evaluated [Vendor]’s sepsis prediction tool using the AI Vendor Evaluation Framework. > The system scores 4.6/10, below our threshold for deployment. > > Key concerns: > - No external validation (validation only at vendor’s hospital group) > - No evidence of improved patient outcomes (only technical metrics reported) > - No fairness audit (risk of bias similar to Epic sepsis model) > > We recommend: > 1. Request external validation study at independent hospitals > 2. Request fairness audit with performance by race/ethnicity > 3. Pilot at 2-3 similar institutions before we consider > 4. Re-evaluate in 12 months if vendor addresses gaps > > Alternative: Invest in building our own sepsis prediction model using our data, > which would be tailored to our patient population and workflows.”


Summary: Key Principles for Vendor Evaluation

  1. Trust, but verify - Vendor claims mean nothing without independent validation
  2. External validation is non-negotiable - Internal validation always looks better than real-world performance
  3. Outcomes > Accuracy - AUC doesn’t save lives; improved patient outcomes do
  4. Fairness testing is mandatory - Bias is the default; fairness must be proven
  5. Start small, scale slowly - Pilot → Evaluate → Scale only if successful
  6. Negotiate strong contracts - Performance guarantees, termination rights, data control
  7. You can say no - Bad AI is worse than no AI; don’t deploy systems that aren’t ready

The most important lesson: You are not obligated to buy a product just because it has “AI” in the name. Demand evidence. Ask hard questions. Walk away if the evidence isn’t there.

Your patients deserve better than unvalidated AI systems.


Additional Resources


Remember: The best AI system is one that improves patient outcomes, operates fairly, respects privacy, integrates into workflows, and has strong evidence supporting its use. Don’t settle for less.