Appendix G — Vendor AI Evaluation Toolkit
Appendix G: AI Vendor Evaluation Checklist
This appendix provides a comprehensive, actionable framework for evaluating commercial AI products for healthcare and public health settings. Use this before purchasing or deploying any vendor AI system.
Who should use this: - Health department leaders considering AI tools - Hospital administrators evaluating AI vendors - Procurement officers assessing AI products - Public health practitioners making technology decisions - IRB/ethics committees reviewing AI deployments
What you’ll find: - Structured evaluation scorecards - Red flag identification guides - Sample questions to ask vendors - Decision frameworks for Go/No-Go - Procurement contract language recommendations
Introduction: Why You Need This Checklist
The AI Morgue (Appendix E) documented $100M+ in failed AI investments. Many failures were predictable. Warning signs existed but were ignored. Vendor claims went unquestioned. Due diligence was insufficient.
This toolkit helps you avoid those mistakes.
The Vendor-Buyer Information Asymmetry
Vendor knows: - Where the model was trained - What its limitations are - Where external validation failed - What patient populations it doesn’t work for - What the false alarm rate is in real-world use
You know: - What the vendor tells you in sales materials - (Often: not much more)
This checklist helps level the playing field.
Quick Reference: The 6-Domain Evaluation Framework
Use this framework to systematically evaluate any health AI vendor:
| Domain | Key Questions | Red Flags |
|---|---|---|
| 1. Technical Validation | Is there external validation? Peer-reviewed publication? | Internal validation only; no publications; vendor-funded studies |
| 2. Clinical Safety | Safety testing? Adverse events tracked? | No safety data; no mention of harms; “100% safe” claims |
| 3. Fairness & Equity | Performance across demographics? Bias audits? | No fairness testing; “we don’t use race so it’s fair” |
| 4. Privacy & Security | HIPAA compliant? BAA provided? Encryption? | Vague privacy claims; no BAA; data sent to foreign servers |
| 5. Workflow Integration | End-user testing? Training provided? Support available? | No user research; “plug and play”; minimal training |
| 6. Business Viability | Company financially stable? Other customers? Roadmap? | Startup with no revenue; no references; unclear roadmap |
Domain 1: Technical Validation 🔍
The Questions to Ask
- Training Data
- Validation Studies
- Performance Metrics
- Generalizability
Red Flags 🚩
Proceed with extreme caution if:
- ❌ “Validated on 100,000 patients” - But all from the same institution (not external validation)
- ❌ “98% accuracy” - On cherry-picked test set; no external validation
- ❌ “Deployed in 150+ hospitals” - Deployment ≠ Effectiveness; no outcome data
- ❌ “Proprietary validation” - No peer-reviewed publications; trust us
- ❌ “Can’t share validation data” - Red flag for lack of transparency
- ❌ Internal validation only - Models always perform better on development data
- ❌ Vendor-funded validation studies - Conflicts of interest
Scoring Rubric
Rate each item (0-2 points):
| Criterion | 0 Points | 1 Point | 2 Points |
|---|---|---|---|
| External Validation | None or internal only | 1 site | 3+ independent sites |
| Peer-Reviewed Publication | No | Conference abstract | Full peer-reviewed paper |
| Independent Researchers | Vendor employees only | Partial independence | Fully independent validation |
| Performance Reporting | AUC only | Sensitivity/specificity | Full confusion matrix + subgroup analysis |
| Generalizability Evidence | No evidence | Similar populations | Validated on your patient population |
Scoring: - 8-10 points: Strong validation evidence - 5-7 points: Moderate evidence; pilot testing recommended - 0-4 points: Insufficient evidence; do NOT deploy
Domain 2: Clinical Safety 🏥
The Questions to Ask
- Safety Testing
- Clinical Outcomes
- Alert Burden
- Human Factors
Red Flags 🚩
Proceed with extreme caution if:
- ❌ “No reported adverse events” - Likely means no monitoring system, not that it’s safe
- ❌ “Clinicians love it” - No quantitative data on alert fatigue or response rates
- ❌ “Seamless integration” - No workflow analysis or user research
- ❌ High false positive rate (>20%) - Will cause alert fatigue
- ❌ No outcome data - Only technical performance (AUC) reported
- ❌ “100% safe” - Overconfident claims; every system has failure modes
- ❌ No failure mode analysis - Every AI system can fail; what happens when it does?
Lessons from Epic Sepsis Model
The Epic sepsis model had: - ✅ High AUC (0.76-0.83) - Impressive technical performance - ❌ But 88% false positive rate in external validation - ❌ No evidence of improved patient outcomes - ❌ Alert fatigue caused clinicians to ignore alerts
Don’t repeat this mistake. Demand outcome data, not just AUC.
Scoring Rubric
Rate each item (0-2 points):
| Criterion | 0 Points | 1 Point | 2 Points |
|---|---|---|---|
| Safety Testing | None mentioned | Some testing | Comprehensive failure mode analysis |
| Outcome Evidence | None | Retrospective analysis | Prospective RCT or quasi-experimental |
| Alert Burden | Unknown | <50% false positive | <20% false positive + manageable alert rate |
| Human Factors | No testing | Limited usability testing | Comprehensive workflow analysis |
| Adverse Event Monitoring | No system | Passive reporting | Active surveillance system |
Scoring: - 8-10 points: Strong safety evidence - 5-7 points: Moderate evidence; close monitoring required - 0-4 points: Insufficient safety evidence; do NOT deploy
Domain 3: Fairness & Equity ⚖️
The Questions to Ask
- Bias Testing
- Training Data Representativeness
- Proxy Variables
- Health Equity Impact
Red Flags 🚩
Proceed with extreme caution if:
- ❌ “We don’t use race as a feature, so it’s fair” - Fairness through unawareness doesn’t work; race correlated with many features
- ❌ “Our algorithm is objective” - Algorithms encode human choices and societal biases
- ❌ “High accuracy means fair” - Accuracy ≠ Fairness (see OPTUM case)
- ❌ No fairness testing - Bias is default; fairness must be tested, not assumed
- ❌ Using costs as proxy for health needs - See OPTUM case: costs reflect access barriers, not just health needs
- ❌ Trained on non-representative data - Homogeneous training data (academic medical centers only, commercially insured only, etc.)
Lessons from OPTUM Algorithmic Bias
The OPTUM algorithm: - ✅ Accurately predicted healthcare costs - High technical performance - ❌ But costs ≠ health needs, especially for Black patients - ❌ Systematically underestimated Black patients’ health needs - ❌ 46.5% more Black patients should have been enrolled for equity
Don’t repeat this mistake. Test for fairness explicitly.
Scoring Rubric
Rate each item (0-2 points):
| Criterion | 0 Points | 1 Point | 2 Points |
|---|---|---|---|
| Fairness Audit | None conducted | Internal audit | Independent external audit |
| Subgroup Analysis | No reporting | Some subgroups | Comprehensive (race, ethnicity, age, sex, SES) |
| Training Data Diversity | Homogeneous | Somewhat diverse | Highly representative of your population |
| Proxy Variable Assessment | No assessment | Acknowledged | Validated against direct outcomes |
| Equity Impact Plan | No plan | Monitoring planned | Active mitigation strategies |
Scoring: - 8-10 points: Strong fairness evidence - 5-7 points: Moderate evidence; continuous monitoring essential - 0-4 points: High bias risk; do NOT deploy without mitigation
Domain 4: Privacy & Security 🔒
The Questions to Ask
- Regulatory Compliance
- Data Handling
- Security Measures
- Privacy by Design
- Transparency & Accountability
Red Flags 🚩
Proceed with extreme caution if:
- ❌ Refuses to sign BAA - Non-starter for HIPAA compliance
- ❌ “We’ll sign BAA later” - Must be in place BEFORE data sharing
- ❌ Vague about data storage location - “Cloud” is not specific enough; which cloud? Which region?
- ❌ Data sent to foreign servers - Compliance and privacy risks
- ❌ “We anonymize data so HIPAA doesn’t apply” - Re-identification risk; HIPAA still applies to most health data
- ❌ No SOC 2 or security certification - Unvetted security practices
- ❌ “Trust us with your data” - Trust requires verification
- ❌ Unclear data retention/deletion - Your data may persist indefinitely
Lessons from DeepMind & TraceTogether
DeepMind Streams: - ❌ Collected entire medical histories (not just kidney-related data) - ❌ Patients not informed - ❌ No proper legal basis - ❌ Result: Ruled unlawful by UK ICO
TraceTogether: - ❌ “Data only for contact tracing” → Became crime-fighting tool - ❌ Privacy promises broken - ❌ Trust destroyed
Lesson: Privacy promises must be legally binding and technically enforced.
Scoring Rubric
Rate each item (0-2 points):
| Criterion | 0 Points | 1 Point | 2 Points |
|---|---|---|---|
| HIPAA Compliance | No BAA or refuses | Will sign BAA | BAA + SOC 2 + HITRUST |
| Data Minimization | Collects everything | Some minimization | Strict minimization; edge deployment option |
| Security Certifications | None | SOC 2 Type I | SOC 2 Type II + penetration testing |
| Transparency | Vague policies | Clear policies | Detailed + third-party audit |
| Data Control | Vendor retains indefinitely | Retention period defined | You control data; deletion guaranteed |
Scoring: - 8-10 points: Strong privacy & security - 5-7 points: Moderate; additional safeguards needed - 0-4 points: Unacceptable risk; do NOT proceed
Domain 5: Workflow Integration 🔄
The Questions to Ask
- User Research
- Implementation
- Training & Support
- Customization
- Monitoring & Feedback
Red Flags 🚩
Proceed with extreme caution if:
- ❌ “Plug and play” - Healthcare is complex; no system is truly plug-and-play
- ❌ “Works with all EHRs” - Each EHR integration is custom; this is implausible
- ❌ “No training needed” - Users always need training for clinical decision support tools
- ❌ “One-size-fits-all” - Different institutions have different workflows and patient populations
- ❌ “We can implement in 2 weeks” - Unrealistic for complex systems; implementation takes months
- ❌ No user research - Designed in isolation from actual clinical workflows
- ❌ Minimal support - Email-only support; no phone; no dedicated account manager
- ❌ Black box, no customization - Can’t adjust thresholds or workflows to fit your institution
Lessons from Google Health India
Google’s diabetic retinopathy AI: - ✅ 96% accuracy in lab with research-grade cameras - ❌ But 55% of images ungradable in field with portable cameras - ❌ Nurses couldn’t operate system effectively (2-hour training inadequate) - ❌ 5 min/patient workflow disruption collapsed clinics - ❌ No offline mode; internet connectivity required
Lesson: Lab performance ≠ Field performance. Workflow integration matters.
Scoring Rubric
Rate each item (0-2 points):
| Criterion | 0 Points | 1 Point | 2 Points |
|---|---|---|---|
| User Research | None | Some user testing | Extensive ethnographic research |
| EHR Integration | No integration or manual entry | Some EHR support | Native integration with your EHR |
| Training Program | Minimal (<2 hours) | Half-day training | Comprehensive with ongoing support |
| Customization | Black box, no customization | Limited adjustments | Highly customizable |
| Support Quality | Email only | Email + phone | Dedicated account manager + on-site support |
Scoring: - 8-10 points: Strong workflow integration - 5-7 points: Moderate; expect implementation challenges - 0-4 points: High risk of failure; don’t proceed without pilot
Domain 6: Business Viability 💼
The Questions to Ask
- Company Stability
- Customer Base
- Product Maturity
- Regulatory Status
- Pricing & Contracts
Red Flags 🚩
Proceed with extreme caution if:
- ❌ Early-stage startup, no revenue - High risk of going out of business; your investment lost
- ❌ Can’t provide customer references - No one willing to vouch for them; bad sign
- ❌ Version 1.0 product - Expect bugs and instability; you’re the beta tester
- ❌ Vague about pricing - “It depends”; no transparency; potential for unexpected costs
- ❌ Long-term contract with no exit clause - You’re locked in even if it doesn’t work
- ❌ No regulatory clearance when required - Legal risk
- ❌ Leadership with no healthcare experience - Tech team with no domain expertise
Lessons from IBM Watson for Oncology
IBM Watson: - ✅ IBM is a massive, stable company - ❌ But even IBM couldn’t make Watson work - ❌ $62M spent by MD Anderson; project failed - ❌ Hospitals that bought in early lost millions
Lesson: Big company ≠ Good product. Validation matters more than brand name.
Scoring Rubric
Rate each item (0-2 points):
| Criterion | 0 Points | 1 Point | 2 Points |
|---|---|---|---|
| Company Stability | Startup, no revenue | Funded startup or small profitable company | Established, profitable, >5 years |
| Customer Base | <5 customers or no references | 5-20 customers, some references | 20+ customers, multiple willing references |
| Product Maturity | V1.0 | V2-3 | V4+ with track record |
| Regulatory | No clearance (when required) | Clearance in progress | FDA cleared/approved |
| Pricing Transparency | Vague or hidden | Somewhat clear | Fully transparent, fair terms |
Scoring: - 8-10 points: Financially stable, low risk - 5-7 points: Moderate risk; negotiate favorable contract terms - 0-4 points: High financial risk; consider waiting
Putting It All Together: The Overall Evaluation Matrix
Use this to synthesize scores across all domains:
| Domain | Weight | Your Score (0-10) | Weighted Score |
|---|---|---|---|
| 1. Technical Validation | 25% | _____ | _____ |
| 2. Clinical Safety | 25% | _____ | _____ |
| 3. Fairness & Equity | 20% | _____ | _____ |
| 4. Privacy & Security | 15% | _____ | _____ |
| 5. Workflow Integration | 10% | _____ | _____ |
| 6. Business Viability | 5% | _____ | _____ |
| TOTAL | 100% | _____ / 10 |
Decision Framework
Overall Score Interpretation:
- 8.0 - 10.0: Proceed with Deployment
- Strong evidence across all domains
- Still: Start with pilot in 1-2 units before hospital-wide deployment
- Monitor closely for first 6 months
- 6.0 - 7.9: Conditional Deployment with Mitigation
- Identify weak domains and create mitigation plans
- Example: Weak fairness score → Implement continuous bias monitoring
- Pilot with intensive monitoring
- Re-evaluate after 6 months
- 4.0 - 5.9: Do NOT Deploy; Negotiate Improvements
- Too many gaps in evidence
- Go back to vendor with requirements:
- “We need external validation study before we’ll consider”
- “We need fairness audit before we’ll proceed”
- Consider waiting for product maturity
- 0 - 3.9: Do NOT Deploy
- Insufficient evidence
- High risk of failure or harm
- Wait for better products or invest in developing your own
Sample Questions for Vendor Meetings
Use these scripts to extract critical information:
Technical Validation Questions
Script: > “Can you provide the peer-reviewed publication of your external validation study? We’d like to see performance metrics at institutions not involved in development, broken down by patient demographics.”
Follow-ups if vendor hesitates: - “If there’s no peer-reviewed external validation, when do you plan to conduct one?” - “Can you share the names of institutions where validation occurred so we can contact them?” - “What was the performance at hospitals most similar to ours?”
Fairness Questions
Script: > “We’re concerned about health equity. Can you show us the fairness audit results? Specifically, performance by race, ethnicity, socioeconomic status, and insurance type?”
Follow-ups: - “If no fairness audit has been done, why not?” - “What is your plan for ongoing bias monitoring?” - “What happens if we discover bias after deployment?”
Privacy Questions
Script: > “Walk us through exactly what data leaves our institution, where it goes, how it’s stored, and how we can verify this. Can we see the Business Associate Agreement and SOC 2 report?”
Follow-ups: - “What specific PHI elements does your system need?” - “Can the system work on-premise without sending data to the cloud?” - “What happens to our data if we terminate the contract?”
Safety Questions
Script: > “What patient outcomes have improved at hospitals using your system? Can you provide data on mortality, length of stay, readmissions, or quality of life?”
Follow-ups if vendor only cites AUC: - “AUC is a technical metric. Has deployment improved patient outcomes?” - “What is the false positive rate in real-world use?” - “What happens when the model fails? What are the failure modes?”
Workflow Integration Questions
Script: > “Has your system been tested with nurses and physicians at institutions like ours? What did the usability testing reveal?”
Follow-ups: - “What is the typical implementation timeline?” - “What ongoing support do you provide?” - “Can we speak with your customers about their implementation experience?”
Procurement Contract Language Recommendations
If you decide to proceed, include these provisions in your contract:
1. Performance Guarantees
Vendor guarantees that the AI system will achieve the following performance metrics
at [Institution Name] during the pilot period:
- AUC ≥ [threshold] (or other appropriate metric)
- False positive rate ≤ [threshold]%
- Alert burden ≤ [number] alerts per day per unit
- User satisfaction ≥ [threshold] (measured by survey)
If performance falls below these thresholds for [timeframe], [Institution] may
terminate the contract without penalty and receive full refund.
2. Fairness Requirements
Vendor warrants that the AI system has undergone bias testing and demonstrates
equitable performance across patient demographic groups (race, ethnicity, age, sex,
socioeconomic status).
Vendor will provide [Institution] with quarterly bias audit reports showing
performance metrics stratified by demographic subgroups.
If disparate impact is identified (performance difference >10% across groups),
Vendor will work with [Institution] to mitigate bias within [timeframe] or
[Institution] may terminate without penalty.
3. Data Privacy & Security
Vendor agrees to:
- Sign Business Associate Agreement (BAA) prior to any PHI access
- Store all data in HIPAA-compliant infrastructure
- Encrypt data at rest (AES-256) and in transit (TLS 1.3+)
- Provide SOC 2 Type II audit report annually
- Not use [Institution] data for Vendor's own R&D without explicit written consent
- Delete all [Institution] data within 30 days of contract termination
- Provide audit logs of all data access quarterly
4. Validation & Monitoring
[Institution] has the right to:
- Conduct independent validation studies of the AI system
- Publish validation results (positive or negative)
- Access model performance dashboards in real-time
- Receive quarterly performance reports from Vendor
Vendor will provide:
- Technical documentation for independent validation
- API access for performance monitoring
- Support for [Institution]'s evaluation efforts
5. Liability & Indemnification
Vendor agrees to indemnify [Institution] for:
- Any patient harm caused by AI system errors or failures
- Regulatory fines resulting from Vendor's non-compliance (HIPAA, etc.)
- Data breaches resulting from Vendor's security failures
Liability cap: [Amount] (no less than annual contract value x 5)
Vendor maintains professional liability insurance of at least [Amount].
6. Termination Rights
[Institution] may terminate this agreement:
- For cause (breach of contract): Immediate termination, full refund
- For convenience: 90-day notice, pro-rated refund for remaining term
- For safety concerns: Immediate termination if AI system poses patient safety risk
- For non-performance: If system fails to meet performance guarantees after [timeframe]
Upon termination:
- Vendor must delete all [Institution] data within 30 days
- Vendor must provide data export in standard format (CSV, FHIR, etc.)
- [Institution] retains all rights to its data
Pilot Implementation Plan
Even after thorough evaluation, start small:
Phase 1: Controlled Pilot (Months 1-3)
Scope: - 1-2 clinical units (ICU, primary care clinic, etc.) - 10-50 patients/day - Intensive monitoring
Metrics to Track: - Technical performance (AUC, sensitivity, specificity, PPV) - Alert burden (alerts/day, false positive rate) - User experience (satisfaction, time spent per alert, response rate) - Workflow impact (time added per patient) - Clinical outcomes (compare to baseline) - Equity impact (outcomes by race, ethnicity, SES)
Success Criteria (Define Before Pilot): - Technical: AUC ≥ [threshold], FP rate ≤ [threshold]% - User: Satisfaction ≥ [threshold]/5, Response rate ≥ [threshold]% - Outcome: [Primary outcome] improved by ≥ [threshold]% vs baseline - Equity: No performance disparities >10% across groups
Go/No-Go Decision: - ✅ Proceed to Phase 2 if ALL success criteria met - ⚠️ Iterate/adjust if some criteria met (address specific gaps) - ❌ Terminate if major criteria not met (don’t throw good money after bad)
Phase 2: Expanded Pilot (Months 4-9)
Scope: - 5-10 units - Continue intensive monitoring
Objectives: - Validate Phase 1 results at larger scale - Test in diverse clinical settings - Identify implementation challenges - Refine workflows
Phase 3: Full Deployment (Month 10+)
Scope: - Hospital-wide or system-wide
Requirements: - Phase 2 demonstrated sustained benefit - User training completed - Ongoing monitoring system in place - Regular re-auditing planned (quarterly)
Never skip the pilot.
Real-World Case Study: Using the Checklist
Example: Evaluating a Hypothetical Sepsis Prediction Tool
Vendor Claims: - “AI predicts sepsis 6 hours before clinical recognition” - “90% sensitivity, 85% specificity” - “Deployed in 200+ hospitals” - “$500K/year for hospital-wide license”
Your Evaluation Using This Checklist:
Domain 1: Technical Validation (Score: 4/10)
- ✅ Published in peer-reviewed journal
- ❌ Internal validation only (same hospital group)
- ❌ No independent external validation
- ❌ 90% sensitivity/85% specificity in paper, but what about external sites?
- Red flag: “Deployed in 200+ hospitals” ≠ Evidence of effectiveness (Epic!)
Domain 2: Clinical Safety (Score: 3/10)
- ✅ Safety mentioned in paper
- ❌ No prospective outcome studies
- ❌ No data on whether deployment reduced mortality
- ❌ False positive rate not reported
- ❌ Alert burden unknown
- Red flag: Only technical metrics (AUC), no patient outcomes
Domain 3: Fairness & Equity (Score: 2/10)
- ❌ No fairness audit mentioned
- ❌ No performance by race/ethnicity
- ❌ When asked, vendor says “we don’t use race as a feature, so it’s fair”
- Red flag: Fairness through unawareness (doesn’t work!)
Domain 4: Privacy & Security (Score: 7/10)
- ✅ Will sign BAA
- ✅ SOC 2 Type II certified
- ✅ Data encrypted at rest and in transit
- ⚠️ Data stored in vendor cloud (not on-premise option)
Domain 5: Workflow Integration (Score: 5/10)
- ✅ Integrates with your EHR (Epic)
- ⚠️ Implementation takes 3-6 months
- ⚠️ Training: 2-hour online module
- ❌ No customization; one-size-fits-all
Domain 6: Business Viability (Score: 8/10)
- ✅ Established company, 7 years in business
- ✅ 200 customers (they say)
- ✅ Willing to provide 2 references
- ⚠️ Pricing seems high ($500K/year)
Overall Weighted Score: 4.6 / 10
Decision: Do NOT Deploy - Insufficient validation (internal only) - No outcome evidence (AUC is not enough) - No fairness testing (high bias risk) - Workflow concerns (alert burden unknown)
Recommendation to Leadership: > “We evaluated [Vendor]’s sepsis prediction tool using the AI Vendor Evaluation Framework. > The system scores 4.6/10, below our threshold for deployment. > > Key concerns: > - No external validation (validation only at vendor’s hospital group) > - No evidence of improved patient outcomes (only technical metrics reported) > - No fairness audit (risk of bias similar to Epic sepsis model) > > We recommend: > 1. Request external validation study at independent hospitals > 2. Request fairness audit with performance by race/ethnicity > 3. Pilot at 2-3 similar institutions before we consider > 4. Re-evaluate in 12 months if vendor addresses gaps > > Alternative: Invest in building our own sepsis prediction model using our data, > which would be tailored to our patient population and workflows.”
Summary: Key Principles for Vendor Evaluation
- Trust, but verify - Vendor claims mean nothing without independent validation
- External validation is non-negotiable - Internal validation always looks better than real-world performance
- Outcomes > Accuracy - AUC doesn’t save lives; improved patient outcomes do
- Fairness testing is mandatory - Bias is the default; fairness must be proven
- Start small, scale slowly - Pilot → Evaluate → Scale only if successful
- Negotiate strong contracts - Performance guarantees, termination rights, data control
- You can say no - Bad AI is worse than no AI; don’t deploy systems that aren’t ready
The most important lesson: You are not obligated to buy a product just because it has “AI” in the name. Demand evidence. Ask hard questions. Walk away if the evidence isn’t there.
Your patients deserve better than unvalidated AI systems.
Additional Resources
- Appendix E (The AI Morgue): Detailed failure case studies showing what goes wrong
- AHRQ Health IT Safety Toolkit: https://www.ahrq.gov/patient-safety/resources/hitsafety/index.html
- FDA AI/ML Medical Device Guidance: https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-aiml-enabled-medical-devices
- WHO Ethics & Governance of AI for Health: https://www.who.int/publications/i/item/9789240029200
Remember: The best AI system is one that improves patient outcomes, operates fairly, respects privacy, integrates into workflows, and has strong evidence supporting its use. Don’t settle for less.