Executive Summary

This handbook addresses a gap in public health practice: understanding when and how AI tools work, when they fail, and what infrastructure is needed for responsible deployment. The evidence comes from peer-reviewed research, documented deployments, published evaluations, and implementation case studies.


Key Findings

AI Capabilities in Public Health Are Real, But Limited

Disease surveillance benefits from AI, with important caveats. Syndromic surveillance systems like HealthMap, ProMED, and BlueDot demonstrated value in early outbreak detection during COVID-19. However, models trained on historical data struggle with novel pathogens. Forecast accuracy degrades rapidly beyond 2-4 week horizons.

Google Flu Trends remains the defining cautionary tale. The 2008 system initially outperformed CDC surveillance by 1-2 weeks. By 2013, it overestimated flu prevalence by up to 140% due to media-driven search behavior changes. Google retired the project in 2015. The lesson: surveillance tools must account for behavioral feedback loops and maintain continuous recalibration.

Genomic surveillance has proven operational value. AI-accelerated variant detection during SARS-CoV-2 enabled weeks of early warning compared to phenotypic testing alone. Antimicrobial resistance prediction from whole-genome sequencing now achieves 90%+ accuracy for well-characterized resistance mechanisms in some pathogens.

Clinical AI shows mixed results in deployment. Epic’s widely deployed sepsis prediction model showed a positive predictive value of only 7% in external validation, meaning 93% of alerts were false positives. This pattern, strong development performance, weak deployment performance, repeats across clinical AI applications.

Common Failure Modes Are Predictable

Dataset shift causes most deployed model failures. Models trained on one population, time period, or data source systematically underperform when conditions change. COVID-19 testing patterns, case definitions, and reporting practices shifted constantly, breaking models trained on early pandemic data.

Algorithmic bias reflects training data inequities. Dermatology AI trained predominantly on light skin underperforms on darker skin. Chest X-ray algorithms trained at academic centers miss patterns common in community hospitals. Language models trained on English text work poorly for non-English populations.

Implementation gaps exceed technical gaps. Most AI project failures stem from workflow integration problems, not algorithm performance. Tools that perform well in validation fail in practice because clinicians don’t trust them, alerts interrupt workflow at wrong moments, or outputs require interpretation expertise that frontline workers lack.

Infrastructure Requirements Are Substantial

Data quality is the binding constraint. Public health data is messy: inconsistent case definitions, variable completeness, reporting delays, duplicate records, linkage errors. AI systems amplify data quality problems rather than solving them.

Skilled workforce gaps limit adoption. Health departments need staff who understand both epidemiology and AI capabilities. This hybrid expertise remains rare. Training programs are emerging but scaling slowly.

Computational infrastructure varies widely. State and local health departments operate with vastly different technical capacity. Cloud computing democratizes some capabilities but introduces data governance complexity.


Recommendations by Stakeholder

For Health Department Directors

Require external validation before deployment. Vendor-reported performance metrics are insufficient. Demand validation on your population, with your data quality, under your operational conditions. Budget for pilot studies.

Build evaluation capacity, not just procurement capacity. Your staff needs to assess AI claims critically. Invest in training that covers:

  • Performance metrics interpretation
  • Bias detection
  • Deployment monitoring
  • Vendor evaluation

Start with augmentation, not automation. AI tools should support human decision-making before replacing it. Clinicians and epidemiologists need to understand model outputs, recognize errors, and maintain override authority.

Plan for maintenance from day one. Deployed models degrade. Budget for ongoing monitoring, recalibration, and potential replacement. A tool that works today may fail next year.

For Epidemiologists and Data Scientists

Validate on held-out data that reflects deployment conditions. Time-series split validation (train on earlier data, test on later data) matters more than random cross-validation for surveillance applications. Geographic holdouts reveal transferability.

Document data preprocessing decisions. Choices about missing data imputation, outlier handling, case definition changes, and temporal alignment affect model behavior. Future users need this documentation to interpret and update models.

Measure calibration, not just discrimination. A model that distinguishes high-risk from low-risk (good AUC) may still systematically overestimate or underestimate probabilities (poor calibration). For decision support, calibration often matters more.

Report confidence intervals and uncertainty. Point predictions without uncertainty bounds invite overconfidence. Probabilistic forecasts that communicate uncertainty support better decisions than precise-sounding wrong answers.

For Policymakers

Regulate outcomes, not algorithms. Technology-specific rules become obsolete quickly. Focus on outcomes:

  • Accuracy requirements
  • Bias audits
  • Transparency obligations
  • Recourse mechanisms for affected individuals

Require algorithmic impact assessments for high-stakes applications. Before deployment in clinical decision support, resource allocation, or outbreak response, mandate structured evaluation of potential harms, affected populations, and mitigation strategies.

Fund public health data infrastructure. AI capabilities depend on data quality. Investment in electronic laboratory reporting, case-based surveillance systems, and health information exchange infrastructure pays dividends across all analytic approaches.

Support workforce development. Create training pipelines for public health informaticists. Integrate AI literacy into MPH curricula. Fund continuing education for current practitioners.

For AI Developers Working in Public Health

Involve domain experts from the start. Epidemiologists and public health practitioners understand data limitations, workflow constraints, and deployment contexts that pure ML approaches miss. Build interdisciplinary teams, not consulting relationships.

Publish validation on diverse populations. Single-site validations don’t demonstrate generalizability. Multi-site studies with explicit subgroup analyses reveal where models work and where they fail.

Document failure modes explicitly. Under what conditions does your model break? What populations are underrepresented in training data? What data quality issues cause problems? Honest documentation builds appropriate trust.

Design for interpretability where decisions affect individuals. Black-box models may achieve higher accuracy, but practitioners need to understand predictions to act appropriately on them. For high-stakes applications, interpretability is not optional.


Evidence Quality Assessment

High-Confidence Findings

These conclusions rest on multiple high-quality studies with consistent results:

  • Dataset shift degrades deployed model performance
  • Algorithmic bias reflects training data inequities
  • External validation shows worse performance than development validation
  • Workflow integration determines real-world impact more than technical performance

Moderate-Confidence Findings

Evidence supports these conclusions but gaps remain:

  • 2-4 week forecast horizons represent practical accuracy limits for epidemic prediction
  • Ensemble methods outperform single models for most public health applications
  • Active learning can reduce labeling burden while maintaining performance

Emerging Areas (Limited Evidence)

These represent promising directions with insufficient evidence for strong conclusions:

  • Large language models for epidemiological literature synthesis
  • Foundation models for clinical decision support
  • Federated learning for privacy-preserving multi-site analysis
  • Causal inference methods for policy evaluation

Priority Investment Areas

Based on evidence and feasibility:

Tier 1: Foundational

  1. Data quality improvement: Invest in data standardization, completeness monitoring, and linkage infrastructure before advanced analytics
  2. Workforce development: Build hybrid epidemiology/data science expertise through structured training programs
  3. Evaluation capacity: Establish methods for ongoing model monitoring and bias detection

Tier 2: Near-Term Applications

  1. Genomic surveillance integration: AI-accelerated variant detection and antimicrobial resistance prediction have demonstrated value
  2. Syndromic surveillance enhancement: Automated signal detection with human-in-the-loop verification
  3. Administrative automation: Reduce burden on routine tasks (data cleaning, report generation) to free staff for analytic work

Tier 3: Longer-Term Development

  1. Clinical decision support: Requires solving workflow integration and trust challenges first
  2. Predictive resource allocation: Depends on forecast accuracy improvements and ethical frameworks
  3. Automated literature surveillance: Promising for tracking emerging evidence but validation needed

What This Handbook Does Not Cover

Comprehensive clinical AI for individual patient care. This handbook’s primary focus is population health applications. Clinical decision support is covered where it intersects with public health operations (e.g., sepsis alerts affecting hospital capacity, diagnostic AI enabling screening at scale). However, detailed treatment protocols, individual prescribing decisions, and specialty-specific clinical workflows are beyond scope.

Research methodology for developing new AI systems. The focus is on evaluation, deployment, and governance of existing approaches, not novel algorithm development.

Comprehensive technical implementation. Code examples demonstrate concepts but production systems require engineering expertise beyond this handbook’s scope.


How to Use This Executive Summary

Health department directors: Use the stakeholder recommendations section to structure AI governance conversations and procurement requirements.

Epidemiologists: The evidence quality assessment and common failure modes sections inform critical evaluation of AI tools you encounter.

Policymakers: The recommendations and priority investment areas provide a framework for resource allocation and regulatory development.

AI developers: The failure modes and stakeholder needs sections identify gaps where technical work can have public health impact.

For detailed analysis, case studies, and implementation guidance, see the full chapters.


This executive summary is part of The Public Health AI Handbook. For evidence citations, case studies, and technical details, see the full chapters.