Understand why public health data is uniquely challenging for AI
Assess data quality using four key dimensions (completeness, accuracy, timeliness, representativeness)
Identify and handle common data issues: missing data, reporting delays, selection bias
Apply practical cleaning and preprocessing strategies
Engineer meaningful features from messy real-world data
Recognize when data problems make AI inappropriate
What you’ll build: 💻 Complete data quality assessment pipeline with missing data analysis, outlier detection, and feature engineering for surveillance data
5.1 Introduction
Here’s the uncomfortable truth: Most AI tutorials use clean, well-formatted datasets. Most public health data is none of those things.
Real public health data is: - Incomplete (missing values everywhere) - Delayed (cases reported days or weeks late) - Biased (who gets tested determines who appears in your data) - Inconsistent (formats change, definitions evolve) - Messy (typos, duplicates, impossible values) - Heterogeneous (multiple sources with different standards)
And yet: This is the data you have to work with. Learning to handle messy data is arguably more important than learning fancy algorithms.
ImportantThe Iron Law of Machine Learning
Garbage in = Garbage out
The most sophisticated deep learning model cannot overcome fundamentally flawed data. A simple logistic regression on clean, well-understood data will outperform a cutting-edge neural network on garbage data.
5.2 Why Public Health Data Is Uniquely Challenging
5.2.1 1. The Surveillance Pyramid Problem
Public health surveillance captures only a fraction of reality:
🔬 Lab-Confirmed Cases (Data you have)
/
🏥 Hospitalized Cases
/
🏠 Symptomatic Cases Seeking Care
/
😷 All Symptomatic Cases
/
😊 All Infections (Including Asymptomatic)
The issue: Your dataset represents the tip of the pyramid, but the population of interest is the entire pyramid. This is selection bias by design.
Example: COVID-19 Testing Bias
Early in the pandemic, testing was limited to: - Symptomatic individuals - Healthcare workers - Travelers - Close contacts of confirmed cases
Any model trained on this data would: - Overestimate symptom severity (asymptomatic cases invisible) - Misunderstand demographic risk factors (testing access varied by socioeconomic status) - Produce biased risk predictions (dataset not representative of population)
If your training data comes from one level of the surveillance pyramid but you want to make predictions about another level, no amount of ML sophistication can fix this.
You need either: 1. Data from the correct population (often impossible) 2. Statistical methods to adjust for selection (Hernán’s target trial framework) 3. A different research question that matches your available data
5.2.2 2. Reporting Delays and Temporal Misalignment
The scenario: You want to predict tomorrow’s case counts based on today’s data. But “today’s data” includes: - Cases from today (20%) - Cases from yesterday, reported today (40%) - Cases from 2 days ago, reported today (30%) - Cases from 3+ days ago, reported today (10%)
Your “September 15th” dataset is actually a mixture of cases from September 12-15, with unknown proportions.
Why this matters for ML: - Features and labels are temporally misaligned - Recent trends are systematically underestimated (right-censoring) - Models learn spurious patterns from reporting artifacts
Real example:Early COVID-19 forecasting models struggled because they treated reported case counts as actual case counts, ignoring 5-10 day reporting lags that varied by jurisdiction.
Case definition changes: - COVID-19 case definitions changed multiple times in 2020-2021 - Adding antigen tests to confirmed cases - Including probable cases vs. confirmed only
Testing availability changes: - Week 1: Only sick hospitalized patients tested - Week 10: Symptomatic individuals can get tested - Week 20: Asymptomatic screening becomes common
Your dataset appears to show: - Exponential case growth - Changing age distribution - Different symptom profiles over time
But reality: These might be artifacts of changing surveillance, not actual epidemiological changes.
WarningThe Fundamental Challenge
Machine learning assumes the relationship between features and outcomes is stationary (constant over time).
Public health surveillance violates this assumption constantly. The data-generating process itself evolves.
Region A: 1,000 cases, 50 deaths → 5% CFR
Region B: 500 cases, 100 deaths → 20% CFR
Is Region B truly more dangerous? Or does it have: - Less testing (missing mild cases in denominator) - Older population - Worse healthcare access - Different case definitions
The problem: Your dataset has numerators (cases, deaths) but often lacks good denominators (population at risk, testing rates, exposure levels).
AI models trained on case counts without accounting for testing intensity will confuse “more testing” with “more disease,” as Ioannidis cautioned.
5.2.5 5. Privacy Constraints and Data Aggregation
To protect patient privacy, public health data is often: - Aggregated (county-level instead of individual-level) - Suppressed (cells with <5 cases shown as “<5” or asterisk) - Coarsened (age shown as brackets: “20-29” instead of exact age) - Delayed (real-time data withheld, released weeks later)
Impact on ML: - Loss of granularity reduces predictive power - Non-standard missing data patterns - Cannot link across datasets (no unique identifiers) - Temporal resolution insufficient for some analyses
The question: What percentage of values are missing? Is the missingness random or systematic?
Code to assess:
Hide code
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns# Load your datadf = pd.read_csv('../data/examples/surveillance_data.csv')# Calculate missingnessmissing_summary = pd.DataFrame({'column': df.columns,'missing_count': df.isnull().sum(),'missing_percent': (df.isnull().sum() /len(df) *100).round(2)}).sort_values('missing_percent', ascending=False)print("Missing Data Summary:")print(missing_summary)# Visualize missingness patternsimport missingno as msno# Matrix plot: see where missing values clustermsno.matrix(df, figsize=(12, 6), sparkline=False)plt.title('Missing Data Pattern')plt.tight_layout()plt.savefig('../images/examples/missing_data_matrix.png', dpi=300)
# Check if missingness is related to other variablesdf['symptom_onset_missing'] = df['symptom_onset'].isnull()# Compare groups with and without missing valuesprint(df.groupby('symptom_onset_missing')['age'].mean())print(df.groupby('symptom_onset_missing')['hospitalized'].mean())# Statistical testfrom scipy.stats import chi2_contingencycontingency_table = pd.crosstab(df['symptom_onset_missing'], df['hospitalized'])chi2, p_value, dof, expected = chi2_contingency(contingency_table)if p_value <0.05:print("⚠️ Missingness is NOT random - be careful with imputation")
5.3.2 2. Accuracy: Are Values Correct?
Common accuracy issues:
5.3.2.1 Impossible or Implausible Values
Hide code
# Detect impossible valuesissues = []# Age checksif (df['age'] <0).any() or (df['age'] >120).any(): issues.append("Impossible ages detected")# Date checksdf['symptom_date'] = pd.to_datetime(df['symptom_date'], errors='coerce')df['test_date'] = pd.to_datetime(df['test_date'], errors='coerce')# Symptom onset after test date?invalid_dates = df['symptom_date'] > df['test_date']if invalid_dates.any(): issues.append(f"{invalid_dates.sum()} cases with symptoms after test")# Temperature checksif'temperature'in df.columns:if (df['temperature'] <35).any() or (df['temperature'] >42).any(): issues.append("Implausible temperatures detected")for issue in issues:print(issue)
5.3.2.2 Data Entry Errors
Hide code
# Common data entry errors# 1. Height/weight transpositionsdf['bmi'] = df['weight'] / (df['height']/100)**2suspicious_bmi = (df['bmi'] <10) | (df['bmi'] >60)print(f"Suspicious BMI values: {suspicious_bmi.sum()}")# 2. Unit inconsistenciesif'temperature'in df.columns: potential_fahrenheit = (df['temperature'] >50).sum()if potential_fahrenheit >0:print(f"⚠️ Possible unit mixing: {potential_fahrenheit} values >50°")# 3. Decimal point errors (370 instead of 37.0)if'temperature'in df.columns: extreme_values = df['temperature'] >100print(f"Possible decimal errors: {extreme_values.sum()}")# 4. Age heaping (digit preference)age_last_digit = df['age'] %10digit_counts = age_last_digit.value_counts().sort_index()from scipy.stats import chisquareexpected_freq = [len(df)/10] *10chi2, p_value = chisquare(digit_counts, expected_freq)if p_value <0.05:print("⚠️ Age heaping detected - rounding to 0s and 5s")
ImportantThe “Too Clean” Red Flag
If your public health dataset has zero missing values, perfect consistency, and no outliers—be suspicious. Real-world data is messy.
5.3.3 3. Timeliness: Are Values Up-to-Date?
Quantifying reporting delays:
Hide code
# Calculate reporting lagdf['report_date'] = pd.to_datetime(df['report_date'])df['symptom_onset'] = pd.to_datetime(df['symptom_onset'])# Lag from symptom onset to reportdf['onset_to_report_days'] = (df['report_date'] - df['symptom_onset']).dt.days# Visualizefig, ax = plt.subplots(figsize=(10, 6))ax.hist(df['onset_to_report_days'].dropna(), bins=50, edgecolor='black')ax.set_xlabel('Days from Symptom Onset to Report')ax.set_ylabel('Frequency')ax.axvline(df['onset_to_report_days'].median(), color='red', linestyle='--', label=f"Median: {df['onset_to_report_days'].median():.1f} days")ax.legend()plt.savefig('../images/examples/reporting_delays.png', dpi=300)
5.3.4 4. Representativeness: Does Data Match Target Population?
Common biases:
Hide code
# Check for demographic bias# Compare to census datapopulation_age_dist = {'0-17': 0.22,'18-64': 0.62,'65+': 0.16}sample_age_dist = df['age_group'].value_counts(normalize=True).to_dict()print("Age Distribution Comparison:")print("Age Group | Population | Sample | Difference")print("-"*50)for age_group in population_age_dist: pop_pct = population_age_dist[age_group] *100 sample_pct = sample_age_dist.get(age_group, 0) *100 diff = sample_pct - pop_pctprint(f"{age_group:8} | {pop_pct:6.1f}% | {sample_pct:6.1f}% | {diff:+6.1f}%")
WarningWhen Representativeness Matters Most
Representativeness is critical when: - Making population-level inferences - Forecasting total burden - Evaluating interventions
Less critical when: - Early outbreak detection (any signal helps) - Comparing relative risks within sample
5.4 Common Data Issues and How to Handle Them
5.4.1 Issue 1: Missing Data
When MCAR:
Hide code
from sklearn.impute import SimpleImputer# Median imputation for numerical variablesnumerical_cols = df.select_dtypes(include=[np.number]).columnsimputer = SimpleImputer(strategy='median')df[numerical_cols] = imputer.fit_transform(df[numerical_cols])
Is it biologically impossible?
├─ Yes → Error. Correct or remove.
└─ No → Is it a data entry error?
├─ Yes → Correct it
└─ No → Does it represent important subpopulation?
├─ Yes → Keep it!
└─ Uncertain → Sensitivity analysis
ImportantThe “Interesting Outlier” Trap
Real example: Early COVID-19 models removed patients with “impossibly fast” progression as outliers. Later discovered these were cytokine storm cases—rare but critical.
Lesson: Outliers often teach you the most. Be cautious about automatic removal.
The COVID-19 pandemic provided a natural experiment in real-time forecasting. Despite unprecedented data availability, most forecasts performed poorly.
What went wrong:
Reporting delay ignorance
Models treated “reported March 15” as “occurred March 15”
“Forecast accuracy was limited more by data quality issues—reporting delays, changing surveillance, and selection bias—than by modeling approaches. Simple models with good data outperformed complex models with poor data.”
The pandemic exposed that public health data infrastructure, not algorithms, is the bottleneck.
5.7 When Data Problems Make AI Inappropriate
Red flags that should stop you:
Extreme selection bias with no adjustment
Example: Training on hospitalized, deploying on general population
Why fatal: Model doesn’t generalize
Outcome label quality is poor
Example: Inconsistent disease diagnosis criteria
Why fatal: Model learns noise, not signal
Critical features >50% missing
Example: Symptom onset missing for most cases
Why fatal: Can’t build temporal features
Data generating process changed mid-dataset
Example: Pre/post policy change, different testing
Why fatal: Violates stationarity assumption
Sample size too small
Example: 100 cases, 50 features, trying deep learning
Why fatal: Severe overfitting
External validity concerns
Example: One hospital’s data, deploying elsewhere
Why fatal: Context matters, model doesn’t transfer
ImportantThe Courage to Say “No”
Good data science includes knowing when NOT to build a model.
If data quality is insufficient: 1. Report limitations clearly 2. Recommend improving data collection 3. Suggest alternatives (traditional epi, qualitative research)
Building on garbage data: - Wastes resources - Produces misleading results - Can harm people - Damages trust in AI
As Box said: “All models are wrong, but some are useful.” If your data is wrong enough, your model will be useless.
5.8 Key Takeaways
Data quality determines model quality. Algorithms can’t overcome flawed data.
Understand missingness mechanisms. MCAR, MAR, MNAR require different strategies.
Public health data has unique challenges: selection bias, reporting delays, changing surveillance.
Investigate, don’t automatically fix. Outliers might be the most important cases.
Document everything. Future users need to know what you cleaned and why.
Know when to stop. Sometimes the ethical choice is not to build a model.
Temporal ordering matters. Never use future to predict past.
5.9 Practice Exercises
5.9.1 Exercise 1: Missing Data Mechanisms
Identify variables with missing data
Test if missingness is MCAR, MAR, or MNAR
Choose imputation strategies
Compare model performance
5.9.2 Exercise 2: Reporting Delay Adjustment
Calculate reporting triangle
Implement nowcasting
Compare nowcast vs. raw counts
Evaluate accuracy over time
5.9.3 Exercise 3: Feature Engineering
Create outbreak predictor using: - Daily cases - Temperature - Rainfall - Population density
Engineer 10+ features. Which matter most?
5.9.4 Exercise 4: Data Quality Audit
Audit a “cleaned” dataset: 1. Find 5+ quality issues 2. Determine if fixable or fatal 3. Write quality report
Check Your Understanding
Test your knowledge of the key concepts from this chapter. Click “Show Answer” to reveal the correct response and explanation.
NoteQuestion 1: Data Quality Foundations
Which dimension of data quality is violated when COVID-19 testing was initially limited to hospitalized patients, but models were deployed to predict community transmission?
Completeness
Accuracy
Timeliness
Representativeness
Answer: d) Representativeness
Explanation: This is a representativeness problem. The training data (hospitalized patients) doesn’t represent the target population (community cases). This selection bias means the model learns patterns from a severely ill subpopulation that don’t generalize to mild/asymptomatic cases in the community.
NoteQuestion 2: Missing Data Mechanisms
You notice that income data is missing for 30% of younger respondents but only 5% of older respondents. What type of missingness is this?
Missing Completely At Random (MCAR)
Missing At Random (MAR)
Missing Not At Random (MNAR)
Systematic error
Answer: b) Missing At Random (MAR)
Explanation: This is MAR (Missing At Random) because the missingness depends on an observed variable (age), not on the missing value itself (income). You can impute income using age as a predictor since you know the missingness pattern.
NoteQuestion 3: Temporal Data Leakage
True or False: When creating rolling averages for time series forecasting, using rolling(window=7, center=True) is appropriate because it creates smoother trends.
Answer: False
Explanation: Using center=True causes data leakage by incorporating future values into the calculation. For a value on day 10, it would include data from days 7-13, meaning you’re using days 11-13 (the future) to predict day 10. For forecasting, you must use center=False (the default) to only look backwards in time.
NoteQuestion 4: Reporting Delays
A surveillance system shows cases declining on the most recent dates. What is the most likely explanation before concluding the outbreak is ending?
Effective public health intervention
Population-level immunity developing
Right-censoring from reporting delays
Seasonal weather changes
Answer: c) Right-censoring from reporting delays
Explanation:Right-censoring from reporting delays is the most common explanation for apparent declines in recent data. Cases from recent days haven’t fully been reported yet, creating an artificial downward trend. Always check reporting lag distributions before interpreting recent trends. This is why nowcasting methods exist.
NoteQuestion 5: Feature Engineering
Which of these features would be MOST useful for predicting dengue outbreaks and demonstrates good domain expertise?
Day of the week
Raw daily temperature
14-day cumulative rainfall
Month number (1-12)
Answer: c) 14-day cumulative rainfall
Explanation:14-day cumulative rainfall shows domain expertise because dengue mosquitoes (Aedes) breed in standing water that accumulates over time. Raw daily temperature or simple month number don’t capture the mechanism. Day of the week is irrelevant for disease biology. This demonstrates how epidemiological understanding drives effective feature engineering.
NoteQuestion 6: Outliers
You find 5 patients with “impossibly fast” disease progression (symptom onset to death in <24 hours). What should you do?
Remove them as data entry errors
Remove them to avoid overfitting on outliers
Investigate them carefully—they might be important rare cases
Set their values to the median progression time
Answer: c) Investigate them carefully—they might be important rare cases
Explanation:Investigate carefully before removing. These could be: - Real rare but critical cases (e.g., cytokine storm, overwhelming infection) - Important subpopulation with different risk factors - Data entry errors (date fields swapped)
The COVID-19 pandemic taught us that “outliers” like rapidly progressing cases were often the most clinically important to understand. Only remove after investigation confirms they’re truly errors.
5.10 Discussion Questions
Your training data is 80% from one hospital. How does this affect generalizability? What can you do?
Model has 95% accuracy on training, 65% on test, with 30% missing labels. How do you interpret this?
Colleague wants to impute symptom onset as report date minus median delay. Pros/cons? When appropriate?
Forecasting COVID-19 hospitalizations: use raw counts or adjust for test positivity? Justify.
When is it acceptable to remove outliers from public health data? Examples of appropriate vs. inappropriate removal.
You now understand why data quality is the foundation of successful AI in public health. Clean data and thoughtful feature engineering matter more than fancy algorithms.
TipPart I Summary: What You Should Now Know
Congratulations! You’ve completed Part I: Foundations. Before moving to applications, ensure you can confidently:
5.11.6 From Chapter 1 (History & Context)
Explain how AI has evolved in public health from expert systems to modern machine learning
Identify when AI adds value vs. when traditional epidemiology is sufficient
Recognize patterns of AI hype and separate genuine capabilities from marketing
Understand the unique ethical considerations of AI in population health
5.11.7 From Chapter 2 (AI Basics)
Distinguish between supervised, unsupervised, and reinforcement learning paradigms
Choose appropriate algorithms for different problem types (logistic regression → Random Forests → gradient boosting → deep learning)
Interpret evaluation metrics (accuracy, sensitivity, specificity, ROC-AUC) and know when each matters
Identify overfitting and data leakage issues
Understand why most public health tabular data works best with tree-based methods, not deep learning
Explain predictions using feature importance and SHAP values
5.11.8 From Chapter 3 (Data Quality)
Assess data quality across four dimensions: completeness, accuracy, timeliness, representativeness
Identify missing data mechanisms (MCAR, MAR, MNAR) and choose appropriate imputation strategies
Handle reporting delays and right-censoring in surveillance data
Engineer meaningful features from public health data using domain expertise
Recognize when data problems make AI inappropriate (selection bias, insufficient sample size, changing data-generating processes)
Avoid temporal data leakage when building forecasting models
5.11.9 Core Skills Checklist
Can you: - [ ] Read and modify basic ML code in Python (scikit-learn, pandas)? - [ ] Run a complete ML pipeline from data loading → preprocessing → training → evaluation? - [ ] Interpret confusion matrices and choose metrics appropriate for your use case? - [ ] Detect and handle missing data without blindly dropping rows? - [ ] Create time-based features without leaking future information? - [ ] Explain why a simple model on clean data beats a complex model on dirty data? - [ ] Identify when NOT to build an AI model due to data quality issues?
5.11.10 What’s Next
Part II: Applications applies these foundations to real public health problems: - Disease surveillance and outbreak detection - Epidemic forecasting - Genomic surveillance - Large language models in public health - Clinical decision support
You now have the conceptual foundation and technical skills to understand how AI works in practice. The next chapters show you specific applications where AI creates value—and where it falls short.
If you’re unsure about any foundation concept, review the relevant chapter before proceeding. The applications build directly on these fundamentals.