5 The Data Problem

Learning Objectives

This chapter addresses messy public health data realities. You will learn to:

Assess data quality dimensions (completeness, accuracy, timeliness, representativeness)
Distinguish missing data mechanisms (MCAR, MAR, MNAR)
Recognize how selection bias invalidates predictions (COVID-19 testing)
Identify when data quality makes AI inappropriate
Implement practical quality assessments and consistency checks
Develop courage to recommend against deployment when appropriate

Prerequisites: AI in Context, Just Enough AI to Be Dangerous.

📋 Chapter Summary (TL;DR)

The Big Picture: Most AI tutorials use clean, perfect datasets. Real public health data is incomplete, delayed, biased, inconsistent, messy, and heterogeneous. Learning to handle messy data is more important than learning fancy algorithms. Garbage in = garbage out—the most sophisticated deep learning model cannot overcome fundamentally flawed data.

Why Public Health Data Is Uniquely Challenging:

The Surveillance Pyramid Problem: You see lab-confirmed cases (tip of iceberg) but want to predict all infections (entire iceberg). Training on hospitalized patients, deploying on community cases = selection bias by design
Reporting Delays: “Today’s” data is actually a mixture of cases from 2-10 days ago with unknown proportions → right-censoring systematically underestimates recent trends
Changing Definitions: COVID-19 case definitions changed multiple times. Testing availability evolved. Models learn spurious patterns from surveillance artifacts, not real epidemiology
The Denominator Problem: 1,000 cases with 50 deaths (5% CFR) vs. 500 cases with 100 deaths (20% CFR)—which is truly more dangerous, or which just has less testing?
Privacy Constraints: Aggregation, suppression (<5 cases), coarsening (age brackets), delayed release → loss of granularity reduces predictive power

The Four Dimensions of Data Quality (Assess Before Building):

Completeness: How much is missing? <5% = minimal, 5-20% = moderate, 20-50% = substantial, >50% = likely unusable
- MCAR (Missing Completely At Random): Safe for simple imputation
- MAR (Missing At Random): Depends on observed variables, can impute using related features
- MNAR (Missing Not At Random): Missingness itself is informative (sickest patients too ill to complete surveys)
Accuracy: Are values correct? Check for impossible values (age >120), implausible values (BMI <10), unit mixing (temperature in Fahrenheit vs. Celsius), data entry errors (decimal points), age heaping
Timeliness: Reporting lags create temporal misalignment. Nowcasting methods adjust for delays
Representativeness: Does your sample match the target population? Compare demographics to census data. Critical for population-level inferences, less critical for outbreak detection

The COVID-19 Forecasting Disaster (Real-World Lesson):

Despite unprecedented data, most forecasts failed spectacularly. Why? - Ignored reporting delays (treated “reported March 15” as “occurred March 15”) - Confused testing expansion with disease growth - Selection bias (trained on strict lockdown data, applied to different compliance) - Missing confounders (behavioral responses, policy changes) - Result: Median error 25-50% for 1-2 week forecasts

The Critical Lesson: Forecast accuracy was limited more by data quality issues than modeling approaches. Simple models with good data outperformed complex models with poor data.

Feature Engineering (Where Domain Expertise Wins):

Time-based: Day of week, season, days since outbreak start, rolling averages (7-day, 14-day)
Epidemiological: R0 estimation, attack rate, case fatality rate, test positivity
Geographic: Distance to healthcare, spatial clustering, cases within 5km
Lag features: Cases 7 days ago, 14 days ago (but never use future information!)

Critical Pitfall—Temporal Data Leakage:

❌ WRONG: df['cases_rolling'] = df['daily_cases'].rolling(window=7, center=True).mean()
✅ CORRECT: df['cases_rolling'] = df['daily_cases'].rolling(window=7).mean()

Center=True uses future data. For time series, always split by date, never random shuffle.

When Data Problems Make AI Inappropriate (Know When to Stop):

Red flags that should halt the project: 1. Extreme selection bias with no adjustment method 2. Outcome labels are inconsistent or unreliable 3. Critical features >50% missing 4. Data-generating process changed mid-dataset (pre/post policy change) 5. Sample size too small for model complexity (100 cases, 50 features, deep learning = disaster) 6. External validity concerns (one hospital’s data, deploying elsewhere)

The Courage to Say “No”: Good data science includes knowing when NOT to build a model. If data quality is insufficient, recommend improving data collection instead. Building on garbage data wastes resources, produces misleading results, can harm people, and damages trust in AI.

The Takeaway for Public Health Practitioners: Data quality determines model quality. Investigate outliers before removing them—they might be the most important cases (COVID-19 cytokine storm). Document everything—what you cleaned and why. Feature engineering with domain expertise beats algorithm sophistication. Know when to stop—sometimes the ethical choice is not to build a model. As Box said: “All models are wrong, but some are useful.” If your data is wrong enough, your model will be useless.

5.1 Introduction

Here’s the uncomfortable truth: Most AI tutorials use clean, well-formatted datasets. Most public health data is none of those things.

Real public health data is: - Incomplete (missing values everywhere) - Delayed (cases reported days or weeks late) - Biased (who gets tested determines who appears in your data) - Inconsistent (formats change, definitions evolve) - Messy (typos, duplicates, impossible values) - Heterogeneous (multiple sources with different standards)

And yet: This is the data you have to work with. Learning to handle messy data is arguably more important than learning fancy algorithms.

The Iron Law of Machine Learning

Garbage in = Garbage out

The most sophisticated deep learning model cannot overcome fundamentally flawed data. A simple logistic regression on clean, well-understood data will outperform a cutting-edge neural network on garbage data.

As Andrew Ng emphasizes in his data-centric AI work: successful AI is 80% data, 20% algorithms. Yet most courses spend 80% of time on algorithms and 20% on data.

This chapter flips that ratio.

5.2 Why Public Health Data Is Uniquely Challenging

5.2.1 1. The Surveillance Pyramid Problem

Public health surveillance captures only a fraction of reality:

                    🔬 Lab-Confirmed Cases (Data you have)
                   /
              🏥 Hospitalized Cases
             /
        🏠 Symptomatic Cases Seeking Care
       /
  😷 All Symptomatic Cases
 /
😊 All Infections (Including Asymptomatic)

Hide code

graph TB
    subgraph " "
    A["🔬 Lab-Confirmed Cases<br/>(What surveillance systems capture)"]
    B["🏥 Hospitalized Cases<br/>(Severe cases only)"]
    C["🏠 Symptomatic Cases Seeking Care<br/>(Access to healthcare varies)"]
    D["😷 All Symptomatic Cases<br/>(Many don't seek care)"]
    E["😊 All Infections<br/>(Including asymptomatic - true burden)"]
    end

    E --> D
    D --> C
    C --> B
    B --> A

    style A fill:#ff6b6b,stroke:#333,stroke-width:2px,color:#fff
    style B fill:#ffa07a,stroke:#333,stroke-width:2px
    style C fill:#ffd166,stroke:#333,stroke-width:2px
    style D fill:#bfe1b0,stroke:#333,stroke-width:2px
    style E fill:#95e1d3,stroke:#333,stroke-width:3px

Figure 5.1: The surveillance pyramid showing how public health data captures only a fraction of true disease burden. Each level represents increasing underascertainment, with lab-confirmed cases (what gets reported) representing only the visible tip of the iceberg. Most infections, especially asymptomatic and mild cases, remain undetected.

The issue: Your dataset represents the tip of the pyramid, but the population of interest is the entire pyramid. This is selection bias by design.

Example: COVID-19 Testing Bias

Early in the pandemic, testing was limited to: - Symptomatic individuals - Healthcare workers - Travelers - Close contacts of confirmed cases

Any model trained on this data would: - Overestimate symptom severity (asymptomatic cases invisible) - Misunderstand demographic risk factors (testing access varied by socioeconomic status) - Produce biased risk predictions (dataset not representative of population)

As documented in Lipsitch et al.’s analysis of surveillance biases (Lipsitch, Swerdlow, and Finelli 2020), these selection effects can lead to systematic underestimation of disease burden (Russell et al. 2020).

When Selection Bias Invalidates AI

If your training data comes from one level of the surveillance pyramid but you want to make predictions about another level, no amount of ML sophistication can fix this.

You need either: 1. Data from the correct population (often impossible) 2. Statistical methods to adjust for selection (Hernán’s target trial framework) 3. A different research question that matches your available data

5.2.2 2. Reporting Delays and Temporal Misalignment

The scenario: You want to predict tomorrow’s case counts based on today’s data. But “today’s data” includes: - Cases from today (20%) - Cases from yesterday, reported today (40%) - Cases from 2 days ago, reported today (30%) - Cases from 3+ days ago, reported today (10%)

Your “September 15th” dataset is actually a mixture of cases from September 12-15, with unknown proportions.

Why this matters for ML: - Features and labels are temporally misaligned - Recent trends are systematically underestimated (right-censoring) - Models learn spurious patterns from reporting artifacts

Real example: Early COVID-19 forecasting models struggled because they treated reported case counts as actual case counts, ignoring 5-10 day reporting lags that varied by jurisdiction.

5.2.3 3. Changing Definitions and Data Collection

Public health data collection evolves over time, as extensively documented in Reich et al.’s analysis of forecasting challenges.

Case definition changes: - COVID-19 case definitions changed multiple times in 2020-2021 - Adding antigen tests to confirmed cases - Including probable cases vs. confirmed only

Testing availability changes: - Week 1: Only sick hospitalized patients tested - Week 10: Symptomatic individuals can get tested - Week 20: Asymptomatic screening becomes common

Your dataset appears to show: - Exponential case growth - Changing age distribution - Different symptom profiles over time

But reality: These might be artifacts of changing surveillance, not actual epidemiological changes.

The Fundamental Challenge

Machine learning assumes the relationship between features and outcomes is stationary (constant over time).

Public health surveillance violates this assumption constantly. The data-generating process itself evolves.

This is why pandemic forecasting is so difficult—the ground truth keeps shifting under your feet.

5.2.4 4. The Denominator Problem

You observe:

Region A: 1,000 cases, 50 deaths → 5% CFR
Region B: 500 cases, 100 deaths → 20% CFR

Is Region B truly more dangerous? Or does it have: - Less testing (missing mild cases in denominator) - Older population - Worse healthcare access - Different case definitions

The problem: Your dataset has numerators (cases, deaths) but often lacks good denominators (population at risk, testing rates, exposure levels).

AI models trained on case counts without accounting for testing intensity will confuse “more testing” with “more disease,” as Ioannidis cautioned (Ioannidis 2021).

5.2.5 5. Privacy Constraints and Data Aggregation

To protect patient privacy, public health data is often: - Aggregated (county-level instead of individual-level) - Suppressed (cells with <5 cases shown as “<5” or asterisk) - Coarsened (age shown as brackets: “20-29” instead of exact age) - Delayed (real-time data withheld, released weeks later)

Impact on ML: - Loss of granularity reduces predictive power - Non-standard missing data patterns - Cannot link across datasets (no unique identifiers) - Temporal resolution insufficient for some analyses

The HL7 FHIR standard and HIPAA Safe Harbor guidelines constrain what data you can access and how it’s formatted.

5.3 The Four Dimensions of Data Quality

When assessing whether your data is suitable for AI, evaluate these four dimensions, as outlined in Weiskopf and Weng’s data quality framework (Weiskopf and Weng 2013).

5.3.1 1. Completeness: Are Values Missing?

The question: What percentage of values are missing? Is the missingness random or systematic?

Code to assess:

Hide code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load your data
df = pd.read_csv('../data/examples/surveillance_data.csv')

# Calculate missingness
missing_summary = pd.DataFrame({
    'column': df.columns,
    'missing_count': df.isnull().sum(),
    'missing_percent': (df.isnull().sum() / len(df) * 100).round(2)
}).sort_values('missing_percent', ascending=False)

print("Missing Data Summary:")
print(missing_summary)

# Visualize missingness patterns
import missingno as msno

# Matrix plot: see where missing values cluster
msno.matrix(df, figsize=(12, 6), sparkline=False)
plt.title('Missing Data Pattern')
plt.tight_layout()
plt.savefig('../images/examples/missing_data_matrix.png', dpi=300)

Interpreting results:

Missingness %	Assessment	Action
< 5%	Minimal	Simple imputation okay
5-20%	Moderate	Investigate patterns, careful imputation
20-50%	Substantial	Deep investigation required
> 50%	Severe	Likely unusable without major effort

Types of missingness (Little & Rubin’s taxonomy (Little and Rubin 2019)):

Missing Completely At Random (MCAR): Missingness is unrelated to any variables
- Example: Lab results lost due to random system glitches
- Implication: Simple imputation is safe
Missing At Random (MAR): Missingness depends on observed variables
- Example: Income more likely missing for younger respondents
- Implication: Can impute using related variables
Missing Not At Random (MNAR): Missingness depends on the missing value itself
- Example: Sickest patients too ill to complete surveys
- Implication: Missing data itself is informative (Kaufman, Rosset, and Perlich 2012)

Testing Missingness Patterns

Hide code

# Check if missingness is related to other variables
df['symptom_onset_missing'] = df['symptom_onset'].isnull()

# Compare groups with and without missing values
print(df.groupby('symptom_onset_missing')['age'].mean())
print(df.groupby('symptom_onset_missing')['hospitalized'].mean())

# Statistical test
from scipy.stats import chi2_contingency
contingency_table = pd.crosstab(df['symptom_onset_missing'],
                                 df['hospitalized'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

if p_value < 0.05:
    print("⚠️ Missingness is NOT random - be careful with imputation")

5.3.2 2. Accuracy: Are Values Correct?

Common accuracy issues:

5.3.2.1 Impossible or Implausible Values

Hide code

# Detect impossible values
issues = []

# Age checks
if (df['age'] < 0).any() or (df['age'] > 120).any():
    issues.append("Impossible ages detected")

# Date checks
df['symptom_date'] = pd.to_datetime(df['symptom_date'], errors='coerce')
df['test_date'] = pd.to_datetime(df['test_date'], errors='coerce')

# Symptom onset after test date?
invalid_dates = df['symptom_date'] > df['test_date']
if invalid_dates.any():
    issues.append(f"{invalid_dates.sum()} cases with symptoms after test")

# Temperature checks
if 'temperature' in df.columns:
    if (df['temperature'] < 35).any() or (df['temperature'] > 42).any():
        issues.append("Implausible temperatures detected")

for issue in issues:
    print(issue)

5.3.2.2 Data Entry Errors

Hide code

# Common data entry errors

# 1. Height/weight transpositions
df['bmi'] = df['weight'] / (df['height']/100)**2
suspicious_bmi = (df['bmi'] < 10) | (df['bmi'] > 60)
print(f"Suspicious BMI values: {suspicious_bmi.sum()}")

# 2. Unit inconsistencies
if 'temperature' in df.columns:
    potential_fahrenheit = (df['temperature'] > 50).sum()
    if potential_fahrenheit > 0:
        print(f"⚠️ Possible unit mixing: {potential_fahrenheit} values >50°")

# 3. Decimal point errors (370 instead of 37.0)
if 'temperature' in df.columns:
    extreme_values = df['temperature'] > 100
    print(f"Possible decimal errors: {extreme_values.sum()}")

# 4. Age heaping (digit preference)
age_last_digit = df['age'] % 10
digit_counts = age_last_digit.value_counts().sort_index()

from scipy.stats import chisquare
expected_freq = [len(df)/10] * 10
chi2, p_value = chisquare(digit_counts, expected_freq)
if p_value < 0.05:
    print("⚠️ Age heaping detected - rounding to 0s and 5s")

The “Too Clean” Red Flag

If your public health dataset has zero missing values, perfect consistency, and no outliers—be suspicious. Real-world data is messy.

5.3.3 3. Timeliness: Are Values Up-to-Date?

Quantifying reporting delays:

Hide code

# Calculate reporting lag
df['report_date'] = pd.to_datetime(df['report_date'])
df['symptom_onset'] = pd.to_datetime(df['symptom_onset'])

# Lag from symptom onset to report
df['onset_to_report_days'] = (df['report_date'] - df['symptom_onset']).dt.days

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df['onset_to_report_days'].dropna(), bins=50, edgecolor='black')
ax.set_xlabel('Days from Symptom Onset to Report')
ax.set_ylabel('Frequency')
ax.axvline(df['onset_to_report_days'].median(), color='red',
           linestyle='--',
           label=f"Median: {df['onset_to_report_days'].median():.1f} days")
ax.legend()
plt.savefig('../images/examples/reporting_delays.png', dpi=300)

Nowcasting: Adjusting for reporting delays as described in McGough et al.’s work (McGough et al. 2020).

Advanced Nowcasting

For rigorous handling of reporting delays: - EpiNow2 R package - CDC’s COVID-19 nowcasting approach

5.3.4 4. Representativeness: Does Data Match Target Population?

Common biases:

Hide code

# Check for demographic bias

# Compare to census data
population_age_dist = {
    '0-17': 0.22,
    '18-64': 0.62,
    '65+': 0.16
}

sample_age_dist = df['age_group'].value_counts(normalize=True).to_dict()

print("Age Distribution Comparison:")
print("Age Group | Population | Sample | Difference")
print("-" * 50)
for age_group in population_age_dist:
    pop_pct = population_age_dist[age_group] * 100
    sample_pct = sample_age_dist.get(age_group, 0) * 100
    diff = sample_pct - pop_pct
    print(f"{age_group:8} | {pop_pct:6.1f}%    | {sample_pct:6.1f}% | {diff:+6.1f}%")

When Representativeness Matters Most

Representativeness is critical when: - Making population-level inferences - Forecasting total burden - Evaluating interventions

Less critical when: - Early outbreak detection (any signal helps) - Comparing relative risks within sample

5.4 Common Data Issues and How to Handle Them

5.4.1 Issue 1: Missing Data

When MCAR:

Hide code

from sklearn.impute import SimpleImputer

# Median imputation for numerical variables
numerical_cols = df.select_dtypes(include=[np.number]).columns
imputer = SimpleImputer(strategy='median')
df[numerical_cols] = imputer.fit_transform(df[numerical_cols])

When MAR:

Hide code

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Iterative imputation (like MICE)
imputer = IterativeImputer(max_iter=10, random_state=42)
df_imputed = pd.DataFrame(
    imputer.fit_transform(df[numerical_cols]),
    columns=numerical_cols
)

When MNAR:

Hide code

# Create missingness indicator
df['income_missing'] = df['income'].isnull().astype(int)

# Impute with flag value
df['income_imputed'] = df['income'].fillna(-999)

Best Practices for Missing Data

Never silently drop missing data
Compare complete-case vs. imputed analysis
Try multiple imputation
Report missingness patterns
Create missingness indicators when uncertain

5.4.2 Issue 2: Outliers and Extreme Values

Not all outliers are errors!

Hide code

def detect_outliers_iqr(df, column, multiplier=1.5):
    """Detect outliers using IQR method"""
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - multiplier * IQR
    upper_bound = Q3 + multiplier * IQR

    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

# Investigate before removing
age_outliers, lower, upper = detect_outliers_iqr(df, 'age', multiplier=3)
print(f"Age outliers: {len(age_outliers)}")
print(age_outliers[['age', 'outcome', 'hospital']].head())

Decision tree for outliers:

Is it biologically impossible?
├─ Yes → Error. Correct or remove.
└─ No → Is it a data entry error?
    ├─ Yes → Correct it
    └─ No → Does it represent important subpopulation?
        ├─ Yes → Keep it!
        └─ Uncertain → Sensitivity analysis

The “Interesting Outlier” Trap

Real example: Early COVID-19 models removed patients with “impossibly fast” progression as outliers. Later discovered these were cytokine storm cases—rare but critical.

Lesson: Outliers often teach you the most. Be cautious about automatic removal.

5.4.3 Issue 3: Duplicate Records

Hide code

# Detect duplicates
exact_dupes = df[df.duplicated(keep=False)]
print(f"Exact duplicates: {len(exact_dupes)}")

patient_dupes = df[df.duplicated(subset=['patient_id'], keep=False)]
print(f"Patients with multiple records: {patient_dupes['patient_id'].nunique()}")

# Handle duplicates
# Option 1: Keep first
df_deduped = df.drop_duplicates(subset=['patient_id'], keep='first')

# Option 2: Keep most recent
df_sorted = df.sort_values('report_date', ascending=False)
df_deduped = df_sorted.drop_duplicates(subset=['patient_id'], keep='first')

5.4.4 Issue 4: Inconsistent Coding

Hide code

# Standardize categorical variables
sex_mapping = {
    'M': 'Male', 'Male': 'Male', 'MALE': 'Male', 'm': 'Male',
    'F': 'Female', 'Female': 'Female', 'FEMALE': 'Female', 'f': 'Female',
    'Unknown': 'Unknown', 'U': 'Unknown', np.nan: 'Unknown'
}
df['sex_clean'] = df['sex'].map(sex_mapping)

# Standardize dates
df['date_clean'] = pd.to_datetime(df['date'], infer_datetime_format=True, errors='coerce')

5.5 Feature Engineering for Public Health

Domain expertise matters most, as Domingos emphasizes.

5.5.1 Time-Based Features

Hide code

# Extract temporal features
df['date'] = pd.to_datetime(df['report_date'])

df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['week_of_year'] = df['date'].dt.isocalendar().week
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

# Seasonal indicators
def assign_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

df['season'] = df['month'].apply(assign_season)

# Days since outbreak start
outbreak_start = df['date'].min()
df['days_since_start'] = (df['date'] - outbreak_start).dt.days

5.5.2 Rolling Statistics

Hide code

# Sort by date first!
df = df.sort_values('date')

# 7-day rolling average
df['cases_7day_avg'] = df['daily_cases'].rolling(window=7, min_periods=1).mean()

# 14-day rolling sum
df['cases_14day_sum'] = df['daily_cases'].rolling(window=14, min_periods=1).sum()

# Growth rate
df['case_growth_rate'] = df['daily_cases'].pct_change(periods=7)

# Lag features
df['cases_lag_7'] = df['daily_cases'].shift(7)
df['cases_lag_14'] = df['daily_cases'].shift(14)

# Ratio features
df['cases_this_week_vs_last'] = df['daily_cases'] / df['cases_lag_7']

Avoiding Data Leakage

NEVER use future information, as warned in Kaufman et al.’s analysis (Kaufman, Rosset, and Perlich 2012):

Hide code

# ❌ WRONG: Uses future data
df['cases_rolling'] = df['daily_cases'].rolling(window=7, center=True).mean()

# ✅ CORRECT: Only past data
df['cases_rolling'] = df['daily_cases'].rolling(window=7, min_periods=1).mean()

For time series splits:

Hide code

# ❌ WRONG for time series
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

# ✅ CORRECT for time series
split_date = df['date'].quantile(0.8)
train_df = df[df['date'] < split_date]
test_df = df[df['date'] >= split_date]

5.5.3 Epidemiological Features

Hide code

# R0 estimation from growth rate
def estimate_R0(cases, generation_time=5):
    """Estimate R0 from exponential growth"""
    from scipy.optimize import curve_fit

    days = np.arange(len(cases))

    def exponential(x, a, b):
        return a * np.exp(b * x)

    params, _ = curve_fit(exponential, days, cases, p0=[1, 0.1])
    growth_rate = params[1]
    R0 = 1 + growth_rate * generation_time
    return R0

df['R0_estimate'] = df['daily_cases'].rolling(window=14).apply(
    lambda x: estimate_R0(x.values) if len(x) > 7 else np.nan,
    raw=False
)

# Attack rate
population = 1000000
df['attack_rate'] = df['cumulative_cases'] / population

# Case fatality rate
df['CFR'] = df['cumulative_deaths'] / df['cumulative_cases']

# Test positivity rate
df['test_positivity'] = df['positive_tests'] / df['total_tests']

5.5.4 Geographic Features

Hide code

# Distance to healthcare
from sklearn.metrics.pairwise import haversine_distances

case_coords = df[['latitude', 'longitude']].values
hospital_coords = hospitals[['lat', 'lon']].values

# Convert to radians
case_coords_rad = np.radians(case_coords)
hospital_coords_rad = np.radians(hospital_coords)

# Calculate distances (km)
distances = haversine_distances(case_coords_rad, hospital_coords_rad) * 6371
df['nearest_hospital_km'] = distances.min(axis=1)

# Spatial clustering
from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(radius=5.0, metric='haversine')
nn.fit(case_coords_rad)

distances, indices = nn.radius_neighbors(case_coords_rad)
df['cases_within_5km'] = [len(idx) - 1 for idx in indices]

5.6 Case Study: COVID-19 Forecasting Failures

The COVID-19 pandemic provided a natural experiment in real-time forecasting. Despite unprecedented data availability, most forecasts performed poorly.

What went wrong:

Reporting delay ignorance
- Models treated “reported March 15” as “occurred March 15”
- Reality: 5-15 day lags, varying by jurisdiction
- Result: Systematic underestimation
Changing testing regime
- Week 1: Only hospitalized patients
- Week 8: Asymptomatic screening
- Models confused testing expansion with disease growth
Selection bias
- Early models trained on Chinese data (strict lockdowns)
- Applied to US/Europe (different compliance)
- External validity failed
Missing confounders
- Behavioral responses not in feature set
- Policy changes (masks, closures) ignored
- Human adaptation breaks stationarity

The numbers:

CDC FluSight hub: Median error 25-50% for 1-2 week forecasts
IHME model: Repeatedly revised by orders of magnitude
Ensemble methods outperformed individuals, but still struggled

Lessons learned:

Data Quality > Model Sophistication

Reich et al.’s analysis concluded:

“Forecast accuracy was limited more by data quality issues—reporting delays, changing surveillance, and selection bias—than by modeling approaches. Simple models with good data outperformed complex models with poor data.”

The pandemic exposed that public health data infrastructure, not algorithms, is the bottleneck.

5.7 When Data Problems Make AI Inappropriate

Red flags that should stop you:

Extreme selection bias with no adjustment
- Example: Training on hospitalized, deploying on general population
- Why fatal: Model doesn’t generalize
Outcome label quality is poor
- Example: Inconsistent disease diagnosis criteria
- Why fatal: Model learns noise, not signal
Critical features >50% missing
- Example: Symptom onset missing for most cases
- Why fatal: Can’t build temporal features
Data generating process changed mid-dataset
- Example: Pre/post policy change, different testing
- Why fatal: Violates stationarity assumption
Sample size too small
- Example: 100 cases, 50 features, trying deep learning
- Why fatal: Severe overfitting
External validity concerns
- Example: One hospital’s data, deploying elsewhere
- Why fatal: Context matters, model doesn’t transfer

The Courage to Say “No”

Good data science includes knowing when NOT to build a model.

If data quality is insufficient: 1. Report limitations clearly 2. Recommend improving data collection 3. Suggest alternatives (traditional epi, qualitative research)

Building on garbage data: - Wastes resources - Produces misleading results - Can harm people - Damages trust in AI

As Box said: “All models are wrong, but some are useful.” If your data is wrong enough, your model will be useless.

5.8 Key Takeaways

Data quality determines model quality. Algorithms can’t overcome flawed data.
Understand missingness mechanisms. MCAR, MAR, MNAR require different strategies.
Public health data has unique challenges: selection bias, reporting delays, changing surveillance.
Investigate, don’t automatically fix. Outliers might be the most important cases.
Feature engineering > algorithm choice. Domain expertise beats sophistication.
Document everything. Future users need to know what you cleaned and why.
Know when to stop. Sometimes the ethical choice is not to build a model.
Temporal ordering matters. Never use future to predict past.

5.9 Practice Exercises

5.9.1 Exercise 1: Missing Data Mechanisms

Identify variables with missing data
Test if missingness is MCAR, MAR, or MNAR
Choose imputation strategies
Compare model performance

5.9.2 Exercise 2: Reporting Delay Adjustment

Calculate reporting triangle
Implement nowcasting
Compare nowcast vs. raw counts
Evaluate accuracy over time

5.9.3 Exercise 3: Feature Engineering

Create outbreak predictor using: - Daily cases - Temperature - Rainfall - Population density

Engineer 10+ features. Which matter most?

5.9.4 Exercise 4: Data Quality Audit

Audit a “cleaned” dataset: 1. Find 5+ quality issues 2. Determine if fixable or fatal 3. Write quality report

Check Your Understanding

Test your knowledge of the key concepts from this chapter. Click “Show Answer” to reveal the correct response and explanation.

Question 1: Data Quality Foundations

Which dimension of data quality is violated when COVID-19 testing was initially limited to hospitalized patients, but models were deployed to predict community transmission?

Completeness
Accuracy
Timeliness
Representativeness

Answer: d) Representativeness

Explanation: This is a representativeness problem. The training data (hospitalized patients) doesn’t represent the target population (community cases). This selection bias means the model learns patterns from a severely ill subpopulation that don’t generalize to mild/asymptomatic cases in the community.

Question 2: Missing Data Mechanisms

You notice that income data is missing for 30% of younger respondents but only 5% of older respondents. What type of missingness is this?

Missing Completely At Random (MCAR)
Missing At Random (MAR)
Missing Not At Random (MNAR)
Systematic error

Answer: b) Missing At Random (MAR)

Explanation: This is MAR (Missing At Random) because the missingness depends on an observed variable (age), not on the missing value itself (income). You can impute income using age as a predictor since you know the missingness pattern.

Question 3: Temporal Data Leakage

True or False: When creating rolling averages for time series forecasting, using rolling(window=7, center=True) is appropriate because it creates smoother trends.

Answer: False

Explanation: Using center=True causes data leakage by incorporating future values into the calculation. For a value on day 10, it would include data from days 7-13, meaning you’re using days 11-13 (the future) to predict day 10. For forecasting, you must use center=False (the default) to only look backwards in time.

Question 4: Reporting Delays

A surveillance system shows cases declining on the most recent dates. What is the most likely explanation before concluding the outbreak is ending?

Effective public health intervention
Population-level immunity developing
Right-censoring from reporting delays
Seasonal weather changes

Answer: c) Right-censoring from reporting delays

Explanation: Right-censoring from reporting delays is the most common explanation for apparent declines in recent data. Cases from recent days haven’t fully been reported yet, creating an artificial downward trend. Always check reporting lag distributions before interpreting recent trends. This is why nowcasting methods exist.

Question 5: Feature Engineering

Which of these features would be MOST useful for predicting dengue outbreaks and demonstrates good domain expertise?

Day of the week
Raw daily temperature
14-day cumulative rainfall
Month number (1-12)

Answer: c) 14-day cumulative rainfall

Explanation: 14-day cumulative rainfall shows domain expertise because dengue mosquitoes (Aedes) breed in standing water that accumulates over time. Raw daily temperature or simple month number don’t capture the mechanism. Day of the week is irrelevant for disease biology. This demonstrates how epidemiological understanding drives effective feature engineering.

Question 6: Outliers

You find 5 patients with “impossibly fast” disease progression (symptom onset to death in <24 hours). What should you do?

Remove them as data entry errors
Remove them to avoid overfitting on outliers
Investigate them carefully—they might be important rare cases
Set their values to the median progression time

Answer: c) Investigate them carefully—they might be important rare cases

Explanation: Investigate carefully before removing. These could be: - Real rare but critical cases (e.g., cytokine storm, overwhelming infection) - Important subpopulation with different risk factors - Data entry errors (date fields swapped)

The COVID-19 pandemic taught us that “outliers” like rapidly progressing cases were often the most clinically important to understand. Only remove after investigation confirms they’re truly errors.

5.10 Discussion Questions

Your training data is 80% from one hospital. How does this affect generalizability? What can you do?
Model has 95% accuracy on training, 65% on test, with 30% missing labels. How do you interpret this?
Colleague wants to impute symptom onset as report date minus median delay. Pros/cons? When appropriate?
Forecasting COVID-19 hospitalizations: use raw counts or adjust for test positivity? Justify.
When is it acceptable to remove outliers from public health data? Examples of appropriate vs. inappropriate removal.

5.11 Further Reading

5.11.1 Data Quality Frameworks

Weiskopf, N.G., & Weng, C. (2013). “Methods and dimensions of EHR data quality.” JAMIA, 20(6), 1041-1051.
Kahn, M.G., et al. (2016). “Harmonized data quality terminology.” (Kahn et al. 2016) JAMIA, 23(6), 1244-1251.

5.11.2 Missing Data

Little, R.J.A., & Rubin, D.B. (2019). Statistical Analysis with Missing Data. Wiley.
van Buuren, S. (2018). Flexible Imputation of Missing Data. CRC Press.

5.11.3 Surveillance Challenges

Lipsitch, M., et al. (2020). “Defining Covid-19 epidemiology.” Science, 368(6490), 565-568.
Reich, N.G., et al. (2019). “Collaborative flu forecasting assessment.” PNAS, 116(8), 3146-3154.

5.11.4 COVID-19 Lessons

Cramer, E.Y., et al. (2021). “Evaluation of COVID-19 forecasts.” (Cramer et al. 2022) PNAS, 119(15), e2111453119.
McGough, S.F., et al. (2020). “Nowcasting by Bayesian smoothing.” PLoS Comp Bio, 16(10), e1008785.

5.11.5 Data Leakage

Kaufman, S., et al. (2012). “Leakage in data mining.” ACM TKDD, 6(4), 1-21.

You now understand why data quality is the foundation of successful AI in public health. Clean data and thoughtful feature engineering matter more than fancy algorithms.

Part I Summary: What You Should Now Know

Congratulations! You’ve completed Part I: Foundations. Before moving to applications, ensure you can confidently:

5.11.6 From Chapter 1 (History & Context)

Explain how AI has evolved in public health from expert systems to modern machine learning
Identify when AI adds value vs. when traditional epidemiology is sufficient
Recognize patterns of AI hype and separate genuine capabilities from marketing
Understand the unique ethical considerations of AI in population health

5.11.7 From Chapter 2 (AI Basics)

Distinguish between supervised, unsupervised, and reinforcement learning paradigms
Choose appropriate algorithms for different problem types (logistic regression → Random Forests → gradient boosting → deep learning)
Interpret evaluation metrics (accuracy, sensitivity, specificity, ROC-AUC) and know when each matters
Identify overfitting and data leakage issues
Understand why most public health tabular data works best with tree-based methods, not deep learning
Explain predictions using feature importance and SHAP values

5.11.8 From Chapter 3 (Data Quality)

Assess data quality across four dimensions: completeness, accuracy, timeliness, representativeness
Identify missing data mechanisms (MCAR, MAR, MNAR) and choose appropriate imputation strategies
Handle reporting delays and right-censoring in surveillance data
Engineer meaningful features from public health data using domain expertise
Recognize when data problems make AI inappropriate (selection bias, insufficient sample size, changing data-generating processes)
Avoid temporal data leakage when building forecasting models

5.11.9 Core Skills Checklist

Can you: - [ ] Read and modify basic ML code in Python (scikit-learn, pandas)? - [ ] Run a complete ML pipeline from data loading → preprocessing → training → evaluation? - [ ] Interpret confusion matrices and choose metrics appropriate for your use case? - [ ] Detect and handle missing data without blindly dropping rows? - [ ] Create time-based features without leaking future information? - [ ] Explain why a simple model on clean data beats a complex model on dirty data? - [ ] Identify when NOT to build an AI model due to data quality issues?

5.11.10 What’s Next

Part II: Applications applies these foundations to real public health problems: - Disease surveillance and outbreak detection - Epidemic forecasting - Genomic surveillance - Large language models in public health - Clinical decision support

You now have the conceptual foundation and technical skills to understand how AI works in practice. The next chapters show you specific applications where AI creates value—and where it falls short.

If you’re unsure about any foundation concept, review the relevant chapter before proceeding. The applications build directly on these fundamentals.

Next chapter: Chapter 4: Disease Surveillance and Outbreak Detection

Cramer, Estee Y, Evan L Ray, Velma K Lopez, Johannes Bracher, Andrea Brennen, Alvaro J Castro Rivadeneira, et al. 2022. “Evaluation of Individual and Ensemble Probabilistic Forecasts of COVID-19 Mortality in the United States.” Proceedings of the National Academy of Sciences 119 (15): e2113561119. https://doi.org/10.1073/pnas.2113561119.

Ioannidis, John PA. 2021. “Infection Fatality Rate of COVID-19 Inferred from Seroprevalence Data.” Bulletin of the World Health Organization 99 (1): 19–33F. https://doi.org/10.2471/BLT.20.265892.

Kahn, Michael G, Tiffany J Callahan, Juliana Barnard, Alan E Bauck, Jeff Brown, Bruce N Davidson, et al. 2016. “A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data.” eGEMs 4 (1): 1244. https://doi.org/10.13063/2327-9214.1244.

Kaufman, Shachar, Saharon Rosset, and Claudia Perlich. 2012. “Leakage in Data Mining: Formulation, Detection, and Avoidance.” ACM Transactions on Knowledge Discovery from Data (TKDD) 6 (4): 1–21. https://doi.org/10.1145/2382577.2382579.

Lipsitch, Marc, David L Swerdlow, and Lyn Finelli. 2020. “Defining the Epidemiology of Covid-19—Studies Needed.” New England Journal of Medicine 382 (13): 1194–96. https://doi.org/10.1056/NEJMp2002125.

Little, Roderick JA, and Donald B Rubin. 2019. Statistical Analysis with Missing Data. 3rd ed. Hoboken, NJ: Wiley. https://doi.org/10.1002/9781119482260.

McGough, Sarah F, Michael A Johansson, Marc Lipsitch, and Nicolas A Menzies. 2020. “Nowcasting by Bayesian Smoothing: A Flexible, Generalizable Model for Real-Time Epidemic Tracking.” PLoS Computational Biology 16 (4): e1007735. https://doi.org/10.1371/journal.pcbi.1007735.

Russell, Timothy W, Joel Hellewell, Christopher I Jarvis, Kevin Van Zandvoort, Sam Abbott, Ruwan Ratnayake, et al. 2020. “Using a Delay-Adjusted Case Fatality Ratio to Estimate Under-Reporting.” CMMID Repository. https://doi.org/10.25561/77731.

Weiskopf, Nicole Gray, and Chunhua Weng. 2013. “Methods and Dimensions of Electronic Health Record Data Quality Assessment: Enabling Reuse for Clinical Research.” Journal of the American Medical Informatics Association 20 (1): 144–51. https://doi.org/10.1136/amiajnl-2011-000681.