5  The Data Problem

TipLearning Objectives

Time to Complete: 75-90 minutes Prerequisites: Chapter 1: History, Chapter 2: AI Basics

By the end of this chapter, you will:

  • Understand why public health data is uniquely challenging for AI
  • Assess data quality using four key dimensions (completeness, accuracy, timeliness, representativeness)
  • Identify and handle common data issues: missing data, reporting delays, selection bias
  • Apply practical cleaning and preprocessing strategies
  • Engineer meaningful features from messy real-world data
  • Recognize when data problems make AI inappropriate

What you’ll build: 💻 Complete data quality assessment pipeline with missing data analysis, outlier detection, and feature engineering for surveillance data

5.1 Introduction

Here’s the uncomfortable truth: Most AI tutorials use clean, well-formatted datasets. Most public health data is none of those things.

Real public health data is: - Incomplete (missing values everywhere) - Delayed (cases reported days or weeks late) - Biased (who gets tested determines who appears in your data) - Inconsistent (formats change, definitions evolve) - Messy (typos, duplicates, impossible values) - Heterogeneous (multiple sources with different standards)

And yet: This is the data you have to work with. Learning to handle messy data is arguably more important than learning fancy algorithms.

ImportantThe Iron Law of Machine Learning

Garbage in = Garbage out

The most sophisticated deep learning model cannot overcome fundamentally flawed data. A simple logistic regression on clean, well-understood data will outperform a cutting-edge neural network on garbage data.

As Andrew Ng emphasizes in his data-centric AI work: successful AI is 80% data, 20% algorithms. Yet most courses spend 80% of time on algorithms and 20% on data.

This chapter flips that ratio.

5.2 Why Public Health Data Is Uniquely Challenging

5.2.1 1. The Surveillance Pyramid Problem

Public health surveillance captures only a fraction of reality:

                    🔬 Lab-Confirmed Cases (Data you have)
                   /
              🏥 Hospitalized Cases
             /
        🏠 Symptomatic Cases Seeking Care
       /
  😷 All Symptomatic Cases
 /
😊 All Infections (Including Asymptomatic)

The issue: Your dataset represents the tip of the pyramid, but the population of interest is the entire pyramid. This is selection bias by design.

Example: COVID-19 Testing Bias

Early in the pandemic, testing was limited to: - Symptomatic individuals - Healthcare workers - Travelers - Close contacts of confirmed cases

Any model trained on this data would: - Overestimate symptom severity (asymptomatic cases invisible) - Misunderstand demographic risk factors (testing access varied by socioeconomic status) - Produce biased risk predictions (dataset not representative of population)

As documented in Lipsitch et al.’s analysis of surveillance biases, these selection effects can lead to systematic underestimation of disease burden.

NoteWhen Selection Bias Invalidates AI

If your training data comes from one level of the surveillance pyramid but you want to make predictions about another level, no amount of ML sophistication can fix this.

You need either: 1. Data from the correct population (often impossible) 2. Statistical methods to adjust for selection (Hernán’s target trial framework) 3. A different research question that matches your available data

5.2.2 2. Reporting Delays and Temporal Misalignment

The scenario: You want to predict tomorrow’s case counts based on today’s data. But “today’s data” includes: - Cases from today (20%) - Cases from yesterday, reported today (40%) - Cases from 2 days ago, reported today (30%) - Cases from 3+ days ago, reported today (10%)

Your “September 15th” dataset is actually a mixture of cases from September 12-15, with unknown proportions.

Why this matters for ML: - Features and labels are temporally misaligned - Recent trends are systematically underestimated (right-censoring) - Models learn spurious patterns from reporting artifacts

Real example: Early COVID-19 forecasting models struggled because they treated reported case counts as actual case counts, ignoring 5-10 day reporting lags that varied by jurisdiction.

5.2.3 3. Changing Definitions and Data Collection

Public health data collection evolves over time, as extensively documented in Reich et al.’s analysis of forecasting challenges.

Case definition changes: - COVID-19 case definitions changed multiple times in 2020-2021 - Adding antigen tests to confirmed cases - Including probable cases vs. confirmed only

Testing availability changes: - Week 1: Only sick hospitalized patients tested - Week 10: Symptomatic individuals can get tested - Week 20: Asymptomatic screening becomes common

Your dataset appears to show: - Exponential case growth - Changing age distribution - Different symptom profiles over time

But reality: These might be artifacts of changing surveillance, not actual epidemiological changes.

WarningThe Fundamental Challenge

Machine learning assumes the relationship between features and outcomes is stationary (constant over time).

Public health surveillance violates this assumption constantly. The data-generating process itself evolves.

This is why pandemic forecasting is so difficult—the ground truth keeps shifting under your feet.

5.2.4 4. The Denominator Problem

You observe:

Region A: 1,000 cases, 50 deaths → 5% CFR
Region B: 500 cases, 100 deaths → 20% CFR

Is Region B truly more dangerous? Or does it have: - Less testing (missing mild cases in denominator) - Older population - Worse healthcare access - Different case definitions

The problem: Your dataset has numerators (cases, deaths) but often lacks good denominators (population at risk, testing rates, exposure levels).

AI models trained on case counts without accounting for testing intensity will confuse “more testing” with “more disease,” as Ioannidis cautioned.

5.2.5 5. Privacy Constraints and Data Aggregation

To protect patient privacy, public health data is often: - Aggregated (county-level instead of individual-level) - Suppressed (cells with <5 cases shown as “<5” or asterisk) - Coarsened (age shown as brackets: “20-29” instead of exact age) - Delayed (real-time data withheld, released weeks later)

Impact on ML: - Loss of granularity reduces predictive power - Non-standard missing data patterns - Cannot link across datasets (no unique identifiers) - Temporal resolution insufficient for some analyses

The HL7 FHIR standard and HIPAA Safe Harbor guidelines constrain what data you can access and how it’s formatted.

5.3 The Four Dimensions of Data Quality

When assessing whether your data is suitable for AI, evaluate these four dimensions, as outlined in Weiskopf and Weng’s data quality framework.

5.3.1 1. Completeness: Are Values Missing?

The question: What percentage of values are missing? Is the missingness random or systematic?

Code to assess:

Hide code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load your data
df = pd.read_csv('../data/examples/surveillance_data.csv')

# Calculate missingness
missing_summary = pd.DataFrame({
    'column': df.columns,
    'missing_count': df.isnull().sum(),
    'missing_percent': (df.isnull().sum() / len(df) * 100).round(2)
}).sort_values('missing_percent', ascending=False)

print("Missing Data Summary:")
print(missing_summary)

# Visualize missingness patterns
import missingno as msno

# Matrix plot: see where missing values cluster
msno.matrix(df, figsize=(12, 6), sparkline=False)
plt.title('Missing Data Pattern')
plt.tight_layout()
plt.savefig('../images/examples/missing_data_matrix.png', dpi=300)

Interpreting results:

Missingness % Assessment Action
< 5% Minimal Simple imputation okay
5-20% Moderate Investigate patterns, careful imputation
20-50% Substantial Deep investigation required
> 50% Severe Likely unusable without major effort

Types of missingness (Little & Rubin’s taxonomy):

  1. Missing Completely At Random (MCAR): Missingness is unrelated to any variables
    • Example: Lab results lost due to random system glitches
    • Implication: Simple imputation is safe
  2. Missing At Random (MAR): Missingness depends on observed variables
    • Example: Income more likely missing for younger respondents
    • Implication: Can impute using related variables
  3. Missing Not At Random (MNAR): Missingness depends on the missing value itself
TipTesting Missingness Patterns
Hide code
# Check if missingness is related to other variables
df['symptom_onset_missing'] = df['symptom_onset'].isnull()

# Compare groups with and without missing values
print(df.groupby('symptom_onset_missing')['age'].mean())
print(df.groupby('symptom_onset_missing')['hospitalized'].mean())

# Statistical test
from scipy.stats import chi2_contingency
contingency_table = pd.crosstab(df['symptom_onset_missing'],
                                 df['hospitalized'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

if p_value < 0.05:
    print("⚠️ Missingness is NOT random - be careful with imputation")

5.3.2 2. Accuracy: Are Values Correct?

Common accuracy issues:

5.3.2.1 Impossible or Implausible Values

Hide code
# Detect impossible values
issues = []

# Age checks
if (df['age'] < 0).any() or (df['age'] > 120).any():
    issues.append("Impossible ages detected")

# Date checks
df['symptom_date'] = pd.to_datetime(df['symptom_date'], errors='coerce')
df['test_date'] = pd.to_datetime(df['test_date'], errors='coerce')

# Symptom onset after test date?
invalid_dates = df['symptom_date'] > df['test_date']
if invalid_dates.any():
    issues.append(f"{invalid_dates.sum()} cases with symptoms after test")

# Temperature checks
if 'temperature' in df.columns:
    if (df['temperature'] < 35).any() or (df['temperature'] > 42).any():
        issues.append("Implausible temperatures detected")

for issue in issues:
    print(issue)

5.3.2.2 Data Entry Errors

Hide code
# Common data entry errors

# 1. Height/weight transpositions
df['bmi'] = df['weight'] / (df['height']/100)**2
suspicious_bmi = (df['bmi'] < 10) | (df['bmi'] > 60)
print(f"Suspicious BMI values: {suspicious_bmi.sum()}")

# 2. Unit inconsistencies
if 'temperature' in df.columns:
    potential_fahrenheit = (df['temperature'] > 50).sum()
    if potential_fahrenheit > 0:
        print(f"⚠️ Possible unit mixing: {potential_fahrenheit} values >50°")

# 3. Decimal point errors (370 instead of 37.0)
if 'temperature' in df.columns:
    extreme_values = df['temperature'] > 100
    print(f"Possible decimal errors: {extreme_values.sum()}")

# 4. Age heaping (digit preference)
age_last_digit = df['age'] % 10
digit_counts = age_last_digit.value_counts().sort_index()

from scipy.stats import chisquare
expected_freq = [len(df)/10] * 10
chi2, p_value = chisquare(digit_counts, expected_freq)
if p_value < 0.05:
    print("⚠️ Age heaping detected - rounding to 0s and 5s")
ImportantThe “Too Clean” Red Flag

If your public health dataset has zero missing values, perfect consistency, and no outliers—be suspicious. Real-world data is messy.

5.3.3 3. Timeliness: Are Values Up-to-Date?

Quantifying reporting delays:

Hide code
# Calculate reporting lag
df['report_date'] = pd.to_datetime(df['report_date'])
df['symptom_onset'] = pd.to_datetime(df['symptom_onset'])

# Lag from symptom onset to report
df['onset_to_report_days'] = (df['report_date'] - df['symptom_onset']).dt.days

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df['onset_to_report_days'].dropna(), bins=50, edgecolor='black')
ax.set_xlabel('Days from Symptom Onset to Report')
ax.set_ylabel('Frequency')
ax.axvline(df['onset_to_report_days'].median(), color='red',
           linestyle='--',
           label=f"Median: {df['onset_to_report_days'].median():.1f} days")
ax.legend()
plt.savefig('../images/examples/reporting_delays.png', dpi=300)

Nowcasting: Adjusting for reporting delays as described in McGough et al.’s work.

NoteAdvanced Nowcasting

For rigorous handling of reporting delays: - EpiNow2 R package - CDC’s COVID-19 nowcasting approach

5.3.4 4. Representativeness: Does Data Match Target Population?

Common biases:

Hide code
# Check for demographic bias

# Compare to census data
population_age_dist = {
    '0-17': 0.22,
    '18-64': 0.62,
    '65+': 0.16
}

sample_age_dist = df['age_group'].value_counts(normalize=True).to_dict()

print("Age Distribution Comparison:")
print("Age Group | Population | Sample | Difference")
print("-" * 50)
for age_group in population_age_dist:
    pop_pct = population_age_dist[age_group] * 100
    sample_pct = sample_age_dist.get(age_group, 0) * 100
    diff = sample_pct - pop_pct
    print(f"{age_group:8} | {pop_pct:6.1f}%    | {sample_pct:6.1f}% | {diff:+6.1f}%")
WarningWhen Representativeness Matters Most

Representativeness is critical when: - Making population-level inferences - Forecasting total burden - Evaluating interventions

Less critical when: - Early outbreak detection (any signal helps) - Comparing relative risks within sample

5.4 Common Data Issues and How to Handle Them

5.4.1 Issue 1: Missing Data

When MCAR:

Hide code
from sklearn.impute import SimpleImputer

# Median imputation for numerical variables
numerical_cols = df.select_dtypes(include=[np.number]).columns
imputer = SimpleImputer(strategy='median')
df[numerical_cols] = imputer.fit_transform(df[numerical_cols])

When MAR:

Hide code
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Iterative imputation (like MICE)
imputer = IterativeImputer(max_iter=10, random_state=42)
df_imputed = pd.DataFrame(
    imputer.fit_transform(df[numerical_cols]),
    columns=numerical_cols
)

When MNAR:

Hide code
# Create missingness indicator
df['income_missing'] = df['income'].isnull().astype(int)

# Impute with flag value
df['income_imputed'] = df['income'].fillna(-999)
TipBest Practices for Missing Data
  1. Never silently drop missing data
  2. Compare complete-case vs. imputed analysis
  3. Try multiple imputation
  4. Report missingness patterns
  5. Create missingness indicators when uncertain

5.4.2 Issue 2: Outliers and Extreme Values

Not all outliers are errors!

Hide code
def detect_outliers_iqr(df, column, multiplier=1.5):
    """Detect outliers using IQR method"""
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - multiplier * IQR
    upper_bound = Q3 + multiplier * IQR

    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

# Investigate before removing
age_outliers, lower, upper = detect_outliers_iqr(df, 'age', multiplier=3)
print(f"Age outliers: {len(age_outliers)}")
print(age_outliers[['age', 'outcome', 'hospital']].head())

Decision tree for outliers:

Is it biologically impossible?
├─ Yes → Error. Correct or remove.
└─ No → Is it a data entry error?
    ├─ Yes → Correct it
    └─ No → Does it represent important subpopulation?
        ├─ Yes → Keep it!
        └─ Uncertain → Sensitivity analysis
ImportantThe “Interesting Outlier” Trap

Real example: Early COVID-19 models removed patients with “impossibly fast” progression as outliers. Later discovered these were cytokine storm cases—rare but critical.

Lesson: Outliers often teach you the most. Be cautious about automatic removal.

5.4.3 Issue 3: Duplicate Records

Hide code
# Detect duplicates
exact_dupes = df[df.duplicated(keep=False)]
print(f"Exact duplicates: {len(exact_dupes)}")

patient_dupes = df[df.duplicated(subset=['patient_id'], keep=False)]
print(f"Patients with multiple records: {patient_dupes['patient_id'].nunique()}")

# Handle duplicates
# Option 1: Keep first
df_deduped = df.drop_duplicates(subset=['patient_id'], keep='first')

# Option 2: Keep most recent
df_sorted = df.sort_values('report_date', ascending=False)
df_deduped = df_sorted.drop_duplicates(subset=['patient_id'], keep='first')

5.4.4 Issue 4: Inconsistent Coding

Hide code
# Standardize categorical variables
sex_mapping = {
    'M': 'Male', 'Male': 'Male', 'MALE': 'Male', 'm': 'Male',
    'F': 'Female', 'Female': 'Female', 'FEMALE': 'Female', 'f': 'Female',
    'Unknown': 'Unknown', 'U': 'Unknown', np.nan: 'Unknown'
}
df['sex_clean'] = df['sex'].map(sex_mapping)

# Standardize dates
df['date_clean'] = pd.to_datetime(df['date'], infer_datetime_format=True, errors='coerce')

5.5 Feature Engineering for Public Health

Domain expertise matters most, as Domingos emphasizes.

5.5.1 Time-Based Features

Hide code
# Extract temporal features
df['date'] = pd.to_datetime(df['report_date'])

df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['week_of_year'] = df['date'].dt.isocalendar().week
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

# Seasonal indicators
def assign_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

df['season'] = df['month'].apply(assign_season)

# Days since outbreak start
outbreak_start = df['date'].min()
df['days_since_start'] = (df['date'] - outbreak_start).dt.days

5.5.2 Rolling Statistics

Hide code
# Sort by date first!
df = df.sort_values('date')

# 7-day rolling average
df['cases_7day_avg'] = df['daily_cases'].rolling(window=7, min_periods=1).mean()

# 14-day rolling sum
df['cases_14day_sum'] = df['daily_cases'].rolling(window=14, min_periods=1).sum()

# Growth rate
df['case_growth_rate'] = df['daily_cases'].pct_change(periods=7)

# Lag features
df['cases_lag_7'] = df['daily_cases'].shift(7)
df['cases_lag_14'] = df['daily_cases'].shift(14)

# Ratio features
df['cases_this_week_vs_last'] = df['daily_cases'] / df['cases_lag_7']
WarningAvoiding Data Leakage

NEVER use future information, as warned in Kaufman et al.’s analysis:

Hide code
# ❌ WRONG: Uses future data
df['cases_rolling'] = df['daily_cases'].rolling(window=7, center=True).mean()

# ✅ CORRECT: Only past data
df['cases_rolling'] = df['daily_cases'].rolling(window=7, min_periods=1).mean()

For time series splits:

Hide code
# ❌ WRONG for time series
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

# ✅ CORRECT for time series
split_date = df['date'].quantile(0.8)
train_df = df[df['date'] < split_date]
test_df = df[df['date'] >= split_date]

5.5.3 Epidemiological Features

Hide code
# R0 estimation from growth rate
def estimate_R0(cases, generation_time=5):
    """Estimate R0 from exponential growth"""
    from scipy.optimize import curve_fit

    days = np.arange(len(cases))

    def exponential(x, a, b):
        return a * np.exp(b * x)

    params, _ = curve_fit(exponential, days, cases, p0=[1, 0.1])
    growth_rate = params[1]
    R0 = 1 + growth_rate * generation_time
    return R0

df['R0_estimate'] = df['daily_cases'].rolling(window=14).apply(
    lambda x: estimate_R0(x.values) if len(x) > 7 else np.nan,
    raw=False
)

# Attack rate
population = 1000000
df['attack_rate'] = df['cumulative_cases'] / population

# Case fatality rate
df['CFR'] = df['cumulative_deaths'] / df['cumulative_cases']

# Test positivity rate
df['test_positivity'] = df['positive_tests'] / df['total_tests']

5.5.4 Geographic Features

Hide code
# Distance to healthcare
from sklearn.metrics.pairwise import haversine_distances

case_coords = df[['latitude', 'longitude']].values
hospital_coords = hospitals[['lat', 'lon']].values

# Convert to radians
case_coords_rad = np.radians(case_coords)
hospital_coords_rad = np.radians(hospital_coords)

# Calculate distances (km)
distances = haversine_distances(case_coords_rad, hospital_coords_rad) * 6371
df['nearest_hospital_km'] = distances.min(axis=1)

# Spatial clustering
from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(radius=5.0, metric='haversine')
nn.fit(case_coords_rad)

distances, indices = nn.radius_neighbors(case_coords_rad)
df['cases_within_5km'] = [len(idx) - 1 for idx in indices]

5.6 Case Study: COVID-19 Forecasting Failures

The COVID-19 pandemic provided a natural experiment in real-time forecasting. Despite unprecedented data availability, most forecasts performed poorly.

What went wrong:

  1. Reporting delay ignorance
  2. Changing testing regime
    • Week 1: Only hospitalized patients
    • Week 8: Asymptomatic screening
    • Models confused testing expansion with disease growth
  3. Selection bias
    • Early models trained on Chinese data (strict lockdowns)
    • Applied to US/Europe (different compliance)
    • External validity failed
  4. Missing confounders

The numbers:

Lessons learned:

ImportantData Quality > Model Sophistication

Reich et al.’s analysis concluded:

“Forecast accuracy was limited more by data quality issues—reporting delays, changing surveillance, and selection bias—than by modeling approaches. Simple models with good data outperformed complex models with poor data.”

The pandemic exposed that public health data infrastructure, not algorithms, is the bottleneck.

5.7 When Data Problems Make AI Inappropriate

Red flags that should stop you:

  1. Extreme selection bias with no adjustment
    • Example: Training on hospitalized, deploying on general population
    • Why fatal: Model doesn’t generalize
  2. Outcome label quality is poor
    • Example: Inconsistent disease diagnosis criteria
    • Why fatal: Model learns noise, not signal
  3. Critical features >50% missing
    • Example: Symptom onset missing for most cases
    • Why fatal: Can’t build temporal features
  4. Data generating process changed mid-dataset
    • Example: Pre/post policy change, different testing
    • Why fatal: Violates stationarity assumption
  5. Sample size too small
    • Example: 100 cases, 50 features, trying deep learning
    • Why fatal: Severe overfitting
  6. External validity concerns
    • Example: One hospital’s data, deploying elsewhere
    • Why fatal: Context matters, model doesn’t transfer
ImportantThe Courage to Say “No”

Good data science includes knowing when NOT to build a model.

If data quality is insufficient: 1. Report limitations clearly 2. Recommend improving data collection 3. Suggest alternatives (traditional epi, qualitative research)

Building on garbage data: - Wastes resources - Produces misleading results - Can harm people - Damages trust in AI

As Box said: “All models are wrong, but some are useful.” If your data is wrong enough, your model will be useless.

5.8 Key Takeaways

  1. Data quality determines model quality. Algorithms can’t overcome flawed data.

  2. Understand missingness mechanisms. MCAR, MAR, MNAR require different strategies.

  3. Public health data has unique challenges: selection bias, reporting delays, changing surveillance.

  4. Investigate, don’t automatically fix. Outliers might be the most important cases.

  5. Feature engineering > algorithm choice. Domain expertise beats sophistication.

  6. Document everything. Future users need to know what you cleaned and why.

  7. Know when to stop. Sometimes the ethical choice is not to build a model.

  8. Temporal ordering matters. Never use future to predict past.

5.9 Practice Exercises

5.9.1 Exercise 1: Missing Data Mechanisms

  1. Identify variables with missing data
  2. Test if missingness is MCAR, MAR, or MNAR
  3. Choose imputation strategies
  4. Compare model performance

5.9.2 Exercise 2: Reporting Delay Adjustment

  1. Calculate reporting triangle
  2. Implement nowcasting
  3. Compare nowcast vs. raw counts
  4. Evaluate accuracy over time

5.9.3 Exercise 3: Feature Engineering

Create outbreak predictor using: - Daily cases - Temperature - Rainfall - Population density

Engineer 10+ features. Which matter most?

5.9.4 Exercise 4: Data Quality Audit

Audit a “cleaned” dataset: 1. Find 5+ quality issues 2. Determine if fixable or fatal 3. Write quality report

Check Your Understanding

Test your knowledge of the key concepts from this chapter. Click “Show Answer” to reveal the correct response and explanation.

NoteQuestion 1: Data Quality Foundations

Which dimension of data quality is violated when COVID-19 testing was initially limited to hospitalized patients, but models were deployed to predict community transmission?

  1. Completeness
  2. Accuracy
  3. Timeliness
  4. Representativeness

Answer: d) Representativeness

Explanation: This is a representativeness problem. The training data (hospitalized patients) doesn’t represent the target population (community cases). This selection bias means the model learns patterns from a severely ill subpopulation that don’t generalize to mild/asymptomatic cases in the community.

NoteQuestion 2: Missing Data Mechanisms

You notice that income data is missing for 30% of younger respondents but only 5% of older respondents. What type of missingness is this?

  1. Missing Completely At Random (MCAR)
  2. Missing At Random (MAR)
  3. Missing Not At Random (MNAR)
  4. Systematic error

Answer: b) Missing At Random (MAR)

Explanation: This is MAR (Missing At Random) because the missingness depends on an observed variable (age), not on the missing value itself (income). You can impute income using age as a predictor since you know the missingness pattern.

NoteQuestion 3: Temporal Data Leakage

True or False: When creating rolling averages for time series forecasting, using rolling(window=7, center=True) is appropriate because it creates smoother trends.

Answer: False

Explanation: Using center=True causes data leakage by incorporating future values into the calculation. For a value on day 10, it would include data from days 7-13, meaning you’re using days 11-13 (the future) to predict day 10. For forecasting, you must use center=False (the default) to only look backwards in time.

NoteQuestion 4: Reporting Delays

A surveillance system shows cases declining on the most recent dates. What is the most likely explanation before concluding the outbreak is ending?

  1. Effective public health intervention
  2. Population-level immunity developing
  3. Right-censoring from reporting delays
  4. Seasonal weather changes

Answer: c) Right-censoring from reporting delays

Explanation: Right-censoring from reporting delays is the most common explanation for apparent declines in recent data. Cases from recent days haven’t fully been reported yet, creating an artificial downward trend. Always check reporting lag distributions before interpreting recent trends. This is why nowcasting methods exist.

NoteQuestion 5: Feature Engineering

Which of these features would be MOST useful for predicting dengue outbreaks and demonstrates good domain expertise?

  1. Day of the week
  2. Raw daily temperature
  3. 14-day cumulative rainfall
  4. Month number (1-12)

Answer: c) 14-day cumulative rainfall

Explanation: 14-day cumulative rainfall shows domain expertise because dengue mosquitoes (Aedes) breed in standing water that accumulates over time. Raw daily temperature or simple month number don’t capture the mechanism. Day of the week is irrelevant for disease biology. This demonstrates how epidemiological understanding drives effective feature engineering.

NoteQuestion 6: Outliers

You find 5 patients with “impossibly fast” disease progression (symptom onset to death in <24 hours). What should you do?

  1. Remove them as data entry errors
  2. Remove them to avoid overfitting on outliers
  3. Investigate them carefully—they might be important rare cases
  4. Set their values to the median progression time

Answer: c) Investigate them carefully—they might be important rare cases

Explanation: Investigate carefully before removing. These could be: - Real rare but critical cases (e.g., cytokine storm, overwhelming infection) - Important subpopulation with different risk factors - Data entry errors (date fields swapped)

The COVID-19 pandemic taught us that “outliers” like rapidly progressing cases were often the most clinically important to understand. Only remove after investigation confirms they’re truly errors.

5.10 Discussion Questions

  1. Your training data is 80% from one hospital. How does this affect generalizability? What can you do?

  2. Model has 95% accuracy on training, 65% on test, with 30% missing labels. How do you interpret this?

  3. Colleague wants to impute symptom onset as report date minus median delay. Pros/cons? When appropriate?

  4. Forecasting COVID-19 hospitalizations: use raw counts or adjust for test positivity? Justify.

  5. When is it acceptable to remove outliers from public health data? Examples of appropriate vs. inappropriate removal.

5.11 Further Reading

5.11.1 Data Quality Frameworks

5.11.2 Missing Data

5.11.3 Surveillance Challenges

5.11.4 COVID-19 Lessons

5.11.5 Data Leakage


You now understand why data quality is the foundation of successful AI in public health. Clean data and thoughtful feature engineering matter more than fancy algorithms.


TipPart I Summary: What You Should Now Know

Congratulations! You’ve completed Part I: Foundations. Before moving to applications, ensure you can confidently:

5.11.6 From Chapter 1 (History & Context)

  • Explain how AI has evolved in public health from expert systems to modern machine learning
  • Identify when AI adds value vs. when traditional epidemiology is sufficient
  • Recognize patterns of AI hype and separate genuine capabilities from marketing
  • Understand the unique ethical considerations of AI in population health

5.11.7 From Chapter 2 (AI Basics)

  • Distinguish between supervised, unsupervised, and reinforcement learning paradigms
  • Choose appropriate algorithms for different problem types (logistic regression → Random Forests → gradient boosting → deep learning)
  • Interpret evaluation metrics (accuracy, sensitivity, specificity, ROC-AUC) and know when each matters
  • Identify overfitting and data leakage issues
  • Understand why most public health tabular data works best with tree-based methods, not deep learning
  • Explain predictions using feature importance and SHAP values

5.11.8 From Chapter 3 (Data Quality)

  • Assess data quality across four dimensions: completeness, accuracy, timeliness, representativeness
  • Identify missing data mechanisms (MCAR, MAR, MNAR) and choose appropriate imputation strategies
  • Handle reporting delays and right-censoring in surveillance data
  • Engineer meaningful features from public health data using domain expertise
  • Recognize when data problems make AI inappropriate (selection bias, insufficient sample size, changing data-generating processes)
  • Avoid temporal data leakage when building forecasting models

5.11.9 Core Skills Checklist

Can you: - [ ] Read and modify basic ML code in Python (scikit-learn, pandas)? - [ ] Run a complete ML pipeline from data loading → preprocessing → training → evaluation? - [ ] Interpret confusion matrices and choose metrics appropriate for your use case? - [ ] Detect and handle missing data without blindly dropping rows? - [ ] Create time-based features without leaking future information? - [ ] Explain why a simple model on clean data beats a complex model on dirty data? - [ ] Identify when NOT to build an AI model due to data quality issues?

5.11.10 What’s Next

Part II: Applications applies these foundations to real public health problems: - Disease surveillance and outbreak detection - Epidemic forecasting - Genomic surveillance - Large language models in public health - Clinical decision support

You now have the conceptual foundation and technical skills to understand how AI works in practice. The next chapters show you specific applications where AI creates value—and where it falls short.

If you’re unsure about any foundation concept, review the relevant chapter before proceeding. The applications build directly on these fundamentals.


Next chapter: Chapter 4: Disease Surveillance and Outbreak Detection