The Big Picture: Most AI tutorials use clean, perfect datasets. Real public health data is incomplete, delayed, biased, inconsistent, messy, and heterogeneous. Learning to handle messy data is more important than learning fancy algorithms. Garbage in = garbage out—the most sophisticated deep learning model cannot overcome fundamentally flawed data.
Why Public Health Data Is Uniquely Challenging:
- The Surveillance Pyramid Problem: You see lab-confirmed cases (tip of iceberg) but want to predict all infections (entire iceberg). Training on hospitalized patients, deploying on community cases = selection bias by design
- Reporting Delays: “Today’s” data is actually a mixture of cases from 2-10 days ago with unknown proportions → right-censoring systematically underestimates recent trends
- Changing Definitions: COVID-19 case definitions changed multiple times. Testing availability evolved. Models learn spurious patterns from surveillance artifacts, not real epidemiology
- The Denominator Problem: 1,000 cases with 50 deaths (5% CFR) vs. 500 cases with 100 deaths (20% CFR)—which is truly more dangerous, or which just has less testing?
- Privacy Constraints: Aggregation, suppression (<5 cases), coarsening (age brackets), delayed release → loss of granularity reduces predictive power
The Four Dimensions of Data Quality (Assess Before Building):
Completeness: How much is missing? <5% = minimal, 5-20% = moderate, 20-50% = substantial, >50% = likely unusable
- MCAR (Missing Completely At Random): Safe for simple imputation
- MAR (Missing At Random): Depends on observed variables, can impute using related features
- MNAR (Missing Not At Random): Missingness itself is informative (sickest patients too ill to complete surveys)
Accuracy: Are values correct? Check for impossible values (age >120), implausible values (BMI <10), unit mixing (temperature in Fahrenheit vs. Celsius), data entry errors (decimal points), age heaping
Timeliness: Reporting lags create temporal misalignment. Nowcasting methods adjust for delays
Representativeness: Does your sample match the target population? Compare demographics to census data. Critical for population-level inferences, less critical for outbreak detection
The COVID-19 Forecasting Disaster (Real-World Lesson):
Despite unprecedented data, most forecasts failed spectacularly. Why? - Ignored reporting delays (treated “reported March 15” as “occurred March 15”) - Confused testing expansion with disease growth - Selection bias (trained on strict lockdown data, applied to different compliance) - Missing confounders (behavioral responses, policy changes) - Result: Median error 25-50% for 1-2 week forecasts
The Critical Lesson: Forecast accuracy was limited more by data quality issues than modeling approaches. Simple models with good data outperformed complex models with poor data.
Feature Engineering (Where Domain Expertise Wins):
- Time-based: Day of week, season, days since outbreak start, rolling averages (7-day, 14-day)
- Epidemiological: R0 estimation, attack rate, case fatality rate, test positivity
- Geographic: Distance to healthcare, spatial clustering, cases within 5km
- Lag features: Cases 7 days ago, 14 days ago (but never use future information!)
Critical Pitfall—Temporal Data Leakage:
❌ WRONG: df['cases_rolling'] = df['daily_cases'].rolling(window=7, center=True).mean()
✅ CORRECT: df['cases_rolling'] = df['daily_cases'].rolling(window=7).mean()
Center=True uses future data. For time series, always split by date, never random shuffle.
When Data Problems Make AI Inappropriate (Know When to Stop):
Red flags that should halt the project: 1. Extreme selection bias with no adjustment method 2. Outcome labels are inconsistent or unreliable 3. Critical features >50% missing 4. Data-generating process changed mid-dataset (pre/post policy change) 5. Sample size too small for model complexity (100 cases, 50 features, deep learning = disaster) 6. External validity concerns (one hospital’s data, deploying elsewhere)
The Courage to Say “No”: Good data science includes knowing when NOT to build a model. If data quality is insufficient, recommend improving data collection instead. Building on garbage data wastes resources, produces misleading results, can harm people, and damages trust in AI.
The Takeaway for Public Health Practitioners: Data quality determines model quality. Investigate outliers before removing them—they might be the most important cases (COVID-19 cytokine storm). Document everything—what you cleaned and why. Feature engineering with domain expertise beats algorithm sophistication. Know when to stop—sometimes the ethical choice is not to build a model. As Box said: “All models are wrong, but some are useful.” If your data is wrong enough, your model will be useless.