Hide code
graph LR
S[S: Susceptible] -->|β: Transmission rate| I[I: Infected]
I -->|γ: Recovery rate| R[R: Recovered/Removed]
style S fill:#e1f5dd
style I fill:#ffe1e1
style R fill:#e1e8ffThis chapter distinguishes outbreak detection from epidemic forecasting. You will learn to:
Prerequisites: Disease Surveillance and Outbreak Detection recommended, Just Enough AI to Be Dangerous, The Data Problem.
March 2020, Washington, D.C.: The White House Coronavirus Task Force presents epidemic models to the President. The projections are sobering: without mitigation, 1.5-2.2 million Americans could die. With social distancing, 100,000-240,000 deaths.
The nation locked down. Schools closed. Businesses shuttered.
January 2023: The U.S. passes 1.1 million COVID-19 deaths—far exceeding even the “with mitigation” scenario.
What went wrong? Were the models bad? Was the data flawed? Or is epidemic forecasting simply impossible?
October 2021, Global: Multiple forecasting models completely missed the magnitude of the Omicron wave. Most predicted a moderate increase in cases. Reality: a vertical spike that shattered all previous records.
January 2022: Models that worked well for Delta suddenly failed for Omicron. Why? The virus evolved, immune landscapes changed, behavior shifted, and the fundamental assumptions of every model broke.
These aren’t just modeling failures—they’re lessons about the fundamental limits of prediction in complex adaptive systems.
Surveillance (Chapter 4): “An outbreak is happening NOW” - Based on current/recent data - Relatively objective - Can be verified immediately
Forecasting (this chapter): “Here’s what will happen in 2 weeks” - Predicting the future - Depends on assumptions - Can’t be verified until the future arrives
The difference matters enormously:
For resource allocation: - How many hospital beds will we need? - Should we order more ventilators? - Do we need to open field hospitals?
For policy decisions: - Should we implement lockdowns? - When can schools safely reopen? - Should we mandate masks?
For public communication: - Should people cancel travel plans? - Is it safe to visit elderly relatives? - Do we need to stock up on supplies?
Get surveillance wrong → Miss an outbreak
Get forecasting wrong → Waste billions, lose public trust, or fail to prevent catastrophe
In chaos theory, the butterfly effect describes how small changes in initial conditions lead to vastly different outcomes. Epidemics exhibit this behavior in extreme form:
Small changes in behavior → Massive changes in outcomes
R₀ = 2.5 (no interventions)
→ Peak: 500,000 cases/day
R₀ = 2.3 (modest social distancing)
→ Peak: 300,000 cases/day
R₀ = 2.0 (strong distancing)
→ Peak: 100,000 cases/day
R₀ = 0.9 (lockdown)
→ No peak, epidemic dies out
A 10% reduction in transmission can mean the difference between epidemic and no epidemic. Forecasting requires getting this exactly right.
Unlike weather forecasting, epidemic forecasts influence the thing being forecasted.
Scenario 1: Self-defeating prophecy - Model predicts catastrophic outbreak - Public panics, implements strict measures - Outbreak is mild - Conclusion: “Model was wrong! Overreaction!”
Scenario 2: Self-fulfilling prophecy - Model predicts mild outbreak - Public relaxes precautions - Outbreak explodes - Conclusion: “Model was wrong! Should have prepared!”
This creates an impossible situation: Success makes you look wrong.
In October 1987, BBC weatherman Michael Fish famously dismissed viewer concerns about an approaching hurricane: “Earlier on today, apparently, a woman rang the BBC and said she heard there was a hurricane on the way. Well, if you’re watching, don’t worry, there isn’t.”
That night, the Great Storm of 1987 hit the UK—the worst storm in 300 years. Eighteen people died. Millions of trees fell. The public never forgot.
Lesson for epidemic forecasting: Reassuring forecasts that turn out wrong destroy trust permanently. Conservative (alarming) forecasts that don’t materialize also destroy trust, but more slowly.
The safest strategy: Communicate uncertainty honestly. Never publish a single-number forecast without ranges.
Forecast accuracy degrades rapidly with time:
| Forecast Horizon | Typical Accuracy | Usefulness |
|---|---|---|
| 1 week ahead | ±10-20% | High - operational planning |
| 2 weeks ahead | ±20-40% | Moderate - resource allocation |
| 4 weeks ahead | ±40-100% | Low - scenario planning only |
| 3+ months ahead | Often worse than naive baseline | Essentially guesswork |
Why does accuracy degrade? - Interventions you can’t predict (lockdowns, vaccine campaigns) - Behavior changes (people react to case counts) - Viral evolution (new variants) - Reporting changes (testing policy shifts) - Weather and seasonality effects - Random superspreading events
The 2-week wall: Most successful forecasting systems focus on 1-2 week horizons. Anything beyond that is scenario planning, not prediction.
The COVID-19 Forecast Hub, coordinated by the Reich Lab at UMass Amherst, collected forecasts from 50+ teams starting in March 2020.
Key findings from Cramer et al., 2022, PNAS (Cramer et al. 2022):
We’ll examine this in detail later in the chapter.
Before machine learning existed, epidemiologists developed mechanistic models—mathematical frameworks based on how diseases actually spread.
These models remain essential today. Why? They’re interpretable, biologically grounded, and enable scenario planning.
The SIR model, introduced by Kermack and McKendrick in 1927, divides a population into three compartments:
The flow:
S → I → R
graph LR
S[S: Susceptible] -->|β: Transmission rate| I[I: Infected]
I -->|γ: Recovery rate| R[R: Recovered/Removed]
style S fill:#e1f5dd
style I fill:#ffe1e1
style R fill:#e1e8ffGoverned by two parameters: - β (beta): Transmission rate (contacts per day × probability of transmission) - γ (gamma): Recovery rate (1/infectious period)
The basic reproduction number: \[R_0 = \frac{\beta}{\gamma}\]
\(R_0\) tells you: How many people does one infected person infect in a fully susceptible population?
The mathematics:
\[\frac{dS}{dt} = -\beta \frac{S \cdot I}{N}\]
\[\frac{dI}{dt} = \beta \frac{S \cdot I}{N} - \gamma I\]
\[\frac{dR}{dt} = \gamma I\]
Where \(N = S + I + R\) (total population).
Let’s implement this:
import numpy as np
from scipy.integrate import odeint
import matplotlib.pyplot as plt
def sir_model(y, t, beta, gamma, N):
"""
Basic SIR model differential equations
Parameters:
- y: [S, I, R] current state
- t: time
- beta: transmission rate
- gamma: recovery rate
- N: total population
Returns:
- [dS/dt, dI/dt, dR/dt]
"""
S, I, R = y
dS_dt = -beta * S * I / N
dI_dt = beta * S * I / N - gamma * I
dR_dt = gamma * I
return dS_dt, dI_dt, dR_dt
# Parameters for a measles-like disease
N = 1_000_000 # Total population (e.g., a city)
I0 = 100 # Initial infected
R0 = 0 # Initial recovered (no prior immunity)
S0 = N - I0 - R0 # Initial susceptible
beta = 0.5 # Transmission rate (contacts per day × transmission probability)
gamma = 0.1 # Recovery rate (1/10 day infectious period)
# Time vector (days)
t = np.linspace(0, 200, 200)
# Initial conditions
y0 = S0, I0, R0
# Solve ODE system
solution = odeint(sir_model, y0, t, args=(beta, gamma, N))
S, I, R = solution.T
# Calculate R₀
R0_value = beta / gamma
print(f"Basic Reproduction Number (R₀): {R0_value:.2f}")
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Left panel: Epidemic curves
axes[0].plot(t, S, label='Susceptible', color='blue', linewidth=2)
axes[0].plot(t, I, label='Infected', color='red', linewidth=3)
axes[0].plot(t, R, label='Recovered', color='green', linewidth=2)
# Mark peak
peak_idx = np.argmax(I)
axes[0].axvline(t[peak_idx], color='red', linestyle='--', alpha=0.5,
label=f'Peak: Day {t[peak_idx]:.0f}')
axes[0].axhline(I[peak_idx], color='red', linestyle='--', alpha=0.5)
axes[0].set_xlabel('Days', fontsize=12)
axes[0].set_ylabel('Population', fontsize=12)
axes[0].set_title(f'SIR Model Epidemic Curve (R₀={R0_value:.1f})', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
# Right panel: Phase portrait (S vs I)
axes[1].plot(S, I, 'k-', linewidth=2)
axes[1].scatter([S0], [I0], color='red', s=200, zorder=5,
marker='*', label='Start', edgecolors='black', linewidth=2)
axes[1].scatter([S[-1]], [I[-1]], color='green', s=200, zorder=5,
marker='*', label='End', edgecolors='black', linewidth=2)
# Herd immunity threshold line
herd_immunity_threshold = N * (1 - 1/R0_value)
axes[1].axvline(herd_immunity_threshold, color='orange', linestyle='--',
linewidth=2, label=f'Herd Immunity: {herd_immunity_threshold/N*100:.0f}%')
axes[1].set_xlabel('Susceptible Population', fontsize=12)
axes[1].set_ylabel('Infected Population', fontsize=12)
axes[1].set_title('Phase Portrait: S vs I', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('sir_basic_model.png', dpi=300)
plt.show()
# Calculate key metrics
peak_infected_idx = np.argmax(I)
peak_day = t[peak_infected_idx]
peak_cases = I[peak_infected_idx]
final_attack_rate = (N - S[-1]) / N * 100
herd_immunity = (1 - 1/R0_value) * 100
print("\n" + "="*60)
print("SIR MODEL RESULTS")
print("="*60)
print(f"\nR₀ (Basic Reproduction Number): {R0_value:.2f}")
print(f"Herd Immunity Threshold: {herd_immunity:.1f}% of population")
print(f"\nPeak Infections: {peak_cases:,.0f} people")
print(f"Peak Day: {peak_day:.0f}")
print(f"\nFinal Attack Rate: {final_attack_rate:.1f}% of population infected")
print(f"Final Susceptible: {S[-1]:,.0f} people ({S[-1]/N*100:.1f}%)")
print(f"Total Infected: {R[-1]:,.0f} people")
# Demonstrate impact of interventions
print("\n" + "="*60)
print("SCENARIO ANALYSIS: Impact of Interventions")
print("="*60)
scenarios = {
'No intervention (R₀=2.5)': 2.5,
'Masks + distancing (R₀=2.0)': 2.0,
'Partial lockdown (R₀=1.5)': 1.5,
'Full lockdown (R₀=0.9)': 0.9
}
for scenario_name, R0_scenario in scenarios.items():
beta_scenario = R0_scenario * gamma
solution_scenario = odeint(sir_model, y0, t, args=(beta_scenario, gamma, N))
S_s, I_s, R_s = solution_scenario.T
peak_infections = I_s.max()
total_infected = R_s[-1]
attack_rate = total_infected / N * 100
print(f"\n{scenario_name}")
print(f" Peak infections: {peak_infections:,.0f}")
print(f" Total infected: {total_infected:,.0f} ({attack_rate:.1f}%)")R₀ is the most important number in epidemiology, but it’s often misunderstood.
What it is: - Number of secondary infections caused by one infected person in a fully susceptible population - Determines whether an epidemic grows or dies out - Threshold for herd immunity: \(1 - 1/R_0\)
What it is NOT: - Not the number of people infected per day (that’s incidence) - Not constant over time (behavior, interventions, immunity all change it) - Not inherent to a pathogen alone (depends on population behavior and contact patterns)
Examples: - Measles: \(R_0 \approx\) 12-18 (extremely contagious) - Influenza: \(R_0 \approx\) 1.3-1.8 (moderately contagious) - Original SARS-CoV-2: \(R_0 \approx\) 2-3 - Delta variant: \(R_0 \approx\) 5-8 - Omicron: \(R_0 \approx\) 10-15
For comprehensive review, see Delamater et al., 2019, Emerging Infectious Diseases.
The SIR model assumes people become infectious immediately after infection. For most diseases, this is wrong—there’s an incubation period.
SEIR adds the Exposed compartment:
S → E → I → R
graph LR
S[S: Susceptible] -->|β: Transmission rate| E[E: Exposed]
E -->|σ: Incubation rate| I[I: Infected]
I -->|γ: Recovery rate| R[R: Recovered/Removed]
style S fill:#e1f5dd
style E fill:#fff4e1
style I fill:#ffe1e1
style R fill:#e1e8ffNew parameter: - σ (sigma): Rate of progression from E to I (1/incubation period)
The mathematics:
\[\frac{dS}{dt} = -\beta \frac{S \cdot I}{N}\]
\[\frac{dE}{dt} = \beta \frac{S \cdot I}{N} - \sigma E\]
\[\frac{dI}{dt} = \sigma E - \gamma I\]
\[\frac{dR}{dt} = \gamma I\]
Why this matters:
For diseases like COVID-19 with long incubation periods (5-6 days), the SEIR model produces more realistic epidemic curves with delayed and broader peaks.
def seir_model(y, t, beta, sigma, gamma, N):
"""
SEIR model with exposed/latent period
Parameters:
- y: [S, E, I, R] current state
- sigma: rate of progression from E to I (1/incubation period)
"""
S, E, I, R = y
dS_dt = -beta * S * I / N
dE_dt = beta * S * I / N - sigma * E
dI_dt = sigma * E - gamma * I
dR_dt = gamma * I
return dS_dt, dE_dt, dI_dt, dR_dt
# Parameters for COVID-19-like disease
sigma = 1/5.5 # 5.5-day incubation period
gamma = 1/10 # 10-day infectious period
beta = 0.5
E0 = 50 # Initial exposed
I0 = 10 # Initial infected
R0 = 0
S0 = N - E0 - I0 - R0
# Solve SEIR
y0_seir = S0, E0, I0, R0
solution_seir = odeint(seir_model, y0, t, args=(beta, sigma, gamma, N))
S_seir, E_seir, I_seir, R_seir = solution_seir.T
# Solve SIR for comparison (combine E and I into I)
y0_sir = S0 + E0, I0, R0
solution_sir = odeint(sir_model, y0_sir, t, args=(beta, gamma, N))
S_sir, I_sir, R_sir = solution_sir.T
# Compare
fig, axes = plt.subplots(2, 1, figsize=(14, 10))
# Top panel: SEIR compartments
axes[0].plot(t, S_seir, label='Susceptible', color='blue', linewidth=2)
axes[0].plot(t, E_seir, label='Exposed (not yet infectious)', color='orange', linewidth=2)
axes[0].plot(t, I_seir, label='Infected (infectious)', color='red', linewidth=3)
axes[0].plot(t, R_seir, label='Recovered', color='green', linewidth=2)
axes[0].set_ylabel('Population', fontsize=12)
axes[0].set_title('SEIR Model: Four Compartments', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
# Bottom panel: Compare infectious populations (SIR vs SEIR)
axes[1].plot(t, I_sir, '--', label='SIR: Infected (no latency)',
color='pink', linewidth=3, alpha=0.7)
axes[1].plot(t, I_seir, '-', label='SEIR: Infected (with latent period)',
color='red', linewidth=3)
axes[1].plot(t, E_seir, '-', label='SEIR: Exposed (not infectious)',
color='orange', linewidth=2, alpha=0.7)
# Mark peaks
sir_peak_day = t[np.argmax(I_sir)]
seir_peak_day = t[np.argmax(I_seir)]
delay = seir_peak_day - sir_peak_day
axes[1].axvline(sir_peak_day, color='pink', linestyle=':', alpha=0.5)
axes[1].axvline(seir_peak_day, color='red', linestyle=':', alpha=0.5)
axes[1].annotate(f'Peak delay: {delay:.0f} days',
xy=(seir_peak_day, I_seir.max()),
xytext=(seir_peak_day + 20, I_seir.max() * 0.8),
arrowprops=dict(arrowstyle='->', color='black', lw=2),
fontsize=12, bbox=dict(boxstyle='round', facecolor='wheat'))
axes[1].set_xlabel('Days', fontsize=12)
axes[1].set_ylabel('Population', fontsize=12)
axes[1].set_title('SIR vs SEIR: Impact of Latent Period', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('seir_vs_sir.png', dpi=300)
plt.show()
print(f"\nPeak timing difference: {delay:.1f} days")
print(f"SIR peak: {I_sir.max():,.0f} people on day {sir_peak_day:.0f}")
print(f"SEIR peak: {I_seir.max():,.0f} people on day {seir_peak_day:.0f}")
print(f"Peak reduction: {(1 - I_seir.max()/I_sir.max())*100:.1f}%")The E compartment: 1. Delays the epidemic peak (by ~1 incubation period) 2. Reduces peak height slightly (infections spread out more) 3. Makes interventions less effective if they only target symptomatic individuals (E people are spreading it without symptoms) 4. Creates a “pre-symptomatic transmission” problem (critical for COVID-19)
For diseases with long incubation periods (COVID-19, Ebola), SEIR is essential. For diseases with short/negligible incubation (norovirus), SIR suffices.
Real epidemic modeling often requires more complexity:
\[S \to E \to I \to \begin{cases} R \text{ (recovered)} \\ D \text{ (dead)} \end{cases}\]
\[S \to E \to I \to R \to S\]
Different parameters for different age groups:
# Children (0-17)
beta_children = 0.3
gamma_children = 0.15
# Adults (18-64)
beta_adults = 0.5
gamma_adults = 0.1
# Elderly (65+)
beta_elderly = 0.4
gamma_elderly = 0.08
cfr_elderly = 0.15 # Higher case fatality rateCritical for COVID-19 where age dramatically affects outcomes.
Divide population into geographic regions with mobility between them:
# Movement between regions
mobility_matrix = [
[0.95, 0.03, 0.02], # Region 1: 95% stay, 3% to R2, 2% to R3
[0.02, 0.96, 0.02], # Region 2
[0.01, 0.02, 0.97] # Region 3
]For more on spatial models, see Keeling & Rohani, 2008, Modeling Infectious Diseases.
Strengths:
✅ Interpretable: Every parameter has biological meaning (transmission rate, recovery rate, etc.)
✅ Scenario planning: Can simulate interventions (vaccination campaigns, social distancing, school closures)
✅ Theory-grounded: Built on century of epidemiological knowledge
✅ Extrapolation: Can predict novel scenarios (new pathogen, untested interventions)
✅ Policy-relevant: Policymakers understand “if R₀ drops to X, we get Y cases”
Limitations:
❌ Simplistic assumptions: Homogeneous mixing (everyone contacts everyone equally—clearly false)
❌ Constant parameters: \(\beta\) and \(\gamma\) change with behavior, seasons, immunity
❌ No heterogeneity: Superspreaders, age structure, spatial clustering all ignored in basic models
❌ Parameter uncertainty: Don’t know true \(\beta\), \(\gamma\) until epidemic is over
❌ Forecast degradation: Accurate for 1-2 weeks, unreliable beyond that
Mechanistic models are: - Transparent and interpretable - Based on biological mechanisms - But require assumptions that are often violated in reality
Machine learning models (next section) are: - Flexible and data-driven - Don’t require mechanistic assumptions - But opaque and hard to extrapolate beyond training data
Neither alone is sufficient. The best forecasting systems use both.
What if we don’t want to specify transmission rates and recovery rates? What if we just want to learn patterns from data?
Enter machine learning forecasting.
ARIMA (AutoRegressive Integrated Moving Average) is a classical statistical method for time series forecasting.
Why ARIMA for epidemics: - Captures temporal autocorrelation (yesterday’s cases predict today’s) - Handles seasonality (flu peaks in winter, dengue in rainy season) - No mechanistic assumptions required - Works with limited historical data
ARIMA components:
AR (AutoRegressive): Use past values to predict future \[y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + ... + \epsilon_t\]
I (Integrated): Differencing to make series stationary \[\Delta y_t = y_t - y_{t-1}\]
MA (Moving Average): Use past errors to improve prediction \[y_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + ...\]
ARIMA(p, d, q): - p: Number of autoregressive terms - d: Degree of differencing - q: Number of moving average terms
For seasonal data: SARIMA adds seasonal components.
Complete implementation for influenza forecasting:
import pandas as pd
import numpy as np
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Simulate realistic influenza data (ILINet-style)
np.random.seed(42)
weeks = pd.date_range('2018-01-01', periods=260, freq='W-MON')
week_of_year = weeks.isocalendar().week
# Seasonal flu pattern (peaks in winter)
baseline = 2.0 + 2.5 * np.exp(-((week_of_year - 6) % 52)**2 / 50)
# Add year-to-year variation
year_effect = np.repeat([0, 0.3, -0.2, 0.5, 0.1], 52)[:len(weeks)]
# Random noise
noise = np.random.normal(0, 0.25, len(weeks))
# Combine
ili_pct = baseline + year_effect + noise
ili_pct = np.maximum(ili_pct, 0.5) # ILI never goes to zero
df = pd.DataFrame({'date': weeks, 'ili_pct': ili_pct})
df.set_index('date', inplace=True)
print(df.head(10))
print(f"\nTotal weeks: {len(df)}")
print(f"Date range: {df.index.min()} to {df.index.max()}")
# Train/test split (reserve last season for testing)
train_size = 208 # 4 years training
train = df.iloc[:train_size]
test = df.iloc[train_size:]
print(f"\nTraining weeks: {len(train)}")
print(f"Test weeks: {len(test)}")
# Exploratory analysis
fig, axes = plt.subplots(3, 1, figsize=(14, 12))
# Raw time series
axes[0].plot(train.index, train['ili_pct'], label='Training Data', color='blue')
axes[0].plot(test.index, test['ili_pct'], label='Test Data', color='orange')
axes[0].set_ylabel('ILI %', fontsize=12)
axes[0].set_title('Influenza-Like Illness Time Series', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# ACF (Autocorrelation Function)
plot_acf(train['ili_pct'], lags=52, ax=axes[1], alpha=0.05)
axes[1].set_title('Autocorrelation Function (ACF)', fontsize=14)
axes[1].set_xlabel('Lag (weeks)')
# PACF (Partial Autocorrelation Function)
plot_pacf(train['ili_pct'], lags=52, ax=axes[2], alpha=0.05)
axes[2].set_title('Partial Autocorrelation Function (PACF)', fontsize=14)
axes[2].set_xlabel('Lag (weeks)')
plt.tight_layout()
plt.savefig('arima_exploratory.png', dpi=300)
plt.show()
# Seasonal decomposition
decomposition = seasonal_decompose(train['ili_pct'], model='additive', period=52)
fig, axes = plt.subplots(4, 1, figsize=(14, 12))
decomposition.observed.plot(ax=axes[0], title='Observed')
decomposition.trend.plot(ax=axes[1], title='Trend')
decomposition.seasonal.plot(ax=axes[2], title='Seasonal')
decomposition.resid.plot(ax=axes[3], title='Residual')
for ax in axes:
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('seasonal_decomposition.png', dpi=300)
plt.show()
# Fit SARIMA model
# SARIMA(p, d, q)(P, D, Q, s)
# (2, 0, 1) for non-seasonal components
# (1, 0, 1, 52) for seasonal components with 52-week period
print("\nFitting SARIMA model...")
model = SARIMAX(
train['ili_pct'],
order=(2, 0, 1), # (p, d, q)
seasonal_order=(1, 0, 1, 52), # (P, D, Q, s)
enforce_stationarity=False,
enforce_invertibility=False
)
fit = model.fit(disp=False, maxiter=200)
print("\n" + "="*60)
print("MODEL SUMMARY")
print("="*60)
print(fit.summary())
# Forecast
forecast_steps = len(test)
forecast_result = fit.get_forecast(steps=forecast_steps)
# Get predictions and confidence intervals
forecast_mean = forecast_result.predicted_mean
forecast_ci = forecast_result.conf_int()
# Create forecast dataframe
forecast_df = pd.DataFrame({
'date': test.index,
'actual': test['ili_pct'].values,
'forecast': forecast_mean.values,
'lower_ci': forecast_ci.iloc[:, 0].values,
'upper_ci': forecast_ci.iloc[:, 1].values
})
# Visualize forecast
fig, axes = plt.subplots(2, 1, figsize=(16, 10))
# Top: Full time series with forecast
axes[0].plot(train.index, train['ili_pct'], 'o-',
label='Training Data', color='blue', alpha=0.6, markersize=4)
axes[0].plot(test.index, test['ili_pct'], 'o-',
label='Actual (Test)', color='black', linewidth=2, markersize=6)
axes[0].plot(forecast_df['date'], forecast_df['forecast'], 's-',
label='SARIMA Forecast', color='red', linewidth=2, markersize=6)
# Confidence interval
axes[0].fill_between(forecast_df['date'],
forecast_df['lower_ci'],
forecast_df['upper_ci'],
alpha=0.3, color='red', label='95% Confidence Interval')
axes[0].axvline(train.index[-1], color='green', linestyle='--',
linewidth=2, label='Train/Test Split')
axes[0].set_ylabel('ILI %', fontsize=12)
axes[0].set_title('SARIMA Forecast: Influenza-Like Illness', fontsize=14)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)
# Bottom: Forecast errors
errors = forecast_df['actual'] - forecast_df['forecast']
colors = ['red' if e < 0 else 'green' for e in errors]
axes[1].bar(forecast_df['date'], errors, color=colors, alpha=0.6, edgecolor='black')
axes[1].axhline(0, color='black', linestyle='-', linewidth=1.5)
# Mark forecast horizons
for horizon, label in [(4, '1 month'), (13, '3 months')]:
if horizon < len(errors):
axes[1].axvline(forecast_df['date'].iloc[horizon],
color='blue', linestyle=':', linewidth=2)
axes[1].text(forecast_df['date'].iloc[horizon], errors.max() * 0.9,
f'{label} ahead', rotation=90, fontsize=10)
axes[1].set_xlabel('Date', fontsize=12)
axes[1].set_ylabel('Forecast Error (Actual - Predicted)', fontsize=12)
axes[1].set_title('Forecast Errors Over Time', fontsize=14)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('arima_flu_forecast.png', dpi=300)
plt.show()
# Evaluate performance at different horizons
horizons = [1, 4, 8, 13] # 1 week, 1 month, 2 months, 3 months
print("\n" + "="*60)
print("FORECAST PERFORMANCE BY HORIZON")
print("="*60)
for horizon in horizons:
if horizon <= len(errors):
subset_actual = forecast_df['actual'].iloc[:horizon]
subset_pred = forecast_df['forecast'].iloc[:horizon]
mae = mean_absolute_error(subset_actual, subset_pred)
rmse = np.sqrt(mean_squared_error(subset_actual, subset_pred))
mape = np.mean(np.abs((subset_actual - subset_pred) / subset_actual)) * 100
# Coverage: % of actuals within CI
subset_lower = forecast_df['lower_ci'].iloc[:horizon]
subset_upper = forecast_df['upper_ci'].iloc[:horizon]
coverage = ((subset_actual >= subset_lower) & (subset_actual <= subset_upper)).mean() * 100
print(f"\n{horizon} week(s) ahead:")
print(f" MAE: {mae:.3f}")
print(f" RMSE: {rmse:.3f}")
print(f" MAPE: {mape:.1f}%")
print(f" 95% CI Coverage: {coverage:.0f}%")
# Compare to naive baseline (persistence forecast)
naive_forecast = train['ili_pct'].iloc[-1] # Last observed value
naive_errors = forecast_df['actual'] - naive_forecast
naive_mae = mean_absolute_error(forecast_df['actual'], [naive_forecast]*len(forecast_df))
arima_mae = mean_absolute_error(forecast_df['actual'], forecast_df['forecast'])
improvement = (1 - arima_mae / naive_mae) * 100
print("\n" + "="*60)
print("COMPARISON TO NAIVE BASELINE")
print("="*60)
print(f"Naive forecast (persistence): MAE = {naive_mae:.3f}")
print(f"SARIMA forecast: MAE = {arima_mae:.3f}")
print(f"Improvement: {improvement:.1f}%")ARIMA excels when: ✓ Disease has regular seasonal patterns (influenza, RSV, dengue)
✓ Reporting is stable and consistent
✓ Short-term forecasts (1-4 weeks)
✓ No major interventions or behavior changes expected
✓ Historical data covers multiple seasons
ARIMA struggles when: ✗ Novel outbreak (no historical pattern to learn)
✗ Sudden interventions (lockdowns, vaccine rollouts)
✗ Reporting systems change
✗ Long-term forecasts (>4 weeks)
✗ Nonlinear, chaotic dynamics
For comprehensive ARIMA tutorial, see Hyndman & Athanasopoulos, Forecasting: Principles and Practice.
LSTM (Long Short-Term Memory) networks are a type of recurrent neural network designed for sequential data.
Why LSTMs for epidemics: - Can capture long-term dependencies (last month’s cases affect this month) - Handle multiple input features (cases + mobility + weather + interventions) - Don’t require stationarity assumptions - Can learn nonlinear patterns
The basic architecture:
Input sequence → LSTM layers → Dense layers → Output (future cases)
(past 14 days) (next 7 days)
Complete implementation for COVID-19 forecasting:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import warnings
warnings.filterwarnings('ignore')
# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)
# Simulate realistic COVID-19 case data
days = np.arange(500)
# Multi-wave pattern (Original → Delta → Omicron-like)
wave1 = 500 * np.exp(-((days - 50)**2) / 300)
wave2 = 2000 * np.exp(-((days - 200)**2) / 400)
wave3 = 5000 * np.exp(-((days - 350)**2) / 500)
# Combine waves with baseline and noise
cases = 100 + wave1 + wave2 + wave3 + np.random.normal(0, 100, len(days))
cases = np.maximum(cases, 50) # Floor at 50 cases/day
# Add day-of-week effect (reporting artifacts)
day_of_week_effect = -200 * (days % 7 >= 5) # Lower on weekends
cases = cases + day_of_week_effect
cases = np.maximum(cases, 0)
# Create DataFrame
df = pd.DataFrame({
'day': days,
'cases': cases
})
print(f"Total days: {len(df)}")
print(f"Case range: {cases.min():.0f} to {cases.max():.0f}")
# Visualize raw data
plt.figure(figsize=(14, 6))
plt.plot(df['day'], df['cases'], 'o-', alpha=0.6, markersize=3)
plt.xlabel('Day', fontsize=12)
plt.ylabel('Daily Cases', fontsize=12)
plt.title('Simulated COVID-19 Daily Cases (Multi-Wave Pattern)', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('lstm_data_raw.png', dpi=300)
plt.show()
# Prepare data for LSTM
def create_sequences(data, seq_length, forecast_horizon=1):
"""
Create input sequences and targets for LSTM
Parameters:
- data: 1D array of values
- seq_length: number of past time steps to use
- forecast_horizon: number of future time steps to predict
Returns:
- X: [n_samples, seq_length] input sequences
- y: [n_samples, forecast_horizon] target values
"""
X, y = [], []
for i in range(len(data) - seq_length - forecast_horizon + 1):
# Input: seq_length days of history
X.append(data[i:i+seq_length])
# Target: next forecast_horizon days
if forecast_horizon == 1:
y.append(data[i+seq_length])
else:
y.append(data[i+seq_length:i+seq_length+forecast_horizon])
return np.array(X), np.array(y)
# Normalize data (LSTM trains better with normalized inputs)
scaler = MinMaxScaler(feature_range=(0, 1))
cases_scaled = scaler.fit_transform(df['cases'].values.reshape(-1, 1)).flatten()
# Create sequences (use past 14 days to predict next day)
seq_length = 14
forecast_horizon = 1
X, y = create_sequences(cases_scaled, seq_length, forecast_horizon)
print(f"\nSequence creation:")
print(f"Input shape: {X.shape}") # (n_samples, seq_length)
print(f"Target shape: {y.shape}") # (n_samples,)
# Train/test split (80/20)
train_size = int(0.8 * len(X))
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]
print(f"\nTrain samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
# Reshape for LSTM [samples, time steps, features]
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))
print(f"\nReshaped for LSTM:")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
# Build LSTM model
model = keras.Sequential([
# First LSTM layer (return sequences for stacking)
layers.LSTM(64, activation='relu', return_sequences=True,
input_shape=(seq_length, 1)),
layers.Dropout(0.2),
# Second LSTM layer
layers.LSTM(32, activation='relu'),
layers.Dropout(0.2),
# Dense layers
layers.Dense(16, activation='relu'),
layers.Dense(forecast_horizon) # Output layer
])
model.compile(
optimizer='adam',
loss='mse',
metrics=['mae']
)
print("\n" + "="*60)
print("LSTM MODEL ARCHITECTURE")
print("="*60)
model.summary()
# Train model
print("\n" + "="*60)
print("TRAINING MODEL")
print("="*60)
early_stopping = keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=10,
restore_best_weights=True
)
history = model.fit(
X_train, y_train,
epochs=100,
batch_size=32,
validation_split=0.2,
callbacks=[early_stopping],
verbose=1
)
# Make predictions
y_train_pred = model.predict(X_train, verbose=0)
y_test_pred = model.predict(X_test, verbose=0)
# Inverse transform to original scale
y_train_actual = scaler.inverse_transform(y_train.reshape(-1, 1)).flatten()
y_train_pred_original = scaler.inverse_transform(y_train_pred).flatten()
y_test_actual = scaler.inverse_transform(y_test.reshape(-1, 1)).flatten()
y_test_pred_original = scaler.inverse_transform(y_test_pred).flatten()
# Evaluate
train_mae = mean_absolute_error(y_train_actual, y_train_pred_original)
train_rmse = np.sqrt(mean_squared_error(y_train_actual, y_train_pred_original))
test_mae = mean_absolute_error(y_test_actual, y_test_pred_original)
test_rmse = np.sqrt(mean_squared_error(y_test_actual, y_test_pred_original))
print("\n" + "="*60)
print("MODEL PERFORMANCE")
print("="*60)
print(f"\nTraining Set:")
print(f" MAE: {train_mae:.1f} cases")
print(f" RMSE: {train_rmse:.1f} cases")
print(f"\nTest Set:")
print(f" MAE: {test_mae:.1f} cases")
print(f" RMSE: {test_rmse:.1f} cases")
# Visualize results
fig, axes = plt.subplots(3, 1, figsize=(16, 14))
# Top: Training history
axes[0].plot(history.history['loss'], label='Training Loss', linewidth=2)
axes[0].plot(history.history['val_loss'], label='Validation Loss', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss (MSE)', fontsize=12)
axes[0].set_title('LSTM Training History', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
# Middle: Predictions vs actual (full dataset)
train_days = days[seq_length:seq_length+len(y_train_actual)]
test_days = days[seq_length+len(y_train_actual):seq_length+len(y_train_actual)+len(y_test_actual)]
axes[1].plot(train_days, y_train_actual, 'o-',
label='Actual (Training)', color='blue', alpha=0.5, markersize=3)
axes[1].plot(test_days, y_test_actual, 'o-',
label='Actual (Test)', color='black', linewidth=2, markersize=4)
axes[1].plot(test_days, y_test_pred_original, 's-',
label='LSTM Forecast', color='red', linewidth=2, markersize=4)
axes[1].axvline(train_days[-1], color='green', linestyle='--',
linewidth=2, label='Train/Test Split')
axes[1].set_xlabel('Day', fontsize=12)
axes[1].set_ylabel('Daily Cases', fontsize=12)
axes[1].set_title('LSTM Forecast: COVID-19 Daily Cases', fontsize=14)
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)
# Bottom: Forecast errors
errors = y_test_actual - y_test_pred_original
colors = ['red' if e < 0 else 'green' for e in errors]
axes[2].bar(test_days, errors, color=colors, alpha=0.6, edgecolor='black')
axes[2].axhline(0, color='black', linestyle='-', linewidth=1.5)
axes[2].set_xlabel('Day', fontsize=12)
axes[2].set_ylabel('Forecast Error (Actual - Predicted)', fontsize=12)
axes[2].set_title('LSTM Forecast Errors', fontsize=14)
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('lstm_covid_forecast.png', dpi=300)
plt.show()
# Multi-step ahead forecasting
print("\n" + "="*60)
print("MULTI-STEP AHEAD FORECASTING")
print("="*60)
# Use the last seq_length days as input
last_sequence = cases_scaled[-seq_length:]
current_sequence = last_sequence.reshape(1, seq_length, 1)
# Forecast next 14 days iteratively
forecast_days = 14
multi_step_forecast = []
for i in range(forecast_days):
# Predict next day
next_pred = model.predict(current_sequence, verbose=0)[0, 0]
multi_step_forecast.append(next_pred)
# Update sequence: drop oldest, add prediction
current_sequence = np.append(current_sequence[0, 1:, 0], next_pred).reshape(1, seq_length, 1)
# Inverse transform
multi_step_forecast_original = scaler.inverse_transform(
np.array(multi_step_forecast).reshape(-1, 1)
).flatten()
# Visualize multi-step forecast
fig, ax = plt.subplots(figsize=(14, 6))
# Historical data
ax.plot(days, cases, 'o-', label='Historical Cases',
color='blue', alpha=0.6, markersize=3)
# Multi-step forecast
forecast_days_array = np.arange(len(days), len(days) + forecast_days)
ax.plot(forecast_days_array, multi_step_forecast_original, 's-',
label=f'{forecast_days}-Day Forecast', color='red',
linewidth=3, markersize=8)
ax.axvline(days[-1], color='green', linestyle='--', linewidth=2,
label='Forecast Start')
ax.set_xlabel('Day', fontsize=12)
ax.set_ylabel('Daily Cases', fontsize=12)
ax.set_title(f'LSTM Multi-Step Ahead Forecast ({forecast_days} days)', fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('lstm_multistep_forecast.png', dpi=300)
plt.show()
print(f"\n{forecast_days}-day forecast:")
for i, pred in enumerate(multi_step_forecast_original, 1):
print(f" Day {i}: {pred:.0f} cases")LSTMs learn patterns, not mechanisms.
Example scenario: - Training data includes natural epidemic curve + lockdown → cases drop - LSTM learns: “After cases reach X, they drop” - New epidemic: Cases reach X, but no lockdown this time - LSTM predicts drop (wrong!)
The problem: LSTM has no concept of “lockdown caused the drop.” It just sees a pattern.
When this matters: - Novel interventions (no historical data) - Changing policy responses - New variants with different characteristics - Long-term forecasts where context matters
Solution: Use LSTMs for short-term pattern recognition, mechanistic models for scenario planning.
For practical deep learning in epidemiology, see Chimmula & Zhang, 2020, Chaos, Solitons & Fractals.
No single model is best for all situations. Ensemble methods combine multiple models to create more robust forecasts.
The CDC’s FluSight challenge, running since 2013, invites teams to forecast influenza-like illness (ILI) weekly.
The setup: - Teams submit forecasts for 1-4 weeks ahead - Also forecast: peak week, peak intensity, onset week - Evaluated on prediction interval coverage and point forecast accuracy
Key finding (Reich et al., 2019, Clinical Infectious Diseases) (Reich et al. 2019):
Ensemble models consistently outperformed any individual model.
Why ensembles work:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings('ignore')
# Generate synthetic epidemic data
np.random.seed(42)
days = np.arange(200)
# Simulated cases (mix of exponential growth + intervention)
phase1 = 100 * np.exp(0.03 * days[:80]) # Growth phase
phase2 = phase1[-1] * np.exp(-0.02 * (days[80:] - 80)) # Decline (intervention)
cases = np.concatenate([phase1, phase2])
cases = cases + np.random.normal(0, cases * 0.1) # Add noise
cases = np.maximum(cases, 10)
df = pd.DataFrame({'day': days, 'cases': cases})
# Split: train on first 150 days, forecast next 50
train_size = 150
train = df.iloc[:train_size]
test = df.iloc[train_size:]
print(f"Training days: {len(train)}")
print(f"Test days: {len(test)}")
# === MODEL 1: SEIR Mechanistic Model ===
def seir_forecast(train_data, forecast_steps, beta=0.03, sigma=0.2, gamma=0.1):
"""
Fit SEIR model to training data and forecast
"""
# Population
N = 1000000
# Initial conditions (estimate from data)
I0 = train_data['cases'].iloc[0]
E0 = I0
R0_init = 0
S0 = N - E0 - I0 - R0_init
def seir_model(y, t, beta, sigma, gamma, N):
S, E, I, R = y
dS = -beta * S * I / N
dE = beta * S * I / N - sigma * E
dI = sigma * E - gamma * I
dR = gamma * I
return dS, dE, dI, dR
# Solve SEIR for training + forecast period
t = np.arange(len(train_data) + forecast_steps)
y0 = S0, E0, I0, R0_init
solution = odeint(seir_model, y0, t, args=(beta, sigma, gamma, N))
_, _, I, _ = solution.T
# Return only forecast period
forecast = I[len(train_data):]
return forecast
seir_forecast_values = seir_forecast(train, len(test))
# === MODEL 2: ARIMA ===
def arima_forecast(train_data, forecast_steps):
"""
Fit ARIMA model and forecast
"""
try:
model = SARIMAX(train_data['cases'], order=(2, 1, 1))
fit = model.fit(disp=False)
forecast = fit.forecast(steps=forecast_steps)
return forecast.values
except:
# Fallback: return mean if ARIMA fails
return np.full(forecast_steps, train_data['cases'].mean())
arima_forecast_values = arima_forecast(train, len(test))
# === MODEL 3: Machine Learning (Simple Trend Extrapolation) ===
def ml_forecast(train_data, forecast_steps):
"""
Simple ML: Fit polynomial to recent trend and extrapolate
"""
from numpy.polynomial import Polynomial
# Use last 30 days for trend
recent = train_data.iloc[-30:]
# Fit polynomial
p = Polynomial.fit(recent['day'], recent['cases'], deg=2)
# Forecast
forecast_days = np.arange(train_data['day'].iloc[-1] + 1,
train_data['day'].iloc[-1] + 1 + forecast_steps)
forecast = p(forecast_days)
return np.maximum(forecast, 0) # No negative cases
ml_forecast_values = ml_forecast(train, len(test))
# === ENSEMBLE: Combine Models ===
# Method 1: Simple average
ensemble_simple = (seir_forecast_values + arima_forecast_values + ml_forecast_values) / 3
# Method 2: Weighted average (based on domain knowledge or historical performance)
weights = {
'seir': 0.4, # Higher weight for mechanistic model (interpretable)
'arima': 0.3, # Medium weight (good for short-term)
'ml': 0.3 # Medium weight (captures recent trends)
}
ensemble_weighted = (
weights['seir'] * seir_forecast_values +
weights['arima'] * arima_forecast_values +
weights['ml'] * ml_forecast_values
)
# Method 3: Median ensemble (more robust to outliers)
all_forecasts = np.array([seir_forecast_values, arima_forecast_values, ml_forecast_values])
ensemble_median = np.median(all_forecasts, axis=0)
# Visualize
fig, axes = plt.subplots(2, 1, figsize=(16, 12))
# Top: All forecasts
axes[0].plot(train['day'], train['cases'], 'o-',
label='Training Data', color='blue', alpha=0.5, markersize=3)
axes[0].plot(test['day'], test['cases'], 'o-',
label='Actual (Test)', color='black', linewidth=3, markersize=6)
axes[0].plot(test['day'], seir_forecast_values, '^--',
label='SEIR Model', alpha=0.7, markersize=6)
axes[0].plot(test['day'], arima_forecast_values, 's--',
label='ARIMA Model', alpha=0.7, markersize=6)
axes[0].plot(test['day'], ml_forecast_values, 'd--',
label='ML Model', alpha=0.7, markersize=6)
axes[0].plot(test['day'], ensemble_weighted, '*-',
label='Weighted Ensemble', color='red', linewidth=3, markersize=10)
axes[0].axvline(train['day'].iloc[-1], color='green', linestyle='--',
linewidth=2, label='Train/Test Split')
axes[0].set_ylabel('Daily Cases', fontsize=12)
axes[0].set_title('Ensemble Forecasting: Combining Multiple Models', fontsize=14)
axes[0].legend(fontsize=10, loc='upper right')
axes[0].grid(True, alpha=0.3)
# Bottom: Performance comparison
models = {
'SEIR': seir_forecast_values,
'ARIMA': arima_forecast_values,
'ML': ml_forecast_values,
'Simple Ensemble': ensemble_simple,
'Weighted Ensemble': ensemble_weighted,
'Median Ensemble': ensemble_median
}
performance = []
for name, forecast in models.items():
mae = mean_absolute_error(test['cases'], forecast)
rmse = np.sqrt(((test['cases'] - forecast) ** 2).mean())
performance.append({'Model': name, 'MAE': mae, 'RMSE': rmse})
perf_df = pd.DataFrame(performance).sort_values('MAE')
x_pos = np.arange(len(perf_df))
axes[1].barh(x_pos, perf_df['MAE'], alpha=0.8, edgecolor='black')
axes[1].set_yticks(x_pos)
axes[1].set_yticklabels(perf_df['Model'])
axes[1].set_xlabel('Mean Absolute Error (MAE)', fontsize=12)
axes[1].set_title('Model Performance Comparison', fontsize=14)
axes[1].grid(True, alpha=0.3, axis='x')
# Highlight ensemble
for i, model in enumerate(perf_df['Model']):
if 'Ensemble' in model:
axes[1].get_children()[i].set_color('red')
axes[1].get_children()[i].set_alpha(1.0)
plt.tight_layout()
plt.savefig('ensemble_forecast.png', dpi=300)
plt.show()
# Print performance
print("\n" + "="*60)
print("MODEL PERFORMANCE COMPARISON")
print("="*60)
print(perf_df.to_string(index=False))
# Best model
best_model = perf_df.iloc[0]
print(f"\n🏆 Best Model: {best_model['Model']}")
print(f" MAE: {best_model['MAE']:.1f}")
print(f" RMSE: {best_model['RMSE']:.1f}")
# Improvement of best ensemble over best individual model
best_individual = perf_df[~perf_df['Model'].str.contains('Ensemble')].iloc[0]
improvement = (1 - best_model['MAE'] / best_individual['MAE']) * 100
print(f"\nEnsemble improvement over best individual model: {improvement:.1f}%")Weight Selection:
When ensemble doesn’t help: - All models use the same data and make the same mistakes - One model is vastly superior to others (ensemble dilutes its performance) - Forecast horizon is so long that all models are guessing
For theoretical foundations, see Ray & Reich, 2018, Journal of Forecasting.
The cardinal sin of forecasting: Publishing a single number without uncertainty.
Example of bad forecasting: > “Model predicts 50,000 cases next week.”
What if the real range is 30,000-80,000? Decision-makers need to know!
1. Parameter Uncertainty - Don’t know exact values of \(\beta\), \(\gamma\), etc. - Solution: Bayesian inference, confidence intervals
2. Structural Uncertainty - Model is wrong (all models are wrong!) - SIR assumes homogeneous mixing (false) - Solution: Multiple models, ensemble
3. Stochastic Uncertainty - Random chance (superspreading events, who gets infected) - Solution: Monte Carlo simulation
4. Scenario Uncertainty - Future interventions unknown (will there be a lockdown?) - Solution: Scenario planning (best/worst/expected case)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from scipy.stats import norm
import seaborn as sns
# === METHOD 1: Bootstrap Confidence Intervals ===
def bootstrap_forecast(train_data, forecast_steps, n_bootstrap=1000):
"""
Use bootstrap resampling to quantify uncertainty
"""
forecasts = []
for i in range(n_bootstrap):
# Resample training data with replacement
boot_sample = train_data.sample(n=len(train_data), replace=True)
boot_sample = boot_sample.sort_values('day').reset_index(drop=True)
# Fit model (simple exponential for speed)
from scipy.optimize import curve_fit
def exp_growth(x, a, b):
return a * np.exp(b * x)
try:
params, _ = curve_fit(exp_growth, boot_sample['day'], boot_sample['cases'],
p0=[100, 0.01], maxfev=5000)
# Forecast
forecast_days = np.arange(train_data['day'].iloc[-1] + 1,
train_data['day'].iloc[-1] + 1 + forecast_steps)
forecast = exp_growth(forecast_days, *params)
forecasts.append(forecast)
except:
# If fit fails, use mean
forecasts.append(np.full(forecast_steps, train_data['cases'].mean()))
forecasts = np.array(forecasts)
# Calculate percentiles
forecast_mean = np.mean(forecasts, axis=0)
forecast_lower = np.percentile(forecasts, 2.5, axis=0)
forecast_upper = np.percentile(forecasts, 97.5, axis=0)
return forecast_mean, forecast_lower, forecast_upper
# === METHOD 2: Bayesian Parameter Uncertainty ===
def bayesian_seir_forecast(train_data, forecast_steps, n_samples=500):
"""
SEIR with Bayesian parameter uncertainty
"""
N = 1000000
# Prior distributions for parameters (based on literature/expert knowledge)
beta_samples = np.random.uniform(0.02, 0.05, n_samples) # Transmission rate
gamma_samples = np.random.uniform(0.08, 0.12, n_samples) # Recovery rate
sigma_samples = np.random.uniform(0.15, 0.25, n_samples) # E→I rate
forecasts = []
for beta, gamma, sigma in zip(beta_samples, gamma_samples, sigma_samples):
# Initial conditions
I0 = train_data['cases'].iloc[0]
E0 = I0
R0_init = 0
S0 = N - E0 - I0 - R0_init
def seir(y, t, beta, sigma, gamma, N):
S, E, I, R = y
dS = -beta * S * I / N
dE = beta * S * I / N - sigma * E
dI = sigma * E - gamma * I
dR = gamma * I
return dS, dE, dI, dR
# Solve
t = np.arange(len(train_data) + forecast_steps)
y0 = S0, E0, I0, R0_init
solution = odeint(seir, y0, t, args=(beta, sigma, gamma, N))
_, _, I, _ = solution.T
forecasts.append(I[len(train_data):])
forecasts = np.array(forecasts)
forecast_mean = np.mean(forecasts, axis=0)
forecast_lower = np.percentile(forecasts, 2.5, axis=0)
forecast_upper = np.percentile(forecasts, 97.5, axis=0)
return forecast_mean, forecast_lower, forecast_upper, forecasts
# === METHOD 3: Ensemble Spread as Uncertainty ===
def ensemble_uncertainty(individual_forecasts):
"""
Use spread across ensemble members as uncertainty measure
"""
forecasts_array = np.array(individual_forecasts)
forecast_mean = np.mean(forecasts_array, axis=0)
forecast_std = np.std(forecasts_array, axis=0)
# 95% CI assuming normal distribution
forecast_lower = forecast_mean - 1.96 * forecast_std
forecast_upper = forecast_mean + 1.96 * forecast_std
return forecast_mean, forecast_lower, forecast_upper
# Generate example data
np.random.seed(42)
days = np.arange(150)
cases = 100 * np.exp(0.025 * days) + np.random.normal(0, 50, len(days))
cases = np.maximum(cases, 10)
df = pd.DataFrame({'day': days, 'cases': cases})
train_size = 120
train = df.iloc[:train_size]
test = df.iloc[train_size:]
# Generate forecasts with uncertainty
print("Generating forecasts with uncertainty quantification...")
# Bootstrap
print(" Running bootstrap...")
boot_mean, boot_lower, boot_upper = bootstrap_forecast(train, len(test), n_bootstrap=500)
# Bayesian
print(" Running Bayesian SEIR...")
bayes_mean, bayes_lower, bayes_upper, bayes_samples = bayesian_seir_forecast(
train, len(test), n_samples=200
)
# Visualize
fig, axes = plt.subplots(3, 1, figsize=(16, 16))
# Top: Bootstrap uncertainty
axes[0].plot(train['day'], train['cases'], 'o-',
label='Training Data', color='blue', alpha=0.5, markersize=3)
axes[0].plot(test['day'], test['cases'], 'o-',
label='Actual (Test)', color='black', linewidth=2, markersize=6)
axes[0].plot(test['day'], boot_mean, 's-',
label='Bootstrap Forecast', color='red', linewidth=2, markersize=6)
axes[0].fill_between(test['day'], boot_lower, boot_upper,
alpha=0.3, color='red', label='95% Bootstrap CI')
axes[0].axvline(train['day'].iloc[-1], color='green', linestyle='--', linewidth=2)
axes[0].set_ylabel('Daily Cases', fontsize=12)
axes[0].set_title('Bootstrap Uncertainty Quantification', fontsize=14)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)
# Middle: Bayesian uncertainty
axes[1].plot(train['day'], train['cases'], 'o-',
label='Training Data', color='blue', alpha=0.5, markersize=3)
axes[1].plot(test['day'], test['cases'], 'o-',
label='Actual (Test)', color='black', linewidth=2, markersize=6)
axes[1].plot(test['day'], bayes_mean, '^-',
label='Bayesian SEIR Forecast', color='purple', linewidth=2, markersize=6)
axes[1].fill_between(test['day'], bayes_lower, bayes_upper,
alpha=0.3, color='purple', label='95% Credible Interval')
axes[1].axvline(train['day'].iloc[-1], color='green', linestyle='--', linewidth=2)
axes[1].set_ylabel('Daily Cases', fontsize=12)
axes[1].set_title('Bayesian Parameter Uncertainty', fontsize=14)
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)
# Bottom: Distribution of forecasts (spaghetti plot)
for i in range(min(50, len(bayes_samples))): # Plot subset for clarity
axes[2].plot(test['day'], bayes_samples[i], '-',
color='purple', alpha=0.1, linewidth=0.5)
axes[2].plot(train['day'], train['cases'], 'o-',
label='Training Data', color='blue', alpha=0.7, markersize=3)
axes[2].plot(test['day'], test['cases'], 'o-',
label='Actual (Test)', color='black', linewidth=3, markersize=6, zorder=5)
axes[2].plot(test['day'], bayes_mean, '-',
label='Mean Forecast', color='red', linewidth=3, zorder=4)
axes[2].axvline(train['day'].iloc[-1], color='green', linestyle='--', linewidth=2)
axes[2].set_xlabel('Day', fontsize=12)
axes[2].set_ylabel('Daily Cases', fontsize=12)
axes[2].set_title('Forecast Uncertainty: Distribution of Possible Futures (Spaghetti Plot)', fontsize=14)
axes[2].legend(fontsize=10)
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('uncertainty_quantification.png', dpi=300)
plt.show()
# Evaluate coverage
def calculate_coverage(actual, lower, upper):
"""Calculate what % of actual values fall within prediction interval"""
within_interval = (actual >= lower) & (actual <= upper)
coverage = within_interval.mean() * 100
return coverage
boot_coverage = calculate_coverage(test['cases'].values, boot_lower, boot_upper)
bayes_coverage = calculate_coverage(test['cases'].values, bayes_lower, bayes_upper)
print("\n" + "="*60)
print("UNCERTAINTY QUANTIFICATION EVALUATION")
print("="*60)
print(f"\nBootstrap 95% CI:")
print(f" Coverage: {boot_coverage:.0f}% (should be ~95%)")
print(f"\nBayesian 95% Credible Interval:")
print(f" Coverage: {bayes_coverage:.0f}% (should be ~95%)")
# Interval width (narrower is better, if coverage is maintained)
boot_width = (boot_upper - boot_lower).mean()
bayes_width = (bayes_upper - bayes_lower).mean()
print(f"\nMean Interval Width:")
print(f" Bootstrap: {boot_width:.1f} cases")
print(f" Bayesian: {bayes_width:.1f} cases")Coverage: % of actual values that fall within your prediction interval - Good: 95% coverage for a 95% interval - Bad: Only 70% coverage (interval too narrow)
Sharpness: How narrow your prediction interval is - Good: Narrow intervals (more informative) - Bad: Wide intervals (“cases will be between 10 and 10,000”)
The trade-off: - Wide intervals → High coverage, low utility - Narrow intervals → Low coverage, misleading
Goal: Maximize sharpness while maintaining nominal coverage.
For rigorous evaluation, see Gneiting & Raftery, 2007, Journal of the American Statistical Association.
The COVID-19 Forecast Hub, launched in March 2020, represents the largest epidemic forecasting effort in history.
Coordinated by: Reich Lab at UMass Amherst
Participants: 50+ teams (academia, industry, government)
Frequency: Weekly forecasts
Targets: - Incident deaths (1-4 weeks ahead) - Incident hospitalizations - Incident cases (when data reliable)
Ensemble method: Weighted median of all submitted forecasts
Use: CDC and state health departments for decision-making
Cramer et al., 2022, PNAS evaluated performance over 20 months:
Key findings:
1-week ahead forecasts: - Median absolute error: 10-15% (useful!) - Ensemble outperformed 90% of individual models
4-week ahead forecasts: - Median absolute error: 40-60% (barely better than guessing) - Huge uncertainty, wide prediction intervals
Challenges encountered:
❌ Novel variants (Alpha, Delta, Omicron) → All models failed simultaneously
❌ Changing testing → Case forecasts unreliable
❌ Policy changes → Lockdowns, mask mandates unpredictable
❌ Behavior changes → People reacted to case counts
❌ Data quality issues → Reporting delays, definition changes
Short-term forecasts (1-2 weeks ahead): - Useful for operational planning (hospital beds, staffing) - Ensemble approach was robust - Simple models often competitive with complex models
Transparency and collaboration: - Open data, open code, open evaluation - Rapid iteration and model improvement - Community learned from collective failures
Ensemble method: - Consistently better than any individual model - Robust to individual model failures - Wisdom of crowds
For detailed analysis, see Reich Lab’s evaluation reports.
Long-term forecasts (>4 weeks): - Worse than simple “trend continues” baseline - Essentially scenario planning, not prediction - Should not have been used for major policy decisions
Forecasting through major disruptions: - New variants broke all models - Policy changes (lockdowns) invalidated forecasts immediately - No model predicted Omicron’s magnitude
Point forecasts without uncertainty: - Some early models provided single numbers - Dangerously misleading - Led to bad decisions
The fundamental problem: COVID-19 exhibited extreme non-stationarity—the data-generating process changed every few months (new variants, vaccines, behavior).
When the rules keep changing, forecasting becomes impossible.
December 2021: Omicron variant detected, spreads rapidly.
Forecast Hub predictions (week of Dec 20): - Ensemble forecast: 450,000 cases/day peak - Range across models: 300,000 - 700,000 cases/day
Reality (Jan 2022): - Actual peak: 1.4 million cases/day (underestimated by 2-3x)
Why all models failed: - Omicron’s R₀ was 2-3x higher than Delta - Immune escape wasn’t fully captured - Behavior didn’t change as much as expected - Testing patterns shifted
The lesson: Novel variants with dramatically different properties break all models trained on past data.
Forecasts have consequences. Sometimes, not forecasting is the ethical choice.
Scenario planning disguised as prediction:
Many “forecasts” during COVID-19 were actually scenarios: - “If we do nothing, 2 million will die” - “If we lock down, 100,000 will die”
These aren’t predictions—they’re conditional statements about policy choices.
The problem: Media reports them as predictions, public gets confused.
You should NOT publish forecasts when:
❌ Novel pathogen with no historical data - Early COVID-19 (January 2020): Every model was pure speculation - Better: “Here’s a range of scenarios from mild to catastrophic”
❌ Major intervention pending - Government about to announce lockdown - Forecasting what happens “if current trends continue” is misleading
❌ Data quality too poor - Testing availability changing daily - Reporting systems overwhelmed - Better: Focus on qualitative assessment
❌ Model disagreement too high - If your ensemble has predictions ranging from 10,000 to 500,000 cases, what’s the point? - Better: “Current uncertainty is too high for actionable forecasts”
❌ Horizon too long - Forecasting 3+ months ahead for rapidly changing epidemic - Better: Provide scenarios, not forecasts
From Ioannidis et al., 2020, International Journal of Forecasting:
“Epidemic forecasters have a special responsibility because their forecasts can profoundly influence public policy and behavior. Forecasts that are too optimistic may lead to complacency. Forecasts that are too pessimistic may cause unnecessary panic and economic damage. Forecasters must resist pressure to provide false certainty.”
Best practices:
✅ Always provide uncertainty quantification
✅ Clearly distinguish forecasts from scenarios
✅ Update forecasts as data changes
✅ Acknowledge model limitations publicly
✅ Archive forecasts for retrospective evaluation
✅ Resist pressure to forecast when uncertainty is unquantifiable
Before you start forecasting:
☐ Do you have sufficient historical data? - Time series: ≥2 years for seasonal diseases - SEIR models: At least early epidemic data to estimate parameters
☐ Is your data quality acceptable? - Complete Chapter 3 assessment - Understand reporting lags - Know how testing/surveillance changed over time
☐ What decisions will this forecast inform? - Hospital resource allocation → 1-2 week forecasts sufficient - Long-term policy planning → Need scenarios, not forecasts - Public communication → Uncertainty is essential
☐ What’s your forecast horizon? - 1-2 weeks: Achievable with most methods - 3-4 weeks: Difficult, high uncertainty - >4 weeks: Scenario planning, not forecasting
Start simple, add complexity only when justified:
1. Naive baseline (persistence: tomorrow = today)
↓
2. Time series model (ARIMA, Prophet)
↓ (if you need scenario planning)
3. Mechanistic model (SEIR)
↓ (if you have large training data)
4. Machine learning (LSTM, ensemble)
Never skip the naive baseline! If your fancy model doesn’t beat “tomorrow will be like today,” it’s not adding value.
Proper time series cross-validation:
# WRONG: Random train/test split
X_train, X_test = train_test_split(X, test_size=0.2) # ❌
# CORRECT: Respect temporal order
train_size = int(0.8 * len(X))
X_train = X[:train_size]
X_test = X[train_size:] # ✅Report multiple metrics: - MAE (Mean Absolute Error) - RMSE (Root Mean Squared Error) - Coverage (95% CI should contain 95% of actuals) - Sharpness (average interval width)
Track performance over time: - Forecast accuracy degrades → Retrain models - Sudden drop in accuracy → Something changed (investigate!)
class EpidemicForecaster:
"""
Production-ready epidemic forecasting system
"""
def __init__(self):
self.models = {}
self.performance_history = []
def add_model(self, name, model):
"""Add a model to the ensemble"""
self.models[name] = model
def fit(self, train_data):
"""Fit all models"""
for name, model in self.models.items():
model.fit(train_data)
def forecast(self, horizon, method='weighted_ensemble'):
"""Generate ensemble forecast"""
# Get forecasts from all models
forecasts = {}
for name, model in self.models.items():
forecasts[name] = model.predict(horizon)
# Combine
if method == 'simple_average':
ensemble = np.mean(list(forecasts.values()), axis=0)
elif method == 'weighted_ensemble':
# Weight by recent performance
weights = self._calculate_weights()
ensemble = sum(w * f for w, f in zip(weights, forecasts.values()))
return ensemble, forecasts
def evaluate(self, actual, forecast):
"""Evaluate forecast and log performance"""
mae = mean_absolute_error(actual, forecast)
self.performance_history.append({
'date': pd.Timestamp.now(),
'mae': mae
})
return mae
def _calculate_weights(self):
"""Calculate model weights based on recent performance"""
# Implementation depends on your needs
pass
# Usage
forecaster = EpidemicForecaster()
forecaster.add_model('seir', SEIRModel())
forecaster.add_model('arima', ARIMAModel())
forecaster.add_model('lstm', LSTMModel())
forecaster.fit(training_data)
forecast, individual_forecasts = forecaster.forecast(horizon=14)Forecasting ≠ Surveillance. Detection tells you what’s happening now; forecasting predicts the future. The latter is exponentially harder.
Short-term works, long-term doesn’t. 1-2 week forecasts are useful. 4+ week forecasts are often worse than naive baselines.
Mechanistic models provide interpretability and scenario planning. SEIR models let you ask “what if we vaccinate 50%?” ML models can’t answer that.
Machine learning excels at pattern recognition in stable conditions. But breaks when conditions change (new variants, interventions).
Ensemble methods are most robust. Combining multiple approaches beats any single model.
Uncertainty quantification is mandatory. Never publish a single-number forecast. Always provide confidence/prediction intervals.
Evaluation is essential. Track forecast performance over time. If accuracy degrades, investigate why and retrain.
Sometimes, don’t forecast. When uncertainty is unquantifiable or forecasts would be misleading, say “we don’t know.”
Forecasts influence outcomes. The act of forecasting changes behavior, creating self-fulfilling or self-defeating prophecies.
COVID-19 Forecast Hub proved ensemble methods work. But also showed fundamental limits—novel variants break all models.
Build SIR, SEIR, ARIMA, and LSTM models for the same epidemic dataset. Which performs best at 1-week, 2-week, and 4-week horizons? Why?
For an SEIR model, vary \(\beta\) by ±20%. How much does peak timing and magnitude change? What does this tell you about parameter uncertainty?
Given forecasts from 5 models, find the optimal weights that minimize MAE on a validation set. Do these weights generalize to the test set?
Generate forecasts with 95% prediction intervals. What percentage of actual values fall within your intervals? If not 95%, how can you calibrate?
Using an SEIR model, create three scenarios: (1) No intervention, (2) 50% transmission reduction, (3) 80% transmission reduction. How different are the outcomes?
Test your knowledge of epidemic forecasting methods and their appropriate applications. Each question builds on the key concepts from this chapter.
In a basic SIR compartmental model, you calculate R₀ = 2.5. What does this mean for epidemic dynamics, and what happens when R₀ drops to 0.9 due to interventions?
Correct Answer: b) R₀ = 2.5 means the epidemic grows exponentially; R₀ = 0.9 means the epidemic dies out
The basic reproduction number (R₀) is the most important parameter in epidemic modeling:
What R₀ means: - R₀ = number of secondary infections caused by one infected person in a fully susceptible population - R₀ = 2.5: Each infected person infects 2.5 others → epidemic grows exponentially - R₀ = 0.9: Each infected person infects <1 person → epidemic dies out - R₀ = 1: Endemic equilibrium (constant case rate)
Critical threshold: R₀ > 1 → epidemic grows; R₀ < 1 → epidemic dies out
How interventions affect R₀: - R₀ = β/γ where β = transmission rate, γ = recovery rate - Social distancing reduces β → lowers R₀ - Faster isolation reduces infectious period → increases γ → lowers R₀ - A modest 10% reduction in transmission (R₀ = 2.5 → 2.3) can dramatically reduce peak cases
Examples from the chapter: - R₀ = 2.5 (no intervention) → Peak: 500,000 cases/day - R₀ = 2.0 (distancing) → Peak: 100,000 cases/day - R₀ = 0.9 (lockdown) → No peak, epidemic dies out
Why other answers are wrong: - R₀ ≠ percentage infected (that’s the attack rate) - R₀ ≠ contacts per day (it’s contacts × transmission probability / infectious period) - R₀ absolutely changes with interventions (that’s the whole point!)
The COVID-19 Forecast Hub found that 1-week ahead forecasts had ~10-15% median error (useful), while 4-week ahead forecasts had ~50% error (barely better than guessing). What fundamental challenge explains this forecast horizon problem?
Correct Answer: b) Epidemics exhibit chaotic dynamics where small changes in initial conditions, behaviors, and interventions compound over time
This is the butterfly effect in epidemiology - a fundamental characteristic of complex adaptive systems:
Why accuracy degrades with forecast horizon:
From the chapter:
| Forecast Horizon | Typical Accuracy | Usefulness |
|---|---|---|
| 1 week ahead | ±10-20% | High - operational planning |
| 2 weeks ahead | ±20-40% | Moderate - resource allocation |
| 4 weeks ahead | ±40-100% | Low - scenario planning only |
| 3+ months ahead | Worse than baseline | Essentially guesswork |
The self-fulfilling/self-defeating prophecy: - Forecasts influence behavior, which changes outcomes - Unlike weather forecasting (rain doesn’t care about predictions), epidemic forecasts affect the thing being forecasted - Long horizons amplify this feedback loop
Real-world example: - December 2021: Models predicted Omicron peak of 300K-700K cases/day - Reality (January 2022): 1.4 million cases/day (underestimated by 2-3x) - Why? Novel variant with 2-3x higher R₀ broke all assumptions
The lesson: Focus on short-term forecasts (1-2 weeks) for operational decisions. Treat longer horizons as scenario planning, not prediction.
You’re comparing mechanistic SEIR models versus LSTM deep learning models for epidemic forecasting. Which statement best describes their complementary strengths and appropriate use cases?
Correct Answer: b) SEIR models enable interpretable scenario planning (what if we vaccinate 50%?), while LSTMs excel at short-term pattern recognition in stable conditions
This captures the fundamental tension between mechanistic and machine learning approaches:
SEIR/Mechanistic Models:
Strengths: - ✅ Interpretable: Every parameter (β, γ, σ) has biological meaning - ✅ Scenario planning: Can simulate “what if we reduce transmission by 30%?” - ✅ Theory-grounded: Built on century of epidemiological knowledge - ✅ Extrapolation: Can predict novel scenarios without historical data - ✅ Policy-relevant: Stakeholders understand “if R₀ drops to X, we get Y cases”
Limitations: - ❌ Simplistic assumptions (homogeneous mixing, constant parameters) - ❌ Don’t capture heterogeneity (superspreaders, age structure) - ❌ Require knowing parameters (β, γ) that are hard to estimate
LSTM/Machine Learning Models:
Strengths: - ✅ Data-driven: Learn patterns without mechanistic assumptions - ✅ Flexible: Capture complex nonlinear relationships - ✅ Short-term accuracy: Excel at 1-2 week forecasts when conditions are stable - ✅ Multiple inputs: Can integrate cases + mobility + weather + interventions
Limitations: - ❌ No mechanistic understanding: Learn patterns, not causes - ❌ Break on distribution shift: New variants, interventions invalidate training data - ❌ Opaque: Can’t explain why predictions changed - ❌ Can’t answer counterfactuals: “What if we vaccinate?” → no historical data for that
The LSTM limitation example from the chapter: - Training data: epidemic curve + lockdown → cases drop - LSTM learns: “After cases reach X, they drop” - New epidemic: cases reach X, but no lockdown - LSTM predicts drop (WRONG!) - Problem: No concept that lockdown caused the drop
Best practice (from COVID-19 Forecast Hub): - Use both in an ensemble - SEIR for scenario planning and understanding mechanisms - LSTM for short-term pattern recognition - Ensemble combines their strengths and averages out individual failures
Why did the COVID-19 Forecast Hub consistently find that ensemble methods (combining multiple models) outperformed any individual model, even though no single “ensemble algorithm” existed?
Correct Answer: b) Different models capture different patterns; averaging reduces overfitting and provides robustness when individual models fail
Ensemble forecasting is one of the most important lessons from both the CDC FluSight Challenge and COVID-19 Forecast Hub:
Why ensembles work:
Evidence from COVID-19 Forecast Hub (Cramer et al., 2022, PNAS): - Ensemble consistently outperformed 90% of individual models - Even when all models struggled (Omicron), ensemble was least wrong - Simple median ensemble worked surprisingly well
Ensemble methods: 1. Simple average: Equal weight to all models (works well!) 2. Weighted average: Weight by historical performance 3. Median ensemble: More robust to outliers 4. Adaptive weights: Change based on recent performance
When ensemble doesn’t help: - All models use same data and make same mistakes - One model vastly superior (rare in practice) - Forecast horizon so long all models are guessing
From the chapter example: Models compared: - SEIR, ARIMA, ML → best individual MAE - Simple ensemble → better than best individual - Weighted ensemble → best overall - Improvement: ~10-20% over best single model
The lesson: Don’t put all your eggs in one model basket. Ensemble methods are more robust and reliable for real-world forecasting.
You publish a forecast predicting “severe outbreak in 2 weeks” with wide confidence intervals (10,000-100,000 cases). The public panics and implements strict measures. The outbreak is mild (15,000 cases). What forecasting challenge does this illustrate?
Correct Answer: b) The self-defeating prophecy problem: forecasts influence behavior, which changes outcomes, making the forecast appear wrong even though it drove the right action
This is one of the most profound challenges in epidemic forecasting - unique to predictions that influence human behavior:
The paradox: Success makes you look wrong
Scenario 1: Self-defeating prophecy (this question) - Model predicts catastrophic outbreak - Public panics, implements strict measures - Outbreak is mild - Conclusion: “Model was wrong! Overreaction!” - Reality: Model was right about unmitigated scenario; intervention changed trajectory
Scenario 2: Self-fulfilling prophecy - Model predicts mild outbreak - Public relaxes precautions - Outbreak explodes - Conclusion: “Model was wrong! Should have prepared!” - Reality: Model was right for baseline scenario; behavior change amplified outbreak
Unlike weather forecasting: - Rain forecast doesn’t affect whether it rains - Epidemic forecasts DO affect transmission (people change behavior) - Creates impossible situation for forecasters
The Michael Fish Problem (from chapter): - 1987: BBC weatherman dismissed hurricane concerns - Hurricane hit anyway → 18 died, permanent trust damage - Lesson: Reassuring forecasts that turn out wrong destroy trust permanently - But alarming forecasts that don’t materialize also erode trust (more slowly)
Ethical best practices:
✅ Clearly distinguish forecasts from scenarios: - “If current trends continue…” (forecast) - “If we do nothing…” (scenario) - “If we implement lockdown…” (scenario)
✅ Communicate uncertainty honestly: - Wide confidence intervals (10K-100K) show genuine uncertainty - Better than false precision (53,472 cases!)
✅ Frame conditionally: - “Without intervention: 50,000-100,000 cases” - “With strong measures: 10,000-30,000 cases”
✅ Update as conditions change: - “Our forecast assumed no intervention; behavior changed, so we update”
From Ioannidis et al., 2020: > “Forecasters must resist pressure to provide false certainty. Forecasts influence policy and behavior, creating special responsibility.”
When NOT to forecast: - Novel pathogen (no data) - Major intervention pending - Model disagreement too high - Better to say: “Current uncertainty is too high for actionable forecasts”
You’re building an epidemic forecasting system for influenza. Your ARIMA model achieves 12% MAE, while a “naive” persistence model (tomorrow = today) achieves 15% MAE. Your LSTM achieves 11% MAE but requires 100x more computational resources. What should you deploy?
Correct Answer: b) ARIMA model because it balances accuracy and simplicity, and beats the naive baseline
This question tests practical judgment about model selection and deployment tradeoffs:
Why ARIMA is the best choice here:
Why NOT choose LSTM (11% MAE): - Only 1% absolute improvement over ARIMA (8% relative improvement) - 100x computational cost for marginal gain - Harder to maintain and debug - Opaque predictions harder to explain to health departments - Principle: Simplest model that beats baseline and meets requirements
Why NOT choose naive baseline: - 15% MAE is worse than ARIMA - Doesn’t capture seasonality - However, always compute it as benchmark!
Why ensemble might not be best: - More complex to maintain - Marginal improvement unlikely to justify complexity for weekly flu forecasts - Better for high-stakes scenarios (pandemic)
From the chapter’s model selection strategy:
1. Naive baseline (persistence: tomorrow = today)
↓ (beat this? Yes → continue)
2. Time series model (ARIMA, Prophet) ← WE ARE HERE
↓ (need scenario planning? No)
3. Mechanistic model (SEIR)
↓ (have large training data + need more accuracy? No)
4. Machine learning (LSTM, ensemble)
Best practice for production systems: 1. Deploy ARIMA as primary model 2. Monitor performance weekly 3. Keep naive baseline for comparison 4. If accuracy degrades, investigate and retrain 5. Consider LSTM only if: - Accuracy requirements become critical - Computational resources available - Interpretability less important
Context matters: - Hospital bed allocation (low stakes): ARIMA is fine - Pandemic ventilator procurement (high stakes): Might justify LSTM or ensemble - Public communication: Need interpretable model (ARIMA)
The chapter emphasizes: “Never skip the naive baseline! If your fancy model doesn’t beat ‘tomorrow will be like today,’ it’s not adding value.”
SEIR models assume homogeneous mixing (everyone contacts everyone equally). How does this assumption break down in reality? How might you modify SEIR to be more realistic?
The COVID-19 Forecast Hub ensemble outperformed individual models, but still failed to predict Omicron. What does this tell us about the limits of forecasting? When should we abandon forecasting and focus on scenario planning?
Mechanistic models (SEIR) are interpretable but make strong assumptions. ML models (LSTMs) are flexible but opaque. For public health decision-making, which is more important: interpretability or accuracy? Does it depend on the decision?
Imagine you’re a state health department epidemiologist in March 2020. Multiple models predict vastly different COVID-19 death tolls (100K to 2M). How do you communicate this uncertainty to policymakers? To the public?
Google Flu Trends failed because search behavior changed when people became aware of the model. Is this problem inherent to all epidemic forecasting, or specific to web-based surveillance?
The COVID-19 Forecast Hub published forecasts weekly, even when model uncertainty was extremely high. Was this ethical? Should forecasters refuse to forecast when uncertainty is unquantifiable?
Forecasting Evaluation: - Reich et al., 2019, Clinical Infectious Diseases - FluSight evaluation (Reich et al. 2019) - Key case study showing ensemble methods work - Cramer et al., 2022, PNAS - COVID-19 Forecast Hub (Cramer et al. 2022) - Essential reading on lessons learned from COVID-19 forecasting - Viboud et al., 2018, PNAS - RAPIDD forecasting challenge - Comparative evaluation of forecasting approaches
Machine Learning Methods: - Chimmula & Zhang, 2020, Chaos, Solitons & Fractals - LSTM forecasting - Deep learning for epidemic forecasting
Ethics and Limitations: - Ioannidis et al., 2020, International Journal of Forecasting - Forecasting for COVID-19 - Critical perspective on forecasting ethics and responsibility
Forecasting is humbling. The future is uncertain, models are wrong, and interventions change outcomes. But with proper humility, transparency, and ensemble methods, forecasts can still inform better decisions than guessing.
Next: Chapter 6: Genomic Surveillance and Pathogen Analysis →