7 ```markdown
7.1 title: “Epidemic Forecasting”
Learning Objectives
By the end of this chapter, you will:
- Understand the fundamental difference between outbreak detection and prediction
- Implement mechanistic models (SIR, SEIR) and understand their assumptions
- Apply time series methods (ARIMA) for short-term forecasting
- Build deep learning models (LSTMs) for complex epidemic patterns
- Create ensemble forecasting systems that combine multiple approaches
- Quantify and communicate forecast uncertainty rigorously
- Recognize why epidemic forecasting is fundamentally harder than weather forecasting
- Navigate the ethical implications of publishing epidemic forecasts
- Evaluate forecast performance using proper metrics
Time to complete: 120-150 minutes Prerequisites: Chapter 4: Surveillance (strongly recommended), Chapter 2: AI Basics, Chapter 3: Data
What you’ll build: 💻 Six complete epidemic forecasting systems including SIR/SEIR mechanistic models, ARIMA time series forecasting, LSTM deep learning models, ensemble forecasting combining multiple approaches, and uncertainty quantification with confidence intervals
7.2 Introduction: From “What Is Happening?” to “What Will Happen?”
March 2020, Washington, D.C.: The White House Coronavirus Task Force presents epidemic models to the President. The projections are sobering: without mitigation, 1.5-2.2 million Americans could die. With social distancing, 100,000-240,000 deaths.
The nation locked down. Schools closed. Businesses shuttered.
January 2023: The U.S. passes 1.1 million COVID-19 deaths—far exceeding even the “with mitigation” scenario.
What went wrong? Were the models bad? Was the data flawed? Or is epidemic forecasting simply impossible?
October 2021, Global: Multiple forecasting models completely missed the magnitude of the Omicron wave. Most predicted a moderate increase in cases. Reality: a vertical spike that shattered all previous records.
January 2022: Models that worked well for Delta suddenly failed for Omicron. Why? The virus evolved, immune landscapes changed, behavior shifted, and the fundamental assumptions of every model broke.
These aren’t just modeling failures—they’re lessons about the fundamental limits of prediction in complex adaptive systems.
7.2.1 The Critical Distinction
Surveillance (Chapter 4): “An outbreak is happening NOW” - Based on current/recent data - Relatively objective - Can be verified immediately
Forecasting (this chapter): “Here’s what will happen in 2 weeks” - Predicting the future - Depends on assumptions - Can’t be verified until the future arrives
The difference matters enormously:
For resource allocation: - How many hospital beds will we need? - Should we order more ventilators? - Do we need to open field hospitals?
For policy decisions: - Should we implement lockdowns? - When can schools safely reopen? - Should we mandate masks?
For public communication: - Should people cancel travel plans? - Is it safe to visit elderly relatives? - Do we need to stock up on supplies?
Get surveillance wrong → Miss an outbreak
Get forecasting wrong → Waste billions, lose public trust, or fail to prevent catastrophe
7.2.2 Why Forecasting Is Harder Than Detection
7.2.2.1 The Butterfly Effect in Epidemiology
In chaos theory, the butterfly effect describes how small changes in initial conditions lead to vastly different outcomes. Epidemics exhibit this behavior in extreme form:
Small changes in behavior → Massive changes in outcomes
R₀ = 2.5 (no interventions)
→ Peak: 500,000 cases/day
R₀ = 2.3 (modest social distancing)
→ Peak: 300,000 cases/day
R₀ = 2.0 (strong distancing)
→ Peak: 100,000 cases/day
R₀ = 0.9 (lockdown)
→ No peak, epidemic dies out
A 10% reduction in transmission can mean the difference between epidemic and no epidemic. Forecasting requires getting this exactly right.
7.2.2.2 The Self-Fulfilling/Self-Defeating Prophecy Problem
Unlike weather forecasting, epidemic forecasts influence the thing being forecasted.
Scenario 1: Self-defeating prophecy - Model predicts catastrophic outbreak - Public panics, implements strict measures - Outbreak is mild - Conclusion: “Model was wrong! Overreaction!”
Scenario 2: Self-fulfilling prophecy - Model predicts mild outbreak - Public relaxes precautions - Outbreak explodes - Conclusion: “Model was wrong! Should have prepared!”
This creates an impossible situation: Success makes you look wrong.
In October 1987, BBC weatherman Michael Fish famously dismissed viewer concerns about an approaching hurricane: “Earlier on today, apparently, a woman rang the BBC and said she heard there was a hurricane on the way. Well, if you’re watching, don’t worry, there isn’t.”
That night, the Great Storm of 1987 hit the UK—the worst storm in 300 years. Eighteen people died. Millions of trees fell. The public never forgot.
Lesson for epidemic forecasting: Reassuring forecasts that turn out wrong destroy trust permanently. Conservative (alarming) forecasts that don’t materialize also destroy trust, but more slowly.
The safest strategy: Communicate uncertainty honestly. Never publish a single-number forecast without ranges.
7.2.2.3 The Horizon Problem
Forecast accuracy degrades rapidly with time:
Forecast Horizon | Typical Accuracy | Usefulness |
---|---|---|
1 week ahead | ±10-20% | High - operational planning |
2 weeks ahead | ±20-40% | Moderate - resource allocation |
4 weeks ahead | ±40-100% | Low - scenario planning only |
3+ months ahead | Often worse than naive baseline | Essentially guesswork |
Why does accuracy degrade? - Interventions you can’t predict (lockdowns, vaccine campaigns) - Behavior changes (people react to case counts) - Viral evolution (new variants) - Reporting changes (testing policy shifts) - Weather and seasonality effects - Random superspreading events
The 2-week wall: Most successful forecasting systems focus on 1-2 week horizons. Anything beyond that is scenario planning, not prediction.
The COVID-19 Forecast Hub, coordinated by the Reich Lab at UMass Amherst, collected forecasts from 50+ teams starting in March 2020.
Key findings from Cramer et al., 2022, PNAS:
- 1-week ahead forecasts: Median error ~10-15% (useful!)
- 4-week ahead forecasts: Median error ~50% (barely better than guessing)
- Ensemble models consistently outperformed any individual model
- Novel variants caused all models to fail simultaneously
We’ll examine this in detail later in the chapter.
7.3 The Mechanistic Approach: Compartmental Models
Before machine learning existed, epidemiologists developed mechanistic models—mathematical frameworks based on how diseases actually spread.
These models remain essential today. Why? They’re interpretable, biologically grounded, and enable scenario planning.
7.3.1 SIR Model: The Foundation
The SIR model, introduced by Kermack and McKendrick in 1927, divides a population into three compartments:
- S: Susceptible (can get infected)
- I: Infected (currently infectious)
- R: Recovered/Removed (immune or dead)
The flow:
S → I → R
Governed by two parameters: - β (beta): Transmission rate (contacts per day × probability of transmission) - γ (gamma): Recovery rate (1/infectious period)
The basic reproduction number: \[R_0 = \frac{\beta}{\gamma}\]
\(R_0\) tells you: How many people does one infected person infect in a fully susceptible population?
- \(R_0 < 1\): Epidemic dies out
- \(R_0 = 1\): Endemic equilibrium
- \(R_0 > 1\): Epidemic grows
The mathematics:
\[\frac{dS}{dt} = -\beta \frac{S \cdot I}{N}\]
\[\frac{dI}{dt} = \beta \frac{S \cdot I}{N} - \gamma I\]
\[\frac{dR}{dt} = \gamma I\]
Where \(N = S + I + R\) (total population).
Let’s implement this:
import numpy as np
from scipy.integrate import odeint
import matplotlib.pyplot as plt
def sir_model(y, t, beta, gamma, N):
"""
Basic SIR model differential equations
Parameters:
- y: [S, I, R] current state
- t: time
- beta: transmission rate
- gamma: recovery rate
- N: total population
Returns:
- [dS/dt, dI/dt, dR/dt]
"""
= y
S, I, R
= -beta * S * I / N
dS_dt = beta * S * I / N - gamma * I
dI_dt = gamma * I
dR_dt
return dS_dt, dI_dt, dR_dt
# Parameters for a measles-like disease
= 1_000_000 # Total population (e.g., a city)
N = 100 # Initial infected
I0 = 0 # Initial recovered (no prior immunity)
R0 = N - I0 - R0 # Initial susceptible
S0
= 0.5 # Transmission rate (contacts per day × transmission probability)
beta = 0.1 # Recovery rate (1/10 day infectious period)
gamma
# Time vector (days)
= np.linspace(0, 200, 200)
t
# Initial conditions
= S0, I0, R0
y0
# Solve ODE system
= odeint(sir_model, y0, t, args=(beta, gamma, N))
solution = solution.T
S, I, R
# Calculate R₀
= beta / gamma
R0_value print(f"Basic Reproduction Number (R₀): {R0_value:.2f}")
# Visualize
= plt.subplots(1, 2, figsize=(16, 6))
fig, axes
# Left panel: Epidemic curves
0].plot(t, S, label='Susceptible', color='blue', linewidth=2)
axes[0].plot(t, I, label='Infected', color='red', linewidth=3)
axes[0].plot(t, R, label='Recovered', color='green', linewidth=2)
axes[
# Mark peak
= np.argmax(I)
peak_idx 0].axvline(t[peak_idx], color='red', linestyle='--', alpha=0.5,
axes[=f'Peak: Day {t[peak_idx]:.0f}')
label0].axhline(I[peak_idx], color='red', linestyle='--', alpha=0.5)
axes[
0].set_xlabel('Days', fontsize=12)
axes[0].set_ylabel('Population', fontsize=12)
axes[0].set_title(f'SIR Model Epidemic Curve (R₀={R0_value:.1f})', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
axes[
# Right panel: Phase portrait (S vs I)
1].plot(S, I, 'k-', linewidth=2)
axes[1].scatter([S0], [I0], color='red', s=200, zorder=5,
axes[='*', label='Start', edgecolors='black', linewidth=2)
marker1].scatter([S[-1]], [I[-1]], color='green', s=200, zorder=5,
axes[='*', label='End', edgecolors='black', linewidth=2)
marker
# Herd immunity threshold line
= N * (1 - 1/R0_value)
herd_immunity_threshold 1].axvline(herd_immunity_threshold, color='orange', linestyle='--',
axes[=2, label=f'Herd Immunity: {herd_immunity_threshold/N*100:.0f}%')
linewidth
1].set_xlabel('Susceptible Population', fontsize=12)
axes[1].set_ylabel('Infected Population', fontsize=12)
axes[1].set_title('Phase Portrait: S vs I', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
axes[
plt.tight_layout()'sir_basic_model.png', dpi=300)
plt.savefig(
plt.show()
# Calculate key metrics
= np.argmax(I)
peak_infected_idx = t[peak_infected_idx]
peak_day = I[peak_infected_idx]
peak_cases = (N - S[-1]) / N * 100
final_attack_rate = (1 - 1/R0_value) * 100
herd_immunity
print("\n" + "="*60)
print("SIR MODEL RESULTS")
print("="*60)
print(f"\nR₀ (Basic Reproduction Number): {R0_value:.2f}")
print(f"Herd Immunity Threshold: {herd_immunity:.1f}% of population")
print(f"\nPeak Infections: {peak_cases:,.0f} people")
print(f"Peak Day: {peak_day:.0f}")
print(f"\nFinal Attack Rate: {final_attack_rate:.1f}% of population infected")
print(f"Final Susceptible: {S[-1]:,.0f} people ({S[-1]/N*100:.1f}%)")
print(f"Total Infected: {R[-1]:,.0f} people")
# Demonstrate impact of interventions
print("\n" + "="*60)
print("SCENARIO ANALYSIS: Impact of Interventions")
print("="*60)
= {
scenarios 'No intervention (R₀=2.5)': 2.5,
'Masks + distancing (R₀=2.0)': 2.0,
'Partial lockdown (R₀=1.5)': 1.5,
'Full lockdown (R₀=0.9)': 0.9
}
for scenario_name, R0_scenario in scenarios.items():
= R0_scenario * gamma
beta_scenario = odeint(sir_model, y0, t, args=(beta_scenario, gamma, N))
solution_scenario = solution_scenario.T
S_s, I_s, R_s
= I_s.max()
peak_infections = R_s[-1]
total_infected = total_infected / N * 100
attack_rate
print(f"\n{scenario_name}")
print(f" Peak infections: {peak_infections:,.0f}")
print(f" Total infected: {total_infected:,.0f} ({attack_rate:.1f}%)")
R₀ is the most important number in epidemiology, but it’s often misunderstood.
What it is: - Number of secondary infections caused by one infected person in a fully susceptible population - Determines whether an epidemic grows or dies out - Threshold for herd immunity: \(1 - 1/R_0\)
What it is NOT: - Not the number of people infected per day (that’s incidence) - Not constant over time (behavior, interventions, immunity all change it) - Not inherent to a pathogen alone (depends on population behavior and contact patterns)
Examples: - Measles: \(R_0 \approx\) 12-18 (extremely contagious) - Influenza: \(R_0 \approx\) 1.3-1.8 (moderately contagious) - Original SARS-CoV-2: \(R_0 \approx\) 2-3 - Delta variant: \(R_0 \approx\) 5-8 - Omicron: \(R_0 \approx\) 10-15
For comprehensive review, see Delamater et al., 2019, Emerging Infectious Diseases.
7.3.2 SEIR Model: Adding the Latent Period
The SIR model assumes people become infectious immediately after infection. For most diseases, this is wrong—there’s an incubation period.
SEIR adds the Exposed compartment:
S → E → I → R
- E: Exposed (infected but not yet infectious)
New parameter: - σ (sigma): Rate of progression from E to I (1/incubation period)
The mathematics:
\[\frac{dS}{dt} = -\beta \frac{S \cdot I}{N}\]
\[\frac{dE}{dt} = \beta \frac{S \cdot I}{N} - \sigma E\]
\[\frac{dI}{dt} = \sigma E - \gamma I\]
\[\frac{dR}{dt} = \gamma I\]
Why this matters:
For diseases like COVID-19 with long incubation periods (5-6 days), the SEIR model produces more realistic epidemic curves with delayed and broader peaks.
def seir_model(y, t, beta, sigma, gamma, N):
"""
SEIR model with exposed/latent period
Parameters:
- y: [S, E, I, R] current state
- sigma: rate of progression from E to I (1/incubation period)
"""
= y
S, E, I, R
= -beta * S * I / N
dS_dt = beta * S * I / N - sigma * E
dE_dt = sigma * E - gamma * I
dI_dt = gamma * I
dR_dt
return dS_dt, dE_dt, dI_dt, dR_dt
# Parameters for COVID-19-like disease
= 1/5.5 # 5.5-day incubation period
sigma = 1/10 # 10-day infectious period
gamma = 0.5
beta
= 50 # Initial exposed
E0 = 10 # Initial infected
I0 = 0
R0 = N - E0 - I0 - R0
S0
# Solve SEIR
= S0, E0, I0, R0
y0_seir = odeint(seir_model, y0, t, args=(beta, sigma, gamma, N))
solution_seir = solution_seir.T
S_seir, E_seir, I_seir, R_seir
# Solve SIR for comparison (combine E and I into I)
= S0 + E0, I0, R0
y0_sir = odeint(sir_model, y0_sir, t, args=(beta, gamma, N))
solution_sir = solution_sir.T
S_sir, I_sir, R_sir
# Compare
= plt.subplots(2, 1, figsize=(14, 10))
fig, axes
# Top panel: SEIR compartments
0].plot(t, S_seir, label='Susceptible', color='blue', linewidth=2)
axes[0].plot(t, E_seir, label='Exposed (not yet infectious)', color='orange', linewidth=2)
axes[0].plot(t, I_seir, label='Infected (infectious)', color='red', linewidth=3)
axes[0].plot(t, R_seir, label='Recovered', color='green', linewidth=2)
axes[
0].set_ylabel('Population', fontsize=12)
axes[0].set_title('SEIR Model: Four Compartments', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
axes[
# Bottom panel: Compare infectious populations (SIR vs SEIR)
1].plot(t, I_sir, '--', label='SIR: Infected (no latency)',
axes[='pink', linewidth=3, alpha=0.7)
color1].plot(t, I_seir, '-', label='SEIR: Infected (with latent period)',
axes[='red', linewidth=3)
color1].plot(t, E_seir, '-', label='SEIR: Exposed (not infectious)',
axes[='orange', linewidth=2, alpha=0.7)
color
# Mark peaks
= t[np.argmax(I_sir)]
sir_peak_day = t[np.argmax(I_seir)]
seir_peak_day = seir_peak_day - sir_peak_day
delay
1].axvline(sir_peak_day, color='pink', linestyle=':', alpha=0.5)
axes[1].axvline(seir_peak_day, color='red', linestyle=':', alpha=0.5)
axes[
1].annotate(f'Peak delay: {delay:.0f} days',
axes[=(seir_peak_day, I_seir.max()),
xy=(seir_peak_day + 20, I_seir.max() * 0.8),
xytext=dict(arrowstyle='->', color='black', lw=2),
arrowprops=12, bbox=dict(boxstyle='round', facecolor='wheat'))
fontsize
1].set_xlabel('Days', fontsize=12)
axes[1].set_ylabel('Population', fontsize=12)
axes[1].set_title('SIR vs SEIR: Impact of Latent Period', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
axes[
plt.tight_layout()'seir_vs_sir.png', dpi=300)
plt.savefig(
plt.show()
print(f"\nPeak timing difference: {delay:.1f} days")
print(f"SIR peak: {I_sir.max():,.0f} people on day {sir_peak_day:.0f}")
print(f"SEIR peak: {I_seir.max():,.0f} people on day {seir_peak_day:.0f}")
print(f"Peak reduction: {(1 - I_seir.max()/I_sir.max())*100:.1f}%")
The E compartment: 1. Delays the epidemic peak (by ~1 incubation period) 2. Reduces peak height slightly (infections spread out more) 3. Makes interventions less effective if they only target symptomatic individuals (E people are spreading it without symptoms) 4. Creates a “pre-symptomatic transmission” problem (critical for COVID-19)
For diseases with long incubation periods (COVID-19, Ebola), SEIR is essential. For diseases with short/negligible incubation (norovirus), SIR suffices.
7.3.3 Advanced Compartmental Models
Real epidemic modeling often requires more complexity:
7.3.3.1 SEIRD: Adding Death
\[S \to E \to I \to \begin{cases} R \text{ (recovered)} \\ D \text{ (dead)} \end{cases}\]
- Split \(I \to R\) into \(I \to R\) (probability \(1-\text{CFR}\)) and \(I \to D\) (probability \(\text{CFR}\))
- Essential for forecasting mortality burden
7.3.3.2 SEIRS: Waning Immunity
\[S \to E \to I \to R \to S\]
- Immunity doesn’t last forever (influenza, coronaviruses)
- Creates endemic equilibrium (disease doesn’t disappear)
7.3.3.3 Age-Structured Models
Different parameters for different age groups:
# Children (0-17)
= 0.3
beta_children = 0.15
gamma_children
# Adults (18-64)
= 0.5
beta_adults = 0.1
gamma_adults
# Elderly (65+)
= 0.4
beta_elderly = 0.08
gamma_elderly = 0.15 # Higher case fatality rate cfr_elderly
Critical for COVID-19 where age dramatically affects outcomes.
7.3.3.4 Spatial Models
Divide population into geographic regions with mobility between them:
# Movement between regions
= [
mobility_matrix 0.95, 0.03, 0.02], # Region 1: 95% stay, 3% to R2, 2% to R3
[0.02, 0.96, 0.02], # Region 2
[0.01, 0.02, 0.97] # Region 3
[ ]
For more on spatial models, see Keeling & Rohani, 2008, Modeling Infectious Diseases.
7.3.4 Why Mechanistic Models Matter
Strengths:
✅ Interpretable: Every parameter has biological meaning (transmission rate, recovery rate, etc.)
✅ Scenario planning: Can simulate interventions (vaccination campaigns, social distancing, school closures)
✅ Theory-grounded: Built on century of epidemiological knowledge
✅ Extrapolation: Can predict novel scenarios (new pathogen, untested interventions)
✅ Policy-relevant: Policymakers understand “if R₀ drops to X, we get Y cases”
Limitations:
❌ Simplistic assumptions: Homogeneous mixing (everyone contacts everyone equally—clearly false)
❌ Constant parameters: \(\beta\) and \(\gamma\) change with behavior, seasons, immunity
❌ No heterogeneity: Superspreaders, age structure, spatial clustering all ignored in basic models
❌ Parameter uncertainty: Don’t know true \(\beta\), \(\gamma\) until epidemic is over
❌ Forecast degradation: Accurate for 1-2 weeks, unreliable beyond that
Mechanistic models are: - Transparent and interpretable - Based on biological mechanisms - But require assumptions that are often violated in reality
Machine learning models (next section) are: - Flexible and data-driven - Don’t require mechanistic assumptions - But opaque and hard to extrapolate beyond training data
Neither alone is sufficient. The best forecasting systems use both.
7.4 The Machine Learning Approach: Data-Driven Forecasting
What if we don’t want to specify transmission rates and recovery rates? What if we just want to learn patterns from data?
Enter machine learning forecasting.
7.4.1 Time Series Forecasting with ARIMA
ARIMA (AutoRegressive Integrated Moving Average) is a classical statistical method for time series forecasting.
Why ARIMA for epidemics: - Captures temporal autocorrelation (yesterday’s cases predict today’s) - Handles seasonality (flu peaks in winter, dengue in rainy season) - No mechanistic assumptions required - Works with limited historical data
ARIMA components:
AR (AutoRegressive): Use past values to predict future \[y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + ... + \epsilon_t\]
I (Integrated): Differencing to make series stationary \[\Delta y_t = y_t - y_{t-1}\]
MA (Moving Average): Use past errors to improve prediction \[y_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + ...\]
ARIMA(p, d, q): - p: Number of autoregressive terms - d: Degree of differencing - q: Number of moving average terms
For seasonal data: SARIMA adds seasonal components.
Complete implementation for influenza forecasting:
import pandas as pd
import numpy as np
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Simulate realistic influenza data (ILINet-style)
42)
np.random.seed(= pd.date_range('2018-01-01', periods=260, freq='W-MON')
weeks = weeks.isocalendar().week
week_of_year
# Seasonal flu pattern (peaks in winter)
= 2.0 + 2.5 * np.exp(-((week_of_year - 6) % 52)**2 / 50)
baseline
# Add year-to-year variation
= np.repeat([0, 0.3, -0.2, 0.5, 0.1], 52)[:len(weeks)]
year_effect
# Random noise
= np.random.normal(0, 0.25, len(weeks))
noise
# Combine
= baseline + year_effect + noise
ili_pct = np.maximum(ili_pct, 0.5) # ILI never goes to zero
ili_pct
= pd.DataFrame({'date': weeks, 'ili_pct': ili_pct})
df 'date', inplace=True)
df.set_index(
print(df.head(10))
print(f"\nTotal weeks: {len(df)}")
print(f"Date range: {df.index.min()} to {df.index.max()}")
# Train/test split (reserve last season for testing)
= 208 # 4 years training
train_size = df.iloc[:train_size]
train = df.iloc[train_size:]
test
print(f"\nTraining weeks: {len(train)}")
print(f"Test weeks: {len(test)}")
# Exploratory analysis
= plt.subplots(3, 1, figsize=(14, 12))
fig, axes
# Raw time series
0].plot(train.index, train['ili_pct'], label='Training Data', color='blue')
axes[0].plot(test.index, test['ili_pct'], label='Test Data', color='orange')
axes[0].set_ylabel('ILI %', fontsize=12)
axes[0].set_title('Influenza-Like Illness Time Series', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[
# ACF (Autocorrelation Function)
'ili_pct'], lags=52, ax=axes[1], alpha=0.05)
plot_acf(train[1].set_title('Autocorrelation Function (ACF)', fontsize=14)
axes[1].set_xlabel('Lag (weeks)')
axes[
# PACF (Partial Autocorrelation Function)
'ili_pct'], lags=52, ax=axes[2], alpha=0.05)
plot_pacf(train[2].set_title('Partial Autocorrelation Function (PACF)', fontsize=14)
axes[2].set_xlabel('Lag (weeks)')
axes[
plt.tight_layout()'arima_exploratory.png', dpi=300)
plt.savefig(
plt.show()
# Seasonal decomposition
= seasonal_decompose(train['ili_pct'], model='additive', period=52)
decomposition
= plt.subplots(4, 1, figsize=(14, 12))
fig, axes =axes[0], title='Observed')
decomposition.observed.plot(ax=axes[1], title='Trend')
decomposition.trend.plot(ax=axes[2], title='Seasonal')
decomposition.seasonal.plot(ax=axes[3], title='Residual')
decomposition.resid.plot(ax
for ax in axes:
True, alpha=0.3)
ax.grid(
plt.tight_layout()'seasonal_decomposition.png', dpi=300)
plt.savefig(
plt.show()
# Fit SARIMA model
# SARIMA(p, d, q)(P, D, Q, s)
# (2, 0, 1) for non-seasonal components
# (1, 0, 1, 52) for seasonal components with 52-week period
print("\nFitting SARIMA model...")
= SARIMAX(
model 'ili_pct'],
train[=(2, 0, 1), # (p, d, q)
order=(1, 0, 1, 52), # (P, D, Q, s)
seasonal_order=False,
enforce_stationarity=False
enforce_invertibility
)
= model.fit(disp=False, maxiter=200)
fit
print("\n" + "="*60)
print("MODEL SUMMARY")
print("="*60)
print(fit.summary())
# Forecast
= len(test)
forecast_steps = fit.get_forecast(steps=forecast_steps)
forecast_result
# Get predictions and confidence intervals
= forecast_result.predicted_mean
forecast_mean = forecast_result.conf_int()
forecast_ci
# Create forecast dataframe
= pd.DataFrame({
forecast_df 'date': test.index,
'actual': test['ili_pct'].values,
'forecast': forecast_mean.values,
'lower_ci': forecast_ci.iloc[:, 0].values,
'upper_ci': forecast_ci.iloc[:, 1].values
})
# Visualize forecast
= plt.subplots(2, 1, figsize=(16, 10))
fig, axes
# Top: Full time series with forecast
0].plot(train.index, train['ili_pct'], 'o-',
axes[='Training Data', color='blue', alpha=0.6, markersize=4)
label0].plot(test.index, test['ili_pct'], 'o-',
axes[='Actual (Test)', color='black', linewidth=2, markersize=6)
label0].plot(forecast_df['date'], forecast_df['forecast'], 's-',
axes[='SARIMA Forecast', color='red', linewidth=2, markersize=6)
label
# Confidence interval
0].fill_between(forecast_df['date'],
axes['lower_ci'],
forecast_df['upper_ci'],
forecast_df[=0.3, color='red', label='95% Confidence Interval')
alpha
0].axvline(train.index[-1], color='green', linestyle='--',
axes[=2, label='Train/Test Split')
linewidth
0].set_ylabel('ILI %', fontsize=12)
axes[0].set_title('SARIMA Forecast: Influenza-Like Illness', fontsize=14)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)
axes[
# Bottom: Forecast errors
= forecast_df['actual'] - forecast_df['forecast']
errors = ['red' if e < 0 else 'green' for e in errors]
colors
1].bar(forecast_df['date'], errors, color=colors, alpha=0.6, edgecolor='black')
axes[1].axhline(0, color='black', linestyle='-', linewidth=1.5)
axes[
# Mark forecast horizons
for horizon, label in [(4, '1 month'), (13, '3 months')]:
if horizon < len(errors):
1].axvline(forecast_df['date'].iloc[horizon],
axes[='blue', linestyle=':', linewidth=2)
color1].text(forecast_df['date'].iloc[horizon], errors.max() * 0.9,
axes[f'{label} ahead', rotation=90, fontsize=10)
1].set_xlabel('Date', fontsize=12)
axes[1].set_ylabel('Forecast Error (Actual - Predicted)', fontsize=12)
axes[1].set_title('Forecast Errors Over Time', fontsize=14)
axes[1].grid(True, alpha=0.3)
axes[
plt.tight_layout()'arima_flu_forecast.png', dpi=300)
plt.savefig(
plt.show()
# Evaluate performance at different horizons
= [1, 4, 8, 13] # 1 week, 1 month, 2 months, 3 months
horizons
print("\n" + "="*60)
print("FORECAST PERFORMANCE BY HORIZON")
print("="*60)
for horizon in horizons:
if horizon <= len(errors):
= forecast_df['actual'].iloc[:horizon]
subset_actual = forecast_df['forecast'].iloc[:horizon]
subset_pred
= mean_absolute_error(subset_actual, subset_pred)
mae = np.sqrt(mean_squared_error(subset_actual, subset_pred))
rmse = np.mean(np.abs((subset_actual - subset_pred) / subset_actual)) * 100
mape
# Coverage: % of actuals within CI
= forecast_df['lower_ci'].iloc[:horizon]
subset_lower = forecast_df['upper_ci'].iloc[:horizon]
subset_upper = ((subset_actual >= subset_lower) & (subset_actual <= subset_upper)).mean() * 100
coverage
print(f"\n{horizon} week(s) ahead:")
print(f" MAE: {mae:.3f}")
print(f" RMSE: {rmse:.3f}")
print(f" MAPE: {mape:.1f}%")
print(f" 95% CI Coverage: {coverage:.0f}%")
# Compare to naive baseline (persistence forecast)
= train['ili_pct'].iloc[-1] # Last observed value
naive_forecast = forecast_df['actual'] - naive_forecast
naive_errors = mean_absolute_error(forecast_df['actual'], [naive_forecast]*len(forecast_df))
naive_mae = mean_absolute_error(forecast_df['actual'], forecast_df['forecast'])
arima_mae
= (1 - arima_mae / naive_mae) * 100
improvement
print("\n" + "="*60)
print("COMPARISON TO NAIVE BASELINE")
print("="*60)
print(f"Naive forecast (persistence): MAE = {naive_mae:.3f}")
print(f"SARIMA forecast: MAE = {arima_mae:.3f}")
print(f"Improvement: {improvement:.1f}%")
ARIMA excels when: ✓ Disease has regular seasonal patterns (influenza, RSV, dengue)
✓ Reporting is stable and consistent
✓ Short-term forecasts (1-4 weeks)
✓ No major interventions or behavior changes expected
✓ Historical data covers multiple seasons
ARIMA struggles when: ✗ Novel outbreak (no historical pattern to learn)
✗ Sudden interventions (lockdowns, vaccine rollouts)
✗ Reporting systems change
✗ Long-term forecasts (>4 weeks)
✗ Nonlinear, chaotic dynamics
For comprehensive ARIMA tutorial, see Hyndman & Athanasopoulos, Forecasting: Principles and Practice.
7.4.2 Deep Learning: LSTMs for Epidemic Forecasting
LSTM (Long Short-Term Memory) networks are a type of recurrent neural network designed for sequential data.
Why LSTMs for epidemics: - Can capture long-term dependencies (last month’s cases affect this month) - Handle multiple input features (cases + mobility + weather + interventions) - Don’t require stationarity assumptions - Can learn nonlinear patterns
The basic architecture:
Input sequence → LSTM layers → Dense layers → Output (future cases)
(past 14 days) (next 7 days)
Complete implementation for COVID-19 forecasting:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import warnings
'ignore')
warnings.filterwarnings(
# Set random seeds for reproducibility
42)
np.random.seed(42)
tf.random.set_seed(
# Simulate realistic COVID-19 case data
= np.arange(500)
days
# Multi-wave pattern (Original → Delta → Omicron-like)
= 500 * np.exp(-((days - 50)**2) / 300)
wave1 = 2000 * np.exp(-((days - 200)**2) / 400)
wave2 = 5000 * np.exp(-((days - 350)**2) / 500)
wave3
# Combine waves with baseline and noise
= 100 + wave1 + wave2 + wave3 + np.random.normal(0, 100, len(days))
cases = np.maximum(cases, 50) # Floor at 50 cases/day
cases
# Add day-of-week effect (reporting artifacts)
= -200 * (days % 7 >= 5) # Lower on weekends
day_of_week_effect = cases + day_of_week_effect
cases = np.maximum(cases, 0)
cases
# Create DataFrame
= pd.DataFrame({
df 'day': days,
'cases': cases
})
print(f"Total days: {len(df)}")
print(f"Case range: {cases.min():.0f} to {cases.max():.0f}")
# Visualize raw data
=(14, 6))
plt.figure(figsize'day'], df['cases'], 'o-', alpha=0.6, markersize=3)
plt.plot(df['Day', fontsize=12)
plt.xlabel('Daily Cases', fontsize=12)
plt.ylabel('Simulated COVID-19 Daily Cases (Multi-Wave Pattern)', fontsize=14)
plt.title(True, alpha=0.3)
plt.grid(
plt.tight_layout()'lstm_data_raw.png', dpi=300)
plt.savefig(
plt.show()
# Prepare data for LSTM
def create_sequences(data, seq_length, forecast_horizon=1):
"""
Create input sequences and targets for LSTM
Parameters:
- data: 1D array of values
- seq_length: number of past time steps to use
- forecast_horizon: number of future time steps to predict
Returns:
- X: [n_samples, seq_length] input sequences
- y: [n_samples, forecast_horizon] target values
"""
= [], []
X, y
for i in range(len(data) - seq_length - forecast_horizon + 1):
# Input: seq_length days of history
+seq_length])
X.append(data[i:i
# Target: next forecast_horizon days
if forecast_horizon == 1:
+seq_length])
y.append(data[ielse:
+seq_length:i+seq_length+forecast_horizon])
y.append(data[i
return np.array(X), np.array(y)
# Normalize data (LSTM trains better with normalized inputs)
= MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit_transform(df['cases'].values.reshape(-1, 1)).flatten()
cases_scaled
# Create sequences (use past 14 days to predict next day)
= 14
seq_length = 1
forecast_horizon
= create_sequences(cases_scaled, seq_length, forecast_horizon)
X, y
print(f"\nSequence creation:")
print(f"Input shape: {X.shape}") # (n_samples, seq_length)
print(f"Target shape: {y.shape}") # (n_samples,)
# Train/test split (80/20)
= int(0.8 * len(X))
train_size = X[:train_size], X[train_size:]
X_train, X_test = y[:train_size], y[train_size:]
y_train, y_test
print(f"\nTrain samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
# Reshape for LSTM [samples, time steps, features]
= X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_train = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))
X_test
print(f"\nReshaped for LSTM:")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
# Build LSTM model
= keras.Sequential([
model # First LSTM layer (return sequences for stacking)
64, activation='relu', return_sequences=True,
layers.LSTM(=(seq_length, 1)),
input_shape0.2),
layers.Dropout(
# Second LSTM layer
32, activation='relu'),
layers.LSTM(0.2),
layers.Dropout(
# Dense layers
16, activation='relu'),
layers.Dense(# Output layer
layers.Dense(forecast_horizon)
])
compile(
model.='adam',
optimizer='mse',
loss=['mae']
metrics
)
print("\n" + "="*60)
print("LSTM MODEL ARCHITECTURE")
print("="*60)
model.summary()
# Train model
print("\n" + "="*60)
print("TRAINING MODEL")
print("="*60)
= keras.callbacks.EarlyStopping(
early_stopping ='val_loss',
monitor=10,
patience=True
restore_best_weights
)
= model.fit(
history
X_train, y_train,=100,
epochs=32,
batch_size=0.2,
validation_split=[early_stopping],
callbacks=1
verbose
)
# Make predictions
= model.predict(X_train, verbose=0)
y_train_pred = model.predict(X_test, verbose=0)
y_test_pred
# Inverse transform to original scale
= scaler.inverse_transform(y_train.reshape(-1, 1)).flatten()
y_train_actual = scaler.inverse_transform(y_train_pred).flatten()
y_train_pred_original
= scaler.inverse_transform(y_test.reshape(-1, 1)).flatten()
y_test_actual = scaler.inverse_transform(y_test_pred).flatten()
y_test_pred_original
# Evaluate
= mean_absolute_error(y_train_actual, y_train_pred_original)
train_mae = np.sqrt(mean_squared_error(y_train_actual, y_train_pred_original))
train_rmse
= mean_absolute_error(y_test_actual, y_test_pred_original)
test_mae = np.sqrt(mean_squared_error(y_test_actual, y_test_pred_original))
test_rmse
print("\n" + "="*60)
print("MODEL PERFORMANCE")
print("="*60)
print(f"\nTraining Set:")
print(f" MAE: {train_mae:.1f} cases")
print(f" RMSE: {train_rmse:.1f} cases")
print(f"\nTest Set:")
print(f" MAE: {test_mae:.1f} cases")
print(f" RMSE: {test_rmse:.1f} cases")
# Visualize results
= plt.subplots(3, 1, figsize=(16, 14))
fig, axes
# Top: Training history
0].plot(history.history['loss'], label='Training Loss', linewidth=2)
axes[0].plot(history.history['val_loss'], label='Validation Loss', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss (MSE)', fontsize=12)
axes[0].set_title('LSTM Training History', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
axes[
# Middle: Predictions vs actual (full dataset)
= days[seq_length:seq_length+len(y_train_actual)]
train_days = days[seq_length+len(y_train_actual):seq_length+len(y_train_actual)+len(y_test_actual)]
test_days
1].plot(train_days, y_train_actual, 'o-',
axes[='Actual (Training)', color='blue', alpha=0.5, markersize=3)
label1].plot(test_days, y_test_actual, 'o-',
axes[='Actual (Test)', color='black', linewidth=2, markersize=4)
label1].plot(test_days, y_test_pred_original, 's-',
axes[='LSTM Forecast', color='red', linewidth=2, markersize=4)
label
1].axvline(train_days[-1], color='green', linestyle='--',
axes[=2, label='Train/Test Split')
linewidth
1].set_xlabel('Day', fontsize=12)
axes[1].set_ylabel('Daily Cases', fontsize=12)
axes[1].set_title('LSTM Forecast: COVID-19 Daily Cases', fontsize=14)
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)
axes[
# Bottom: Forecast errors
= y_test_actual - y_test_pred_original
errors = ['red' if e < 0 else 'green' for e in errors]
colors
2].bar(test_days, errors, color=colors, alpha=0.6, edgecolor='black')
axes[2].axhline(0, color='black', linestyle='-', linewidth=1.5)
axes[
2].set_xlabel('Day', fontsize=12)
axes[2].set_ylabel('Forecast Error (Actual - Predicted)', fontsize=12)
axes[2].set_title('LSTM Forecast Errors', fontsize=14)
axes[2].grid(True, alpha=0.3)
axes[
plt.tight_layout()'lstm_covid_forecast.png', dpi=300)
plt.savefig(
plt.show()
# Multi-step ahead forecasting
print("\n" + "="*60)
print("MULTI-STEP AHEAD FORECASTING")
print("="*60)
# Use the last seq_length days as input
= cases_scaled[-seq_length:]
last_sequence = last_sequence.reshape(1, seq_length, 1)
current_sequence
# Forecast next 14 days iteratively
= 14
forecast_days = []
multi_step_forecast
for i in range(forecast_days):
# Predict next day
= model.predict(current_sequence, verbose=0)[0, 0]
next_pred
multi_step_forecast.append(next_pred)
# Update sequence: drop oldest, add prediction
= np.append(current_sequence[0, 1:, 0], next_pred).reshape(1, seq_length, 1)
current_sequence
# Inverse transform
= scaler.inverse_transform(
multi_step_forecast_original -1, 1)
np.array(multi_step_forecast).reshape(
).flatten()
# Visualize multi-step forecast
= plt.subplots(figsize=(14, 6))
fig, ax
# Historical data
'o-', label='Historical Cases',
ax.plot(days, cases, ='blue', alpha=0.6, markersize=3)
color
# Multi-step forecast
= np.arange(len(days), len(days) + forecast_days)
forecast_days_array 's-',
ax.plot(forecast_days_array, multi_step_forecast_original, =f'{forecast_days}-Day Forecast', color='red',
label=3, markersize=8)
linewidth
-1], color='green', linestyle='--', linewidth=2,
ax.axvline(days[='Forecast Start')
label
'Day', fontsize=12)
ax.set_xlabel('Daily Cases', fontsize=12)
ax.set_ylabel(f'LSTM Multi-Step Ahead Forecast ({forecast_days} days)', fontsize=14)
ax.set_title(=11)
ax.legend(fontsizeTrue, alpha=0.3)
ax.grid(
plt.tight_layout()'lstm_multistep_forecast.png', dpi=300)
plt.savefig(
plt.show()
print(f"\n{forecast_days}-day forecast:")
for i, pred in enumerate(multi_step_forecast_original, 1):
print(f" Day {i}: {pred:.0f} cases")
LSTMs learn patterns, not mechanisms.
Example scenario: - Training data includes natural epidemic curve + lockdown → cases drop - LSTM learns: “After cases reach X, they drop” - New epidemic: Cases reach X, but no lockdown this time - LSTM predicts drop (wrong!)
The problem: LSTM has no concept of “lockdown caused the drop.” It just sees a pattern.
When this matters: - Novel interventions (no historical data) - Changing policy responses - New variants with different characteristics - Long-term forecasts where context matters
Solution: Use LSTMs for short-term pattern recognition, mechanistic models for scenario planning.
For practical deep learning in epidemiology, see Chimmula & Zhang, 2020, Chaos, Solitons & Fractals.
7.5 Ensemble Methods: Combining Models
No single model is best for all situations. Ensemble methods combine multiple models to create more robust forecasts.
7.5.1 The CDC FluSight Challenge
The CDC’s FluSight challenge, running since 2013, invites teams to forecast influenza-like illness (ILI) weekly.
The setup: - Teams submit forecasts for 1-4 weeks ahead - Also forecast: peak week, peak intensity, onset week - Evaluated on prediction interval coverage and point forecast accuracy
Key finding (Reich et al., 2019, Clinical Infectious Diseases):
Ensemble models consistently outperformed any individual model.
Why ensembles work:
- Different models capture different patterns
- Mechanistic models: Understand transmission dynamics
- ARIMA: Captures seasonal patterns
- ML models: Learn complex nonlinear relationships
- Averaging reduces overfitting
- Individual models may be overconfident
- Ensemble spreads risk across models
- Robustness to model-specific failures
- If one model fails (e.g., unexpected intervention), others compensate
7.5.2 Building an Ensemble Forecaster
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error
import warnings
'ignore')
warnings.filterwarnings(
# Generate synthetic epidemic data
42)
np.random.seed(= np.arange(200)
days
# Simulated cases (mix of exponential growth + intervention)
= 100 * np.exp(0.03 * days[:80]) # Growth phase
phase1 = phase1[-1] * np.exp(-0.02 * (days[80:] - 80)) # Decline (intervention)
phase2
= np.concatenate([phase1, phase2])
cases = cases + np.random.normal(0, cases * 0.1) # Add noise
cases = np.maximum(cases, 10)
cases
= pd.DataFrame({'day': days, 'cases': cases})
df
# Split: train on first 150 days, forecast next 50
= 150
train_size = df.iloc[:train_size]
train = df.iloc[train_size:]
test
print(f"Training days: {len(train)}")
print(f"Test days: {len(test)}")
# === MODEL 1: SEIR Mechanistic Model ===
def seir_forecast(train_data, forecast_steps, beta=0.03, sigma=0.2, gamma=0.1):
"""
Fit SEIR model to training data and forecast
"""
# Population
= 1000000
N
# Initial conditions (estimate from data)
= train_data['cases'].iloc[0]
I0 = I0
E0 = 0
R0_init = N - E0 - I0 - R0_init
S0
def seir_model(y, t, beta, sigma, gamma, N):
= y
S, E, I, R = -beta * S * I / N
dS = beta * S * I / N - sigma * E
dE = sigma * E - gamma * I
dI = gamma * I
dR return dS, dE, dI, dR
# Solve SEIR for training + forecast period
= np.arange(len(train_data) + forecast_steps)
t = S0, E0, I0, R0_init
y0
= odeint(seir_model, y0, t, args=(beta, sigma, gamma, N))
solution = solution.T
_, _, I, _
# Return only forecast period
= I[len(train_data):]
forecast
return forecast
= seir_forecast(train, len(test))
seir_forecast_values
# === MODEL 2: ARIMA ===
def arima_forecast(train_data, forecast_steps):
"""
Fit ARIMA model and forecast
"""
try:
= SARIMAX(train_data['cases'], order=(2, 1, 1))
model = model.fit(disp=False)
fit = fit.forecast(steps=forecast_steps)
forecast return forecast.values
except:
# Fallback: return mean if ARIMA fails
return np.full(forecast_steps, train_data['cases'].mean())
= arima_forecast(train, len(test))
arima_forecast_values
# === MODEL 3: Machine Learning (Simple Trend Extrapolation) ===
def ml_forecast(train_data, forecast_steps):
"""
Simple ML: Fit polynomial to recent trend and extrapolate
"""
from numpy.polynomial import Polynomial
# Use last 30 days for trend
= train_data.iloc[-30:]
recent
# Fit polynomial
= Polynomial.fit(recent['day'], recent['cases'], deg=2)
p
# Forecast
= np.arange(train_data['day'].iloc[-1] + 1,
forecast_days 'day'].iloc[-1] + 1 + forecast_steps)
train_data[= p(forecast_days)
forecast
return np.maximum(forecast, 0) # No negative cases
= ml_forecast(train, len(test))
ml_forecast_values
# === ENSEMBLE: Combine Models ===
# Method 1: Simple average
= (seir_forecast_values + arima_forecast_values + ml_forecast_values) / 3
ensemble_simple
# Method 2: Weighted average (based on domain knowledge or historical performance)
= {
weights 'seir': 0.4, # Higher weight for mechanistic model (interpretable)
'arima': 0.3, # Medium weight (good for short-term)
'ml': 0.3 # Medium weight (captures recent trends)
}
= (
ensemble_weighted 'seir'] * seir_forecast_values +
weights['arima'] * arima_forecast_values +
weights['ml'] * ml_forecast_values
weights[
)
# Method 3: Median ensemble (more robust to outliers)
= np.array([seir_forecast_values, arima_forecast_values, ml_forecast_values])
all_forecasts = np.median(all_forecasts, axis=0)
ensemble_median
# Visualize
= plt.subplots(2, 1, figsize=(16, 12))
fig, axes
# Top: All forecasts
0].plot(train['day'], train['cases'], 'o-',
axes[='Training Data', color='blue', alpha=0.5, markersize=3)
label0].plot(test['day'], test['cases'], 'o-',
axes[='Actual (Test)', color='black', linewidth=3, markersize=6)
label
0].plot(test['day'], seir_forecast_values, '^--',
axes[='SEIR Model', alpha=0.7, markersize=6)
label0].plot(test['day'], arima_forecast_values, 's--',
axes[='ARIMA Model', alpha=0.7, markersize=6)
label0].plot(test['day'], ml_forecast_values, 'd--',
axes[='ML Model', alpha=0.7, markersize=6)
label
0].plot(test['day'], ensemble_weighted, '*-',
axes[='Weighted Ensemble', color='red', linewidth=3, markersize=10)
label
0].axvline(train['day'].iloc[-1], color='green', linestyle='--',
axes[=2, label='Train/Test Split')
linewidth
0].set_ylabel('Daily Cases', fontsize=12)
axes[0].set_title('Ensemble Forecasting: Combining Multiple Models', fontsize=14)
axes[0].legend(fontsize=10, loc='upper right')
axes[0].grid(True, alpha=0.3)
axes[
# Bottom: Performance comparison
= {
models 'SEIR': seir_forecast_values,
'ARIMA': arima_forecast_values,
'ML': ml_forecast_values,
'Simple Ensemble': ensemble_simple,
'Weighted Ensemble': ensemble_weighted,
'Median Ensemble': ensemble_median
}
= []
performance for name, forecast in models.items():
= mean_absolute_error(test['cases'], forecast)
mae = np.sqrt(((test['cases'] - forecast) ** 2).mean())
rmse 'Model': name, 'MAE': mae, 'RMSE': rmse})
performance.append({
= pd.DataFrame(performance).sort_values('MAE')
perf_df
= np.arange(len(perf_df))
x_pos 1].barh(x_pos, perf_df['MAE'], alpha=0.8, edgecolor='black')
axes[1].set_yticks(x_pos)
axes[1].set_yticklabels(perf_df['Model'])
axes[1].set_xlabel('Mean Absolute Error (MAE)', fontsize=12)
axes[1].set_title('Model Performance Comparison', fontsize=14)
axes[1].grid(True, alpha=0.3, axis='x')
axes[
# Highlight ensemble
for i, model in enumerate(perf_df['Model']):
if 'Ensemble' in model:
1].get_children()[i].set_color('red')
axes[1].get_children()[i].set_alpha(1.0)
axes[
plt.tight_layout()'ensemble_forecast.png', dpi=300)
plt.savefig(
plt.show()
# Print performance
print("\n" + "="*60)
print("MODEL PERFORMANCE COMPARISON")
print("="*60)
print(perf_df.to_string(index=False))
# Best model
= perf_df.iloc[0]
best_model print(f"\n🏆 Best Model: {best_model['Model']}")
print(f" MAE: {best_model['MAE']:.1f}")
print(f" RMSE: {best_model['RMSE']:.1f}")
# Improvement of best ensemble over best individual model
= perf_df[~perf_df['Model'].str.contains('Ensemble')].iloc[0]
best_individual = (1 - best_model['MAE'] / best_individual['MAE']) * 100
improvement
print(f"\nEnsemble improvement over best individual model: {improvement:.1f}%")
Weight Selection:
- Equal weights (simple average)
- Easiest to implement
- Works surprisingly well
- Default starting point
- Performance-based weights
- Weight by historical accuracy
- Requires validation set
- Can overfit to past performance
- Domain-based weights
- Mechanistic models: Higher weight for scenario planning
- ML models: Higher weight for short-term forecasts
- Time series models: Higher weight for seasonal diseases
- Adaptive weights
- Change weights based on recent performance
- More complex but more robust
- Requires careful validation
When ensemble doesn’t help: - All models use the same data and make the same mistakes - One model is vastly superior to others (ensemble dilutes its performance) - Forecast horizon is so long that all models are guessing
For theoretical foundations, see Ray & Reich, 2018, Journal of Forecasting.
7.6 Uncertainty Quantification: Beyond Point Forecasts
The cardinal sin of forecasting: Publishing a single number without uncertainty.
Example of bad forecasting: > “Model predicts 50,000 cases next week.”
What if the real range is 30,000-80,000? Decision-makers need to know!
7.6.1 Types of Uncertainty
1. Parameter Uncertainty - Don’t know exact values of \(\beta\), \(\gamma\), etc. - Solution: Bayesian inference, confidence intervals
2. Structural Uncertainty - Model is wrong (all models are wrong!) - SIR assumes homogeneous mixing (false) - Solution: Multiple models, ensemble
3. Stochastic Uncertainty - Random chance (superspreading events, who gets infected) - Solution: Monte Carlo simulation
4. Scenario Uncertainty - Future interventions unknown (will there be a lockdown?) - Solution: Scenario planning (best/worst/expected case)
7.6.2 Quantifying Forecast Uncertainty
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from scipy.stats import norm
import seaborn as sns
# === METHOD 1: Bootstrap Confidence Intervals ===
def bootstrap_forecast(train_data, forecast_steps, n_bootstrap=1000):
"""
Use bootstrap resampling to quantify uncertainty
"""
= []
forecasts
for i in range(n_bootstrap):
# Resample training data with replacement
= train_data.sample(n=len(train_data), replace=True)
boot_sample = boot_sample.sort_values('day').reset_index(drop=True)
boot_sample
# Fit model (simple exponential for speed)
from scipy.optimize import curve_fit
def exp_growth(x, a, b):
return a * np.exp(b * x)
try:
= curve_fit(exp_growth, boot_sample['day'], boot_sample['cases'],
params, _ =[100, 0.01], maxfev=5000)
p0
# Forecast
= np.arange(train_data['day'].iloc[-1] + 1,
forecast_days 'day'].iloc[-1] + 1 + forecast_steps)
train_data[= exp_growth(forecast_days, *params)
forecast
forecasts.append(forecast)except:
# If fit fails, use mean
'cases'].mean()))
forecasts.append(np.full(forecast_steps, train_data[
= np.array(forecasts)
forecasts
# Calculate percentiles
= np.mean(forecasts, axis=0)
forecast_mean = np.percentile(forecasts, 2.5, axis=0)
forecast_lower = np.percentile(forecasts, 97.5, axis=0)
forecast_upper
return forecast_mean, forecast_lower, forecast_upper
# === METHOD 2: Bayesian Parameter Uncertainty ===
def bayesian_seir_forecast(train_data, forecast_steps, n_samples=500):
"""
SEIR with Bayesian parameter uncertainty
"""
= 1000000
N
# Prior distributions for parameters (based on literature/expert knowledge)
= np.random.uniform(0.02, 0.05, n_samples) # Transmission rate
beta_samples = np.random.uniform(0.08, 0.12, n_samples) # Recovery rate
gamma_samples = np.random.uniform(0.15, 0.25, n_samples) # E→I rate
sigma_samples
= []
forecasts
for beta, gamma, sigma in zip(beta_samples, gamma_samples, sigma_samples):
# Initial conditions
= train_data['cases'].iloc[0]
I0 = I0
E0 = 0
R0_init = N - E0 - I0 - R0_init
S0
def seir(y, t, beta, sigma, gamma, N):
= y
S, E, I, R = -beta * S * I / N
dS = beta * S * I / N - sigma * E
dE = sigma * E - gamma * I
dI = gamma * I
dR return dS, dE, dI, dR
# Solve
= np.arange(len(train_data) + forecast_steps)
t = S0, E0, I0, R0_init
y0
= odeint(seir, y0, t, args=(beta, sigma, gamma, N))
solution = solution.T
_, _, I, _
len(train_data):])
forecasts.append(I[
= np.array(forecasts)
forecasts
= np.mean(forecasts, axis=0)
forecast_mean = np.percentile(forecasts, 2.5, axis=0)
forecast_lower = np.percentile(forecasts, 97.5, axis=0)
forecast_upper
return forecast_mean, forecast_lower, forecast_upper, forecasts
# === METHOD 3: Ensemble Spread as Uncertainty ===
def ensemble_uncertainty(individual_forecasts):
"""
Use spread across ensemble members as uncertainty measure
"""
= np.array(individual_forecasts)
forecasts_array
= np.mean(forecasts_array, axis=0)
forecast_mean = np.std(forecasts_array, axis=0)
forecast_std
# 95% CI assuming normal distribution
= forecast_mean - 1.96 * forecast_std
forecast_lower = forecast_mean + 1.96 * forecast_std
forecast_upper
return forecast_mean, forecast_lower, forecast_upper
# Generate example data
42)
np.random.seed(= np.arange(150)
days = 100 * np.exp(0.025 * days) + np.random.normal(0, 50, len(days))
cases = np.maximum(cases, 10)
cases
= pd.DataFrame({'day': days, 'cases': cases})
df
= 120
train_size = df.iloc[:train_size]
train = df.iloc[train_size:]
test
# Generate forecasts with uncertainty
print("Generating forecasts with uncertainty quantification...")
# Bootstrap
print(" Running bootstrap...")
= bootstrap_forecast(train, len(test), n_bootstrap=500)
boot_mean, boot_lower, boot_upper
# Bayesian
print(" Running Bayesian SEIR...")
= bayesian_seir_forecast(
bayes_mean, bayes_lower, bayes_upper, bayes_samples len(test), n_samples=200
train,
)
# Visualize
= plt.subplots(3, 1, figsize=(16, 16))
fig, axes
# Top: Bootstrap uncertainty
0].plot(train['day'], train['cases'], 'o-',
axes[='Training Data', color='blue', alpha=0.5, markersize=3)
label0].plot(test['day'], test['cases'], 'o-',
axes[='Actual (Test)', color='black', linewidth=2, markersize=6)
label
0].plot(test['day'], boot_mean, 's-',
axes[='Bootstrap Forecast', color='red', linewidth=2, markersize=6)
label0].fill_between(test['day'], boot_lower, boot_upper,
axes[=0.3, color='red', label='95% Bootstrap CI')
alpha
0].axvline(train['day'].iloc[-1], color='green', linestyle='--', linewidth=2)
axes[
0].set_ylabel('Daily Cases', fontsize=12)
axes[0].set_title('Bootstrap Uncertainty Quantification', fontsize=14)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)
axes[
# Middle: Bayesian uncertainty
1].plot(train['day'], train['cases'], 'o-',
axes[='Training Data', color='blue', alpha=0.5, markersize=3)
label1].plot(test['day'], test['cases'], 'o-',
axes[='Actual (Test)', color='black', linewidth=2, markersize=6)
label
1].plot(test['day'], bayes_mean, '^-',
axes[='Bayesian SEIR Forecast', color='purple', linewidth=2, markersize=6)
label1].fill_between(test['day'], bayes_lower, bayes_upper,
axes[=0.3, color='purple', label='95% Credible Interval')
alpha
1].axvline(train['day'].iloc[-1], color='green', linestyle='--', linewidth=2)
axes[
1].set_ylabel('Daily Cases', fontsize=12)
axes[1].set_title('Bayesian Parameter Uncertainty', fontsize=14)
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)
axes[
# Bottom: Distribution of forecasts (spaghetti plot)
for i in range(min(50, len(bayes_samples))): # Plot subset for clarity
2].plot(test['day'], bayes_samples[i], '-',
axes[='purple', alpha=0.1, linewidth=0.5)
color
2].plot(train['day'], train['cases'], 'o-',
axes[='Training Data', color='blue', alpha=0.7, markersize=3)
label2].plot(test['day'], test['cases'], 'o-',
axes[='Actual (Test)', color='black', linewidth=3, markersize=6, zorder=5)
label2].plot(test['day'], bayes_mean, '-',
axes[='Mean Forecast', color='red', linewidth=3, zorder=4)
label
2].axvline(train['day'].iloc[-1], color='green', linestyle='--', linewidth=2)
axes[
2].set_xlabel('Day', fontsize=12)
axes[2].set_ylabel('Daily Cases', fontsize=12)
axes[2].set_title('Forecast Uncertainty: Distribution of Possible Futures (Spaghetti Plot)', fontsize=14)
axes[2].legend(fontsize=10)
axes[2].grid(True, alpha=0.3)
axes[
plt.tight_layout()'uncertainty_quantification.png', dpi=300)
plt.savefig(
plt.show()
# Evaluate coverage
def calculate_coverage(actual, lower, upper):
"""Calculate what % of actual values fall within prediction interval"""
= (actual >= lower) & (actual <= upper)
within_interval = within_interval.mean() * 100
coverage return coverage
= calculate_coverage(test['cases'].values, boot_lower, boot_upper)
boot_coverage = calculate_coverage(test['cases'].values, bayes_lower, bayes_upper)
bayes_coverage
print("\n" + "="*60)
print("UNCERTAINTY QUANTIFICATION EVALUATION")
print("="*60)
print(f"\nBootstrap 95% CI:")
print(f" Coverage: {boot_coverage:.0f}% (should be ~95%)")
print(f"\nBayesian 95% Credible Interval:")
print(f" Coverage: {bayes_coverage:.0f}% (should be ~95%)")
# Interval width (narrower is better, if coverage is maintained)
= (boot_upper - boot_lower).mean()
boot_width = (bayes_upper - bayes_lower).mean()
bayes_width
print(f"\nMean Interval Width:")
print(f" Bootstrap: {boot_width:.1f} cases")
print(f" Bayesian: {bayes_width:.1f} cases")
Coverage: % of actual values that fall within your prediction interval - Good: 95% coverage for a 95% interval - Bad: Only 70% coverage (interval too narrow)
Sharpness: How narrow your prediction interval is - Good: Narrow intervals (more informative) - Bad: Wide intervals (“cases will be between 10 and 10,000”)
The trade-off: - Wide intervals → High coverage, low utility - Narrow intervals → Low coverage, misleading
Goal: Maximize sharpness while maintaining nominal coverage.
For rigorous evaluation, see Gneiting & Raftery, 2007, Journal of the American Statistical Association.
7.7 Case Study: COVID-19 Forecasting Hub
The COVID-19 Forecast Hub, launched in March 2020, represents the largest epidemic forecasting effort in history.
7.7.1 Setup
Coordinated by: Reich Lab at UMass Amherst
Participants: 50+ teams (academia, industry, government)
Frequency: Weekly forecasts
Targets: - Incident deaths (1-4 weeks ahead) - Incident hospitalizations - Incident cases (when data reliable)
Ensemble method: Weighted median of all submitted forecasts
Use: CDC and state health departments for decision-making
7.7.2 Results
Cramer et al., 2022, PNAS evaluated performance over 20 months:
Key findings:
1-week ahead forecasts: - Median absolute error: 10-15% (useful!) - Ensemble outperformed 90% of individual models
4-week ahead forecasts: - Median absolute error: 40-60% (barely better than guessing) - Huge uncertainty, wide prediction intervals
Challenges encountered:
❌ Novel variants (Alpha, Delta, Omicron) → All models failed simultaneously
❌ Changing testing → Case forecasts unreliable
❌ Policy changes → Lockdowns, mask mandates unpredictable
❌ Behavior changes → People reacted to case counts
❌ Data quality issues → Reporting delays, definition changes
7.7.3 Lessons Learned
Short-term forecasts (1-2 weeks ahead): - Useful for operational planning (hospital beds, staffing) - Ensemble approach was robust - Simple models often competitive with complex models
Transparency and collaboration: - Open data, open code, open evaluation - Rapid iteration and model improvement - Community learned from collective failures
Ensemble method: - Consistently better than any individual model - Robust to individual model failures - Wisdom of crowds
For detailed analysis, see Reich Lab’s evaluation reports.
Long-term forecasts (>4 weeks): - Worse than simple “trend continues” baseline - Essentially scenario planning, not prediction - Should not have been used for major policy decisions
Forecasting through major disruptions: - New variants broke all models - Policy changes (lockdowns) invalidated forecasts immediately - No model predicted Omicron’s magnitude
Point forecasts without uncertainty: - Some early models provided single numbers - Dangerously misleading - Led to bad decisions
The fundamental problem: COVID-19 exhibited extreme non-stationarity—the data-generating process changed every few months (new variants, vaccines, behavior).
When the rules keep changing, forecasting becomes impossible.
7.7.4 Example: Omicron Forecasting Failure
December 2021: Omicron variant detected, spreads rapidly.
Forecast Hub predictions (week of Dec 20): - Ensemble forecast: 450,000 cases/day peak - Range across models: 300,000 - 700,000 cases/day
Reality (Jan 2022): - Actual peak: 1.4 million cases/day (underestimated by 2-3x)
Why all models failed: - Omicron’s R₀ was 2-3x higher than Delta - Immune escape wasn’t fully captured - Behavior didn’t change as much as expected - Testing patterns shifted
The lesson: Novel variants with dramatically different properties break all models trained on past data.
7.8 When NOT to Forecast: The Ethics of Publishing Predictions
Forecasts have consequences. Sometimes, not forecasting is the ethical choice.
7.8.1 The Self-Fulfilling/Self-Defeating Problem Revisited
Scenario planning disguised as prediction:
Many “forecasts” during COVID-19 were actually scenarios: - “If we do nothing, 2 million will die” - “If we lock down, 100,000 will die”
These aren’t predictions—they’re conditional statements about policy choices.
The problem: Media reports them as predictions, public gets confused.
7.8.2 When to Say “We Don’t Know”
You should NOT publish forecasts when:
❌ Novel pathogen with no historical data - Early COVID-19 (January 2020): Every model was pure speculation - Better: “Here’s a range of scenarios from mild to catastrophic”
❌ Major intervention pending - Government about to announce lockdown - Forecasting what happens “if current trends continue” is misleading
❌ Data quality too poor - Testing availability changing daily - Reporting systems overwhelmed - Better: Focus on qualitative assessment
❌ Model disagreement too high - If your ensemble has predictions ranging from 10,000 to 500,000 cases, what’s the point? - Better: “Current uncertainty is too high for actionable forecasts”
❌ Horizon too long - Forecasting 3+ months ahead for rapidly changing epidemic - Better: Provide scenarios, not forecasts
From Ioannidis et al., 2020, International Journal of Forecasting:
“Epidemic forecasters have a special responsibility because their forecasts can profoundly influence public policy and behavior. Forecasts that are too optimistic may lead to complacency. Forecasts that are too pessimistic may cause unnecessary panic and economic damage. Forecasters must resist pressure to provide false certainty.”
Best practices:
✅ Always provide uncertainty quantification
✅ Clearly distinguish forecasts from scenarios
✅ Update forecasts as data changes
✅ Acknowledge model limitations publicly
✅ Archive forecasts for retrospective evaluation
✅ Resist pressure to forecast when uncertainty is unquantifiable
7.9 Practical Guidance: Building Your First Forecasting System
7.9.1 Pre-Flight Checklist
Before you start forecasting:
☐ Do you have sufficient historical data? - Time series: ≥2 years for seasonal diseases - SEIR models: At least early epidemic data to estimate parameters
☐ Is your data quality acceptable? - Complete Chapter 3 assessment - Understand reporting lags - Know how testing/surveillance changed over time
☐ What decisions will this forecast inform? - Hospital resource allocation → 1-2 week forecasts sufficient - Long-term policy planning → Need scenarios, not forecasts - Public communication → Uncertainty is essential
☐ What’s your forecast horizon? - 1-2 weeks: Achievable with most methods - 3-4 weeks: Difficult, high uncertainty - >4 weeks: Scenario planning, not forecasting
7.9.2 Model Selection Strategy
Start simple, add complexity only when justified:
1. Naive baseline (persistence: tomorrow = today)
↓
2. Time series model (ARIMA, Prophet)
↓ (if you need scenario planning)
3. Mechanistic model (SEIR)
↓ (if you have large training data)
4. Machine learning (LSTM, ensemble)
Never skip the naive baseline! If your fancy model doesn’t beat “tomorrow will be like today,” it’s not adding value.
7.9.3 Evaluation Framework
Proper time series cross-validation:
# WRONG: Random train/test split
= train_test_split(X, test_size=0.2) # ❌
X_train, X_test
# CORRECT: Respect temporal order
= int(0.8 * len(X))
train_size = X[:train_size]
X_train = X[train_size:] # ✅ X_test
Report multiple metrics: - MAE (Mean Absolute Error) - RMSE (Root Mean Squared Error) - Coverage (95% CI should contain 95% of actuals) - Sharpness (average interval width)
Track performance over time: - Forecast accuracy degrades → Retrain models - Sudden drop in accuracy → Something changed (investigate!)
7.9.4 Implementation Template
class EpidemicForecaster:
"""
Production-ready epidemic forecasting system
"""
def __init__(self):
self.models = {}
self.performance_history = []
def add_model(self, name, model):
"""Add a model to the ensemble"""
self.models[name] = model
def fit(self, train_data):
"""Fit all models"""
for name, model in self.models.items():
model.fit(train_data)
def forecast(self, horizon, method='weighted_ensemble'):
"""Generate ensemble forecast"""
# Get forecasts from all models
= {}
forecasts for name, model in self.models.items():
= model.predict(horizon)
forecasts[name]
# Combine
if method == 'simple_average':
= np.mean(list(forecasts.values()), axis=0)
ensemble elif method == 'weighted_ensemble':
# Weight by recent performance
= self._calculate_weights()
weights = sum(w * f for w, f in zip(weights, forecasts.values()))
ensemble
return ensemble, forecasts
def evaluate(self, actual, forecast):
"""Evaluate forecast and log performance"""
= mean_absolute_error(actual, forecast)
mae
self.performance_history.append({
'date': pd.Timestamp.now(),
'mae': mae
})
return mae
def _calculate_weights(self):
"""Calculate model weights based on recent performance"""
# Implementation depends on your needs
pass
# Usage
= EpidemicForecaster()
forecaster 'seir', SEIRModel())
forecaster.add_model('arima', ARIMAModel())
forecaster.add_model('lstm', LSTMModel())
forecaster.add_model(
forecaster.fit(training_data)= forecaster.forecast(horizon=14) forecast, individual_forecasts
7.10 Key Takeaways
Forecasting ≠ Surveillance. Detection tells you what’s happening now; forecasting predicts the future. The latter is exponentially harder.
Short-term works, long-term doesn’t. 1-2 week forecasts are useful. 4+ week forecasts are often worse than naive baselines.
Mechanistic models provide interpretability and scenario planning. SEIR models let you ask “what if we vaccinate 50%?” ML models can’t answer that.
Machine learning excels at pattern recognition in stable conditions. But breaks when conditions change (new variants, interventions).
Ensemble methods are most robust. Combining multiple approaches beats any single model.
Uncertainty quantification is mandatory. Never publish a single-number forecast. Always provide confidence/prediction intervals.
Evaluation is essential. Track forecast performance over time. If accuracy degrades, investigate why and retrain.
Sometimes, don’t forecast. When uncertainty is unquantifiable or forecasts would be misleading, say “we don’t know.”
Forecasts influence outcomes. The act of forecasting changes behavior, creating self-fulfilling or self-defeating prophecies.
COVID-19 Forecast Hub proved ensemble methods work. But also showed fundamental limits—novel variants break all models.
7.11 Practice Exercises
7.11.1 Exercise 1: Implement and Compare Models
Build SIR, SEIR, ARIMA, and LSTM models for the same epidemic dataset. Which performs best at 1-week, 2-week, and 4-week horizons? Why?
7.11.2 Exercise 2: Sensitivity Analysis
For an SEIR model, vary \(\beta\) by ±20%. How much does peak timing and magnitude change? What does this tell you about parameter uncertainty?
7.11.3 Exercise 3: Ensemble Optimization
Given forecasts from 5 models, find the optimal weights that minimize MAE on a validation set. Do these weights generalize to the test set?
7.11.4 Exercise 4: Uncertainty Calibration
Generate forecasts with 95% prediction intervals. What percentage of actual values fall within your intervals? If not 95%, how can you calibrate?
7.11.5 Exercise 5: Scenario Planning
Using an SEIR model, create three scenarios: (1) No intervention, (2) 50% transmission reduction, (3) 80% transmission reduction. How different are the outcomes?
Check Your Understanding
Test your knowledge of epidemic forecasting methods and their appropriate applications. Each question builds on the key concepts from this chapter.
In a basic SIR compartmental model, you calculate R₀ = 2.5. What does this mean for epidemic dynamics, and what happens when R₀ drops to 0.9 due to interventions?
- R₀ = 2.5 means 2.5% of the population gets infected; R₀ = 0.9 means 0.9% gets infected
- R₀ = 2.5 means the epidemic grows exponentially; R₀ = 0.9 means the epidemic dies out
- R₀ values only matter for novel pathogens and don’t change with interventions
- R₀ = 2.5 means each infected person has 2.5 contacts per day
Correct Answer: b) R₀ = 2.5 means the epidemic grows exponentially; R₀ = 0.9 means the epidemic dies out
The basic reproduction number (R₀) is the most important parameter in epidemic modeling:
What R₀ means: - R₀ = number of secondary infections caused by one infected person in a fully susceptible population - R₀ = 2.5: Each infected person infects 2.5 others → epidemic grows exponentially - R₀ = 0.9: Each infected person infects <1 person → epidemic dies out - R₀ = 1: Endemic equilibrium (constant case rate)
Critical threshold: R₀ > 1 → epidemic grows; R₀ < 1 → epidemic dies out
How interventions affect R₀: - R₀ = β/γ where β = transmission rate, γ = recovery rate - Social distancing reduces β → lowers R₀ - Faster isolation reduces infectious period → increases γ → lowers R₀ - A modest 10% reduction in transmission (R₀ = 2.5 → 2.3) can dramatically reduce peak cases
Examples from the chapter: - R₀ = 2.5 (no intervention) → Peak: 500,000 cases/day - R₀ = 2.0 (distancing) → Peak: 100,000 cases/day - R₀ = 0.9 (lockdown) → No peak, epidemic dies out
Why other answers are wrong: - R₀ ≠ percentage infected (that’s the attack rate) - R₀ ≠ contacts per day (it’s contacts × transmission probability / infectious period) - R₀ absolutely changes with interventions (that’s the whole point!)
The COVID-19 Forecast Hub found that 1-week ahead forecasts had ~10-15% median error (useful), while 4-week ahead forecasts had ~50% error (barely better than guessing). What fundamental challenge explains this forecast horizon problem?
- Longer forecasts require more computational power that wasn’t available
- Epidemics exhibit chaotic dynamics where small changes in initial conditions, behaviors, and interventions compound over time
- Four weeks is beyond the maximum incubation period so the data becomes unreliable
- Forecasters deliberately make long-term forecasts less accurate to avoid liability
Correct Answer: b) Epidemics exhibit chaotic dynamics where small changes in initial conditions, behaviors, and interventions compound over time
This is the butterfly effect in epidemiology - a fundamental characteristic of complex adaptive systems:
Why accuracy degrades with forecast horizon:
- Compounding uncertainty: Small errors in parameter estimation (β, γ) grow exponentially
- Unpredictable interventions: Lockdowns, mask mandates, vaccine campaigns can’t be predicted
- Behavior changes: People react to case counts, changing transmission dynamics
- Viral evolution: New variants with different R₀ values break model assumptions
- Reporting changes: Testing policy shifts invalidate surveillance data patterns
- Random events: Superspreading events can dramatically shift trajectories
From the chapter:
Forecast Horizon | Typical Accuracy | Usefulness |
---|---|---|
1 week ahead | ±10-20% | High - operational planning |
2 weeks ahead | ±20-40% | Moderate - resource allocation |
4 weeks ahead | ±40-100% | Low - scenario planning only |
3+ months ahead | Worse than baseline | Essentially guesswork |
The self-fulfilling/self-defeating prophecy: - Forecasts influence behavior, which changes outcomes - Unlike weather forecasting (rain doesn’t care about predictions), epidemic forecasts affect the thing being forecasted - Long horizons amplify this feedback loop
Real-world example: - December 2021: Models predicted Omicron peak of 300K-700K cases/day - Reality (January 2022): 1.4 million cases/day (underestimated by 2-3x) - Why? Novel variant with 2-3x higher R₀ broke all assumptions
The lesson: Focus on short-term forecasts (1-2 weeks) for operational decisions. Treat longer horizons as scenario planning, not prediction.
You’re comparing mechanistic SEIR models versus LSTM deep learning models for epidemic forecasting. Which statement best describes their complementary strengths and appropriate use cases?
- LSTM models are always superior because they use more advanced AI technology
- SEIR models enable interpretable scenario planning (what if we vaccinate 50%?), while LSTMs excel at short-term pattern recognition in stable conditions
- SEIR models are outdated; modern forecasting should only use deep learning
- LSTM models understand causal mechanisms better than SEIR models
Correct Answer: b) SEIR models enable interpretable scenario planning (what if we vaccinate 50%?), while LSTMs excel at short-term pattern recognition in stable conditions
This captures the fundamental tension between mechanistic and machine learning approaches:
SEIR/Mechanistic Models:
Strengths: - ✅ Interpretable: Every parameter (β, γ, σ) has biological meaning - ✅ Scenario planning: Can simulate “what if we reduce transmission by 30%?” - ✅ Theory-grounded: Built on century of epidemiological knowledge - ✅ Extrapolation: Can predict novel scenarios without historical data - ✅ Policy-relevant: Stakeholders understand “if R₀ drops to X, we get Y cases”
Limitations: - ❌ Simplistic assumptions (homogeneous mixing, constant parameters) - ❌ Don’t capture heterogeneity (superspreaders, age structure) - ❌ Require knowing parameters (β, γ) that are hard to estimate
LSTM/Machine Learning Models:
Strengths: - ✅ Data-driven: Learn patterns without mechanistic assumptions - ✅ Flexible: Capture complex nonlinear relationships - ✅ Short-term accuracy: Excel at 1-2 week forecasts when conditions are stable - ✅ Multiple inputs: Can integrate cases + mobility + weather + interventions
Limitations: - ❌ No mechanistic understanding: Learn patterns, not causes - ❌ Break on distribution shift: New variants, interventions invalidate training data - ❌ Opaque: Can’t explain why predictions changed - ❌ Can’t answer counterfactuals: “What if we vaccinate?” → no historical data for that
The LSTM limitation example from the chapter: - Training data: epidemic curve + lockdown → cases drop - LSTM learns: “After cases reach X, they drop” - New epidemic: cases reach X, but no lockdown - LSTM predicts drop (WRONG!) - Problem: No concept that lockdown caused the drop
Best practice (from COVID-19 Forecast Hub): - Use both in an ensemble - SEIR for scenario planning and understanding mechanisms - LSTM for short-term pattern recognition - Ensemble combines their strengths and averages out individual failures
Why did the COVID-19 Forecast Hub consistently find that ensemble methods (combining multiple models) outperformed any individual model, even though no single “ensemble algorithm” existed?
- Ensemble methods use more computational resources so they’re inherently more accurate
- Different models capture different patterns; averaging reduces overfitting and provides robustness when individual models fail
- Ensemble methods average out all the errors so they always perform better mathematically
- The ensemble was weighted toward the best model, which is why it performed well
Correct Answer: b) Different models capture different patterns; averaging reduces overfitting and provides robustness when individual models fail
Ensemble forecasting is one of the most important lessons from both the CDC FluSight Challenge and COVID-19 Forecast Hub:
Why ensembles work:
- Different models capture different patterns:
- Mechanistic models (SEIR): Understand transmission dynamics
- Time series models (ARIMA): Capture seasonal patterns
- ML models (LSTM): Learn complex nonlinear relationships
- Each model has blind spots; ensembles fill in gaps
- Averaging reduces overfitting:
- Individual models may be overconfident in their predictions
- Ensemble spreads risk across multiple approaches
- Extreme predictions get averaged out
- Robustness to model-specific failures:
- If one model fails (e.g., LSTM breaks on new variant), others compensate
- No need to identify “best” model in advance
- Wisdom of crowds
Evidence from COVID-19 Forecast Hub (Cramer et al., 2022, PNAS): - Ensemble consistently outperformed 90% of individual models - Even when all models struggled (Omicron), ensemble was least wrong - Simple median ensemble worked surprisingly well
Ensemble methods: 1. Simple average: Equal weight to all models (works well!) 2. Weighted average: Weight by historical performance 3. Median ensemble: More robust to outliers 4. Adaptive weights: Change based on recent performance
When ensemble doesn’t help: - All models use same data and make same mistakes - One model vastly superior (rare in practice) - Forecast horizon so long all models are guessing
From the chapter example: Models compared: - SEIR, ARIMA, ML → best individual MAE - Simple ensemble → better than best individual - Weighted ensemble → best overall - Improvement: ~10-20% over best single model
The lesson: Don’t put all your eggs in one model basket. Ensemble methods are more robust and reliable for real-world forecasting.
You publish a forecast predicting “severe outbreak in 2 weeks” with wide confidence intervals (10,000-100,000 cases). The public panics and implements strict measures. The outbreak is mild (15,000 cases). What forecasting challenge does this illustrate?
- Your model had high bias and should have been regularized better
- The self-defeating prophecy problem: forecasts influence behavior, which changes outcomes, making the forecast appear wrong even though it drove the right action
- Confidence intervals were too wide and should have been narrower for credibility
- The model should have predicted the intervention and adjusted for it
Correct Answer: b) The self-defeating prophecy problem: forecasts influence behavior, which changes outcomes, making the forecast appear wrong even though it drove the right action
This is one of the most profound challenges in epidemic forecasting - unique to predictions that influence human behavior:
The paradox: Success makes you look wrong
Scenario 1: Self-defeating prophecy (this question) - Model predicts catastrophic outbreak - Public panics, implements strict measures - Outbreak is mild - Conclusion: “Model was wrong! Overreaction!” - Reality: Model was right about unmitigated scenario; intervention changed trajectory
Scenario 2: Self-fulfilling prophecy - Model predicts mild outbreak - Public relaxes precautions - Outbreak explodes - Conclusion: “Model was wrong! Should have prepared!” - Reality: Model was right for baseline scenario; behavior change amplified outbreak
Unlike weather forecasting: - Rain forecast doesn’t affect whether it rains - Epidemic forecasts DO affect transmission (people change behavior) - Creates impossible situation for forecasters
The Michael Fish Problem (from chapter): - 1987: BBC weatherman dismissed hurricane concerns - Hurricane hit anyway → 18 died, permanent trust damage - Lesson: Reassuring forecasts that turn out wrong destroy trust permanently - But alarming forecasts that don’t materialize also erode trust (more slowly)
Ethical best practices:
✅ Clearly distinguish forecasts from scenarios: - “If current trends continue…” (forecast) - “If we do nothing…” (scenario) - “If we implement lockdown…” (scenario)
✅ Communicate uncertainty honestly: - Wide confidence intervals (10K-100K) show genuine uncertainty - Better than false precision (53,472 cases!)
✅ Frame conditionally: - “Without intervention: 50,000-100,000 cases” - “With strong measures: 10,000-30,000 cases”
✅ Update as conditions change: - “Our forecast assumed no intervention; behavior changed, so we update”
From Ioannidis et al., 2020: > “Forecasters must resist pressure to provide false certainty. Forecasts influence policy and behavior, creating special responsibility.”
When NOT to forecast: - Novel pathogen (no data) - Major intervention pending - Model disagreement too high - Better to say: “Current uncertainty is too high for actionable forecasts”
You’re building an epidemic forecasting system for influenza. Your ARIMA model achieves 12% MAE, while a “naive” persistence model (tomorrow = today) achieves 15% MAE. Your LSTM achieves 11% MAE but requires 100x more computational resources. What should you deploy?
- LSTM model because it has the lowest error
- ARIMA model because it balances accuracy and simplicity, and beats the naive baseline
- Naive model because it’s simplest and the differences are too small to matter
- An ensemble of all three models weighted by their inverse MAE
Correct Answer: b) ARIMA model because it balances accuracy and simplicity, and beats the naive baseline
This question tests practical judgment about model selection and deployment tradeoffs:
Why ARIMA is the best choice here:
- Beats naive baseline (12% vs 15% MAE)
- Rule from chapter: “If your fancy model doesn’t beat ‘tomorrow will be like today,’ it’s not adding value”
- 20% improvement over baseline is meaningful
- Reasonable complexity
- ARIMA is interpretable and well-understood
- Can explain seasonal patterns, trends to stakeholders
- Easier to diagnose when forecasts degrade
- Computational efficiency
- Fast training and inference
- Can update frequently as new data arrives
- No specialized hardware needed
- Appropriate for influenza
- Flu has strong seasonal patterns (ARIMA excels here)
- Reporting is stable and consistent
- Short-term forecasts (1-4 weeks) are use case
Why NOT choose LSTM (11% MAE): - Only 1% absolute improvement over ARIMA (8% relative improvement) - 100x computational cost for marginal gain - Harder to maintain and debug - Opaque predictions harder to explain to health departments - Principle: Simplest model that beats baseline and meets requirements
Why NOT choose naive baseline: - 15% MAE is worse than ARIMA - Doesn’t capture seasonality - However, always compute it as benchmark!
Why ensemble might not be best: - More complex to maintain - Marginal improvement unlikely to justify complexity for weekly flu forecasts - Better for high-stakes scenarios (pandemic)
From the chapter’s model selection strategy:
1. Naive baseline (persistence: tomorrow = today)
↓ (beat this? Yes → continue)
2. Time series model (ARIMA, Prophet) ← WE ARE HERE
↓ (need scenario planning? No)
3. Mechanistic model (SEIR)
↓ (have large training data + need more accuracy? No)
4. Machine learning (LSTM, ensemble)
Best practice for production systems: 1. Deploy ARIMA as primary model 2. Monitor performance weekly 3. Keep naive baseline for comparison 4. If accuracy degrades, investigate and retrain 5. Consider LSTM only if: - Accuracy requirements become critical - Computational resources available - Interpretability less important
Context matters: - Hospital bed allocation (low stakes): ARIMA is fine - Pandemic ventilator procurement (high stakes): Might justify LSTM or ensemble - Public communication: Need interpretable model (ARIMA)
The chapter emphasizes: “Never skip the naive baseline! If your fancy model doesn’t beat ‘tomorrow will be like today,’ it’s not adding value.”
7.12 Discussion Questions
SEIR models assume homogeneous mixing (everyone contacts everyone equally). How does this assumption break down in reality? How might you modify SEIR to be more realistic?
The COVID-19 Forecast Hub ensemble outperformed individual models, but still failed to predict Omicron. What does this tell us about the limits of forecasting? When should we abandon forecasting and focus on scenario planning?
Mechanistic models (SEIR) are interpretable but make strong assumptions. ML models (LSTMs) are flexible but opaque. For public health decision-making, which is more important: interpretability or accuracy? Does it depend on the decision?
Imagine you’re a state health department epidemiologist in March 2020. Multiple models predict vastly different COVID-19 death tolls (100K to 2M). How do you communicate this uncertainty to policymakers? To the public?
Google Flu Trends failed because search behavior changed when people became aware of the model. Is this problem inherent to all epidemic forecasting, or specific to web-based surveillance?
The COVID-19 Forecast Hub published forecasts weekly, even when model uncertainty was extremely high. Was this ethical? Should forecasters refuse to forecast when uncertainty is unquantifiable?
7.13 Further Resources
7.13.1 📚 Books and Guides
- Keeling & Rohani, 2008, Modeling Infectious Diseases in Humans and Animals - Foundational textbook on mechanistic epidemic models
- Vynnycky & White, 2010, An Introduction to Infectious Disease Modelling - Practical guide to SEIR and related models
7.13.2 📄 Academic Papers
Forecasting Evaluation: - Reich et al., 2019, Clinical Infectious Diseases - FluSight evaluation - 🎯 Key case study showing ensemble methods work - Cramer et al., 2022, PNAS - COVID-19 Forecast Hub - 🎯 Essential reading on lessons learned from COVID-19 forecasting - Viboud et al., 2018, PNAS - RAPIDD forecasting challenge - Comparative evaluation of forecasting approaches
Machine Learning Methods: - Chimmula & Zhang, 2020, Chaos, Solitons & Fractals - LSTM forecasting - Deep learning for epidemic forecasting
Ethics and Limitations: - Ioannidis et al., 2020, International Journal of Forecasting - Forecasting for COVID-19 - Critical perspective on forecasting ethics and responsibility
7.13.3 💻 Tools and Platforms
- COVID-19 Forecast Hub - Archive of all COVID-19 forecasts with open data and code
- CDC FluSight - Annual influenza forecasting challenge
- EpiModel R package - Network-based epidemic modeling framework
- PyEpiDAT - Python toolkit for epidemic data and forecasting
7.13.4 🎓 Online Courses
- Coursera: Epidemics - Penn State course on epidemic modeling fundamentals
- edX: Infectious Disease Modelling - Imperial College London course on advanced modeling techniques
Forecasting is humbling. The future is uncertain, models are wrong, and interventions change outcomes. But with proper humility, transparency, and ensemble methods, forecasts can still inform better decisions than guessing.
Next: Chapter 6: Genomic Surveillance and Pathogen Analysis → ```