7 Epidemic Forecasting

Learning Objectives

This chapter distinguishes outbreak detection from epidemic forecasting. You will learn to:

Understand why epidemic forecasting is harder than weather forecasting
Implement mechanistic compartmental models (SIR, SEIR) and their assumptions
Apply statistical time series methods (ARIMA) for short-term prediction
Build LSTM neural networks for complex epidemic patterns
Create ensemble approaches combining multiple forecasting methods
Quantify uncertainty with prediction intervals and scenario analyses
Evaluate probabilistic predictions using proper scoring rules (CRPS, Brier)
Navigate ethical obligations when predictions influence behavior

Prerequisites: Disease Surveillance and Outbreak Detection recommended, Just Enough AI to Be Dangerous, The Data Problem.

📋 Chapter Summary (TL;DR)

The Big Picture: Epidemic forecasting is fundamentally harder than weather forecasting because humans change behavior in response to predictions (self-fulfilling/self-defeating prophecies), viruses evolve, and small errors cascade into massive divergence. COVID-19 exposed this brutally—even with unprecedented data, most models failed spectacularly.

Surveillance vs. Forecasting (Critical Distinction):

Surveillance: “An outbreak is happening NOW” → based on current data, relatively objective, verifiable immediately
Forecasting: “Here’s what will happen in 2 weeks” → predicting future, depends on assumptions, can’t verify until future arrives
Stakes: Get surveillance wrong → miss outbreak. Get forecasting wrong → waste billions, lose public trust, fail to prevent catastrophe

Why Forecasting Is So Hard:

The Butterfly Effect: R₀ = 2.5 → 500K cases/day. R₀ = 2.3 → 300K cases/day. R₀ = 0.9 → epidemic dies out. 10% reduction in transmission = difference between epidemic and no epidemic
Self-Defeating Prophecy: Dire forecast → behavior change → forecast doesn’t happen → public loses trust (“models were wrong!”)
Human Adaptation: People change behavior when they see forecasts, breaking model assumptions
Viral Evolution: Omicron invalidated every Delta-trained model overnight
Data Quality Issues: Reporting delays, testing bias, changing definitions (see Chapter 3)

Three Forecasting Approaches:

Mechanistic Models (SIR, SEIR)
- Based on epidemiological theory (compartments: Susceptible → Exposed → Infected → Recovered)
- Strengths: Interpretable, captures disease biology, scenario analysis
- Weaknesses: Strong assumptions (homogeneous mixing, constant parameters), parameter estimation challenges
- When to use: Understanding transmission dynamics, evaluating intervention scenarios
Statistical Time Series (ARIMA, Prophet)
- Learn patterns from historical data without biological assumptions
- Strengths: Good short-term predictions (1-2 weeks), minimal assumptions
- Weaknesses: No mechanism, fails when patterns change, can’t handle interventions
- When to use: Short-term operational forecasting with stable patterns
Machine Learning (LSTMs, Gradient Boosting)
- Neural networks capture complex temporal patterns
- Strengths: Flexible, handles non-linear relationships, integrates multiple data streams
- Weaknesses: “Black box,” requires large training data, overfits to specific outbreak contexts
- When to use: Rich feature sets, complex patterns, adequate training data

Ensemble Methods (The Best Approach):

Combine multiple models to hedge against individual failures. CDC FluSight uses ensemble averaging—typically outperforms any single model. Aggregating diverse approaches captures different aspects of epidemic dynamics.

Evaluation (Beyond Accuracy):

Proper Scoring Rules: CRPS (Continuous Ranked Probability Score), Brier Score—reward probabilistic calibration, not just point estimates
Prediction Intervals: Quantify uncertainty. 95% interval should contain truth 95% of time
Scenario Analysis: “If R₀ drops by 20%, what happens?” Not single prediction but range of possibilities
Short-term vs. Long-term: Models do okay 1-2 weeks ahead, fail catastrophically 4+ weeks ahead

The COVID-19 Forecasting Disaster:

March 2020 models: 100K-240K deaths with mitigation. Reality: 1.1M+ deaths
Omicron (Jan 2022): Models predicted moderate increase, reality was vertical spike
Why models failed: Behavioral responses not captured, viral evolution, testing bias confusing disease growth with testing expansion, parameter assumptions broke
Median error: 25-50% for 1-2 week forecasts. Worse for longer horizons.

Ethical Obligations:

Forecasts influence the outcomes they predict: - Transparency: Communicate uncertainty honestly, not false precision - Humility: Acknowledge limitations publicly - Avoid Crying Wolf: Repeated failed predictions destroy credibility for genuine threats - Equity: Ensure forecasts don’t worsen disparities (resource allocation based on predictions)

When to Use Which Approach:

Mechanistic (SIR/SEIR): Policy scenario analysis, understanding transmission, resource planning
Statistical (ARIMA): Short-term operational forecasting (hospital bed planning 1-2 weeks)
ML (LSTMs): Complex multi-source data, non-linear patterns, adequate training data
Ensemble: Whenever possible—hedge against model failures

The Takeaway for Public Health Practitioners:

Epidemic forecasting will always be uncertain—embrace it, quantify it, communicate it. Don’t chase false precision. Use forecasts for scenario planning (“If this happens, then that”), not crystal ball predictions. Mechanistic models help understand biology; statistical models work for short-term ops; ML adds flexibility but loses interpretability. Ensemble methods typically win. Most importantly: forecasts change behavior, which changes outcomes—this isn’t a bug, it’s the whole point. Failed forecasts that motivated effective action are successes, not failures.

7.1 Introduction: From “What Is Happening?” to “What Will Happen?”

March 2020, Washington, D.C.: The White House Coronavirus Task Force presents epidemic models to the President. The projections are sobering: without mitigation, 1.5-2.2 million Americans could die. With social distancing, 100,000-240,000 deaths.

The nation locked down. Schools closed. Businesses shuttered.

January 2023: The U.S. passes 1.1 million COVID-19 deaths—far exceeding even the “with mitigation” scenario.

What went wrong? Were the models bad? Was the data flawed? Or is epidemic forecasting simply impossible?

October 2021, Global: Multiple forecasting models completely missed the magnitude of the Omicron wave. Most predicted a moderate increase in cases. Reality: a vertical spike that shattered all previous records.

January 2022: Models that worked well for Delta suddenly failed for Omicron. Why? The virus evolved, immune landscapes changed, behavior shifted, and the fundamental assumptions of every model broke.

These aren’t just modeling failures—they’re lessons about the fundamental limits of prediction in complex adaptive systems.

7.1.1 The Critical Distinction

Surveillance (Chapter 4): “An outbreak is happening NOW” - Based on current/recent data - Relatively objective - Can be verified immediately

Forecasting (this chapter): “Here’s what will happen in 2 weeks” - Predicting the future - Depends on assumptions - Can’t be verified until the future arrives

The difference matters enormously:

For resource allocation: - How many hospital beds will we need? - Should we order more ventilators? - Do we need to open field hospitals?

For policy decisions: - Should we implement lockdowns? - When can schools safely reopen? - Should we mandate masks?

For public communication: - Should people cancel travel plans? - Is it safe to visit elderly relatives? - Do we need to stock up on supplies?

Get surveillance wrong → Miss an outbreak
Get forecasting wrong → Waste billions, lose public trust, or fail to prevent catastrophe

7.1.2 Why Forecasting Is Harder Than Detection

7.1.2.1 The Butterfly Effect in Epidemiology

In chaos theory, the butterfly effect describes how small changes in initial conditions lead to vastly different outcomes. Epidemics exhibit this behavior in extreme form:

Small changes in behavior → Massive changes in outcomes

R₀ = 2.5 (no interventions)
→ Peak: 500,000 cases/day

R₀ = 2.3 (modest social distancing)
→ Peak: 300,000 cases/day

R₀ = 2.0 (strong distancing)
→ Peak: 100,000 cases/day

R₀ = 0.9 (lockdown)
→ No peak, epidemic dies out

A 10% reduction in transmission can mean the difference between epidemic and no epidemic. Forecasting requires getting this exactly right.

7.1.2.2 The Self-Fulfilling/Self-Defeating Prophecy Problem

Unlike weather forecasting, epidemic forecasts influence the thing being forecasted.

Scenario 1: Self-defeating prophecy - Model predicts catastrophic outbreak - Public panics, implements strict measures - Outbreak is mild - Conclusion: “Model was wrong! Overreaction!”

Scenario 2: Self-fulfilling prophecy - Model predicts mild outbreak - Public relaxes precautions - Outbreak explodes - Conclusion: “Model was wrong! Should have prepared!”

This creates an impossible situation: Success makes you look wrong.

The Michael Fish Problem

In October 1987, BBC weatherman Michael Fish famously dismissed viewer concerns about an approaching hurricane: “Earlier on today, apparently, a woman rang the BBC and said she heard there was a hurricane on the way. Well, if you’re watching, don’t worry, there isn’t.”

That night, the Great Storm of 1987 hit the UK—the worst storm in 300 years. Eighteen people died. Millions of trees fell. The public never forgot.

Lesson for epidemic forecasting: Reassuring forecasts that turn out wrong destroy trust permanently. Conservative (alarming) forecasts that don’t materialize also destroy trust, but more slowly.

The safest strategy: Communicate uncertainty honestly. Never publish a single-number forecast without ranges.

7.1.2.3 The Horizon Problem

Forecast accuracy degrades rapidly with time:

Forecast Horizon	Typical Accuracy	Usefulness
1 week ahead	±10-20%	High - operational planning
2 weeks ahead	±20-40%	Moderate - resource allocation
4 weeks ahead	±40-100%	Low - scenario planning only
3+ months ahead	Often worse than naive baseline	Essentially guesswork

Why does accuracy degrade? - Interventions you can’t predict (lockdowns, vaccine campaigns) - Behavior changes (people react to case counts) - Viral evolution (new variants) - Reporting changes (testing policy shifts) - Weather and seasonality effects - Random superspreading events

The 2-week wall: Most successful forecasting systems focus on 1-2 week horizons. Anything beyond that is scenario planning, not prediction.

Case Study Preview: The COVID-19 Forecast Hub

The COVID-19 Forecast Hub, coordinated by the Reich Lab at UMass Amherst, collected forecasts from 50+ teams starting in March 2020.

Key findings from Cramer et al., 2022, PNAS (Cramer et al. 2022):

1-week ahead forecasts: Median error ~10-15% (useful!)
4-week ahead forecasts: Median error ~50% (barely better than guessing)
Ensemble models consistently outperformed any individual model
Novel variants caused all models to fail simultaneously

We’ll examine this in detail later in the chapter.

7.2 The Mechanistic Approach: Compartmental Models

Before machine learning existed, epidemiologists developed mechanistic models—mathematical frameworks based on how diseases actually spread.

These models remain essential today. Why? They’re interpretable, biologically grounded, and enable scenario planning.

7.2.1 SIR Model: The Foundation

The SIR model, introduced by Kermack and McKendrick in 1927, divides a population into three compartments:

S: Susceptible (can get infected)
I: Infected (currently infectious)
R: Recovered/Removed (immune or dead)

The flow:

S → I → R

Hide code

graph LR
    S[S: Susceptible] -->|β: Transmission rate| I[I: Infected]
    I -->|γ: Recovery rate| R[R: Recovered/Removed]

    style S fill:#e1f5dd
    style I fill:#ffe1e1
    style R fill:#e1e8ff

Figure 7.1: SIR compartmental model showing the flow from Susceptible to Infected to Recovered populations. The transmission rate β governs movement from S to I, while the recovery rate γ governs movement from I to R.

Governed by two parameters: - β (beta): Transmission rate (contacts per day × probability of transmission) - γ (gamma): Recovery rate (1/infectious period)

The basic reproduction number: \[R_0 = \frac{\beta}{\gamma}\]

\(R_0\) tells you: How many people does one infected person infect in a fully susceptible population?

\(R_0 < 1\): Epidemic dies out
\(R_0 = 1\): Endemic equilibrium
\(R_0 > 1\): Epidemic grows

The mathematics:

\[\frac{dS}{dt} = -\beta \frac{S \cdot I}{N}\]

\[\frac{dI}{dt} = \beta \frac{S \cdot I}{N} - \gamma I\]

\[\frac{dR}{dt} = \gamma I\]

Where \(N = S + I + R\) (total population).

Let’s implement this:

import numpy as np
from scipy.integrate import odeint
import matplotlib.pyplot as plt

def sir_model(y, t, beta, gamma, N):
    """
    Basic SIR model differential equations
    
    Parameters:
    - y: [S, I, R] current state
    - t: time
    - beta: transmission rate
    - gamma: recovery rate
    - N: total population
    
    Returns:
    - [dS/dt, dI/dt, dR/dt]
    """
    S, I, R = y
    
    dS_dt = -beta * S * I / N
    dI_dt = beta * S * I / N - gamma * I
    dR_dt = gamma * I
    
    return dS_dt, dI_dt, dR_dt

# Parameters for a measles-like disease
N = 1_000_000  # Total population (e.g., a city)
I0 = 100       # Initial infected
R0 = 0         # Initial recovered (no prior immunity)
S0 = N - I0 - R0  # Initial susceptible

beta = 0.5     # Transmission rate (contacts per day × transmission probability)
gamma = 0.1    # Recovery rate (1/10 day infectious period)

# Time vector (days)
t = np.linspace(0, 200, 200)

# Initial conditions
y0 = S0, I0, R0

# Solve ODE system
solution = odeint(sir_model, y0, t, args=(beta, gamma, N))
S, I, R = solution.T

# Calculate R₀
R0_value = beta / gamma
print(f"Basic Reproduction Number (R₀): {R0_value:.2f}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Left panel: Epidemic curves
axes[0].plot(t, S, label='Susceptible', color='blue', linewidth=2)
axes[0].plot(t, I, label='Infected', color='red', linewidth=3)
axes[0].plot(t, R, label='Recovered', color='green', linewidth=2)

# Mark peak
peak_idx = np.argmax(I)
axes[0].axvline(t[peak_idx], color='red', linestyle='--', alpha=0.5, 
                label=f'Peak: Day {t[peak_idx]:.0f}')
axes[0].axhline(I[peak_idx], color='red', linestyle='--', alpha=0.5)

axes[0].set_xlabel('Days', fontsize=12)
axes[0].set_ylabel('Population', fontsize=12)
axes[0].set_title(f'SIR Model Epidemic Curve (R₀={R0_value:.1f})', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Right panel: Phase portrait (S vs I)
axes[1].plot(S, I, 'k-', linewidth=2)
axes[1].scatter([S0], [I0], color='red', s=200, zorder=5, 
                marker='*', label='Start', edgecolors='black', linewidth=2)
axes[1].scatter([S[-1]], [I[-1]], color='green', s=200, zorder=5,
                marker='*', label='End', edgecolors='black', linewidth=2)

# Herd immunity threshold line
herd_immunity_threshold = N * (1 - 1/R0_value)
axes[1].axvline(herd_immunity_threshold, color='orange', linestyle='--',
                linewidth=2, label=f'Herd Immunity: {herd_immunity_threshold/N*100:.0f}%')

axes[1].set_xlabel('Susceptible Population', fontsize=12)
axes[1].set_ylabel('Infected Population', fontsize=12)
axes[1].set_title('Phase Portrait: S vs I', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('sir_basic_model.png', dpi=300)
plt.show()

# Calculate key metrics
peak_infected_idx = np.argmax(I)
peak_day = t[peak_infected_idx]
peak_cases = I[peak_infected_idx]
final_attack_rate = (N - S[-1]) / N * 100
herd_immunity = (1 - 1/R0_value) * 100

print("\n" + "="*60)
print("SIR MODEL RESULTS")
print("="*60)
print(f"\nR₀ (Basic Reproduction Number): {R0_value:.2f}")
print(f"Herd Immunity Threshold: {herd_immunity:.1f}% of population")
print(f"\nPeak Infections: {peak_cases:,.0f} people")
print(f"Peak Day: {peak_day:.0f}")
print(f"\nFinal Attack Rate: {final_attack_rate:.1f}% of population infected")
print(f"Final Susceptible: {S[-1]:,.0f} people ({S[-1]/N*100:.1f}%)")
print(f"Total Infected: {R[-1]:,.0f} people")

# Demonstrate impact of interventions
print("\n" + "="*60)
print("SCENARIO ANALYSIS: Impact of Interventions")
print("="*60)

scenarios = {
    'No intervention (R₀=2.5)': 2.5,
    'Masks + distancing (R₀=2.0)': 2.0,
    'Partial lockdown (R₀=1.5)': 1.5,
    'Full lockdown (R₀=0.9)': 0.9
}

for scenario_name, R0_scenario in scenarios.items():
    beta_scenario = R0_scenario * gamma
    solution_scenario = odeint(sir_model, y0, t, args=(beta_scenario, gamma, N))
    S_s, I_s, R_s = solution_scenario.T
    
    peak_infections = I_s.max()
    total_infected = R_s[-1]
    attack_rate = total_infected / N * 100
    
    print(f"\n{scenario_name}")
    print(f"  Peak infections: {peak_infections:,.0f}")
    print(f"  Total infected: {total_infected:,.0f} ({attack_rate:.1f}%)")

Understanding R₀

R₀ is the most important number in epidemiology, but it’s often misunderstood.

What it is: - Number of secondary infections caused by one infected person in a fully susceptible population - Determines whether an epidemic grows or dies out - Threshold for herd immunity: \(1 - 1/R_0\)

What it is NOT: - Not the number of people infected per day (that’s incidence) - Not constant over time (behavior, interventions, immunity all change it) - Not inherent to a pathogen alone (depends on population behavior and contact patterns)

Examples: - Measles: \(R_0 \approx\) 12-18 (extremely contagious) - Influenza: \(R_0 \approx\) 1.3-1.8 (moderately contagious) - Original SARS-CoV-2: \(R_0 \approx\) 2-3 - Delta variant: \(R_0 \approx\) 5-8 - Omicron: \(R_0 \approx\) 10-15

For comprehensive review, see Delamater et al., 2019, Emerging Infectious Diseases.

7.2.2 SEIR Model: Adding the Latent Period

The SIR model assumes people become infectious immediately after infection. For most diseases, this is wrong—there’s an incubation period.

SEIR adds the Exposed compartment:

S → E → I → R

Hide code

graph LR
    S[S: Susceptible] -->|β: Transmission rate| E[E: Exposed]
    E -->|σ: Incubation rate| I[I: Infected]
    I -->|γ: Recovery rate| R[R: Recovered/Removed]

    style S fill:#e1f5dd
    style E fill:#fff4e1
    style I fill:#ffe1e1
    style R fill:#e1e8ff

Figure 7.2: SEIR compartmental model extending SIR with an Exposed (E) compartment representing the incubation/latent period. Infected individuals (E) are not yet infectious, making this model more realistic for diseases like COVID-19, influenza, and Ebola that have significant incubation periods.

E: Exposed (infected but not yet infectious)

New parameter: - σ (sigma): Rate of progression from E to I (1/incubation period)

The mathematics:

\[\frac{dS}{dt} = -\beta \frac{S \cdot I}{N}\]

\[\frac{dE}{dt} = \beta \frac{S \cdot I}{N} - \sigma E\]

\[\frac{dI}{dt} = \sigma E - \gamma I\]

\[\frac{dR}{dt} = \gamma I\]

Why this matters:

For diseases like COVID-19 with long incubation periods (5-6 days), the SEIR model produces more realistic epidemic curves with delayed and broader peaks.

def seir_model(y, t, beta, sigma, gamma, N):
    """
    SEIR model with exposed/latent period
    
    Parameters:
    - y: [S, E, I, R] current state
    - sigma: rate of progression from E to I (1/incubation period)
    """
    S, E, I, R = y
    
    dS_dt = -beta * S * I / N
    dE_dt = beta * S * I / N - sigma * E
    dI_dt = sigma * E - gamma * I
    dR_dt = gamma * I
    
    return dS_dt, dE_dt, dI_dt, dR_dt

# Parameters for COVID-19-like disease
sigma = 1/5.5  # 5.5-day incubation period
gamma = 1/10   # 10-day infectious period
beta = 0.5

E0 = 50   # Initial exposed
I0 = 10   # Initial infected
R0 = 0
S0 = N - E0 - I0 - R0

# Solve SEIR
y0_seir = S0, E0, I0, R0
solution_seir = odeint(seir_model, y0, t, args=(beta, sigma, gamma, N))
S_seir, E_seir, I_seir, R_seir = solution_seir.T

# Solve SIR for comparison (combine E and I into I)
y0_sir = S0 + E0, I0, R0
solution_sir = odeint(sir_model, y0_sir, t, args=(beta, gamma, N))
S_sir, I_sir, R_sir = solution_sir.T

# Compare
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Top panel: SEIR compartments
axes[0].plot(t, S_seir, label='Susceptible', color='blue', linewidth=2)
axes[0].plot(t, E_seir, label='Exposed (not yet infectious)', color='orange', linewidth=2)
axes[0].plot(t, I_seir, label='Infected (infectious)', color='red', linewidth=3)
axes[0].plot(t, R_seir, label='Recovered', color='green', linewidth=2)

axes[0].set_ylabel('Population', fontsize=12)
axes[0].set_title('SEIR Model: Four Compartments', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Bottom panel: Compare infectious populations (SIR vs SEIR)
axes[1].plot(t, I_sir, '--', label='SIR: Infected (no latency)', 
            color='pink', linewidth=3, alpha=0.7)
axes[1].plot(t, I_seir, '-', label='SEIR: Infected (with latent period)', 
            color='red', linewidth=3)
axes[1].plot(t, E_seir, '-', label='SEIR: Exposed (not infectious)', 
            color='orange', linewidth=2, alpha=0.7)

# Mark peaks
sir_peak_day = t[np.argmax(I_sir)]
seir_peak_day = t[np.argmax(I_seir)]
delay = seir_peak_day - sir_peak_day

axes[1].axvline(sir_peak_day, color='pink', linestyle=':', alpha=0.5)
axes[1].axvline(seir_peak_day, color='red', linestyle=':', alpha=0.5)

axes[1].annotate(f'Peak delay: {delay:.0f} days', 
                xy=(seir_peak_day, I_seir.max()),
                xytext=(seir_peak_day + 20, I_seir.max() * 0.8),
                arrowprops=dict(arrowstyle='->', color='black', lw=2),
                fontsize=12, bbox=dict(boxstyle='round', facecolor='wheat'))

axes[1].set_xlabel('Days', fontsize=12)
axes[1].set_ylabel('Population', fontsize=12)
axes[1].set_title('SIR vs SEIR: Impact of Latent Period', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('seir_vs_sir.png', dpi=300)
plt.show()

print(f"\nPeak timing difference: {delay:.1f} days")
print(f"SIR peak: {I_sir.max():,.0f} people on day {sir_peak_day:.0f}")
print(f"SEIR peak: {I_seir.max():,.0f} people on day {seir_peak_day:.0f}")
print(f"Peak reduction: {(1 - I_seir.max()/I_sir.max())*100:.1f}%")

Why the Latent Period Matters

The E compartment: 1. Delays the epidemic peak (by ~1 incubation period) 2. Reduces peak height slightly (infections spread out more) 3. Makes interventions less effective if they only target symptomatic individuals (E people are spreading it without symptoms) 4. Creates a “pre-symptomatic transmission” problem (critical for COVID-19)

For diseases with long incubation periods (COVID-19, Ebola), SEIR is essential. For diseases with short/negligible incubation (norovirus), SIR suffices.

7.2.3 Advanced Compartmental Models

Real epidemic modeling often requires more complexity:

7.2.3.1 SEIRD: Adding Death

\[S \to E \to I \to \begin{cases} R \text{ (recovered)} \\ D \text{ (dead)} \end{cases}\]

Split \(I \to R\) into \(I \to R\) (probability \(1-\text{CFR}\)) and \(I \to D\) (probability \(\text{CFR}\))
Essential for forecasting mortality burden

7.2.3.2 SEIRS: Waning Immunity

\[S \to E \to I \to R \to S\]

Immunity doesn’t last forever (influenza, coronaviruses)
Creates endemic equilibrium (disease doesn’t disappear)

7.2.3.3 Age-Structured Models

Different parameters for different age groups:

# Children (0-17)
beta_children = 0.3
gamma_children = 0.15

# Adults (18-64)
beta_adults = 0.5
gamma_adults = 0.1

# Elderly (65+)
beta_elderly = 0.4
gamma_elderly = 0.08
cfr_elderly = 0.15  # Higher case fatality rate

Critical for COVID-19 where age dramatically affects outcomes.

7.2.3.4 Spatial Models

Divide population into geographic regions with mobility between them:

# Movement between regions
mobility_matrix = [
    [0.95, 0.03, 0.02],  # Region 1: 95% stay, 3% to R2, 2% to R3
    [0.02, 0.96, 0.02],  # Region 2
    [0.01, 0.02, 0.97]   # Region 3
]

For more on spatial models, see Keeling & Rohani, 2008, Modeling Infectious Diseases.

7.2.4 Why Mechanistic Models Matter

Strengths:

✅ Interpretable: Every parameter has biological meaning (transmission rate, recovery rate, etc.)
✅ Scenario planning: Can simulate interventions (vaccination campaigns, social distancing, school closures)
✅ Theory-grounded: Built on century of epidemiological knowledge
✅ Extrapolation: Can predict novel scenarios (new pathogen, untested interventions)
✅ Policy-relevant: Policymakers understand “if R₀ drops to X, we get Y cases”

Limitations:

❌ Simplistic assumptions: Homogeneous mixing (everyone contacts everyone equally—clearly false)
❌ Constant parameters: \(\beta\) and \(\gamma\) change with behavior, seasons, immunity
❌ No heterogeneity: Superspreaders, age structure, spatial clustering all ignored in basic models
❌ Parameter uncertainty: Don’t know true \(\beta\), \(\gamma\) until epidemic is over
❌ Forecast degradation: Accurate for 1-2 weeks, unreliable beyond that

The Fundamental Tension

Mechanistic models are: - Transparent and interpretable - Based on biological mechanisms - But require assumptions that are often violated in reality

Machine learning models (next section) are: - Flexible and data-driven - Don’t require mechanistic assumptions - But opaque and hard to extrapolate beyond training data

Neither alone is sufficient. The best forecasting systems use both.

7.3 The Machine Learning Approach: Data-Driven Forecasting

What if we don’t want to specify transmission rates and recovery rates? What if we just want to learn patterns from data?

Enter machine learning forecasting.

7.3.1 Time Series Forecasting with ARIMA

ARIMA (AutoRegressive Integrated Moving Average) is a classical statistical method for time series forecasting.

Why ARIMA for epidemics: - Captures temporal autocorrelation (yesterday’s cases predict today’s) - Handles seasonality (flu peaks in winter, dengue in rainy season) - No mechanistic assumptions required - Works with limited historical data

ARIMA components:

AR (AutoRegressive): Use past values to predict future \[y_t = c + \phi_1 y_{t-1} + \phi_2 y_{t-2} + ... + \epsilon_t\]
I (Integrated): Differencing to make series stationary \[\Delta y_t = y_t - y_{t-1}\]
MA (Moving Average): Use past errors to improve prediction \[y_t = \mu + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + ...\]

ARIMA(p, d, q): - p: Number of autoregressive terms - d: Degree of differencing - q: Number of moving average terms

For seasonal data: SARIMA adds seasonal components.

Complete implementation for influenza forecasting:

import pandas as pd
import numpy as np
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Simulate realistic influenza data (ILINet-style)
np.random.seed(42)
weeks = pd.date_range('2018-01-01', periods=260, freq='W-MON')
week_of_year = weeks.isocalendar().week

# Seasonal flu pattern (peaks in winter)
baseline = 2.0 + 2.5 * np.exp(-((week_of_year - 6) % 52)**2 / 50)

# Add year-to-year variation
year_effect = np.repeat([0, 0.3, -0.2, 0.5, 0.1], 52)[:len(weeks)]

# Random noise
noise = np.random.normal(0, 0.25, len(weeks))

# Combine
ili_pct = baseline + year_effect + noise
ili_pct = np.maximum(ili_pct, 0.5)  # ILI never goes to zero

df = pd.DataFrame({'date': weeks, 'ili_pct': ili_pct})
df.set_index('date', inplace=True)

print(df.head(10))
print(f"\nTotal weeks: {len(df)}")
print(f"Date range: {df.index.min()} to {df.index.max()}")

# Train/test split (reserve last season for testing)
train_size = 208  # 4 years training
train = df.iloc[:train_size]
test = df.iloc[train_size:]

print(f"\nTraining weeks: {len(train)}")
print(f"Test weeks: {len(test)}")

# Exploratory analysis
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Raw time series
axes[0].plot(train.index, train['ili_pct'], label='Training Data', color='blue')
axes[0].plot(test.index, test['ili_pct'], label='Test Data', color='orange')
axes[0].set_ylabel('ILI %', fontsize=12)
axes[0].set_title('Influenza-Like Illness Time Series', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# ACF (Autocorrelation Function)
plot_acf(train['ili_pct'], lags=52, ax=axes[1], alpha=0.05)
axes[1].set_title('Autocorrelation Function (ACF)', fontsize=14)
axes[1].set_xlabel('Lag (weeks)')

# PACF (Partial Autocorrelation Function)
plot_pacf(train['ili_pct'], lags=52, ax=axes[2], alpha=0.05)
axes[2].set_title('Partial Autocorrelation Function (PACF)', fontsize=14)
axes[2].set_xlabel('Lag (weeks)')

plt.tight_layout()
plt.savefig('arima_exploratory.png', dpi=300)
plt.show()

# Seasonal decomposition
decomposition = seasonal_decompose(train['ili_pct'], model='additive', period=52)

fig, axes = plt.subplots(4, 1, figsize=(14, 12))
decomposition.observed.plot(ax=axes[0], title='Observed')
decomposition.trend.plot(ax=axes[1], title='Trend')
decomposition.seasonal.plot(ax=axes[2], title='Seasonal')
decomposition.resid.plot(ax=axes[3], title='Residual')

for ax in axes:
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('seasonal_decomposition.png', dpi=300)
plt.show()

# Fit SARIMA model
# SARIMA(p, d, q)(P, D, Q, s)
# (2, 0, 1) for non-seasonal components
# (1, 0, 1, 52) for seasonal components with 52-week period

print("\nFitting SARIMA model...")
model = SARIMAX(
    train['ili_pct'],
    order=(2, 0, 1),              # (p, d, q)
    seasonal_order=(1, 0, 1, 52), # (P, D, Q, s)
    enforce_stationarity=False,
    enforce_invertibility=False
)

fit = model.fit(disp=False, maxiter=200)

print("\n" + "="*60)
print("MODEL SUMMARY")
print("="*60)
print(fit.summary())

# Forecast
forecast_steps = len(test)
forecast_result = fit.get_forecast(steps=forecast_steps)

# Get predictions and confidence intervals
forecast_mean = forecast_result.predicted_mean
forecast_ci = forecast_result.conf_int()

# Create forecast dataframe
forecast_df = pd.DataFrame({
    'date': test.index,
    'actual': test['ili_pct'].values,
    'forecast': forecast_mean.values,
    'lower_ci': forecast_ci.iloc[:, 0].values,
    'upper_ci': forecast_ci.iloc[:, 1].values
})

# Visualize forecast
fig, axes = plt.subplots(2, 1, figsize=(16, 10))

# Top: Full time series with forecast
axes[0].plot(train.index, train['ili_pct'], 'o-', 
            label='Training Data', color='blue', alpha=0.6, markersize=4)
axes[0].plot(test.index, test['ili_pct'], 'o-',
            label='Actual (Test)', color='black', linewidth=2, markersize=6)
axes[0].plot(forecast_df['date'], forecast_df['forecast'], 's-',
            label='SARIMA Forecast', color='red', linewidth=2, markersize=6)

# Confidence interval
axes[0].fill_between(forecast_df['date'],
                     forecast_df['lower_ci'],
                     forecast_df['upper_ci'],
                     alpha=0.3, color='red', label='95% Confidence Interval')

axes[0].axvline(train.index[-1], color='green', linestyle='--', 
               linewidth=2, label='Train/Test Split')

axes[0].set_ylabel('ILI %', fontsize=12)
axes[0].set_title('SARIMA Forecast: Influenza-Like Illness', fontsize=14)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Bottom: Forecast errors
errors = forecast_df['actual'] - forecast_df['forecast']
colors = ['red' if e < 0 else 'green' for e in errors]

axes[1].bar(forecast_df['date'], errors, color=colors, alpha=0.6, edgecolor='black')
axes[1].axhline(0, color='black', linestyle='-', linewidth=1.5)

# Mark forecast horizons
for horizon, label in [(4, '1 month'), (13, '3 months')]:
    if horizon < len(errors):
        axes[1].axvline(forecast_df['date'].iloc[horizon], 
                       color='blue', linestyle=':', linewidth=2)
        axes[1].text(forecast_df['date'].iloc[horizon], errors.max() * 0.9,
                    f'{label} ahead', rotation=90, fontsize=10)

axes[1].set_xlabel('Date', fontsize=12)
axes[1].set_ylabel('Forecast Error (Actual - Predicted)', fontsize=12)
axes[1].set_title('Forecast Errors Over Time', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('arima_flu_forecast.png', dpi=300)
plt.show()

# Evaluate performance at different horizons
horizons = [1, 4, 8, 13]  # 1 week, 1 month, 2 months, 3 months

print("\n" + "="*60)
print("FORECAST PERFORMANCE BY HORIZON")
print("="*60)

for horizon in horizons:
    if horizon <= len(errors):
        subset_actual = forecast_df['actual'].iloc[:horizon]
        subset_pred = forecast_df['forecast'].iloc[:horizon]
        
        mae = mean_absolute_error(subset_actual, subset_pred)
        rmse = np.sqrt(mean_squared_error(subset_actual, subset_pred))
        mape = np.mean(np.abs((subset_actual - subset_pred) / subset_actual)) * 100
        
        # Coverage: % of actuals within CI
        subset_lower = forecast_df['lower_ci'].iloc[:horizon]
        subset_upper = forecast_df['upper_ci'].iloc[:horizon]
        coverage = ((subset_actual >= subset_lower) & (subset_actual <= subset_upper)).mean() * 100
        
        print(f"\n{horizon} week(s) ahead:")
        print(f"  MAE:  {mae:.3f}")
        print(f"  RMSE: {rmse:.3f}")
        print(f"  MAPE: {mape:.1f}%")
        print(f"  95% CI Coverage: {coverage:.0f}%")

# Compare to naive baseline (persistence forecast)
naive_forecast = train['ili_pct'].iloc[-1]  # Last observed value
naive_errors = forecast_df['actual'] - naive_forecast
naive_mae = mean_absolute_error(forecast_df['actual'], [naive_forecast]*len(forecast_df))
arima_mae = mean_absolute_error(forecast_df['actual'], forecast_df['forecast'])

improvement = (1 - arima_mae / naive_mae) * 100

print("\n" + "="*60)
print("COMPARISON TO NAIVE BASELINE")
print("="*60)
print(f"Naive forecast (persistence): MAE = {naive_mae:.3f}")
print(f"SARIMA forecast: MAE = {arima_mae:.3f}")
print(f"Improvement: {improvement:.1f}%")

When ARIMA Works (and When It Doesn’t)

ARIMA excels when: ✓ Disease has regular seasonal patterns (influenza, RSV, dengue)
✓ Reporting is stable and consistent
✓ Short-term forecasts (1-4 weeks)
✓ No major interventions or behavior changes expected
✓ Historical data covers multiple seasons

ARIMA struggles when: ✗ Novel outbreak (no historical pattern to learn)
✗ Sudden interventions (lockdowns, vaccine rollouts)
✗ Reporting systems change
✗ Long-term forecasts (>4 weeks)
✗ Nonlinear, chaotic dynamics

For comprehensive ARIMA tutorial, see Hyndman & Athanasopoulos, Forecasting: Principles and Practice.

7.3.2 Deep Learning: LSTMs for Epidemic Forecasting

LSTM (Long Short-Term Memory) networks are a type of recurrent neural network designed for sequential data.

Why LSTMs for epidemics: - Can capture long-term dependencies (last month’s cases affect this month) - Handle multiple input features (cases + mobility + weather + interventions) - Don’t require stationarity assumptions - Can learn nonlinear patterns

The basic architecture:

Input sequence → LSTM layers → Dense layers → Output (future cases)
   (past 14 days)                              (next 7 days)

Complete implementation for COVID-19 forecasting:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Simulate realistic COVID-19 case data
days = np.arange(500)

# Multi-wave pattern (Original → Delta → Omicron-like)
wave1 = 500 * np.exp(-((days - 50)**2) / 300)
wave2 = 2000 * np.exp(-((days - 200)**2) / 400)
wave3 = 5000 * np.exp(-((days - 350)**2) / 500)

# Combine waves with baseline and noise
cases = 100 + wave1 + wave2 + wave3 + np.random.normal(0, 100, len(days))
cases = np.maximum(cases, 50)  # Floor at 50 cases/day

# Add day-of-week effect (reporting artifacts)
day_of_week_effect = -200 * (days % 7 >= 5)  # Lower on weekends
cases = cases + day_of_week_effect
cases = np.maximum(cases, 0)

# Create DataFrame
df = pd.DataFrame({
    'day': days,
    'cases': cases
})

print(f"Total days: {len(df)}")
print(f"Case range: {cases.min():.0f} to {cases.max():.0f}")

# Visualize raw data
plt.figure(figsize=(14, 6))
plt.plot(df['day'], df['cases'], 'o-', alpha=0.6, markersize=3)
plt.xlabel('Day', fontsize=12)
plt.ylabel('Daily Cases', fontsize=12)
plt.title('Simulated COVID-19 Daily Cases (Multi-Wave Pattern)', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('lstm_data_raw.png', dpi=300)
plt.show()

# Prepare data for LSTM
def create_sequences(data, seq_length, forecast_horizon=1):
    """
    Create input sequences and targets for LSTM
    
    Parameters:
    - data: 1D array of values
    - seq_length: number of past time steps to use
    - forecast_horizon: number of future time steps to predict
    
    Returns:
    - X: [n_samples, seq_length] input sequences
    - y: [n_samples, forecast_horizon] target values
    """
    X, y = [], []
    
    for i in range(len(data) - seq_length - forecast_horizon + 1):
        # Input: seq_length days of history
        X.append(data[i:i+seq_length])
        
        # Target: next forecast_horizon days
        if forecast_horizon == 1:
            y.append(data[i+seq_length])
        else:
            y.append(data[i+seq_length:i+seq_length+forecast_horizon])
    
    return np.array(X), np.array(y)

# Normalize data (LSTM trains better with normalized inputs)
scaler = MinMaxScaler(feature_range=(0, 1))
cases_scaled = scaler.fit_transform(df['cases'].values.reshape(-1, 1)).flatten()

# Create sequences (use past 14 days to predict next day)
seq_length = 14
forecast_horizon = 1

X, y = create_sequences(cases_scaled, seq_length, forecast_horizon)

print(f"\nSequence creation:")
print(f"Input shape: {X.shape}")  # (n_samples, seq_length)
print(f"Target shape: {y.shape}")  # (n_samples,)

# Train/test split (80/20)
train_size = int(0.8 * len(X))
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]

print(f"\nTrain samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

# Reshape for LSTM [samples, time steps, features]
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

print(f"\nReshaped for LSTM:")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")

# Build LSTM model
model = keras.Sequential([
    # First LSTM layer (return sequences for stacking)
    layers.LSTM(64, activation='relu', return_sequences=True, 
                input_shape=(seq_length, 1)),
    layers.Dropout(0.2),
    
    # Second LSTM layer
    layers.LSTM(32, activation='relu'),
    layers.Dropout(0.2),
    
    # Dense layers
    layers.Dense(16, activation='relu'),
    layers.Dense(forecast_horizon)  # Output layer
])

model.compile(
    optimizer='adam',
    loss='mse',
    metrics=['mae']
)

print("\n" + "="*60)
print("LSTM MODEL ARCHITECTURE")
print("="*60)
model.summary()

# Train model
print("\n" + "="*60)
print("TRAINING MODEL")
print("="*60)

early_stopping = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True
)

history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.2,
    callbacks=[early_stopping],
    verbose=1
)

# Make predictions
y_train_pred = model.predict(X_train, verbose=0)
y_test_pred = model.predict(X_test, verbose=0)

# Inverse transform to original scale
y_train_actual = scaler.inverse_transform(y_train.reshape(-1, 1)).flatten()
y_train_pred_original = scaler.inverse_transform(y_train_pred).flatten()

y_test_actual = scaler.inverse_transform(y_test.reshape(-1, 1)).flatten()
y_test_pred_original = scaler.inverse_transform(y_test_pred).flatten()

# Evaluate
train_mae = mean_absolute_error(y_train_actual, y_train_pred_original)
train_rmse = np.sqrt(mean_squared_error(y_train_actual, y_train_pred_original))

test_mae = mean_absolute_error(y_test_actual, y_test_pred_original)
test_rmse = np.sqrt(mean_squared_error(y_test_actual, y_test_pred_original))

print("\n" + "="*60)
print("MODEL PERFORMANCE")
print("="*60)
print(f"\nTraining Set:")
print(f"  MAE:  {train_mae:.1f} cases")
print(f"  RMSE: {train_rmse:.1f} cases")

print(f"\nTest Set:")
print(f"  MAE:  {test_mae:.1f} cases")
print(f"  RMSE: {test_rmse:.1f} cases")

# Visualize results
fig, axes = plt.subplots(3, 1, figsize=(16, 14))

# Top: Training history
axes[0].plot(history.history['loss'], label='Training Loss', linewidth=2)
axes[0].plot(history.history['val_loss'], label='Validation Loss', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss (MSE)', fontsize=12)
axes[0].set_title('LSTM Training History', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Middle: Predictions vs actual (full dataset)
train_days = days[seq_length:seq_length+len(y_train_actual)]
test_days = days[seq_length+len(y_train_actual):seq_length+len(y_train_actual)+len(y_test_actual)]

axes[1].plot(train_days, y_train_actual, 'o-', 
            label='Actual (Training)', color='blue', alpha=0.5, markersize=3)
axes[1].plot(test_days, y_test_actual, 'o-',
            label='Actual (Test)', color='black', linewidth=2, markersize=4)
axes[1].plot(test_days, y_test_pred_original, 's-',
            label='LSTM Forecast', color='red', linewidth=2, markersize=4)

axes[1].axvline(train_days[-1], color='green', linestyle='--', 
               linewidth=2, label='Train/Test Split')

axes[1].set_xlabel('Day', fontsize=12)
axes[1].set_ylabel('Daily Cases', fontsize=12)
axes[1].set_title('LSTM Forecast: COVID-19 Daily Cases', fontsize=14)
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

# Bottom: Forecast errors
errors = y_test_actual - y_test_pred_original
colors = ['red' if e < 0 else 'green' for e in errors]

axes[2].bar(test_days, errors, color=colors, alpha=0.6, edgecolor='black')
axes[2].axhline(0, color='black', linestyle='-', linewidth=1.5)

axes[2].set_xlabel('Day', fontsize=12)
axes[2].set_ylabel('Forecast Error (Actual - Predicted)', fontsize=12)
axes[2].set_title('LSTM Forecast Errors', fontsize=14)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('lstm_covid_forecast.png', dpi=300)
plt.show()

# Multi-step ahead forecasting
print("\n" + "="*60)
print("MULTI-STEP AHEAD FORECASTING")
print("="*60)

# Use the last seq_length days as input
last_sequence = cases_scaled[-seq_length:]
current_sequence = last_sequence.reshape(1, seq_length, 1)

# Forecast next 14 days iteratively
forecast_days = 14
multi_step_forecast = []

for i in range(forecast_days):
    # Predict next day
    next_pred = model.predict(current_sequence, verbose=0)[0, 0]
    multi_step_forecast.append(next_pred)
    
    # Update sequence: drop oldest, add prediction
    current_sequence = np.append(current_sequence[0, 1:, 0], next_pred).reshape(1, seq_length, 1)

# Inverse transform
multi_step_forecast_original = scaler.inverse_transform(
    np.array(multi_step_forecast).reshape(-1, 1)
).flatten()

# Visualize multi-step forecast
fig, ax = plt.subplots(figsize=(14, 6))

# Historical data
ax.plot(days, cases, 'o-', label='Historical Cases', 
        color='blue', alpha=0.6, markersize=3)

# Multi-step forecast
forecast_days_array = np.arange(len(days), len(days) + forecast_days)
ax.plot(forecast_days_array, multi_step_forecast_original, 's-',
        label=f'{forecast_days}-Day Forecast', color='red', 
        linewidth=3, markersize=8)

ax.axvline(days[-1], color='green', linestyle='--', linewidth=2,
          label='Forecast Start')

ax.set_xlabel('Day', fontsize=12)
ax.set_ylabel('Daily Cases', fontsize=12)
ax.set_title(f'LSTM Multi-Step Ahead Forecast ({forecast_days} days)', fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('lstm_multistep_forecast.png', dpi=300)
plt.show()

print(f"\n{forecast_days}-day forecast:")
for i, pred in enumerate(multi_step_forecast_original, 1):
    print(f"  Day {i}: {pred:.0f} cases")

The LSTM Limitation: No Mechanistic Understanding

LSTMs learn patterns, not mechanisms.

Example scenario: - Training data includes natural epidemic curve + lockdown → cases drop - LSTM learns: “After cases reach X, they drop” - New epidemic: Cases reach X, but no lockdown this time - LSTM predicts drop (wrong!)

The problem: LSTM has no concept of “lockdown caused the drop.” It just sees a pattern.

When this matters: - Novel interventions (no historical data) - Changing policy responses - New variants with different characteristics - Long-term forecasts where context matters

Solution: Use LSTMs for short-term pattern recognition, mechanistic models for scenario planning.

For practical deep learning in epidemiology, see Chimmula & Zhang, 2020, Chaos, Solitons & Fractals.

7.4 Ensemble Methods: Combining Models

No single model is best for all situations. Ensemble methods combine multiple models to create more robust forecasts.

7.4.1 The CDC FluSight Challenge

The CDC’s FluSight challenge, running since 2013, invites teams to forecast influenza-like illness (ILI) weekly.

The setup: - Teams submit forecasts for 1-4 weeks ahead - Also forecast: peak week, peak intensity, onset week - Evaluated on prediction interval coverage and point forecast accuracy

Key finding (Reich et al., 2019, Clinical Infectious Diseases) (Reich et al. 2019):

Ensemble models consistently outperformed any individual model.

Why ensembles work:

Different models capture different patterns
- Mechanistic models: Understand transmission dynamics
- ARIMA: Captures seasonal patterns
- ML models: Learn complex nonlinear relationships
Averaging reduces overfitting
- Individual models may be overconfident
- Ensemble spreads risk across models
Robustness to model-specific failures
- If one model fails (e.g., unexpected intervention), others compensate

7.4.2 Building an Ensemble Forecaster

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

# Generate synthetic epidemic data
np.random.seed(42)
days = np.arange(200)

# Simulated cases (mix of exponential growth + intervention)
phase1 = 100 * np.exp(0.03 * days[:80])  # Growth phase
phase2 = phase1[-1] * np.exp(-0.02 * (days[80:] - 80))  # Decline (intervention)

cases = np.concatenate([phase1, phase2])
cases = cases + np.random.normal(0, cases * 0.1)  # Add noise
cases = np.maximum(cases, 10)

df = pd.DataFrame({'day': days, 'cases': cases})

# Split: train on first 150 days, forecast next 50
train_size = 150
train = df.iloc[:train_size]
test = df.iloc[train_size:]

print(f"Training days: {len(train)}")
print(f"Test days: {len(test)}")

# === MODEL 1: SEIR Mechanistic Model ===

def seir_forecast(train_data, forecast_steps, beta=0.03, sigma=0.2, gamma=0.1):
    """
    Fit SEIR model to training data and forecast
    """
    # Population
    N = 1000000
    
    # Initial conditions (estimate from data)
    I0 = train_data['cases'].iloc[0]
    E0 = I0
    R0_init = 0
    S0 = N - E0 - I0 - R0_init
    
    def seir_model(y, t, beta, sigma, gamma, N):
        S, E, I, R = y
        dS = -beta * S * I / N
        dE = beta * S * I / N - sigma * E
        dI = sigma * E - gamma * I
        dR = gamma * I
        return dS, dE, dI, dR
    
    # Solve SEIR for training + forecast period
    t = np.arange(len(train_data) + forecast_steps)
    y0 = S0, E0, I0, R0_init
    
    solution = odeint(seir_model, y0, t, args=(beta, sigma, gamma, N))
    _, _, I, _ = solution.T
    
    # Return only forecast period
    forecast = I[len(train_data):]
    
    return forecast

seir_forecast_values = seir_forecast(train, len(test))

# === MODEL 2: ARIMA ===

def arima_forecast(train_data, forecast_steps):
    """
    Fit ARIMA model and forecast
    """
    try:
        model = SARIMAX(train_data['cases'], order=(2, 1, 1))
        fit = model.fit(disp=False)
        forecast = fit.forecast(steps=forecast_steps)
        return forecast.values
    except:
        # Fallback: return mean if ARIMA fails
        return np.full(forecast_steps, train_data['cases'].mean())

arima_forecast_values = arima_forecast(train, len(test))

# === MODEL 3: Machine Learning (Simple Trend Extrapolation) ===

def ml_forecast(train_data, forecast_steps):
    """
    Simple ML: Fit polynomial to recent trend and extrapolate
    """
    from numpy.polynomial import Polynomial
    
    # Use last 30 days for trend
    recent = train_data.iloc[-30:]
    
    # Fit polynomial
    p = Polynomial.fit(recent['day'], recent['cases'], deg=2)
    
    # Forecast
    forecast_days = np.arange(train_data['day'].iloc[-1] + 1, 
                              train_data['day'].iloc[-1] + 1 + forecast_steps)
    forecast = p(forecast_days)
    
    return np.maximum(forecast, 0)  # No negative cases

ml_forecast_values = ml_forecast(train, len(test))

# === ENSEMBLE: Combine Models ===

# Method 1: Simple average
ensemble_simple = (seir_forecast_values + arima_forecast_values + ml_forecast_values) / 3

# Method 2: Weighted average (based on domain knowledge or historical performance)
weights = {
    'seir': 0.4,   # Higher weight for mechanistic model (interpretable)
    'arima': 0.3,  # Medium weight (good for short-term)
    'ml': 0.3      # Medium weight (captures recent trends)
}

ensemble_weighted = (
    weights['seir'] * seir_forecast_values +
    weights['arima'] * arima_forecast_values +
    weights['ml'] * ml_forecast_values
)

# Method 3: Median ensemble (more robust to outliers)
all_forecasts = np.array([seir_forecast_values, arima_forecast_values, ml_forecast_values])
ensemble_median = np.median(all_forecasts, axis=0)

# Visualize
fig, axes = plt.subplots(2, 1, figsize=(16, 12))

# Top: All forecasts
axes[0].plot(train['day'], train['cases'], 'o-', 
            label='Training Data', color='blue', alpha=0.5, markersize=3)
axes[0].plot(test['day'], test['cases'], 'o-',
            label='Actual (Test)', color='black', linewidth=3, markersize=6)

axes[0].plot(test['day'], seir_forecast_values, '^--',
            label='SEIR Model', alpha=0.7, markersize=6)
axes[0].plot(test['day'], arima_forecast_values, 's--',
            label='ARIMA Model', alpha=0.7, markersize=6)
axes[0].plot(test['day'], ml_forecast_values, 'd--',
            label='ML Model', alpha=0.7, markersize=6)

axes[0].plot(test['day'], ensemble_weighted, '*-',
            label='Weighted Ensemble', color='red', linewidth=3, markersize=10)

axes[0].axvline(train['day'].iloc[-1], color='green', linestyle='--',
               linewidth=2, label='Train/Test Split')

axes[0].set_ylabel('Daily Cases', fontsize=12)
axes[0].set_title('Ensemble Forecasting: Combining Multiple Models', fontsize=14)
axes[0].legend(fontsize=10, loc='upper right')
axes[0].grid(True, alpha=0.3)

# Bottom: Performance comparison
models = {
    'SEIR': seir_forecast_values,
    'ARIMA': arima_forecast_values,
    'ML': ml_forecast_values,
    'Simple Ensemble': ensemble_simple,
    'Weighted Ensemble': ensemble_weighted,
    'Median Ensemble': ensemble_median
}

performance = []
for name, forecast in models.items():
    mae = mean_absolute_error(test['cases'], forecast)
    rmse = np.sqrt(((test['cases'] - forecast) ** 2).mean())
    performance.append({'Model': name, 'MAE': mae, 'RMSE': rmse})

perf_df = pd.DataFrame(performance).sort_values('MAE')

x_pos = np.arange(len(perf_df))
axes[1].barh(x_pos, perf_df['MAE'], alpha=0.8, edgecolor='black')
axes[1].set_yticks(x_pos)
axes[1].set_yticklabels(perf_df['Model'])
axes[1].set_xlabel('Mean Absolute Error (MAE)', fontsize=12)
axes[1].set_title('Model Performance Comparison', fontsize=14)
axes[1].grid(True, alpha=0.3, axis='x')

# Highlight ensemble
for i, model in enumerate(perf_df['Model']):
    if 'Ensemble' in model:
        axes[1].get_children()[i].set_color('red')
        axes[1].get_children()[i].set_alpha(1.0)

plt.tight_layout()
plt.savefig('ensemble_forecast.png', dpi=300)
plt.show()

# Print performance
print("\n" + "="*60)
print("MODEL PERFORMANCE COMPARISON")
print("="*60)
print(perf_df.to_string(index=False))

# Best model
best_model = perf_df.iloc[0]
print(f"\n🏆 Best Model: {best_model['Model']}")
print(f"   MAE: {best_model['MAE']:.1f}")
print(f"   RMSE: {best_model['RMSE']:.1f}")

# Improvement of best ensemble over best individual model
best_individual = perf_df[~perf_df['Model'].str.contains('Ensemble')].iloc[0]
improvement = (1 - best_model['MAE'] / best_individual['MAE']) * 100

print(f"\nEnsemble improvement over best individual model: {improvement:.1f}%")

Ensemble Best Practices

Weight Selection:

Equal weights (simple average)
- Easiest to implement
- Works surprisingly well
- Default starting point
Performance-based weights
- Weight by historical accuracy
- Requires validation set
- Can overfit to past performance
Domain-based weights
- Mechanistic models: Higher weight for scenario planning
- ML models: Higher weight for short-term forecasts
- Time series models: Higher weight for seasonal diseases
Adaptive weights
- Change weights based on recent performance
- More complex but more robust
- Requires careful validation

When ensemble doesn’t help: - All models use the same data and make the same mistakes - One model is vastly superior to others (ensemble dilutes its performance) - Forecast horizon is so long that all models are guessing

For theoretical foundations, see Ray & Reich, 2018, Journal of Forecasting.

7.5 Uncertainty Quantification: Beyond Point Forecasts

The cardinal sin of forecasting: Publishing a single number without uncertainty.

Example of bad forecasting: > “Model predicts 50,000 cases next week.”

What if the real range is 30,000-80,000? Decision-makers need to know!

7.5.1 Types of Uncertainty

1. Parameter Uncertainty - Don’t know exact values of \(\beta\), \(\gamma\), etc. - Solution: Bayesian inference, confidence intervals

2. Structural Uncertainty - Model is wrong (all models are wrong!) - SIR assumes homogeneous mixing (false) - Solution: Multiple models, ensemble

3. Stochastic Uncertainty - Random chance (superspreading events, who gets infected) - Solution: Monte Carlo simulation

4. Scenario Uncertainty - Future interventions unknown (will there be a lockdown?) - Solution: Scenario planning (best/worst/expected case)

7.5.2 Quantifying Forecast Uncertainty

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from scipy.stats import norm
import seaborn as sns

# === METHOD 1: Bootstrap Confidence Intervals ===

def bootstrap_forecast(train_data, forecast_steps, n_bootstrap=1000):
    """
    Use bootstrap resampling to quantify uncertainty
    """
    forecasts = []
    
    for i in range(n_bootstrap):
        # Resample training data with replacement
        boot_sample = train_data.sample(n=len(train_data), replace=True)
        boot_sample = boot_sample.sort_values('day').reset_index(drop=True)
        
        # Fit model (simple exponential for speed)
        from scipy.optimize import curve_fit
        
        def exp_growth(x, a, b):
            return a * np.exp(b * x)
        
        try:
            params, _ = curve_fit(exp_growth, boot_sample['day'], boot_sample['cases'],
                                 p0=[100, 0.01], maxfev=5000)
            
            # Forecast
            forecast_days = np.arange(train_data['day'].iloc[-1] + 1,
                                     train_data['day'].iloc[-1] + 1 + forecast_steps)
            forecast = exp_growth(forecast_days, *params)
            forecasts.append(forecast)
        except:
            # If fit fails, use mean
            forecasts.append(np.full(forecast_steps, train_data['cases'].mean()))
    
    forecasts = np.array(forecasts)
    
    # Calculate percentiles
    forecast_mean = np.mean(forecasts, axis=0)
    forecast_lower = np.percentile(forecasts, 2.5, axis=0)
    forecast_upper = np.percentile(forecasts, 97.5, axis=0)
    
    return forecast_mean, forecast_lower, forecast_upper

# === METHOD 2: Bayesian Parameter Uncertainty ===

def bayesian_seir_forecast(train_data, forecast_steps, n_samples=500):
    """
    SEIR with Bayesian parameter uncertainty
    """
    N = 1000000
    
    # Prior distributions for parameters (based on literature/expert knowledge)
    beta_samples = np.random.uniform(0.02, 0.05, n_samples)  # Transmission rate
    gamma_samples = np.random.uniform(0.08, 0.12, n_samples)  # Recovery rate
    sigma_samples = np.random.uniform(0.15, 0.25, n_samples)  # E→I rate
    
    forecasts = []
    
    for beta, gamma, sigma in zip(beta_samples, gamma_samples, sigma_samples):
        # Initial conditions
        I0 = train_data['cases'].iloc[0]
        E0 = I0
        R0_init = 0
        S0 = N - E0 - I0 - R0_init
        
        def seir(y, t, beta, sigma, gamma, N):
            S, E, I, R = y
            dS = -beta * S * I / N
            dE = beta * S * I / N - sigma * E
            dI = sigma * E - gamma * I
            dR = gamma * I
            return dS, dE, dI, dR
        
        # Solve
        t = np.arange(len(train_data) + forecast_steps)
        y0 = S0, E0, I0, R0_init
        
        solution = odeint(seir, y0, t, args=(beta, sigma, gamma, N))
        _, _, I, _ = solution.T
        
        forecasts.append(I[len(train_data):])
    
    forecasts = np.array(forecasts)
    
    forecast_mean = np.mean(forecasts, axis=0)
    forecast_lower = np.percentile(forecasts, 2.5, axis=0)
    forecast_upper = np.percentile(forecasts, 97.5, axis=0)
    
    return forecast_mean, forecast_lower, forecast_upper, forecasts

# === METHOD 3: Ensemble Spread as Uncertainty ===

def ensemble_uncertainty(individual_forecasts):
    """
    Use spread across ensemble members as uncertainty measure
    """
    forecasts_array = np.array(individual_forecasts)
    
    forecast_mean = np.mean(forecasts_array, axis=0)
    forecast_std = np.std(forecasts_array, axis=0)
    
    # 95% CI assuming normal distribution
    forecast_lower = forecast_mean - 1.96 * forecast_std
    forecast_upper = forecast_mean + 1.96 * forecast_std
    
    return forecast_mean, forecast_lower, forecast_upper

# Generate example data
np.random.seed(42)
days = np.arange(150)
cases = 100 * np.exp(0.025 * days) + np.random.normal(0, 50, len(days))
cases = np.maximum(cases, 10)

df = pd.DataFrame({'day': days, 'cases': cases})

train_size = 120
train = df.iloc[:train_size]
test = df.iloc[train_size:]

# Generate forecasts with uncertainty
print("Generating forecasts with uncertainty quantification...")

# Bootstrap
print("  Running bootstrap...")
boot_mean, boot_lower, boot_upper = bootstrap_forecast(train, len(test), n_bootstrap=500)

# Bayesian
print("  Running Bayesian SEIR...")
bayes_mean, bayes_lower, bayes_upper, bayes_samples = bayesian_seir_forecast(
    train, len(test), n_samples=200
)

# Visualize
fig, axes = plt.subplots(3, 1, figsize=(16, 16))

# Top: Bootstrap uncertainty
axes[0].plot(train['day'], train['cases'], 'o-',
            label='Training Data', color='blue', alpha=0.5, markersize=3)
axes[0].plot(test['day'], test['cases'], 'o-',
            label='Actual (Test)', color='black', linewidth=2, markersize=6)

axes[0].plot(test['day'], boot_mean, 's-',
            label='Bootstrap Forecast', color='red', linewidth=2, markersize=6)
axes[0].fill_between(test['day'], boot_lower, boot_upper,
                     alpha=0.3, color='red', label='95% Bootstrap CI')

axes[0].axvline(train['day'].iloc[-1], color='green', linestyle='--', linewidth=2)

axes[0].set_ylabel('Daily Cases', fontsize=12)
axes[0].set_title('Bootstrap Uncertainty Quantification', fontsize=14)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Middle: Bayesian uncertainty
axes[1].plot(train['day'], train['cases'], 'o-',
            label='Training Data', color='blue', alpha=0.5, markersize=3)
axes[1].plot(test['day'], test['cases'], 'o-',
            label='Actual (Test)', color='black', linewidth=2, markersize=6)

axes[1].plot(test['day'], bayes_mean, '^-',
            label='Bayesian SEIR Forecast', color='purple', linewidth=2, markersize=6)
axes[1].fill_between(test['day'], bayes_lower, bayes_upper,
                     alpha=0.3, color='purple', label='95% Credible Interval')

axes[1].axvline(train['day'].iloc[-1], color='green', linestyle='--', linewidth=2)

axes[1].set_ylabel('Daily Cases', fontsize=12)
axes[1].set_title('Bayesian Parameter Uncertainty', fontsize=14)
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

# Bottom: Distribution of forecasts (spaghetti plot)
for i in range(min(50, len(bayes_samples))):  # Plot subset for clarity
    axes[2].plot(test['day'], bayes_samples[i], '-',
                color='purple', alpha=0.1, linewidth=0.5)

axes[2].plot(train['day'], train['cases'], 'o-',
            label='Training Data', color='blue', alpha=0.7, markersize=3)
axes[2].plot(test['day'], test['cases'], 'o-',
            label='Actual (Test)', color='black', linewidth=3, markersize=6, zorder=5)
axes[2].plot(test['day'], bayes_mean, '-',
            label='Mean Forecast', color='red', linewidth=3, zorder=4)

axes[2].axvline(train['day'].iloc[-1], color='green', linestyle='--', linewidth=2)

axes[2].set_xlabel('Day', fontsize=12)
axes[2].set_ylabel('Daily Cases', fontsize=12)
axes[2].set_title('Forecast Uncertainty: Distribution of Possible Futures (Spaghetti Plot)', fontsize=14)
axes[2].legend(fontsize=10)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('uncertainty_quantification.png', dpi=300)
plt.show()

# Evaluate coverage
def calculate_coverage(actual, lower, upper):
    """Calculate what % of actual values fall within prediction interval"""
    within_interval = (actual >= lower) & (actual <= upper)
    coverage = within_interval.mean() * 100
    return coverage

boot_coverage = calculate_coverage(test['cases'].values, boot_lower, boot_upper)
bayes_coverage = calculate_coverage(test['cases'].values, bayes_lower, bayes_upper)

print("\n" + "="*60)
print("UNCERTAINTY QUANTIFICATION EVALUATION")
print("="*60)
print(f"\nBootstrap 95% CI:")
print(f"  Coverage: {boot_coverage:.0f}% (should be ~95%)")

print(f"\nBayesian 95% Credible Interval:")
print(f"  Coverage: {bayes_coverage:.0f}% (should be ~95%)")

# Interval width (narrower is better, if coverage is maintained)
boot_width = (boot_upper - boot_lower).mean()
bayes_width = (bayes_upper - bayes_lower).mean()

print(f"\nMean Interval Width:")
print(f"  Bootstrap: {boot_width:.1f} cases")
print(f"  Bayesian:  {bayes_width:.1f} cases")

The Coverage vs. Sharpness Trade-Off

Coverage: % of actual values that fall within your prediction interval - Good: 95% coverage for a 95% interval - Bad: Only 70% coverage (interval too narrow)

Sharpness: How narrow your prediction interval is - Good: Narrow intervals (more informative) - Bad: Wide intervals (“cases will be between 10 and 10,000”)

The trade-off: - Wide intervals → High coverage, low utility - Narrow intervals → Low coverage, misleading

Goal: Maximize sharpness while maintaining nominal coverage.

For rigorous evaluation, see Gneiting & Raftery, 2007, Journal of the American Statistical Association.

7.6 Case Study: COVID-19 Forecasting Hub

The COVID-19 Forecast Hub, launched in March 2020, represents the largest epidemic forecasting effort in history.

7.6.1 Setup

Coordinated by: Reich Lab at UMass Amherst
Participants: 50+ teams (academia, industry, government)
Frequency: Weekly forecasts
Targets: - Incident deaths (1-4 weeks ahead) - Incident hospitalizations - Incident cases (when data reliable)

Ensemble method: Weighted median of all submitted forecasts
Use: CDC and state health departments for decision-making

7.6.2 Results

Cramer et al., 2022, PNAS evaluated performance over 20 months:

Key findings:

1-week ahead forecasts: - Median absolute error: 10-15% (useful!) - Ensemble outperformed 90% of individual models

4-week ahead forecasts: - Median absolute error: 40-60% (barely better than guessing) - Huge uncertainty, wide prediction intervals

Challenges encountered:

❌ Novel variants (Alpha, Delta, Omicron) → All models failed simultaneously
❌ Changing testing → Case forecasts unreliable
❌ Policy changes → Lockdowns, mask mandates unpredictable
❌ Behavior changes → People reacted to case counts
❌ Data quality issues → Reporting delays, definition changes

7.6.3 Lessons Learned

What Worked

Short-term forecasts (1-2 weeks ahead): - Useful for operational planning (hospital beds, staffing) - Ensemble approach was robust - Simple models often competitive with complex models

Transparency and collaboration: - Open data, open code, open evaluation - Rapid iteration and model improvement - Community learned from collective failures

Ensemble method: - Consistently better than any individual model - Robust to individual model failures - Wisdom of crowds

For detailed analysis, see Reich Lab’s evaluation reports.

What Didn’t Work

Long-term forecasts (>4 weeks): - Worse than simple “trend continues” baseline - Essentially scenario planning, not prediction - Should not have been used for major policy decisions

Forecasting through major disruptions: - New variants broke all models - Policy changes (lockdowns) invalidated forecasts immediately - No model predicted Omicron’s magnitude

Point forecasts without uncertainty: - Some early models provided single numbers - Dangerously misleading - Led to bad decisions

The fundamental problem: COVID-19 exhibited extreme non-stationarity—the data-generating process changed every few months (new variants, vaccines, behavior).

When the rules keep changing, forecasting becomes impossible.

7.6.4 Example: Omicron Forecasting Failure

December 2021: Omicron variant detected, spreads rapidly.

Forecast Hub predictions (week of Dec 20): - Ensemble forecast: 450,000 cases/day peak - Range across models: 300,000 - 700,000 cases/day

Reality (Jan 2022): - Actual peak: 1.4 million cases/day (underestimated by 2-3x)

Why all models failed: - Omicron’s R₀ was 2-3x higher than Delta - Immune escape wasn’t fully captured - Behavior didn’t change as much as expected - Testing patterns shifted

The lesson: Novel variants with dramatically different properties break all models trained on past data.

7.7 When NOT to Forecast: The Ethics of Publishing Predictions

Forecasts have consequences. Sometimes, not forecasting is the ethical choice.

7.7.1 The Self-Fulfilling/Self-Defeating Problem Revisited

Scenario planning disguised as prediction:

Many “forecasts” during COVID-19 were actually scenarios: - “If we do nothing, 2 million will die” - “If we lock down, 100,000 will die”

These aren’t predictions—they’re conditional statements about policy choices.

The problem: Media reports them as predictions, public gets confused.

7.7.2 When to Say “We Don’t Know”

You should NOT publish forecasts when:

❌ Novel pathogen with no historical data - Early COVID-19 (January 2020): Every model was pure speculation - Better: “Here’s a range of scenarios from mild to catastrophic”

❌ Major intervention pending - Government about to announce lockdown - Forecasting what happens “if current trends continue” is misleading

❌ Data quality too poor - Testing availability changing daily - Reporting systems overwhelmed - Better: Focus on qualitative assessment

❌ Model disagreement too high - If your ensemble has predictions ranging from 10,000 to 500,000 cases, what’s the point? - Better: “Current uncertainty is too high for actionable forecasts”

❌ Horizon too long - Forecasting 3+ months ahead for rapidly changing epidemic - Better: Provide scenarios, not forecasts

The Responsibility of Forecasters

From Ioannidis et al., 2020, International Journal of Forecasting:

“Epidemic forecasters have a special responsibility because their forecasts can profoundly influence public policy and behavior. Forecasts that are too optimistic may lead to complacency. Forecasts that are too pessimistic may cause unnecessary panic and economic damage. Forecasters must resist pressure to provide false certainty.”

Best practices:

✅ Always provide uncertainty quantification
✅ Clearly distinguish forecasts from scenarios
✅ Update forecasts as data changes
✅ Acknowledge model limitations publicly
✅ Archive forecasts for retrospective evaluation
✅ Resist pressure to forecast when uncertainty is unquantifiable

7.8 Practical Guidance: Building Your First Forecasting System

7.8.1 Pre-Flight Checklist

Before you start forecasting:

☐ Do you have sufficient historical data? - Time series: ≥2 years for seasonal diseases - SEIR models: At least early epidemic data to estimate parameters

☐ Is your data quality acceptable? - Complete Chapter 3 assessment - Understand reporting lags - Know how testing/surveillance changed over time

☐ What decisions will this forecast inform? - Hospital resource allocation → 1-2 week forecasts sufficient - Long-term policy planning → Need scenarios, not forecasts - Public communication → Uncertainty is essential

☐ What’s your forecast horizon? - 1-2 weeks: Achievable with most methods - 3-4 weeks: Difficult, high uncertainty - >4 weeks: Scenario planning, not forecasting

7.8.2 Model Selection Strategy

Start simple, add complexity only when justified:

1. Naive baseline (persistence: tomorrow = today)
   ↓
2. Time series model (ARIMA, Prophet)
   ↓ (if you need scenario planning)
3. Mechanistic model (SEIR)
   ↓ (if you have large training data)
4. Machine learning (LSTM, ensemble)

Never skip the naive baseline! If your fancy model doesn’t beat “tomorrow will be like today,” it’s not adding value.

7.8.3 Evaluation Framework

Proper time series cross-validation:

# WRONG: Random train/test split
X_train, X_test = train_test_split(X, test_size=0.2)  # ❌

# CORRECT: Respect temporal order
train_size = int(0.8 * len(X))
X_train = X[:train_size]
X_test = X[train_size:]  # ✅

Report multiple metrics: - MAE (Mean Absolute Error) - RMSE (Root Mean Squared Error) - Coverage (95% CI should contain 95% of actuals) - Sharpness (average interval width)

Track performance over time: - Forecast accuracy degrades → Retrain models - Sudden drop in accuracy → Something changed (investigate!)

7.8.4 Implementation Template

class EpidemicForecaster:
    """
    Production-ready epidemic forecasting system
    """
    def __init__(self):
        self.models = {}
        self.performance_history = []
    
    def add_model(self, name, model):
        """Add a model to the ensemble"""
        self.models[name] = model
    
    def fit(self, train_data):
        """Fit all models"""
        for name, model in self.models.items():
            model.fit(train_data)
    
    def forecast(self, horizon, method='weighted_ensemble'):
        """Generate ensemble forecast"""
        # Get forecasts from all models
        forecasts = {}
        for name, model in self.models.items():
            forecasts[name] = model.predict(horizon)
        
        # Combine
        if method == 'simple_average':
            ensemble = np.mean(list(forecasts.values()), axis=0)
        elif method == 'weighted_ensemble':
            # Weight by recent performance
            weights = self._calculate_weights()
            ensemble = sum(w * f for w, f in zip(weights, forecasts.values()))
        
        return ensemble, forecasts
    
    def evaluate(self, actual, forecast):
        """Evaluate forecast and log performance"""
        mae = mean_absolute_error(actual, forecast)
        
        self.performance_history.append({
            'date': pd.Timestamp.now(),
            'mae': mae
        })
        
        return mae
    
    def _calculate_weights(self):
        """Calculate model weights based on recent performance"""
        # Implementation depends on your needs
        pass

# Usage
forecaster = EpidemicForecaster()
forecaster.add_model('seir', SEIRModel())
forecaster.add_model('arima', ARIMAModel())
forecaster.add_model('lstm', LSTMModel())

forecaster.fit(training_data)
forecast, individual_forecasts = forecaster.forecast(horizon=14)

7.9 Key Takeaways

Forecasting ≠ Surveillance. Detection tells you what’s happening now; forecasting predicts the future. The latter is exponentially harder.
Short-term works, long-term doesn’t. 1-2 week forecasts are useful. 4+ week forecasts are often worse than naive baselines.
Mechanistic models provide interpretability and scenario planning. SEIR models let you ask “what if we vaccinate 50%?” ML models can’t answer that.
Machine learning excels at pattern recognition in stable conditions. But breaks when conditions change (new variants, interventions).
Ensemble methods are most robust. Combining multiple approaches beats any single model.
Uncertainty quantification is mandatory. Never publish a single-number forecast. Always provide confidence/prediction intervals.
Evaluation is essential. Track forecast performance over time. If accuracy degrades, investigate why and retrain.
Sometimes, don’t forecast. When uncertainty is unquantifiable or forecasts would be misleading, say “we don’t know.”
Forecasts influence outcomes. The act of forecasting changes behavior, creating self-fulfilling or self-defeating prophecies.
COVID-19 Forecast Hub proved ensemble methods work. But also showed fundamental limits—novel variants break all models.

7.10 Practice Exercises

7.10.1 Exercise 1: Implement and Compare Models

Build SIR, SEIR, ARIMA, and LSTM models for the same epidemic dataset. Which performs best at 1-week, 2-week, and 4-week horizons? Why?

7.10.2 Exercise 2: Sensitivity Analysis

For an SEIR model, vary \(\beta\) by ±20%. How much does peak timing and magnitude change? What does this tell you about parameter uncertainty?

7.10.3 Exercise 3: Ensemble Optimization

Given forecasts from 5 models, find the optimal weights that minimize MAE on a validation set. Do these weights generalize to the test set?

7.10.4 Exercise 4: Uncertainty Calibration

Generate forecasts with 95% prediction intervals. What percentage of actual values fall within your intervals? If not 95%, how can you calibrate?

7.10.5 Exercise 5: Scenario Planning

Using an SEIR model, create three scenarios: (1) No intervention, (2) 50% transmission reduction, (3) 80% transmission reduction. How different are the outcomes?

Check Your Understanding

Test your knowledge of epidemic forecasting methods and their appropriate applications. Each question builds on the key concepts from this chapter.

Question 1

In a basic SIR compartmental model, you calculate R₀ = 2.5. What does this mean for epidemic dynamics, and what happens when R₀ drops to 0.9 due to interventions?

R₀ = 2.5 means 2.5% of the population gets infected; R₀ = 0.9 means 0.9% gets infected
R₀ = 2.5 means the epidemic grows exponentially; R₀ = 0.9 means the epidemic dies out
R₀ values only matter for novel pathogens and don’t change with interventions
R₀ = 2.5 means each infected person has 2.5 contacts per day

Correct Answer: b) R₀ = 2.5 means the epidemic grows exponentially; R₀ = 0.9 means the epidemic dies out

The basic reproduction number (R₀) is the most important parameter in epidemic modeling:

What R₀ means: - R₀ = number of secondary infections caused by one infected person in a fully susceptible population - R₀ = 2.5: Each infected person infects 2.5 others → epidemic grows exponentially - R₀ = 0.9: Each infected person infects <1 person → epidemic dies out - R₀ = 1: Endemic equilibrium (constant case rate)

Critical threshold: R₀ > 1 → epidemic grows; R₀ < 1 → epidemic dies out

How interventions affect R₀: - R₀ = β/γ where β = transmission rate, γ = recovery rate - Social distancing reduces β → lowers R₀ - Faster isolation reduces infectious period → increases γ → lowers R₀ - A modest 10% reduction in transmission (R₀ = 2.5 → 2.3) can dramatically reduce peak cases

Examples from the chapter: - R₀ = 2.5 (no intervention) → Peak: 500,000 cases/day - R₀ = 2.0 (distancing) → Peak: 100,000 cases/day - R₀ = 0.9 (lockdown) → No peak, epidemic dies out

Why other answers are wrong: - R₀ ≠ percentage infected (that’s the attack rate) - R₀ ≠ contacts per day (it’s contacts × transmission probability / infectious period) - R₀ absolutely changes with interventions (that’s the whole point!)

Question 2

The COVID-19 Forecast Hub found that 1-week ahead forecasts had ~10-15% median error (useful), while 4-week ahead forecasts had ~50% error (barely better than guessing). What fundamental challenge explains this forecast horizon problem?

Longer forecasts require more computational power that wasn’t available
Epidemics exhibit chaotic dynamics where small changes in initial conditions, behaviors, and interventions compound over time
Four weeks is beyond the maximum incubation period so the data becomes unreliable
Forecasters deliberately make long-term forecasts less accurate to avoid liability

Correct Answer: b) Epidemics exhibit chaotic dynamics where small changes in initial conditions, behaviors, and interventions compound over time

This is the butterfly effect in epidemiology - a fundamental characteristic of complex adaptive systems:

Why accuracy degrades with forecast horizon:

Compounding uncertainty: Small errors in parameter estimation (β, γ) grow exponentially
Unpredictable interventions: Lockdowns, mask mandates, vaccine campaigns can’t be predicted
Behavior changes: People react to case counts, changing transmission dynamics
Viral evolution: New variants with different R₀ values break model assumptions
Reporting changes: Testing policy shifts invalidate surveillance data patterns
Random events: Superspreading events can dramatically shift trajectories

From the chapter:

Forecast Horizon	Typical Accuracy	Usefulness
1 week ahead	±10-20%	High - operational planning
2 weeks ahead	±20-40%	Moderate - resource allocation
4 weeks ahead	±40-100%	Low - scenario planning only
3+ months ahead	Worse than baseline	Essentially guesswork

The self-fulfilling/self-defeating prophecy: - Forecasts influence behavior, which changes outcomes - Unlike weather forecasting (rain doesn’t care about predictions), epidemic forecasts affect the thing being forecasted - Long horizons amplify this feedback loop

Real-world example: - December 2021: Models predicted Omicron peak of 300K-700K cases/day - Reality (January 2022): 1.4 million cases/day (underestimated by 2-3x) - Why? Novel variant with 2-3x higher R₀ broke all assumptions

The lesson: Focus on short-term forecasts (1-2 weeks) for operational decisions. Treat longer horizons as scenario planning, not prediction.

Question 3

You’re comparing mechanistic SEIR models versus LSTM deep learning models for epidemic forecasting. Which statement best describes their complementary strengths and appropriate use cases?

LSTM models are always superior because they use more advanced AI technology
SEIR models enable interpretable scenario planning (what if we vaccinate 50%?), while LSTMs excel at short-term pattern recognition in stable conditions
SEIR models are outdated; modern forecasting should only use deep learning
LSTM models understand causal mechanisms better than SEIR models

Correct Answer: b) SEIR models enable interpretable scenario planning (what if we vaccinate 50%?), while LSTMs excel at short-term pattern recognition in stable conditions

This captures the fundamental tension between mechanistic and machine learning approaches:

SEIR/Mechanistic Models:

Strengths: - ✅ Interpretable: Every parameter (β, γ, σ) has biological meaning - ✅ Scenario planning: Can simulate “what if we reduce transmission by 30%?” - ✅ Theory-grounded: Built on century of epidemiological knowledge - ✅ Extrapolation: Can predict novel scenarios without historical data - ✅ Policy-relevant: Stakeholders understand “if R₀ drops to X, we get Y cases”

Limitations: - ❌ Simplistic assumptions (homogeneous mixing, constant parameters) - ❌ Don’t capture heterogeneity (superspreaders, age structure) - ❌ Require knowing parameters (β, γ) that are hard to estimate

LSTM/Machine Learning Models:

Strengths: - ✅ Data-driven: Learn patterns without mechanistic assumptions - ✅ Flexible: Capture complex nonlinear relationships - ✅ Short-term accuracy: Excel at 1-2 week forecasts when conditions are stable - ✅ Multiple inputs: Can integrate cases + mobility + weather + interventions

Limitations: - ❌ No mechanistic understanding: Learn patterns, not causes - ❌ Break on distribution shift: New variants, interventions invalidate training data - ❌ Opaque: Can’t explain why predictions changed - ❌ Can’t answer counterfactuals: “What if we vaccinate?” → no historical data for that

The LSTM limitation example from the chapter: - Training data: epidemic curve + lockdown → cases drop - LSTM learns: “After cases reach X, they drop” - New epidemic: cases reach X, but no lockdown - LSTM predicts drop (WRONG!) - Problem: No concept that lockdown caused the drop

Best practice (from COVID-19 Forecast Hub): - Use both in an ensemble - SEIR for scenario planning and understanding mechanisms - LSTM for short-term pattern recognition - Ensemble combines their strengths and averages out individual failures

Question 4

Why did the COVID-19 Forecast Hub consistently find that ensemble methods (combining multiple models) outperformed any individual model, even though no single “ensemble algorithm” existed?

Ensemble methods use more computational resources so they’re inherently more accurate
Different models capture different patterns; averaging reduces overfitting and provides robustness when individual models fail
Ensemble methods average out all the errors so they always perform better mathematically
The ensemble was weighted toward the best model, which is why it performed well

Correct Answer: b) Different models capture different patterns; averaging reduces overfitting and provides robustness when individual models fail

Ensemble forecasting is one of the most important lessons from both the CDC FluSight Challenge and COVID-19 Forecast Hub:

Why ensembles work:

Different models capture different patterns:
- Mechanistic models (SEIR): Understand transmission dynamics
- Time series models (ARIMA): Capture seasonal patterns
- ML models (LSTM): Learn complex nonlinear relationships
- Each model has blind spots; ensembles fill in gaps
Averaging reduces overfitting:
- Individual models may be overconfident in their predictions
- Ensemble spreads risk across multiple approaches
- Extreme predictions get averaged out
Robustness to model-specific failures:
- If one model fails (e.g., LSTM breaks on new variant), others compensate
- No need to identify “best” model in advance
- Wisdom of crowds

Evidence from COVID-19 Forecast Hub (Cramer et al., 2022, PNAS): - Ensemble consistently outperformed 90% of individual models - Even when all models struggled (Omicron), ensemble was least wrong - Simple median ensemble worked surprisingly well

Ensemble methods: 1. Simple average: Equal weight to all models (works well!) 2. Weighted average: Weight by historical performance 3. Median ensemble: More robust to outliers 4. Adaptive weights: Change based on recent performance

When ensemble doesn’t help: - All models use same data and make same mistakes - One model vastly superior (rare in practice) - Forecast horizon so long all models are guessing

From the chapter example: Models compared: - SEIR, ARIMA, ML → best individual MAE - Simple ensemble → better than best individual - Weighted ensemble → best overall - Improvement: ~10-20% over best single model

The lesson: Don’t put all your eggs in one model basket. Ensemble methods are more robust and reliable for real-world forecasting.

Question 5

You publish a forecast predicting “severe outbreak in 2 weeks” with wide confidence intervals (10,000-100,000 cases). The public panics and implements strict measures. The outbreak is mild (15,000 cases). What forecasting challenge does this illustrate?

Your model had high bias and should have been regularized better
The self-defeating prophecy problem: forecasts influence behavior, which changes outcomes, making the forecast appear wrong even though it drove the right action
Confidence intervals were too wide and should have been narrower for credibility
The model should have predicted the intervention and adjusted for it

Correct Answer: b) The self-defeating prophecy problem: forecasts influence behavior, which changes outcomes, making the forecast appear wrong even though it drove the right action

This is one of the most profound challenges in epidemic forecasting - unique to predictions that influence human behavior:

The paradox: Success makes you look wrong

Scenario 1: Self-defeating prophecy (this question) - Model predicts catastrophic outbreak - Public panics, implements strict measures - Outbreak is mild - Conclusion: “Model was wrong! Overreaction!” - Reality: Model was right about unmitigated scenario; intervention changed trajectory

Scenario 2: Self-fulfilling prophecy - Model predicts mild outbreak - Public relaxes precautions - Outbreak explodes - Conclusion: “Model was wrong! Should have prepared!” - Reality: Model was right for baseline scenario; behavior change amplified outbreak

Unlike weather forecasting: - Rain forecast doesn’t affect whether it rains - Epidemic forecasts DO affect transmission (people change behavior) - Creates impossible situation for forecasters

The Michael Fish Problem (from chapter): - 1987: BBC weatherman dismissed hurricane concerns - Hurricane hit anyway → 18 died, permanent trust damage - Lesson: Reassuring forecasts that turn out wrong destroy trust permanently - But alarming forecasts that don’t materialize also erode trust (more slowly)

Ethical best practices:

✅ Clearly distinguish forecasts from scenarios: - “If current trends continue…” (forecast) - “If we do nothing…” (scenario) - “If we implement lockdown…” (scenario)

✅ Communicate uncertainty honestly: - Wide confidence intervals (10K-100K) show genuine uncertainty - Better than false precision (53,472 cases!)

✅ Frame conditionally: - “Without intervention: 50,000-100,000 cases” - “With strong measures: 10,000-30,000 cases”

✅ Update as conditions change: - “Our forecast assumed no intervention; behavior changed, so we update”

From Ioannidis et al., 2020: > “Forecasters must resist pressure to provide false certainty. Forecasts influence policy and behavior, creating special responsibility.”

When NOT to forecast: - Novel pathogen (no data) - Major intervention pending - Model disagreement too high - Better to say: “Current uncertainty is too high for actionable forecasts”

Question 6

You’re building an epidemic forecasting system for influenza. Your ARIMA model achieves 12% MAE, while a “naive” persistence model (tomorrow = today) achieves 15% MAE. Your LSTM achieves 11% MAE but requires 100x more computational resources. What should you deploy?

LSTM model because it has the lowest error
ARIMA model because it balances accuracy and simplicity, and beats the naive baseline
Naive model because it’s simplest and the differences are too small to matter
An ensemble of all three models weighted by their inverse MAE

Correct Answer: b) ARIMA model because it balances accuracy and simplicity, and beats the naive baseline

This question tests practical judgment about model selection and deployment tradeoffs:

Why ARIMA is the best choice here:

Beats naive baseline (12% vs 15% MAE)
- Rule from chapter: “If your fancy model doesn’t beat ‘tomorrow will be like today,’ it’s not adding value”
- 20% improvement over baseline is meaningful
Reasonable complexity
- ARIMA is interpretable and well-understood
- Can explain seasonal patterns, trends to stakeholders
- Easier to diagnose when forecasts degrade
Computational efficiency
- Fast training and inference
- Can update frequently as new data arrives
- No specialized hardware needed
Appropriate for influenza
- Flu has strong seasonal patterns (ARIMA excels here)
- Reporting is stable and consistent
- Short-term forecasts (1-4 weeks) are use case

Why NOT choose LSTM (11% MAE): - Only 1% absolute improvement over ARIMA (8% relative improvement) - 100x computational cost for marginal gain - Harder to maintain and debug - Opaque predictions harder to explain to health departments - Principle: Simplest model that beats baseline and meets requirements

Why NOT choose naive baseline: - 15% MAE is worse than ARIMA - Doesn’t capture seasonality - However, always compute it as benchmark!

Why ensemble might not be best: - More complex to maintain - Marginal improvement unlikely to justify complexity for weekly flu forecasts - Better for high-stakes scenarios (pandemic)

From the chapter’s model selection strategy:

1. Naive baseline (persistence: tomorrow = today)
   ↓ (beat this? Yes → continue)
2. Time series model (ARIMA, Prophet)  ← WE ARE HERE
   ↓ (need scenario planning? No)
3. Mechanistic model (SEIR)
   ↓ (have large training data + need more accuracy? No)
4. Machine learning (LSTM, ensemble)

Best practice for production systems: 1. Deploy ARIMA as primary model 2. Monitor performance weekly 3. Keep naive baseline for comparison 4. If accuracy degrades, investigate and retrain 5. Consider LSTM only if: - Accuracy requirements become critical - Computational resources available - Interpretability less important

Context matters: - Hospital bed allocation (low stakes): ARIMA is fine - Pandemic ventilator procurement (high stakes): Might justify LSTM or ensemble - Public communication: Need interpretable model (ARIMA)

The chapter emphasizes: “Never skip the naive baseline! If your fancy model doesn’t beat ‘tomorrow will be like today,’ it’s not adding value.”

7.11 Discussion Questions

SEIR models assume homogeneous mixing (everyone contacts everyone equally). How does this assumption break down in reality? How might you modify SEIR to be more realistic?
The COVID-19 Forecast Hub ensemble outperformed individual models, but still failed to predict Omicron. What does this tell us about the limits of forecasting? When should we abandon forecasting and focus on scenario planning?
Mechanistic models (SEIR) are interpretable but make strong assumptions. ML models (LSTMs) are flexible but opaque. For public health decision-making, which is more important: interpretability or accuracy? Does it depend on the decision?
Imagine you’re a state health department epidemiologist in March 2020. Multiple models predict vastly different COVID-19 death tolls (100K to 2M). How do you communicate this uncertainty to policymakers? To the public?
Google Flu Trends failed because search behavior changed when people became aware of the model. Is this problem inherent to all epidemic forecasting, or specific to web-based surveillance?
The COVID-19 Forecast Hub published forecasts weekly, even when model uncertainty was extremely high. Was this ethical? Should forecasters refuse to forecast when uncertainty is unquantifiable?

7.12 Further Resources

7.12.1 Books and Guides

Keeling & Rohani, 2008, Modeling Infectious Diseases in Humans and Animals - Foundational textbook on mechanistic epidemic models
Vynnycky & White, 2010, An Introduction to Infectious Disease Modelling - Practical guide to SEIR and related models

7.12.2 Academic Papers

Forecasting Evaluation: - Reich et al., 2019, Clinical Infectious Diseases - FluSight evaluation (Reich et al. 2019) - Key case study showing ensemble methods work - Cramer et al., 2022, PNAS - COVID-19 Forecast Hub (Cramer et al. 2022) - Essential reading on lessons learned from COVID-19 forecasting - Viboud et al., 2018, PNAS - RAPIDD forecasting challenge - Comparative evaluation of forecasting approaches

Machine Learning Methods: - Chimmula & Zhang, 2020, Chaos, Solitons & Fractals - LSTM forecasting - Deep learning for epidemic forecasting

Ethics and Limitations: - Ioannidis et al., 2020, International Journal of Forecasting - Forecasting for COVID-19 - Critical perspective on forecasting ethics and responsibility

7.12.3 Tools and Platforms

COVID-19 Forecast Hub - Archive of all COVID-19 forecasts with open data and code
CDC FluSight - Annual influenza forecasting challenge
EpiModel R package - Network-based epidemic modeling framework
PyEpiDAT - Python toolkit for epidemic data and forecasting

7.12.4 Online Courses

Coursera: Epidemics - Penn State course on epidemic modeling fundamentals
edX: Infectious Disease Modelling - Imperial College London course on advanced modeling techniques

Forecasting is humbling. The future is uncertain, models are wrong, and interventions change outcomes. But with proper humility, transparency, and ensemble methods, forecasts can still inform better decisions than guessing.

Next: Chapter 6: Genomic Surveillance and Pathogen Analysis →