4  Just Enough AI to Be Dangerous

TipLearning Objectives

Time to Complete: 90-120 minutes Prerequisites: Chapter 1: AI in Context

By the end of this chapter, you will:

  • Understand the fundamental types of machine learning and when to use each
  • Recognize common algorithms and their strengths/weaknesses
  • Read and modify basic ML code confidently
  • Distinguish between genuine AI capabilities and marketing hype
  • Know which concepts matter for public health (and which don’t)
  • Run your first ML model on real public health data

What you’ll build: 💻 Complete outbreak prediction models using logistic regression, Random Forests, and XGBoost with evaluation metrics

4.1 Introduction

You don’t need a PhD in machine learning to use AI effectively in public health. You need to understand just enough to:

  • Choose the right tool for your problem
  • Interpret results critically
  • Avoid common pitfalls
  • Communicate with technical teams
  • Know when AI is the wrong solution

This chapter gives you that foundation. We’ll skip the heavy math and focus on concepts, intuition, and practical application.

WarningWhat This Chapter Is NOT

This is not a comprehensive ML course. For deep technical knowledge, see Andrew Ng’s Machine Learning Specialization or the Deep Learning book by Goodfellow et al.

This chapter teaches you enough to be productively dangerous in public health AI.

4.2 The Three Fundamental Paradigms

All of machine learning falls into three categories, as comprehensively reviewed in Mitchell’s foundational work on machine learning. Master these, and you’ll understand 90% of AI applications in public health.

4.2.1 1. Supervised Learning: Learning from Examples

The Idea: You show the algorithm examples of inputs and correct outputs. It learns the pattern. Then it predicts outputs for new inputs.

Analogy: Teaching a medical student by showing them patient cases with known diagnoses.

Real-world example:

Input: Patient symptoms (fever, cough, chest pain)
Output: Diagnosis (pneumonia)

You show the algorithm 10,000 labeled examples.
It learns: "fever + productive cough + chest pain → pneumonia (85% probability)"

When to use it: - You have labeled historical data (inputs + correct answers) - You want to predict a specific outcome - Examples: disease diagnosis, outbreak prediction, risk stratification

Public health applications: - Predicting which patients will develop sepsis using electronic health records - Classifying disease from medical images - Forecasting epidemic curves - Identifying high-risk populations

NoteThe Key Constraint

Supervised learning requires labeled data—meaning someone (often human experts) must have already provided the “correct answers” for your training examples.

In public health, getting quality labels is often the hardest part. As Rajkomar et al. (2019) note in their NEJM review, a model is only as good as its training data.

4.2.2 2. Unsupervised Learning: Finding Hidden Patterns

The Idea: You give the algorithm data with NO labels. It finds structure, patterns, or groups on its own.

Analogy: Sorting a messy pile of patient records into natural categories without being told what categories to use.

Real-world example:

Input: Symptom reports from 50,000 emergency department visits
Output: "I found 7 distinct clusters—one looks like flu-like illness,
         another looks like gastrointestinal illness, etc."

When to use it: - You have unlabeled data (which is most data!) - You want to discover structure or segments - You’re exploring without a specific prediction goal

Public health applications: - Identifying disease subtypes from symptom patterns - Segmenting populations by health behavior - Detecting anomalies in outbreak surveillance - Reducing data dimensions for visualization

Common techniques: - Clustering (k-means, hierarchical, DBSCAN) - Dimensionality reduction (PCA, t-SNE, UMAP) - Anomaly detection

4.2.3 3. Reinforcement Learning: Learning by Trial and Error

The Idea: An agent takes actions in an environment and receives rewards or penalties. It learns which actions lead to the best outcomes. This approach, formalized by Sutton and Barto in their seminal text, has applications in sequential decision-making.

Analogy: Training a dog—reward good behavior, discourage bad behavior, let it learn through experience.

When to use it: - You’re optimizing sequential decisions - There’s no “correct answer” dataset - You have a simulator or can interact safely

Public health applications: - Optimizing vaccine distribution strategies - Adaptive clinical trial designs - Resource allocation during outbreaks - Personalized treatment sequencing

ImportantReality Check

Reinforcement learning sounds exciting but is much harder to implement than supervised learning. It requires careful reward engineering and lots of computational resources.

For most public health problems, supervised learning is more practical. Don’t use RL just because it sounds cool.

For this handbook, we’ll focus primarily on supervised learning—it’s where 80% of practical public health AI lives.

4.3 Core Supervised Learning Concepts

4.3.1 The Machine Learning Workflow

Every supervised learning project follows this pattern, as outlined in Géron’s practical guide:

Hide code
# 1. Collect labeled data
data = load_outbreak_data()  # Features + outcomes

# 2. Split into train/test sets
train_data, test_data = split(data, test_size=0.2)

# 3. Choose and train a model
model = RandomForestClassifier()
model.fit(train_data.features, train_data.outcomes)

# 4. Evaluate on unseen test data
predictions = model.predict(test_data.features)
accuracy = evaluate(predictions, test_data.outcomes)

# 5. Deploy and monitor
model.save('outbreak_predictor.pkl')

Let’s break down each step.

4.3.2 Step 1: Features and Labels

Features (also called “predictors” or “independent variables”): The inputs your model uses to make predictions.

Labels (also called “targets” or “dependent variables”): The outputs you’re trying to predict.

Example: Predicting dengue outbreaks

Week Temperature Rainfall Prev_Cases Label: Outbreak?
1 28°C 120mm 15 No
2 30°C 200mm 18 No
3 31°C 250mm 45 Yes
4 29°C 180mm 60 Yes

Features: Temperature, Rainfall, Previous Cases Label: Outbreak (Yes/No)

TipFeature Engineering: The Secret Sauce

Raw data is rarely ideal for ML. Feature engineering—creating new features from existing data—is where domain expertise matters most, as emphasized in Domingos’ influential paper on machine learning pitfalls.

Examples: - Raw: Daily case counts → Engineered: 7-day rolling average - Raw: Birth date → Engineered: Age in years - Raw: GPS coordinates → Engineered: Distance to nearest hospital - Raw: Text symptoms → Engineered: Presence/absence of key symptom terms

This is where epidemiologists have an advantage over pure data scientists. You know which features actually matter biologically and epidemiologically.

4.3.3 Step 2: Training, Validation, and Testing

A critical concept: You need THREE data splits, not just two, as explained in Hastie et al.’s comprehensive statistical learning text.

  1. Training set (60-70%): Model learns patterns from this
  2. Validation set (10-20%): Used to tune model settings (hyperparameters)
  3. Test set (20%): Never touched until final evaluation—simulates real-world performance

Why this matters:

Hide code
# BAD: Tuning on test data
model = train_model(train_data)
# Adjust model settings...
performance = evaluate(model, test_data)  # Too optimistic!

# GOOD: Using validation set
model = train_model(train_data)
# Adjust model settings based on validation performance...
performance = evaluate(model, validation_data)
# Only after all tuning is done:
final_performance = evaluate(model, test_data)  # True performance
WarningThe Cardinal Sin: Data Leakage

Data leakage happens when information from the test set “leaks” into training. This makes your model look better than it really is, a problem extensively documented in Kaufman et al.’s analysis of leakage in healthcare ML.

Common mistakes: - Normalizing features before splitting data - Including future information in features - Using the same patients in both train and test - Cross-validation on time series without respecting temporal order

See this excellent guide on data leakage for more.

4.3.4 Step 3: What “Training” Actually Means

When we say a model “trains,” what’s happening?

Simple answer: The algorithm adjusts internal parameters to minimize prediction errors.

Let’s make this concrete with logistic regression (which you already know from epi!):

You’re familiar with: \[\text{logit}(p) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ...\]

In traditional statistics, you estimate \(\beta\) coefficients using maximum likelihood estimation.

Machine learning does the same thing—but: - Often with millions of parameters - Using iterative optimization (gradient descent, not closed-form solutions) - With regularization to prevent overfitting - On much larger datasets

Here’s what’s happening under the hood:

Hide code
# Simplified training loop (don't worry about details)
for epoch in range(num_epochs):
    # Make predictions with current parameters
    predictions = model.predict(X_train)

    # Calculate error (loss)
    error = calculate_loss(predictions, y_train)

    # Adjust parameters to reduce error
    gradients = compute_gradients(error)
    parameters = parameters - learning_rate * gradients

    # Repeat until error stops decreasing

The model is literally learning from mistakes—making predictions, checking how wrong they are, and adjusting to do better next time.

4.4 Common Machine Learning Algorithms

Let’s survey the algorithms you’ll encounter most often in public health AI. For each, we’ll cover how it works, strengths/weaknesses, when to use it, and a practical example.

4.4.1 1. Logistic Regression: Your Old Friend

You already know this one! Logistic regression predicts binary outcomes (disease/no disease, outbreak/no outbreak).

When to use it: - Binary classification problems - You want interpretable coefficients (odds ratios) - You have modest amounts of data - Linear relationships are reasonable

Strengths: ✅ Interpretable (you can explain to stakeholders) ✅ Fast to train ✅ Works well with small data ✅ Provides probability estimates ✅ Epidemiologists understand it intuitively

Weaknesses: ❌ Assumes linear relationships (in log-odds space) ❌ Can’t capture complex interactions automatically ❌ Performance limited on very complex patterns

Code example:

Hide code
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import pandas as pd
import numpy as np

# Load outbreak data
data = pd.read_csv('../data/examples/dengue_outbreaks.csv')

# Features and label
X = data[['temperature', 'rainfall', 'prev_cases', 'population_density']]
y = data['outbreak']  # 0 or 1

# Train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

# Interpret: Look at coefficients (like ORs in epi)
feature_names = X.columns
coefficients = model.coef_[0]

for feature, coef in zip(feature_names, coefficients):
    odds_ratio = np.exp(coef)
    print(f"{feature}: OR = {odds_ratio:.2f}")

Output interpretation:

temperature: OR = 1.15  → Each 1°C increase → 15% higher outbreak odds
rainfall: OR = 1.08     → Each 1mm increase → 8% higher outbreak odds
prev_cases: OR = 1.25   → Each additional case → 25% higher outbreak odds
TipStart Here

Logistic regression should be your default starting point for classification problems. Only move to complex models if it doesn’t work well enough.

Many “AI” successes in public health are just well-applied logistic regression with good feature engineering.

4.4.2 2. Decision Trees: Readable Rules

The idea: The algorithm builds a tree of yes/no questions to classify data, as introduced in Breiman et al.’s seminal CART work.

Analogy: A flowchart a doctor might use: “If fever > 38°C AND cough present AND chest X-ray abnormal → Pneumonia likely”

Visual example:

                    Fever > 38°C?
                   /              \
                 Yes               No
                /                    \
        Cough present?           Low risk
           /        \
         Yes         No
         /             \
   Chest X-ray      Moderate risk
   abnormal?
    /      \
  Yes      No
  /          \
Pneumonia   Viral infection

Code example:

Hide code
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt

# Train decision tree
model = DecisionTreeClassifier(max_depth=3, random_state=42)
model.fit(X_train, y_train)

# Visualize the tree
plt.figure(figsize=(20,10))
tree.plot_tree(model,
               feature_names=X.columns,
               class_names=['No outbreak', 'Outbreak'],
               filled=True,
               fontsize=10)
plt.savefig('../images/examples/decision_tree.png', dpi=300, bbox_inches='tight')
plt.show()

# Make predictions
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)

Strengths: ✅ Highly interpretable (you can draw the exact decision process) ✅ Handles non-linear relationships automatically ✅ No need to normalize features ✅ Works with mixed data types (numerical + categorical) ✅ Can capture interactions between features

Weaknesses: ❌ Prone to overfitting (memorizing training data) ❌ Unstable (small data changes → very different trees) ❌ Often less accurate than ensemble methods

NoteSingle trees vs. Forests

Single decision trees are rarely used in production because they overfit easily. Instead, we use ensembles of many trees (Random Forests, Gradient Boosting)—see next sections.

But single trees are excellent for exploratory analysis and communicating findings to non-technical stakeholders.

4.4.3 3. Random Forests: Democracy of Trees

The idea: Instead of one decision tree, train hundreds or thousands of trees on slightly different subsets of data. Each tree “votes” on the final prediction. This ensemble method introduced by Breiman (2001) has become one of the most widely used algorithms in machine learning.

Why it works: Individual trees might overfit and make mistakes, but their collective wisdom averages out errors.

Analogy: Instead of asking one doctor, you ask 500 doctors and take the majority opinion.

Code example:

Hide code
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, confusion_matrix
import numpy as np

# Train random forest
model = RandomForestClassifier(
    n_estimators=500,      # Number of trees
    max_depth=10,          # Limit tree depth (prevent overfitting)
    min_samples_split=20,  # Require at least 20 samples to split
    random_state=42
)

model.fit(X_train, y_train)

# Predict probabilities (not just 0/1)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Evaluate with ROC-AUC
auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC: {auc:.3f}")

# Feature importance: Which features matter most?
importances = model.feature_importances_
feature_importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance': importances
}).sort_values('importance', ascending=False)

print("\nFeature Importance:")
print(feature_importance_df)

# Plot with handbook styling
import sys
sys.path.append('..')
from styles.plot_config import set_handbook_style, HANDBOOK_COLORS

set_handbook_style()
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance_df, x='importance', y='feature',
            color=HANDBOOK_COLORS['primary_blue'])
plt.title('Feature Importance in Outbreak Prediction')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.savefig('../images/examples/feature_importance.png', dpi=300, bbox_inches='tight')

Strengths: ✅ Very accurate on most problems ✅ Handles thousands of features without feature selection ✅ Resistant to overfitting (unlike single trees) ✅ Provides feature importance scores ✅ No need for feature scaling ✅ Works out-of-the-box with minimal tuning

Weaknesses: ❌ Less interpretable than single trees (500 trees is hard to explain) ❌ Slower to train and predict than simple models ❌ Large file sizes (saving 500 trees takes space) ❌ Can struggle with highly imbalanced data

TipPractical Advice

Random Forests are often the best first “real” ML model to try after logistic regression. They’re: - Forgiving of mistakes in data preparation - Robust to outliers - Accurate enough for many applications - Easy to use with scikit-learn

For many public health prediction tasks, a well-tuned Random Forest is all you need.

4.4.4 4. Gradient Boosting: The Current Champion

The idea: Build trees sequentially, where each new tree focuses on correcting the mistakes of previous trees. This approach, formalized by Friedman (2001), has become the dominant method for structured data competitions.

Analogy: Instead of 500 doctors voting independently (Random Forest), you have a team of 500 doctors where each successive doctor specifically tries to fix the previous doctor’s misdiagnoses.

Popular implementations: - XGBoost (most popular) - LightGBM (faster, good for large data) - CatBoost (handles categorical features well)

Code example:

Hide code
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

# Train XGBoost model
model = XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,    # How much each tree contributes
    max_depth=6,
    subsample=0.8,         # Use 80% of data for each tree
    colsample_bytree=0.8,  # Use 80% of features for each tree
    random_state=42
)

model.fit(X_train, y_train)

# Cross-validation for robust performance estimate
cv_scores = cross_val_score(model, X_train, y_train,
                             cv=5, scoring='roc_auc')
print(f"Cross-validated AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

# Test set performance
y_pred_proba = model.predict_proba(X_test)[:, 1]
test_auc = roc_auc_score(y_test, y_pred_proba)
print(f"Test AUC: {test_auc:.3f}")

# Feature importance (SHAP values for better interpretation)
import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Visualize what drives predictions
shap.summary_plot(shap_values, X_test, feature_names=X.columns)

Strengths: ✅ Often the most accurate “classical” ML algorithm ✅ Wins most Kaggle competitions ✅ Handles missing data automatically ✅ Built-in regularization prevents overfitting ✅ Excellent with tabular data (most public health data)

Weaknesses: ❌ More hyperparameters to tune than Random Forests ❌ Easier to overfit if not careful ❌ Requires more computational resources ❌ Even less interpretable than Random Forests (but SHAP values help)

ImportantWhen to Use Gradient Boosting

Use gradient boosting when: - You’ve tried Random Forests and need better performance - You have sufficient data (>10,000 observations) - Prediction accuracy is critical - You have time/resources for hyperparameter tuning

Stick with Random Forests when: - You want simplicity and fast iteration - Data is limited (<1,000 observations) - Interpretability is paramount

4.4.5 5. Neural Networks and Deep Learning

The idea: Layers of interconnected “neurons” that transform inputs into outputs through learned weights. Neural networks are universal approximators, meaning they can theoretically learn any function given enough data and capacity.

Simplified structure:

Input Layer → Hidden Layer(s) → Output Layer
   ↓              ↓                 ↓
[Features]    [Learned       [Predictions]
              representations]

Each connection has a “weight” that’s adjusted during training using backpropagation.

When neural networks shine: - Large datasets (millions of examples) - Complex patterns (images, text, audio) - Representation learning (automatic feature extraction)

When they struggle: - Small datasets (<10,000 examples) - Tabular data (trees often work better, as shown in Shwartz-Ziv & Armon 2022) - Interpretability is required

Simple neural network example:

Hide code
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler

# Neural networks need scaled features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Build neural network
# (100, 50) means:
#   - First hidden layer: 100 neurons
#   - Second hidden layer: 50 neurons
model = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    activation='relu',        # Activation function
    max_iter=500,
    random_state=42,
    early_stopping=True,      # Stop if validation performance plateaus
    validation_fraction=0.1
)

model.fit(X_train_scaled, y_train)

# Predict
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

# Evaluate
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
WarningReality Check: Neural Networks for Tabular Data

For most public health tabular data (spreadsheets with rows and columns), gradient boosting methods usually outperform neural networks, as demonstrated in multiple systematic comparisons.

Neural networks excel at: - Medical imaging (chest X-rays, histopathology) - Time series with complex temporal patterns - Natural language processing - Multi-modal data (combining images + text + numbers)

For simple prediction from structured data, stick with tree-based methods first.

4.4.6 Deep Learning Architectures for Public Health

Deep learning = neural networks with many layers (10s to 100s of layers), as comprehensively covered in Goodfellow et al.’s textbook.

Key architectures relevant to public health:

4.4.6.1 Convolutional Neural Networks (CNNs)

  • Purpose: Image analysis
  • Applications: Medical imaging (X-rays, CT scans, pathology slides), skin lesion classification
  • How they work: Learn spatial hierarchies (edges → textures → organs → diagnoses)
Hide code
# Example: Chest X-ray pneumonia detection with CNN
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.models import Model

# Load pre-trained ResNet50 (transfer learning)
base_model = ResNet50(weights='imagenet', include_top=False,
                     input_shape=(224, 224, 3))

# Add custom classification layers
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu')(x)
predictions = Dense(1, activation='sigmoid')(x)  # Binary: pneumonia or not

model = Model(inputs=base_model.input, outputs=predictions)

# Freeze pre-trained layers, train only new layers
for layer in base_model.layers:
    layer.trainable = False

model.compile(optimizer='adam', loss='binary_crossentropy',
              metrics=['accuracy', 'AUC'])

# Train on chest X-ray dataset
# model.fit(train_images, train_labels, epochs=10, validation_split=0.2)

Key concept: Transfer Learning Instead of training from scratch, start with a model pre-trained on millions of images (ImageNet), then fine-tune it on your medical images. This transfer learning approach works even with small datasets (1,000-10,000 images).

4.4.6.2 Recurrent Neural Networks (RNNs) and LSTMs

  • Purpose: Sequential data (time series, text)
  • Applications: Epidemic forecasting, patient trajectory prediction, clinical note analysis
  • How they work: “Remember” previous time steps when processing current time step using LSTM architectures
Hide code
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

# Example: Flu case forecasting
model = Sequential([
    LSTM(50, return_sequences=True, input_shape=(timesteps, features)),
    Dropout(0.2),
    LSTM(50, return_sequences=False),
    Dropout(0.2),
    Dense(1)  # Predict next week's case count
])

model.compile(optimizer='adam', loss='mse')
# model.fit(X_train, y_train, epochs=50, batch_size=32)

4.4.6.3 Transformers and Large Language Models

  • Purpose: Natural language understanding
  • Applications: Clinical note extraction, literature review, patient Q&A, surveillance
  • Key models: BERT, GPT-4, Bio_ClinicalBERT, PubMedBERT

We’ll cover LLMs extensively in Chapter 7: Large Language Models in Public Health.

TipDeep Learning in Practice

Good news: You rarely need to build deep learning models from scratch. Instead:

  1. Use pre-trained models (transfer learning)
  2. Use existing libraries: TensorFlow, PyTorch, Hugging Face
  3. Start with tutorials: Fast.ai, DeepLearning.AI

For most public health applications, fine-tuning existing models is more practical than training from scratch.

4.5 Quick Algorithm Selection Guide

Use this decision tree to choose algorithms:

Do you have labeled data?
  └─ No  → Unsupervised learning (clustering, anomaly detection)
  └─ Yes → Continue...

What type of data?
  └─ Images        → CNNs (ResNet, EfficientNet)
  └─ Text          → Transformers (BERT, GPT)
  └─ Time series   → LSTMs, Transformers, or gradient boosting
  └─ Tabular       → Continue...

How much data?
  └─ < 1,000 rows     → Logistic regression, single decision tree
  └─ 1,000-10,000     → Random Forest
  └─ > 10,000         → Gradient boosting (XGBoost)
  └─ > 100,000        → Deep learning (if complex patterns)

Do you need interpretability?
  └─ Critical → Logistic regression or single decision tree
  └─ Nice     → Random Forest + SHAP values
  └─ Less important → Gradient boosting or neural networks

4.6 Evaluation Metrics: Measuring Success

You’ve trained a model. How do you know if it’s good?

4.6.1 For Classification (Disease/No Disease, Outbreak/No Outbreak)

4.6.1.1 1. Confusion Matrix

The foundation of classification evaluation:

Predicted: Disease Predicted: No Disease
Actual: Disease True Positive (TP) False Negative (FN)
Actual: No Disease False Positive (FP) True Negative (TN)

From this, we derive all other metrics.

4.6.1.2 2. Accuracy

\[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\]

When it’s misleading: Imbalanced data

Example: If only 1% of patients have disease, a model that predicts “No disease” for everyone gets 99% accuracy but is useless!

4.6.1.3 3. Sensitivity (Recall, True Positive Rate)

\[\text{Sensitivity} = \frac{TP}{TP + FN}\]

“Of all people with disease, what fraction did we detect?”

Critical when: Missing cases is dangerous (e.g., cancer screening)

4.6.1.4 4. Specificity (True Negative Rate)

\[\text{Specificity} = \frac{TN}{TN + FP}\]

“Of all healthy people, what fraction did we correctly identify?”

Critical when: False alarms are costly or harmful

4.6.1.5 5. Positive Predictive Value (Precision)

\[\text{PPV} = \frac{TP}{TP + FP}\]

“Of all positive predictions, how many are correct?”

Critical when: You act on positive predictions (e.g., administering expensive treatment)

4.6.1.6 6. F1 Score

\[F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\]

Harmonic mean of precision and recall. Useful when you need a single metric balancing both.

4.6.1.7 7. ROC-AUC (Area Under the Receiver Operating Characteristic Curve)

The gold standard for binary classification, introduced in signal detection theory.

  • Plots True Positive Rate vs. False Positive Rate at all threshold values
  • AUC = 0.5: Random guessing
  • AUC = 1.0: Perfect classifier
  • AUC > 0.7: Decent
  • AUC > 0.8: Good
  • AUC > 0.9: Excellent (but check for overfitting!)

Code example:

Hide code
from sklearn.metrics import roc_curve, auc, RocCurveDisplay
import matplotlib.pyplot as plt
import sys
sys.path.append('..')
from styles.plot_config import set_handbook_style, HANDBOOK_COLORS

set_handbook_style()

# Get predicted probabilities
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

# Plot with handbook colors
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(fpr, tpr, color=HANDBOOK_COLORS['primary_blue'], lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
ax.plot([0, 1], [0, 1], color=HANDBOOK_COLORS['mid_gray'], linestyle='--', lw=1.5, label='Random Chance')
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title(f'ROC Curve (AUC = {roc_auc:.3f})')
ax.legend(loc="lower right")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('../images/examples/roc_curve.png', dpi=300, bbox_inches='tight')
ImportantChoosing the Right Metric

No single metric tells the whole story, as emphasized in Powers’ comprehensive review of evaluation metrics. Report multiple metrics:

  • Screening/surveillance: Prioritize sensitivity (don’t miss cases)
  • Confirmatory testing: Prioritize specificity (avoid false alarms)
  • Balanced: Report F1 or ROC-AUC
  • Always report: Confusion matrix so readers can calculate their preferred metric

4.6.2 For Regression (Predicting Continuous Values)

4.6.2.1 1. Mean Absolute Error (MAE)

\[MAE = \frac{1}{n}\sum |y_{\text{true}} - y_{\text{pred}}|\]

Average absolute difference. Easy to interpret (same units as target).

4.6.2.2 2. Root Mean Squared Error (RMSE)

\[RMSE = \sqrt{\frac{1}{n}\sum (y_{\text{true}} - y_{\text{pred}})^2}\]

Penalizes large errors more than MAE. More sensitive to outliers.

4.6.2.3 3. R² (Coefficient of Determination)

\[R^2 = 1 - \frac{\sum(y_{\text{true}} - y_{\text{pred}})^2}{\sum(y_{\text{true}} - \bar{y})^2}\]

  • R² = 1: Perfect predictions
  • R² = 0: Model no better than predicting mean
  • R² < 0: Model worse than predicting mean (yikes!)

Code example:

Hide code
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Predict continuous outcome (e.g., number of cases next week)
y_pred = model.predict(X_test)

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"MAE:  {mae:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R²:   {r2:.3f}")

# Visualize predictions vs. actual with handbook styling
import sys
sys.path.append('..')
from styles.plot_config import set_handbook_style, HANDBOOK_COLORS

set_handbook_style()

plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.6, color=HANDBOOK_COLORS['primary_blue'],
            edgecolors=HANDBOOK_COLORS['dark_gray'], linewidth=0.5, s=50)
plt.plot([y_test.min(), y_test.max()],
         [y_test.min(), y_test.max()],
         color=HANDBOOK_COLORS['secondary_coral'], linestyle='--', lw=2, label='Perfect Prediction')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predictions vs. Actual')
plt.legend()
plt.tight_layout()
plt.savefig('../images/examples/regression_performance.png', dpi=300, bbox_inches='tight')

4.7 Overfitting and Underfitting: The Bias-Variance Tradeoff

4.7.1 The Central Challenge

This fundamental concept, formalized in Geman et al.’s seminal paper, represents the core challenge of machine learning.

Underfitting (High Bias): Model is too simple. Fails to capture patterns even in training data.

Overfitting (High Variance): Model is too complex. Memorizes training data perfectly but fails on new data.

The Goal: Find the sweet spot—a model complex enough to capture true patterns but not so complex it memorizes noise.

Visual intuition:

Underfitting:        Just Right:          Overfitting:

   •  •  •            •  •  •              •  •  •
  •  ──  •           •  ~~~  •            •   │  •
 •        •         •        •           •    │  •
•          •       •          •         •     │  •
   (Line)           (Smooth curve)        (Zigzag through
                                           every point)

4.7.2 How to Detect Overfitting

Hide code
# Compare training vs. test performance
train_accuracy = model.score(X_train, y_train)
test_accuracy = model.score(X_test, y_test)

print(f"Training accuracy: {train_accuracy:.3f}")
print(f"Test accuracy:     {test_accuracy:.3f}")
print(f"Difference:        {train_accuracy - test_accuracy:.3f}")

Red flags: - Training accuracy = 0.95, Test accuracy = 0.65 → Overfitting! - Training accuracy = 0.60, Test accuracy = 0.58 → Underfitting (both low) - Training accuracy = 0.82, Test accuracy = 0.80 → Just right (close and good)

4.7.3 How to Prevent Overfitting

  1. More data (the best solution, if possible)
  2. Regularization (penalize model complexity using L1/L2 penalties)
  3. Cross-validation (validate on multiple splits)
  4. Simpler model (fewer features, less depth)
  5. Early stopping (stop training before overfitting starts)
  6. Ensemble methods (average multiple models)

Example with regularization:

Hide code
from sklearn.linear_model import LogisticRegressionCV

# CV suffix = automatic cross-validation for hyperparameter tuning
model = LogisticRegressionCV(
    Cs=10,              # Try 10 different regularization strengths
    cv=5,               # 5-fold cross-validation
    scoring='roc_auc',
    random_state=42
)

model.fit(X_train, y_train)

# Best regularization parameter found automatically
print(f"Best C (regularization): {model.C_[0]:.4f}")

4.8 Common Pitfalls and How to Avoid Them

4.8.1 1. Data Leakage

The problem: Test data “leaks” into training, inflating performance estimates, as documented in Kaufman et al.’s comprehensive analysis.

Example mistake:

Hide code
# BAD: Normalize before splitting
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Uses ALL data including test!
X_train, X_test = train_test_split(X_scaled, test_size=0.2)

Correct approach:

Hide code
# GOOD: Split first, then normalize
X_train, X_test = train_test_split(X, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit on train only
X_test_scaled = scaler.transform(X_test)        # Apply to test

4.8.2 2. Ignoring Class Imbalance

The problem: When one class is rare (e.g., 1% outbreak rate), models can get high accuracy by always predicting the majority class.

Solutions:

Hide code
# Option 1: Use class_weight parameter
model = RandomForestClassifier(class_weight='balanced')

# Option 2: Use appropriate evaluation metrics (F1, ROC-AUC, not accuracy)

# Option 3: Resample data
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

4.8.3 3. Not Using Cross-Validation

The problem: Single train/test split might be lucky or unlucky.

Solution:

Hide code
from sklearn.model_selection import cross_val_score

# 5-fold cross-validation
cv_scores = cross_val_score(model, X_train, y_train,
                             cv=5, scoring='roc_auc')
print(f"CV AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

4.8.4 4. Ignoring Temporal Ordering

The problem: In time series, future information leaks into past predictions.

Wrong:

Hide code
# Random split destroys temporal order
X_train, X_test = train_test_split(X, test_size=0.2)  # BAD for time series

Correct:

Hide code
# Respect temporal order
split_point = int(0.8 * len(df))
train_df = df[:split_point]
test_df = df[split_point:]

X_train, y_train = train_df[features], train_df['outcome']
X_test, y_test = test_df[features], test_df['outcome']

4.9 Key Takeaways

  1. Start simple: Logistic regression → Random Forest → Gradient Boosting → Deep Learning. Don’t skip steps.

  2. Feature engineering > Algorithm choice: Domain expertise in creating good features matters more than fancy algorithms.

  3. Evaluation is multi-dimensional: No single metric tells the whole story. Report confusion matrix, ROC-AUC, and metrics relevant to your use case.

  4. Overfitting is the enemy: Always validate on held-out test data. Use cross-validation. Be skeptical of “too good” results.

  5. Interpretability matters: In public health, stakeholders need to understand and trust models. Use SHAP values, feature importance, and simple models when possible.

  6. AI ≠ Magic: It’s pattern recognition from data. Garbage in = garbage out. Your domain knowledge is irreplaceable.

  7. Most problems don’t need deep learning: Trees and ensembles work great for tabular data. Save neural networks for images, text, and truly complex patterns.

4.11 Practice Exercises

4.11.1 Exercise 1: Your First Classifier

Using the outbreak detection code patterns above: 1. Modify features (try removing or adding features) 2. Try different algorithms (logistic regression, decision tree) 3. Compare performance—which works best? Why?

4.11.2 Exercise 2: Handle Imbalance

Create a dataset where outbreaks are only 2% of weeks: 1. Train a model without addressing imbalance 2. Train with class_weight='balanced' 3. Compare using F1 score and ROC-AUC

4.11.3 Exercise 3: Interpret Predictions

Pick one misclassified test example: 1. What features contributed to the wrong prediction? 2. Use SHAP values to visualize 3. Could a human have made the same mistake?

Check Your Understanding

Test your knowledge of machine learning fundamentals and their applications in public health. Each question builds on the key concepts from this chapter.

NoteQuestion 1

You’re building a model to predict dengue outbreaks using historical data with known outbreak labels (yes/no). You have temperature, rainfall, and previous case counts as features. Which machine learning paradigm should you use?

  1. Unsupervised learning, because weather patterns are complex
  2. Supervised learning, because you have labeled historical outbreak data
  3. Reinforcement learning, because outbreak prediction requires sequential decision-making
  4. Deep learning, because outbreak prediction is a complex task

Correct Answer: b) Supervised learning, because you have labeled historical outbreak data

This is a classic supervised learning problem because:

  • You have labeled data: Historical weeks with known outbreak outcomes (yes/no)
  • You have features: Temperature, rainfall, previous case counts
  • You want to predict a specific outcome: Whether an outbreak will occur

The three fundamental paradigms are distinguished by:

  1. Supervised learning: You have labeled examples (inputs + correct outputs), and you want to predict outcomes for new data
  2. Unsupervised learning: You have unlabeled data and want to discover patterns or structure (clustering, anomaly detection)
  3. Reinforcement learning: You’re optimizing sequential decisions through trial and error with rewards/penalties

Common supervised learning applications in public health include: - Disease diagnosis from symptoms or images - Outbreak prediction from surveillance data - Risk stratification for high-risk patients - Hospital readmission prediction

For this dengue outbreak problem, you’d likely start with logistic regression (binary classification), then try Random Forests or gradient boosting if you need better performance.

NoteQuestion 2

You’ve trained a model to predict hospital readmissions. Training accuracy is 95%, but test accuracy is only 68%. What is the most likely problem, and what should you do?

  1. Underfitting - use a more complex model with more parameters
  2. Overfitting - the model memorized training data and doesn’t generalize well
  3. Data leakage - information from the test set leaked into training
  4. Class imbalance - readmissions are rare events

Correct Answer: b) Overfitting - the model memorized training data and doesn’t generalize well

The large gap between training (95%) and test (68%) accuracy is a classic sign of overfitting. The model is too complex and has essentially memorized the training data, including its noise and quirks, rather than learning generalizable patterns.

Why overfitting happened: - Model is too complex (too many parameters, too deep trees) - Too little training data for the model’s complexity - Insufficient regularization - Training for too many iterations

How to fix overfitting: 1. Use regularization: Add L1/L2 penalties to constrain model complexity 2. Simplify the model: Use fewer features, shallower trees, or simpler algorithms 3. Get more data: The best solution if possible 4. Use cross-validation: Validate on multiple data splits to catch overfitting early 5. Early stopping: Stop training before the model starts memorizing 6. Ensemble methods: Average multiple models to reduce overfitting

Why other answers are wrong: - Underfitting would show both training and test accuracy being low (e.g., 60% and 58%) - Data leakage would make test performance too good (unrealistically high) - Class imbalance could be a problem but wouldn’t explain the train/test gap

NoteQuestion 3

For most public health tabular data (structured rows and columns with patient demographics, lab results, and outcomes), which approach typically performs BEST?

  1. Deep neural networks with many hidden layers
  2. Gradient boosting methods (XGBoost, LightGBM) or Random Forests
  3. Simple logistic regression with no feature engineering
  4. Convolutional neural networks (CNNs)

Correct Answer: b) Gradient boosting methods (XGBoost, LightGBM) or Random Forests

For tabular data (spreadsheet-style structured data), tree-based ensemble methods consistently outperform neural networks, as demonstrated in multiple systematic comparisons. Here’s why:

Advantages of gradient boosting/Random Forests for tabular data: - Naturally handle mixed data types (numerical + categorical) - Don’t require feature scaling or normalization - Robust to outliers and missing data - Automatic feature interaction detection - Excellent performance with modest amounts of data (thousands to tens of thousands of rows) - Provide interpretable feature importance scores

When deep learning excels (NOT tabular data): - Images: CNNs for chest X-rays, pathology slides, skin lesions - Text: Transformers/LLMs for clinical notes, literature, patient messages - Complex time series: LSTMs for epidemic forecasting with long-term dependencies - Multimodal data: Combining images + text + structured data

Recommended approach for tabular public health data: 1. Start with logistic regression (interpretable baseline) 2. Try Random Forest (forgiving, works well out-of-the-box) 3. Optimize with XGBoost/LightGBM if you need maximum performance 4. Only use neural networks if tree methods fail and you have massive amounts of data

Most winning Kaggle solutions for tabular data use gradient boosting, not deep learning.

NoteQuestion 4

You’re evaluating a disease screening tool for a rare condition (1% prevalence). The model achieves 99% accuracy. Your colleague says “This is excellent! Let’s deploy it.” What’s the problem?

  1. There is no problem - 99% accuracy is excellent for any application
  2. Accuracy is misleading with imbalanced data; the model might just predict “no disease” for everyone
  3. 99% is too high and suggests the model is overfitting
  4. Screening tools should prioritize specificity over accuracy

Correct Answer: b) Accuracy is misleading with imbalanced data; the model might just predict “no disease” for everyone

This is the class imbalance problem. With 1% disease prevalence, a completely useless model that predicts “no disease” for everyone would achieve 99% accuracy while providing zero clinical value.

Why accuracy fails: - Accuracy = (TP + TN) / (TP + TN + FP + FN) - If the model predicts “no disease” for all 1,000 patients: - True Negatives (TN): 990 (correctly identified healthy) - False Negatives (FN): 10 (missed all disease cases!) - Accuracy = 990/1000 = 99% - Sensitivity (recall) = 0/10 = 0% (detected zero disease cases!)

What you should check instead: 1. Confusion matrix: See actual TP, TN, FP, FN counts 2. Sensitivity (recall): TP/(TP+FN) - critical for screening (don’t miss disease) 3. Specificity: TN/(TN+FP) - avoid false alarms 4. Positive Predictive Value (PPV): TP/(TP+FP) - if you flag someone, what’s the probability they actually have disease? 5. ROC-AUC: Overall discriminative ability across all thresholds 6. F1 Score: Balances precision and recall

For disease screening, sensitivity (detecting actual cases) is usually prioritized over specificity. You can’t afford to miss cases of a serious disease, even if it means some false positives that require follow-up testing.

How to fix: - Use class_weight='balanced' in your model - Use appropriate evaluation metrics (F1, ROC-AUC, not accuracy) - Consider SMOTE or other resampling techniques

NoteQuestion 5

A critical concept in machine learning is preventing “data leakage.” Which scenario represents data leakage that would invalidate your model evaluation?

  1. Using domain expertise to engineer new features before splitting train/test sets
  2. Normalizing features using statistics (mean, std) computed on the entire dataset before splitting into train/test sets
  3. Including demographic variables that might correlate with protected classes
  4. Training on data from 2020-2022 and testing on data from 2023

Correct Answer: b) Normalizing features using statistics (mean, std) computed on the entire dataset before splitting into train/test sets

This is a classic data leakage mistake that makes your model appear better than it really is.

Why this is leakage:

WRONG approach:

# Compute statistics using ALL data (including test!)
X_scaled = scaler.fit_transform(X)
X_train, X_test = train_test_split(X_scaled)

The test set’s mean and standard deviation influenced the scaling transformation. The model indirectly “saw” information from the test set during training.

CORRECT approach:

# Split FIRST, then normalize
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit on train only
X_test_scaled = scaler.transform(X_test)        # Apply to test

Other common data leakage scenarios: - Including future information in features (e.g., using next week’s data to predict this week) - Using the same patients in both train and test sets - Performing feature selection on the full dataset before splitting - Cross-validation on time series without respecting temporal order

Why other answers are NOT leakage: - Feature engineering before splitting: Fine, as long as you don’t use test labels or test-specific statistics - Including demographic variables: An ethical/fairness concern, not leakage - Time-based train/test split: Actually the CORRECT approach for temporal data to avoid leakage!

The cardinal rule: Test data must remain completely untouched until final evaluation. Any processing (scaling, feature selection, etc.) must be fit on training data only.

NoteQuestion 6

You’re choosing between a Random Forest with 85% accuracy that you can’t easily explain, and logistic regression with 80% accuracy that provides interpretable odds ratios. For predicting which communities should receive limited outbreak response resources, which should you choose and why?

  1. Random Forest - 5% better accuracy could save lives, interpretability doesn’t matter
  2. Logistic regression - interpretability is critical for stakeholder trust and accountability in resource allocation
  3. Deep learning - should always use the most sophisticated method available
  4. Either model is fine since they’re both above 80% accuracy

Correct Answer: b) Logistic regression - interpretability is critical for stakeholder trust and accountability in resource allocation

This question highlights a fundamental tension in applied machine learning: accuracy vs. interpretability. For public health applications, especially those involving resource allocation, interpretability often matters more than marginal accuracy gains.

Why interpretability is critical here:

  1. Stakeholder trust: Public health officials, community leaders, and the public need to understand why certain communities were prioritized. “The black-box algorithm said so” erodes trust.

  2. Accountability: If resources are misallocated, you need to explain what went wrong and fix it. With logistic regression, you can identify which factors (e.g., population density, previous outbreak rates) drove the decision.

  3. Equity concerns: Resource allocation decisions can perpetuate health inequities. Interpretable models let you audit for fairness - are we discriminating based on race, income, or other protected attributes?

  4. Domain validation: Epidemiologists can review odds ratios and say “this makes biological sense” or “this is spurious.” Can’t do that with a Random Forest.

  5. Legal/regulatory: Many jurisdictions require explainable decision-making for consequential decisions affecting people’s lives.

When to prioritize accuracy over interpretability: - Medical imaging diagnosis (radiology AI supporting clinicians) - Internal risk flagging (clinician reviews all flagged cases anyway) - Non-consequential applications (exploratory analysis)

Best practice approach: 1. Start with interpretable model (logistic regression) 2. If accuracy is insufficient for clinical value, try more complex models 3. Use SHAP values or LIME to interpret complex models 4. Consider the context - resource allocation demands more interpretability than internal screening tools

The 5% accuracy gap must be weighed against: - Can you defend decisions to communities? - Can you audit for bias? - Can domain experts validate the logic? - Will stakeholders trust and adopt the system?

For consequential public health decisions, the answer is usually: choose the interpretable model unless the accuracy gap is unacceptably large.

4.12 Discussion Questions

  1. When would you choose logistic regression over a Random Forest, even if the Random Forest has higher accuracy?

  2. Your model has 95% accuracy but only 40% sensitivity. Is this acceptable for an outbreak detection system? Why or why not?

  3. A colleague says “deep learning is always better than traditional ML.” How do you respond?

  4. You train a model to predict hospital readmission with AUC = 0.92 on test data. Should you deploy it immediately? What else should you check?

  5. If forced to choose between a model that’s 90% accurate but uninterpretable vs. 80% accurate but fully interpretable, which would you choose for a public health application? Does the context matter?

4.13 Further Reading

4.13.1 Foundational Papers

4.13.2 Public Health Applications

4.13.3 Explainability and Fairness


You now have the foundation to understand, evaluate, and apply AI in public health contexts. The remaining chapters build on these concepts with domain-specific applications.

Next chapter: Chapter 3: The Data Problem - Understanding why public health data is uniquely challenging and how to handle it.