This chapter teaches you enough to be productively dangerous in public health AI.
4.2 The Three Fundamental Paradigms
All of machine learning falls into three categories, as comprehensively reviewed in Mitchell’s foundational work on machine learning. Master these, and you’ll understand 90% of AI applications in public health.
4.2.1 1. Supervised Learning: Learning from Examples
The Idea: You show the algorithm examples of inputs and correct outputs. It learns the pattern. Then it predicts outputs for new inputs.
Analogy: Teaching a medical student by showing them patient cases with known diagnoses.
Real-world example:
Input: Patient symptoms (fever, cough, chest pain)
Output: Diagnosis (pneumonia)
You show the algorithm 10,000 labeled examples.
It learns: "fever + productive cough + chest pain → pneumonia (85% probability)"
When to use it: - You have labeled historical data (inputs + correct answers) - You want to predict a specific outcome - Examples: disease diagnosis, outbreak prediction, risk stratification
Supervised learning requires labeled data—meaning someone (often human experts) must have already provided the “correct answers” for your training examples.
The Idea: You give the algorithm data with NO labels. It finds structure, patterns, or groups on its own.
Analogy: Sorting a messy pile of patient records into natural categories without being told what categories to use.
Real-world example:
Input: Symptom reports from 50,000 emergency department visits
Output: "I found 7 distinct clusters—one looks like flu-like illness,
another looks like gastrointestinal illness, etc."
When to use it: - You have unlabeled data (which is most data!) - You want to discover structure or segments - You’re exploring without a specific prediction goal
4.2.3 3. Reinforcement Learning: Learning by Trial and Error
The Idea: An agent takes actions in an environment and receives rewards or penalties. It learns which actions lead to the best outcomes. This approach, formalized by Sutton and Barto in their seminal text, has applications in sequential decision-making.
Analogy: Training a dog—reward good behavior, discourage bad behavior, let it learn through experience.
When to use it: - You’re optimizing sequential decisions - There’s no “correct answer” dataset - You have a simulator or can interact safely
Reinforcement learning sounds exciting but is much harder to implement than supervised learning. It requires careful reward engineering and lots of computational resources.
For most public health problems, supervised learning is more practical. Don’t use RL just because it sounds cool.
For this handbook, we’ll focus primarily on supervised learning—it’s where 80% of practical public health AI lives.
4.3 Core Supervised Learning Concepts
4.3.1 The Machine Learning Workflow
Every supervised learning project follows this pattern, as outlined in Géron’s practical guide:
Hide code
# 1. Collect labeled datadata = load_outbreak_data() # Features + outcomes# 2. Split into train/test setstrain_data, test_data = split(data, test_size=0.2)# 3. Choose and train a modelmodel = RandomForestClassifier()model.fit(train_data.features, train_data.outcomes)# 4. Evaluate on unseen test datapredictions = model.predict(test_data.features)accuracy = evaluate(predictions, test_data.outcomes)# 5. Deploy and monitormodel.save('outbreak_predictor.pkl')
Let’s break down each step.
4.3.2 Step 1: Features and Labels
Features (also called “predictors” or “independent variables”): The inputs your model uses to make predictions.
Labels (also called “targets” or “dependent variables”): The outputs you’re trying to predict.
Examples: - Raw: Daily case counts → Engineered: 7-day rolling average - Raw: Birth date → Engineered: Age in years - Raw: GPS coordinates → Engineered: Distance to nearest hospital - Raw: Text symptoms → Engineered: Presence/absence of key symptom terms
This is where epidemiologists have an advantage over pure data scientists. You know which features actually matter biologically and epidemiologically.
Training set (60-70%): Model learns patterns from this
Validation set (10-20%): Used to tune model settings (hyperparameters)
Test set (20%): Never touched until final evaluation—simulates real-world performance
Why this matters:
Hide code
# BAD: Tuning on test datamodel = train_model(train_data)# Adjust model settings...performance = evaluate(model, test_data) # Too optimistic!# GOOD: Using validation setmodel = train_model(train_data)# Adjust model settings based on validation performance...performance = evaluate(model, validation_data)# Only after all tuning is done:final_performance = evaluate(model, test_data) # True performance
WarningThe Cardinal Sin: Data Leakage
Data leakage happens when information from the test set “leaks” into training. This makes your model look better than it really is, a problem extensively documented in Kaufman et al.’s analysis of leakage in healthcare ML.
Common mistakes: - Normalizing features before splitting data - Including future information in features - Using the same patients in both train and test - Cross-validation on time series without respecting temporal order
Machine learning does the same thing—but: - Often with millions of parameters - Using iterative optimization (gradient descent, not closed-form solutions) - With regularization to prevent overfitting - On much larger datasets
Here’s what’s happening under the hood:
Hide code
# Simplified training loop (don't worry about details)for epoch inrange(num_epochs):# Make predictions with current parameters predictions = model.predict(X_train)# Calculate error (loss) error = calculate_loss(predictions, y_train)# Adjust parameters to reduce error gradients = compute_gradients(error) parameters = parameters - learning_rate * gradients# Repeat until error stops decreasing
The model is literally learning from mistakes—making predictions, checking how wrong they are, and adjusting to do better next time.
4.4 Common Machine Learning Algorithms
Let’s survey the algorithms you’ll encounter most often in public health AI. For each, we’ll cover how it works, strengths/weaknesses, when to use it, and a practical example.
4.4.1 1. Logistic Regression: Your Old Friend
You already know this one!Logistic regression predicts binary outcomes (disease/no disease, outbreak/no outbreak).
When to use it: - Binary classification problems - You want interpretable coefficients (odds ratios) - You have modest amounts of data - Linear relationships are reasonable
Strengths: ✅ Interpretable (you can explain to stakeholders) ✅ Fast to train ✅ Works well with small data ✅ Provides probability estimates ✅ Epidemiologists understand it intuitively
Weaknesses: ❌ Assumes linear relationships (in log-odds space) ❌ Can’t capture complex interactions automatically ❌ Performance limited on very complex patterns
Code example:
Hide code
from sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import classification_reportimport pandas as pdimport numpy as np# Load outbreak datadata = pd.read_csv('../data/examples/dengue_outbreaks.csv')# Features and labelX = data[['temperature', 'rainfall', 'prev_cases', 'population_density']]y = data['outbreak'] # 0 or 1# Train/test splitfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)# Train modelmodel = LogisticRegression(max_iter=1000)model.fit(X_train, y_train)# Evaluatey_pred = model.predict(X_test)print(classification_report(y_test, y_pred))# Interpret: Look at coefficients (like ORs in epi)feature_names = X.columnscoefficients = model.coef_[0]for feature, coef inzip(feature_names, coefficients): odds_ratio = np.exp(coef)print(f"{feature}: OR = {odds_ratio:.2f}")
Output interpretation:
temperature: OR = 1.15 → Each 1°C increase → 15% higher outbreak odds
rainfall: OR = 1.08 → Each 1mm increase → 8% higher outbreak odds
prev_cases: OR = 1.25 → Each additional case → 25% higher outbreak odds
TipStart Here
Logistic regression should be your default starting point for classification problems. Only move to complex models if it doesn’t work well enough.
Many “AI” successes in public health are just well-applied logistic regression with good feature engineering.
from sklearn.tree import DecisionTreeClassifierfrom sklearn import treeimport matplotlib.pyplot as plt# Train decision treemodel = DecisionTreeClassifier(max_depth=3, random_state=42)model.fit(X_train, y_train)# Visualize the treeplt.figure(figsize=(20,10))tree.plot_tree(model, feature_names=X.columns, class_names=['No outbreak', 'Outbreak'], filled=True, fontsize=10)plt.savefig('../images/examples/decision_tree.png', dpi=300, bbox_inches='tight')plt.show()# Make predictionspredictions = model.predict(X_test)probabilities = model.predict_proba(X_test)
Strengths: ✅ Highly interpretable (you can draw the exact decision process) ✅ Handles non-linear relationships automatically ✅ No need to normalize features ✅ Works with mixed data types (numerical + categorical) ✅ Can capture interactions between features
Weaknesses: ❌ Prone to overfitting (memorizing training data) ❌ Unstable (small data changes → very different trees) ❌ Often less accurate than ensemble methods
NoteSingle trees vs. Forests
Single decision trees are rarely used in production because they overfit easily. Instead, we use ensembles of many trees (Random Forests, Gradient Boosting)—see next sections.
But single trees are excellent for exploratory analysis and communicating findings to non-technical stakeholders.
4.4.3 3. Random Forests: Democracy of Trees
The idea: Instead of one decision tree, train hundreds or thousands of trees on slightly different subsets of data. Each tree “votes” on the final prediction. This ensemble method introduced by Breiman (2001) has become one of the most widely used algorithms in machine learning.
Why it works: Individual trees might overfit and make mistakes, but their collective wisdom averages out errors.
Analogy: Instead of asking one doctor, you ask 500 doctors and take the majority opinion.
Code example:
Hide code
from sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import roc_auc_score, confusion_matriximport numpy as np# Train random forestmodel = RandomForestClassifier( n_estimators=500, # Number of trees max_depth=10, # Limit tree depth (prevent overfitting) min_samples_split=20, # Require at least 20 samples to split random_state=42)model.fit(X_train, y_train)# Predict probabilities (not just 0/1)y_pred_proba = model.predict_proba(X_test)[:, 1]# Evaluate with ROC-AUCauc = roc_auc_score(y_test, y_pred_proba)print(f"ROC-AUC: {auc:.3f}")# Feature importance: Which features matter most?importances = model.feature_importances_feature_importance_df = pd.DataFrame({'feature': X.columns,'importance': importances}).sort_values('importance', ascending=False)print("\nFeature Importance:")print(feature_importance_df)# Plot with handbook stylingimport syssys.path.append('..')from styles.plot_config import set_handbook_style, HANDBOOK_COLORSset_handbook_style()import seaborn as snsplt.figure(figsize=(10, 6))sns.barplot(data=feature_importance_df, x='importance', y='feature', color=HANDBOOK_COLORS['primary_blue'])plt.title('Feature Importance in Outbreak Prediction')plt.xlabel('Importance Score')plt.tight_layout()plt.savefig('../images/examples/feature_importance.png', dpi=300, bbox_inches='tight')
Strengths: ✅ Very accurate on most problems ✅ Handles thousands of features without feature selection ✅ Resistant to overfitting (unlike single trees) ✅ Provides feature importance scores ✅ No need for feature scaling ✅ Works out-of-the-box with minimal tuning
Weaknesses: ❌ Less interpretable than single trees (500 trees is hard to explain) ❌ Slower to train and predict than simple models ❌ Large file sizes (saving 500 trees takes space) ❌ Can struggle with highly imbalanced data
TipPractical Advice
Random Forests are often the best first “real” ML model to try after logistic regression. They’re: - Forgiving of mistakes in data preparation - Robust to outliers - Accurate enough for many applications - Easy to use with scikit-learn
For many public health prediction tasks, a well-tuned Random Forest is all you need.
4.4.4 4. Gradient Boosting: The Current Champion
The idea: Build trees sequentially, where each new tree focuses on correcting the mistakes of previous trees. This approach, formalized by Friedman (2001), has become the dominant method for structured data competitions.
Analogy: Instead of 500 doctors voting independently (Random Forest), you have a team of 500 doctors where each successive doctor specifically tries to fix the previous doctor’s misdiagnoses.
Popular implementations: - XGBoost (most popular) - LightGBM (faster, good for large data) - CatBoost (handles categorical features well)
Code example:
Hide code
from xgboost import XGBClassifierfrom sklearn.model_selection import cross_val_score# Train XGBoost modelmodel = XGBClassifier( n_estimators=500, learning_rate=0.05, # How much each tree contributes max_depth=6, subsample=0.8, # Use 80% of data for each tree colsample_bytree=0.8, # Use 80% of features for each tree random_state=42)model.fit(X_train, y_train)# Cross-validation for robust performance estimatecv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')print(f"Cross-validated AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")# Test set performancey_pred_proba = model.predict_proba(X_test)[:, 1]test_auc = roc_auc_score(y_test, y_pred_proba)print(f"Test AUC: {test_auc:.3f}")# Feature importance (SHAP values for better interpretation)import shapexplainer = shap.TreeExplainer(model)shap_values = explainer.shap_values(X_test)# Visualize what drives predictionsshap.summary_plot(shap_values, X_test, feature_names=X.columns)
Strengths: ✅ Often the most accurate “classical” ML algorithm ✅ Wins most Kaggle competitions ✅ Handles missing data automatically ✅ Built-in regularization prevents overfitting ✅ Excellent with tabular data (most public health data)
Weaknesses: ❌ More hyperparameters to tune than Random Forests ❌ Easier to overfit if not careful ❌ Requires more computational resources ❌ Even less interpretable than Random Forests (but SHAP values help)
ImportantWhen to Use Gradient Boosting
Use gradient boosting when: - You’ve tried Random Forests and need better performance - You have sufficient data (>10,000 observations) - Prediction accuracy is critical - You have time/resources for hyperparameter tuning
Stick with Random Forests when: - You want simplicity and fast iteration - Data is limited (<1,000 observations) - Interpretability is paramount
4.4.5 5. Neural Networks and Deep Learning
The idea: Layers of interconnected “neurons” that transform inputs into outputs through learned weights. Neural networks are universal approximators, meaning they can theoretically learn any function given enough data and capacity.
Each connection has a “weight” that’s adjusted during training using backpropagation.
When neural networks shine: - Large datasets (millions of examples) - Complex patterns (images, text, audio) - Representation learning (automatic feature extraction)
When they struggle: - Small datasets (<10,000 examples) - Tabular data (trees often work better, as shown in Shwartz-Ziv & Armon 2022) - Interpretability is required
WarningReality Check: Neural Networks for Tabular Data
For most public health tabular data (spreadsheets with rows and columns), gradient boosting methods usually outperform neural networks, as demonstrated in multiple systematic comparisons.
Neural networks excel at: - Medical imaging (chest X-rays, histopathology) - Time series with complex temporal patterns - Natural language processing - Multi-modal data (combining images + text + numbers)
For simple prediction from structured data, stick with tree-based methods first.
4.4.6 Deep Learning Architectures for Public Health
Deep learning = neural networks with many layers (10s to 100s of layers), as comprehensively covered in Goodfellow et al.’s textbook.
Key concept: Transfer Learning Instead of training from scratch, start with a model pre-trained on millions of images (ImageNet), then fine-tune it on your medical images. This transfer learning approach works even with small datasets (1,000-10,000 images).
4.4.6.2 Recurrent Neural Networks (RNNs) and LSTMs
For most public health applications, fine-tuning existing models is more practical than training from scratch.
4.5 Quick Algorithm Selection Guide
Use this decision tree to choose algorithms:
Do you have labeled data?
└─ No → Unsupervised learning (clustering, anomaly detection)
└─ Yes → Continue...
What type of data?
└─ Images → CNNs (ResNet, EfficientNet)
└─ Text → Transformers (BERT, GPT)
└─ Time series → LSTMs, Transformers, or gradient boosting
└─ Tabular → Continue...
How much data?
└─ < 1,000 rows → Logistic regression, single decision tree
└─ 1,000-10,000 → Random Forest
└─ > 10,000 → Gradient boosting (XGBoost)
└─ > 100,000 → Deep learning (if complex patterns)
Do you need interpretability?
└─ Critical → Logistic regression or single decision tree
└─ Nice → Random Forest + SHAP values
└─ Less important → Gradient boosting or neural networks
4.6 Evaluation Metrics: Measuring Success
You’ve trained a model. How do you know if it’s good?
4.6.1 For Classification (Disease/No Disease, Outbreak/No Outbreak)
# Compare training vs. test performancetrain_accuracy = model.score(X_train, y_train)test_accuracy = model.score(X_test, y_test)print(f"Training accuracy: {train_accuracy:.3f}")print(f"Test accuracy: {test_accuracy:.3f}")print(f"Difference: {train_accuracy - test_accuracy:.3f}")
Red flags: - Training accuracy = 0.95, Test accuracy = 0.65 → Overfitting! - Training accuracy = 0.60, Test accuracy = 0.58 → Underfitting (both low) - Training accuracy = 0.82, Test accuracy = 0.80 → Just right (close and good)
4.7.3 How to Prevent Overfitting
More data (the best solution, if possible)
Regularization (penalize model complexity using L1/L2 penalties)
Cross-validation (validate on multiple splits)
Simpler model (fewer features, less depth)
Early stopping (stop training before overfitting starts)
Ensemble methods (average multiple models)
Example with regularization:
Hide code
from sklearn.linear_model import LogisticRegressionCV# CV suffix = automatic cross-validation for hyperparameter tuningmodel = LogisticRegressionCV( Cs=10, # Try 10 different regularization strengths cv=5, # 5-fold cross-validation scoring='roc_auc', random_state=42)model.fit(X_train, y_train)# Best regularization parameter found automaticallyprint(f"Best C (regularization): {model.C_[0]:.4f}")
# BAD: Normalize before splittingfrom sklearn.preprocessing import StandardScalerscaler = StandardScaler()X_scaled = scaler.fit_transform(X) # Uses ALL data including test!X_train, X_test = train_test_split(X_scaled, test_size=0.2)
Correct approach:
Hide code
# GOOD: Split first, then normalizeX_train, X_test = train_test_split(X, test_size=0.2)scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train) # Fit on train onlyX_test_scaled = scaler.transform(X_test) # Apply to test
4.8.2 2. Ignoring Class Imbalance
The problem: When one class is rare (e.g., 1% outbreak rate), models can get high accuracy by always predicting the majority class.
Start simple: Logistic regression → Random Forest → Gradient Boosting → Deep Learning. Don’t skip steps.
Feature engineering > Algorithm choice: Domain expertise in creating good features matters more than fancy algorithms.
Evaluation is multi-dimensional: No single metric tells the whole story. Report confusion matrix, ROC-AUC, and metrics relevant to your use case.
Overfitting is the enemy: Always validate on held-out test data. Use cross-validation. Be skeptical of “too good” results.
Interpretability matters: In public health, stakeholders need to understand and trust models. Use SHAP values, feature importance, and simple models when possible.
AI ≠ Magic: It’s pattern recognition from data. Garbage in = garbage out. Your domain knowledge is irreplaceable.
Most problems don’t need deep learning: Trees and ensembles work great for tabular data. Save neural networks for images, text, and truly complex patterns.
Using the outbreak detection code patterns above: 1. Modify features (try removing or adding features) 2. Try different algorithms (logistic regression, decision tree) 3. Compare performance—which works best? Why?
4.11.2 Exercise 2: Handle Imbalance
Create a dataset where outbreaks are only 2% of weeks: 1. Train a model without addressing imbalance 2. Train with class_weight='balanced' 3. Compare using F1 score and ROC-AUC
4.11.3 Exercise 3: Interpret Predictions
Pick one misclassified test example: 1. What features contributed to the wrong prediction? 2. Use SHAP values to visualize 3. Could a human have made the same mistake?
Check Your Understanding
Test your knowledge of machine learning fundamentals and their applications in public health. Each question builds on the key concepts from this chapter.
NoteQuestion 1
You’re building a model to predict dengue outbreaks using historical data with known outbreak labels (yes/no). You have temperature, rainfall, and previous case counts as features. Which machine learning paradigm should you use?
Unsupervised learning, because weather patterns are complex
Supervised learning, because you have labeled historical outbreak data
Reinforcement learning, because outbreak prediction requires sequential decision-making
Deep learning, because outbreak prediction is a complex task
Correct Answer: b) Supervised learning, because you have labeled historical outbreak data
This is a classic supervised learning problem because:
You have labeled data: Historical weeks with known outbreak outcomes (yes/no)
You have features: Temperature, rainfall, previous case counts
You want to predict a specific outcome: Whether an outbreak will occur
The three fundamental paradigms are distinguished by:
Supervised learning: You have labeled examples (inputs + correct outputs), and you want to predict outcomes for new data
Unsupervised learning: You have unlabeled data and want to discover patterns or structure (clustering, anomaly detection)
Reinforcement learning: You’re optimizing sequential decisions through trial and error with rewards/penalties
Common supervised learning applications in public health include: - Disease diagnosis from symptoms or images - Outbreak prediction from surveillance data - Risk stratification for high-risk patients - Hospital readmission prediction
For this dengue outbreak problem, you’d likely start with logistic regression (binary classification), then try Random Forests or gradient boosting if you need better performance.
NoteQuestion 2
You’ve trained a model to predict hospital readmissions. Training accuracy is 95%, but test accuracy is only 68%. What is the most likely problem, and what should you do?
Underfitting - use a more complex model with more parameters
Overfitting - the model memorized training data and doesn’t generalize well
Data leakage - information from the test set leaked into training
Class imbalance - readmissions are rare events
Correct Answer: b) Overfitting - the model memorized training data and doesn’t generalize well
The large gap between training (95%) and test (68%) accuracy is a classic sign of overfitting. The model is too complex and has essentially memorized the training data, including its noise and quirks, rather than learning generalizable patterns.
Why overfitting happened: - Model is too complex (too many parameters, too deep trees) - Too little training data for the model’s complexity - Insufficient regularization - Training for too many iterations
How to fix overfitting: 1. Use regularization: Add L1/L2 penalties to constrain model complexity 2. Simplify the model: Use fewer features, shallower trees, or simpler algorithms 3. Get more data: The best solution if possible 4. Use cross-validation: Validate on multiple data splits to catch overfitting early 5. Early stopping: Stop training before the model starts memorizing 6. Ensemble methods: Average multiple models to reduce overfitting
Why other answers are wrong: - Underfitting would show both training and test accuracy being low (e.g., 60% and 58%) - Data leakage would make test performance too good (unrealistically high) - Class imbalance could be a problem but wouldn’t explain the train/test gap
NoteQuestion 3
For most public health tabular data (structured rows and columns with patient demographics, lab results, and outcomes), which approach typically performs BEST?
Deep neural networks with many hidden layers
Gradient boosting methods (XGBoost, LightGBM) or Random Forests
Simple logistic regression with no feature engineering
Convolutional neural networks (CNNs)
Correct Answer: b) Gradient boosting methods (XGBoost, LightGBM) or Random Forests
For tabular data (spreadsheet-style structured data), tree-based ensemble methods consistently outperform neural networks, as demonstrated in multiple systematic comparisons. Here’s why:
Advantages of gradient boosting/Random Forests for tabular data: - Naturally handle mixed data types (numerical + categorical) - Don’t require feature scaling or normalization - Robust to outliers and missing data - Automatic feature interaction detection - Excellent performance with modest amounts of data (thousands to tens of thousands of rows) - Provide interpretable feature importance scores
When deep learning excels (NOT tabular data): - Images: CNNs for chest X-rays, pathology slides, skin lesions - Text: Transformers/LLMs for clinical notes, literature, patient messages - Complex time series: LSTMs for epidemic forecasting with long-term dependencies - Multimodal data: Combining images + text + structured data
Recommended approach for tabular public health data: 1. Start with logistic regression (interpretable baseline) 2. Try Random Forest (forgiving, works well out-of-the-box) 3. Optimize with XGBoost/LightGBM if you need maximum performance 4. Only use neural networks if tree methods fail and you have massive amounts of data
Most winning Kaggle solutions for tabular data use gradient boosting, not deep learning.
NoteQuestion 4
You’re evaluating a disease screening tool for a rare condition (1% prevalence). The model achieves 99% accuracy. Your colleague says “This is excellent! Let’s deploy it.” What’s the problem?
There is no problem - 99% accuracy is excellent for any application
Accuracy is misleading with imbalanced data; the model might just predict “no disease” for everyone
99% is too high and suggests the model is overfitting
Screening tools should prioritize specificity over accuracy
Correct Answer: b) Accuracy is misleading with imbalanced data; the model might just predict “no disease” for everyone
This is the class imbalance problem. With 1% disease prevalence, a completely useless model that predicts “no disease” for everyone would achieve 99% accuracy while providing zero clinical value.
Why accuracy fails: - Accuracy = (TP + TN) / (TP + TN + FP + FN) - If the model predicts “no disease” for all 1,000 patients: - True Negatives (TN): 990 (correctly identified healthy) - False Negatives (FN): 10 (missed all disease cases!) - Accuracy = 990/1000 = 99% - Sensitivity (recall) = 0/10 = 0% (detected zero disease cases!)
What you should check instead: 1. Confusion matrix: See actual TP, TN, FP, FN counts 2. Sensitivity (recall): TP/(TP+FN) - critical for screening (don’t miss disease) 3. Specificity: TN/(TN+FP) - avoid false alarms 4. Positive Predictive Value (PPV): TP/(TP+FP) - if you flag someone, what’s the probability they actually have disease? 5. ROC-AUC: Overall discriminative ability across all thresholds 6. F1 Score: Balances precision and recall
For disease screening, sensitivity (detecting actual cases) is usually prioritized over specificity. You can’t afford to miss cases of a serious disease, even if it means some false positives that require follow-up testing.
How to fix: - Use class_weight='balanced' in your model - Use appropriate evaluation metrics (F1, ROC-AUC, not accuracy) - Consider SMOTE or other resampling techniques
NoteQuestion 5
A critical concept in machine learning is preventing “data leakage.” Which scenario represents data leakage that would invalidate your model evaluation?
Using domain expertise to engineer new features before splitting train/test sets
Normalizing features using statistics (mean, std) computed on the entire dataset before splitting into train/test sets
Including demographic variables that might correlate with protected classes
Training on data from 2020-2022 and testing on data from 2023
Correct Answer: b) Normalizing features using statistics (mean, std) computed on the entire dataset before splitting into train/test sets
This is a classic data leakage mistake that makes your model appear better than it really is.
Why this is leakage:
WRONG approach:
# Compute statistics using ALL data (including test!)X_scaled = scaler.fit_transform(X)X_train, X_test = train_test_split(X_scaled)
The test set’s mean and standard deviation influenced the scaling transformation. The model indirectly “saw” information from the test set during training.
CORRECT approach:
# Split FIRST, then normalizeX_train, X_test = train_test_split(X)scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train) # Fit on train onlyX_test_scaled = scaler.transform(X_test) # Apply to test
Other common data leakage scenarios: - Including future information in features (e.g., using next week’s data to predict this week) - Using the same patients in both train and test sets - Performing feature selection on the full dataset before splitting - Cross-validation on time series without respecting temporal order
Why other answers are NOT leakage: - Feature engineering before splitting: Fine, as long as you don’t use test labels or test-specific statistics - Including demographic variables: An ethical/fairness concern, not leakage - Time-based train/test split: Actually the CORRECT approach for temporal data to avoid leakage!
The cardinal rule: Test data must remain completely untouched until final evaluation. Any processing (scaling, feature selection, etc.) must be fit on training data only.
NoteQuestion 6
You’re choosing between a Random Forest with 85% accuracy that you can’t easily explain, and logistic regression with 80% accuracy that provides interpretable odds ratios. For predicting which communities should receive limited outbreak response resources, which should you choose and why?
Random Forest - 5% better accuracy could save lives, interpretability doesn’t matter
Logistic regression - interpretability is critical for stakeholder trust and accountability in resource allocation
Deep learning - should always use the most sophisticated method available
Either model is fine since they’re both above 80% accuracy
Correct Answer: b) Logistic regression - interpretability is critical for stakeholder trust and accountability in resource allocation
This question highlights a fundamental tension in applied machine learning: accuracy vs. interpretability. For public health applications, especially those involving resource allocation, interpretability often matters more than marginal accuracy gains.
Why interpretability is critical here:
Stakeholder trust: Public health officials, community leaders, and the public need to understand why certain communities were prioritized. “The black-box algorithm said so” erodes trust.
Accountability: If resources are misallocated, you need to explain what went wrong and fix it. With logistic regression, you can identify which factors (e.g., population density, previous outbreak rates) drove the decision.
Equity concerns: Resource allocation decisions can perpetuate health inequities. Interpretable models let you audit for fairness - are we discriminating based on race, income, or other protected attributes?
Domain validation: Epidemiologists can review odds ratios and say “this makes biological sense” or “this is spurious.” Can’t do that with a Random Forest.
Legal/regulatory: Many jurisdictions require explainable decision-making for consequential decisions affecting people’s lives.
When to prioritize accuracy over interpretability: - Medical imaging diagnosis (radiology AI supporting clinicians) - Internal risk flagging (clinician reviews all flagged cases anyway) - Non-consequential applications (exploratory analysis)
Best practice approach: 1. Start with interpretable model (logistic regression) 2. If accuracy is insufficient for clinical value, try more complex models 3. Use SHAP values or LIME to interpret complex models 4. Consider the context - resource allocation demands more interpretability than internal screening tools
The 5% accuracy gap must be weighed against: - Can you defend decisions to communities? - Can you audit for bias? - Can domain experts validate the logic? - Will stakeholders trust and adopt the system?
For consequential public health decisions, the answer is usually: choose the interpretable model unless the accuracy gap is unacceptably large.
4.12 Discussion Questions
When would you choose logistic regression over a Random Forest, even if the Random Forest has higher accuracy?
Your model has 95% accuracy but only 40% sensitivity. Is this acceptable for an outbreak detection system? Why or why not?
A colleague says “deep learning is always better than traditional ML.” How do you respond?
You train a model to predict hospital readmission with AUC = 0.92 on test data. Should you deploy it immediately? What else should you check?
If forced to choose between a model that’s 90% accurate but uninterpretable vs. 80% accurate but fully interpretable, which would you choose for a public health application? Does the context matter?
You now have the foundation to understand, evaluate, and apply AI in public health contexts. The remaining chapters build on these concepts with domain-specific applications.
Next chapter: Chapter 3: The Data Problem - Understanding why public health data is uniquely challenging and how to handle it.