15 Your AI Toolkit
Hands-on learning: This chapter has a companion Jupyter notebook with executable code examples.
Or download locally: chapter13-toolkit-interactive.ipynb
This chapter guides you in building your AI development environment. You will learn to:
- Select development environments based on project stage (Jupyter, VS Code, cloud)
- Configure essential Python libraries (pandas, scikit-learn, PyTorch)
- Establish reproducible workflows (Git, conda/venv, Docker)
- Navigate the AI tool ecosystem for public health applications
- Set up data management infrastructure (storage, backups, versioning)
- Configure training on local and cloud resources
- Implement experiment tracking platforms (Weights & Biases, MLflow)
- Evaluate epidemiological tools and public health datasets
Prerequisites: Just Enough AI to Be Dangerous, The Data Problem.
15.1 Introduction: Building Your AI Development Environment
15.1.1 The Paradox of Choice
Walk into any data science conference and you’ll hear heated debates: Python vs R. PyTorch vs TensorFlow. VS Code vs Jupyter vs PyCharm. Cloud vs local. Docker vs virtual environments. MLflow vs Weights & Biases.
The truth? Most of these debates don’t matter for getting started.
What matters: Having a working environment where you can: 1. Load and explore data 2. Train and evaluate models 3. Reproduce your results 4. Collaborate with others 5. Eventually deploy to production
This chapter cuts through the noise. Instead of surveying every tool in the sprawling AI ecosystem, we focus on battle-tested, production-ready tools that thousands of data scientists use daily to build real systems.
15.1.2 Why Your Toolkit Matters
Bad tools waste time: - Installing packages breaks your environment - “Works on my machine” but not colleague’s - Lost experiments because you didn’t track parameters - Can’t reproduce results from 3 months ago - Code scattered across 50 untitled Jupyter notebooks
Good tools compound: - Reproducible from day 1 (Future You will thank Present You) - Collaboration becomes easy (Git, shared environments) - Experiments tracked automatically (never lose a promising model) - Scale from laptop to cloud seamlessly - Deploy to production without rewriting code
15.1.3 The AI Tool Ecosystem in 2025
The landscape has matured significantly:
What’s stabilized (safe bets): - Python won the AI language war (>80% of practitioners) - Jupyter for exploration, VS Code for development - PyTorch for deep learning research and increasingly production - scikit-learn for classical ML - Git/GitHub for version control - Docker for reproducibility
What’s still evolving: - MLOps platforms (many options, no clear winner) - Cloud providers (all viable, choose based on organizational relationship) - Deep learning frameworks (PyTorch gaining, TensorFlow holding ground) - LLM tools (LangChain, LlamaIndex competing, space moving fast)
What’s overhyped: - No-code AI platforms (limited for serious work) - Specialized AI hardware for most use cases (GPUs sufficient) - Proprietary tools locking you into vendors
15.1.4 This Chapter’s Philosophy
1. Open source first - No vendor lock-in - Community support - Free or affordable - Transparent (see how things work)
2. Python-first (but not Python-only) - Largest ecosystem - Best libraries - Most jobs - Easiest collaboration - Note: R still excellent for statistics, epidemiology; use what fits your team
3. Start simple, add complexity gradually - Don’t adopt tools before you feel the pain they solve - Jupyter → VS Code → MLflow → Docker → Cloud - Each step solves a real problem
4. Production-ready from the start - Learn tools that scale to deployment - Avoid “toy” tools requiring rewrite later - Think “will this work when my pilot becomes a product?”
15.1.5 Who This Chapter Is For
If you’re starting from zero: - Follow every step sequentially - Don’t skip the fundamentals - Set up everything before Chapter 14 (Building Your First Project)
If you have some experience: - Skim familiar sections - Focus on MLOps tools (experiment tracking, reproducibility) - Check public health-specific tools section
If you’re experienced: - Jump to advanced topics (Docker, cloud, MLOps) - Review public health domain-specific tools - Use as reference for team standardization
15.1.6 A Note on Updates
The AI tool landscape evolves rapidly. This chapter focuses on foundational tools with multi-year staying power. For the latest: - Package versions update frequently (always check documentation) - Cloud services add features continuously - New tools emerge (evaluate against principles above)
Core principle remains: Master fundamentals before chasing novelty. A data scientist skilled in pandas, scikit-learn, and Git will be productive for years. Someone who chases every new tool will constantly start over.
15.2 What You’ll Build
By working through this chapter, you will set up:
- Complete Python Development Environment - Anaconda/Miniconda, Jupyter, VS Code
- Project Template - Structured directory with configuration files and documentation
- Model Training Pipeline - End-to-end workflow from data to trained model
- MLOps Stack - Experiment tracking, version control, and model registry
- Deployment-Ready Container - Docker configuration for reproducible environments
15.3 1. Introduction: Choosing the Right Tools
15.3.1 The Modern AI Stack
The AI ecosystem is vast and evolving rapidly. This chapter focuses on:
- Open-source tools (accessible, transparent, community-supported)
- Python-first (dominant language for data science and ML)
- Production-ready (tools that scale from prototype to deployment)
- Public health-relevant (applicable to epidemiology and population health)
15.3.2 Tool Categories
┌────────────────────────────────────────────────────────────┐
│ AI Development Stack │
├────────────────────────────────────────────────────────────┤
│ │
│ Development Environment │
│ ├─ IDE/Editor (VS Code, PyCharm, Jupyter) │
│ ├─ Package Manager (conda, pip) │
│ └─ Version Control (Git, GitHub) │
│ │
│ Core Libraries │
│ ├─ Data (pandas, numpy, polars) │
│ ├─ ML (scikit-learn, XGBoost, LightGBM) │
│ ├─ Deep Learning (PyTorch, TensorFlow, JAX) │
│ └─ Visualization (matplotlib, seaborn, plotly) │
│ │
│ MLOps & Workflow │
│ ├─ Experiment Tracking (MLflow, Weights & Biases) │
│ ├─ Data Versioning (DVC, LakeFS) │
│ ├─ Pipeline Orchestration (Airflow, Prefect) │
│ └─ Model Serving (FastAPI, TorchServe, TFServing) │
│ │
│ Infrastructure │
│ ├─ Containerization (Docker, Kubernetes) │
│ ├─ Cloud Platforms (AWS, GCP, Azure) │
│ ├─ Compute (GPUs, TPUs, serverless) │
│ └─ Databases (PostgreSQL, MongoDB, InfluxDB) │
│ │
│ Domain-Specific │
│ ├─ Epidemiology (EpiModel, surveillance-py) │
│ ├─ Genomics (BioPython, scikit-bio) │
│ ├─ GIS/Spatial (GeoPandas, folium) │
│ └─ NLP/LLMs (Hugging Face, spaCy, LangChain) │
│ │
└────────────────────────────────────────────────────────────┘
15.4 2. Setting Up Your Development Environment
15.4.1 Python Distribution: Anaconda vs. Miniconda
Anaconda (Recommended for beginners)
Pros: - Pre-installed with 250+ popular data science packages - Graphical installer and Navigator UI - Includes Jupyter, Spyder, VS Code - No compilation needed for scientific libraries
Cons: - Large download (~3 GB) - Takes significant disk space (~5 GB) - May include packages you don’t need
Installation:
# Download from https://www.anaconda.com/download
# Follow graphical installer
# Verify installation
conda --version
python --versionMiniconda (Recommended for experienced users)
Pros: - Minimal installer (~50 MB) - Install only what you need - Faster environment creation - Same conda functionality
Cons: - Requires manual package installation - Command-line focused - More setup steps
Installation:
# Download from https://docs.conda.io/en/latest/miniconda.html
# Linux/macOS
bash Miniconda3-latest-Linux-x86_64.sh
# Windows: Run .exe installer
# Verify
conda --version15.4.2 Creating Virtual Environments
Why virtual environments?
- Isolate project dependencies
- Avoid version conflicts
- Reproducible environments
- Easy to share and recreate
Creating environments with conda:
# Create new environment with Python 3.10
conda create -n publichealth-ai python=3.10
# Activate environment
conda activate publichealth-ai
# Install core packages
conda install numpy pandas scikit-learn matplotlib jupyter
# Deactivate when done
conda deactivate
# List all environments
conda env list
# Remove environment
conda env remove -n publichealth-aiAlternative: venv (Python built-in)
# Create virtual environment
python -m venv venv
# Activate
# Windows
venv\Scripts\activate
# Linux/macOS
source venv/bin/activate
# Install packages
pip install numpy pandas scikit-learn matplotlib jupyter
# Deactivate
deactivate15.4.3 IDE Options
15.4.3.1 Jupyter Notebook/Lab (Interactive exploration)
Best for: - Data exploration and visualization - Iterative analysis - Teaching and presentations - Documenting analysis with narrative
Installation:
conda install jupyter jupyterlab
# Launch Jupyter Lab
jupyter lab
# Launch classic Jupyter Notebook
jupyter notebookKey features: - Cell-based execution - Inline plots and visualizations - Rich markdown support - Easy sharing (.ipynb files)
Extensions to install:
# Variable inspector
pip install jupyterlab-variableinspector
# Table of contents
pip install jupyterlab-toc
# Code formatter
pip install jupyterlab-code-formatter black isort15.4.3.2 VS Code (General-purpose development)
Best for: - Writing production code - Debugging complex issues - Working with multiple files - Git integration - Remote development (SSH, WSL, containers)
Installation: Download from https://code.visualstudio.com/
Essential extensions for AI/ML:
{
"recommendations": [
"ms-python.python", // Python IntelliSense
"ms-toolsai.jupyter", // Jupyter notebooks
"ms-python.vscode-pylance", // Type checking
"ms-python.black-formatter", // Code formatting
"ms-python.debugpy", // Debugging
"ms-azuretools.vscode-docker", // Docker support
"github.copilot", // AI code suggestions
"visualstudioexptteam.vscodeintellicode" // ML-powered suggestions
]
}Useful settings (.vscode/settings.json):
{
"python.defaultInterpreterPath": "/path/to/conda/env/python",
"python.formatting.provider": "black",
"python.linting.enabled": true,
"python.linting.pylintEnabled": true,
"editor.formatOnSave": true,
"files.autoSave": "afterDelay",
"jupyter.askForKernelRestart": false
}15.4.3.3 PyCharm (Professional Python IDE)
Best for: - Large-scale projects - Professional development - Advanced debugging - Refactoring tools
Editions: - Community - Free, open-source, sufficient for most projects - Professional - Paid, adds scientific tools, web frameworks, databases
Download: https://www.jetbrains.com/pycharm/
15.4.3.4 Google Colab (Cloud-based, free GPUs)
Best for: - Learning and experimentation - GPU access without local hardware - Quick prototyping - Sharing notebooks
Features: - Free GPU/TPU access (limited hours) - Pre-installed ML libraries - Google Drive integration - No setup required
Access: https://colab.research.google.com/
Example: Enabling GPU:
# Check GPU availability
import torch
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU name: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")15.5 3. Essential Python Libraries
15.5.1 Data Manipulation
15.5.1.1 pandas - Data structures and analysis
Installation:
conda install pandas
# or
pip install pandasCore functionality: - DataFrames (2D tables) - Series (1D arrays) - Reading/writing CSV, Excel, SQL, JSON - Data cleaning, filtering, aggregation - Time series analysis
Quick start:
import pandas as pd
# Read data
df = pd.read_csv('covid_cases.csv')
# Basic exploration
print(df.head())
print(df.info())
print(df.describe())
# Filtering
recent = df[df['date'] > '2023-01-01']
# Grouping
by_region = df.groupby('region')['cases'].sum()
# Time series
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
weekly = df.resample('W').sum()Resources: - 📚 pandas documentation - 📄 10 minutes to pandas
15.5.1.2 NumPy - Numerical computing
Installation:
conda install numpyCore functionality: - Multi-dimensional arrays - Mathematical operations - Linear algebra - Random number generation - Broadcasting
Quick start:
import numpy as np
# Create arrays
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2], [3, 4]])
# Operations
print(arr.mean(), arr.std())
print(matrix @ matrix.T) # Matrix multiplication
# Random numbers (for simulations)
np.random.seed(42)
samples = np.random.normal(loc=100, scale=15, size=1000)15.5.1.3 Polars - Fast DataFrame library (Alternative to pandas)
Installation:
pip install polarsWhy Polars? - 10-100x faster than pandas for large datasets - Lower memory usage - Better query optimization - Lazy evaluation
Quick comparison:
import polars as pl
# Polars syntax
df_polars = pl.read_csv('large_dataset.csv')
result = (
df_polars
.filter(pl.col('age') > 18)
.group_by('region')
.agg([
pl.col('cases').sum().alias('total_cases'),
pl.col('deaths').sum().alias('total_deaths')
])
)
# pandas equivalent
import pandas as pd
df_pandas = pd.read_csv('large_dataset.csv')
result = (
df_pandas[df_pandas['age'] > 18]
.groupby('region')
.agg({'cases': 'sum', 'deaths': 'sum'})
)15.5.2 Machine Learning
15.5.2.1 scikit-learn - Classical ML algorithms
Installation:
conda install scikit-learnCoverage: - Classification, regression, clustering - Preprocessing and feature engineering - Model selection and evaluation - Pipeline construction
Core modules:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, classification_report
# Typical workflow
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Preprocessing
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Training
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)
# Evaluation
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_proba):.3f}")
print(classification_report(y_test, y_pred))Resources: - 📚 scikit-learn documentation - 📄 Choosing the right estimator
15.5.2.2 XGBoost - Gradient boosting
Installation:
conda install -c conda-forge xgboost
# or
pip install xgboostWhy XGBoost? - State-of-the-art performance on tabular data - Handles missing values - Built-in regularization - Feature importance
Quick start:
import xgboost as xgb
from sklearn.metrics import roc_auc_score
# Create DMatrix (XGBoost's data structure)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# Parameters
params = {
'objective': 'binary:logistic',
'eval_metric': 'auc',
'max_depth': 6,
'eta': 0.1,
'subsample': 0.8,
'colsample_bytree': 0.8
}
# Train
model = xgb.train(
params,
dtrain,
num_boost_round=100,
evals=[(dtrain, 'train'), (dtest, 'test')],
early_stopping_rounds=10,
verbose_eval=10
)
# Predict
y_pred_proba = model.predict(dtest)
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_proba):.3f}")15.5.2.3 LightGBM - Fast gradient boosting
Installation:
conda install -c conda-forge lightgbmAdvantages over XGBoost: - Faster training on large datasets - Lower memory usage - Better handling of categorical features - Leaf-wise tree growth (vs. level-wise)
Quick start:
import lightgbm as lgb
# Create dataset
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)
# Parameters
params = {
'objective': 'binary',
'metric': 'auc',
'boosting_type': 'gbdt',
'num_leaves': 31,
'learning_rate': 0.05
}
# Train
model = lgb.train(
params,
train_data,
num_boost_round=100,
valid_sets=[test_data],
callbacks=[lgb.early_stopping(10)]
)15.5.3 Deep Learning
15.5.3.1 PyTorch - Dynamic neural networks
Installation:
# CPU only
conda install pytorch torchvision torchaudio cpuonly -c pytorch
# GPU (CUDA 11.8)
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidiaWhy PyTorch? - Pythonic, intuitive API - Dynamic computation graphs - Strong research community - Excellent for NLP and computer vision
Simple example:
import torch
import torch.nn as nn
import torch.optim as optim
# Define model
class SimpleNN(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_dim, output_dim)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
x = self.sigmoid(x)
return x
# Initialize
model = SimpleNN(input_dim=10, hidden_dim=64, output_dim=1)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
for epoch in range(100):
# Forward pass
outputs = model(X_train_tensor)
loss = criterion(outputs, y_train_tensor)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch + 1) % 10 == 0:
print(f'Epoch [{epoch+1}/100], Loss: {loss.item():.4f}')Resources: - 📚 PyTorch documentation - 🎓 PyTorch tutorials
15.5.3.2 TensorFlow/Keras - Production-ready deep learning
Installation:
pip install tensorflowWhy TensorFlow? - Industry standard for production - TensorFlow Serving for deployment - TensorFlow Lite for mobile/edge - Keras API for ease of use
Keras example:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Define model
model = keras.Sequential([
layers.Dense(64, activation='relu', input_shape=(10,)),
layers.Dropout(0.2),
layers.Dense(32, activation='relu'),
layers.Dropout(0.2),
layers.Dense(1, activation='sigmoid')
])
# Compile
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy', 'AUC']
)
# Train
history = model.fit(
X_train, y_train,
epochs=50,
batch_size=32,
validation_split=0.2,
callbacks=[
keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
]
)
# Evaluate
test_loss, test_acc, test_auc = model.evaluate(X_test, y_test)
print(f'Test AUC: {test_auc:.3f}')15.5.4 Visualization
15.5.4.1 Matplotlib - Publication-quality plots
Installation:
conda install matplotlibCore plotting:
import matplotlib.pyplot as plt
import numpy as np
# Line plot
plt.figure(figsize=(10, 6))
plt.plot(dates, cases, label='Daily Cases', linewidth=2)
plt.xlabel('Date')
plt.ylabel('Number of Cases')
plt.title('COVID-19 Daily Cases')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes[0, 0].plot(data['cases'])
axes[0, 0].set_title('Cases')
axes[0, 1].plot(data['deaths'])
axes[0, 1].set_title('Deaths')
axes[1, 0].scatter(data['age'], data['severity'])
axes[1, 0].set_title('Age vs. Severity')
axes[1, 1].hist(data['recovery_time'], bins=30)
axes[1, 1].set_title('Recovery Time Distribution')
plt.tight_layout()
plt.show()15.5.4.2 Seaborn - Statistical visualization
Installation:
conda install seabornWhy Seaborn? - Built on matplotlib - Beautiful default styles - Statistical plots out-of-the-box - Great for exploratory analysis
Examples:
import seaborn as sns
# Set style
sns.set_style('whitegrid')
sns.set_palette('husl')
# Distribution plot
sns.histplot(data=df, x='age', hue='outcome', kde=True)
plt.title('Age Distribution by Outcome')
plt.show()
# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlations')
plt.show()
# Pairplot
sns.pairplot(df, hue='disease_status', vars=['age', 'bmi', 'blood_pressure'])
plt.show()
# Box plot
sns.boxplot(data=df, x='region', y='incidence_rate')
plt.xticks(rotation=45)
plt.title('Disease Incidence by Region')
plt.show()15.5.4.3 Plotly - Interactive visualizations
Installation:
conda install -c plotly plotlyWhy Plotly? - Interactive plots (zoom, pan, hover) - Web-based (works in Jupyter, dashboards) - 3D plots - Maps and geospatial visualization
Examples:
import plotly.express as px
import plotly.graph_objects as go
# Interactive line plot
fig = px.line(df, x='date', y='cases', color='region',
title='COVID-19 Cases by Region')
fig.update_layout(hovermode='x unified')
fig.show()
# Choropleth map
fig = px.choropleth(df,
locations='country_code',
color='incidence_rate',
hover_name='country',
color_continuous_scale='Reds',
title='Disease Incidence by Country')
fig.show()
# 3D scatter
fig = px.scatter_3d(df, x='age', y='bmi', z='risk_score',
color='outcome', size='severity',
title='Risk Factors 3D Visualization')
fig.show()15.6 4. MLOps and Experiment Tracking
15.6.1 MLflow - Experiment tracking and model registry
Installation:
pip install mlflowCore features: - Experiment tracking (parameters, metrics, artifacts) - Model registry (versioning, staging, production) - Model serving - Project packaging
Example workflow:
import mlflow
import mlflow.sklearn
# Set experiment
mlflow.set_experiment("disease-prediction")
# Start run
with mlflow.start_run(run_name="random-forest-v1"):
# Log parameters
mlflow.log_param("n_estimators", 100)
mlflow.log_param("max_depth", 10)
# Train model
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
# Evaluate
y_pred_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)
# Log metrics
mlflow.log_metric("auc", auc)
mlflow.log_metric("accuracy", accuracy_score(y_test, model.predict(X_test)))
# Log model
mlflow.sklearn.log_model(model, "model")
# Log artifacts
plt.figure()
plt.plot(fpr, tpr)
plt.savefig("roc_curve.png")
mlflow.log_artifact("roc_curve.png")
# View UI
# mlflow uiResources: - 📚 MLflow documentation
15.6.2 Weights & Biases - Experiment tracking and collaboration
Installation:
pip install wandbWhy W&B? - Beautiful visualizations - Team collaboration features - Model versioning - Hyperparameter sweeps - Free for individuals and academics
Quick start:
import wandb
# Initialize
wandb.init(project="public-health-ai", name="experiment-1")
# Log config
wandb.config.update({
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 50
})
# Training loop
for epoch in range(50):
# ... training code ...
# Log metrics
wandb.log({
"epoch": epoch,
"train_loss": train_loss,
"val_loss": val_loss,
"val_auc": val_auc
})
# Log model
wandb.log_model(path="model.pth", name="disease-predictor")
# Finish run
wandb.finish()15.6.3 DVC - Data version control
Installation:
pip install dvcWhy DVC? - Version large datasets (like Git for data) - Track data pipelines - Remote storage (S3, GCS, Azure, SSH) - Reproducible experiments
Setup:
# Initialize DVC in Git repo
git init
dvc init
# Add data to DVC
dvc add data/raw/covid_data.csv
# Add to git
git add data/raw/covid_data.csv.dvc .gitignore
git commit -m "Add COVID data"
# Configure remote storage
dvc remote add -d storage s3://mybucket/dvc-store
dvc push
# On another machine
git clone <repo-url>
dvc pull15.7 5. Domain-Specific Tools
15.7.1 Epidemiology and Public Health
15.7.1.1 EpiEstim - Estimating reproduction number
Installation:
# R package
install.packages("EpiEstim")Alternative Python: EpiNow2-py
15.7.1.2 Epiweeks - Epidemiological week calculations
Installation:
pip install epiweeksUsage:
from epiweeks import Week, Year
# Get current epi week
current_week = Week.thisweek()
print(f"Current epi week: {current_week}")
# Convert date to epi week
from datetime import date
week = Week.fromdate(date(2024, 3, 15))
print(f"Epi week for 2024-03-15: {week}")15.7.1.3 Geospatial Analysis - GeoPandas, Folium
Installation:
conda install geopandas foliumExample:
import geopandas as gpd
import folium
# Read shapefile
gdf = gpd.read_file('countries.shp')
# Merge with disease data
gdf = gdf.merge(disease_data, on='country_code')
# Create interactive map
m = folium.Map(location=[0, 0], zoom_start=2)
folium.Choropleth(
geo_data=gdf,
data=gdf,
columns=['country_code', 'incidence_rate'],
key_on='feature.properties.country_code',
fill_color='YlOrRd',
legend_name='Incidence Rate per 100,000'
).add_to(m)
m.save('disease_map.html')15.7.2 Natural Language Processing
15.7.2.1 Hugging Face Transformers - Pre-trained language models
Installation:
pip install transformersQuick start:
from transformers import pipeline
# Sentiment analysis
classifier = pipeline("sentiment-analysis")
result = classifier("The patient reported feeling much better after treatment.")
print(result)
# Named entity recognition (medical entities)
ner = pipeline("ner", model="d4data/biomedical-ner-all")
entities = ner("Patient has diabetes and hypertension.")
print(entities)
# Text generation
generator = pipeline("text-generation", model="gpt2")
text = generator("Symptoms of COVID-19 include", max_length=50)
print(text)15.7.2.2 spaCy - Industrial-strength NLP
Installation:
pip install spacy
python -m spacy download en_core_web_smMedical NLP:
import spacy
# Load model
nlp = spacy.load("en_core_web_sm")
# Process text
text = "Patient presents with fever, cough, and shortness of breath."
doc = nlp(text)
# Extract entities
for ent in doc.ents:
print(f"{ent.text}: {ent.label_}")
# POS tagging
for token in doc:
print(f"{token.text}: {token.pos_}")15.7.3 Genomics and Bioinformatics
15.7.3.1 Biopython - Computational biology
Installation:
conda install -c conda-forge biopythonQuick example:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio import Align
# Read FASTA file
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(f"ID: {record.id}")
print(f"Sequence: {record.seq}")
print(f"Length: {len(record)}")
# Sequence alignment
aligner = Align.PairwiseAligner()
seq1 = Seq("ACCGT")
seq2 = Seq("ACGT")
alignments = aligner.align(seq1, seq2)
print(alignments[0])15.8 6. Cloud and Compute Resources
15.8.1 Cloud Platforms
15.8.1.1 AWS (Amazon Web Services)
Relevant services: - EC2 - Virtual machines with GPU support - SageMaker - Managed ML platform - S3 - Object storage - Lambda - Serverless computing
Getting started:
# Install AWS CLI
pip install awscli
# Configure credentials
aws configure
# Launch GPU instance
aws ec2 run-instances --image-id ami-xxx --instance-type p3.2xlarge15.8.1.2 Google Cloud Platform (GCP)
Relevant services: - Compute Engine - VMs with TPU support - Vertex AI - ML platform - Cloud Storage - Object storage - BigQuery - Data warehouse
Getting started:
# Install gcloud CLI
# Download from https://cloud.google.com/sdk/docs/install
# Initialize
gcloud init
# Create VM with GPU
gcloud compute instances create ml-instance \
--machine-type=n1-highmem-8 \
--accelerator=type=nvidia-tesla-t4,count=115.8.1.3 Microsoft Azure
Relevant services: - Azure ML - ML platform - Azure Databricks - Apache Spark - Blob Storage - Object storage
15.8.2 Free GPU Resources
15.8.2.1 Google Colab (Free tier)
Pros: - Free GPU access (T4) - No setup required - Pre-installed libraries
Cons: - Limited to 12-hour sessions - Can’t customize environment fully - No persistent storage (use Google Drive)
15.8.2.2 Kaggle Kernels (Free GPU)
Pros: - Free GPU (P100) for 30 hours/week - Access to Kaggle datasets - Community notebooks
Access: https://www.kaggle.com/code
15.9 7. Reproducibility and Collaboration
15.9.1 Version Control with Git
Essential Git commands:
# Initialize repository
git init
# Add files
git add .
git commit -m "Initial commit"
# Create branch
git checkout -b feature/new-model
# Push to remote
git remote add origin https://github.com/username/repo.git
git push -u origin main
# Pull updates
git pull origin main
# Merge branch
git checkout main
git merge feature/new-model.gitignore for data science:
# Data
data/
*.csv
*.h5
*.pkl
# Models
models/
*.pth
*.h5
*.pkl
# Notebooks
.ipynb_checkpoints/
*-checkpoint.ipynb
# Environment
venv/
.venv/
__pycache__/
*.pyc
# IDE
.vscode/
.idea/
# OS
.DS_Store
Thumbs.db
15.9.2 Containerization with Docker
Why Docker? - Reproducible environments - Works anywhere (local, cloud, clusters) - Version-controlled infrastructure - Easy collaboration
Example Dockerfile:
FROM python:3.10-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
gcc \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY . .
# Run
CMD ["python", "train.py"]Building and running:
# Build image
docker build -t disease-predictor:latest .
# Run container
docker run -v $(pwd)/data:/app/data disease-predictor:latest
# Interactive shell
docker run -it disease-predictor:latest /bin/bash15.9.3 Project Structure
Recommended directory layout:
project-name/
├── README.md # Project overview
├── requirements.txt # Python dependencies
├── environment.yml # Conda environment
├── .gitignore # Git ignore rules
├── Dockerfile # Container definition
├── setup.py # Package installation
│
├── data/
│ ├── raw/ # Original, immutable data
│ ├── processed/ # Cleaned, transformed data
│ └── external/ # Third-party data
│
├── notebooks/ # Jupyter notebooks
│ ├── 01-exploration.ipynb
│ ├── 02-preprocessing.ipynb
│ └── 03-modeling.ipynb
│
├── src/ # Source code
│ ├── __init__.py
│ ├── data/ # Data processing
│ │ ├── __init__.py
│ │ └── make_dataset.py
│ ├── features/ # Feature engineering
│ │ ├── __init__.py
│ │ └── build_features.py
│ ├── models/ # Model definitions
│ │ ├── __init__.py
│ │ ├── train.py
│ │ └── predict.py
│ └── visualization/ # Plotting code
│ ├── __init__.py
│ └── visualize.py
│
├── models/ # Trained models
│ └── .gitkeep
│
├── reports/ # Analysis reports
│ ├── figures/ # Plots and images
│ └── results.md
│
└── tests/ # Unit tests
├── __init__.py
└── test_features.py
15.10 8. Pre-trained Models and Datasets
15.10.1 Model Hubs
15.10.1.1 Hugging Face Hub
Access thousands of models:
from transformers import AutoModel, AutoTokenizer
# Load pre-trained model
model_name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Use for downstream task
text = "Patient diagnosed with pneumonia."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)Browse models: https://huggingface.co/models
15.10.1.2 PyTorch Hub
Access:
import torch
# Load pre-trained model
model = torch.hub.load('pytorch/vision', 'resnet50', pretrained=True)
model.eval()15.10.2 Public Health Datasets
15.10.2.1 Johns Hopkins COVID-19 Data
import pandas as pd
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
covid_data = pd.read_csv(url)15.10.2.2 WHO Global Health Observatory
Access: https://www.who.int/data/gho
15.10.2.3 CDC Data
WONDER Database: https://wonder.cdc.gov/
15.10.2.4 Kaggle Datasets
Public health datasets: - COVID-19 forecasting - Disease surveillance - Health indicators
Access: https://www.kaggle.com/datasets
15.11 9. Building Your First Pipeline
15.11.1 Complete Example: Disease Prediction Pipeline
1. Project setup:
# Create project
mkdir disease-predictor
cd disease-predictor
# Create environment
conda create -n disease-pred python=3.10
conda activate disease-pred
# Install packages
conda install pandas numpy scikit-learn matplotlib jupyter mlflow
# Initialize git
git init
echo "data/\nmodels/\n*.pyc" > .gitignore
git add .gitignore
git commit -m "Initial commit"2. Data loading (src/data/load_data.py):
import pandas as pd
from sklearn.model_selection import train_test_split
def load_and_split_data(filepath, test_size=0.2, random_state=42):
"""Load data and split into train/test sets"""
# Load data
df = pd.read_csv(filepath)
# Separate features and target
X = df.drop('outcome', axis=1)
y = df['outcome']
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_size, random_state=random_state, stratify=y
)
return X_train, X_test, y_train, y_test
if __name__ == "__main__":
X_train, X_test, y_train, y_test = load_and_split_data('data/raw/disease_data.csv')
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")3. Feature engineering (src/features/build_features.py):
from sklearn.preprocessing import StandardScaler
import pandas as pd
def build_features(X_train, X_test):
"""Engineer features and scale data"""
# Create age groups
X_train['age_group'] = pd.cut(X_train['age'], bins=[0, 18, 65, 100], labels=['child', 'adult', 'senior'])
X_test['age_group'] = pd.cut(X_test['age'], bins=[0, 18, 65, 100], labels=['child', 'adult', 'senior'])
# One-hot encode categorical
X_train = pd.get_dummies(X_train, columns=['age_group', 'region'])
X_test = pd.get_dummies(X_test, columns=['age_group', 'region'])
# Align columns
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)
# Scale numerical features
scaler = StandardScaler()
numerical_cols = ['age', 'bmi', 'blood_pressure']
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])
return X_train, X_test, scaler4. Model training (src/models/train.py):
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report
def train_model(X_train, y_train, X_test, y_test):
"""Train and log model with MLflow"""
mlflow.set_experiment("disease-prediction")
with mlflow.start_run():
# Parameters
params = {
'n_estimators': 100,
'max_depth': 10,
'min_samples_split': 5,
'random_state': 42
}
mlflow.log_params(params)
# Train
model = RandomForestClassifier(**params)
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)
acc = accuracy_score(y_test, y_pred)
mlflow.log_metrics({
'auc': auc,
'accuracy': acc
})
# Log model
mlflow.sklearn.log_model(model, "model")
print(f"AUC: {auc:.3f}, Accuracy: {acc:.3f}")
print(classification_report(y_test, y_pred))
return model
if __name__ == "__main__":
from src.data.load_data import load_and_split_data
from src.features.build_features import build_features
# Load data
X_train, X_test, y_train, y_test = load_and_split_data('data/raw/disease_data.csv')
# Build features
X_train, X_test, scaler = build_features(X_train, X_test)
# Train model
model = train_model(X_train, y_train, X_test, y_test)5. Run pipeline:
# Train model
python src/models/train.py
# View MLflow UI
mlflow ui
# Open browser: http://localhost:500015.12 10. Key Takeaways
Development Environment: - Python distribution: Anaconda or Miniconda - IDE: Jupyter Lab for exploration, VS Code for production code - Virtual environments: conda or venv for dependency isolation
Core Libraries: - Data: pandas (tabular), NumPy (arrays), Polars (fast alternative) - ML: scikit-learn (classical), XGBoost/LightGBM (boosting) - Deep Learning: PyTorch (research), TensorFlow/Keras (production) - Visualization: Matplotlib (static), Seaborn (statistical), Plotly (interactive)
MLOps: - Experiment tracking: MLflow or Weights & Biases - Data versioning: DVC - Containerization: Docker
Best Practices: - Use virtual environments for every project - Version control code with Git - Track experiments systematically - Structure projects consistently - Document dependencies (requirements.txt, environment.yml) - Containerize for reproducibility
15.13 Hands-On Exercise
15.13.1 Exercise: Set Up Your AI Development Environment
Objective: Build a complete, reproducible development environment for AI/ML projects.
Tasks:
Install Miniconda
- Download and install from official website
- Verify installation with
conda --version
Create project environment
conda create -n my-ml-project python=3.10 conda activate my-ml-project conda install pandas numpy scikit-learn matplotlib jupyter mlflowSet up VS Code
- Install VS Code
- Install Python and Jupyter extensions
- Configure to use your conda environment
Create project structure
mkdir my-ml-project cd my-ml-project mkdir data models notebooks src tests touch README.md requirements.txt .gitignoreInitialize Git repository
git init git add . git commit -m "Initial project structure"Create simple ML pipeline
- Load sample dataset (e.g., from scikit-learn)
- Train simple model
- Log experiment with MLflow
- Save model
Export environment
conda env export > environment.yml pip freeze > requirements.txtCreate Dockerfile
- Write Dockerfile for your project
- Build image
- Test running training script in container
Bonus: - Set up Weights & Biases account and log experiment - Create GitHub repository and push code - Set up DVC for data versioning
Check Your Understanding
Test your knowledge of AI development tools and best practices. Each question builds on the key concepts from this chapter.
A data scientist starts a new project analyzing hospital readmission data. They install packages globally on their system using pip install without creating a virtual environment. Six months later, they need to share the project with a colleague, who reports numerous package version conflicts and cannot run the code. What is the PRIMARY lesson this scenario illustrates about development environments?
- pip is unreliable and conda should always be used instead
- Virtual environments are essential for dependency isolation, reproducibility, and collaboration
- Python packages change too rapidly to maintain long-term projects
- Sharing code is inherently difficult and requires containerization from day one
Correct Answer: b) Virtual environments are essential for dependency isolation, reproducibility, and collaboration
This scenario illustrates a fundamental best practice in software development: dependency isolation through virtual environments. The chapter emphasizes this throughout the environment setup section, listing virtual environments as critical for avoiding version conflicts, ensuring reproducible environments, and enabling easy sharing.
The problem: Installing packages globally causes several issues: - Dependency conflicts: Different projects need different versions of the same package (project A needs pandas 1.3, project B needs pandas 2.0) - System pollution: Global installation affects all Python projects on the system - Irreproducibility: The colleague doesn’t know which package versions were used - Fragility: Upgrading a package for one project breaks another project
The chapter explains that virtualenvironments solve these problems by creating isolated Python environments per project, each with its own package versions. When the data scientist eventually tries to share, they can’t easily communicate “install these exact versions” because they never tracked them.
Correct approach:
# Create isolated environment
conda create -n readmission-analysis python=3.10
conda activate readmission-analysis
# Install packages
conda install pandas numpy scikit-learn
# Export for sharing
conda env export > environment.yml
pip freeze > requirements.txtNow colleagues can recreate the exact environment:
conda env create -f environment.ymlOption (a) misses the point—both pip and conda support virtual environments (venv/virtualenv for pip, conda environments for conda). The tool isn’t the issue; the practice of isolation is. Option (c) is defeatist—yes, packages evolve, which is precisely why version pinning and environment management are necessary. The solution isn’t to avoid Python but to use proper tooling. Option (d) overstates the requirement—while Docker is valuable (discussed in the chapter), virtual environments + requirements files handle most sharing scenarios. Jumping to containers “from day one” for every project is overkill.
The chapter’s section on virtual environments lists clear benefits: - Isolate project dependencies - Avoid version conflicts - Reproducible environments - Easy to share and recreate
Real-world implications: This scenario mirrors the famous “works on my machine” problem that plagues software development. The healthcare AI context makes it worse—if the model can’t be reproduced, research findings can’t be validated, clinical deployments become risky, and regulatory compliance (FDA requirements for reproducibility) may be violated.
The chapter provides specific commands for both conda and venv, emphasizing that the choice of tool matters less than consistent use of isolation. For public health practitioners: start every project with environment creation, export dependencies regularly (don’t wait until sharing time), document environment setup in README, and consider it part of “scientific method” for computational work—others must be able to reproduce your results.
The broader principle: computational reproducibility requires infrastructure. Just as lab scientists document reagent lot numbers and equipment settings, data scientists must document software environments. Virtual environments are the foundational tool for this documentation.
A research team is building a COVID-19 forecasting model. They need to decide between using pandas (which they know well) versus Polars (which they’ve heard is much faster). Their current dataset is 5 million rows and model training takes 30 minutes with pandas. Which factors should MOST heavily influence their decision?
- Always choose Polars because speed improvements are always valuable
- Evaluate the actual bottleneck: if data processing is <5% of runtime, pandas is fine; if it’s >50%, consider Polars; also weigh team familiarity and deadline pressure
- Stick with pandas because learning new tools wastes time that could be spent on modeling
- Use both—pandas for development and Polars for production
Correct Answer: b) Evaluate the actual bottleneck: if data processing is <5% of runtime, pandas is fine; if it’s >50%, consider Polars; also weigh team familiarity and deadline pressure
This question tests understanding of pragmatic tool selection—a key theme throughout the chapter. The scenario requires evaluating trade-offs between performance, learning curve, team capability, and project constraints rather than defaulting to “fastest tool wins” or “familiar tool wins.”
The chapter discusses Polars as “10-100x faster than pandas for large datasets” but presents it as an alternative, not a replacement for all cases. The key is understanding when speed matters enough to justify switching costs.
Decision framework:
1. Profile to find bottlenecks: If model training takes 30 minutes total: - Data loading + preprocessing: 2 minutes (7%) → pandas is fine - Model training: 28 minutes (93%) → optimize model, not data pipeline - Data loading + preprocessing: 20 minutes (67%) → Polars might help significantly
The chapter’s philosophy emphasizes: optimize what matters. Premature optimization wastes time.
2. Calculate switching costs: - Learning curve: How long to become proficient with Polars? - Code rewrite: How much existing code needs translation? - Testing: How much validation is needed after switching? - Documentation: Does switching affect reproducibility or team knowledge transfer?
3. Consider team and project context: - Team familiarity: If everyone knows pandas, collective productivity matters more than individual script speed - Deadline pressure: If launch is imminent, don’t introduce new tools mid-project - Long-term maintenance: If Polars reduces training time from 30 min to 3 min and you’ll run thousands of experiments, the investment pays off
4. Evaluate alternative optimizations: Before switching tools, consider: - Optimize pandas code (vectorization, efficient data types, chunking) - Sample data for development, full data for final training - Parallel processing with Dask (pandas-compatible API) - Cloud resources for faster compute
Option (a)—“always choose faster”—ignores switching costs and assumes speed is the only bottleneck. This violates the chapter’s emphasis on pragmatic tool selection. The chapter presents multiple tools precisely because different situations call for different solutions. Option (c)—“never learn new tools”—is the opposite error. Technical debt accumulates when teams refuse to adopt better tools. If Polars genuinely solves a major bottleneck and the project has longevity, learning it is worthwhile. However, context matters. Option (d)—“use both”—introduces unnecessary complexity. Maintaining two implementations (one for dev, one for prod) creates version skew risks, doubles testing burden, and complicates debugging. The chapter’s emphasis on reproducibility argues against dev/prod environment divergence.
The chapter’s guidance on Polars: - When to consider: Large datasets (>1GB), complex aggregations, memory constraints - When to skip: Small datasets, simple operations, team unfamiliar and no time to learn - Alternative: Polars has pandas-like API, lowering learning curve
Real-world considerations for public health AI: - Regulatory: If the model is for FDA-cleared device, environment consistency matters—don’t switch tools casually - Collaboration: If working with epidemiologists unfamiliar with data science, pandas’ widespread adoption and documentation base creates lower barriers - Iteration speed: In outbreak response, getting a working model fast may matter more than optimal performance
The chapter’s broader theme is pragmatic tool selection: choose tools that match your problem, team, and constraints. The AI ecosystem offers hundreds of options precisely because no single tool is best for all situations. The chapter provides guidance on multiple alternatives (pandas vs. Polars, PyTorch vs. TensorFlow, Jupyter vs. VS Code) to equip readers for context-appropriate decisions.
For public health practitioners: resist both technology dogmatism (“always use X”) and technology conservatism (“never change from Y”). Profile your code to find actual bottlenecks, evaluate the cost-benefit of switching tools, involve the team in decisions, and make trade-offs explicit. Sometimes the best choice is the tool your team already knows; sometimes it’s worth investing in something better. Data and context should drive the decision.
A hospital AI team has been using Jupyter notebooks for all development, from initial exploration through model deployment. They experience several problems: notebooks with 200+ cells becoming unwieldy, difficulty tracking which cells were run in what order, merge conflicts when multiple team members edit notebooks, and challenges deploying notebook code to production. What does this scenario suggest about development tool selection?
- Jupyter notebooks are inappropriate for ML development and should be avoided
- The team needs better notebook organization through naming conventions and documentation
- Different development stages require different tools: notebooks for exploration, Python scripts/IDEs for production code, with clear transitions between phases
- The team should switch entirely to VS Code for all development activities
Correct Answer: c) Different development stages require different tools: notebooks for exploration, Python scripts/IDEs for production code, with clear transitions between phases
This question tests understanding of the chapter’s nuanced guidance on IDE selection and workflow design. The chapter doesn’t advocate for one tool over another universally but rather explains each tool’s strengths and appropriate use cases.
The chapter’s tool guidance:
Jupyter Notebooks: - Best for: Data exploration, iterative analysis, teaching, documenting analysis with narrative - Strengths: Cell-based execution, inline visualizations, rich markdown, easy sharing - Limitations: Not mentioned explicitly but implied by VS Code’s positioning
VS Code: - Best for: Production code, debugging complex issues, multiple files, Git integration - Strengths: Refactoring, testing, deployment, version control
The scenario describes classic problems that arise when using exploration tools for production workflows:
1. Unwieldy 200+ cell notebooks: Notebooks become unmaintainable at scale. They lack the modular structure of properly organized Python packages with functions, classes, and modules. The chapter’s project structure section shows how production code should be organized:
src/
├── data/make_dataset.py
├── features/build_features.py
├── models/train.py
└── visualization/visualize.py
This modular structure enables: - Testing individual components - Reusing code across projects - Clear dependencies and interfaces - Team division of labor
2. Execution order confusion: Notebooks allow non-linear execution. Cell 5 might depend on Cell 10, but this dependency is invisible. Production scripts have clear top-to-bottom execution, making behavior predictable. The chapter’s example training script (train.py) demonstrates linear, predictable flow.
3. Git merge conflicts: Notebooks are JSON files with embedded metadata (execution counts, outputs, cell IDs). When two people edit a notebook, Git struggles to merge. Python scripts are plain text, merging cleanly. The chapter’s emphasis on Git for version control implicitly assumes code structured for version control.
4. Deployment challenges: Deploying a notebook to production is awkward. You need to strip interactive elements, convert to script, handle cell dependencies, and remove exploration code. Starting with production-structured code avoids this conversion.
The right workflow (implied by the chapter):
Phase 1: Exploration (Jupyter) - Load data, visualize distributions - Try different features, models - Document findings with markdown - Iterate quickly
Phase 2: Production (VS Code + scripts) - Extract working code from notebooks - Refactor into functions and modules - Add tests and documentation - Structure according to chapter’s project template - Commit to Git with clean history
Phase 3: Deployment - Production scripts are deployment-ready - Clear entry points (train.py, predict.py) - Containerization (Dockerfile) - CI/CD integration
Option (a) dismisses notebooks entirely, contradicting the chapter’s guidance that notebooks excel for exploration. Many critical data science insights come from exploratory work best done interactively. Option (b) treats symptoms rather than causes. Better organization helps, but fundamental limitations remain—notebooks aren’t designed for production code, and trying to force that use case creates friction. Option (d) makes the opposite error of (a)—VS Code isn’t ideal for initial exploration. The chapter presents VS Code as powerful for production development, not for replacing notebooks’ exploratory strengths.
The chapter’s project structure provides the solution: - notebooks/ directory for exploration (01-exploration.ipynb, 02-preprocessing.ipynb) - src/ directory for production code (modular Python scripts) - Clear workflow: explore in notebooks, productionize in src/
Real-world implications for public health AI:
Regulatory compliance: FDA-cleared medical devices require validated, version-controlled code. Notebooks don’t meet these standards; properly structured Python packages do.
Reproducibility: Research publications require reproducible methods. The chapter emphasizes containers and dependency management, which work much better with scripts than notebooks.
Team collaboration: Hospital AI teams include epidemiologists, clinicians, data scientists, ML engineers. Notebooks work for sharing analysis with stakeholders; scripts work for engineers building production systems.
Operational reliability: When the model runs in production serving patient predictions, it must be reliable, tested, monitored. The chapter’s MLOps section discusses logging, error handling, monitoring—all easier with production-structured code.
The chapter’s toolkit philosophy is use the right tool for the job. Notebooks and scripts aren’t competitors but collaborators in a complete workflow. Mature ML practice involves knowing when to use each and how to transition between them. The team’s problems stem not from choosing notebooks but from failing to transition to production-appropriate tools when development matured beyond exploration.
For public health practitioners: embrace notebooks for exploration and communication, transition to scripts for production, maintain both in your project (chapter’s directory structure includes both notebooks/ and src/), document the workflow, and train team members on when to use each tool. The chapter provides all these pieces; the practitioner’s job is assembling them into appropriate workflows.
A team building a disease outbreak prediction model tracks experiments inconsistently: some parameters are in Excel spreadsheets, some in paper notebooks, some in code comments. When they need to reproduce their best model six months later for a regulatory submission, they cannot determine which hyperparameters, data version, or random seed were used. What MLOps practice would have MOST directly prevented this problem?
- Better documentation practices and standardized note-taking
- Systematic experiment tracking using MLflow or Weights & Biases to automatically log parameters, metrics, code versions, and artifacts
- More frequent code reviews and team meetings to discuss experiments
- Requiring all experiments to be approved by a senior scientist before running
Correct Answer: b) Systematic experiment tracking using MLflow or Weights & Biases to automatically log parameters, metrics, code versions, and artifacts
This question tests understanding of MLOps experiment tracking—a core component of the chapter’s toolkit section. The scenario describes a classic reproducibility crisis that experiment tracking tools are specifically designed to prevent.
The problem: The team has no systematic record of: - Hyperparameters: learning rate, model architecture, regularization - Data version: which preprocessing, which data split - Code version: which model code, which feature engineering - Random seeds: critical for reproducibility in ML - Environment: library versions, hardware (GPU vs CPU) - Results: metrics, artifacts, model files
Six months later, they can’t reproduce the “best model” because this information is scattered, incomplete, or lost.
The chapter’s solution: MLflow
The chapter provides a concrete example showing exactly how MLflow solves this:
import mlflow
import mlflow.sklearn
mlflow.set_experiment("disease-prediction")
with mlflow.start_run(run_name="random-forest-v1"):
# Log parameters - AUTOMATICALLY TRACKED
mlflow.log_param("n_estimators", 100)
mlflow.log_param("max_depth", 10)
mlflow.log_param("random_state", 42) # CRITICAL FOR REPRODUCIBILITY
# Train model
model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
model.fit(X_train, y_train)
# Log metrics - AUTOMATICALLY TRACKED
auc = roc_auc_score(y_test, y_pred_proba)
mlflow.log_metric("auc", auc)
# Log model - ARTIFACT STORED
mlflow.sklearn.log_model(model, "model")
# Log plots - ARTIFACTS STORED
plt.savefig("roc_curve.png")
mlflow.log_artifact("roc_curve.png")What MLflow captures automatically: 1. Parameters: All hyperparameters logged explicitly 2. Metrics: Performance metrics at training and validation 3. Artifacts: Models, plots, data files 4. Code version: Git commit hash if repo is initialized 5. Environment: Library versions if captured 6. Timestamp: When experiment ran 7. User: Who ran the experiment 8. Hardware: System information
Six months later, the team can: - Query MLflow: “Show me all runs with AUC > 0.9” - Find the best run’s ID - Retrieve exact parameters, code version, model file - Reproduce by running with those exact settings
The chapter also discusses Weights & Biases as an alternative with similar capabilities plus: - Better visualizations - Team collaboration features - Hyperparameter sweep automation - Model versioning
Why other options are insufficient:
Option (a)—better documentation—is necessary but insufficient. Manual documentation is: - Error-prone: People forget to record things - Inconsistent: Different team members document differently - Time-consuming: Creates friction (“I’ll document it later”) - Incomplete: Hard to capture everything manually - Not programmatic: Can’t query “find all experiments with learning_rate < 0.01”
The chapter emphasizes systematic, automated tracking precisely because manual processes fail.
Option (c)—code reviews and meetings—helps team coordination but doesn’t solve the tracking problem. Discussions don’t create machine-readable records. When the regulatory submission happens months later, meeting notes won’t suffice.
Option (d)—approval requirements—adds bureaucracy without solving the core issue. Even approved experiments need tracking. This option slows the team without improving reproducibility.
The regulatory angle (important for public health AI):
The scenario mentions “regulatory submission”—likely FDA for a medical device. FDA 21 CFR Part 11 and related guidance require: - Traceability: Ability to trace model provenance - Reproducibility: Ability to recreate exact models - Auditability: Records of all development decisions - Validation: Evidence of systematic testing
MLflow/W&B provide exactly this audit trail. The chapter’s emphasis on these tools reflects not just development convenience but regulatory necessity.
The chapter’s complete MLOps stack:
Experiment tracking: MLflow or W&B (this problem) Data versioning: DVC (tracks which data version) Code versioning: Git (tracks which code version) Environment: Docker + requirements.txt (tracks which libraries) Project structure: Standard directories (organizes everything)
Together, these provide complete reproducibility—the ability to recreate any experiment exactly.
Real-world workflow:
The chapter’s example training script shows integration: 1. Initialize MLflow experiment 2. Log all parameters at start 3. Train model 4. Log all metrics 5. Save model as artifact 6. Query MLflow UI to compare runs
This becomes habitual: every experiment automatically tracked, every model reproducible.
For public health practitioners:
The chapter makes MLflow/W&B setup straightforward: - Install: pip install mlflow or pip install wandb - Wrap training code with tracking calls - View UI: mlflow ui provides web interface - Query programmatically or via UI
Cost: minimal (W&B free for individuals/academics, MLflow open-source) Benefit: enormous (complete reproducibility, regulatory compliance, team coordination)
The reproducibility crisis in ML is well-documented. The chapter addresses it directly by presenting experiment tracking as essential infrastructure, not optional. The scenario’s regulatory submission failure illustrates precisely why: without systematic tracking, even successful models may be unusable if they can’t be reproduced.
The key principle: automate tracking, don’t rely on human memory. The chapter provides the tools; the practitioner must use them consistently. Start using experiment tracking on day one of a project, not when reproducibility problems emerge.
A data science team is deciding between TensorFlow/Keras and PyTorch for a new image-based tuberculosis screening project. The team has limited deep learning experience, needs to deploy to hospital systems within 6 months, and the model must run on edge devices (low-power tablets). Which factors should MOST heavily influence their framework choice?
- PyTorch because it’s more popular in research and has more cutting-edge features
- TensorFlow because Keras provides easier learning curve, TensorFlow Lite enables edge deployment, and TensorFlow Serving supports production deployment—matching their constraints
- Neither—they should use scikit-learn since deep learning is overkill for this problem
- Both—prototype in PyTorch for flexibility, then convert to TensorFlow for deployment
Correct Answer: b) TensorFlow because Keras provides easier learning curve, TensorFlow Lite enables edge deployment, and TensorFlow Serving supports production deployment—matching their constraints
This question tests understanding of the chapter’s framework comparison and requires matching tool capabilities to project requirements. The scenario deliberately provides specific constraints that favor one framework over the other.
The chapter’s framework comparison:
PyTorch: - “Pythonic, intuitive API” - “Dynamic computation graphs” - “Strong research community” - “Excellent for NLP and computer vision”
TensorFlow/Keras: - “Industry standard for production” - “TensorFlow Serving for deployment” - “TensorFlow Lite for mobile/edge” - “Keras API for ease of use”
Analyzing project constraints:
1. Limited deep learning experience: The chapter positions Keras as having an easier learning curve. The example code shows:
# Keras: Very concise, readable
model = keras.Sequential([
layers.Dense(64, activation='relu', input_shape=(10,)),
layers.Dropout(0.2),
layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', 'AUC'])
model.fit(X_train, y_train, epochs=50)PyTorch requires more boilerplate (define class, forward method, training loop). For beginners, Keras’ high-level API reduces cognitive load.
2. Edge device deployment (low-power tablets): The chapter explicitly states “TensorFlow Lite for mobile/edge.” This is a critical capability. TensorFlow Lite: - Optimizes models for mobile/edge devices - Reduces model size and inference latency - Supports quantization for lower precision (faster, smaller) - Has extensive mobile deployment tooling
PyTorch has mobile deployment (PyTorch Mobile) but TensorFlow Lite is more mature and widely adopted for healthcare applications.
3. Hospital production deployment within 6 months: The chapter emphasizes “TensorFlow Serving for deployment” and calls TensorFlow “industry standard for production.” TensorFlow Serving: - Production-grade model serving system - Handles versioning, A/B testing, monitoring - Integrates with enterprise IT systems - Well-documented for healthcare deployments
The 6-month timeline is tight. Learning a framework AND building production infrastructure is challenging. TensorFlow’s ecosystem provides more out-of-the-box production tooling.
4. Image-based TB screening: Both frameworks excel at computer vision. This constraint doesn’t differentiate them. Both have excellent pre-trained models (TensorFlow Hub, PyTorch Hub), transfer learning capabilities, and computer vision libraries.
The decision matrix:
| Factor | PyTorch | TensorFlow/Keras |
|---|---|---|
| Learning curve | Moderate | Easy (Keras) |
| Research flexibility | Excellent | Good |
| Production tooling | Adequate | Excellent |
| Edge deployment | PyTorch Mobile | TensorFlow Lite ✓ |
| Enterprise support | Growing | Mature ✓ |
| Time to production | Longer | Shorter ✓ |
Given constraints (beginner team, edge deployment, tight timeline), TensorFlow/Keras matches better.
Option (a) prioritizes research popularity over project needs. “Cutting-edge features” matter for research, not for deploying a TB screening tool to hospitals. The chapter distinguishes research use (PyTorch strengths) from production use (TensorFlow strengths). This project is production-focused.
Option (c) dismisses deep learning prematurely. The chapter discusses deep learning specifically for image analysis. TB screening from chest X-rays is a canonical deep learning application where CNNs outperform traditional computer vision. Skikit-learn lacks image-specific capabilities that deep learning frameworks provide.
Option (d) suggests using both frameworks—doubling learning curve, maintaining two implementations, and introducing conversion complexity. The chapter doesn’t recommend this approach. Framework conversion (PyTorch → TensorFlow) is non-trivial, error-prone, and time-consuming. With a 6-month deadline, focusing on one framework makes sense.
Additional considerations from the chapter:
Pre-trained models: The chapter discusses both TensorFlow Hub and PyTorch Hub. For TB screening, transfer learning from ImageNet models is common. Both frameworks support this, though specific TB screening models may be available for one or the other—worth checking model zoos before deciding.
Healthcare AI examples: The handbook’s earlier chapters (clinical AI, imaging) likely reference TensorFlow implementations because of its production maturity in healthcare settings.
Team trajectory: If the team later shifts to research (exploring novel architectures), they could learn PyTorch then. For initial production deployment, TensorFlow’s tooling provides scaffolding that accelerates delivery.
Real-world context for public health AI:
FDA clearance: If the TB screening tool needs FDA clearance, TensorFlow’s maturity and documentation for medical devices provides advantages. Regulatory submissions benefit from using widely-validated tooling.
Hospital IT integration: Hospital IT departments are conservative and prefer mature, supported technology. TensorFlow’s enterprise backing (Google) and widespread deployment precedents reduce institutional friction.
Resource constraints: Public health settings often have limited computational resources. TensorFlow Lite’s optimization for low-power devices directly addresses this reality.
Maintenance: After initial deployment, who maintains the model? If the team is small or experiences turnover, TensorFlow’s larger talent pool (due to industry adoption) makes hiring easier.
The chapter’s philosophy:
The chapter presents multiple tools not to create confusion but to equip readers for context-appropriate decisions. PyTorch vs. TensorFlow isn’t “which is better?” but “which matches our needs?”
For this scenario: beginner team + edge deployment + tight timeline = TensorFlow/Keras.
Different scenario (experienced team, research project, no deployment constraints) might favor PyTorch.
For public health practitioners:
When choosing frameworks: 1. List project constraints (team experience, deployment target, timeline, requirements) 2. Map to framework capabilities (chapter provides this mapping) 3. Choose frameworks that match constraints, not most popular or newest 4. Start learning/prototyping quickly to validate the choice 5. Commit to one framework rather than hedging across multiple
The chapter provides information for informed decisions; practitioners must apply decision frameworks that prioritize project success over tool fashion. This question tests that practical decision-making skill—understanding not just what each tool does but when to use which tool.
A graduate student is working on epidemic forecasting for their dissertation. They maintain all their code and data on their laptop’s desktop folder with names like “model_v2_final_FINAL.py” and “data_cleaned.csv.” After their laptop crashes and they discover their backups are outdated, they lose three months of work. Which combination of practices from the chapter would have MOST effectively prevented this catastrophe?
- Cloud storage (Google Drive, Dropbox) for automatic backup
- Git for code version control + GitHub for remote backup + DVC for data versioning + remote storage (S3/GCS) for data backup
- More frequent manual backups to external hard drives
- Working exclusively on cloud platforms (Google Colab, AWS SageMaker) instead of local machines
Correct Answer: b) Git for code version control + GitHub for remote backup + DVC for data versioning + remote storage (S3/GCS) for data backup
This question synthesizes multiple practices from the chapter’s reproducibility and collaboration section. The scenario illustrates common disasters that proper version control and backup infrastructure prevent.
The problem breakdown:
1. Code loss (“model_v2_final_FINAL.py”): - No version control - Manual versioning through filenames (error-prone, unmaintainable) - Single point of failure (laptop) - Can’t recover previous versions - No collaboration capability
2. Data loss (“data_cleaned.csv”): - No data versioning - No tracking of data transformations - No remote backup - Can’t reproduce data cleaning steps
3. Lack of remote backup: - All work on single device - Backups not systematic/automated - No ability to recover from device failure
The chapter’s solution:
Git for code version control: The chapter emphasizes Git throughout, providing concrete examples:
git init # Initialize repository
git add . # Stage files
git commit -m "Initial commit" # Create checkpointBenefits: - Every commit is a checkpoint: Can recover any previous version - No more “final_FINAL” naming: Git tracks versions automatically - Branching: Experiment safely without destroying working code - History: See what changed, when, and why - Merging: Combine different development paths
With Git, the student never loses code. Even if laptop crashes, the commit history is preserved (especially when combined with remote backup).
GitHub for remote backup: The chapter’s Git section shows:
git remote add origin https://github.com/username/repo.git
git push -u origin mainBenefits: - Automatic offsite backup: Every git push backs up to cloud - Free for public repos: Academic work benefits from open science - Collaboration ready: Advisor/committee can review code - Institutional compliance: Many universities require code archiving
When laptop crashes, student clones from GitHub on new machine. No code lost.
DVC for data versioning: The chapter introduces DVC specifically for versioning large datasets (which Git doesn’t handle well):
dvc init # Initialize DVC in Git repo
dvc add data/raw/epidemic_data.csv # Track data file
dvc remote add -d storage s3://mybucket/dvc-store
dvc push # Upload data to remote storage
# After crash, on new machine:
git clone <repo-url> # Get code
dvc pull # Get dataBenefits: - Version large datasets: Git struggles with >100MB files, DVC handles terabytes - Track data transformations: Record preprocessing steps - Remote storage: Data backed up to S3/GCS/Azure - Reproduce pipelines: DVC tracks data processing pipelines
With DVC, the student’s “data_cleaned.csv” is versioned, backed up remotely, and accompanied by code that documents the cleaning process. Data loss is prevented, and the cleaning process is reproducible.
Why this combination is essential:
Git alone doesn’t solve the data problem (large files, binary data). DVC alone doesn’t version code. Together, they provide complete protection for computational research.
Analyzing alternatives:
Option (a)—Cloud storage (Drive/Dropbox): - Pros: Easy, automatic backup - Cons: Not version control (overwrites files, limited history), no meaningful history/diffs, doesn’t track relationships between code and data, not designed for code workflows, poor collaboration features for code
Option (a) prevents total loss but doesn’t provide versioning, reproducibility, or proper collaboration. The chapter presents it as insufficient for serious development work.
Option (c)—Manual backups to external drives: - Pros: Physical control of data - Cons: Requires discipline (people forget), no automation, drives can fail too, versioning still manual, doesn’t enable collaboration
Manual processes fail. The chapter emphasizes automated systems precisely because humans are unreliable about backups.
Option (d)—Cloud platforms exclusively: - Pros: Built-in backup, compute resources - Cons: Vendor lock-in, requires internet, limited customization, can be expensive, doesn’t address version control (still need Git)
Cloud platforms are tools, not replacements for version control. The chapter discusses Colab for compute, not as a version control solution. Colab + Git + GitHub is appropriate; Colab alone is not.
The chapter’s complete reproducibility stack:
Code versioning: Git Code backup: GitHub/GitLab Data versioning: DVC Data backup: S3/GCS/Azure Environment: Docker + requirements.txt Experiments: MLflow/W&B Project structure: Standard directories
This stack provides: - Recoverability: Device failure doesn’t cause data loss - Reproducibility: Anyone can recreate exact environment and results - Collaboration: Team members work together efficiently - Auditability: Complete history of all changes - Correctness: Version control enables review and validation
Real-world implications for dissertation work:
1. Advisor review: “Send me your code” → “Here’s the GitHub link” Much cleaner than emailing zip files.
2. Committee examination: “How did you clean the data?” → “See commit 3a7f9b2 and DVC pipeline” Full traceability of methods.
3. Publication: “Make code available” → Already on GitHub with DOI Meets open science requirements.
4. Post-graduation: Future researchers can build on the work because code + data + environment are documented and preserved.
The “three months of work” loss:
Without version control, this is catastrophic. With Git + GitHub: - Latest commit is from yesterday → lose one day maximum - DVC + remote storage → data fully backed up - git clone + dvc pull on new machine → back to work in hours
The chapter’s project setup exercise (Section 12):
The hands-on exercise walks through exactly this setup: 1. Create environment (conda/venv) 2. Create project structure (directories) 3. Initialize Git 4. Create GitHub repo 5. Set up DVC 6. Build pipeline 7. Export environment 8. Create Dockerfile
This is the chapter’s recommended starting point for ANY project. Following it would have completely prevented the student’s disaster.
For public health practitioners:
The chapter presents this infrastructure not as optional nice-to-have but as essential professional practice. Just as laboratory researchers maintain lab notebooks and document procedures, computational researchers must version control code and data.
Start every project with:
mkdir my-project
cd my-project
git init # Version control from day one
dvc init # Data versioning from day one
git remote add origin <github-url> # Remote backup from day one
dvc remote add -d storage <s3-url> # Data backup from day oneThe chapter provides all these tools and explains their use. The practitioner’s responsibility is establishing habits: commit regularly, push frequently, version data systematically. These practices prevent catastrophes and enable reproducible science.
The scenario’s disaster—three months of work lost—would be impossible with proper version control. This isn’t hypothetical; this happens to real students/researchers regularly. The chapter equips readers with tools to prevent it. The question tests whether readers understand not just what the tools do but why they’re essential and how they work together to provide comprehensive protection.
15.14 Discussion Questions
Environment choice: When would you choose Jupyter notebooks over VS Code for development? What are the trade-offs?
Library selection: For a new tabular data classification project, would you start with scikit-learn, XGBoost, or deep learning? Why?
Cloud vs. local: When does it make sense to move from local development to cloud computing? What factors influence this decision?
Reproducibility: How would you ensure someone else can exactly reproduce your analysis? What tools and practices are essential?
Tool overload: The AI ecosystem has hundreds of tools. How do you decide what to learn? What’s essential vs. nice-to-have?
Version control for data: Git works well for code, but large datasets pose challenges. How would you version control a 100GB dataset?
15.15 Further Resources
15.15.1 Books
- Python Data Science Handbook by Jake VanderPlas - Free online
- Hands-On Machine Learning by Aurélien Géron - Practical guide
- Deep Learning with Python by François Chollet - Keras creator
15.15.2 Documentation
- scikit-learn - Excellent tutorials and examples
- PyTorch - Comprehensive documentation
- pandas - User guide and API reference
- MLflow - MLOps platform
15.15.3 Interactive Learning
- Kaggle Learn - Free micro-courses
- Fast.ai - Practical deep learning
- Google Colab Tutorials - Notebooks
15.15.4 Online Courses
- Andrew Ng’s ML Specialization - Coursera
- Deep Learning Specialization - Coursera
- Full Stack Deep Learning - Production ML