14  Your AI Toolkit

NoteLearning Objectives

By the end of this chapter, you will be able to:

  1. Select appropriate development environments for different AI project types
  2. Install and configure essential Python libraries for data science and machine learning
  3. Choose the right tools for data management, preprocessing, and visualization
  4. Set up cloud and local compute resources for model training and deployment
  5. Navigate MLOps platforms for experiment tracking, versioning, and collaboration
  6. Integrate specialized tools for epidemiological analysis and public health applications
  7. Build a reproducible workflow using containers, version control, and virtual environments
  8. Access and utilize pre-trained models and datasets relevant to public health

Estimated time: 60-90 minutes

Prerequisites: - Chapter 2: Just Enough AI to Be Dangerous - Basic ML concepts - Chapter 3: The Data Problem - Data fundamentals


14.1 What You’ll Build

By working through this chapter, you will set up:

  1. Complete Python Development Environment - Anaconda/Miniconda, Jupyter, VS Code
  2. Project Template - Structured directory with configuration files and documentation
  3. Model Training Pipeline - End-to-end workflow from data to trained model
  4. MLOps Stack - Experiment tracking, version control, and model registry
  5. Deployment-Ready Container - Docker configuration for reproducible environments

14.2 1. Introduction: Choosing the Right Tools

14.2.1 The Modern AI Stack

The AI ecosystem is vast and evolving rapidly. This chapter focuses on:

  • Open-source tools (accessible, transparent, community-supported)
  • Python-first (dominant language for data science and ML)
  • Production-ready (tools that scale from prototype to deployment)
  • Public health-relevant (applicable to epidemiology and population health)

14.2.2 Tool Categories

┌────────────────────────────────────────────────────────────┐
│                    AI Development Stack                     │
├────────────────────────────────────────────────────────────┤
│                                                             │
│  Development Environment                                    │
│  ├─ IDE/Editor (VS Code, PyCharm, Jupyter)                │
│  ├─ Package Manager (conda, pip)                          │
│  └─ Version Control (Git, GitHub)                         │
│                                                             │
│  Core Libraries                                            │
│  ├─ Data (pandas, numpy, polars)                          │
│  ├─ ML (scikit-learn, XGBoost, LightGBM)                 │
│  ├─ Deep Learning (PyTorch, TensorFlow, JAX)             │
│  └─ Visualization (matplotlib, seaborn, plotly)          │
│                                                             │
│  MLOps & Workflow                                          │
│  ├─ Experiment Tracking (MLflow, Weights & Biases)       │
│  ├─ Data Versioning (DVC, LakeFS)                        │
│  ├─ Pipeline Orchestration (Airflow, Prefect)            │
│  └─ Model Serving (FastAPI, TorchServe, TFServing)       │
│                                                             │
│  Infrastructure                                            │
│  ├─ Containerization (Docker, Kubernetes)                 │
│  ├─ Cloud Platforms (AWS, GCP, Azure)                    │
│  ├─ Compute (GPUs, TPUs, serverless)                     │
│  └─ Databases (PostgreSQL, MongoDB, InfluxDB)            │
│                                                             │
│  Domain-Specific                                           │
│  ├─ Epidemiology (EpiModel, surveillance-py)             │
│  ├─ Genomics (BioPython, scikit-bio)                     │
│  ├─ GIS/Spatial (GeoPandas, folium)                      │
│  └─ NLP/LLMs (Hugging Face, spaCy, LangChain)           │
│                                                             │
└────────────────────────────────────────────────────────────┘

14.3 2. Setting Up Your Development Environment

14.3.1 Python Distribution: Anaconda vs. Miniconda

Anaconda (Recommended for beginners)

Pros: - Pre-installed with 250+ popular data science packages - Graphical installer and Navigator UI - Includes Jupyter, Spyder, VS Code - No compilation needed for scientific libraries

Cons: - Large download (~3 GB) - Takes significant disk space (~5 GB) - May include packages you don’t need

Installation:

# Download from https://www.anaconda.com/download
# Follow graphical installer

# Verify installation
conda --version
python --version

Miniconda (Recommended for experienced users)

Pros: - Minimal installer (~50 MB) - Install only what you need - Faster environment creation - Same conda functionality

Cons: - Requires manual package installation - Command-line focused - More setup steps

Installation:

# Download from https://docs.conda.io/en/latest/miniconda.html

# Linux/macOS
bash Miniconda3-latest-Linux-x86_64.sh

# Windows: Run .exe installer

# Verify
conda --version

14.3.2 Creating Virtual Environments

Why virtual environments?

  • Isolate project dependencies
  • Avoid version conflicts
  • Reproducible environments
  • Easy to share and recreate

Creating environments with conda:

# Create new environment with Python 3.10
conda create -n publichealth-ai python=3.10

# Activate environment
conda activate publichealth-ai

# Install core packages
conda install numpy pandas scikit-learn matplotlib jupyter

# Deactivate when done
conda deactivate

# List all environments
conda env list

# Remove environment
conda env remove -n publichealth-ai

Alternative: venv (Python built-in)

# Create virtual environment
python -m venv venv

# Activate
# Windows
venv\Scripts\activate
# Linux/macOS
source venv/bin/activate

# Install packages
pip install numpy pandas scikit-learn matplotlib jupyter

# Deactivate
deactivate

14.3.3 IDE Options

14.3.3.1 Jupyter Notebook/Lab (Interactive exploration)

Best for: - Data exploration and visualization - Iterative analysis - Teaching and presentations - Documenting analysis with narrative

Installation:

conda install jupyter jupyterlab

# Launch Jupyter Lab
jupyter lab

# Launch classic Jupyter Notebook
jupyter notebook

Key features: - Cell-based execution - Inline plots and visualizations - Rich markdown support - Easy sharing (.ipynb files)

Extensions to install:

# Variable inspector
pip install jupyterlab-variableinspector

# Table of contents
pip install jupyterlab-toc

# Code formatter
pip install jupyterlab-code-formatter black isort

14.3.3.2 VS Code (General-purpose development)

Best for: - Writing production code - Debugging complex issues - Working with multiple files - Git integration - Remote development (SSH, WSL, containers)

Installation: Download from https://code.visualstudio.com/

Essential extensions for AI/ML:

{
  "recommendations": [
    "ms-python.python",              // Python IntelliSense
    "ms-toolsai.jupyter",            // Jupyter notebooks
    "ms-python.vscode-pylance",      // Type checking
    "ms-python.black-formatter",     // Code formatting
    "ms-python.debugpy",             // Debugging
    "ms-azuretools.vscode-docker",   // Docker support
    "github.copilot",                // AI code suggestions
    "visualstudioexptteam.vscodeintellicode" // ML-powered suggestions
  ]
}

Useful settings (.vscode/settings.json):

{
  "python.defaultInterpreterPath": "/path/to/conda/env/python",
  "python.formatting.provider": "black",
  "python.linting.enabled": true,
  "python.linting.pylintEnabled": true,
  "editor.formatOnSave": true,
  "files.autoSave": "afterDelay",
  "jupyter.askForKernelRestart": false
}

14.3.3.3 PyCharm (Professional Python IDE)

Best for: - Large-scale projects - Professional development - Advanced debugging - Refactoring tools

Editions: - Community - Free, open-source, sufficient for most projects - Professional - Paid, adds scientific tools, web frameworks, databases

Download: https://www.jetbrains.com/pycharm/


14.3.3.4 Google Colab (Cloud-based, free GPUs)

Best for: - Learning and experimentation - GPU access without local hardware - Quick prototyping - Sharing notebooks

Features: - Free GPU/TPU access (limited hours) - Pre-installed ML libraries - Google Drive integration - No setup required

Access: https://colab.research.google.com/

Example: Enabling GPU:

# Check GPU availability
import torch
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU name: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

14.4 3. Essential Python Libraries

14.4.1 Data Manipulation

14.4.1.1 pandas - Data structures and analysis

Installation:

conda install pandas
# or
pip install pandas

Core functionality: - DataFrames (2D tables) - Series (1D arrays) - Reading/writing CSV, Excel, SQL, JSON - Data cleaning, filtering, aggregation - Time series analysis

Quick start:

import pandas as pd

# Read data
df = pd.read_csv('covid_cases.csv')

# Basic exploration
print(df.head())
print(df.info())
print(df.describe())

# Filtering
recent = df[df['date'] > '2023-01-01']

# Grouping
by_region = df.groupby('region')['cases'].sum()

# Time series
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
weekly = df.resample('W').sum()

Resources: - 📚 pandas documentation - 📄 10 minutes to pandas


14.4.1.2 NumPy - Numerical computing

Installation:

conda install numpy

Core functionality: - Multi-dimensional arrays - Mathematical operations - Linear algebra - Random number generation - Broadcasting

Quick start:

import numpy as np

# Create arrays
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2], [3, 4]])

# Operations
print(arr.mean(), arr.std())
print(matrix @ matrix.T)  # Matrix multiplication

# Random numbers (for simulations)
np.random.seed(42)
samples = np.random.normal(loc=100, scale=15, size=1000)

14.4.1.3 Polars - Fast DataFrame library (Alternative to pandas)

Installation:

pip install polars

Why Polars? - 10-100x faster than pandas for large datasets - Lower memory usage - Better query optimization - Lazy evaluation

Quick comparison:

import polars as pl

# Polars syntax
df_polars = pl.read_csv('large_dataset.csv')
result = (
    df_polars
    .filter(pl.col('age') > 18)
    .group_by('region')
    .agg([
        pl.col('cases').sum().alias('total_cases'),
        pl.col('deaths').sum().alias('total_deaths')
    ])
)

# pandas equivalent
import pandas as pd
df_pandas = pd.read_csv('large_dataset.csv')
result = (
    df_pandas[df_pandas['age'] > 18]
    .groupby('region')
    .agg({'cases': 'sum', 'deaths': 'sum'})
)

14.4.2 Machine Learning

14.4.2.1 scikit-learn - Classical ML algorithms

Installation:

conda install scikit-learn

Coverage: - Classification, regression, clustering - Preprocessing and feature engineering - Model selection and evaluation - Pipeline construction

Core modules:

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, classification_report

# Typical workflow
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Training
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# Evaluation
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_proba):.3f}")
print(classification_report(y_test, y_pred))

Resources: - 📚 scikit-learn documentation - 📄 Choosing the right estimator


14.4.2.2 XGBoost - Gradient boosting

Installation:

conda install -c conda-forge xgboost
# or
pip install xgboost

Why XGBoost? - State-of-the-art performance on tabular data - Handles missing values - Built-in regularization - Feature importance

Quick start:

import xgboost as xgb
from sklearn.metrics import roc_auc_score

# Create DMatrix (XGBoost's data structure)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Parameters
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth': 6,
    'eta': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8
}

# Train
model = xgb.train(
    params,
    dtrain,
    num_boost_round=100,
    evals=[(dtrain, 'train'), (dtest, 'test')],
    early_stopping_rounds=10,
    verbose_eval=10
)

# Predict
y_pred_proba = model.predict(dtest)
print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_proba):.3f}")

14.4.2.3 LightGBM - Fast gradient boosting

Installation:

conda install -c conda-forge lightgbm

Advantages over XGBoost: - Faster training on large datasets - Lower memory usage - Better handling of categorical features - Leaf-wise tree growth (vs. level-wise)

Quick start:

import lightgbm as lgb

# Create dataset
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

# Parameters
params = {
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05
}

# Train
model = lgb.train(
    params,
    train_data,
    num_boost_round=100,
    valid_sets=[test_data],
    callbacks=[lgb.early_stopping(10)]
)

14.4.3 Deep Learning

14.4.3.1 PyTorch - Dynamic neural networks

Installation:

# CPU only
conda install pytorch torchvision torchaudio cpuonly -c pytorch

# GPU (CUDA 11.8)
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

Why PyTorch? - Pythonic, intuitive API - Dynamic computation graphs - Strong research community - Excellent for NLP and computer vision

Simple example:

import torch
import torch.nn as nn
import torch.optim as optim

# Define model
class SimpleNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.sigmoid(x)
        return x

# Initialize
model = SimpleNN(input_dim=10, hidden_dim=64, output_dim=1)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(100):
    # Forward pass
    outputs = model(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/100], Loss: {loss.item():.4f}')

Resources: - 📚 PyTorch documentation - 🎓 PyTorch tutorials


14.4.3.2 TensorFlow/Keras - Production-ready deep learning

Installation:

pip install tensorflow

Why TensorFlow? - Industry standard for production - TensorFlow Serving for deployment - TensorFlow Lite for mobile/edge - Keras API for ease of use

Keras example:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Define model
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(10,)),
    layers.Dropout(0.2),
    layers.Dense(32, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(1, activation='sigmoid')
])

# Compile
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy', 'AUC']
)

# Train
history = model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    callbacks=[
        keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)
    ]
)

# Evaluate
test_loss, test_acc, test_auc = model.evaluate(X_test, y_test)
print(f'Test AUC: {test_auc:.3f}')

14.4.4 Visualization

14.4.4.1 Matplotlib - Publication-quality plots

Installation:

conda install matplotlib

Core plotting:

import matplotlib.pyplot as plt
import numpy as np

# Line plot
plt.figure(figsize=(10, 6))
plt.plot(dates, cases, label='Daily Cases', linewidth=2)
plt.xlabel('Date')
plt.ylabel('Number of Cases')
plt.title('COVID-19 Daily Cases')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

axes[0, 0].plot(data['cases'])
axes[0, 0].set_title('Cases')

axes[0, 1].plot(data['deaths'])
axes[0, 1].set_title('Deaths')

axes[1, 0].scatter(data['age'], data['severity'])
axes[1, 0].set_title('Age vs. Severity')

axes[1, 1].hist(data['recovery_time'], bins=30)
axes[1, 1].set_title('Recovery Time Distribution')

plt.tight_layout()
plt.show()

14.4.4.2 Seaborn - Statistical visualization

Installation:

conda install seaborn

Why Seaborn? - Built on matplotlib - Beautiful default styles - Statistical plots out-of-the-box - Great for exploratory analysis

Examples:

import seaborn as sns

# Set style
sns.set_style('whitegrid')
sns.set_palette('husl')

# Distribution plot
sns.histplot(data=df, x='age', hue='outcome', kde=True)
plt.title('Age Distribution by Outcome')
plt.show()

# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlations')
plt.show()

# Pairplot
sns.pairplot(df, hue='disease_status', vars=['age', 'bmi', 'blood_pressure'])
plt.show()

# Box plot
sns.boxplot(data=df, x='region', y='incidence_rate')
plt.xticks(rotation=45)
plt.title('Disease Incidence by Region')
plt.show()

14.4.4.3 Plotly - Interactive visualizations

Installation:

conda install -c plotly plotly

Why Plotly? - Interactive plots (zoom, pan, hover) - Web-based (works in Jupyter, dashboards) - 3D plots - Maps and geospatial visualization

Examples:

import plotly.express as px
import plotly.graph_objects as go

# Interactive line plot
fig = px.line(df, x='date', y='cases', color='region',
              title='COVID-19 Cases by Region')
fig.update_layout(hovermode='x unified')
fig.show()

# Choropleth map
fig = px.choropleth(df,
                    locations='country_code',
                    color='incidence_rate',
                    hover_name='country',
                    color_continuous_scale='Reds',
                    title='Disease Incidence by Country')
fig.show()

# 3D scatter
fig = px.scatter_3d(df, x='age', y='bmi', z='risk_score',
                    color='outcome', size='severity',
                    title='Risk Factors 3D Visualization')
fig.show()

14.5 4. MLOps and Experiment Tracking

14.5.1 MLflow - Experiment tracking and model registry

Installation:

pip install mlflow

Core features: - Experiment tracking (parameters, metrics, artifacts) - Model registry (versioning, staging, production) - Model serving - Project packaging

Example workflow:

import mlflow
import mlflow.sklearn

# Set experiment
mlflow.set_experiment("disease-prediction")

# Start run
with mlflow.start_run(run_name="random-forest-v1"):
    # Log parameters
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)

    # Train model
    model = RandomForestClassifier(n_estimators=100, max_depth=10)
    model.fit(X_train, y_train)

    # Evaluate
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, y_pred_proba)

    # Log metrics
    mlflow.log_metric("auc", auc)
    mlflow.log_metric("accuracy", accuracy_score(y_test, model.predict(X_test)))

    # Log model
    mlflow.sklearn.log_model(model, "model")

    # Log artifacts
    plt.figure()
    plt.plot(fpr, tpr)
    plt.savefig("roc_curve.png")
    mlflow.log_artifact("roc_curve.png")

# View UI
# mlflow ui

Resources: - 📚 MLflow documentation


14.5.2 Weights & Biases - Experiment tracking and collaboration

Installation:

pip install wandb

Why W&B? - Beautiful visualizations - Team collaboration features - Model versioning - Hyperparameter sweeps - Free for individuals and academics

Quick start:

import wandb

# Initialize
wandb.init(project="public-health-ai", name="experiment-1")

# Log config
wandb.config.update({
    "learning_rate": 0.001,
    "batch_size": 32,
    "epochs": 50
})

# Training loop
for epoch in range(50):
    # ... training code ...

    # Log metrics
    wandb.log({
        "epoch": epoch,
        "train_loss": train_loss,
        "val_loss": val_loss,
        "val_auc": val_auc
    })

# Log model
wandb.log_model(path="model.pth", name="disease-predictor")

# Finish run
wandb.finish()

14.5.3 DVC - Data version control

Installation:

pip install dvc

Why DVC? - Version large datasets (like Git for data) - Track data pipelines - Remote storage (S3, GCS, Azure, SSH) - Reproducible experiments

Setup:

# Initialize DVC in Git repo
git init
dvc init

# Add data to DVC
dvc add data/raw/covid_data.csv

# Add to git
git add data/raw/covid_data.csv.dvc .gitignore
git commit -m "Add COVID data"

# Configure remote storage
dvc remote add -d storage s3://mybucket/dvc-store
dvc push

# On another machine
git clone <repo-url>
dvc pull

14.6 5. Domain-Specific Tools

14.6.1 Epidemiology and Public Health

14.6.1.1 EpiEstim - Estimating reproduction number

Installation:

# R package
install.packages("EpiEstim")

Alternative Python: EpiNow2-py


14.6.1.2 Epiweeks - Epidemiological week calculations

Installation:

pip install epiweeks

Usage:

from epiweeks import Week, Year

# Get current epi week
current_week = Week.thisweek()
print(f"Current epi week: {current_week}")

# Convert date to epi week
from datetime import date
week = Week.fromdate(date(2024, 3, 15))
print(f"Epi week for 2024-03-15: {week}")

14.6.1.3 Geospatial Analysis - GeoPandas, Folium

Installation:

conda install geopandas folium

Example:

import geopandas as gpd
import folium

# Read shapefile
gdf = gpd.read_file('countries.shp')

# Merge with disease data
gdf = gdf.merge(disease_data, on='country_code')

# Create interactive map
m = folium.Map(location=[0, 0], zoom_start=2)

folium.Choropleth(
    geo_data=gdf,
    data=gdf,
    columns=['country_code', 'incidence_rate'],
    key_on='feature.properties.country_code',
    fill_color='YlOrRd',
    legend_name='Incidence Rate per 100,000'
).add_to(m)

m.save('disease_map.html')

14.6.2 Natural Language Processing

14.6.2.1 Hugging Face Transformers - Pre-trained language models

Installation:

pip install transformers

Quick start:

from transformers import pipeline

# Sentiment analysis
classifier = pipeline("sentiment-analysis")
result = classifier("The patient reported feeling much better after treatment.")
print(result)

# Named entity recognition (medical entities)
ner = pipeline("ner", model="d4data/biomedical-ner-all")
entities = ner("Patient has diabetes and hypertension.")
print(entities)

# Text generation
generator = pipeline("text-generation", model="gpt2")
text = generator("Symptoms of COVID-19 include", max_length=50)
print(text)

14.6.2.2 spaCy - Industrial-strength NLP

Installation:

pip install spacy
python -m spacy download en_core_web_sm

Medical NLP:

import spacy

# Load model
nlp = spacy.load("en_core_web_sm")

# Process text
text = "Patient presents with fever, cough, and shortness of breath."
doc = nlp(text)

# Extract entities
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")

# POS tagging
for token in doc:
    print(f"{token.text}: {token.pos_}")

14.6.3 Genomics and Bioinformatics

14.6.3.1 Biopython - Computational biology

Installation:

conda install -c conda-forge biopython

Quick example:

from Bio import SeqIO
from Bio.Seq import Seq
from Bio import Align

# Read FASTA file
for record in SeqIO.parse("sequences.fasta", "fasta"):
    print(f"ID: {record.id}")
    print(f"Sequence: {record.seq}")
    print(f"Length: {len(record)}")

# Sequence alignment
aligner = Align.PairwiseAligner()
seq1 = Seq("ACCGT")
seq2 = Seq("ACGT")
alignments = aligner.align(seq1, seq2)
print(alignments[0])

14.7 6. Cloud and Compute Resources

14.7.1 Cloud Platforms

14.7.1.1 AWS (Amazon Web Services)

Relevant services: - EC2 - Virtual machines with GPU support - SageMaker - Managed ML platform - S3 - Object storage - Lambda - Serverless computing

Getting started:

# Install AWS CLI
pip install awscli

# Configure credentials
aws configure

# Launch GPU instance
aws ec2 run-instances --image-id ami-xxx --instance-type p3.2xlarge

14.7.1.2 Google Cloud Platform (GCP)

Relevant services: - Compute Engine - VMs with TPU support - Vertex AI - ML platform - Cloud Storage - Object storage - BigQuery - Data warehouse

Getting started:

# Install gcloud CLI
# Download from https://cloud.google.com/sdk/docs/install

# Initialize
gcloud init

# Create VM with GPU
gcloud compute instances create ml-instance \
  --machine-type=n1-highmem-8 \
  --accelerator=type=nvidia-tesla-t4,count=1

14.7.1.3 Microsoft Azure

Relevant services: - Azure ML - ML platform - Azure Databricks - Apache Spark - Blob Storage - Object storage


14.7.2 Free GPU Resources

14.7.2.1 Google Colab (Free tier)

Pros: - Free GPU access (T4) - No setup required - Pre-installed libraries

Cons: - Limited to 12-hour sessions - Can’t customize environment fully - No persistent storage (use Google Drive)


14.7.2.2 Kaggle Kernels (Free GPU)

Pros: - Free GPU (P100) for 30 hours/week - Access to Kaggle datasets - Community notebooks

Access: https://www.kaggle.com/code


14.8 7. Reproducibility and Collaboration

14.8.1 Version Control with Git

Essential Git commands:

# Initialize repository
git init

# Add files
git add .
git commit -m "Initial commit"

# Create branch
git checkout -b feature/new-model

# Push to remote
git remote add origin https://github.com/username/repo.git
git push -u origin main

# Pull updates
git pull origin main

# Merge branch
git checkout main
git merge feature/new-model

.gitignore for data science:

# Data
data/
*.csv
*.h5
*.pkl

# Models
models/
*.pth
*.h5
*.pkl

# Notebooks
.ipynb_checkpoints/
*-checkpoint.ipynb

# Environment
venv/
.venv/
__pycache__/
*.pyc

# IDE
.vscode/
.idea/

# OS
.DS_Store
Thumbs.db

14.8.2 Containerization with Docker

Why Docker? - Reproducible environments - Works anywhere (local, cloud, clusters) - Version-controlled infrastructure - Easy collaboration

Example Dockerfile:

FROM python:3.10-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY . .

# Run
CMD ["python", "train.py"]

Building and running:

# Build image
docker build -t disease-predictor:latest .

# Run container
docker run -v $(pwd)/data:/app/data disease-predictor:latest

# Interactive shell
docker run -it disease-predictor:latest /bin/bash

14.8.3 Project Structure

Recommended directory layout:

project-name/
├── README.md              # Project overview
├── requirements.txt       # Python dependencies
├── environment.yml        # Conda environment
├── .gitignore            # Git ignore rules
├── Dockerfile            # Container definition
├── setup.py              # Package installation
│
├── data/
│   ├── raw/              # Original, immutable data
│   ├── processed/        # Cleaned, transformed data
│   └── external/         # Third-party data
│
├── notebooks/            # Jupyter notebooks
│   ├── 01-exploration.ipynb
│   ├── 02-preprocessing.ipynb
│   └── 03-modeling.ipynb
│
├── src/                  # Source code
│   ├── __init__.py
│   ├── data/             # Data processing
│   │   ├── __init__.py
│   │   └── make_dataset.py
│   ├── features/         # Feature engineering
│   │   ├── __init__.py
│   │   └── build_features.py
│   ├── models/           # Model definitions
│   │   ├── __init__.py
│   │   ├── train.py
│   │   └── predict.py
│   └── visualization/    # Plotting code
│       ├── __init__.py
│       └── visualize.py
│
├── models/               # Trained models
│   └── .gitkeep
│
├── reports/              # Analysis reports
│   ├── figures/          # Plots and images
│   └── results.md
│
└── tests/                # Unit tests
    ├── __init__.py
    └── test_features.py

14.9 8. Pre-trained Models and Datasets

14.9.1 Model Hubs

14.9.1.1 Hugging Face Hub

Access thousands of models:

from transformers import AutoModel, AutoTokenizer

# Load pre-trained model
model_name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Use for downstream task
text = "Patient diagnosed with pneumonia."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Browse models: https://huggingface.co/models


14.9.1.2 PyTorch Hub

Access:

import torch

# Load pre-trained model
model = torch.hub.load('pytorch/vision', 'resnet50', pretrained=True)
model.eval()

14.9.2 Public Health Datasets

14.9.2.1 Johns Hopkins COVID-19 Data

import pandas as pd

url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
covid_data = pd.read_csv(url)

14.9.2.2 WHO Global Health Observatory

Access: https://www.who.int/data/gho


14.9.2.3 CDC Data

WONDER Database: https://wonder.cdc.gov/


14.9.2.4 Kaggle Datasets

Public health datasets: - COVID-19 forecasting - Disease surveillance - Health indicators

Access: https://www.kaggle.com/datasets


14.10 9. Building Your First Pipeline

14.10.1 Complete Example: Disease Prediction Pipeline

1. Project setup:

# Create project
mkdir disease-predictor
cd disease-predictor

# Create environment
conda create -n disease-pred python=3.10
conda activate disease-pred

# Install packages
conda install pandas numpy scikit-learn matplotlib jupyter mlflow

# Initialize git
git init
echo "data/\nmodels/\n*.pyc" > .gitignore
git add .gitignore
git commit -m "Initial commit"

2. Data loading (src/data/load_data.py):

import pandas as pd
from sklearn.model_selection import train_test_split

def load_and_split_data(filepath, test_size=0.2, random_state=42):
    """Load data and split into train/test sets"""

    # Load data
    df = pd.read_csv(filepath)

    # Separate features and target
    X = df.drop('outcome', axis=1)
    y = df['outcome']

    # Split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )

    return X_train, X_test, y_train, y_test

if __name__ == "__main__":
    X_train, X_test, y_train, y_test = load_and_split_data('data/raw/disease_data.csv')
    print(f"Training samples: {len(X_train)}")
    print(f"Test samples: {len(X_test)}")

3. Feature engineering (src/features/build_features.py):

from sklearn.preprocessing import StandardScaler
import pandas as pd

def build_features(X_train, X_test):
    """Engineer features and scale data"""

    # Create age groups
    X_train['age_group'] = pd.cut(X_train['age'], bins=[0, 18, 65, 100], labels=['child', 'adult', 'senior'])
    X_test['age_group'] = pd.cut(X_test['age'], bins=[0, 18, 65, 100], labels=['child', 'adult', 'senior'])

    # One-hot encode categorical
    X_train = pd.get_dummies(X_train, columns=['age_group', 'region'])
    X_test = pd.get_dummies(X_test, columns=['age_group', 'region'])

    # Align columns
    X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)

    # Scale numerical features
    scaler = StandardScaler()
    numerical_cols = ['age', 'bmi', 'blood_pressure']

    X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
    X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

    return X_train, X_test, scaler

4. Model training (src/models/train.py):

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report

def train_model(X_train, y_train, X_test, y_test):
    """Train and log model with MLflow"""

    mlflow.set_experiment("disease-prediction")

    with mlflow.start_run():
        # Parameters
        params = {
            'n_estimators': 100,
            'max_depth': 10,
            'min_samples_split': 5,
            'random_state': 42
        }
        mlflow.log_params(params)

        # Train
        model = RandomForestClassifier(**params)
        model.fit(X_train, y_train)

        # Evaluate
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test)[:, 1]

        auc = roc_auc_score(y_test, y_pred_proba)
        acc = accuracy_score(y_test, y_pred)

        mlflow.log_metrics({
            'auc': auc,
            'accuracy': acc
        })

        # Log model
        mlflow.sklearn.log_model(model, "model")

        print(f"AUC: {auc:.3f}, Accuracy: {acc:.3f}")
        print(classification_report(y_test, y_pred))

        return model

if __name__ == "__main__":
    from src.data.load_data import load_and_split_data
    from src.features.build_features import build_features

    # Load data
    X_train, X_test, y_train, y_test = load_and_split_data('data/raw/disease_data.csv')

    # Build features
    X_train, X_test, scaler = build_features(X_train, X_test)

    # Train model
    model = train_model(X_train, y_train, X_test, y_test)

5. Run pipeline:

# Train model
python src/models/train.py

# View MLflow UI
mlflow ui

# Open browser: http://localhost:5000

14.11 10. Key Takeaways

ImportantEssential Tools Summary

Development Environment: - Python distribution: Anaconda or Miniconda - IDE: Jupyter Lab for exploration, VS Code for production code - Virtual environments: conda or venv for dependency isolation

Core Libraries: - Data: pandas (tabular), NumPy (arrays), Polars (fast alternative) - ML: scikit-learn (classical), XGBoost/LightGBM (boosting) - Deep Learning: PyTorch (research), TensorFlow/Keras (production) - Visualization: Matplotlib (static), Seaborn (statistical), Plotly (interactive)

MLOps: - Experiment tracking: MLflow or Weights & Biases - Data versioning: DVC - Containerization: Docker

Best Practices: - Use virtual environments for every project - Version control code with Git - Track experiments systematically - Structure projects consistently - Document dependencies (requirements.txt, environment.yml) - Containerize for reproducibility


14.12 Hands-On Exercise

14.12.1 Exercise: Set Up Your AI Development Environment

Objective: Build a complete, reproducible development environment for AI/ML projects.

Tasks:

  1. Install Miniconda

    • Download and install from official website
    • Verify installation with conda --version
  2. Create project environment

    conda create -n my-ml-project python=3.10
    conda activate my-ml-project
    conda install pandas numpy scikit-learn matplotlib jupyter mlflow
  3. Set up VS Code

    • Install VS Code
    • Install Python and Jupyter extensions
    • Configure to use your conda environment
  4. Create project structure

    mkdir my-ml-project
    cd my-ml-project
    mkdir data models notebooks src tests
    touch README.md requirements.txt .gitignore
  5. Initialize Git repository

    git init
    git add .
    git commit -m "Initial project structure"
  6. Create simple ML pipeline

    • Load sample dataset (e.g., from scikit-learn)
    • Train simple model
    • Log experiment with MLflow
    • Save model
  7. Export environment

    conda env export > environment.yml
    pip freeze > requirements.txt
  8. Create Dockerfile

    • Write Dockerfile for your project
    • Build image
    • Test running training script in container

Bonus: - Set up Weights & Biases account and log experiment - Create GitHub repository and push code - Set up DVC for data versioning


Check Your Understanding

Test your knowledge of AI development tools and best practices. Each question builds on the key concepts from this chapter.

NoteQuestion 1

A data scientist starts a new project analyzing hospital readmission data. They install packages globally on their system using pip install without creating a virtual environment. Six months later, they need to share the project with a colleague, who reports numerous package version conflicts and cannot run the code. What is the PRIMARY lesson this scenario illustrates about development environments?

  1. pip is unreliable and conda should always be used instead
  2. Virtual environments are essential for dependency isolation, reproducibility, and collaboration
  3. Python packages change too rapidly to maintain long-term projects
  4. Sharing code is inherently difficult and requires containerization from day one

Correct Answer: b) Virtual environments are essential for dependency isolation, reproducibility, and collaboration

This scenario illustrates a fundamental best practice in software development: dependency isolation through virtual environments. The chapter emphasizes this throughout the environment setup section, listing virtual environments as critical for avoiding version conflicts, ensuring reproducible environments, and enabling easy sharing.

The problem: Installing packages globally causes several issues: - Dependency conflicts: Different projects need different versions of the same package (project A needs pandas 1.3, project B needs pandas 2.0) - System pollution: Global installation affects all Python projects on the system - Irreproducibility: The colleague doesn’t know which package versions were used - Fragility: Upgrading a package for one project breaks another project

The chapter explains that virtualenvironments solve these problems by creating isolated Python environments per project, each with its own package versions. When the data scientist eventually tries to share, they can’t easily communicate “install these exact versions” because they never tracked them.

Correct approach:

# Create isolated environment
conda create -n readmission-analysis python=3.10
conda activate readmission-analysis

# Install packages
conda install pandas numpy scikit-learn

# Export for sharing
conda env export > environment.yml
pip freeze > requirements.txt

Now colleagues can recreate the exact environment:

conda env create -f environment.yml

Option (a) misses the point—both pip and conda support virtual environments (venv/virtualenv for pip, conda environments for conda). The tool isn’t the issue; the practice of isolation is. Option (c) is defeatist—yes, packages evolve, which is precisely why version pinning and environment management are necessary. The solution isn’t to avoid Python but to use proper tooling. Option (d) overstates the requirement—while Docker is valuable (discussed in the chapter), virtual environments + requirements files handle most sharing scenarios. Jumping to containers “from day one” for every project is overkill.

The chapter’s section on virtual environments lists clear benefits: - Isolate project dependencies - Avoid version conflicts - Reproducible environments - Easy to share and recreate

Real-world implications: This scenario mirrors the famous “works on my machine” problem that plagues software development. The healthcare AI context makes it worse—if the model can’t be reproduced, research findings can’t be validated, clinical deployments become risky, and regulatory compliance (FDA requirements for reproducibility) may be violated.

The chapter provides specific commands for both conda and venv, emphasizing that the choice of tool matters less than consistent use of isolation. For public health practitioners: start every project with environment creation, export dependencies regularly (don’t wait until sharing time), document environment setup in README, and consider it part of “scientific method” for computational work—others must be able to reproduce your results.

The broader principle: computational reproducibility requires infrastructure. Just as lab scientists document reagent lot numbers and equipment settings, data scientists must document software environments. Virtual environments are the foundational tool for this documentation.

NoteQuestion 2

A research team is building a COVID-19 forecasting model. They need to decide between using pandas (which they know well) versus Polars (which they’ve heard is much faster). Their current dataset is 5 million rows and model training takes 30 minutes with pandas. Which factors should MOST heavily influence their decision?

  1. Always choose Polars because speed improvements are always valuable
  2. Evaluate the actual bottleneck: if data processing is <5% of runtime, pandas is fine; if it’s >50%, consider Polars; also weigh team familiarity and deadline pressure
  3. Stick with pandas because learning new tools wastes time that could be spent on modeling
  4. Use both—pandas for development and Polars for production

Correct Answer: b) Evaluate the actual bottleneck: if data processing is <5% of runtime, pandas is fine; if it’s >50%, consider Polars; also weigh team familiarity and deadline pressure

This question tests understanding of pragmatic tool selection—a key theme throughout the chapter. The scenario requires evaluating trade-offs between performance, learning curve, team capability, and project constraints rather than defaulting to “fastest tool wins” or “familiar tool wins.”

The chapter discusses Polars as “10-100x faster than pandas for large datasets” but presents it as an alternative, not a replacement for all cases. The key is understanding when speed matters enough to justify switching costs.

Decision framework:

1. Profile to find bottlenecks: If model training takes 30 minutes total: - Data loading + preprocessing: 2 minutes (7%) → pandas is fine - Model training: 28 minutes (93%) → optimize model, not data pipeline - Data loading + preprocessing: 20 minutes (67%) → Polars might help significantly

The chapter’s philosophy emphasizes: optimize what matters. Premature optimization wastes time.

2. Calculate switching costs: - Learning curve: How long to become proficient with Polars? - Code rewrite: How much existing code needs translation? - Testing: How much validation is needed after switching? - Documentation: Does switching affect reproducibility or team knowledge transfer?

3. Consider team and project context: - Team familiarity: If everyone knows pandas, collective productivity matters more than individual script speed - Deadline pressure: If launch is imminent, don’t introduce new tools mid-project - Long-term maintenance: If Polars reduces training time from 30 min to 3 min and you’ll run thousands of experiments, the investment pays off

4. Evaluate alternative optimizations: Before switching tools, consider: - Optimize pandas code (vectorization, efficient data types, chunking) - Sample data for development, full data for final training - Parallel processing with Dask (pandas-compatible API) - Cloud resources for faster compute

Option (a)—“always choose faster”—ignores switching costs and assumes speed is the only bottleneck. This violates the chapter’s emphasis on pragmatic tool selection. The chapter presents multiple tools precisely because different situations call for different solutions. Option (c)—“never learn new tools”—is the opposite error. Technical debt accumulates when teams refuse to adopt better tools. If Polars genuinely solves a major bottleneck and the project has longevity, learning it is worthwhile. However, context matters. Option (d)—“use both”—introduces unnecessary complexity. Maintaining two implementations (one for dev, one for prod) creates version skew risks, doubles testing burden, and complicates debugging. The chapter’s emphasis on reproducibility argues against dev/prod environment divergence.

The chapter’s guidance on Polars: - When to consider: Large datasets (>1GB), complex aggregations, memory constraints - When to skip: Small datasets, simple operations, team unfamiliar and no time to learn - Alternative: Polars has pandas-like API, lowering learning curve

Real-world considerations for public health AI: - Regulatory: If the model is for FDA-cleared device, environment consistency matters—don’t switch tools casually - Collaboration: If working with epidemiologists unfamiliar with data science, pandas’ widespread adoption and documentation base creates lower barriers - Iteration speed: In outbreak response, getting a working model fast may matter more than optimal performance

The chapter’s broader theme is pragmatic tool selection: choose tools that match your problem, team, and constraints. The AI ecosystem offers hundreds of options precisely because no single tool is best for all situations. The chapter provides guidance on multiple alternatives (pandas vs. Polars, PyTorch vs. TensorFlow, Jupyter vs. VS Code) to equip readers for context-appropriate decisions.

For public health practitioners: resist both technology dogmatism (“always use X”) and technology conservatism (“never change from Y”). Profile your code to find actual bottlenecks, evaluate the cost-benefit of switching tools, involve the team in decisions, and make trade-offs explicit. Sometimes the best choice is the tool your team already knows; sometimes it’s worth investing in something better. Data and context should drive the decision.

NoteQuestion 3

A hospital AI team has been using Jupyter notebooks for all development, from initial exploration through model deployment. They experience several problems: notebooks with 200+ cells becoming unwieldy, difficulty tracking which cells were run in what order, merge conflicts when multiple team members edit notebooks, and challenges deploying notebook code to production. What does this scenario suggest about development tool selection?

  1. Jupyter notebooks are inappropriate for ML development and should be avoided
  2. The team needs better notebook organization through naming conventions and documentation
  3. Different development stages require different tools: notebooks for exploration, Python scripts/IDEs for production code, with clear transitions between phases
  4. The team should switch entirely to VS Code for all development activities

Correct Answer: c) Different development stages require different tools: notebooks for exploration, Python scripts/IDEs for production code, with clear transitions between phases

This question tests understanding of the chapter’s nuanced guidance on IDE selection and workflow design. The chapter doesn’t advocate for one tool over another universally but rather explains each tool’s strengths and appropriate use cases.

The chapter’s tool guidance:

Jupyter Notebooks: - Best for: Data exploration, iterative analysis, teaching, documenting analysis with narrative - Strengths: Cell-based execution, inline visualizations, rich markdown, easy sharing - Limitations: Not mentioned explicitly but implied by VS Code’s positioning

VS Code: - Best for: Production code, debugging complex issues, multiple files, Git integration - Strengths: Refactoring, testing, deployment, version control

The scenario describes classic problems that arise when using exploration tools for production workflows:

1. Unwieldy 200+ cell notebooks: Notebooks become unmaintainable at scale. They lack the modular structure of properly organized Python packages with functions, classes, and modules. The chapter’s project structure section shows how production code should be organized:

src/
├── data/make_dataset.py
├── features/build_features.py
├── models/train.py
└── visualization/visualize.py

This modular structure enables: - Testing individual components - Reusing code across projects - Clear dependencies and interfaces - Team division of labor

2. Execution order confusion: Notebooks allow non-linear execution. Cell 5 might depend on Cell 10, but this dependency is invisible. Production scripts have clear top-to-bottom execution, making behavior predictable. The chapter’s example training script (train.py) demonstrates linear, predictable flow.

3. Git merge conflicts: Notebooks are JSON files with embedded metadata (execution counts, outputs, cell IDs). When two people edit a notebook, Git struggles to merge. Python scripts are plain text, merging cleanly. The chapter’s emphasis on Git for version control implicitly assumes code structured for version control.

4. Deployment challenges: Deploying a notebook to production is awkward. You need to strip interactive elements, convert to script, handle cell dependencies, and remove exploration code. Starting with production-structured code avoids this conversion.

The right workflow (implied by the chapter):

Phase 1: Exploration (Jupyter) - Load data, visualize distributions - Try different features, models - Document findings with markdown - Iterate quickly

Phase 2: Production (VS Code + scripts) - Extract working code from notebooks - Refactor into functions and modules - Add tests and documentation - Structure according to chapter’s project template - Commit to Git with clean history

Phase 3: Deployment - Production scripts are deployment-ready - Clear entry points (train.py, predict.py) - Containerization (Dockerfile) - CI/CD integration

Option (a) dismisses notebooks entirely, contradicting the chapter’s guidance that notebooks excel for exploration. Many critical data science insights come from exploratory work best done interactively. Option (b) treats symptoms rather than causes. Better organization helps, but fundamental limitations remain—notebooks aren’t designed for production code, and trying to force that use case creates friction. Option (d) makes the opposite error of (a)—VS Code isn’t ideal for initial exploration. The chapter presents VS Code as powerful for production development, not for replacing notebooks’ exploratory strengths.

The chapter’s project structure provides the solution: - notebooks/ directory for exploration (01-exploration.ipynb, 02-preprocessing.ipynb) - src/ directory for production code (modular Python scripts) - Clear workflow: explore in notebooks, productionize in src/

Real-world implications for public health AI:

Regulatory compliance: FDA-cleared medical devices require validated, version-controlled code. Notebooks don’t meet these standards; properly structured Python packages do.

Reproducibility: Research publications require reproducible methods. The chapter emphasizes containers and dependency management, which work much better with scripts than notebooks.

Team collaboration: Hospital AI teams include epidemiologists, clinicians, data scientists, ML engineers. Notebooks work for sharing analysis with stakeholders; scripts work for engineers building production systems.

Operational reliability: When the model runs in production serving patient predictions, it must be reliable, tested, monitored. The chapter’s MLOps section discusses logging, error handling, monitoring—all easier with production-structured code.

The chapter’s toolkit philosophy is use the right tool for the job. Notebooks and scripts aren’t competitors but collaborators in a complete workflow. Mature ML practice involves knowing when to use each and how to transition between them. The team’s problems stem not from choosing notebooks but from failing to transition to production-appropriate tools when development matured beyond exploration.

For public health practitioners: embrace notebooks for exploration and communication, transition to scripts for production, maintain both in your project (chapter’s directory structure includes both notebooks/ and src/), document the workflow, and train team members on when to use each tool. The chapter provides all these pieces; the practitioner’s job is assembling them into appropriate workflows.

NoteQuestion 4

A team building a disease outbreak prediction model tracks experiments inconsistently: some parameters are in Excel spreadsheets, some in paper notebooks, some in code comments. When they need to reproduce their best model six months later for a regulatory submission, they cannot determine which hyperparameters, data version, or random seed were used. What MLOps practice would have MOST directly prevented this problem?

  1. Better documentation practices and standardized note-taking
  2. Systematic experiment tracking using MLflow or Weights & Biases to automatically log parameters, metrics, code versions, and artifacts
  3. More frequent code reviews and team meetings to discuss experiments
  4. Requiring all experiments to be approved by a senior scientist before running

Correct Answer: b) Systematic experiment tracking using MLflow or Weights & Biases to automatically log parameters, metrics, code versions, and artifacts

This question tests understanding of MLOps experiment tracking—a core component of the chapter’s toolkit section. The scenario describes a classic reproducibility crisis that experiment tracking tools are specifically designed to prevent.

The problem: The team has no systematic record of: - Hyperparameters: learning rate, model architecture, regularization - Data version: which preprocessing, which data split - Code version: which model code, which feature engineering - Random seeds: critical for reproducibility in ML - Environment: library versions, hardware (GPU vs CPU) - Results: metrics, artifacts, model files

Six months later, they can’t reproduce the “best model” because this information is scattered, incomplete, or lost.

The chapter’s solution: MLflow

The chapter provides a concrete example showing exactly how MLflow solves this:

import mlflow
import mlflow.sklearn

mlflow.set_experiment("disease-prediction")

with mlflow.start_run(run_name="random-forest-v1"):
    # Log parameters - AUTOMATICALLY TRACKED
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    mlflow.log_param("random_state", 42)  # CRITICAL FOR REPRODUCIBILITY

    # Train model
    model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
    model.fit(X_train, y_train)

    # Log metrics - AUTOMATICALLY TRACKED
    auc = roc_auc_score(y_test, y_pred_proba)
    mlflow.log_metric("auc", auc)

    # Log model - ARTIFACT STORED
    mlflow.sklearn.log_model(model, "model")

    # Log plots - ARTIFACTS STORED
    plt.savefig("roc_curve.png")
    mlflow.log_artifact("roc_curve.png")

What MLflow captures automatically: 1. Parameters: All hyperparameters logged explicitly 2. Metrics: Performance metrics at training and validation 3. Artifacts: Models, plots, data files 4. Code version: Git commit hash if repo is initialized 5. Environment: Library versions if captured 6. Timestamp: When experiment ran 7. User: Who ran the experiment 8. Hardware: System information

Six months later, the team can: - Query MLflow: “Show me all runs with AUC > 0.9” - Find the best run’s ID - Retrieve exact parameters, code version, model file - Reproduce by running with those exact settings

The chapter also discusses Weights & Biases as an alternative with similar capabilities plus: - Better visualizations - Team collaboration features - Hyperparameter sweep automation - Model versioning

Why other options are insufficient:

Option (a)—better documentation—is necessary but insufficient. Manual documentation is: - Error-prone: People forget to record things - Inconsistent: Different team members document differently - Time-consuming: Creates friction (“I’ll document it later”) - Incomplete: Hard to capture everything manually - Not programmatic: Can’t query “find all experiments with learning_rate < 0.01”

The chapter emphasizes systematic, automated tracking precisely because manual processes fail.

Option (c)—code reviews and meetings—helps team coordination but doesn’t solve the tracking problem. Discussions don’t create machine-readable records. When the regulatory submission happens months later, meeting notes won’t suffice.

Option (d)—approval requirements—adds bureaucracy without solving the core issue. Even approved experiments need tracking. This option slows the team without improving reproducibility.

The regulatory angle (important for public health AI):

The scenario mentions “regulatory submission”—likely FDA for a medical device. FDA 21 CFR Part 11 and related guidance require: - Traceability: Ability to trace model provenance - Reproducibility: Ability to recreate exact models - Auditability: Records of all development decisions - Validation: Evidence of systematic testing

MLflow/W&B provide exactly this audit trail. The chapter’s emphasis on these tools reflects not just development convenience but regulatory necessity.

The chapter’s complete MLOps stack:

Experiment tracking: MLflow or W&B (this problem) Data versioning: DVC (tracks which data version) Code versioning: Git (tracks which code version) Environment: Docker + requirements.txt (tracks which libraries) Project structure: Standard directories (organizes everything)

Together, these provide complete reproducibility—the ability to recreate any experiment exactly.

Real-world workflow:

The chapter’s example training script shows integration: 1. Initialize MLflow experiment 2. Log all parameters at start 3. Train model 4. Log all metrics 5. Save model as artifact 6. Query MLflow UI to compare runs

This becomes habitual: every experiment automatically tracked, every model reproducible.

For public health practitioners:

The chapter makes MLflow/W&B setup straightforward: - Install: pip install mlflow or pip install wandb - Wrap training code with tracking calls - View UI: mlflow ui provides web interface - Query programmatically or via UI

Cost: minimal (W&B free for individuals/academics, MLflow open-source) Benefit: enormous (complete reproducibility, regulatory compliance, team coordination)

The reproducibility crisis in ML is well-documented. The chapter addresses it directly by presenting experiment tracking as essential infrastructure, not optional. The scenario’s regulatory submission failure illustrates precisely why: without systematic tracking, even successful models may be unusable if they can’t be reproduced.

The key principle: automate tracking, don’t rely on human memory. The chapter provides the tools; the practitioner must use them consistently. Start using experiment tracking on day one of a project, not when reproducibility problems emerge.

NoteQuestion 5

A data science team is deciding between TensorFlow/Keras and PyTorch for a new image-based tuberculosis screening project. The team has limited deep learning experience, needs to deploy to hospital systems within 6 months, and the model must run on edge devices (low-power tablets). Which factors should MOST heavily influence their framework choice?

  1. PyTorch because it’s more popular in research and has more cutting-edge features
  2. TensorFlow because Keras provides easier learning curve, TensorFlow Lite enables edge deployment, and TensorFlow Serving supports production deployment—matching their constraints
  3. Neither—they should use scikit-learn since deep learning is overkill for this problem
  4. Both—prototype in PyTorch for flexibility, then convert to TensorFlow for deployment

Correct Answer: b) TensorFlow because Keras provides easier learning curve, TensorFlow Lite enables edge deployment, and TensorFlow Serving supports production deployment—matching their constraints

This question tests understanding of the chapter’s framework comparison and requires matching tool capabilities to project requirements. The scenario deliberately provides specific constraints that favor one framework over the other.

The chapter’s framework comparison:

PyTorch: - “Pythonic, intuitive API” - “Dynamic computation graphs” - “Strong research community” - “Excellent for NLP and computer vision”

TensorFlow/Keras: - “Industry standard for production” - “TensorFlow Serving for deployment” - “TensorFlow Lite for mobile/edge” - “Keras API for ease of use”

Analyzing project constraints:

1. Limited deep learning experience: The chapter positions Keras as having an easier learning curve. The example code shows:

# Keras: Very concise, readable
model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(10,)),
    layers.Dropout(0.2),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', 'AUC'])
model.fit(X_train, y_train, epochs=50)

PyTorch requires more boilerplate (define class, forward method, training loop). For beginners, Keras’ high-level API reduces cognitive load.

2. Edge device deployment (low-power tablets): The chapter explicitly states “TensorFlow Lite for mobile/edge.” This is a critical capability. TensorFlow Lite: - Optimizes models for mobile/edge devices - Reduces model size and inference latency - Supports quantization for lower precision (faster, smaller) - Has extensive mobile deployment tooling

PyTorch has mobile deployment (PyTorch Mobile) but TensorFlow Lite is more mature and widely adopted for healthcare applications.

3. Hospital production deployment within 6 months: The chapter emphasizes “TensorFlow Serving for deployment” and calls TensorFlow “industry standard for production.” TensorFlow Serving: - Production-grade model serving system - Handles versioning, A/B testing, monitoring - Integrates with enterprise IT systems - Well-documented for healthcare deployments

The 6-month timeline is tight. Learning a framework AND building production infrastructure is challenging. TensorFlow’s ecosystem provides more out-of-the-box production tooling.

4. Image-based TB screening: Both frameworks excel at computer vision. This constraint doesn’t differentiate them. Both have excellent pre-trained models (TensorFlow Hub, PyTorch Hub), transfer learning capabilities, and computer vision libraries.

The decision matrix:

Factor PyTorch TensorFlow/Keras
Learning curve Moderate Easy (Keras)
Research flexibility Excellent Good
Production tooling Adequate Excellent
Edge deployment PyTorch Mobile TensorFlow Lite ✓
Enterprise support Growing Mature ✓
Time to production Longer Shorter ✓

Given constraints (beginner team, edge deployment, tight timeline), TensorFlow/Keras matches better.

Option (a) prioritizes research popularity over project needs. “Cutting-edge features” matter for research, not for deploying a TB screening tool to hospitals. The chapter distinguishes research use (PyTorch strengths) from production use (TensorFlow strengths). This project is production-focused.

Option (c) dismisses deep learning prematurely. The chapter discusses deep learning specifically for image analysis. TB screening from chest X-rays is a canonical deep learning application where CNNs outperform traditional computer vision. Skikit-learn lacks image-specific capabilities that deep learning frameworks provide.

Option (d) suggests using both frameworks—doubling learning curve, maintaining two implementations, and introducing conversion complexity. The chapter doesn’t recommend this approach. Framework conversion (PyTorch → TensorFlow) is non-trivial, error-prone, and time-consuming. With a 6-month deadline, focusing on one framework makes sense.

Additional considerations from the chapter:

Pre-trained models: The chapter discusses both TensorFlow Hub and PyTorch Hub. For TB screening, transfer learning from ImageNet models is common. Both frameworks support this, though specific TB screening models may be available for one or the other—worth checking model zoos before deciding.

Healthcare AI examples: The handbook’s earlier chapters (clinical AI, imaging) likely reference TensorFlow implementations because of its production maturity in healthcare settings.

Team trajectory: If the team later shifts to research (exploring novel architectures), they could learn PyTorch then. For initial production deployment, TensorFlow’s tooling provides scaffolding that accelerates delivery.

Real-world context for public health AI:

FDA clearance: If the TB screening tool needs FDA clearance, TensorFlow’s maturity and documentation for medical devices provides advantages. Regulatory submissions benefit from using widely-validated tooling.

Hospital IT integration: Hospital IT departments are conservative and prefer mature, supported technology. TensorFlow’s enterprise backing (Google) and widespread deployment precedents reduce institutional friction.

Resource constraints: Public health settings often have limited computational resources. TensorFlow Lite’s optimization for low-power devices directly addresses this reality.

Maintenance: After initial deployment, who maintains the model? If the team is small or experiences turnover, TensorFlow’s larger talent pool (due to industry adoption) makes hiring easier.

The chapter’s philosophy:

The chapter presents multiple tools not to create confusion but to equip readers for context-appropriate decisions. PyTorch vs. TensorFlow isn’t “which is better?” but “which matches our needs?”

For this scenario: beginner team + edge deployment + tight timeline = TensorFlow/Keras.

Different scenario (experienced team, research project, no deployment constraints) might favor PyTorch.

For public health practitioners:

When choosing frameworks: 1. List project constraints (team experience, deployment target, timeline, requirements) 2. Map to framework capabilities (chapter provides this mapping) 3. Choose frameworks that match constraints, not most popular or newest 4. Start learning/prototyping quickly to validate the choice 5. Commit to one framework rather than hedging across multiple

The chapter provides information for informed decisions; practitioners must apply decision frameworks that prioritize project success over tool fashion. This question tests that practical decision-making skill—understanding not just what each tool does but when to use which tool.

NoteQuestion 6

A graduate student is working on epidemic forecasting for their dissertation. They maintain all their code and data on their laptop’s desktop folder with names like “model_v2_final_FINAL.py” and “data_cleaned.csv.” After their laptop crashes and they discover their backups are outdated, they lose three months of work. Which combination of practices from the chapter would have MOST effectively prevented this catastrophe?

  1. Cloud storage (Google Drive, Dropbox) for automatic backup
  2. Git for code version control + GitHub for remote backup + DVC for data versioning + remote storage (S3/GCS) for data backup
  3. More frequent manual backups to external hard drives
  4. Working exclusively on cloud platforms (Google Colab, AWS SageMaker) instead of local machines

Correct Answer: b) Git for code version control + GitHub for remote backup + DVC for data versioning + remote storage (S3/GCS) for data backup

This question synthesizes multiple practices from the chapter’s reproducibility and collaboration section. The scenario illustrates common disasters that proper version control and backup infrastructure prevent.

The problem breakdown:

1. Code loss (“model_v2_final_FINAL.py”): - No version control - Manual versioning through filenames (error-prone, unmaintainable) - Single point of failure (laptop) - Can’t recover previous versions - No collaboration capability

2. Data loss (“data_cleaned.csv”): - No data versioning - No tracking of data transformations - No remote backup - Can’t reproduce data cleaning steps

3. Lack of remote backup: - All work on single device - Backups not systematic/automated - No ability to recover from device failure

The chapter’s solution:

Git for code version control: The chapter emphasizes Git throughout, providing concrete examples:

git init                          # Initialize repository
git add .                         # Stage files
git commit -m "Initial commit"    # Create checkpoint

Benefits: - Every commit is a checkpoint: Can recover any previous version - No more “final_FINAL” naming: Git tracks versions automatically - Branching: Experiment safely without destroying working code - History: See what changed, when, and why - Merging: Combine different development paths

With Git, the student never loses code. Even if laptop crashes, the commit history is preserved (especially when combined with remote backup).

GitHub for remote backup: The chapter’s Git section shows:

git remote add origin https://github.com/username/repo.git
git push -u origin main

Benefits: - Automatic offsite backup: Every git push backs up to cloud - Free for public repos: Academic work benefits from open science - Collaboration ready: Advisor/committee can review code - Institutional compliance: Many universities require code archiving

When laptop crashes, student clones from GitHub on new machine. No code lost.

DVC for data versioning: The chapter introduces DVC specifically for versioning large datasets (which Git doesn’t handle well):

dvc init                          # Initialize DVC in Git repo
dvc add data/raw/epidemic_data.csv  # Track data file
dvc remote add -d storage s3://mybucket/dvc-store
dvc push                          # Upload data to remote storage

# After crash, on new machine:
git clone <repo-url>              # Get code
dvc pull                          # Get data

Benefits: - Version large datasets: Git struggles with >100MB files, DVC handles terabytes - Track data transformations: Record preprocessing steps - Remote storage: Data backed up to S3/GCS/Azure - Reproduce pipelines: DVC tracks data processing pipelines

With DVC, the student’s “data_cleaned.csv” is versioned, backed up remotely, and accompanied by code that documents the cleaning process. Data loss is prevented, and the cleaning process is reproducible.

Why this combination is essential:

Git alone doesn’t solve the data problem (large files, binary data). DVC alone doesn’t version code. Together, they provide complete protection for computational research.

Analyzing alternatives:

Option (a)—Cloud storage (Drive/Dropbox): - Pros: Easy, automatic backup - Cons: Not version control (overwrites files, limited history), no meaningful history/diffs, doesn’t track relationships between code and data, not designed for code workflows, poor collaboration features for code

Option (a) prevents total loss but doesn’t provide versioning, reproducibility, or proper collaboration. The chapter presents it as insufficient for serious development work.

Option (c)—Manual backups to external drives: - Pros: Physical control of data - Cons: Requires discipline (people forget), no automation, drives can fail too, versioning still manual, doesn’t enable collaboration

Manual processes fail. The chapter emphasizes automated systems precisely because humans are unreliable about backups.

Option (d)—Cloud platforms exclusively: - Pros: Built-in backup, compute resources - Cons: Vendor lock-in, requires internet, limited customization, can be expensive, doesn’t address version control (still need Git)

Cloud platforms are tools, not replacements for version control. The chapter discusses Colab for compute, not as a version control solution. Colab + Git + GitHub is appropriate; Colab alone is not.

The chapter’s complete reproducibility stack:

Code versioning: Git Code backup: GitHub/GitLab Data versioning: DVC Data backup: S3/GCS/Azure Environment: Docker + requirements.txt Experiments: MLflow/W&B Project structure: Standard directories

This stack provides: - Recoverability: Device failure doesn’t cause data loss - Reproducibility: Anyone can recreate exact environment and results - Collaboration: Team members work together efficiently - Auditability: Complete history of all changes - Correctness: Version control enables review and validation

Real-world implications for dissertation work:

1. Advisor review: “Send me your code” → “Here’s the GitHub link” Much cleaner than emailing zip files.

2. Committee examination: “How did you clean the data?” → “See commit 3a7f9b2 and DVC pipeline” Full traceability of methods.

3. Publication: “Make code available” → Already on GitHub with DOI Meets open science requirements.

4. Post-graduation: Future researchers can build on the work because code + data + environment are documented and preserved.

The “three months of work” loss:

Without version control, this is catastrophic. With Git + GitHub: - Latest commit is from yesterday → lose one day maximum - DVC + remote storage → data fully backed up - git clone + dvc pull on new machine → back to work in hours

The chapter’s project setup exercise (Section 12):

The hands-on exercise walks through exactly this setup: 1. Create environment (conda/venv) 2. Create project structure (directories) 3. Initialize Git 4. Create GitHub repo 5. Set up DVC 6. Build pipeline 7. Export environment 8. Create Dockerfile

This is the chapter’s recommended starting point for ANY project. Following it would have completely prevented the student’s disaster.

For public health practitioners:

The chapter presents this infrastructure not as optional nice-to-have but as essential professional practice. Just as laboratory researchers maintain lab notebooks and document procedures, computational researchers must version control code and data.

Start every project with:

mkdir my-project
cd my-project
git init                          # Version control from day one
dvc init                          # Data versioning from day one
git remote add origin <github-url>  # Remote backup from day one
dvc remote add -d storage <s3-url>  # Data backup from day one

The chapter provides all these tools and explains their use. The practitioner’s responsibility is establishing habits: commit regularly, push frequently, version data systematically. These practices prevent catastrophes and enable reproducible science.

The scenario’s disaster—three months of work lost—would be impossible with proper version control. This isn’t hypothetical; this happens to real students/researchers regularly. The chapter equips readers with tools to prevent it. The question tests whether readers understand not just what the tools do but why they’re essential and how they work together to provide comprehensive protection.


14.13 Discussion Questions

  1. Environment choice: When would you choose Jupyter notebooks over VS Code for development? What are the trade-offs?

  2. Library selection: For a new tabular data classification project, would you start with scikit-learn, XGBoost, or deep learning? Why?

  3. Cloud vs. local: When does it make sense to move from local development to cloud computing? What factors influence this decision?

  4. Reproducibility: How would you ensure someone else can exactly reproduce your analysis? What tools and practices are essential?

  5. Tool overload: The AI ecosystem has hundreds of tools. How do you decide what to learn? What’s essential vs. nice-to-have?

  6. Version control for data: Git works well for code, but large datasets pose challenges. How would you version control a 100GB dataset?


14.14 Further Resources

14.14.1 📚 Books

14.14.2 📄 Documentation

14.14.3 💻 Interactive Learning

14.14.4 🎓 Online Courses

14.14.5 🎯 Cheat Sheets


Next: Chapter 14: Building Your First Project →