14 Your AI Toolkit
Estimated time: 60-90 minutes
Prerequisites: - Chapter 2: Just Enough AI to Be Dangerous - Basic ML concepts - Chapter 3: The Data Problem - Data fundamentals
14.1 What You’ll Build
By working through this chapter, you will set up:
- Complete Python Development Environment - Anaconda/Miniconda, Jupyter, VS Code
- Project Template - Structured directory with configuration files and documentation
- Model Training Pipeline - End-to-end workflow from data to trained model
- MLOps Stack - Experiment tracking, version control, and model registry
- Deployment-Ready Container - Docker configuration for reproducible environments
14.2 1. Introduction: Choosing the Right Tools
14.2.1 The Modern AI Stack
The AI ecosystem is vast and evolving rapidly. This chapter focuses on:
- Open-source tools (accessible, transparent, community-supported)
- Python-first (dominant language for data science and ML)
- Production-ready (tools that scale from prototype to deployment)
- Public health-relevant (applicable to epidemiology and population health)
14.2.2 Tool Categories
┌────────────────────────────────────────────────────────────┐
│ AI Development Stack │
├────────────────────────────────────────────────────────────┤
│ │
│ Development Environment │
│ ├─ IDE/Editor (VS Code, PyCharm, Jupyter) │
│ ├─ Package Manager (conda, pip) │
│ └─ Version Control (Git, GitHub) │
│ │
│ Core Libraries │
│ ├─ Data (pandas, numpy, polars) │
│ ├─ ML (scikit-learn, XGBoost, LightGBM) │
│ ├─ Deep Learning (PyTorch, TensorFlow, JAX) │
│ └─ Visualization (matplotlib, seaborn, plotly) │
│ │
│ MLOps & Workflow │
│ ├─ Experiment Tracking (MLflow, Weights & Biases) │
│ ├─ Data Versioning (DVC, LakeFS) │
│ ├─ Pipeline Orchestration (Airflow, Prefect) │
│ └─ Model Serving (FastAPI, TorchServe, TFServing) │
│ │
│ Infrastructure │
│ ├─ Containerization (Docker, Kubernetes) │
│ ├─ Cloud Platforms (AWS, GCP, Azure) │
│ ├─ Compute (GPUs, TPUs, serverless) │
│ └─ Databases (PostgreSQL, MongoDB, InfluxDB) │
│ │
│ Domain-Specific │
│ ├─ Epidemiology (EpiModel, surveillance-py) │
│ ├─ Genomics (BioPython, scikit-bio) │
│ ├─ GIS/Spatial (GeoPandas, folium) │
│ └─ NLP/LLMs (Hugging Face, spaCy, LangChain) │
│ │
└────────────────────────────────────────────────────────────┘
14.3 2. Setting Up Your Development Environment
14.3.1 Python Distribution: Anaconda vs. Miniconda
Anaconda (Recommended for beginners)
Pros: - Pre-installed with 250+ popular data science packages - Graphical installer and Navigator UI - Includes Jupyter, Spyder, VS Code - No compilation needed for scientific libraries
Cons: - Large download (~3 GB) - Takes significant disk space (~5 GB) - May include packages you don’t need
Installation:
# Download from https://www.anaconda.com/download
# Follow graphical installer
# Verify installation
conda --version
python --version
Miniconda (Recommended for experienced users)
Pros: - Minimal installer (~50 MB) - Install only what you need - Faster environment creation - Same conda functionality
Cons: - Requires manual package installation - Command-line focused - More setup steps
Installation:
# Download from https://docs.conda.io/en/latest/miniconda.html
# Linux/macOS
bash Miniconda3-latest-Linux-x86_64.sh
# Windows: Run .exe installer
# Verify
conda --version
14.3.2 Creating Virtual Environments
Why virtual environments?
- Isolate project dependencies
- Avoid version conflicts
- Reproducible environments
- Easy to share and recreate
Creating environments with conda:
# Create new environment with Python 3.10
conda create -n publichealth-ai python=3.10
# Activate environment
conda activate publichealth-ai
# Install core packages
conda install numpy pandas scikit-learn matplotlib jupyter
# Deactivate when done
conda deactivate
# List all environments
conda env list
# Remove environment
conda env remove -n publichealth-ai
Alternative: venv (Python built-in)
# Create virtual environment
python -m venv venv
# Activate
# Windows
venv\Scripts\activate
# Linux/macOS
source venv/bin/activate
# Install packages
pip install numpy pandas scikit-learn matplotlib jupyter
# Deactivate
deactivate
14.3.3 IDE Options
14.3.3.1 Jupyter Notebook/Lab (Interactive exploration)
Best for: - Data exploration and visualization - Iterative analysis - Teaching and presentations - Documenting analysis with narrative
Installation:
conda install jupyter jupyterlab
# Launch Jupyter Lab
jupyter lab
# Launch classic Jupyter Notebook
jupyter notebook
Key features: - Cell-based execution - Inline plots and visualizations - Rich markdown support - Easy sharing (.ipynb files)
Extensions to install:
# Variable inspector
pip install jupyterlab-variableinspector
# Table of contents
pip install jupyterlab-toc
# Code formatter
pip install jupyterlab-code-formatter black isort
14.3.3.2 VS Code (General-purpose development)
Best for: - Writing production code - Debugging complex issues - Working with multiple files - Git integration - Remote development (SSH, WSL, containers)
Installation: Download from https://code.visualstudio.com/
Essential extensions for AI/ML:
{
"recommendations": [
"ms-python.python", // Python IntelliSense
"ms-toolsai.jupyter", // Jupyter notebooks
"ms-python.vscode-pylance", // Type checking
"ms-python.black-formatter", // Code formatting
"ms-python.debugpy", // Debugging
"ms-azuretools.vscode-docker", // Docker support
"github.copilot", // AI code suggestions
"visualstudioexptteam.vscodeintellicode" // ML-powered suggestions
]
}
Useful settings (.vscode/settings.json):
{
"python.defaultInterpreterPath": "/path/to/conda/env/python",
"python.formatting.provider": "black",
"python.linting.enabled": true,
"python.linting.pylintEnabled": true,
"editor.formatOnSave": true,
"files.autoSave": "afterDelay",
"jupyter.askForKernelRestart": false
}
14.3.3.3 PyCharm (Professional Python IDE)
Best for: - Large-scale projects - Professional development - Advanced debugging - Refactoring tools
Editions: - Community - Free, open-source, sufficient for most projects - Professional - Paid, adds scientific tools, web frameworks, databases
Download: https://www.jetbrains.com/pycharm/
14.3.3.4 Google Colab (Cloud-based, free GPUs)
Best for: - Learning and experimentation - GPU access without local hardware - Quick prototyping - Sharing notebooks
Features: - Free GPU/TPU access (limited hours) - Pre-installed ML libraries - Google Drive integration - No setup required
Access: https://colab.research.google.com/
Example: Enabling GPU:
# Check GPU availability
import torch
print(f"GPU available: {torch.cuda.is_available()}")
print(f"GPU name: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")
14.4 3. Essential Python Libraries
14.4.1 Data Manipulation
14.4.1.1 pandas - Data structures and analysis
Installation:
conda install pandas
# or
pip install pandas
Core functionality: - DataFrames (2D tables) - Series (1D arrays) - Reading/writing CSV, Excel, SQL, JSON - Data cleaning, filtering, aggregation - Time series analysis
Quick start:
import pandas as pd
# Read data
= pd.read_csv('covid_cases.csv')
df
# Basic exploration
print(df.head())
print(df.info())
print(df.describe())
# Filtering
= df[df['date'] > '2023-01-01']
recent
# Grouping
= df.groupby('region')['cases'].sum()
by_region
# Time series
'date'] = pd.to_datetime(df['date'])
df['date', inplace=True)
df.set_index(= df.resample('W').sum() weekly
Resources: - 📚 pandas documentation - 📄 10 minutes to pandas
14.4.1.2 NumPy - Numerical computing
Installation:
conda install numpy
Core functionality: - Multi-dimensional arrays - Mathematical operations - Linear algebra - Random number generation - Broadcasting
Quick start:
import numpy as np
# Create arrays
= np.array([1, 2, 3, 4, 5])
arr = np.array([[1, 2], [3, 4]])
matrix
# Operations
print(arr.mean(), arr.std())
print(matrix @ matrix.T) # Matrix multiplication
# Random numbers (for simulations)
42)
np.random.seed(= np.random.normal(loc=100, scale=15, size=1000) samples
14.4.1.3 Polars - Fast DataFrame library (Alternative to pandas)
Installation:
pip install polars
Why Polars? - 10-100x faster than pandas for large datasets - Lower memory usage - Better query optimization - Lazy evaluation
Quick comparison:
import polars as pl
# Polars syntax
= pl.read_csv('large_dataset.csv')
df_polars = (
result
df_polarsfilter(pl.col('age') > 18)
.'region')
.group_by(
.agg(['cases').sum().alias('total_cases'),
pl.col('deaths').sum().alias('total_deaths')
pl.col(
])
)
# pandas equivalent
import pandas as pd
= pd.read_csv('large_dataset.csv')
df_pandas = (
result 'age'] > 18]
df_pandas[df_pandas['region')
.groupby('cases': 'sum', 'deaths': 'sum'})
.agg({ )
14.4.2 Machine Learning
14.4.2.1 scikit-learn - Classical ML algorithms
Installation:
conda install scikit-learn
Coverage: - Classification, regression, clustering - Preprocessing and feature engineering - Model selection and evaluation - Pipeline construction
Core modules:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, classification_report
# Typical workflow
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Preprocessing
= StandardScaler()
scaler = scaler.fit_transform(X_train)
X_train_scaled = scaler.transform(X_test)
X_test_scaled
# Training
= RandomForestClassifier(n_estimators=100, random_state=42)
model
model.fit(X_train_scaled, y_train)
# Evaluation
= model.predict(X_test_scaled)
y_pred = model.predict_proba(X_test_scaled)[:, 1]
y_pred_proba print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_proba):.3f}")
print(classification_report(y_test, y_pred))
Resources: - 📚 scikit-learn documentation - 📄 Choosing the right estimator
14.4.2.2 XGBoost - Gradient boosting
Installation:
conda install -c conda-forge xgboost
# or
pip install xgboost
Why XGBoost? - State-of-the-art performance on tabular data - Handles missing values - Built-in regularization - Feature importance
Quick start:
import xgboost as xgb
from sklearn.metrics import roc_auc_score
# Create DMatrix (XGBoost's data structure)
= xgb.DMatrix(X_train, label=y_train)
dtrain = xgb.DMatrix(X_test, label=y_test)
dtest
# Parameters
= {
params 'objective': 'binary:logistic',
'eval_metric': 'auc',
'max_depth': 6,
'eta': 0.1,
'subsample': 0.8,
'colsample_bytree': 0.8
}
# Train
= xgb.train(
model
params,
dtrain,=100,
num_boost_round=[(dtrain, 'train'), (dtest, 'test')],
evals=10,
early_stopping_rounds=10
verbose_eval
)
# Predict
= model.predict(dtest)
y_pred_proba print(f"AUC-ROC: {roc_auc_score(y_test, y_pred_proba):.3f}")
14.4.2.3 LightGBM - Fast gradient boosting
Installation:
conda install -c conda-forge lightgbm
Advantages over XGBoost: - Faster training on large datasets - Lower memory usage - Better handling of categorical features - Leaf-wise tree growth (vs. level-wise)
Quick start:
import lightgbm as lgb
# Create dataset
= lgb.Dataset(X_train, label=y_train)
train_data = lgb.Dataset(X_test, label=y_test, reference=train_data)
test_data
# Parameters
= {
params 'objective': 'binary',
'metric': 'auc',
'boosting_type': 'gbdt',
'num_leaves': 31,
'learning_rate': 0.05
}
# Train
= lgb.train(
model
params,
train_data,=100,
num_boost_round=[test_data],
valid_sets=[lgb.early_stopping(10)]
callbacks )
14.4.3 Deep Learning
14.4.3.1 PyTorch - Dynamic neural networks
Installation:
# CPU only
conda install pytorch torchvision torchaudio cpuonly -c pytorch
# GPU (CUDA 11.8)
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
Why PyTorch? - Pythonic, intuitive API - Dynamic computation graphs - Strong research community - Excellent for NLP and computer vision
Simple example:
import torch
import torch.nn as nn
import torch.optim as optim
# Define model
class SimpleNN(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_dim, output_dim)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
= self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
x = self.sigmoid(x)
x return x
# Initialize
= SimpleNN(input_dim=10, hidden_dim=64, output_dim=1)
model = nn.BCELoss()
criterion = optim.Adam(model.parameters(), lr=0.001)
optimizer
# Training loop
for epoch in range(100):
# Forward pass
= model(X_train_tensor)
outputs = criterion(outputs, y_train_tensor)
loss
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch + 1) % 10 == 0:
print(f'Epoch [{epoch+1}/100], Loss: {loss.item():.4f}')
Resources: - 📚 PyTorch documentation - 🎓 PyTorch tutorials
14.4.3.2 TensorFlow/Keras - Production-ready deep learning
Installation:
pip install tensorflow
Why TensorFlow? - Industry standard for production - TensorFlow Serving for deployment - TensorFlow Lite for mobile/edge - Keras API for ease of use
Keras example:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# Define model
= keras.Sequential([
model 64, activation='relu', input_shape=(10,)),
layers.Dense(0.2),
layers.Dropout(32, activation='relu'),
layers.Dense(0.2),
layers.Dropout(1, activation='sigmoid')
layers.Dense(
])
# Compile
compile(
model.='adam',
optimizer='binary_crossentropy',
loss=['accuracy', 'AUC']
metrics
)
# Train
= model.fit(
history
X_train, y_train,=50,
epochs=32,
batch_size=0.2,
validation_split=[
callbacks=5, restore_best_weights=True)
keras.callbacks.EarlyStopping(patience
]
)
# Evaluate
= model.evaluate(X_test, y_test)
test_loss, test_acc, test_auc print(f'Test AUC: {test_auc:.3f}')
14.4.4 Visualization
14.4.4.1 Matplotlib - Publication-quality plots
Installation:
conda install matplotlib
Core plotting:
import matplotlib.pyplot as plt
import numpy as np
# Line plot
=(10, 6))
plt.figure(figsize='Daily Cases', linewidth=2)
plt.plot(dates, cases, label'Date')
plt.xlabel('Number of Cases')
plt.ylabel('COVID-19 Daily Cases')
plt.title(
plt.legend()=0.3)
plt.grid(alpha
plt.tight_layout()
plt.show()
# Multiple subplots
= plt.subplots(2, 2, figsize=(12, 10))
fig, axes
0, 0].plot(data['cases'])
axes[0, 0].set_title('Cases')
axes[
0, 1].plot(data['deaths'])
axes[0, 1].set_title('Deaths')
axes[
1, 0].scatter(data['age'], data['severity'])
axes[1, 0].set_title('Age vs. Severity')
axes[
1, 1].hist(data['recovery_time'], bins=30)
axes[1, 1].set_title('Recovery Time Distribution')
axes[
plt.tight_layout() plt.show()
14.4.4.2 Seaborn - Statistical visualization
Installation:
conda install seaborn
Why Seaborn? - Built on matplotlib - Beautiful default styles - Statistical plots out-of-the-box - Great for exploratory analysis
Examples:
import seaborn as sns
# Set style
'whitegrid')
sns.set_style('husl')
sns.set_palette(
# Distribution plot
=df, x='age', hue='outcome', kde=True)
sns.histplot(data'Age Distribution by Outcome')
plt.title(
plt.show()
# Correlation heatmap
=(10, 8))
plt.figure(figsize=True, cmap='coolwarm', center=0)
sns.heatmap(df.corr(), annot'Feature Correlations')
plt.title(
plt.show()
# Pairplot
='disease_status', vars=['age', 'bmi', 'blood_pressure'])
sns.pairplot(df, hue
plt.show()
# Box plot
=df, x='region', y='incidence_rate')
sns.boxplot(data=45)
plt.xticks(rotation'Disease Incidence by Region')
plt.title( plt.show()
14.4.4.3 Plotly - Interactive visualizations
Installation:
conda install -c plotly plotly
Why Plotly? - Interactive plots (zoom, pan, hover) - Web-based (works in Jupyter, dashboards) - 3D plots - Maps and geospatial visualization
Examples:
import plotly.express as px
import plotly.graph_objects as go
# Interactive line plot
= px.line(df, x='date', y='cases', color='region',
fig ='COVID-19 Cases by Region')
title='x unified')
fig.update_layout(hovermode
fig.show()
# Choropleth map
= px.choropleth(df,
fig ='country_code',
locations='incidence_rate',
color='country',
hover_name='Reds',
color_continuous_scale='Disease Incidence by Country')
title
fig.show()
# 3D scatter
= px.scatter_3d(df, x='age', y='bmi', z='risk_score',
fig ='outcome', size='severity',
color='Risk Factors 3D Visualization')
title fig.show()
14.5 4. MLOps and Experiment Tracking
14.5.1 MLflow - Experiment tracking and model registry
Installation:
pip install mlflow
Core features: - Experiment tracking (parameters, metrics, artifacts) - Model registry (versioning, staging, production) - Model serving - Project packaging
Example workflow:
import mlflow
import mlflow.sklearn
# Set experiment
"disease-prediction")
mlflow.set_experiment(
# Start run
with mlflow.start_run(run_name="random-forest-v1"):
# Log parameters
"n_estimators", 100)
mlflow.log_param("max_depth", 10)
mlflow.log_param(
# Train model
= RandomForestClassifier(n_estimators=100, max_depth=10)
model
model.fit(X_train, y_train)
# Evaluate
= model.predict_proba(X_test)[:, 1]
y_pred_proba = roc_auc_score(y_test, y_pred_proba)
auc
# Log metrics
"auc", auc)
mlflow.log_metric("accuracy", accuracy_score(y_test, model.predict(X_test)))
mlflow.log_metric(
# Log model
"model")
mlflow.sklearn.log_model(model,
# Log artifacts
plt.figure()
plt.plot(fpr, tpr)"roc_curve.png")
plt.savefig("roc_curve.png")
mlflow.log_artifact(
# View UI
# mlflow ui
Resources: - 📚 MLflow documentation
14.5.2 Weights & Biases - Experiment tracking and collaboration
Installation:
pip install wandb
Why W&B? - Beautiful visualizations - Team collaboration features - Model versioning - Hyperparameter sweeps - Free for individuals and academics
Quick start:
import wandb
# Initialize
="public-health-ai", name="experiment-1")
wandb.init(project
# Log config
wandb.config.update({"learning_rate": 0.001,
"batch_size": 32,
"epochs": 50
})
# Training loop
for epoch in range(50):
# ... training code ...
# Log metrics
wandb.log({"epoch": epoch,
"train_loss": train_loss,
"val_loss": val_loss,
"val_auc": val_auc
})
# Log model
="model.pth", name="disease-predictor")
wandb.log_model(path
# Finish run
wandb.finish()
14.5.3 DVC - Data version control
Installation:
pip install dvc
Why DVC? - Version large datasets (like Git for data) - Track data pipelines - Remote storage (S3, GCS, Azure, SSH) - Reproducible experiments
Setup:
# Initialize DVC in Git repo
git init
dvc init
# Add data to DVC
dvc add data/raw/covid_data.csv
# Add to git
git add data/raw/covid_data.csv.dvc .gitignore
git commit -m "Add COVID data"
# Configure remote storage
dvc remote add -d storage s3://mybucket/dvc-store
dvc push
# On another machine
git clone <repo-url>
dvc pull
14.6 5. Domain-Specific Tools
14.6.1 Epidemiology and Public Health
14.6.1.1 EpiEstim - Estimating reproduction number
Installation:
# R package
install.packages("EpiEstim")
Alternative Python: EpiNow2-py
14.6.1.2 Epiweeks - Epidemiological week calculations
Installation:
pip install epiweeks
Usage:
from epiweeks import Week, Year
# Get current epi week
= Week.thisweek()
current_week print(f"Current epi week: {current_week}")
# Convert date to epi week
from datetime import date
= Week.fromdate(date(2024, 3, 15))
week print(f"Epi week for 2024-03-15: {week}")
14.6.1.3 Geospatial Analysis - GeoPandas, Folium
Installation:
conda install geopandas folium
Example:
import geopandas as gpd
import folium
# Read shapefile
= gpd.read_file('countries.shp')
gdf
# Merge with disease data
= gdf.merge(disease_data, on='country_code')
gdf
# Create interactive map
= folium.Map(location=[0, 0], zoom_start=2)
m
folium.Choropleth(=gdf,
geo_data=gdf,
data=['country_code', 'incidence_rate'],
columns='feature.properties.country_code',
key_on='YlOrRd',
fill_color='Incidence Rate per 100,000'
legend_name
).add_to(m)
'disease_map.html') m.save(
14.6.2 Natural Language Processing
14.6.2.1 Hugging Face Transformers - Pre-trained language models
Installation:
pip install transformers
Quick start:
from transformers import pipeline
# Sentiment analysis
= pipeline("sentiment-analysis")
classifier = classifier("The patient reported feeling much better after treatment.")
result print(result)
# Named entity recognition (medical entities)
= pipeline("ner", model="d4data/biomedical-ner-all")
ner = ner("Patient has diabetes and hypertension.")
entities print(entities)
# Text generation
= pipeline("text-generation", model="gpt2")
generator = generator("Symptoms of COVID-19 include", max_length=50)
text print(text)
14.6.2.2 spaCy - Industrial-strength NLP
Installation:
pip install spacy
python -m spacy download en_core_web_sm
Medical NLP:
import spacy
# Load model
= spacy.load("en_core_web_sm")
nlp
# Process text
= "Patient presents with fever, cough, and shortness of breath."
text = nlp(text)
doc
# Extract entities
for ent in doc.ents:
print(f"{ent.text}: {ent.label_}")
# POS tagging
for token in doc:
print(f"{token.text}: {token.pos_}")
14.6.3 Genomics and Bioinformatics
14.6.3.1 Biopython - Computational biology
Installation:
conda install -c conda-forge biopython
Quick example:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio import Align
# Read FASTA file
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(f"ID: {record.id}")
print(f"Sequence: {record.seq}")
print(f"Length: {len(record)}")
# Sequence alignment
= Align.PairwiseAligner()
aligner = Seq("ACCGT")
seq1 = Seq("ACGT")
seq2 = aligner.align(seq1, seq2)
alignments print(alignments[0])
14.7 6. Cloud and Compute Resources
14.7.1 Cloud Platforms
14.7.1.1 AWS (Amazon Web Services)
Relevant services: - EC2 - Virtual machines with GPU support - SageMaker - Managed ML platform - S3 - Object storage - Lambda - Serverless computing
Getting started:
# Install AWS CLI
pip install awscli
# Configure credentials
aws configure
# Launch GPU instance
aws ec2 run-instances --image-id ami-xxx --instance-type p3.2xlarge
14.7.1.2 Google Cloud Platform (GCP)
Relevant services: - Compute Engine - VMs with TPU support - Vertex AI - ML platform - Cloud Storage - Object storage - BigQuery - Data warehouse
Getting started:
# Install gcloud CLI
# Download from https://cloud.google.com/sdk/docs/install
# Initialize
gcloud init
# Create VM with GPU
gcloud compute instances create ml-instance \
--machine-type=n1-highmem-8 \
--accelerator=type=nvidia-tesla-t4,count=1
14.7.1.3 Microsoft Azure
Relevant services: - Azure ML - ML platform - Azure Databricks - Apache Spark - Blob Storage - Object storage
14.7.2 Free GPU Resources
14.7.2.1 Google Colab (Free tier)
Pros: - Free GPU access (T4) - No setup required - Pre-installed libraries
Cons: - Limited to 12-hour sessions - Can’t customize environment fully - No persistent storage (use Google Drive)
14.7.2.2 Kaggle Kernels (Free GPU)
Pros: - Free GPU (P100) for 30 hours/week - Access to Kaggle datasets - Community notebooks
Access: https://www.kaggle.com/code
14.8 7. Reproducibility and Collaboration
14.8.1 Version Control with Git
Essential Git commands:
# Initialize repository
git init
# Add files
git add .
git commit -m "Initial commit"
# Create branch
git checkout -b feature/new-model
# Push to remote
git remote add origin https://github.com/username/repo.git
git push -u origin main
# Pull updates
git pull origin main
# Merge branch
git checkout main
git merge feature/new-model
.gitignore
for data science:
# Data
data/
*.csv
*.h5
*.pkl
# Models
models/
*.pth
*.h5
*.pkl
# Notebooks
.ipynb_checkpoints/
*-checkpoint.ipynb
# Environment
venv/
.venv/
__pycache__/
*.pyc
# IDE
.vscode/
.idea/
# OS
.DS_Store
Thumbs.db
14.8.2 Containerization with Docker
Why Docker? - Reproducible environments - Works anywhere (local, cloud, clusters) - Version-controlled infrastructure - Easy collaboration
Example Dockerfile:
FROM python:3.10-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
\
gcc && rm -rf /var/lib/apt/lists/*
# Copy requirements
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY . .
# Run
CMD ["python", "train.py"]
Building and running:
# Build image
docker build -t disease-predictor:latest .
# Run container
docker run -v $(pwd)/data:/app/data disease-predictor:latest
# Interactive shell
docker run -it disease-predictor:latest /bin/bash
14.8.3 Project Structure
Recommended directory layout:
project-name/
├── README.md # Project overview
├── requirements.txt # Python dependencies
├── environment.yml # Conda environment
├── .gitignore # Git ignore rules
├── Dockerfile # Container definition
├── setup.py # Package installation
│
├── data/
│ ├── raw/ # Original, immutable data
│ ├── processed/ # Cleaned, transformed data
│ └── external/ # Third-party data
│
├── notebooks/ # Jupyter notebooks
│ ├── 01-exploration.ipynb
│ ├── 02-preprocessing.ipynb
│ └── 03-modeling.ipynb
│
├── src/ # Source code
│ ├── __init__.py
│ ├── data/ # Data processing
│ │ ├── __init__.py
│ │ └── make_dataset.py
│ ├── features/ # Feature engineering
│ │ ├── __init__.py
│ │ └── build_features.py
│ ├── models/ # Model definitions
│ │ ├── __init__.py
│ │ ├── train.py
│ │ └── predict.py
│ └── visualization/ # Plotting code
│ ├── __init__.py
│ └── visualize.py
│
├── models/ # Trained models
│ └── .gitkeep
│
├── reports/ # Analysis reports
│ ├── figures/ # Plots and images
│ └── results.md
│
└── tests/ # Unit tests
├── __init__.py
└── test_features.py
14.9 8. Pre-trained Models and Datasets
14.9.1 Model Hubs
14.9.1.1 Hugging Face Hub
Access thousands of models:
from transformers import AutoModel, AutoTokenizer
# Load pre-trained model
= "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract"
model_name = AutoModel.from_pretrained(model_name)
model = AutoTokenizer.from_pretrained(model_name)
tokenizer
# Use for downstream task
= "Patient diagnosed with pneumonia."
text = tokenizer(text, return_tensors="pt")
inputs = model(**inputs) outputs
Browse models: https://huggingface.co/models
14.9.1.2 PyTorch Hub
Access:
import torch
# Load pre-trained model
= torch.hub.load('pytorch/vision', 'resnet50', pretrained=True)
model eval() model.
14.9.2 Public Health Datasets
14.9.2.1 Johns Hopkins COVID-19 Data
import pandas as pd
= "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
url = pd.read_csv(url) covid_data
14.9.2.2 WHO Global Health Observatory
Access: https://www.who.int/data/gho
14.9.2.3 CDC Data
WONDER Database: https://wonder.cdc.gov/
14.9.2.4 Kaggle Datasets
Public health datasets: - COVID-19 forecasting - Disease surveillance - Health indicators
Access: https://www.kaggle.com/datasets
14.10 9. Building Your First Pipeline
14.10.1 Complete Example: Disease Prediction Pipeline
1. Project setup:
# Create project
mkdir disease-predictor
cd disease-predictor
# Create environment
conda create -n disease-pred python=3.10
conda activate disease-pred
# Install packages
conda install pandas numpy scikit-learn matplotlib jupyter mlflow
# Initialize git
git init
echo "data/\nmodels/\n*.pyc" > .gitignore
git add .gitignore
git commit -m "Initial commit"
2. Data loading (src/data/load_data.py
):
import pandas as pd
from sklearn.model_selection import train_test_split
def load_and_split_data(filepath, test_size=0.2, random_state=42):
"""Load data and split into train/test sets"""
# Load data
= pd.read_csv(filepath)
df
# Separate features and target
= df.drop('outcome', axis=1)
X = df['outcome']
y
# Split
= train_test_split(
X_train, X_test, y_train, y_test =test_size, random_state=random_state, stratify=y
X, y, test_size
)
return X_train, X_test, y_train, y_test
if __name__ == "__main__":
= load_and_split_data('data/raw/disease_data.csv')
X_train, X_test, y_train, y_test print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
3. Feature engineering (src/features/build_features.py
):
from sklearn.preprocessing import StandardScaler
import pandas as pd
def build_features(X_train, X_test):
"""Engineer features and scale data"""
# Create age groups
'age_group'] = pd.cut(X_train['age'], bins=[0, 18, 65, 100], labels=['child', 'adult', 'senior'])
X_train['age_group'] = pd.cut(X_test['age'], bins=[0, 18, 65, 100], labels=['child', 'adult', 'senior'])
X_test[
# One-hot encode categorical
= pd.get_dummies(X_train, columns=['age_group', 'region'])
X_train = pd.get_dummies(X_test, columns=['age_group', 'region'])
X_test
# Align columns
= X_train.align(X_test, join='left', axis=1, fill_value=0)
X_train, X_test
# Scale numerical features
= StandardScaler()
scaler = ['age', 'bmi', 'blood_pressure']
numerical_cols
= scaler.fit_transform(X_train[numerical_cols])
X_train[numerical_cols] = scaler.transform(X_test[numerical_cols])
X_test[numerical_cols]
return X_train, X_test, scaler
4. Model training (src/models/train.py
):
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, classification_report
def train_model(X_train, y_train, X_test, y_test):
"""Train and log model with MLflow"""
"disease-prediction")
mlflow.set_experiment(
with mlflow.start_run():
# Parameters
= {
params 'n_estimators': 100,
'max_depth': 10,
'min_samples_split': 5,
'random_state': 42
}
mlflow.log_params(params)
# Train
= RandomForestClassifier(**params)
model
model.fit(X_train, y_train)
# Evaluate
= model.predict(X_test)
y_pred = model.predict_proba(X_test)[:, 1]
y_pred_proba
= roc_auc_score(y_test, y_pred_proba)
auc = accuracy_score(y_test, y_pred)
acc
mlflow.log_metrics({'auc': auc,
'accuracy': acc
})
# Log model
"model")
mlflow.sklearn.log_model(model,
print(f"AUC: {auc:.3f}, Accuracy: {acc:.3f}")
print(classification_report(y_test, y_pred))
return model
if __name__ == "__main__":
from src.data.load_data import load_and_split_data
from src.features.build_features import build_features
# Load data
= load_and_split_data('data/raw/disease_data.csv')
X_train, X_test, y_train, y_test
# Build features
= build_features(X_train, X_test)
X_train, X_test, scaler
# Train model
= train_model(X_train, y_train, X_test, y_test) model
5. Run pipeline:
# Train model
python src/models/train.py
# View MLflow UI
mlflow ui
# Open browser: http://localhost:5000
14.11 10. Key Takeaways
Development Environment: - Python distribution: Anaconda or Miniconda - IDE: Jupyter Lab for exploration, VS Code for production code - Virtual environments: conda or venv for dependency isolation
Core Libraries: - Data: pandas (tabular), NumPy (arrays), Polars (fast alternative) - ML: scikit-learn (classical), XGBoost/LightGBM (boosting) - Deep Learning: PyTorch (research), TensorFlow/Keras (production) - Visualization: Matplotlib (static), Seaborn (statistical), Plotly (interactive)
MLOps: - Experiment tracking: MLflow or Weights & Biases - Data versioning: DVC - Containerization: Docker
Best Practices: - Use virtual environments for every project - Version control code with Git - Track experiments systematically - Structure projects consistently - Document dependencies (requirements.txt, environment.yml) - Containerize for reproducibility
14.12 Hands-On Exercise
14.12.1 Exercise: Set Up Your AI Development Environment
Objective: Build a complete, reproducible development environment for AI/ML projects.
Tasks:
Install Miniconda
- Download and install from official website
- Verify installation with
conda --version
Create project environment
conda create -n my-ml-project python=3.10 conda activate my-ml-project conda install pandas numpy scikit-learn matplotlib jupyter mlflow
Set up VS Code
- Install VS Code
- Install Python and Jupyter extensions
- Configure to use your conda environment
Create project structure
mkdir my-ml-project cd my-ml-project mkdir data models notebooks src tests touch README.md requirements.txt .gitignore
Initialize Git repository
git init git add . git commit -m "Initial project structure"
Create simple ML pipeline
- Load sample dataset (e.g., from scikit-learn)
- Train simple model
- Log experiment with MLflow
- Save model
Export environment
conda env export > environment.yml pip freeze > requirements.txt
Create Dockerfile
- Write Dockerfile for your project
- Build image
- Test running training script in container
Bonus: - Set up Weights & Biases account and log experiment - Create GitHub repository and push code - Set up DVC for data versioning
Check Your Understanding
Test your knowledge of AI development tools and best practices. Each question builds on the key concepts from this chapter.
A data scientist starts a new project analyzing hospital readmission data. They install packages globally on their system using pip install
without creating a virtual environment. Six months later, they need to share the project with a colleague, who reports numerous package version conflicts and cannot run the code. What is the PRIMARY lesson this scenario illustrates about development environments?
- pip is unreliable and conda should always be used instead
- Virtual environments are essential for dependency isolation, reproducibility, and collaboration
- Python packages change too rapidly to maintain long-term projects
- Sharing code is inherently difficult and requires containerization from day one
Correct Answer: b) Virtual environments are essential for dependency isolation, reproducibility, and collaboration
This scenario illustrates a fundamental best practice in software development: dependency isolation through virtual environments. The chapter emphasizes this throughout the environment setup section, listing virtual environments as critical for avoiding version conflicts, ensuring reproducible environments, and enabling easy sharing.
The problem: Installing packages globally causes several issues: - Dependency conflicts: Different projects need different versions of the same package (project A needs pandas 1.3, project B needs pandas 2.0) - System pollution: Global installation affects all Python projects on the system - Irreproducibility: The colleague doesn’t know which package versions were used - Fragility: Upgrading a package for one project breaks another project
The chapter explains that virtualenvironments solve these problems by creating isolated Python environments per project, each with its own package versions. When the data scientist eventually tries to share, they can’t easily communicate “install these exact versions” because they never tracked them.
Correct approach:
# Create isolated environment
conda create -n readmission-analysis python=3.10
conda activate readmission-analysis
# Install packages
conda install pandas numpy scikit-learn
# Export for sharing
conda env export > environment.yml
pip freeze > requirements.txt
Now colleagues can recreate the exact environment:
conda env create -f environment.yml
Option (a) misses the point—both pip and conda support virtual environments (venv/virtualenv for pip, conda environments for conda). The tool isn’t the issue; the practice of isolation is. Option (c) is defeatist—yes, packages evolve, which is precisely why version pinning and environment management are necessary. The solution isn’t to avoid Python but to use proper tooling. Option (d) overstates the requirement—while Docker is valuable (discussed in the chapter), virtual environments + requirements files handle most sharing scenarios. Jumping to containers “from day one” for every project is overkill.
The chapter’s section on virtual environments lists clear benefits: - Isolate project dependencies - Avoid version conflicts - Reproducible environments - Easy to share and recreate
Real-world implications: This scenario mirrors the famous “works on my machine” problem that plagues software development. The healthcare AI context makes it worse—if the model can’t be reproduced, research findings can’t be validated, clinical deployments become risky, and regulatory compliance (FDA requirements for reproducibility) may be violated.
The chapter provides specific commands for both conda and venv, emphasizing that the choice of tool matters less than consistent use of isolation. For public health practitioners: start every project with environment creation, export dependencies regularly (don’t wait until sharing time), document environment setup in README, and consider it part of “scientific method” for computational work—others must be able to reproduce your results.
The broader principle: computational reproducibility requires infrastructure. Just as lab scientists document reagent lot numbers and equipment settings, data scientists must document software environments. Virtual environments are the foundational tool for this documentation.
A research team is building a COVID-19 forecasting model. They need to decide between using pandas (which they know well) versus Polars (which they’ve heard is much faster). Their current dataset is 5 million rows and model training takes 30 minutes with pandas. Which factors should MOST heavily influence their decision?
- Always choose Polars because speed improvements are always valuable
- Evaluate the actual bottleneck: if data processing is <5% of runtime, pandas is fine; if it’s >50%, consider Polars; also weigh team familiarity and deadline pressure
- Stick with pandas because learning new tools wastes time that could be spent on modeling
- Use both—pandas for development and Polars for production
Correct Answer: b) Evaluate the actual bottleneck: if data processing is <5% of runtime, pandas is fine; if it’s >50%, consider Polars; also weigh team familiarity and deadline pressure
This question tests understanding of pragmatic tool selection—a key theme throughout the chapter. The scenario requires evaluating trade-offs between performance, learning curve, team capability, and project constraints rather than defaulting to “fastest tool wins” or “familiar tool wins.”
The chapter discusses Polars as “10-100x faster than pandas for large datasets” but presents it as an alternative, not a replacement for all cases. The key is understanding when speed matters enough to justify switching costs.
Decision framework:
1. Profile to find bottlenecks: If model training takes 30 minutes total: - Data loading + preprocessing: 2 minutes (7%) → pandas is fine - Model training: 28 minutes (93%) → optimize model, not data pipeline - Data loading + preprocessing: 20 minutes (67%) → Polars might help significantly
The chapter’s philosophy emphasizes: optimize what matters. Premature optimization wastes time.
2. Calculate switching costs: - Learning curve: How long to become proficient with Polars? - Code rewrite: How much existing code needs translation? - Testing: How much validation is needed after switching? - Documentation: Does switching affect reproducibility or team knowledge transfer?
3. Consider team and project context: - Team familiarity: If everyone knows pandas, collective productivity matters more than individual script speed - Deadline pressure: If launch is imminent, don’t introduce new tools mid-project - Long-term maintenance: If Polars reduces training time from 30 min to 3 min and you’ll run thousands of experiments, the investment pays off
4. Evaluate alternative optimizations: Before switching tools, consider: - Optimize pandas code (vectorization, efficient data types, chunking) - Sample data for development, full data for final training - Parallel processing with Dask (pandas-compatible API) - Cloud resources for faster compute
Option (a)—“always choose faster”—ignores switching costs and assumes speed is the only bottleneck. This violates the chapter’s emphasis on pragmatic tool selection. The chapter presents multiple tools precisely because different situations call for different solutions. Option (c)—“never learn new tools”—is the opposite error. Technical debt accumulates when teams refuse to adopt better tools. If Polars genuinely solves a major bottleneck and the project has longevity, learning it is worthwhile. However, context matters. Option (d)—“use both”—introduces unnecessary complexity. Maintaining two implementations (one for dev, one for prod) creates version skew risks, doubles testing burden, and complicates debugging. The chapter’s emphasis on reproducibility argues against dev/prod environment divergence.
The chapter’s guidance on Polars: - When to consider: Large datasets (>1GB), complex aggregations, memory constraints - When to skip: Small datasets, simple operations, team unfamiliar and no time to learn - Alternative: Polars has pandas-like API, lowering learning curve
Real-world considerations for public health AI: - Regulatory: If the model is for FDA-cleared device, environment consistency matters—don’t switch tools casually - Collaboration: If working with epidemiologists unfamiliar with data science, pandas’ widespread adoption and documentation base creates lower barriers - Iteration speed: In outbreak response, getting a working model fast may matter more than optimal performance
The chapter’s broader theme is pragmatic tool selection: choose tools that match your problem, team, and constraints. The AI ecosystem offers hundreds of options precisely because no single tool is best for all situations. The chapter provides guidance on multiple alternatives (pandas vs. Polars, PyTorch vs. TensorFlow, Jupyter vs. VS Code) to equip readers for context-appropriate decisions.
For public health practitioners: resist both technology dogmatism (“always use X”) and technology conservatism (“never change from Y”). Profile your code to find actual bottlenecks, evaluate the cost-benefit of switching tools, involve the team in decisions, and make trade-offs explicit. Sometimes the best choice is the tool your team already knows; sometimes it’s worth investing in something better. Data and context should drive the decision.
A hospital AI team has been using Jupyter notebooks for all development, from initial exploration through model deployment. They experience several problems: notebooks with 200+ cells becoming unwieldy, difficulty tracking which cells were run in what order, merge conflicts when multiple team members edit notebooks, and challenges deploying notebook code to production. What does this scenario suggest about development tool selection?
- Jupyter notebooks are inappropriate for ML development and should be avoided
- The team needs better notebook organization through naming conventions and documentation
- Different development stages require different tools: notebooks for exploration, Python scripts/IDEs for production code, with clear transitions between phases
- The team should switch entirely to VS Code for all development activities
Correct Answer: c) Different development stages require different tools: notebooks for exploration, Python scripts/IDEs for production code, with clear transitions between phases
This question tests understanding of the chapter’s nuanced guidance on IDE selection and workflow design. The chapter doesn’t advocate for one tool over another universally but rather explains each tool’s strengths and appropriate use cases.
The chapter’s tool guidance:
Jupyter Notebooks: - Best for: Data exploration, iterative analysis, teaching, documenting analysis with narrative - Strengths: Cell-based execution, inline visualizations, rich markdown, easy sharing - Limitations: Not mentioned explicitly but implied by VS Code’s positioning
VS Code: - Best for: Production code, debugging complex issues, multiple files, Git integration - Strengths: Refactoring, testing, deployment, version control
The scenario describes classic problems that arise when using exploration tools for production workflows:
1. Unwieldy 200+ cell notebooks: Notebooks become unmaintainable at scale. They lack the modular structure of properly organized Python packages with functions, classes, and modules. The chapter’s project structure section shows how production code should be organized:
src/
├── data/make_dataset.py
├── features/build_features.py
├── models/train.py
└── visualization/visualize.py
This modular structure enables: - Testing individual components - Reusing code across projects - Clear dependencies and interfaces - Team division of labor
2. Execution order confusion: Notebooks allow non-linear execution. Cell 5 might depend on Cell 10, but this dependency is invisible. Production scripts have clear top-to-bottom execution, making behavior predictable. The chapter’s example training script (train.py
) demonstrates linear, predictable flow.
3. Git merge conflicts: Notebooks are JSON files with embedded metadata (execution counts, outputs, cell IDs). When two people edit a notebook, Git struggles to merge. Python scripts are plain text, merging cleanly. The chapter’s emphasis on Git for version control implicitly assumes code structured for version control.
4. Deployment challenges: Deploying a notebook to production is awkward. You need to strip interactive elements, convert to script, handle cell dependencies, and remove exploration code. Starting with production-structured code avoids this conversion.
The right workflow (implied by the chapter):
Phase 1: Exploration (Jupyter) - Load data, visualize distributions - Try different features, models - Document findings with markdown - Iterate quickly
Phase 2: Production (VS Code + scripts) - Extract working code from notebooks - Refactor into functions and modules - Add tests and documentation - Structure according to chapter’s project template - Commit to Git with clean history
Phase 3: Deployment - Production scripts are deployment-ready - Clear entry points (train.py
, predict.py
) - Containerization (Dockerfile) - CI/CD integration
Option (a) dismisses notebooks entirely, contradicting the chapter’s guidance that notebooks excel for exploration. Many critical data science insights come from exploratory work best done interactively. Option (b) treats symptoms rather than causes. Better organization helps, but fundamental limitations remain—notebooks aren’t designed for production code, and trying to force that use case creates friction. Option (d) makes the opposite error of (a)—VS Code isn’t ideal for initial exploration. The chapter presents VS Code as powerful for production development, not for replacing notebooks’ exploratory strengths.
The chapter’s project structure provides the solution: - notebooks/
directory for exploration (01-exploration.ipynb, 02-preprocessing.ipynb) - src/
directory for production code (modular Python scripts) - Clear workflow: explore in notebooks, productionize in src/
Real-world implications for public health AI:
Regulatory compliance: FDA-cleared medical devices require validated, version-controlled code. Notebooks don’t meet these standards; properly structured Python packages do.
Reproducibility: Research publications require reproducible methods. The chapter emphasizes containers and dependency management, which work much better with scripts than notebooks.
Team collaboration: Hospital AI teams include epidemiologists, clinicians, data scientists, ML engineers. Notebooks work for sharing analysis with stakeholders; scripts work for engineers building production systems.
Operational reliability: When the model runs in production serving patient predictions, it must be reliable, tested, monitored. The chapter’s MLOps section discusses logging, error handling, monitoring—all easier with production-structured code.
The chapter’s toolkit philosophy is use the right tool for the job. Notebooks and scripts aren’t competitors but collaborators in a complete workflow. Mature ML practice involves knowing when to use each and how to transition between them. The team’s problems stem not from choosing notebooks but from failing to transition to production-appropriate tools when development matured beyond exploration.
For public health practitioners: embrace notebooks for exploration and communication, transition to scripts for production, maintain both in your project (chapter’s directory structure includes both notebooks/ and src/), document the workflow, and train team members on when to use each tool. The chapter provides all these pieces; the practitioner’s job is assembling them into appropriate workflows.
A team building a disease outbreak prediction model tracks experiments inconsistently: some parameters are in Excel spreadsheets, some in paper notebooks, some in code comments. When they need to reproduce their best model six months later for a regulatory submission, they cannot determine which hyperparameters, data version, or random seed were used. What MLOps practice would have MOST directly prevented this problem?
- Better documentation practices and standardized note-taking
- Systematic experiment tracking using MLflow or Weights & Biases to automatically log parameters, metrics, code versions, and artifacts
- More frequent code reviews and team meetings to discuss experiments
- Requiring all experiments to be approved by a senior scientist before running
Correct Answer: b) Systematic experiment tracking using MLflow or Weights & Biases to automatically log parameters, metrics, code versions, and artifacts
This question tests understanding of MLOps experiment tracking—a core component of the chapter’s toolkit section. The scenario describes a classic reproducibility crisis that experiment tracking tools are specifically designed to prevent.
The problem: The team has no systematic record of: - Hyperparameters: learning rate, model architecture, regularization - Data version: which preprocessing, which data split - Code version: which model code, which feature engineering - Random seeds: critical for reproducibility in ML - Environment: library versions, hardware (GPU vs CPU) - Results: metrics, artifacts, model files
Six months later, they can’t reproduce the “best model” because this information is scattered, incomplete, or lost.
The chapter’s solution: MLflow
The chapter provides a concrete example showing exactly how MLflow solves this:
import mlflow
import mlflow.sklearn
"disease-prediction")
mlflow.set_experiment(
with mlflow.start_run(run_name="random-forest-v1"):
# Log parameters - AUTOMATICALLY TRACKED
"n_estimators", 100)
mlflow.log_param("max_depth", 10)
mlflow.log_param("random_state", 42) # CRITICAL FOR REPRODUCIBILITY
mlflow.log_param(
# Train model
= RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
model
model.fit(X_train, y_train)
# Log metrics - AUTOMATICALLY TRACKED
= roc_auc_score(y_test, y_pred_proba)
auc "auc", auc)
mlflow.log_metric(
# Log model - ARTIFACT STORED
"model")
mlflow.sklearn.log_model(model,
# Log plots - ARTIFACTS STORED
"roc_curve.png")
plt.savefig("roc_curve.png") mlflow.log_artifact(
What MLflow captures automatically: 1. Parameters: All hyperparameters logged explicitly 2. Metrics: Performance metrics at training and validation 3. Artifacts: Models, plots, data files 4. Code version: Git commit hash if repo is initialized 5. Environment: Library versions if captured 6. Timestamp: When experiment ran 7. User: Who ran the experiment 8. Hardware: System information
Six months later, the team can: - Query MLflow: “Show me all runs with AUC > 0.9” - Find the best run’s ID - Retrieve exact parameters, code version, model file - Reproduce by running with those exact settings
The chapter also discusses Weights & Biases as an alternative with similar capabilities plus: - Better visualizations - Team collaboration features - Hyperparameter sweep automation - Model versioning
Why other options are insufficient:
Option (a)—better documentation—is necessary but insufficient. Manual documentation is: - Error-prone: People forget to record things - Inconsistent: Different team members document differently - Time-consuming: Creates friction (“I’ll document it later”) - Incomplete: Hard to capture everything manually - Not programmatic: Can’t query “find all experiments with learning_rate < 0.01”
The chapter emphasizes systematic, automated tracking precisely because manual processes fail.
Option (c)—code reviews and meetings—helps team coordination but doesn’t solve the tracking problem. Discussions don’t create machine-readable records. When the regulatory submission happens months later, meeting notes won’t suffice.
Option (d)—approval requirements—adds bureaucracy without solving the core issue. Even approved experiments need tracking. This option slows the team without improving reproducibility.
The regulatory angle (important for public health AI):
The scenario mentions “regulatory submission”—likely FDA for a medical device. FDA 21 CFR Part 11 and related guidance require: - Traceability: Ability to trace model provenance - Reproducibility: Ability to recreate exact models - Auditability: Records of all development decisions - Validation: Evidence of systematic testing
MLflow/W&B provide exactly this audit trail. The chapter’s emphasis on these tools reflects not just development convenience but regulatory necessity.
The chapter’s complete MLOps stack:
Experiment tracking: MLflow or W&B (this problem) Data versioning: DVC (tracks which data version) Code versioning: Git (tracks which code version) Environment: Docker + requirements.txt (tracks which libraries) Project structure: Standard directories (organizes everything)
Together, these provide complete reproducibility—the ability to recreate any experiment exactly.
Real-world workflow:
The chapter’s example training script shows integration: 1. Initialize MLflow experiment 2. Log all parameters at start 3. Train model 4. Log all metrics 5. Save model as artifact 6. Query MLflow UI to compare runs
This becomes habitual: every experiment automatically tracked, every model reproducible.
For public health practitioners:
The chapter makes MLflow/W&B setup straightforward: - Install: pip install mlflow
or pip install wandb
- Wrap training code with tracking calls - View UI: mlflow ui
provides web interface - Query programmatically or via UI
Cost: minimal (W&B free for individuals/academics, MLflow open-source) Benefit: enormous (complete reproducibility, regulatory compliance, team coordination)
The reproducibility crisis in ML is well-documented. The chapter addresses it directly by presenting experiment tracking as essential infrastructure, not optional. The scenario’s regulatory submission failure illustrates precisely why: without systematic tracking, even successful models may be unusable if they can’t be reproduced.
The key principle: automate tracking, don’t rely on human memory. The chapter provides the tools; the practitioner must use them consistently. Start using experiment tracking on day one of a project, not when reproducibility problems emerge.
A data science team is deciding between TensorFlow/Keras and PyTorch for a new image-based tuberculosis screening project. The team has limited deep learning experience, needs to deploy to hospital systems within 6 months, and the model must run on edge devices (low-power tablets). Which factors should MOST heavily influence their framework choice?
- PyTorch because it’s more popular in research and has more cutting-edge features
- TensorFlow because Keras provides easier learning curve, TensorFlow Lite enables edge deployment, and TensorFlow Serving supports production deployment—matching their constraints
- Neither—they should use scikit-learn since deep learning is overkill for this problem
- Both—prototype in PyTorch for flexibility, then convert to TensorFlow for deployment
Correct Answer: b) TensorFlow because Keras provides easier learning curve, TensorFlow Lite enables edge deployment, and TensorFlow Serving supports production deployment—matching their constraints
This question tests understanding of the chapter’s framework comparison and requires matching tool capabilities to project requirements. The scenario deliberately provides specific constraints that favor one framework over the other.
The chapter’s framework comparison:
PyTorch: - “Pythonic, intuitive API” - “Dynamic computation graphs” - “Strong research community” - “Excellent for NLP and computer vision”
TensorFlow/Keras: - “Industry standard for production” - “TensorFlow Serving for deployment” - “TensorFlow Lite for mobile/edge” - “Keras API for ease of use”
Analyzing project constraints:
1. Limited deep learning experience: The chapter positions Keras as having an easier learning curve. The example code shows:
# Keras: Very concise, readable
= keras.Sequential([
model 64, activation='relu', input_shape=(10,)),
layers.Dense(0.2),
layers.Dropout(1, activation='sigmoid')
layers.Dense(
])
compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', 'AUC'])
model.=50) model.fit(X_train, y_train, epochs
PyTorch requires more boilerplate (define class, forward method, training loop). For beginners, Keras’ high-level API reduces cognitive load.
2. Edge device deployment (low-power tablets): The chapter explicitly states “TensorFlow Lite for mobile/edge.” This is a critical capability. TensorFlow Lite: - Optimizes models for mobile/edge devices - Reduces model size and inference latency - Supports quantization for lower precision (faster, smaller) - Has extensive mobile deployment tooling
PyTorch has mobile deployment (PyTorch Mobile) but TensorFlow Lite is more mature and widely adopted for healthcare applications.
3. Hospital production deployment within 6 months: The chapter emphasizes “TensorFlow Serving for deployment” and calls TensorFlow “industry standard for production.” TensorFlow Serving: - Production-grade model serving system - Handles versioning, A/B testing, monitoring - Integrates with enterprise IT systems - Well-documented for healthcare deployments
The 6-month timeline is tight. Learning a framework AND building production infrastructure is challenging. TensorFlow’s ecosystem provides more out-of-the-box production tooling.
4. Image-based TB screening: Both frameworks excel at computer vision. This constraint doesn’t differentiate them. Both have excellent pre-trained models (TensorFlow Hub, PyTorch Hub), transfer learning capabilities, and computer vision libraries.
The decision matrix:
Factor | PyTorch | TensorFlow/Keras |
---|---|---|
Learning curve | Moderate | Easy (Keras) |
Research flexibility | Excellent | Good |
Production tooling | Adequate | Excellent |
Edge deployment | PyTorch Mobile | TensorFlow Lite ✓ |
Enterprise support | Growing | Mature ✓ |
Time to production | Longer | Shorter ✓ |
Given constraints (beginner team, edge deployment, tight timeline), TensorFlow/Keras matches better.
Option (a) prioritizes research popularity over project needs. “Cutting-edge features” matter for research, not for deploying a TB screening tool to hospitals. The chapter distinguishes research use (PyTorch strengths) from production use (TensorFlow strengths). This project is production-focused.
Option (c) dismisses deep learning prematurely. The chapter discusses deep learning specifically for image analysis. TB screening from chest X-rays is a canonical deep learning application where CNNs outperform traditional computer vision. Skikit-learn lacks image-specific capabilities that deep learning frameworks provide.
Option (d) suggests using both frameworks—doubling learning curve, maintaining two implementations, and introducing conversion complexity. The chapter doesn’t recommend this approach. Framework conversion (PyTorch → TensorFlow) is non-trivial, error-prone, and time-consuming. With a 6-month deadline, focusing on one framework makes sense.
Additional considerations from the chapter:
Pre-trained models: The chapter discusses both TensorFlow Hub and PyTorch Hub. For TB screening, transfer learning from ImageNet models is common. Both frameworks support this, though specific TB screening models may be available for one or the other—worth checking model zoos before deciding.
Healthcare AI examples: The handbook’s earlier chapters (clinical AI, imaging) likely reference TensorFlow implementations because of its production maturity in healthcare settings.
Team trajectory: If the team later shifts to research (exploring novel architectures), they could learn PyTorch then. For initial production deployment, TensorFlow’s tooling provides scaffolding that accelerates delivery.
Real-world context for public health AI:
FDA clearance: If the TB screening tool needs FDA clearance, TensorFlow’s maturity and documentation for medical devices provides advantages. Regulatory submissions benefit from using widely-validated tooling.
Hospital IT integration: Hospital IT departments are conservative and prefer mature, supported technology. TensorFlow’s enterprise backing (Google) and widespread deployment precedents reduce institutional friction.
Resource constraints: Public health settings often have limited computational resources. TensorFlow Lite’s optimization for low-power devices directly addresses this reality.
Maintenance: After initial deployment, who maintains the model? If the team is small or experiences turnover, TensorFlow’s larger talent pool (due to industry adoption) makes hiring easier.
The chapter’s philosophy:
The chapter presents multiple tools not to create confusion but to equip readers for context-appropriate decisions. PyTorch vs. TensorFlow isn’t “which is better?” but “which matches our needs?”
For this scenario: beginner team + edge deployment + tight timeline = TensorFlow/Keras.
Different scenario (experienced team, research project, no deployment constraints) might favor PyTorch.
For public health practitioners:
When choosing frameworks: 1. List project constraints (team experience, deployment target, timeline, requirements) 2. Map to framework capabilities (chapter provides this mapping) 3. Choose frameworks that match constraints, not most popular or newest 4. Start learning/prototyping quickly to validate the choice 5. Commit to one framework rather than hedging across multiple
The chapter provides information for informed decisions; practitioners must apply decision frameworks that prioritize project success over tool fashion. This question tests that practical decision-making skill—understanding not just what each tool does but when to use which tool.
A graduate student is working on epidemic forecasting for their dissertation. They maintain all their code and data on their laptop’s desktop folder with names like “model_v2_final_FINAL.py” and “data_cleaned.csv.” After their laptop crashes and they discover their backups are outdated, they lose three months of work. Which combination of practices from the chapter would have MOST effectively prevented this catastrophe?
- Cloud storage (Google Drive, Dropbox) for automatic backup
- Git for code version control + GitHub for remote backup + DVC for data versioning + remote storage (S3/GCS) for data backup
- More frequent manual backups to external hard drives
- Working exclusively on cloud platforms (Google Colab, AWS SageMaker) instead of local machines
Correct Answer: b) Git for code version control + GitHub for remote backup + DVC for data versioning + remote storage (S3/GCS) for data backup
This question synthesizes multiple practices from the chapter’s reproducibility and collaboration section. The scenario illustrates common disasters that proper version control and backup infrastructure prevent.
The problem breakdown:
1. Code loss (“model_v2_final_FINAL.py”): - No version control - Manual versioning through filenames (error-prone, unmaintainable) - Single point of failure (laptop) - Can’t recover previous versions - No collaboration capability
2. Data loss (“data_cleaned.csv”): - No data versioning - No tracking of data transformations - No remote backup - Can’t reproduce data cleaning steps
3. Lack of remote backup: - All work on single device - Backups not systematic/automated - No ability to recover from device failure
The chapter’s solution:
Git for code version control: The chapter emphasizes Git throughout, providing concrete examples:
git init # Initialize repository
git add . # Stage files
git commit -m "Initial commit" # Create checkpoint
Benefits: - Every commit is a checkpoint: Can recover any previous version - No more “final_FINAL” naming: Git tracks versions automatically - Branching: Experiment safely without destroying working code - History: See what changed, when, and why - Merging: Combine different development paths
With Git, the student never loses code. Even if laptop crashes, the commit history is preserved (especially when combined with remote backup).
GitHub for remote backup: The chapter’s Git section shows:
git remote add origin https://github.com/username/repo.git
git push -u origin main
Benefits: - Automatic offsite backup: Every git push
backs up to cloud - Free for public repos: Academic work benefits from open science - Collaboration ready: Advisor/committee can review code - Institutional compliance: Many universities require code archiving
When laptop crashes, student clones from GitHub on new machine. No code lost.
DVC for data versioning: The chapter introduces DVC specifically for versioning large datasets (which Git doesn’t handle well):
dvc init # Initialize DVC in Git repo
dvc add data/raw/epidemic_data.csv # Track data file
dvc remote add -d storage s3://mybucket/dvc-store
dvc push # Upload data to remote storage
# After crash, on new machine:
git clone <repo-url> # Get code
dvc pull # Get data
Benefits: - Version large datasets: Git struggles with >100MB files, DVC handles terabytes - Track data transformations: Record preprocessing steps - Remote storage: Data backed up to S3/GCS/Azure - Reproduce pipelines: DVC tracks data processing pipelines
With DVC, the student’s “data_cleaned.csv” is versioned, backed up remotely, and accompanied by code that documents the cleaning process. Data loss is prevented, and the cleaning process is reproducible.
Why this combination is essential:
Git alone doesn’t solve the data problem (large files, binary data). DVC alone doesn’t version code. Together, they provide complete protection for computational research.
Analyzing alternatives:
Option (a)—Cloud storage (Drive/Dropbox): - Pros: Easy, automatic backup - Cons: Not version control (overwrites files, limited history), no meaningful history/diffs, doesn’t track relationships between code and data, not designed for code workflows, poor collaboration features for code
Option (a) prevents total loss but doesn’t provide versioning, reproducibility, or proper collaboration. The chapter presents it as insufficient for serious development work.
Option (c)—Manual backups to external drives: - Pros: Physical control of data - Cons: Requires discipline (people forget), no automation, drives can fail too, versioning still manual, doesn’t enable collaboration
Manual processes fail. The chapter emphasizes automated systems precisely because humans are unreliable about backups.
Option (d)—Cloud platforms exclusively: - Pros: Built-in backup, compute resources - Cons: Vendor lock-in, requires internet, limited customization, can be expensive, doesn’t address version control (still need Git)
Cloud platforms are tools, not replacements for version control. The chapter discusses Colab for compute, not as a version control solution. Colab + Git + GitHub is appropriate; Colab alone is not.
The chapter’s complete reproducibility stack:
Code versioning: Git Code backup: GitHub/GitLab Data versioning: DVC Data backup: S3/GCS/Azure Environment: Docker + requirements.txt Experiments: MLflow/W&B Project structure: Standard directories
This stack provides: - Recoverability: Device failure doesn’t cause data loss - Reproducibility: Anyone can recreate exact environment and results - Collaboration: Team members work together efficiently - Auditability: Complete history of all changes - Correctness: Version control enables review and validation
Real-world implications for dissertation work:
1. Advisor review: “Send me your code” → “Here’s the GitHub link” Much cleaner than emailing zip files.
2. Committee examination: “How did you clean the data?” → “See commit 3a7f9b2 and DVC pipeline” Full traceability of methods.
3. Publication: “Make code available” → Already on GitHub with DOI Meets open science requirements.
4. Post-graduation: Future researchers can build on the work because code + data + environment are documented and preserved.
The “three months of work” loss:
Without version control, this is catastrophic. With Git + GitHub: - Latest commit is from yesterday → lose one day maximum - DVC + remote storage → data fully backed up - git clone
+ dvc pull
on new machine → back to work in hours
The chapter’s project setup exercise (Section 12):
The hands-on exercise walks through exactly this setup: 1. Create environment (conda/venv) 2. Create project structure (directories) 3. Initialize Git 4. Create GitHub repo 5. Set up DVC 6. Build pipeline 7. Export environment 8. Create Dockerfile
This is the chapter’s recommended starting point for ANY project. Following it would have completely prevented the student’s disaster.
For public health practitioners:
The chapter presents this infrastructure not as optional nice-to-have but as essential professional practice. Just as laboratory researchers maintain lab notebooks and document procedures, computational researchers must version control code and data.
Start every project with:
mkdir my-project
cd my-project
git init # Version control from day one
dvc init # Data versioning from day one
git remote add origin <github-url> # Remote backup from day one
dvc remote add -d storage <s3-url> # Data backup from day one
The chapter provides all these tools and explains their use. The practitioner’s responsibility is establishing habits: commit regularly, push frequently, version data systematically. These practices prevent catastrophes and enable reproducible science.
The scenario’s disaster—three months of work lost—would be impossible with proper version control. This isn’t hypothetical; this happens to real students/researchers regularly. The chapter equips readers with tools to prevent it. The question tests whether readers understand not just what the tools do but why they’re essential and how they work together to provide comprehensive protection.
14.13 Discussion Questions
Environment choice: When would you choose Jupyter notebooks over VS Code for development? What are the trade-offs?
Library selection: For a new tabular data classification project, would you start with scikit-learn, XGBoost, or deep learning? Why?
Cloud vs. local: When does it make sense to move from local development to cloud computing? What factors influence this decision?
Reproducibility: How would you ensure someone else can exactly reproduce your analysis? What tools and practices are essential?
Tool overload: The AI ecosystem has hundreds of tools. How do you decide what to learn? What’s essential vs. nice-to-have?
Version control for data: Git works well for code, but large datasets pose challenges. How would you version control a 100GB dataset?
14.14 Further Resources
14.14.1 📚 Books
- Python Data Science Handbook by Jake VanderPlas - Free online
- Hands-On Machine Learning by Aurélien Géron - Practical guide
- Deep Learning with Python by François Chollet - Keras creator
14.14.2 📄 Documentation
- scikit-learn - Excellent tutorials and examples
- PyTorch - Comprehensive documentation
- pandas - User guide and API reference
- MLflow - MLOps platform
14.14.3 💻 Interactive Learning
- Kaggle Learn - Free micro-courses
- Fast.ai - Practical deep learning
- Google Colab Tutorials - Notebooks
14.14.4 🎓 Online Courses
- Andrew Ng’s ML Specialization - Coursera
- Deep Learning Specialization - Coursera
- Full Stack Deep Learning - Production ML