# Your AI Toolkit - Interactive Notebook

**From Chapter 13 of The Public Health AI Handbook**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BryanTegomoh/publichealth-ai-handbook/blob/main/notebooks/chapter13-toolkit-interactive.ipynb)

---

## Learning Objectives

In this interactive notebook, you will:

1. Set up your Python environment with essential packages
2. Create a reproducible project structure
3. Work with core libraries (pandas, scikit-learn, matplotlib)
4. Build an end-to-end ML pipeline for public health data
5. Track experiments with MLflow
6. Implement best practices for reproducibility

---

## üöÄ Setup: Installing Essential Packages

First, let's install the core packages we'll need. If running locally, you should create a virtual environment first:

```bash
# Local setup (not needed in Colab)
conda create -n pubhealth-ai python=3.10
conda activate pubhealth-ai
```

In [None]:
# Install essential packages (uncomment if running in Colab or fresh environment)
# !pip install pandas numpy scikit-learn matplotlib seaborn plotly mlflow imbalanced-learn xgboost

In [None]:
# Import core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import warnings

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
warnings.filterwarnings('ignore')

print("‚úÖ All packages imported successfully!")
print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")
print(f"scikit-learn version: {sklearn.__version__}")

---

## üìÅ Part 1: Creating a Reproducible Project Structure

Every ML project should follow a standard directory structure. Let's create one:

In [None]:
import os
from pathlib import Path

# Create project directory structure
project_dirs = [
    'data/raw',
    'data/processed',
    'data/external',
    'notebooks',
    'src',
    'models',
    'reports/figures',
    'references'
]

for dir_path in project_dirs:
    Path(dir_path).mkdir(parents=True, exist_ok=True)

print("‚úÖ Project structure created!")
print("\nDirectory tree:")
for dir_path in sorted(project_dirs):
    print(f"  {dir_path}/")

---

## üìä Part 2: Working with Public Health Data

Let's create a synthetic public health dataset to demonstrate the toolkit. We'll build a disease outbreak prediction model.

**Scenario:** Predicting whether a region will experience a disease outbreak based on epidemiological indicators.

In [None]:
# Generate synthetic public health data
np.random.seed(42)

n_samples = 1000

# Features: epidemiological indicators
data = {
    'population': np.random.randint(10000, 1000000, n_samples),
    'population_density': np.random.uniform(10, 10000, n_samples),
    'vaccination_rate': np.random.uniform(0.3, 0.95, n_samples),
    'healthcare_access_index': np.random.uniform(0.2, 1.0, n_samples),
    'median_age': np.random.uniform(25, 50, n_samples),
    'poverty_rate': np.random.uniform(0.05, 0.40, n_samples),
    'case_count_previous_week': np.random.poisson(lam=20, size=n_samples),
    'temperature_celsius': np.random.uniform(-10, 35, n_samples),
    'humidity_percent': np.random.uniform(30, 90, n_samples)
}

df = pd.DataFrame(data)

# Target variable: outbreak (1) or no outbreak (0)
# Outbreak probability increases with:
# - Higher population density
# - Lower vaccination rate
# - Lower healthcare access
# - Higher previous case count

outbreak_prob = (
    0.1 * (df['population_density'] / 10000) +
    0.3 * (1 - df['vaccination_rate']) +
    0.2 * (1 - df['healthcare_access_index']) +
    0.2 * (df['case_count_previous_week'] / 100) +
    0.2 * (df['poverty_rate'])
)

df['outbreak'] = (outbreak_prob + np.random.normal(0, 0.1, n_samples) > 0.5).astype(int)

print("‚úÖ Synthetic public health data generated!")
print(f"\nDataset shape: {df.shape}")
print(f"Outbreak prevalence: {df['outbreak'].mean():.1%}")
print(f"\nFirst few rows:")
df.head()

### Exploratory Data Analysis (EDA)

In [None]:
# Summary statistics
print("üìä Summary Statistics:")
df.describe()

In [None]:
# Check for missing values
print("üîç Missing Values:")
print(df.isnull().sum())
print(f"\nTotal missing: {df.isnull().sum().sum()}")

In [None]:
# Visualize target distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Count plot
df['outbreak'].value_counts().plot(kind='bar', ax=axes[0])
axes[0].set_title('Outbreak Distribution')
axes[0].set_xlabel('Outbreak (0=No, 1=Yes)')
axes[0].set_ylabel('Count')
axes[0].set_xticklabels(['No Outbreak', 'Outbreak'], rotation=0)

# Pie chart
df['outbreak'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%')
axes[1].set_title('Outbreak Proportion')
axes[1].set_ylabel('')

plt.tight_layout()
plt.savefig('reports/figures/outbreak_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Target distribution visualized")

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 8))
correlation = df.corr()
sns.heatmap(correlation, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=1)
plt.title('Feature Correlation Heatmap')
plt.tight_layout()
plt.savefig('reports/figures/correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Correlation analysis complete")

In [None]:
# Feature distributions by outbreak status
key_features = ['vaccination_rate', 'healthcare_access_index', 
                'case_count_previous_week', 'poverty_rate']

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, feature in enumerate(key_features):
    df[df['outbreak']==0][feature].hist(ax=axes[idx], alpha=0.6, label='No Outbreak', bins=30)
    df[df['outbreak']==1][feature].hist(ax=axes[idx], alpha=0.6, label='Outbreak', bins=30)
    axes[idx].set_title(f'{feature.replace("_", " ").title()}')
    axes[idx].set_xlabel('Value')
    axes[idx].set_ylabel('Frequency')
    axes[idx].legend()

plt.tight_layout()
plt.savefig('reports/figures/feature_distributions.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Feature distributions analyzed")

---

## üî¨ Part 3: Building an ML Pipeline

Let's build a complete machine learning pipeline from data preprocessing to model evaluation.

### Step 1: Train/Test Split

In [None]:
# Separate features and target
X = df.drop('outbreak', axis=1)
y = df['outbreak']

# Train/test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("‚úÖ Data split complete")
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nTraining set outbreak rate: {y_train.mean():.1%}")
print(f"Test set outbreak rate: {y_test.mean():.1%}")

### Step 2: Feature Scaling

In [None]:
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for readability
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

print("‚úÖ Features scaled")
print("\nScaled feature statistics (training set):")
print(f"Mean: {X_train_scaled.mean().mean():.2f} (should be ~0)")
print(f"Std: {X_train_scaled.std().mean():.2f} (should be ~1)")

### Step 3: Train Multiple Models

In [None]:
# Train Logistic Regression
print("üîÑ Training Logistic Regression...")
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train_scaled, y_train)
print("‚úÖ Logistic Regression trained")

# Train Random Forest
print("\nüîÑ Training Random Forest...")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train_scaled, y_train)
print("‚úÖ Random Forest trained")

### Step 4: Model Evaluation

In [None]:
# Make predictions
lr_pred = lr_model.predict(X_test_scaled)
lr_pred_proba = lr_model.predict_proba(X_test_scaled)[:, 1]

rf_pred = rf_model.predict(X_test_scaled)
rf_pred_proba = rf_model.predict_proba(X_test_scaled)[:, 1]

# Calculate metrics
def evaluate_model(y_true, y_pred, y_pred_proba, model_name):
    print(f"\n{'='*60}")
    print(f"{model_name} Performance")
    print(f"{'='*60}")
    print("\nClassification Report:")
    print(classification_report(y_true, y_pred, target_names=['No Outbreak', 'Outbreak']))
    
    auc = roc_auc_score(y_true, y_pred_proba)
    print(f"\nAUC-ROC: {auc:.3f}")
    
    return auc

lr_auc = evaluate_model(y_test, lr_pred, lr_pred_proba, "Logistic Regression")
rf_auc = evaluate_model(y_test, rf_pred, rf_pred_proba, "Random Forest")

In [None]:
# Confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Logistic Regression
cm_lr = confusion_matrix(y_test, lr_pred)
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title(f'Logistic Regression\nAUC = {lr_auc:.3f}')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_xticklabels(['No Outbreak', 'Outbreak'])
axes[0].set_yticklabels(['No Outbreak', 'Outbreak'])

# Random Forest
cm_rf = confusion_matrix(y_test, rf_pred)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Greens', ax=axes[1])
axes[1].set_title(f'Random Forest\nAUC = {rf_auc:.3f}')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
axes[1].set_xticklabels(['No Outbreak', 'Outbreak'])
axes[1].set_yticklabels(['No Outbreak', 'Outbreak'])

plt.tight_layout()
plt.savefig('reports/figures/confusion_matrices.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Confusion matrices generated")

In [None]:
# ROC curves
plt.figure(figsize=(10, 7))

# Logistic Regression ROC
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_pred_proba)
plt.plot(lr_fpr, lr_tpr, label=f'Logistic Regression (AUC = {lr_auc:.3f})', linewidth=2)

# Random Forest ROC
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_pred_proba)
plt.plot(rf_fpr, rf_tpr, label=f'Random Forest (AUC = {rf_auc:.3f})', linewidth=2)

# Diagonal reference line
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier (AUC = 0.50)', linewidth=1)

plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curves: Outbreak Prediction Models', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('reports/figures/roc_curves.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ ROC curves generated")

### Step 5: Feature Importance Analysis

In [None]:
# Random Forest feature importance
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='importance', y='feature', palette='viridis')
plt.title('Feature Importance (Random Forest)', fontsize=14, fontweight='bold')
plt.xlabel('Importance', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.tight_layout()
plt.savefig('reports/figures/feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

print("‚úÖ Feature importance analysis complete")
print("\nTop 5 Most Important Features:")
print(feature_importance.head())

---

## üìà Part 4: Experiment Tracking with MLflow

Let's track our experiments using MLflow for reproducibility.

In [None]:
# !pip install mlflow  # Uncomment if not installed

import mlflow
import mlflow.sklearn

# Set up MLflow
mlflow.set_tracking_uri("file:./mlruns")
mlflow.set_experiment("outbreak_prediction")

print("‚úÖ MLflow configured")
print(f"Tracking URI: {mlflow.get_tracking_uri()}")
print(f"Experiment: {mlflow.get_experiment_by_name('outbreak_prediction')}")

In [None]:
# Log Logistic Regression experiment
with mlflow.start_run(run_name="logistic_regression"):
    # Log parameters
    mlflow.log_param("model_type", "LogisticRegression")
    mlflow.log_param("max_iter", 1000)
    mlflow.log_param("random_state", 42)
    mlflow.log_param("scaler", "StandardScaler")
    
    # Log metrics
    mlflow.log_metric("auc_roc", lr_auc)
    mlflow.log_metric("accuracy", (lr_pred == y_test).mean())
    
    # Log model
    mlflow.sklearn.log_model(lr_model, "model")
    
    # Log artifacts
    mlflow.log_artifact("reports/figures/confusion_matrices.png")
    mlflow.log_artifact("reports/figures/roc_curves.png")
    
    print("‚úÖ Logistic Regression experiment logged")

In [None]:
# Log Random Forest experiment
with mlflow.start_run(run_name="random_forest"):
    # Log parameters
    mlflow.log_param("model_type", "RandomForest")
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("random_state", 42)
    mlflow.log_param("scaler", "StandardScaler")
    
    # Log metrics
    mlflow.log_metric("auc_roc", rf_auc)
    mlflow.log_metric("accuracy", (rf_pred == y_test).mean())
    
    # Log model
    mlflow.sklearn.log_model(rf_model, "model")
    
    # Log feature importance
    feature_importance.to_csv('models/feature_importance.csv', index=False)
    mlflow.log_artifact('models/feature_importance.csv')
    mlflow.log_artifact("reports/figures/feature_importance.png")
    
    print("‚úÖ Random Forest experiment logged")

### View MLflow UI

To view your experiments in the MLflow UI, run this command in your terminal:

```bash
mlflow ui --backend-store-uri file:./mlruns
```

Then open http://localhost:5000 in your browser.

---

## üíæ Part 5: Saving and Loading Models

Let's save our models for future use.

In [None]:
import joblib
from datetime import datetime

# Save models with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

model_paths = {
    'lr_model': f'models/logistic_regression_{timestamp}.joblib',
    'rf_model': f'models/random_forest_{timestamp}.joblib',
    'scaler': f'models/scaler_{timestamp}.joblib'
}

joblib.dump(lr_model, model_paths['lr_model'])
joblib.dump(rf_model, model_paths['rf_model'])
joblib.dump(scaler, model_paths['scaler'])

print("‚úÖ Models saved:")
for name, path in model_paths.items():
    print(f"  {name}: {path}")

In [None]:
# Load models
loaded_rf = joblib.load(model_paths['rf_model'])
loaded_scaler = joblib.load(model_paths['scaler'])

# Test loaded model
test_predictions = loaded_rf.predict(X_test_scaled)
test_auc = roc_auc_score(y_test, loaded_rf.predict_proba(X_test_scaled)[:, 1])

print("‚úÖ Models loaded successfully")
print(f"Loaded Random Forest AUC: {test_auc:.3f}")
print(f"Original Random Forest AUC: {rf_auc:.3f}")
print(f"Match: {test_auc == rf_auc}")

---

## üéØ Part 6: Making Predictions on New Data

Let's use our trained model to make predictions on new, unseen data.

In [None]:
# Create new data point (hypothetical region)
new_region = pd.DataFrame([{
    'population': 500000,
    'population_density': 2500,
    'vaccination_rate': 0.65,
    'healthcare_access_index': 0.70,
    'median_age': 38,
    'poverty_rate': 0.18,
    'case_count_previous_week': 45,
    'temperature_celsius': 22,
    'humidity_percent': 65
}])

print("üìç New Region Characteristics:")
print(new_region.T)

# Scale features
new_region_scaled = loaded_scaler.transform(new_region)

# Make prediction
prediction = loaded_rf.predict(new_region_scaled)[0]
prediction_proba = loaded_rf.predict_proba(new_region_scaled)[0]

print("\n" + "="*60)
print("üîÆ Outbreak Prediction")
print("="*60)
print(f"Prediction: {'OUTBREAK LIKELY' if prediction == 1 else 'NO OUTBREAK EXPECTED'}")
print(f"Confidence: {prediction_proba[prediction]:.1%}")
print(f"\nProbabilities:")
print(f"  No Outbreak: {prediction_proba[0]:.1%}")
print(f"  Outbreak: {prediction_proba[1]:.1%}")

---

## üìù Part 7: Creating a Model Card

Document your model for transparency and reproducibility.

In [None]:
# Create model card
model_card = f"""
# Model Card: Outbreak Prediction Model

## Model Details
- **Model Type:** Random Forest Classifier
- **Version:** 1.0
- **Date:** {datetime.now().strftime('%Y-%m-%d')}
- **Developer:** Public Health AI Handbook Tutorial

## Intended Use
- **Primary Use:** Predict outbreak risk in regions based on epidemiological indicators
- **Target Users:** Public health practitioners, epidemiologists
- **Out-of-Scope Uses:** Real-time clinical decisions, individual patient diagnosis

## Training Data
- **Dataset:** Synthetic public health data (1000 samples)
- **Features:** 9 epidemiological indicators
- **Target:** Binary outbreak classification
- **Split:** 80% train, 20% test

## Performance
- **AUC-ROC:** {rf_auc:.3f}
- **Accuracy:** {(rf_pred == y_test).mean():.3f}
- **Evaluation:** Test set (200 samples)

## Feature Importance
Top 3 features:
{feature_importance.head(3).to_string(index=False)}

## Limitations
- Trained on synthetic data (not real-world outbreaks)
- May not generalize to all disease types
- Requires feature scaling for predictions
- Performance may degrade over time (concept drift)

## Ethical Considerations
- Fairness: Performance should be monitored across demographic groups
- Privacy: Input data may contain sensitive population information
- Accountability: Predictions should inform, not replace, expert judgment

## Model Files
- Model: {model_paths['rf_model']}
- Scaler: {model_paths['scaler']}

## Contact
For questions or issues, refer to: The Public Health AI Handbook (Chapter 13)
"""

# Save model card
with open('models/MODEL_CARD.md', 'w') as f:
    f.write(model_card)

print("‚úÖ Model card created: models/MODEL_CARD.md")
print("\nModel Card Preview:")
print(model_card)

---

## üì¶ Part 8: Creating a requirements.txt for Reproducibility

In [None]:
# Generate requirements.txt
requirements = """
# Core data science
pandas==2.0.3
numpy==1.24.3
matplotlib==3.7.2
seaborn==0.12.2

# Machine learning
scikit-learn==1.3.0
xgboost==1.7.6
imbalanced-learn==0.11.0

# Experiment tracking
mlflow==2.5.0

# Utilities
joblib==1.3.1
jupyter==1.0.0
""".strip()

with open('requirements.txt', 'w') as f:
    f.write(requirements)

print("‚úÖ requirements.txt created")
print("\nContents:")
print(requirements)

---

## üéì Summary & Next Steps

### What You've Accomplished

‚úÖ Set up a reproducible project structure  
‚úÖ Created and analyzed synthetic public health data  
‚úÖ Built a complete ML pipeline (preprocessing ‚Üí training ‚Üí evaluation)  
‚úÖ Compared multiple models (Logistic Regression vs Random Forest)  
‚úÖ Tracked experiments with MLflow  
‚úÖ Saved and loaded models for deployment  
‚úÖ Made predictions on new data  
‚úÖ Created model documentation (model card)  
‚úÖ Generated requirements.txt for reproducibility  

### Key Takeaways

1. **Start Simple:** We used scikit-learn for everything. No need for deep learning here.
2. **Track Everything:** MLflow helps you compare experiments and reproduce results.
3. **Document Thoroughly:** Model cards make your work transparent and trustworthy.
4. **Think About Deployment:** Save models, scalers, and document dependencies.

### Next Steps

1. **Try with Real Data:** Apply this pipeline to actual public health datasets
2. **Add More Models:** Try XGBoost, LightGBM, or neural networks
3. **Hyperparameter Tuning:** Use GridSearchCV or Optuna to optimize models
4. **Build a Web App:** Deploy your model with Streamlit or Flask
5. **Read Chapter 14:** Learn to build your first complete project

---

## üìö Additional Resources

- **The Public Health AI Handbook:** Full book at [link]
- **scikit-learn Documentation:** https://scikit-learn.org/
- **MLflow Documentation:** https://mlflow.org/docs/latest/index.html
- **pandas Documentation:** https://pandas.pydata.org/docs/

---

**Questions or Issues?**  
Open an issue at: https://github.com/BryanTegomoh/publichealth-ai-handbook/issues