15 Building Your First Project
Estimated time: 3-4 hours (hands-on project work)
Prerequisites: - Chapter 2: Just Enough AI to Be Dangerous - ML fundamentals - Chapter 3: The Data Problem - Data collection and quality - Chapter 9: Evaluating AI Systems - Performance metrics - Chapter 13: Your AI Toolkit - Development environment setup
15.1 What You’ll Build
In this chapter, you will build a complete end-to-end project:
Project: Hospital Readmission Risk Prediction
A practical system to predict 30-day hospital readmission risk for discharged patients, enabling targeted interventions and resource allocation.
Deliverables:
- Clean, documented codebase - Well-structured Python project with proper organization
- Trained ML model - Random Forest and XGBoost models with evaluation
- Interactive dashboard - Streamlit web app for predictions and visualizations
- Technical report - Documentation of methodology, results, and limitations
- Presentation materials - Stakeholder-ready slides with key findings
- Deployment artifacts - Docker container and API for production use
15.2 1. Introduction: The Project Lifecycle
15.2.1 Why Hospital Readmission Prediction?
Hospital readmissions are a critical public health challenge:
- Clinical impact: Jencks et al., 2009, NEJM found that nearly 20% of Medicare patients are readmitted within 30 days
- Financial burden: Centers for Medicare & Medicaid Services (CMS) estimates $17 billion annually in preventable readmission costs
- Quality indicator: 30-day readmission rates are a key quality metric for hospitals
- Actionable: Early identification enables targeted interventions (follow-up calls, home visits, medication reconciliation)
Kansagara et al., 2011, Annals of Internal Medicine reviewed readmission risk prediction models and found that most perform moderately well (C-statistic 0.55-0.65), suggesting room for improvement with modern ML techniques.
15.2.2 The ML Project Lifecycle
Amershi et al., 2019, IEEE Software studied ML workflows at Microsoft and identified nine key stages:
┌────────────────────────────────────────────────────────────────┐
│ ML Project Lifecycle │
├────────────────────────────────────────────────────────────────┤
│ │
│ 1. Problem Definition │ Define scope, goals, success │
│ ↓ │ metrics │
│ │ │
│ 2. Data Collection │ Identify sources, gather data │
│ ↓ │ (EHR, claims, surveys) │
│ │ │
│ 3. Data Exploration │ Understand distributions, │
│ ↓ │ relationships, quality issues │
│ │ │
│ 4. Data Preprocessing │ Clean, transform, engineer │
│ ↓ │ features │
│ │ │
│ 5. Modeling │ Train baseline, iterate, │
│ ↓ │ tune hyperparameters │
│ │ │
│ 6. Evaluation │ Assess performance, fairness, │
│ ↓ │ clinical utility │
│ │ │
│ 7. Interpretation │ Explain predictions, identify │
│ ↓ │ important features │
│ │ │
│ 8. Deployment │ Integrate with systems, create │
│ ↓ │ interfaces │
│ │ │
│ 9. Monitoring & Maintenance │ Track performance, retrain, │
│ │ update │
│ │ │
└────────────────────────────────────────────────────────────────┘
Time allocation (based on CrowdFlower, 2016 survey):
- Data collection: 19%
- Data cleaning: 60%
- Model building: 9%
- Model deployment: 6%
- Other (visualization, communication): 6%
Key insight: Most time is spent on data work, not modeling!
15.2.3 Project Scope and Constraints
For a first project, proper scoping is critical. Ng, 2021, MLOps lecture recommends:
DO: - ✅ Start with a well-defined, narrow problem - ✅ Use readily available data - ✅ Aim for “good enough” not “perfect” - ✅ Focus on end-to-end completion - ✅ Document everything as you go
DON’T: - ❌ Try to solve everything at once - ❌ Collect new data (too time-consuming) - ❌ Get stuck on a single step - ❌ Aim for production-grade from start - ❌ Skip documentation until the end
Our project scope:
In scope: - Predict 30-day all-cause readmission risk - Adult patients (18+) - Use publicly available dataset - Build interpretable models (Random Forest, XGBoost) - Create basic web interface
Out of scope: - Cause-specific readmissions - Pediatric patients - Real-time integration with EHR - Deep learning models - Prospective validation
15.3 2. Problem Definition
15.3.1 Defining Success Metrics
Bates et al., 2014, NEJM emphasize that ML success metrics must align with clinical utility, not just statistical performance.
Statistical metrics: - Primary: AUC-ROC ≥ 0.70 (moderate discrimination) - Secondary: Sensitivity ≥ 0.70 at 20% alert rate
Clinical metrics: - Reduce preventable readmissions by 15% - Identify 70% of high-risk patients for intervention
Operational metrics: - Predictions available within 24 hours of discharge - False positive rate < 80% (avoid alert fatigue) - Model runs in < 1 second per patient
Equity metrics: - Performance parity across demographic groups (AUC within 0.05) - No disparate impact (Feldman et al., 2015, KDD)
15.3.2 Stakeholder Analysis
Holstein et al., 2019, CHI found that successful ML projects require early and continuous stakeholder engagement.
Key stakeholders for readmission prediction:
Stakeholder | Needs | Concerns | Communication Strategy |
---|---|---|---|
Care coordinators | Actionable risk scores, patient lists | Workload increase, tool usability | Interactive dashboard, training |
Clinicians | Interpretable predictions, integration with workflow | Alert fatigue, liability | Explainable models, calibrated thresholds |
Hospital administrators | ROI metrics, compliance | Cost, regulatory approval | Business case, quality reports |
Data team | Maintainable code, documentation | Technical debt, model drift | Version control, monitoring plan |
Patients | Improved outcomes, transparency | Privacy, bias | Plain-language explanations, consent |
15.3.3 Project Timeline
Realistic 4-week timeline for first project:
Week 1: Problem Definition & Data Exploration - Days 1-2: Problem scoping, literature review - Days 3-5: Data acquisition, initial EDA - Day 6-7: Feature engineering planning, documentation
Week 2: Preprocessing & Feature Engineering - Days 8-10: Data cleaning, handling missing values - Days 11-12: Feature creation, transformation - Days 13-14: Train/validation/test split, baseline model
Week 3: Modeling & Evaluation - Days 15-17: Model training (multiple algorithms) - Days 18-19: Hyperparameter tuning, validation - Days 20-21: Evaluation, fairness analysis
Week 4: Deployment & Documentation - Days 22-24: Build web interface, create visualizations - Days 25-26: Write technical report, create presentation - Days 27-28: Code cleanup, Docker container, final review
15.4 3. Data Acquisition and Setup
15.4.1 Dataset: MIMIC-III or Synthetic Alternative
Option 1: MIMIC-III (Johnson et al., 2016, Scientific Data)
- Freely available critical care database
- 40,000+ ICU patients from Beth Israel Deaconess Medical Center
- Requires CITI training and PhysioNet credentialing
- Rich clinical data: diagnoses, procedures, medications, labs
Access: https://physionet.org/content/mimiciii/
Option 2: Synthesized dataset (for this tutorial)
We’ll use a synthesized dataset based on real readmission patterns but without PHI concerns.
15.4.2 Project Setup
1. Create project directory:
# Create project structure
mkdir hospital-readmission-prediction
cd hospital-readmission-prediction
# Create subdirectories
mkdir -p data/{raw,processed,external}
mkdir -p notebooks
mkdir -p src/{data,features,models,visualization}
mkdir -p models
mkdir -p reports/figures
mkdir -p tests
mkdir -p deployment
# Create initial files
touch README.md
touch requirements.txt
touch .gitignore
touch environment.yml
2. Initialize Git repository:
git init
git add README.md .gitignore
git commit -m "Initial commit: Project structure"
3. Create virtual environment:
# Create conda environment
conda create -n readmission-pred python=3.10
conda activate readmission-pred
# Install core packages
conda install pandas numpy scikit-learn matplotlib seaborn jupyter mlflow
# Install additional packages
pip install xgboost lightgbm shap streamlit plotly
4. Create requirements.txt
:
# requirements.txt
pandas==2.0.3
numpy==1.24.3
scikit-learn==1.3.0
xgboost==1.7.6
lightgbm==4.0.0
matplotlib==3.7.2
seaborn==0.12.2
plotly==5.15.0
shap==0.42.1
mlflow==2.5.0
streamlit==1.25.0 jupyter==1.0.0
5. Create .gitignore
:
# .gitignore
# Data
data/raw/
data/processed/
*.csv
*.pkl
*.h5
# Models
models/
*.pth
*.joblib
# Notebooks
.ipynb_checkpoints/
*-checkpoint.ipynb
# Python
__pycache__/
*.pyc
*.pyo
*.pyd
.Python
env/
venv/
.venv/
# IDEs
.vscode/
.idea/
*.swp
*.swo
# OS
.DS_Store
Thumbs.db
# MLflow
mlruns/
mlartifacts/
# Streamlit
.streamlit/
6. Create README.md
:
# Hospital Readmission Prediction
Predict 30-day hospital readmission risk using machine learning.
## Project Structure
hospital-readmission-prediction/
├── data/
│ ├── raw/ # Original data
│ ├── processed/ # Cleaned data
│ └── external/ # Reference data
├── notebooks/ # Jupyter notebooks
├── src/ # Source code
│ ├── data/ # Data processing
│ ├── features/ # Feature engineering
│ ├── models/ # Model training
│ └── visualization/ # Plotting
├── models/ # Trained models
├── reports/ # Analysis outputs
└── deployment/ # Deployment code
## Setup
bash# Create environment
conda env create -f environment.yml
conda activate readmission-pred
# Or use pip
pip install -r requirements.txt
## Usage
bash# Train model
python src/models/train.py
# Make predictions
python src/models/predict.py --input data/new_patients.csv
# Launch dashboard
streamlit run deployment/app.py
## Performance
- AUC-ROC: 0.73
- Sensitivity: 0.72 @ 20% alert rate
- Specificity: 0.68
## Citation
If you use this code, please cite:
bibtex
@misc{hospital_readmission_prediction,
author = {Your Name},
title = {Hospital Readmission Risk Prediction Using Machine Learning},
year = {2024},
publisher = {GitHub},
url = {https://github.com/yourusername/hospital-readmission-prediction} }
15.5 4. Data Exploration
15.5.1 Load and Inspect Data
Create notebooks/01_exploratory_data_analysis.ipynb
:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
'ignore')
warnings.filterwarnings(
# Set style
'whitegrid')
sns.set_style('husl')
sns.set_palette('figure.figsize'] = (12, 8)
plt.rcParams[
# Load data
= pd.read_csv('../data/raw/readmission_data.csv')
df
print(f"Dataset shape: {df.shape}")
print(f"Number of patients: {df['patient_id'].nunique()}")
print(f"\nFirst few rows:")
df.head()
Output:
Dataset shape: (10000, 25)
Number of patients: 10000
15.5.2 Data Dictionary
Understanding your features is critical. Sendak et al., 2020, npj Digital Medicine emphasize the importance of clinically meaningful features.
# Create data dictionary
= {
data_dict 'patient_id': 'Unique patient identifier',
'age': 'Patient age in years',
'gender': 'Patient gender (M/F)',
'race': 'Patient race/ethnicity',
'admission_type': 'Type of admission (Emergency/Elective/Urgent)',
'discharge_disposition': 'Where patient went after discharge',
'length_of_stay': 'Hospital length of stay (days)',
'num_diagnoses': 'Number of diagnoses',
'num_procedures': 'Number of procedures',
'num_medications': 'Number of medications',
'num_lab_procedures': 'Number of lab procedures',
'num_outpatient': 'Number of outpatient visits in prior year',
'num_emergency': 'Number of emergency visits in prior year',
'num_inpatient': 'Number of inpatient visits in prior year',
'diabetes': 'Diabetes diagnosis (Yes/No)',
'heart_failure': 'Heart failure diagnosis (Yes/No)',
'copd': 'COPD diagnosis (Yes/No)',
'hypertension': 'Hypertension diagnosis (Yes/No)',
'depression': 'Depression diagnosis (Yes/No)',
'admission_source': 'Source of admission',
'insurance': 'Insurance type (Medicare/Medicaid/Private/None)',
'marital_status': 'Marital status',
'readmitted_30_days': 'Target: Readmitted within 30 days (1=Yes, 0=No)'
}
# Display
='index', columns=['Description']) pd.DataFrame.from_dict(data_dict, orient
15.5.3 Summary Statistics
# Basic statistics
print("="*60)
print("SUMMARY STATISTICS")
print("="*60)
# Numerical features
= df.select_dtypes(include=[np.number]).columns
numerical_cols print("\nNumerical Features:")
print(df[numerical_cols].describe().T)
# Categorical features
= df.select_dtypes(include=['object']).columns
categorical_cols print("\nCategorical Features:")
for col in categorical_cols:
print(f"\n{col}:")
print(df[col].value_counts())
print(f"Unique values: {df[col].nunique()}")
# Target variable
print("\n" + "="*60)
print("TARGET VARIABLE: readmitted_30_days")
print("="*60)
print(df['readmitted_30_days'].value_counts())
print(f"\nReadmission rate: {df['readmitted_30_days'].mean():.1%}")
Example output:
TARGET VARIABLE: readmitted_30_days
============================================================
0 8520
1 1480
Name: readmitted_30_days, dtype: int64
Readmission rate: 14.8%
15.5.4 Missing Data Analysis
Van Buuren, 2018, Flexible Imputation of Missing Data provides comprehensive guidance on handling missing data.
# Missing data analysis
= df.isnull().sum()
missing = (missing / len(df)) * 100
missing_pct = pd.DataFrame({
missing_df 'Missing_Count': missing,
'Missing_Percentage': missing_pct
})= missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)
missing_df
print("\nMissing Data Summary:")
print(missing_df)
# Visualize missing patterns
=(10, 6))
plt.figure(figsize=False, yticklabels=False, cmap='viridis')
sns.heatmap(df.isnull(), cbar'Missing Data Pattern')
plt.title('Features')
plt.xlabel(
plt.tight_layout()'../reports/figures/missing_data_pattern.png', dpi=150)
plt.savefig(
plt.show()
# Missing data by readmission status
print("\nMissing Data by Readmission Status:")
for col in missing_df.index:
= df[df['readmitted_30_days']==1][col].isnull().mean()
readmit_miss = df[df['readmitted_30_days']==0][col].isnull().mean()
no_readmit_miss
if abs(readmit_miss - no_readmit_miss) > 0.05: # >5% difference
print(f"\n{col}:")
print(f" Readmitted: {readmit_miss:.1%} missing")
print(f" Not readmitted: {no_readmit_miss:.1%} missing")
print(f" ⚠️ Differential missingness detected!")
Interpretation: Differential missingness (missing patterns differ by outcome) can indicate: - Informative missingness (missing itself is predictive) - Data collection bias - Need for careful imputation strategy
15.5.5 Univariate Analysis
# Numerical features distribution
= plt.subplots(4, 3, figsize=(15, 16))
fig, axes = axes.ravel()
axes
for idx, col in enumerate(numerical_cols[:12]):
=30, edgecolor='black', alpha=0.7)
axes[idx].hist(df[col].dropna(), binsf'{col}\nSkew: {df[col].skew():.2f}')
axes[idx].set_title(
axes[idx].set_xlabel(col)'Frequency')
axes[idx].set_ylabel(
# Add mean and median lines
='red', linestyle='--', label='Mean')
axes[idx].axvline(df[col].mean(), color='green', linestyle='--', label='Median')
axes[idx].axvline(df[col].median(), color=8)
axes[idx].legend(fontsize
plt.tight_layout()'../reports/figures/univariate_numerical.png', dpi=150)
plt.savefig(
plt.show()
# Categorical features
= plt.subplots(3, 3, figsize=(15, 12))
fig, axes = axes.ravel()
axes
for idx, col in enumerate(categorical_cols[:9]):
= df[col].value_counts()
value_counts range(len(value_counts)), value_counts.values)
axes[idx].bar(range(len(value_counts)))
axes[idx].set_xticks(=45, ha='right')
axes[idx].set_xticklabels(value_counts.index, rotation
axes[idx].set_title(col)'Count')
axes[idx].set_ylabel(
plt.tight_layout()'../reports/figures/univariate_categorical.png', dpi=150)
plt.savefig( plt.show()
15.5.6 Bivariate Analysis
Examine relationships with target variable.
# Numerical features vs. target
= plt.subplots(4, 3, figsize=(15, 16))
fig, axes = axes.ravel()
axes
for idx, col in enumerate(numerical_cols[:12]):
# Box plot by readmission status
= df[[col, 'readmitted_30_days']].dropna()
df_plot
axes[idx].boxplot(['readmitted_30_days']==0][col],
df_plot[df_plot['readmitted_30_days']==1][col]
df_plot[df_plot[=['Not Readmitted', 'Readmitted'])
], labels
axes[idx].set_title(col)'Value')
axes[idx].set_ylabel(
# Statistical test
= df_plot[df_plot['readmitted_30_days']==0][col]
not_readmit = df_plot[df_plot['readmitted_30_days']==1][col]
readmit
# Mann-Whitney U test (non-parametric)
= stats.mannwhitneyu(not_readmit, readmit, alternative='two-sided')
statistic, p_value
if p_value < 0.001:
= '***'
sig elif p_value < 0.01:
= '**'
sig elif p_value < 0.05:
= '*'
sig else:
= 'ns'
sig
0.5, 0.95, f'p{sig}', transform=axes[idx].transAxes,
axes[idx].text(='center', va='top', fontsize=10)
ha
plt.tight_layout()'../reports/figures/bivariate_numerical.png', dpi=150)
plt.savefig( plt.show()
Sullivan & D’Agostino, 2003, Statistics in Medicine recommend non-parametric tests when distributions are skewed, which is common in clinical data.
15.5.7 Correlation Analysis
# Correlation matrix
=(14, 12))
plt.figure(figsize= df[numerical_cols].corr()
corr_matrix = np.triu(np.ones_like(corr_matrix, dtype=bool)) # Mask upper triangle
mask
=mask, annot=True, fmt='.2f',
sns.heatmap(corr_matrix, mask='coolwarm', center=0, square=True,
cmap=0.5, cbar_kws={"shrink": 0.8})
linewidths'Feature Correlation Matrix', fontsize=16)
plt.title(
plt.tight_layout()'../reports/figures/correlation_matrix.png', dpi=150)
plt.savefig(
plt.show()
# High correlations (potential multicollinearity)
print("\nHighly Correlated Feature Pairs (|r| > 0.7):")
= []
high_corr for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
if abs(corr_matrix.iloc[i, j]) > 0.7:
high_corr.append({'Feature 1': corr_matrix.columns[i],
'Feature 2': corr_matrix.columns[j],
'Correlation': corr_matrix.iloc[i, j]
})
if high_corr:
pd.DataFrame(high_corr)else:
print("No highly correlated pairs found")
Interpretation: High correlations suggest: - Potential multicollinearity issues for some models - Redundant information - Opportunities for feature selection
15.5.8 Key Insights from EDA
Document findings systematically:
# Summary of key findings
= {
findings 'Sample Size': {
'Total patients': len(df),
'Readmitted': df['readmitted_30_days'].sum(),
'Not readmitted': (df['readmitted_30_days']==0).sum(),
'Readmission rate': f"{df['readmitted_30_days'].mean():.1%}"
},'Data Quality': {
'Features with missing data': len(missing_df),
'Max missing percentage': f"{missing_df['Missing_Percentage'].max():.1%}",
'Differential missingness': 'Detected in race, insurance'
},'Class Balance': {
'Status': 'Imbalanced' if df['readmitted_30_days'].mean() < 0.3 else 'Balanced',
'Imbalance ratio': f"1:{(1-df['readmitted_30_days'].mean())/df['readmitted_30_days'].mean():.1f}"
},'Strong Predictors (p<0.001)': [
'num_inpatient', 'num_emergency', 'num_diagnoses',
'length_of_stay', 'num_medications'
],'Multicollinearity': {
'High correlations': len(high_corr),
'Max correlation': max([abs(x['Correlation']) for x in high_corr]) if high_corr else 0
}
}
# Print findings
print("\n" + "="*60)
print("KEY FINDINGS FROM EXPLORATORY DATA ANALYSIS")
print("="*60)
import json
print(json.dumps(findings, indent=2))
15.6 5. Data Preprocessing
Create src/data/preprocessing.py
:
"""
Data preprocessing module for hospital readmission prediction.
Handles:
- Missing value imputation
- Outlier detection and treatment
- Feature scaling
- Encoding categorical variables
"""
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
import logging
=logging.INFO)
logging.basicConfig(level= logging.getLogger(__name__)
logger
class ReadmissionPreprocessor:
"""Preprocess readmission data for modeling"""
def __init__(self):
self.numerical_imputer = SimpleImputer(strategy='median')
self.categorical_imputer = SimpleImputer(strategy='most_frequent')
self.scaler = StandardScaler()
self.label_encoders = {}
def fit(self, df):
"""
Fit preprocessing transformers
Args:
df: Training dataframe
Returns:
self
"""
# Separate numerical and categorical
self.numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
self.categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
# Remove target if present
if 'readmitted_30_days' in self.numerical_cols:
self.numerical_cols.remove('readmitted_30_days')
if 'patient_id' in self.numerical_cols:
self.numerical_cols.remove('patient_id')
if 'patient_id' in self.categorical_cols:
self.categorical_cols.remove('patient_id')
# Fit imputers
"Fitting imputers...")
logger.info(if self.numerical_cols:
self.numerical_imputer.fit(df[self.numerical_cols])
if self.categorical_cols:
self.categorical_imputer.fit(df[self.categorical_cols])
# Fit scaler (on imputed data)
if self.numerical_cols:
= self.numerical_imputer.transform(df[self.numerical_cols])
num_imputed self.scaler.fit(num_imputed)
# Fit label encoders
"Fitting label encoders...")
logger.info(for col in self.categorical_cols:
= LabelEncoder()
le # Handle missing values by adding 'missing' category
= df[col].fillna('missing')
col_data
le.fit(col_data)self.label_encoders[col] = le
return self
def transform(self, df):
"""
Transform data using fitted preprocessors
Args:
df: Dataframe to transform
Returns:
Preprocessed dataframe
"""
= df.copy()
df_processed
# Impute numerical features
if self.numerical_cols:
f"Imputing {len(self.numerical_cols)} numerical features...")
logger.info(= self.numerical_imputer.transform(df_processed[self.numerical_cols])
num_imputed
# Scale numerical features
"Scaling numerical features...")
logger.info(= self.scaler.transform(num_imputed)
num_scaled
# Replace in dataframe
self.numerical_cols] = num_scaled
df_processed[
# Encode categorical features
if self.categorical_cols:
f"Encoding {len(self.categorical_cols)} categorical features...")
logger.info(for col in self.categorical_cols:
# Handle unseen categories
= df_processed[col].fillna('missing')
col_data
# Map unseen categories to 'missing'
= set(self.label_encoders[col].classes_)
known_categories = col_data.apply(
col_data lambda x: x if x in known_categories else 'missing'
)
= self.label_encoders[col].transform(col_data)
df_processed[col]
return df_processed
def fit_transform(self, df):
"""Fit and transform in one step"""
return self.fit(df).transform(df)
def handle_outliers(df, columns, method='iqr', threshold=3.0):
"""
Handle outliers in numerical columns
Args:
df: Dataframe
columns: List of columns to check
method: 'iqr' or 'zscore'
threshold: Threshold for outlier detection
Returns:
Dataframe with outliers capped
"""
= df.copy()
df_out
for col in columns:
if method == 'iqr':
= df[col].quantile(0.25)
Q1 = df[col].quantile(0.75)
Q3 = Q3 - Q1
IQR = Q1 - threshold * IQR
lower = Q3 + threshold * IQR
upper elif method == 'zscore':
= df[col].mean()
mean = df[col].std()
std = mean - threshold * std
lower = mean + threshold * std
upper
# Cap outliers
= ((df[col] < lower) | (df[col] > upper)).sum()
n_outliers if n_outliers > 0:
f"{col}: Capping {n_outliers} outliers")
logger.info(= df[col].clip(lower, upper)
df_out[col]
return df_out
# Example usage
if __name__ == "__main__":
# Load data
= pd.read_csv('../../data/raw/readmission_data.csv')
df
# Initialize preprocessor
= ReadmissionPreprocessor()
preprocessor
# Fit on training data (assuming you've split already)
= df.sample(frac=0.8, random_state=42)
df_train = df.drop(df_train.index)
df_test
# Preprocess
preprocessor.fit(df_train)= preprocessor.transform(df_train)
df_train_processed = preprocessor.transform(df_test)
df_test_processed
f"Training set shape: {df_train_processed.shape}")
logger.info(f"Test set shape: {df_test_processed.shape}") logger.info(
15.6.1 Feature Engineering
Create src/features/build_features.py
:
"""
Feature engineering for readmission prediction.
Based on clinical literature and domain knowledge.
"""
import pandas as pd
import numpy as np
from datetime import datetime
import logging
=logging.INFO)
logging.basicConfig(level= logging.getLogger(__name__)
logger
def create_utilization_features(df):
"""
Create healthcare utilization features
Research shows prior utilization strongly predicts readmission
[Kansagara et al., 2011]
"""
= df.copy()
df_fe
# Total prior visits
'total_prior_visits'] = (
df_fe['num_outpatient'] +
df_fe['num_emergency'] +
df_fe['num_inpatient']
df_fe[
)
# High utilizer flag
'high_utilizer'] = (df_fe['total_prior_visits'] >
df_fe['total_prior_visits'].quantile(0.75)).astype(int)
df_fe[
# Emergency to outpatient ratio
'emergency_to_outpatient_ratio'] = (
df_fe['num_emergency'] / (df_fe['num_outpatient'] + 1) # Add 1 to avoid division by zero
df_fe[
)
# Inpatient intensity
'inpatient_intensity'] = (
df_fe['num_inpatient'] / (df_fe['age'] + 1) * 100 # Admissions per year of life
df_fe[
)
"Created utilization features")
logger.info(return df_fe
def create_complexity_features(df):
"""
Create clinical complexity features
Complexity indicators associated with readmission risk
"""
= df.copy()
df_fe
# Total clinical burden
'clinical_burden_score'] = (
df_fe['num_diagnoses'] +
df_fe['num_procedures'] +
df_fe['num_medications'] +
df_fe['num_lab_procedures']
df_fe[
)
# Medication burden flag
'polypharmacy'] = (df_fe['num_medications'] >= 5).astype(int) # Common definition
df_fe[
# Diagnostic complexity
'diagnoses_per_day'] = df_fe['num_diagnoses'] / (df_fe['length_of_stay'] + 1)
df_fe[
# Procedure intensity
'procedure_intensity'] = (
df_fe['num_procedures'] / (df_fe['length_of_stay'] + 1)
df_fe[
)
"Created complexity features")
logger.info(return df_fe
def create_comorbidity_features(df):
"""
Create comorbidity combination features
Specific comorbidity patterns increase risk
"""
= df.copy()
df_fe
# Comorbidity count
= ['diabetes', 'heart_failure', 'copd', 'hypertension', 'depression']
comorbidity_cols
# Ensure boolean type
for col in comorbidity_cols:
if df_fe[col].dtype == 'object':
= (df_fe[col] == 'Yes').astype(int)
df_fe[col]
'comorbidity_count'] = df_fe[comorbidity_cols].sum(axis=1)
df_fe[
# High-risk combinations
'diabetes_heartfailure'] = (
df_fe['diabetes'] == 1) & (df_fe['heart_failure'] == 1)
(df_fe[int)
).astype(
'copd_heartfailure'] = (
df_fe['copd'] == 1) & (df_fe['heart_failure'] == 1)
(df_fe[int)
).astype(
# Multiple chronic conditions
'multiple_chronic_conditions'] = (df_fe['comorbidity_count'] >= 2).astype(int)
df_fe[
"Created comorbidity features")
logger.info(return df_fe
def create_demographic_features(df):
"""
Create demographic risk features
"""
= df.copy()
df_fe
# Age groups (clinically relevant cutoffs)
'age_group'] = pd.cut(
df_fe['age'],
df_fe[=[0, 40, 65, 80, 120],
bins=['young', 'middle', 'elderly', 'very_elderly']
labels
)
# Elderly flag
'is_elderly'] = (df_fe['age'] >= 65).astype(int)
df_fe[
# Very elderly flag
'is_very_elderly'] = (df_fe['age'] >= 80).astype(int)
df_fe[
"Created demographic features")
logger.info(return df_fe
def create_discharge_features(df):
"""
Create discharge-related features
"""
= df.copy()
df_fe
# Short length of stay (potential premature discharge)
'short_los'] = (df_fe['length_of_stay'] <= 2).astype(int)
df_fe[
# Long length of stay (complexity)
'long_los'] = (df_fe['length_of_stay'] >= 7).astype(int)
df_fe[
"Created discharge features")
logger.info(return df_fe
def engineer_all_features(df):
"""
Apply all feature engineering steps
Args:
df: Raw dataframe
Returns:
Dataframe with engineered features
"""
"Starting feature engineering...")
logger.info(
= df.copy()
df_fe
# Apply all feature engineering functions
= create_utilization_features(df_fe)
df_fe = create_complexity_features(df_fe)
df_fe = create_comorbidity_features(df_fe)
df_fe = create_demographic_features(df_fe)
df_fe = create_discharge_features(df_fe)
df_fe
f"Feature engineering complete. Final shape: {df_fe.shape}")
logger.info(f"New features created: {df_fe.shape[1] - df.shape[1]}")
logger.info(
return df_fe
# Example usage
if __name__ == "__main__":
# Load data
= pd.read_csv('../../data/raw/readmission_data.csv')
df
# Engineer features
= engineer_all_features(df)
df_engineered
# Save
'../../data/processed/readmission_data_engineered.csv', index=False)
df_engineered.to_csv("Engineered data saved") logger.info(
15.7 6. Model Training
Create src/models/train.py
:
Chen & Guestrin, 2016, KDD introduced XGBoost, now standard for tabular data.
"""
Model training module
Implements multiple algorithms with hyperparameter tuning
"""
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import (
roc_auc_score, accuracy_score, precision_score, recall_score,
f1_score, confusion_matrix, classification_report, roc_curve
)import mlflow
import mlflow.sklearn
import mlflow.xgboost
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
import logging
from datetime import datetime
=logging.INFO)
logging.basicConfig(level= logging.getLogger(__name__)
logger
def load_and_split_data(filepath, test_size=0.2, val_size=0.1, random_state=42):
"""
Load data and create train/val/test splits
Args:
filepath: Path to processed data
test_size: Proportion for test set
val_size: Proportion for validation set (from training data)
random_state: Random seed
Returns:
X_train, X_val, X_test, y_train, y_val, y_test
"""
f"Loading data from {filepath}")
logger.info(= pd.read_csv(filepath)
df
# Separate features and target
= df.drop(['readmitted_30_days', 'patient_id'], axis=1, errors='ignore')
X = df['readmitted_30_days']
y
# First split: train+val vs test
= train_test_split(
X_trainval, X_test, y_trainval, y_test =test_size, random_state=random_state, stratify=y
X, y, test_size
)
# Second split: train vs val
= train_test_split(
X_train, X_val, y_train, y_val
X_trainval, y_trainval, =val_size/(1-test_size), # Adjust proportion
test_size=random_state,
random_state=y_trainval
stratify
)
f"Train set: {X_train.shape}, Positive rate: {y_train.mean():.1%}")
logger.info(f"Val set: {X_val.shape}, Positive rate: {y_val.mean():.1%}")
logger.info(f"Test set: {X_test.shape}, Positive rate: {y_test.mean():.1%}")
logger.info(
return X_train, X_val, X_test, y_train, y_val, y_test
def evaluate_model(y_true, y_pred, y_pred_proba, model_name="Model"):
"""
Comprehensive model evaluation
Args:
y_true: True labels
y_pred: Predicted labels
y_pred_proba: Predicted probabilities
model_name: Name for display
Returns:
Dictionary of metrics
"""
= {
metrics 'model': model_name,
'auc_roc': roc_auc_score(y_true, y_pred_proba),
'accuracy': accuracy_score(y_true, y_pred),
'precision': precision_score(y_true, y_pred),
'recall': recall_score(y_true, y_pred),
'f1': f1_score(y_true, y_pred),
'specificity': recall_score(1-y_true, 1-y_pred)
}
# Confusion matrix
= confusion_matrix(y_true, y_pred)
cm 'confusion_matrix'] = cm
metrics[
f"\n{'='*60}")
logger.info(f"{model_name} Performance")
logger.info(f"{'='*60}")
logger.info(for key, value in metrics.items():
if key not in ['confusion_matrix', 'model']:
f"{key}: {value:.4f}")
logger.info(
return metrics
def plot_roc_curve(y_true, y_pred_proba, model_name, save_path=None):
"""Plot ROC curve"""
= roc_curve(y_true, y_pred_proba)
fpr, tpr, thresholds = roc_auc_score(y_true, y_pred_proba)
auc
=(8, 6))
plt.figure(figsize=f'{model_name} (AUC = {auc:.3f})', linewidth=2)
plt.plot(fpr, tpr, label0, 1], [0, 1], 'k--', label='Random Classifier')
plt.plot(['False Positive Rate')
plt.xlabel('True Positive Rate')
plt.ylabel(f'ROC Curve - {model_name}')
plt.title(
plt.legend()=0.3)
plt.grid(alpha
plt.tight_layout()
if save_path:
=150)
plt.savefig(save_path, dpif"ROC curve saved to {save_path}")
logger.info(
plt.close()
def train_logistic_regression(X_train, y_train, X_val, y_val):
"""
Train logistic regression baseline
Simple, interpretable baseline model
"""
"\n" + "="*60)
logger.info("Training Logistic Regression")
logger.info("="*60)
logger.info(
= LogisticRegression(
model =1000,
max_iter=42,
random_state='balanced' # Handle class imbalance
class_weight
)
model.fit(X_train, y_train)
# Predictions
= model.predict(X_val)
y_pred = model.predict_proba(X_val)[:, 1]
y_pred_proba
# Evaluate
= evaluate_model(y_val, y_pred, y_pred_proba, "Logistic Regression")
metrics
return model, metrics
def train_random_forest(X_train, y_train, X_val, y_val):
"""
Train Random Forest
Ensemble method, handles non-linearity well
[Breiman, 2001, Machine Learning]
"""
"\n" + "="*60)
logger.info("Training Random Forest")
logger.info("="*60)
logger.info(
= RandomForestClassifier(
model =100,
n_estimators=10,
max_depth=20,
min_samples_split=10,
min_samples_leaf='balanced',
class_weight=42,
random_state=-1
n_jobs
)
model.fit(X_train, y_train)
# Predictions
= model.predict(X_val)
y_pred = model.predict_proba(X_val)[:, 1]
y_pred_proba
# Evaluate
= evaluate_model(y_val, y_pred, y_pred_proba, "Random Forest")
metrics
return model, metrics
def train_xgboost(X_train, y_train, X_val, y_val):
"""
Train XGBoost
Gradient boosting, typically best for tabular data
[Chen & Guestrin, 2016, KDD]
"""
"\n" + "="*60)
logger.info("Training XGBoost")
logger.info("="*60)
logger.info(
# Calculate scale_pos_weight for imbalance
= (y_train == 0).sum() / (y_train == 1).sum()
scale_pos_weight
= xgb.XGBClassifier(
model =100,
n_estimators=6,
max_depth=0.1,
learning_rate=0.8,
subsample=0.8,
colsample_bytree=scale_pos_weight,
scale_pos_weight=42,
random_state='auc',
eval_metric=10
early_stopping_rounds
)
model.fit(
X_train, y_train,=[(X_val, y_val)],
eval_set=False
verbose
)
# Predictions
= model.predict(X_val)
y_pred = model.predict_proba(X_val)[:, 1]
y_pred_proba
# Evaluate
= evaluate_model(y_val, y_pred, y_pred_proba, "XGBoost")
metrics
return model, metrics
def train_lightgbm(X_train, y_train, X_val, y_val):
"""
Train LightGBM
Fast gradient boosting, efficient on large datasets
[Ke et al., 2017, NIPS]
"""
"\n" + "="*60)
logger.info("Training LightGBM")
logger.info("="*60)
logger.info(
= lgb.LGBMClassifier(
model =100,
n_estimators=6,
max_depth=0.1,
learning_rate=31,
num_leaves=0.8,
subsample=0.8,
colsample_bytree='balanced',
class_weight=42
random_state
)
model.fit(
X_train, y_train,=[(X_val, y_val)],
eval_set='auc',
eval_metric=[lgb.early_stopping(10)]
callbacks
)
# Predictions
= model.predict(X_val)
y_pred = model.predict_proba(X_val)[:, 1]
y_pred_proba
# Evaluate
= evaluate_model(y_val, y_pred, y_pred_proba, "LightGBM")
metrics
return model, metrics
def compare_models(models_metrics):
"""
Compare multiple models
Args:
models_metrics: List of (model_name, metrics) tuples
Returns:
Comparison dataframe
"""
= []
comparison for model_name, metrics in models_metrics:
comparison.append({'Model': model_name,
'AUC-ROC': metrics['auc_roc'],
'Accuracy': metrics['accuracy'],
'Precision': metrics['precision'],
'Recall': metrics['recall'],
'F1': metrics['f1'],
'Specificity': metrics['specificity']
})
= pd.DataFrame(comparison)
df_comparison
# Highlight best performers
"\n" + "="*60)
logger.info("MODEL COMPARISON")
logger.info("="*60)
logger.info("\n" + df_comparison.to_string(index=False))
logger.info(
# Best model by AUC
= df_comparison.loc[df_comparison['AUC-ROC'].idxmax(), 'Model']
best_model f"\n🏆 Best model by AUC-ROC: {best_model}")
logger.info(
return df_comparison
def main():
"""Main training pipeline"""
# Set up MLflow
"readmission-prediction")
mlflow.set_experiment(
with mlflow.start_run(run_name=f"training_{datetime.now().strftime('%Y%m%d_%H%M')}"):
# Load data
= load_and_split_data(
X_train, X_val, X_test, y_train, y_val, y_test '../data/processed/readmission_data_processed.csv'
)
# Log dataset info
"train_samples", len(X_train))
mlflow.log_param("val_samples", len(X_val))
mlflow.log_param("test_samples", len(X_test))
mlflow.log_param("n_features", X_train.shape[1])
mlflow.log_param("positive_rate", y_train.mean())
mlflow.log_param(
# Train models
= []
models_metrics
# Logistic Regression
= train_logistic_regression(X_train, y_train, X_val, y_val)
lr_model, lr_metrics "Logistic Regression", lr_metrics))
models_metrics.append((1],
plot_roc_curve(y_val, lr_model.predict_proba(X_val)[:, "Logistic Regression",
"../reports/figures/roc_logistic.png")
# Random Forest
= train_random_forest(X_train, y_train, X_val, y_val)
rf_model, rf_metrics "Random Forest", rf_metrics))
models_metrics.append((1],
plot_roc_curve(y_val, rf_model.predict_proba(X_val)[:, "Random Forest",
"../reports/figures/roc_rf.png")
# XGBoost
= train_xgboost(X_train, y_train, X_val, y_val)
xgb_model, xgb_metrics "XGBoost", xgb_metrics))
models_metrics.append((1],
plot_roc_curve(y_val, xgb_model.predict_proba(X_val)[:, "XGBoost",
"../reports/figures/roc_xgb.png")
# LightGBM
= train_lightgbm(X_train, y_train, X_val, y_val)
lgb_model, lgb_metrics "LightGBM", lgb_metrics))
models_metrics.append((1],
plot_roc_curve(y_val, lgb_model.predict_proba(X_val)[:, "LightGBM",
"../reports/figures/roc_lgb.png")
# Compare models
= compare_models(models_metrics)
comparison
# Log best model metrics to MLflow
= comparison['AUC-ROC'].idxmax()
best_idx for col in ['AUC-ROC', 'Accuracy', 'Precision', 'Recall', 'F1', 'Specificity']:
f"val_{col.lower().replace('-','_')}",
mlflow.log_metric(
comparison.loc[best_idx, col])
# Save best model
= comparison.loc[best_idx, 'Model']
best_model_name if best_model_name == "XGBoost":
= xgb_model
best_model elif best_model_name == "LightGBM":
= lgb_model
best_model elif best_model_name == "Random Forest":
= rf_model
best_model else:
= lr_model
best_model
# Evaluate on test set
"\n" + "="*60)
logger.info("FINAL EVALUATION ON TEST SET")
logger.info("="*60)
logger.info(
= best_model.predict(X_test)
y_test_pred = best_model.predict_proba(X_test)[:, 1]
y_test_pred_proba
= evaluate_model(y_test, y_test_pred, y_test_pred_proba,
test_metrics f"{best_model_name} (Test Set)")
# Log test metrics
for key, value in test_metrics.items():
if key not in ['confusion_matrix', 'model']:
f"test_{key}", value)
mlflow.log_metric(
# Save model
= f"../models/{best_model_name.lower().replace(' ', '_')}_final.joblib"
model_path
joblib.dump(best_model, model_path)f"\n✅ Best model saved to {model_path}")
logger.info(
# Log model to MLflow
if best_model_name == "XGBoost":
"model")
mlflow.xgboost.log_model(best_model, else:
"model")
mlflow.sklearn.log_model(best_model,
"\n🎉 Training complete!")
logger.info(f"View results: mlflow ui")
logger.info(
if __name__ == "__main__":
main()
15.8 7. Model Interpretation
15.8.1 Feature Importance Analysis
Lundberg & Lee, 2017, NIPS introduced SHAP (SHapley Additive exPlanations), now the standard for model interpretation.
Create src/models/interpret.py
:
"""
Model interpretation module
Implements SHAP and other interpretability methods
"""
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import shap
import joblib
import logging
from sklearn.inspection import permutation_importance
=logging.INFO)
logging.basicConfig(level= logging.getLogger(__name__)
logger
def plot_feature_importance(model, feature_names, top_n=20, save_path=None):
"""
Plot feature importance from tree-based model
Args:
model: Trained model with feature_importances_
feature_names: List of feature names
top_n: Number of top features to show
save_path: Path to save figure
"""
# Get feature importances
if hasattr(model, 'feature_importances_'):
= model.feature_importances_
importances else:
"Model does not have feature_importances_ attribute")
logger.warning(return
# Create dataframe
= pd.DataFrame({
feat_imp 'feature': feature_names,
'importance': importances
'importance', ascending=False)
}).sort_values(
# Plot top features
=(10, 8))
plt.figure(figsize=feat_imp.head(top_n), x='importance', y='feature')
sns.barplot(dataf'Top {top_n} Feature Importances', fontsize=14)
plt.title('Importance Score')
plt.xlabel('Feature')
plt.ylabel(
plt.tight_layout()
if save_path:
=150)
plt.savefig(save_path, dpif"Feature importance plot saved to {save_path}")
logger.info(
plt.show()
# Print top features
"\nTop 10 Most Important Features:")
logger.info(for idx, row in feat_imp.head(10).iterrows():
f" {row['feature']}: {row['importance']:.4f}")
logger.info(
return feat_imp
def compute_shap_values(model, X, feature_names, background_samples=100):
"""
Compute SHAP values for model interpretability
SHAP provides unified measure of feature importance
[Lundberg & Lee, 2017, NIPS]
Args:
model: Trained model
X: Feature matrix
feature_names: List of feature names
background_samples: Number of background samples for TreeExplainer
Returns:
shap_values, explainer
"""
"Computing SHAP values...")
logger.info(
# Select background dataset
if len(X) > background_samples:
= shap.sample(X, background_samples, random_state=42)
background else:
= X
background
# Create explainer based on model type
if hasattr(model, 'predict_proba'):
# Tree-based models
= shap.TreeExplainer(model, background)
explainer = explainer.shap_values(X)
shap_values
# For binary classification, get positive class SHAP values
if isinstance(shap_values, list):
= shap_values[1] # Positive class
shap_values else:
# Linear models
= shap.LinearExplainer(model, background)
explainer = explainer.shap_values(X)
shap_values
"SHAP values computed")
logger.info(return shap_values, explainer
def plot_shap_summary(shap_values, X, feature_names, save_path=None):
"""
Plot SHAP summary plot
Shows feature importance and impact direction
"""
=(10, 8))
plt.figure(figsize
shap.summary_plot(
shap_values,
X, =feature_names,
feature_names=False,
show=20
max_display
)
plt.tight_layout()
if save_path:
=150, bbox_inches='tight')
plt.savefig(save_path, dpif"SHAP summary plot saved to {save_path}")
logger.info(
plt.show()
def plot_shap_bar(shap_values, feature_names, save_path=None):
"""
Plot SHAP bar chart (mean absolute SHAP values)
"""
# Calculate mean absolute SHAP values
= np.abs(shap_values).mean(axis=0)
mean_abs_shap
# Create dataframe
= pd.DataFrame({
shap_importance 'feature': feature_names,
'mean_abs_shap': mean_abs_shap
'mean_abs_shap', ascending=False)
}).sort_values(
# Plot
=(10, 8))
plt.figure(figsize=shap_importance.head(20), x='mean_abs_shap', y='feature')
sns.barplot(data'Top 20 Features by Mean |SHAP Value|', fontsize=14)
plt.title('Mean |SHAP Value|')
plt.xlabel('Feature')
plt.ylabel(
plt.tight_layout()
if save_path:
=150)
plt.savefig(save_path, dpif"SHAP bar plot saved to {save_path}")
logger.info(
plt.show()
return shap_importance
def explain_prediction(model, explainer, X, feature_names, patient_idx=0, save_path=None):
"""
Explain individual prediction using SHAP
Waterfall plot shows contribution of each feature
"""
# Get SHAP values for this patient
= explainer.shap_values(X[patient_idx:patient_idx+1])
shap_values
if isinstance(shap_values, list):
= shap_values[1] # Positive class
shap_values
# Create explanation object
= shap.Explanation(
explanation =shap_values[0],
values=explainer.expected_value[1] if isinstance(explainer.expected_value, list) else explainer.expected_value,
base_values=X[patient_idx:patient_idx+1].values[0],
data=feature_names
feature_names
)
# Waterfall plot
=(10, 8))
plt.figure(figsize=False)
shap.waterfall_plot(explanation, show
plt.tight_layout()
if save_path:
=150, bbox_inches='tight')
plt.savefig(save_path, dpif"Individual explanation saved to {save_path}")
logger.info(
plt.show()
# Print explanation
= model.predict_proba(X[patient_idx:patient_idx+1])[0, 1]
prediction f"\nPatient {patient_idx} Readmission Risk: {prediction:.1%}")
logger.info("\nTop contributing factors:")
logger.info(
# Get top contributing features
= pd.DataFrame({
feature_impacts 'feature': feature_names,
'value': X[patient_idx:patient_idx+1].values[0],
'shap_value': shap_values[0]
})'abs_shap'] = np.abs(feature_impacts['shap_value'])
feature_impacts[= feature_impacts.sort_values('abs_shap', ascending=False)
feature_impacts
for idx, row in feature_impacts.head(5).iterrows():
= "↑ increases" if row['shap_value'] > 0 else "↓ decreases"
direction f" {row['feature']} = {row['value']:.2f} {direction} risk")
logger.info(
def compute_permutation_importance(model, X, y, feature_names, n_repeats=10):
"""
Compute permutation importance
Model-agnostic method
[Breiman, 2001, Machine Learning]
"""
"Computing permutation importance...")
logger.info(
= permutation_importance(
result
model, X, y,=n_repeats,
n_repeats=42,
random_state='roc_auc'
scoring
)
= pd.DataFrame({
perm_importance 'feature': feature_names,
'importance_mean': result.importances_mean,
'importance_std': result.importances_std
'importance_mean', ascending=False)
}).sort_values(
"\nTop 10 Features by Permutation Importance:")
logger.info(for idx, row in perm_importance.head(10).iterrows():
f" {row['feature']}: {row['importance_mean']:.4f} ± {row['importance_std']:.4f}")
logger.info(
return perm_importance
def main():
"""Run model interpretation pipeline"""
# Load model and data
"Loading model and data...")
logger.info(= joblib.load('../models/xgboost_final.joblib')
model
= pd.read_csv('../data/processed/X_test.csv')
X_test = pd.read_csv('../data/processed/y_test.csv').values.ravel()
y_test
= X_test.columns.tolist()
feature_names
# 1. Feature importance (tree-based)
"\n" + "="*60)
logger.info("FEATURE IMPORTANCE ANALYSIS")
logger.info("="*60)
logger.info(
= plot_feature_importance(
feat_imp
model,
feature_names, =20,
top_n='../reports/figures/feature_importance.png'
save_path
)
# 2. SHAP analysis
"\n" + "="*60)
logger.info("SHAP ANALYSIS")
logger.info("="*60)
logger.info(
= compute_shap_values(
shap_values, explainer
model,
X_test.values,
feature_names,=100
background_samples
)
# SHAP summary plot
plot_shap_summary(
shap_values,
X_test.values,
feature_names,='../reports/figures/shap_summary.png'
save_path
)
# SHAP bar plot
= plot_shap_bar(
shap_importance
shap_values,
feature_names,='../reports/figures/shap_bar.png'
save_path
)
# 3. Individual predictions
"\n" + "="*60)
logger.info("INDIVIDUAL PREDICTION EXPLANATIONS")
logger.info("="*60)
logger.info(
# High-risk patient
= np.argmax(model.predict_proba(X_test.values)[:, 1])
high_risk_idx f"\nExplaining HIGH RISK patient (index {high_risk_idx}):")
logger.info(
explain_prediction(
model,
explainer,
X_test.values,
feature_names,=high_risk_idx,
patient_idx='../reports/figures/shap_high_risk.png'
save_path
)
# Low-risk patient
= np.argmin(model.predict_proba(X_test.values)[:, 1])
low_risk_idx f"\nExplaining LOW RISK patient (index {low_risk_idx}):")
logger.info(
explain_prediction(
model,
explainer,
X_test.values,
feature_names,=low_risk_idx,
patient_idx='../reports/figures/shap_low_risk.png'
save_path
)
# 4. Permutation importance
"\n" + "="*60)
logger.info("PERMUTATION IMPORTANCE")
logger.info("="*60)
logger.info(
= compute_permutation_importance(
perm_importance
model,
X_test.values,
y_test,
feature_names,=10
n_repeats
)
"\n✅ Model interpretation complete!")
logger.info(
if __name__ == "__main__":
main()
Key insights from interpretation:
According to Caruana et al., 2015, KDD, model interpretability is critical in healthcare to: - Build trust with clinicians - Identify spurious correlations - Ensure clinical validity - Meet regulatory requirements
15.9 8. Model Optimization
15.9.1 Hyperparameter Tuning
Bergstra & Bengio, 2012, JMLR showed that random search often outperforms grid search for hyperparameter optimization.
Create src/models/tune.py
:
"""
Hyperparameter tuning module
Uses Optuna for Bayesian optimization
"""
import pandas as pd
import numpy as np
import optuna
from optuna.visualization import (
plot_optimization_history,
plot_param_importances,
plot_parallel_coordinate
)import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import roc_auc_score
import mlflow
import joblib
import logging
=logging.INFO)
logging.basicConfig(level= logging.getLogger(__name__)
logger
def objective_xgboost(trial, X_train, y_train):
"""
Objective function for XGBoost hyperparameter tuning
Uses Optuna for Bayesian optimization
[Akiba et al., 2019, KDD]
"""
# Suggest hyperparameters
= {
params 'max_depth': trial.suggest_int('max_depth', 3, 10),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
'n_estimators': trial.suggest_int('n_estimators', 50, 300),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
'gamma': trial.suggest_float('gamma', 0, 5),
'reg_alpha': trial.suggest_float('reg_alpha', 0, 1),
'reg_lambda': trial.suggest_float('reg_lambda', 0, 1),
'scale_pos_weight': (y_train == 0).sum() / (y_train == 1).sum(),
'random_state': 42,
'eval_metric': 'auc'
}
# Cross-validation
= StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv = xgb.XGBClassifier(**params)
model
= cross_val_score(
scores
model, X_train, y_train,=cv,
cv='roc_auc',
scoring=-1
n_jobs
)
return scores.mean()
def tune_xgboost(X_train, y_train, n_trials=100):
"""
Tune XGBoost hyperparameters
Args:
X_train: Training features
y_train: Training labels
n_trials: Number of optimization trials
Returns:
Best parameters, study object
"""
"\n" + "="*60)
logger.info("TUNING XGBOOST HYPERPARAMETERS")
logger.info("="*60)
logger.info(f"Running {n_trials} trials with Bayesian optimization...")
logger.info(
# Create study
= optuna.create_study(
study ='maximize',
direction=optuna.samplers.TPESampler(seed=42)
sampler
)
# Optimize
study.optimize(lambda trial: objective_xgboost(trial, X_train, y_train),
=n_trials,
n_trials=True
show_progress_bar
)
# Best parameters
f"\n✅ Best AUC: {study.best_value:.4f}")
logger.info("Best hyperparameters:")
logger.info(for key, value in study.best_params.items():
f" {key}: {value}")
logger.info(
return study.best_params, study
def objective_lightgbm(trial, X_train, y_train):
"""Objective function for LightGBM"""
= {
params 'max_depth': trial.suggest_int('max_depth', 3, 10),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
'n_estimators': trial.suggest_int('n_estimators', 50, 300),
'num_leaves': trial.suggest_int('num_leaves', 20, 100),
'min_child_samples': trial.suggest_int('min_child_samples', 5, 50),
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
'reg_alpha': trial.suggest_float('reg_alpha', 0, 1),
'reg_lambda': trial.suggest_float('reg_lambda', 0, 1),
'class_weight': 'balanced',
'random_state': 42
}
= StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv = lgb.LGBMClassifier(**params)
model
= cross_val_score(
scores
model, X_train, y_train,=cv,
cv='roc_auc',
scoring=-1
n_jobs
)
return scores.mean()
def tune_lightgbm(X_train, y_train, n_trials=100):
"""Tune LightGBM hyperparameters"""
"\n" + "="*60)
logger.info("TUNING LIGHTGBM HYPERPARAMETERS")
logger.info("="*60)
logger.info(f"Running {n_trials} trials with Bayesian optimization...")
logger.info(
= optuna.create_study(
study ='maximize',
direction=optuna.samplers.TPESampler(seed=42)
sampler
)
study.optimize(lambda trial: objective_lightgbm(trial, X_train, y_train),
=n_trials,
n_trials=True
show_progress_bar
)
f"\n✅ Best AUC: {study.best_value:.4f}")
logger.info("Best hyperparameters:")
logger.info(for key, value in study.best_params.items():
f" {key}: {value}")
logger.info(
return study.best_params, study
def visualize_optimization(study, save_dir='../reports/figures'):
"""
Visualize hyperparameter optimization results
"""
import matplotlib.pyplot as plt
# Optimization history
= plot_optimization_history(study)
fig f"{save_dir}/optuna_history.png")
fig.write_image(
# Parameter importances
= plot_param_importances(study)
fig f"{save_dir}/optuna_importance.png")
fig.write_image(
# Parallel coordinate plot
= plot_parallel_coordinate(study)
fig f"{save_dir}/optuna_parallel.png")
fig.write_image(
f"Optimization visualizations saved to {save_dir}")
logger.info(
def main():
"""Run hyperparameter tuning"""
# Load data
= pd.read_csv('../data/processed/X_train.csv')
X_train = pd.read_csv('../data/processed/y_train.csv').values.ravel()
y_train
# Tune XGBoost
= tune_xgboost(X_train, y_train, n_trials=100)
best_params_xgb, study_xgb
# Visualize
='../reports/figures')
visualize_optimization(study_xgb, save_dir
# Train final model with best parameters
"\n" + "="*60)
logger.info("TRAINING FINAL MODEL WITH BEST PARAMETERS")
logger.info("="*60)
logger.info(
= xgb.XGBClassifier(
final_model **best_params_xgb,
=(y_train == 0).sum() / (y_train == 1).sum(),
scale_pos_weight=42
random_state
)
final_model.fit(X_train, y_train)
# Save
'../models/xgboost_tuned.joblib')
joblib.dump(final_model, "\n✅ Tuned model saved to ../models/xgboost_tuned.joblib")
logger.info(
if __name__ == "__main__":
main()
15.9.2 Threshold Optimization
Saito & Rehmsmeier, 2015, PLoS ONE discuss the importance of optimizing classification thresholds.
"""
Threshold optimization for operational constraints
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, roc_curve
import joblib
def find_optimal_threshold(y_true, y_pred_proba, target_sensitivity=0.70):
"""
Find threshold that achieves target sensitivity
Clinical requirement: detect 70% of readmissions
Args:
y_true: True labels
y_pred_proba: Predicted probabilities
target_sensitivity: Desired sensitivity level
Returns:
Optimal threshold, achieved sensitivity, specificity
"""
# Calculate metrics at all thresholds
= roc_curve(y_true, y_pred_proba)
fpr, tpr, thresholds
# Find threshold closest to target sensitivity
= np.argmin(np.abs(tpr - target_sensitivity))
idx = thresholds[idx]
optimal_threshold = tpr[idx]
achieved_sensitivity = 1 - fpr[idx]
achieved_specificity
# Alert rate (positive predictions)
= (y_pred_proba >= optimal_threshold).astype(int)
y_pred_binary = y_pred_binary.mean()
alert_rate
print(f"Optimal threshold: {optimal_threshold:.3f}")
print(f"Achieved sensitivity: {achieved_sensitivity:.1%}")
print(f"Achieved specificity: {achieved_specificity:.1%}")
print(f"Alert rate: {alert_rate:.1%}")
return optimal_threshold, achieved_sensitivity, achieved_specificity
def find_threshold_at_alert_rate(y_true, y_pred_proba, target_alert_rate=0.20):
"""
Find threshold that produces target alert rate
Operational constraint: can only follow up with 20% of patients
Args:
y_true: True labels
y_pred_proba: Predicted probabilities
target_alert_rate: Desired proportion of positive predictions
Returns:
Optimal threshold, sensitivity, specificity, precision
"""
# Find threshold at target percentile
= np.percentile(y_pred_proba, 100 * (1 - target_alert_rate))
optimal_threshold
# Calculate metrics
= (y_pred_proba >= optimal_threshold).astype(int)
y_pred_binary
from sklearn.metrics import confusion_matrix, precision_score, recall_score
= confusion_matrix(y_true, y_pred_binary).ravel()
tn, fp, fn, tp
= tp / (tp + fn)
sensitivity = tn / (tn + fp)
specificity = tp / (tp + fp) if (tp + fp) > 0 else 0
precision = y_pred_binary.mean()
alert_rate
print(f"\nThreshold at {target_alert_rate:.0%} alert rate: {optimal_threshold:.3f}")
print(f"Sensitivity (Recall): {sensitivity:.1%}")
print(f"Specificity: {specificity:.1%}")
print(f"Precision (PPV): {precision:.1%}")
print(f"Actual alert rate: {alert_rate:.1%}")
return optimal_threshold, sensitivity, specificity, precision
def plot_threshold_analysis(y_true, y_pred_proba, save_path=None):
"""
Plot metrics across threshold range
Helps visualize trade-offs
"""
# Calculate metrics at all thresholds
= precision_recall_curve(y_true, y_pred_proba)
precision, recall, pr_thresholds = roc_curve(y_true, y_pred_proba)
fpr, tpr, roc_thresholds
# Create figure
= plt.subplots(2, 2, figsize=(14, 10))
fig, axes
# 1. Precision-Recall curve
0, 0].plot(recall, precision, linewidth=2)
axes[0, 0].set_xlabel('Recall (Sensitivity)')
axes[0, 0].set_ylabel('Precision (PPV)')
axes[0, 0].set_title('Precision-Recall Curve')
axes[0, 0].grid(alpha=0.3)
axes[
# 2. ROC curve
0, 1].plot(fpr, tpr, linewidth=2)
axes[0, 1].plot([0, 1], [0, 1], 'k--')
axes[0, 1].set_xlabel('False Positive Rate')
axes[0, 1].set_ylabel('True Positive Rate (Sensitivity)')
axes[0, 1].set_title('ROC Curve')
axes[0, 1].grid(alpha=0.3)
axes[
# 3. Sensitivity and Specificity vs Threshold
1, 0].plot(roc_thresholds, tpr, label='Sensitivity', linewidth=2)
axes[1, 0].plot(roc_thresholds, 1-fpr, label='Specificity', linewidth=2)
axes[1, 0].set_xlabel('Threshold')
axes[1, 0].set_ylabel('Score')
axes[1, 0].set_title('Sensitivity & Specificity vs Threshold')
axes[1, 0].legend()
axes[1, 0].grid(alpha=0.3)
axes[1, 0].set_xlim([0, 1])
axes[
# 4. Alert rate vs Threshold
= np.linspace(0, 1, 100)
thresholds_range = [(y_pred_proba >= t).mean() for t in thresholds_range]
alert_rates
1, 1].plot(thresholds_range, alert_rates, linewidth=2)
axes[1, 1].set_xlabel('Threshold')
axes[1, 1].set_ylabel('Alert Rate')
axes[1, 1].set_title('Alert Rate vs Threshold')
axes[1, 1].grid(alpha=0.3)
axes[1, 1].axhline(y=0.20, color='r', linestyle='--', label='20% target')
axes[1, 1].legend()
axes[
plt.tight_layout()
if save_path:
=150)
plt.savefig(save_path, dpiprint(f"Threshold analysis plot saved to {save_path}")
plt.show()
# Example usage
if __name__ == "__main__":
# Load model and data
= joblib.load('../models/xgboost_tuned.joblib')
model = pd.read_csv('../data/processed/X_test.csv')
X_test = pd.read_csv('../data/processed/y_test.csv').values.ravel()
y_test
# Predict probabilities
= model.predict_proba(X_test)[:, 1]
y_pred_proba
print("="*60)
print("THRESHOLD OPTIMIZATION")
print("="*60)
# Strategy 1: Achieve target sensitivity
print("\nStrategy 1: Target 70% sensitivity")
= find_optimal_threshold(y_test, y_pred_proba, target_sensitivity=0.70)
threshold_sens, _, _
# Strategy 2: Work within alert rate constraint
print("\nStrategy 2: 20% alert rate constraint")
= find_threshold_at_alert_rate(y_test, y_pred_proba, target_alert_rate=0.20)
threshold_alert, _, _, _
# Visualize
='../reports/figures/threshold_analysis.png') plot_threshold_analysis(y_test, y_pred_proba, save_path
15.10 9. Deployment
15.10.1 Building a Web Interface with Streamlit
Streamlit enables rapid development of ML web apps.
Create deployment/app.py
:
"""
Streamlit web application for readmission risk prediction
Provides user-friendly interface for clinicians
"""
import streamlit as st
import pandas as pd
import numpy as np
import joblib
import shap
import matplotlib.pyplot as plt
from datetime import datetime
# Page config
st.set_page_config(="Hospital Readmission Predictor",
page_title="🏥",
page_icon="wide"
layout
)
# Load model
@st.cache_resource
def load_model():
"""Load trained model"""
= joblib.load('../models/xgboost_tuned.joblib')
model return model
@st.cache_resource
def load_explainer(_model, X_background):
"""Load SHAP explainer"""
= shap.TreeExplainer(_model, X_background)
explainer return explainer
# Title and description
"🏥 Hospital Readmission Risk Predictor")
st.title("""
st.markdown(This tool predicts the risk of 30-day hospital readmission for discharged patients.
Enter patient information below to get a risk assessment.
**Note:** This is a clinical decision support tool. Always use clinical judgment.
""")
# Sidebar for input
"Patient Information")
st.sidebar.header(
# Demographics
"Demographics")
st.sidebar.subheader(= st.sidebar.number_input("Age (years)", min_value=18, max_value=120, value=65)
age = st.sidebar.selectbox("Gender", ["Male", "Female"])
gender = st.sidebar.selectbox("Race/Ethnicity", ["White", "Black", "Hispanic", "Asian", "Other"])
race
# Admission details
"Admission Details")
st.sidebar.subheader(= st.sidebar.selectbox("Admission Type", ["Emergency", "Urgent", "Elective"])
admission_type = st.sidebar.number_input("Length of Stay (days)", min_value=1, max_value=30, value=3)
length_of_stay = st.sidebar.number_input("Number of Diagnoses", min_value=1, max_value=20, value=5)
num_diagnoses = st.sidebar.number_input("Number of Procedures", min_value=0, max_value=10, value=2)
num_procedures = st.sidebar.number_input("Number of Medications", min_value=0, max_value=50, value=10)
num_medications
# Prior utilization
"Prior Healthcare Utilization")
st.sidebar.subheader(= st.sidebar.number_input("Outpatient Visits (past year)", min_value=0, max_value=20, value=2)
num_outpatient = st.sidebar.number_input("Emergency Visits (past year)", min_value=0, max_value=10, value=1)
num_emergency = st.sidebar.number_input("Inpatient Visits (past year)", min_value=0, max_value=5, value=0)
num_inpatient
# Comorbidities
"Comorbidities")
st.sidebar.subheader(= st.sidebar.checkbox("Diabetes")
diabetes = st.sidebar.checkbox("Heart Failure")
heart_failure = st.sidebar.checkbox("COPD")
copd = st.sidebar.checkbox("Hypertension")
hypertension = st.sidebar.checkbox("Depression")
depression
# Insurance
"Other")
st.sidebar.subheader(= st.sidebar.selectbox("Insurance", ["Medicare", "Medicaid", "Private", "None"])
insurance
# Predict button
= st.sidebar.button("🔮 Predict Risk", type="primary")
predict_button
if predict_button:
# Prepare input data
= pd.DataFrame({
input_data 'age': [age],
'gender': [1 if gender == "Male" else 0],
'race': [race],
'admission_type': [admission_type],
'length_of_stay': [length_of_stay],
'num_diagnoses': [num_diagnoses],
'num_procedures': [num_procedures],
'num_medications': [num_medications],
'num_outpatient': [num_outpatient],
'num_emergency': [num_emergency],
'num_inpatient': [num_inpatient],
'diabetes': [1 if diabetes else 0],
'heart_failure': [1 if heart_failure else 0],
'copd': [1 if copd else 0],
'hypertension': [1 if hypertension else 0],
'depression': [1 if depression else 0],
'insurance': [insurance]
})
# Load model
= load_model()
model
# Preprocess (simplified - in production, use same preprocessing pipeline)
# For demo, assume data is preprocessed
# Make prediction
= model.predict_proba(input_data)[:, 1][0]
risk_prob
# Display results
= st.columns(3)
col1, col2, col3
with col1:
st.metric(="Readmission Risk",
label=f"{risk_prob:.1%}",
value=None
delta
)
with col2:
if risk_prob >= 0.7:
= "🔴 High Risk"
risk_category elif risk_prob >= 0.3:
= "🟡 Moderate Risk"
risk_category else:
= "🟢 Low Risk"
risk_category
st.metric(="Risk Category",
label=risk_category
value
)
with col3:
if risk_prob >= 0.7:
= "Intensive follow-up"
recommendation elif risk_prob >= 0.3:
= "Standard follow-up"
recommendation else:
= "Routine care"
recommendation
st.metric(="Recommendation",
label=recommendation
value
)
# Risk interpretation
"---")
st.markdown("Risk Interpretation")
st.subheader(
if risk_prob >= 0.7:
"""
st.error( **High Risk** (≥70%)
- Strong intervention recommended
- Consider: Home health visit, medication reconciliation, early follow-up appointment
- Close monitoring for warning signs
""")
elif risk_prob >= 0.3:
"""
st.warning( **Moderate Risk** (30-70%)
- Standard discharge planning
- Ensure discharge instructions understood
- Schedule follow-up within 7 days
""")
else:
"""
st.success( **Low Risk** (<30%)
- Routine discharge planning
- Standard follow-up recommendations
- Patient education on when to seek care
""")
# Key risk factors
"---")
st.markdown("Key Risk Factors")
st.subheader(
# Simplified feature importance display
= []
risk_factors
if num_inpatient > 0:
"Prior hospitalizations", "↑ High", num_inpatient))
risk_factors.append((if num_emergency > 2:
"Frequent ED visits", "↑ High", num_emergency))
risk_factors.append((if num_medications >= 10:
"Polypharmacy", "↑ Moderate", num_medications))
risk_factors.append((if heart_failure:
"Heart failure", "↑ High", "Yes"))
risk_factors.append((if diabetes:
"Diabetes", "↑ Moderate", "Yes"))
risk_factors.append((if age >= 75:
"Advanced age", "↑ Moderate", age))
risk_factors.append((
if risk_factors:
for factor, impact, value in risk_factors:
f"- **{factor}**: {impact} (value: {value})")
st.write(else:
"No major risk factors identified")
st.write(
# Action items
"---")
st.markdown("Recommended Actions")
st.subheader(
= []
actions
if risk_prob >= 0.7:
= [
actions "📞 Schedule home health visit within 48 hours",
"💊 Conduct medication reconciliation",
"📅 Book follow-up appointment within 3-5 days",
"📝 Provide written discharge instructions",
"🤝 Engage care coordinator"
]elif risk_prob >= 0.3:
= [
actions "📅 Schedule follow-up within 7 days",
"📝 Review discharge instructions with patient",
"📞 Follow-up phone call within 48 hours",
"💊 Ensure medication understanding"
]else:
= [
actions "📝 Provide standard discharge instructions",
"📅 Schedule routine follow-up",
"📞 Provide contact number for questions"
]
for action in actions:
st.write(action)
# Model information
with st.expander("ℹ️ About this model"):
"""
st.write( **Model:** XGBoost Classifier
**Performance:**
- AUC-ROC: 0.73
- Sensitivity: 72% at 20% alert rate
- Specificity: 68%
**Training data:** 10,000 patients
**Last updated:** January 2024
**Validation:** Externally validated on held-out test set
**Limitations:**
- Not validated on all patient populations
- Should be used alongside clinical judgment
- Performance may vary in different settings
""")
# Footer
"---")
st.markdown("""
st.markdown(<div style='text-align: center'>
<small>
Hospital Readmission Risk Predictor v1.0 |
For clinical decision support only |
Contact: datasci@hospital.org
</small>
</div>
""", unsafe_allow_html=True)
To run the app:
streamlit run deployment/app.py
15.10.2 Creating a REST API with FastAPI
Create deployment/api.py
:
"""
FastAPI REST API for readmission prediction
Production-ready API with validation and logging
"""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field, validator
from typing import Optional
import joblib
import numpy as np
import pandas as pd
from datetime import datetime
import logging
# Configure logging
=logging.INFO)
logging.basicConfig(level= logging.getLogger(__name__)
logger
# Initialize FastAPI app
= FastAPI(
app ="Hospital Readmission Prediction API",
title="Predict 30-day hospital readmission risk",
description="1.0.0"
version
)
# Load model
= joblib.load('../models/xgboost_tuned.joblib')
model "Model loaded successfully")
logger.info(
# Request schema
class PatientData(BaseModel):
"""Patient data schema with validation"""
int = Field(..., ge=18, le=120, description="Patient age in years")
age: str = Field(..., regex="^(Male|Female)$")
gender: str = Field(..., description="Patient race/ethnicity")
race: str = Field(..., regex="^(Emergency|Urgent|Elective)$")
admission_type: int = Field(..., ge=1, le=365, description="Hospital length of stay in days")
length_of_stay: int = Field(..., ge=1, le=50)
num_diagnoses: int = Field(..., ge=0, le=50)
num_procedures: int = Field(..., ge=0, le=100)
num_medications: int = Field(..., ge=0, le=100, description="Outpatient visits in past year")
num_outpatient: int = Field(..., ge=0, le=50, description="Emergency visits in past year")
num_emergency: int = Field(..., ge=0, le=20, description="Inpatient visits in past year")
num_inpatient: bool
diabetes: bool
heart_failure: bool
copd: bool
hypertension: bool
depression: str = Field(..., regex="^(Medicare|Medicaid|Private|None)$")
insurance:
class Config:
= {
schema_extra "example": {
"age": 65,
"gender": "Male",
"race": "White",
"admission_type": "Emergency",
"length_of_stay": 5,
"num_diagnoses": 8,
"num_procedures": 3,
"num_medications": 12,
"num_outpatient": 3,
"num_emergency": 2,
"num_inpatient": 1,
"diabetes": True,
"heart_failure": False,
"copd": False,
"hypertension": True,
"depression": False,
"insurance": "Medicare"
}
}
# Response schema
class PredictionResponse(BaseModel):
"""Prediction response schema"""
float = Field(..., description="Probability of readmission (0-1)")
readmission_risk: str = Field(..., description="Low, Moderate, or High")
risk_category: str = Field(..., description="Clinical recommendation")
recommendation: str = Field(..., description="Model version used")
model_version: str = Field(..., description="Time of prediction")
prediction_timestamp:
@app.get("/")
def root():
"""Health check endpoint"""
return {
"status": "healthy",
"service": "Hospital Readmission Prediction API",
"version": "1.0.0"
}
@app.get("/health")
def health():
"""Detailed health check"""
return {
"status": "healthy",
"model_loaded": model is not None,
"timestamp": datetime.now().isoformat()
}
@app.post("/predict", response_model=PredictionResponse)
def predict(patient: PatientData):
"""
Predict readmission risk for a patient
Args:
patient: Patient data
Returns:
Prediction response with risk score and recommendations
"""
try:
# Log request
f"Prediction request received at {datetime.now()}")
logger.info(
# Prepare input data
= pd.DataFrame({
input_df 'age': [patient.age],
'gender': [1 if patient.gender == "Male" else 0],
'race': [patient.race],
'admission_type': [patient.admission_type],
'length_of_stay': [patient.length_of_stay],
'num_diagnoses': [patient.num_diagnoses],
'num_procedures': [patient.num_procedures],
'num_medications': [patient.num_medications],
'num_outpatient': [patient.num_outpatient],
'num_emergency': [patient.num_emergency],
'num_inpatient': [patient.num_inpatient],
'diabetes': [1 if patient.diabetes else 0],
'heart_failure': [1 if patient.heart_failure else 0],
'copd': [1 if patient.copd else 0],
'hypertension': [1 if patient.hypertension else 0],
'depression': [1 if patient.depression else 0],
'insurance': [patient.insurance]
})
# Make prediction
= model.predict_proba(input_df)[:, 1][0]
risk_prob
# Categorize risk
if risk_prob >= 0.7:
= "High"
risk_category = "Intensive follow-up with home health visit and early appointment"
recommendation elif risk_prob >= 0.3:
= "Moderate"
risk_category = "Standard discharge planning with follow-up within 7 days"
recommendation else:
= "Low"
risk_category = "Routine discharge planning with standard follow-up"
recommendation
# Prepare response
= PredictionResponse(
response =float(risk_prob),
readmission_risk=risk_category,
risk_category=recommendation,
recommendation="XGBoost_v1.0",
model_version=datetime.now().isoformat()
prediction_timestamp
)
f"Prediction successful: risk={risk_prob:.3f}, category={risk_category}")
logger.info(
return response
except Exception as e:
f"Prediction failed: {str(e)}")
logger.error(raise HTTPException(status_code=500, detail=f"Prediction failed: {str(e)}")
@app.post("/batch_predict")
def batch_predict(patients: list[PatientData]):
"""
Batch prediction for multiple patients
Args:
patients: List of patient data
Returns:
List of prediction responses
"""
try:
f"Batch prediction request for {len(patients)} patients")
logger.info(
= []
responses for patient in patients:
= predict(patient)
response
responses.append(response)
return responses
except Exception as e:
f"Batch prediction failed: {str(e)}")
logger.error(raise HTTPException(status_code=500, detail=f"Batch prediction failed: {str(e)}")
if __name__ == "__main__":
import uvicorn
="0.0.0.0", port=8000) uvicorn.run(app, host
To run the API:
uvicorn deployment.api:app --reload
Test the API:
import requests
# Test prediction
= "http://localhost:8000/predict"
url = {
data "age": 72,
"gender": "Female",
"race": "White",
"admission_type": "Emergency",
"length_of_stay": 7,
"num_diagnoses": 10,
"num_procedures": 4,
"num_medications": 15,
"num_outpatient": 5,
"num_emergency": 3,
"num_inpatient": 2,
"diabetes": True,
"heart_failure": True,
"copd": False,
"hypertension": True,
"depression": False,
"insurance": "Medicare"
}
= requests.post(url, json=data)
response print(response.json())
15.10.3 Containerization with Docker
Create Dockerfile
:
FROM python:3.10-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
\
gcc \
g++ && rm -rf /var/lib/apt/lists/*
# Copy requirements
COPY requirements.txt .
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY deployment/ ./deployment/
COPY models/ ./models/
COPY src/ ./src/
# Expose port
EXPOSE 8000
# Run API
CMD ["uvicorn", "deployment.api:app", "--host", "0.0.0.0", "--port", "8000"]
Create docker-compose.yml
:
version: '3.8'
services:
api:
build: .
ports:
- "8000:8000"
environment:
- MODEL_PATH=/app/models/xgboost_tuned.joblib
volumes:
- ./models:/app/models
- ./logs:/app/logs
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
Build and run:
# Build image
docker build -t readmission-predictor:latest .
# Run container
docker run -p 8000:8000 readmission-predictor:latest
# Or use docker-compose
docker-compose up -d
15.11 10. Documentation and Reporting
15.11.1 Technical Report Template
Create reports/technical_report.md
:
# Hospital Readmission Prediction: Technical Report
**Author:** Data Science Team
**Date:** January 2024
**Version:** 1.0
---
## Executive Summary
This report documents the development of a machine learning system to predict 30-day hospital readmission risk. The model achieves an AUC-ROC of 0.73 and identifies 72% of readmissions at a 20% alert rate, meeting clinical requirements for deployment.
**Key Findings:**- XGBoost outperforms baseline logistic regression and random forest
- Prior utilization (inpatient/emergency visits) is the strongest predictor
- Model performance is equitable across demographic groups
- Clinical validation recommended before production deployment
---
## 1. Problem Definition
### Background
[Jencks et al., 2009] and cost an estimated $17 billion annually [CMS]. Early identification of high-risk patients enables targeted interventions.
Hospital readmissions within 30 days affect approximately 20% of Medicare patients
### Objectives
**Primary Goal:** Develop ML model to predict 30-day readmission risk
**Success Metrics:**- Statistical: AUC-ROC ≥ 0.70
- Clinical: Sensitivity ≥ 0.70 at 20% alert rate
- Operational: Predictions available within 24h of discharge
- Equity: Performance parity across demographic groups (AUC within 0.05)
### Scope
**In Scope:**- Adult patients (18+)
- All-cause readmissions
- 30-day window
- Interpretable models
**Out of Scope:**- Cause-specific readmissions
- Pediatric patients
- >30-day predictions
- Real-time EHR integration (Phase 2)
---
## 2. Data
### Dataset Description
**Source:** De-identified hospital discharge data (2020-2023)
**Sample Size:**- Total: 10,000 patients
- Training: 7,200 (72%)
- Validation: 1,000 (10%)
- Test: 1,800 (18%)
**Outcome:**- Readmitted within 30 days: 14.8%
- Not readmitted: 85.2%
### Features
**Categories:**1. Demographics (age, gender, race)
2. Admission characteristics (type, length of stay)
3. Clinical complexity (diagnoses, procedures, medications)
4. Prior utilization (outpatient, emergency, inpatient visits)
5. Comorbidities (diabetes, heart failure, COPD, hypertension, depression)
6. Social determinants (insurance, marital status)
**Engineered Features:**- Total prior visits
- High utilizer flag
- Clinical burden score
- Polypharmacy indicator
- Comorbidity combinations
### Data Quality
**Missing Data:**- Race: 8.2%
- Insurance: 3.1%
- All other features: <1%
**Handling:** Median imputation (numerical), mode imputation (categorical)
**Outliers:** Capped at 99th percentile to prevent model overfitting
---
## 3. Methods
### Preprocessing
1. Missing value imputation
2. Outlier detection and capping (IQR method)
3. Feature engineering (12 new features)
4. Standardization (Z-score normalization)
5. Label encoding (categorical variables)
### Models Evaluated
1. **Logistic Regression** - Interpretable baseline
2. **Random Forest** - Ensemble method
3. **XGBoost** - Gradient boosting (best performer)
4. **LightGBM** - Fast gradient boosting
### Hyperparameter Tuning
**Method:** Bayesian optimization (Optuna, 100 trials)
**Search Space:** - max_depth: [3, 10]
- learning_rate: [0.01, 0.3]
- n_estimators: [50, 300]
- subsample: [0.6, 1.0]
- colsample_bytree: [0.6, 1.0]
**Optimization Metric:** AUC-ROC (5-fold cross-validation)
### Evaluation Strategy
**Validation:** Stratified train/validation/test split
**Metrics:**- AUC-ROC (primary)
- Sensitivity, specificity
- Precision, NPV
- Calibration (Brier score)
**Subgroup Analysis:** Performance evaluated across:- Age groups (<65, 65-80, >80)
- Gender (Male, Female)
- Race/ethnicity (White, Black, Hispanic, Other)
- Insurance (Medicare, Medicaid, Private, None)
---
## 4. Results
### Model Performance
| Model | AUC-ROC | Sensitivity @ 20% | Specificity | Precision |
|-------|---------|-------------------|-------------|-----------|
| Logistic Regression | 0.68 | 0.61 | 0.74 | 0.42 |
| Random Forest | 0.71 | 0.66 | 0.71 | 0.45 |
| **XGBoost** | **0.73** | **0.72** | **0.68** | **0.48** |
| LightGBM | 0.72 | 0.69 | 0.70 | 0.46 |
**Winner:** XGBoost selected for deployment
### Feature Importance
**Top 10 Predictors (by SHAP value):**
1. num_inpatient (prior hospitalizations)
2. num_emergency (emergency visits)
3. length_of_stay
4. num_diagnoses
5. num_medications
6. age
7. clinical_burden_score
8. heart_failure
9. total_prior_visits
10. diabetes
**Clinical Interpretation:**- Healthcare utilization history dominates predictions
- Chronic conditions (heart failure, diabetes) contribute moderately
- Social determinants (insurance, marital status) have limited impact
### Fairness Analysis
**Performance Parity:**
| Subgroup | AUC-ROC | Δ from Overall |
|----------|---------|----------------|
| Overall | 0.73 | - |
| Age <65 | 0.71 | -0.02 |
| Age 65-80 | 0.74 | +0.01 |
| Age >80 | 0.72 | -0.01 |
| Male | 0.72 | -0.01 |
| Female | 0.74 | +0.01 |
| White | 0.73 | 0.00 |
| Black | 0.71 | -0.02 |
| Hispanic | 0.72 | -0.01 |
| Medicare | 0.73 | 0.00 |
| Medicaid | 0.70 | -0.03 |
| Private | 0.74 | +0.01 |
**Assessment:** Performance parity achieved (all within 0.05 AUC)
### Calibration
**Brier Score:** 0.11 (well-calibrated)
**Calibration plot:** Shows good agreement between predicted probabilities and observed outcomes
---
## 5. Interpretation
### SHAP Analysis
**Global Explanations:**- Prior hospitalization increases risk by ~15 percentage points (median SHAP value)
- Each additional emergency visit increases risk by ~8 percentage points
- Heart failure diagnosis increases risk by ~10 percentage points
**Individual Explanations:**- Model provides patient-level explanations via SHAP waterfall plots
- Clinicians can see which features contribute to each prediction
- Enables trust and clinical validation
### Clinical Insights
**High-Risk Profile:**- Age >70
- Multiple prior hospitalizations (≥2)
- Frequent emergency visits (≥3)
- Heart failure or COPD
- Polypharmacy (≥10 medications)
- Long hospital stay (≥7 days)
**Low-Risk Profile:**- Age <60
- No prior hospitalizations
- Few comorbidities
- Short hospital stay (<3 days)
- <5 medications
---
## 6. Limitations
1. **Retrospective Data:** Model trained on historical data; prospective validation needed
2. **Single Institution:** Generalizability to other hospitals unknown
3. **Missing Variables:** Social support, functional status, patient engagement not captured
4. **Label Quality:** Readmissions identified via administrative data, not clinical review
5. **Temporal Drift:** Model performance may degrade over time as practices evolve
---
## 7. Recommendations
### Immediate Actions
1. **Prospective Validation:** Validate model on new cohort before deployment
2. **Clinical Review:** Have clinicians review predictions for face validity
3. **Threshold Optimization:** Work with care coordinators to set operational threshold
4. **Integration Planning:** Design EHR integration workflow
### Deployment Strategy
**Phase 1 (Months 1-3):** Silent operation- Generate predictions but don't alert clinicians
- Collect feedback from care coordinators
- Refine threshold and interface
**Phase 2 (Months 4-6):** Limited deployment- Deploy to 2-3 hospital units
- Monitor alert response and outcomes
- Adjust based on feedback
**Phase 3 (Months 7-12):** Full deployment- Hospital-wide rollout
- Ongoing performance monitoring
- Quarterly model retraining
### Monitoring Plan
**Weekly:**- Prediction volume
- System uptime
- Alert response rate
**Monthly:**- Calibration drift
- Feature distribution changes
- Subgroup performance
**Quarterly:**- AUC-ROC on recent data
- Clinical outcome analysis (actual readmissions)
- Model retraining decision
---
## 8. Conclusion
The hospital readmission prediction model achieves strong performance (AUC 0.73) and meets clinical requirements for deployment. Key strengths include interpretability, fairness, and actionable risk scores. Prospective validation and careful deployment planning are recommended before full clinical integration.
---
## References
1. Jencks SF, Williams MV, Coleman EA. Rehospitalizations among patients in the Medicare fee-for-service program. N Engl J Med. 2009;360(14):1418-1428.
2. Centers for Medicare & Medicaid Services. Hospital Readmissions Reduction Program. https://www.cms.gov/Medicare/Medicare-Fee-for-Service-Payment/AcuteInpatientPPS/Readmissions-Reduction-Program
3. Kansagara D, Englander H, Salanitro A, et al. Risk prediction models for hospital readmission: a systematic review. JAMA. 2011;306(15):1688-1698.
4. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. NIPS. 2017.
5. Chen T, Guestrin C. XGBoost: A scalable tree boosting system. KDD. 2016.
---
## Appendix
### A. Model Configuration
```python
XGBClassifier(=150,
n_estimators=6,
max_depth=0.08,
learning_rate=0.85,
subsample=0.80,
colsample_bytree=3,
min_child_weight=0.5,
gamma=0.1,
reg_alpha=0.8,
reg_lambda=5.76,
scale_pos_weight=42
random_state )
15.12 11. Common Pitfalls and Solutions
Based on Ng, 2021, MLOps Specialization and real-world experience:
15.12.1 Pitfall 1: Data Leakage
Problem: Including future information in training data
Example:
# ❌ WRONG: Using discharge disposition as feature
# This is determined AFTER hospitalization
'discharge_to_nursing_home'] = ...
df[
# ✅ CORRECT: Only use pre-discharge information
# Use admission source, not discharge destination
'admission_from_nursing_home'] = ... df[
Solution: Carefully audit features for temporal validity
15.12.2 Pitfall 2: Target Leakage
Problem: Features highly correlated with target through causal mechanism
Example:
# ❌ WRONG: Number of follow-up appointments scheduled
# Clinicians schedule more for high-risk patients
# ✅ CORRECT: Use only pre-discharge information
# Don't use post-discharge care planning features
15.12.3 Pitfall 3: Ignoring Class Imbalance
Problem: Model predicts majority class for all samples
Solution:
# Option 1: Class weights
= XGBClassifier(
model =(y==0).sum() / (y==1).sum()
scale_pos_weight
)
# Option 2: Resampling
from imblearn.over_sampling import SMOTE
= SMOTE().fit_resample(X, y)
X_resampled, y_resampled
# Option 3: Threshold tuning (preferred)
= optimize_threshold(y_val, y_pred_proba, target_sensitivity=0.70) threshold
15.12.4 Pitfall 4: Overfitting to Training Data
Problem: Perfect training performance, poor test performance
Signs: - Training AUC: 0.99 - Validation AUC: 0.68 - Test AUC: 0.65
Solution:
# Regularization
= XGBClassifier(
model =6, # Limit tree depth
max_depth=5, # Require more samples per leaf
min_child_weight=1.0, # Minimum loss reduction for split
gamma=0.1, # L1 regularization
reg_alpha=0.8, # L2 regularization
reg_lambda=0.8, # Row subsampling
subsample=0.8 # Column subsampling
colsample_bytree
)
# Early stopping
model.fit(
X_train, y_train,=[(X_val, y_val)],
eval_set=10
early_stopping_rounds )
15.12.5 Pitfall 5: Not Validating on Held-Out Test Set
Problem: Tuning on test set leads to optimistic performance estimates
Solution:
# ✅ CORRECT: Three-way split
= train_test_split(X, y, test_size=0.3)
X_train, X_temp, y_train, y_temp = train_test_split(X_temp, y_temp, test_size=0.5)
X_val, X_test, y_val, y_test
# Train on training set
model.fit(X_train, y_train)
# Tune on validation set
= tune_threshold(model, X_val, y_val)
best_threshold
# Final evaluation ONCE on test set
= evaluate(model, X_test, y_test, threshold=best_threshold) final_performance
15.12.6 Pitfall 6: Ignoring Temporal Validation
Problem: Training on future, testing on past (temporal leakage)
Solution:
# ✅ CORRECT: Temporal split
= '2022-12-31'
train_cutoff = '2023-01-01'
test_start
= df[df['discharge_date'] <= train_cutoff]
df_train = df[df['discharge_date'] >= test_start]
df_test
# Time-series cross-validation
from sklearn.model_selection import TimeSeriesSplit
= TimeSeriesSplit(n_splits=5)
tscv for train_idx, val_idx in tscv.split(X):
= X[train_idx], X[val_idx]
X_train, X_val = y[train_idx], y[val_idx]
y_train, y_val # Train and validate
15.12.7 Pitfall 7: Poor Documentation
Problem: Code works but nobody knows how or why
Solution:
# ✅ GOOD: Document everything
def preprocess_data(df, config):
"""
Preprocess hospital readmission data
Steps:
1. Handle missing values (median for numerical, mode for categorical)
2. Cap outliers at 99th percentile
3. Engineer 12 derived features (see feature_engineering.py)
4. Standardize numerical features (Z-score)
5. Encode categorical features (label encoding)
Args:
df: Raw dataframe
config: Preprocessing configuration dict
Returns:
Preprocessed dataframe
Raises:
ValueError: If required columns missing
Example:
>>> df_processed = preprocess_data(df_raw, config)
"""
# Implementation
15.12.8 Pitfall 8: Forgetting About Deployment
Problem: Model works in notebook but not in production
Solution:
# Save preprocessing pipeline with model
import joblib
# Save everything needed for inference
= {
artifacts 'model': model,
'scaler': scaler,
'feature_names': feature_names,
'preprocessing_config': config,
'threshold': optimal_threshold
}
'model_artifacts.pkl')
joblib.dump(artifacts,
# Load and use
= joblib.load('model_artifacts.pkl')
artifacts = artifacts['model']
model = artifacts['scaler']
scaler
# Same preprocessing in production
= preprocess_for_inference(patient_data, artifacts)
X_new = model.predict_proba(X_new)[:, 1] prediction
15.13 12. Key Takeaways
Project Management: 1. Start small, iterate quickly - Don’t aim for perfection on first attempt 2. Document as you go - Not at the end 3. Version everything - Code, data, models, results 4. Communicate early and often - With stakeholders and domain experts
Data Work: 5. EDA is not optional - Spend time understanding your data 6. Data quality matters more than model complexity - Clean data beats fancy algorithms 7. Check for leakage - Most common reason models fail in production 8. Validate assumptions - Don’t assume data is IID, stationary, or complete
Modeling: 9. Start with simple baselines - Logistic regression, decision trees 10. Feature engineering > Algorithm selection - Domain knowledge creates value 11. Interpretability matters - Especially in healthcare 12. Optimize for the right metric - AUC ≠ clinical utility
Evaluation: 13. Hold out a real test set - Never touch it until final evaluation 14. Check fairness - Performance across demographic groups 15. Validate temporally - If data has time structure 16. Calibration matters - Probabilities should match reality
Deployment: 17. Plan for deployment from day 1 - Not an afterthought 18. Monitor in production - Performance degrades over time 19. Make it usable - Best model is worthless if clinicians won’t use it 20. Iterate based on feedback - Deployment is the beginning, not the end
15.14 Hands-On Exercise
15.14.1 Build Your Own Readmission Predictor
Objective: Complete end-to-end project from data to deployment
Time: 4-6 hours
15.14.2 Dataset Options
You have three options for obtaining data for this exercise:
15.14.2.1 Option 1: Generate Synthetic Dataset (Recommended) ⭐
Advantages: - No account registration required - Matches all code examples in chapter - Realistic correlations and patterns - Includes class imbalance and missing data - Reproducible (seeded)
Quick Start:
# Clone the book repository
git clone https://github.com/public-health-ai-textbook/datasets.git
cd datasets
# Generate 10,000 patient dataset
python generate_readmission_data.py
# Dataset will be saved to: data/readmission_data.csv
# Also creates train/val/test splits automatically
Generate in Python:
# Option A: Use the generator module
from generate_readmission_data import generate_readmission_dataset
# Generate dataset
= generate_readmission_dataset(n_samples=10000, seed=42)
df
# Save
'data/readmission_data.csv', index=False) df.to_csv(
Dataset Characteristics: - Samples: 10,000 patients - Features: 23 (demographics, clinical, utilization) - Target: 30-day readmission (14.8% positive class) - Missing data: 3 features with realistic patterns (MAR, MCAR) - Correlations: Age → comorbidities, prior admissions → readmission
15.14.2.2 Option 2: UCI Diabetes 130-US Hospitals Dataset
Real-world data from 130 US hospitals (1999-2008).
Access: https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals
Advantages: - Real clinical data - Large (100,000+ encounters) - Published dataset (citable)
Disadvantages: - Requires data cleaning - Diabetes-specific - Older data (1999-2008)
Citation: Strack et al., 2014, BioMed Research International
15.14.2.3 Option 3: MIMIC-III Critical Care Database
Detailed ICU data from Beth Israel Deaconess Medical Center.
Access: https://physionet.org/content/mimiciii/
Advantages: - Rich clinical detail - Gold standard dataset - Active research community
Disadvantages: - Requires CITI training completion (~3-5 hours) - PhysioNet credentialing process - Complex database schema - ICU-specific (may not generalize)
For this exercise, we recommend Option 1 (synthetic dataset) as it removes barriers and matches all code examples exactly.
15.14.2.4 Part 1: Data Exploration (45 min)
Tasks:
- Load data and inspect structure
- Check for missing values and outliers
- Visualize target distribution
- Create univariate plots for numerical features
- Create bivariate plots (features vs. target)
- Calculate correlation matrix
- Document 5 key insights
Deliverable: Jupyter notebook with EDA
15.14.2.5 Part 2: Preprocessing & Feature Engineering (60 min)
Tasks:
- Handle missing values (justify your approach)
- Detect and cap outliers
- Create 5 engineered features based on domain knowledge
- Encode categorical variables
- Scale numerical features
- Create train/validation/test splits (70/10/20)
- Save processed data
Deliverable: preprocessing.py
module
15.14.2.6 Part 3: Model Training (60 min)
Tasks:
- Train logistic regression baseline
- Train random forest
- Train XGBoost
- Compare performance (AUC, sensitivity, specificity)
- Select best model
- Log experiments with MLflow
- Save best model
Deliverable: train.py
script with MLflow tracking
15.14.2.7 Part 4: Model Interpretation (45 min)
Tasks:
- Plot feature importance
- Compute SHAP values
- Create SHAP summary plot
- Explain 3 individual predictions (high/medium/low risk)
- Identify top 5 risk factors
- Document clinical insights
Deliverable: interpret.py
script with visualizations
15.14.2.8 Part 5: Optimization (45 min)
Tasks:
- Tune XGBoost hyperparameters (20 trials minimum)
- Find optimal threshold for 70% sensitivity
- Find threshold at 20% alert rate
- Create threshold analysis plots
- Compare tuned vs. untuned performance
Deliverable: tune.py
script with results
15.14.2.9 Part 6: Deployment (60 min)
Tasks:
- Create Streamlit web app with:
- Patient input form
- Risk prediction display
- Risk interpretation
- Recommended actions
- Test app with 5 example patients
- Create FastAPI endpoint
- Test API with curl or Postman
Deliverable: Working web app and API
15.14.2.10 Part 7: Documentation (45 min)
Tasks:
- Write README with:
- Project overview
- Setup instructions
- Usage examples
- Performance summary
- Create technical report (use template)
- Make 5-slide presentation for stakeholders
Deliverable: Complete documentation
15.14.3 Bonus Challenges
Advanced students:
- Docker deployment: Containerize your application
- Fairness analysis: Evaluate performance across subgroups
- Calibration: Assess and improve model calibration
- A/B testing simulation: Design experiment to validate model impact
- Cost-benefit analysis: Calculate ROI of your intervention strategy
15.14.4 Evaluation Rubric
Category | Excellent (9-10) | Good (7-8) | Needs Work (5-6) | Insufficient (<5) |
---|---|---|---|---|
EDA | Comprehensive analysis, clear insights | Adequate exploration | Basic statistics only | Minimal effort |
Preprocessing | Thoughtful decisions, well-justified | Standard approaches | Some issues present | Major problems |
Modeling | Multiple algorithms, tuned | Baseline + 1 advanced | Only baseline | Poor performance |
Interpretation | SHAP + feature importance + clinical insights | Feature importance | Minimal interpretation | No interpretation |
Deployment | Working app + API + Docker | Working app | Non-functional | Not attempted |
Documentation | Professional, complete | Adequate | Sparse | Missing |
15.15 15. Further Resources
15.15.1 📚 Books
ML for Healthcare: - Machine Learning for Healthcare by Greenes - Comprehensive overview - Healthcare Data Analytics by Reddy & Aggarwal - Practical applications - Clinical Prediction Models by Steyerberg - Statistical foundations
Project Management: - Data Science for Business by Provost & Fawcett - Business perspective - Building ML Powered Applications by Ameisen - End-to-end projects
15.15.2 📄 Key Papers
Readmission Prediction: - Kansagara et al., 2011, Annals of Internal Medicine - Systematic review of readmission models 🎯 - Futoma et al., 2015, AMIA - Deep learning for readmission prediction - Rajkomar et al., 2018, npj Digital Medicine - Scalable deep learning with EHR data 🎯
Model Interpretation: - Lundberg & Lee, 2017, NIPS - SHAP values 🎯 - Ribeiro et al., 2016, KDD - LIME for interpretability - Caruana et al., 2015, KDD - Intelligible models for healthcare 🎯
Fairness & Bias: - Obermeyer et al., 2019, Science - Bias in healthcare algorithms 🎯 - Feldman et al., 2015, KDD - Certifying and removing disparate impact - Chen et al., 2019, AAAI - Fairness in risk assessment
MLOps: - Sculley et al., 2015, NIPS - Hidden technical debt in ML systems 🎯 - Amershi et al., 2019, IEEE Software - Software engineering for ML 🎯 - Breck et al., 2017, NIPS - ML test score for production readiness
15.15.3 💻 Tools & Tutorials
Book Resources: - Synthetic Dataset Generator - Generate readmission data for exercises 🎯 - Chapter Code Examples - Complete code from all chapters - Project Templates - ML project scaffolding
Data Science: - Kaggle Learn - Free micro-courses on ML fundamentals - Fast.ai - Practical deep learning course - Google Colab Tutorials - Interactive notebooks
MLOps: - Made With ML - Production ML course by Goku Mohandas 🎯 - Full Stack Deep Learning - Production ML at scale - MLOps Community - Talks, guides, and discussions
Healthcare AI: - MIMIC-III Tutorials - Working with clinical data - Healthcare ML Course - MIT course materials - Fast Healthcare Interoperability Resources (FHIR) - Health data standards
15.15.4 🎓 Online Courses
ML Fundamentals: - Andrew Ng’s ML Specialization - Coursera (Essential) 🎯 - Deep Learning Specialization - Coursera by Andrew Ng - Fast.ai Practical Deep Learning - Free, practical approach
MLOps: - MLOps Specialization - DeepLearning.AI 🎯 - TensorFlow: Data and Deployment - Coursera - AWS Machine Learning Engineer Nanodegree - Udacity
Healthcare Analytics: - Healthcare Data Science Specialization - Johns Hopkins - Clinical Natural Language Processing - University of Colorado - Health Informatics on FHIR - edX
15.15.5 🎯 Practice Datasets
Readmission Prediction: - MIMIC-III - ICU patient data 🎯 - eICU - Multi-center ICU database - CMS Hospital Readmissions - Medicare readmission data
General Healthcare: - PhysioNet - Clinical research data - Kaggle Healthcare Datasets - Various healthcare competitions - UCI ML Repository - Medical - Classic datasets
15.15.6 📱 Communities & Forums
- r/MachineLearning - ML research and discussion
- r/datascience - Data science projects and careers
- Healthcare ML LinkedIn Group - Professional networking
- MLOps Community Slack - Production ML discussions
- Kaggle Forums - Competition discussions and learning
Check Your Understanding
Test your knowledge of building end-to-end ML projects. Each question builds on the key concepts from this chapter.
A student wants to build their first ML project for a public health portfolio. They’re considering three options: (A) Predicting COVID-19 severity using a dataset with 500 patients and 200 clinical variables, (B) Classifying vaccine hesitancy from 50,000 survey responses with 15 demographic/attitudinal features, or (C) Forecasting flu hospitalizations using 10 years of weekly data (520 time points) with weather and search trends. Which project is MOST appropriate for a first project, and why?
- Option A, because clinical prediction is most relevant to healthcare and 200 variables enable sophisticated feature engineering
- Option B, because the large sample size (50,000) provides statistical power and the straightforward classification task matches beginner skill level
- Option C, because time series forecasting is an important public health skill and 10 years of data is substantial
- All options are equally appropriate; success depends on effort rather than project characteristics
Correct Answer: b) Option B, because the large sample size (50,000) provides statistical power and the straightforward classification task matches beginner skill level
This question tests understanding of the chapter’s guidance on selecting appropriate first projects. The chapter emphasizes that first projects should be achievable, educational, and demonstrate competence—not necessarily tackle the hardest possible problem.
Analyzing each option:
Option A (COVID severity, 500 patients, 200 variables): - Red flags: Severe class imbalance likely (few severe cases), high risk of overfitting (200 features >> 500 samples violates rule of thumb), clinical data quality issues, complex medical domain knowledge required - The curse of dimensionality: With 200 variables and only 500 patients, the model will likely memorize rather than generalize. The chapter’s guidance on train/test splits and validation would reveal this problem, but it creates frustration for beginners - Why problematic for first project: The chapter emphasizes starting with manageable scope. This project has multiple challenges (small n, large p, imbalanced outcomes, complex domain) that compound difficulty
Option B (Vaccine hesitancy, 50,000 responses, 15 features): - Strengths: Large sample size enables robust train/test/validation splits, reasonable feature count prevents overfitting, binary/multi-class classification is well-understood with standard metrics, survey data is relatively clean - Appropriate complexity: The chapter’s hospital readmission example (the walkthrough project) has similar characteristics—tabular data, classification task, manageable feature count. This demonstrates the chapter’s recommended scope - Learning opportunities: Student can focus on ML fundamentals (EDA, baseline models, hyperparameter tuning, evaluation) without getting bogged down in data quality issues or overfitting - Portfolio value: Demonstrates end-to-end capability with a socially relevant problem
Option C (Flu forecasting, 520 time points): - Challenges: Time series requires specialized techniques (autocorrelation, stationarity, seasonality), cross-validation is non-standard (can’t shuffle time series), 520 points is modest for ML approaches, multivariate forecasting with weather/search trends adds complexity - Why problematic for first project: The chapter’s readmission project is classification, not forecasting. Time series forecasting requires additional skills (ARIMA, state space models, or specialized NN architectures) beyond standard ML. While important, it’s better as a second or third project after mastering classification/regression basics
The chapter’s project selection criteria (implicitly demonstrated through the readmission example):
- Clear problem definition: Can you articulate what success looks like? Option B has clear success metrics (accuracy, F1, AUC for classification)
- Sufficient data: Enough samples for robust evaluation? Option B: yes, A: borderline, C: modest
- Manageable features: Can you understand and engineer features without overwhelming complexity? Option B: 15 features is tractable
- Standard task type: Classification/regression before advanced topics? Option B fits this; C requires time series expertise
- Data availability: Can you actually get the data? All three could work, but survey data (B) often more accessible than clinical (A)
- Interpretability: Can you explain results to stakeholders? Option B’s demographic/attitudinal features are interpretable
Why other options are wrong:
Option (a) fetishizes complexity (“200 variables enable sophisticated feature engineering”). The chapter’s philosophy is start simple, add complexity only when justified. Feature engineering on 200 variables with 500 samples is a recipe for overfitting, not sophistication.
Option (c) correctly identifies time series skills as valuable but ignores that first projects should build foundational skills before specialization. The chapter’s walkthrough is classification for good reason—it’s the standard entry point.
Option (d) dismisses project scoping entirely. The chapter dedicates significant space to problem definition and scope precisely because project characteristics profoundly affect success probability. Effort matters, but appropriate scope enables effort to translate into success.
Real-world implications:
The chapter’s hospital readmission project deliberately models good first-project characteristics: - Tabular data: Standard ML techniques apply - Classification: Binary outcome (readmitted yes/no) or multiclass (risk tiers) - Sufficient samples: Hundreds to thousands of patient records - Interpretable features: Age, diagnosis, prior admissions, medications - Clear stakeholder value: Hospitals want to reduce readmissions
Option B (vaccine hesitancy) shares these characteristics. A student completing this project would demonstrate: - Data exploration and visualization - Classification modeling (logistic regression, random forests, gradient boosting) - Model evaluation and comparison - Interpretation (which factors predict hesitancy?) - Stakeholder communication (visualizations, report)
These skills generalize to other public health problems, which is the chapter’s goal—build transferable competence through an achievable first project.
For practitioners choosing first projects:
The chapter’s advice (implicit in the readmission walkthrough): 1. Start with classification or regression, not specialized tasks (time series, NLP, computer vision) 2. Choose adequate sample sizes (thousands, not hundreds) to avoid overfitting frustration 3. Limit feature count initially (10-50 features is sweet spot) to focus on ML fundamentals, not feature engineering 4. Pick interpretable domains where you can sanity-check results 5. Ensure data accessibility before committing to a project 6. Define success metrics upfront so you know when you’re done
The student with Option B can complete the project, learn fundamentals, build portfolio credibility, and tackle more complex projects (like A or C) with experience gained. Starting with A or C risks frustration, abandonment, or producing a flawed project that doesn’t demonstrate competence.
During exploratory data analysis for a hospital readmission project, a data scientist discovers that the “days_until_readmission” variable has 85% of values at exactly 30 days, which seems suspicious. Investigation reveals this is because the hospital’s EHR system automatically codes any readmission beyond 30 days as exactly 30 for billing purposes. What does this scenario illustrate about EDA’s role in ML projects?
- EDA is unnecessary if you have domain expertise—a clinician would have known about this coding practice
- EDA uncovers data quality issues and domain-specific artifacts that must be understood before modeling to avoid garbage-in-garbage-out
- This is a minor issue that won’t affect model performance since ML algorithms are robust to coding artifacts
- The data scientist should ignore this and proceed with modeling, then address issues if they arise
Correct Answer: b) EDA uncovers data quality issues and domain-specific artifacts that must be understood before modeling to avoid garbage-in-garbage-out
This question tests understanding of EDA’s critical role emphasized throughout the chapter. The scenario presents a realistic data quality issue that would severely compromise modeling if not discovered and addressed.
The problem: The variable “days_until_readmission” appears to measure time to readmission but actually contains a billing artifact where true values >30 are censored to exactly 30. This creates several modeling problems:
1. Outcome variable corruption: If the project aims to predict time-to-readmission (survival/regression task), the censored data makes this impossible. True readmission at 45 days vs. no readmission at all are both coded as “30”—fundamentally different outcomes collapsed into identical values.
2. Feature reliability: If days_until_readmission is a feature (perhaps in a different model), it contains systematic measurement error. The distribution spike at 30 is artificial, not reflecting actual clinical patterns.
3. Downstream consequences: Building a model on this data would produce nonsensical predictions. A readmission risk model might learn that everyone gets readmitted at exactly 30 days, or a time-to-event model would fail to distinguish genuine 30-day readmissions from censored longer-term outcomes.
The chapter’s EDA emphasis:
The chapter dedicates substantial space to EDA precisely because healthcare data is messy. The walkthrough includes: - Distribution checks: Histograms, box plots for continuous variables—exactly the analysis that would reveal the 85% spike at 30 - Sanity checks: Do values make clinical sense? The 85% concentration at exactly 30 days should trigger suspicion - Domain consultation: Talk to data generators (EHR administrators, billers, clinicians) to understand coding practices
The chapter’s philosophy: understand your data before modeling. EDA isn’t a checkbox exercise; it’s detective work uncovering how data was generated, what artifacts exist, and what preprocessing is needed.
Why other options are wrong:
Option (a) creates a false dichotomy between EDA and domain expertise. The chapter shows these are complementary, not substitutes. Domain expertise helps interpret EDA findings, but doesn’t eliminate the need for EDA. Even clinical experts may not know EHR billing quirks. Moreover, data scientists often work across domains; systematic EDA compensates for gaps in domain knowledge.
Option (c) dangerously dismisses data quality. The “ML algorithms are robust” claim is false—garbage in, garbage out is a foundational principle. The chapter repeatedly emphasizes that sophisticated algorithms can’t fix fundamentally flawed data. A gradient boosting model trained on censored outcomes will produce censored predictions, regardless of algorithmic sophistication.
Option (d) advocates building first, debugging later—the opposite of the chapter’s methodology. The chapter’s project lifecycle places EDA before modeling for good reason: discovering problems after training wastes time, and worse, you might not discover problems at all if you don’t look. The model might appear to work (good training metrics) while producing clinically nonsensical predictions.
How EDA prevents this problem:
Visualization catches the artifact:
'days_until_readmission'], bins=50)
plt.hist(data['Days Until Readmission')
plt.xlabel('Count') plt.ylabel(
Result: Massive spike at 30, sparse distribution elsewhere → suspicious pattern triggers investigation.
Summary statistics confirm:
'days_until_readmission'].describe() data[
Result: 85th percentile = 30, 95th percentile = 30 → unnatural compression at ceiling.
Domain consultation explains: Speak with EHR administrator: “Oh yes, for CMS reporting we truncate at 30 days because that’s the official readmission window. Anything beyond that is coded as 30 for billing.”
Solutions identified through EDA:
For outcome variable: - Solution 1: Redefine as binary (readmitted within 30 days: yes/no) since time-to-event is unreliable - Solution 2: Obtain uncensored data if available (claims data might have actual dates) - Solution 3: Use survival analysis methods that handle censoring (Cox models, Kaplan-Meier) if you can flag which 30s are censored vs. true
For feature: - Solution 1: Exclude entirely if too corrupted - Solution 2: Bin into categories (<7 days, 7-14, 14-30, >30) if censoring is acceptable - Solution 3: Create binary flag (readmitted_30_days: yes/no) which is reliable
The chapter’s readmission walkthrough includes similar EDA discoveries: - Missing data patterns (some variables missing for certain admission types) - Outliers (impossibly high values suggesting data entry errors) - Leakage risks (variables recorded after the outcome occurred) - Encoding inconsistencies (same concept coded differently across hospital systems)
Real-world prevalence:
Healthcare data is rife with these artifacts: - Billing codes: ICD codes reflect reimbursement logic, not always clinical reality - EHR defaults: Missing values auto-filled with “normal” or “0” - Temporal misalignment: Lab results timestamped by processing, not collection - Institutional quirks: Each hospital’s EHR configured differently
The chapter emphasizes that experienced practitioners expect these issues and use EDA systematically to find them.
The broader lesson:
EDA isn’t preliminary; it’s foundational. The chapter structures the project lifecycle with EDA before modeling because:
- Prevents wasted effort: Modeling bad data produces bad models
- Builds understanding: Can’t interpret results without understanding inputs
- Enables preprocessing: Can’t clean data without knowing what’s dirty
- Informs modeling: Data characteristics guide algorithm selection
- Facilitates communication: Stakeholders need to understand data limitations
The scenario’s billing artifact exemplifies why the chapter dedicates a full section to EDA with hands-on code examples. Finding and fixing this issue during EDA might take 30 minutes. Not finding it means weeks of modeling producing unreliable results, or worse, deploying a flawed system that gives wrong clinical guidance.
For practitioners:
The chapter’s EDA checklist (implicit in walkthrough): - Distributions: Check all variables for unexpected patterns - Missing data: Understand missingness mechanisms - Outliers: Investigate extreme values - Relationships: Do correlations make sense? - Temporal patterns: Any trends over time (data drift)? - Domain sanity: Would a domain expert recognize these patterns? - Coding quirks: Consult data dictionaries and data generators
This systematic approach, demonstrated in the chapter’s walkthrough, transforms EDA from optional exploration into essential quality assurance that distinguishes successful projects from failures.
A data scientist trains three models for readmission prediction: Logistic Regression (AUC=0.68), Random Forest (AUC=0.73), XGBoost (AUC=0.76). They plan to immediately deploy XGBoost to production because it has the highest AUC. According to the chapter’s guidance, what critical step are they skipping?
- Hyperparameter tuning—XGBoost could achieve even higher AUC with optimization
- Ensemble methods—combining all three models would likely outperform any single model
- Comprehensive evaluation including calibration, fairness metrics, interpretability, computational cost, and stakeholder feedback before deployment
- Feature engineering—more sophisticated features would improve all models
Correct Answer: c) Comprehensive evaluation including calibration, fairness metrics, interpretability, computational cost, and stakeholder feedback before deployment
This question tests understanding of the chapter’s emphasis on holistic evaluation beyond a single metric. The scenario presents a common beginner mistake: equating “highest AUC” with “best model for deployment.”
The problem with AUC-only evaluation:
AUC (Area Under ROC Curve) measures discrimination—the model’s ability to rank high-risk patients higher than low-risk patients. While valuable, it’s insufficient for deployment decisions because:
1. Calibration matters for risk predictions: The chapter discusses calibration extensively. A model can have excellent AUC but poor calibration—predicted probabilities don’t match actual frequencies. If XGBoost predicts “30% readmission risk” but actual readmission rate for that group is 15%, clinicians can’t trust the probabilities for clinical decisions. The chapter’s evaluation section includes calibration plots and Brier scores precisely because this matters for deployment.
2. Fairness requires subgroup analysis: The chapter emphasizes fairness evaluation. XGBoost might have AUC=0.76 overall but: - AUC=0.80 for white patients - AUC=0.65 for Black patients
Deploying this model perpetuates healthcare disparities. The chapter’s evaluation framework includes stratified metrics by demographic groups, consistent with Chapter 10’s emphasis on equity.
3. Interpretability affects clinical adoption: The chapter discusses the interpretability-performance tradeoff. XGBoost is a black box; logistic regression is interpretable. If clinicians can’t understand why a patient is flagged high-risk, they may ignore predictions. The chapter’s SHAP analysis section addresses exactly this concern—explaining complex models to enable clinical trust and use.
4. Computational cost matters operationally: - Logistic Regression: Milliseconds per prediction, runs on any hardware - Random Forest: Seconds per prediction, modest compute - XGBoost: May be slower, requires more memory
For real-time EHR integration, speed matters. The chapter’s deployment section discusses computational constraints that affect algorithm selection.
5. Stakeholder preferences inform deployment: The chapter emphasizes stakeholder engagement. Clinicians might prefer: - Simpler model (logistic regression) they understand over complex black box - Specific features they can act on (e.g., medication non-adherence) over abstract risk scores - Integration with existing workflows over technically superior but operationally disruptive systems
The chapter’s evaluation framework:
The walkthrough includes a comprehensive evaluation section covering:
Discrimination: - AUC-ROC: Overall ranking ability - Precision-Recall curves: Performance across decision thresholds - Confusion matrix: Understand error types (false positives vs. false negatives)
Calibration: - Calibration plots: Do predicted probabilities match observed frequencies? - Brier score: Quantify calibration quality - Hosmer-Lemeshow test: Statistical calibration assessment
Fairness: - Stratified metrics by race, gender, age, insurance - Equalized odds: Equal TPR and FPR across groups - Calibration within groups: Are predictions equally calibrated?
Interpretability: - Feature importance: Which variables drive predictions? - SHAP values: Explain individual predictions - Partial dependence plots: How features relate to outcomes
Operational: - Prediction latency: How fast? - Memory footprint: Hardware requirements? - Maintenance burden: How often retraining needed? - Integration complexity: EHR API compatibility?
Clinical utility: - Decision curve analysis: Net benefit at different risk thresholds - Stakeholder interviews: Would clinicians use this? - Pilot testing: Does it improve outcomes in practice?
Why other options miss the point:
Option (a) focuses on squeezing more performance from XGBoost. While hyperparameter tuning is valuable (and covered in the chapter), it doesn’t address the fundamental issue: AUC isn’t the only deployment criterion. Even perfectly tuned XGBoost might not be the right deployment choice.
Option (b) suggests ensemble methods. The chapter covers ensembles, but this is another performance optimization that doesn’t address evaluation comprehensiveness. An ensemble might have AUC=0.78 but still fail on calibration, fairness, or interpretability.
Option (d) recommends feature engineering. Again, this could improve all models’ AUC, but doesn’t solve the core problem: deciding deployment readiness requires evaluating multiple dimensions beyond discrimination performance.
The correct deployment decision process (from chapter):
Step 1: Comprehensive evaluation - Run full evaluation suite on all candidate models - Document tradeoffs (Logistic Regression: interpretable but lower AUC; XGBoost: higher AUC but black box)
Step 2: Stakeholder engagement - Present tradeoffs to clinical partners - Demonstrate predictions and explanations - Discuss operational constraints
Step 3: Pilot testing - Deploy in shadow mode (predictions made but not used) - Monitor for issues (data drift, fairness problems, operational failures) - Gather user feedback
Step 4: Informed decision - Consider ALL factors: performance, fairness, interpretability, operations, stakeholder buy-in - Choose model that best balances competing priorities - Document rationale
Possible outcomes:
- Choose XGBoost if: Fairness/calibration are acceptable, SHAP explanations provide sufficient interpretability, computational costs are acceptable, stakeholders approve after seeing explanations
- Choose Random Forest if: Slightly lower AUC is acceptable trade-off for better interpretability (feature importance easier to explain than XGBoost)
- Choose Logistic Regression if: Interpretability is paramount, the AUC difference (0.68 vs. 0.76) doesn’t translate to meaningful clinical benefit, stakeholders prefer simplicity
The chapter’s hospital readmission walkthrough concludes with exactly this kind of decision-making: not “which model has highest AUC?” but “which model best serves our stakeholders given all constraints?”
Real-world examples from earlier chapters:
- Epic sepsis model: High AUC but poor calibration and fairness → deployment failure despite good discrimination
- Dermatology AI: Different performance by skin tone → can’t deploy despite high overall accuracy
- Clinical decision support: Black box models rejected by clinicians despite high performance → interpretability matters
For practitioners:
The chapter’s message is clear: deployment readiness ≠ highest validation metric. Comprehensive evaluation means: 1. Multiple performance dimensions (discrimination, calibration, fairness) 2. Operational feasibility (speed, cost, integration) 3. Stakeholder acceptance (interpretability, trust, workflow fit) 4. Pilot validation (does it work in practice?)
Only after evaluating all these dimensions can you make an informed deployment decision. Jumping straight from “XGBoost has best AUC” to “deploy XGBoost” skips the critical evaluation that distinguishes successful deployments from expensive failures.
The chapter structures the walkthrough to model this comprehensive approach, dedicating significant space to evaluation precisely because beginners tend to over-focus on single metrics and under-evaluate broader deployment readiness.
A student completes their hospital readmission project with good results (AUC=0.72, well-calibrated, fair across subgroups). They want to add it to their portfolio but are unsure what to include. Which combination of materials would BEST demonstrate their skills to potential employers or graduate programs?
- Just the final model file (model.pkl) since the results speak for themselves
- A polished Jupyter notebook with narrative explaining the full project lifecycle, key findings, and limitations
- GitHub repository with organized code, README, requirements.txt, and a separate portfolio website with visualizations, methodology, and discussion
- A conference-style poster with methods, results, and conclusions
Correct Answer: c) GitHub repository with organized code, README, requirements.txt, and a separate portfolio website with visualizations, methodology, and discussion
This question tests understanding of professional documentation and portfolio development—key themes in the chapter’s final sections. The scenario requires evaluating what materials best demonstrate competence to external audiences.
Why option (c) is correct:
The chapter emphasizes that successful projects must be documented, reproducible, and communicable. Option (c) provides multiple entry points for different audiences and demonstrates professional development practices.
GitHub repository demonstrates technical competence:
The chapter’s project structure section shows proper code organization:
hospital-readmission/
├── README.md # Project overview, how to run
├── requirements.txt # Dependencies for reproducibility
├── environment.yml # Conda environment specification
├── data/
│ ├── raw/ # Original data (or download script)
│ ├── processed/ # Cleaned data
│ └── README.md # Data documentation
├── notebooks/
│ ├── 01-eda.ipynb # Exploratory analysis
│ ├── 02-modeling.ipynb # Model development
│ └── 03-evaluation.ipynb # Results and evaluation
├── src/
│ ├── data/make_dataset.py # Data processing
│ ├── features/build_features.py # Feature engineering
│ ├── models/train_model.py # Model training
│ └── visualization/visualize.py # Plotting functions
├── models/ # Saved model artifacts
├── reports/
│ ├── figures/ # Generated plots
│ └── final_report.pdf # Technical report
└── app/
└── streamlit_app.py # Interactive dashboard
This structure communicates: - Organization: Can structure complex projects - Reproducibility: Others can run the code - Best practices: Version control, modular code, documentation - Professional workflow: Not just a single script
README.md is the critical entry point:
The chapter emphasizes documentation. A good README includes:
# Hospital Readmission Risk Prediction
## Overview
Predicting 30-day readmission risk using Medicare claims data.
AUC=0.72, well-calibrated, fair across demographic groups.
## Key Findings
- Prior admissions and medication count are strongest predictors
- Model identifies 15% of patients as high-risk (>50% readmission probability)
- Targeted interventions could prevent 200 readmissions annually
## Methodology
- Data: 10,000 Medicare patients, 2019-2021
- Models: Logistic Regression (baseline), Random Forest, XGBoost
- Evaluation: 5-fold CV, calibration analysis, fairness metrics
## How to Run
1. Install dependencies: `pip install -r requirements.txt`
2. Download data: `python src/data/download_data.py`
3. Run pipeline: `python src/models/train_model.py`
4. Launch dashboard: `streamlit run app/streamlit_app.py`
## Project Structure
[Describe directory organization]
## Results
[Key visualizations, metrics summary]
## Limitations
- Retrospective data limits causal claims
- Medicare population may not generalize to younger patients
- Model requires retraining as care patterns evolve
## Author
[Name], [Email], [LinkedIn]
This README serves multiple audiences: - Recruiters: Quick overview, key results, professional presentation - Technical reviewers: Methodology, reproducibility instructions - Collaborators: How to run and extend the work
requirements.txt enables reproducibility:
pandas==1.5.0
numpy==1.23.0
scikit-learn==1.1.0
xgboost==1.6.0
shap==0.41.0
matplotlib==3.5.0
seaborn==0.12.0 streamlit==1.15.0
This demonstrates understanding of dependency management (Chapter 13’s emphasis). Employers value candidates who ship reproducible work.
Portfolio website provides narrative:
GitHub shows technical skills; a portfolio website communicates effectively to non-technical audiences:
Structure: - Project overview: Problem, motivation, impact - Approach: High-level methodology with visualizations - Results: Key findings, interactive dashboard embed - Reflection: What you learned, what you’d do differently - Skills demonstrated: Python, scikit-learn, model evaluation, stakeholder communication
Why this matters: - Recruiters may not read code but will browse portfolios - Graduate programs want to see communication skills - Demonstrates ability to translate technical work for non-technical audiences
Why other options are insufficient:
Option (a)—just model.pkl: This is nearly useless. A pickled model file: - Can’t be inspected without running code - Doesn’t show how it was built - Doesn’t explain what problem it solves - Doesn’t demonstrate process, only final artifact - Could be from a tutorial, impossible to verify originality
The chapter emphasizes that process matters as much as results. Showing only the final model is like showing only a finished painting without demonstrating artistic skill through sketches and technique.
Option (b)—polished notebook: The chapter uses notebooks for exploration, but a single notebook has limitations: - Doesn’t demonstrate code organization (everything in one file) - Harder to reuse code (functions embedded in notebooks) - Doesn’t show version control practices - Can’t demonstrate deployment (Streamlit app, API) - Lacks modular structure that professional projects require
A notebook can be a supplement (in notebooks/ directory) but shouldn’t be the sole deliverable. The chapter’s project structure includes notebooks for exploration alongside production code in src/.
Option (d)—conference poster: Posters are valuable for specific contexts (conferences, thesis defenses) but insufficient for portfolios: - Static (can’t interact with code or models) - Limited detail (condensed to fit poster format) - Doesn’t demonstrate coding ability - Doesn’t enable reproducibility - Appropriate for dissemination, not primary portfolio artifact
The chapter discusses creating presentation materials (slides, posters) as supplements, not replacements for code repositories.
The chapter’s comprehensive deliverables:
The chapter’s final section lists what a complete project includes: 1. Code repository (GitHub): Technical foundation 2. Documentation (README, docstrings): Enables understanding and reproduction 3. Technical report (PDF): Detailed methodology and results 4. Presentation materials (Slides): For interviews/talks 5. Interactive demo (Streamlit app): Shows it works 6. Portfolio write-up (Website/blog): Communicates to broader audience
Option (c) encompasses the core elements (repository + portfolio), with others as supplements.
Real-world hiring perspective:
Employers evaluating candidates look for: - Can they code? → Well-organized GitHub repo - Can they document? → Clear README, commented code - Can they deploy? → Streamlit app, Docker container - Can they communicate? → Portfolio write-up explaining project - Are they professional? → Reproducible, version-controlled, following best practices
Option (c) demonstrates all of these. Other options demonstrate only subsets.
The chapter’s example:
The hospital readmission walkthrough concludes with exactly this kind of comprehensive documentation: - Organized repository following standard structure - README with overview, instructions, results - requirements.txt for environment recreation - Streamlit dashboard for interactive exploration - Technical report documenting methodology
This isn’t accidental—the chapter models professional project completion, not just model training.
For students building portfolios:
The chapter’s implicit advice: 1. Every project should have a GitHub repo: This is non-negotiable for demonstrating version control and code quality 2. README is as important as code: First impression, must be polished 3. Reproducibility matters: requirements.txt, clear instructions, anyone should be able to run it 4. Show, don’t just tell: Interactive demos > static descriptions 5. Communicate broadly: Portfolio website reaches non-technical audiences
Following this approach transforms a completed model into a compelling portfolio piece that opens doors to jobs, graduate programs, and collaborations. The chapter structures the walkthrough to demonstrate this professional approach from project start to portfolio publication.
After deploying a hospital readmission model, a hospital administrator reports that clinicians aren’t using the predictions. Investigation reveals clinicians receive daily emails with risk scores but no actionable guidance. What does this scenario illustrate about ML project deployment?
- The model should be retrained with different features to make predictions more actionable
- Technical performance (AUC) is insufficient; deployment must consider workflow integration, user interface design, and actionable recommendations
- Clinicians need more training on interpreting machine learning outputs
- The hospital should mandate that clinicians use the predictions to ensure adoption
Correct Answer: b) Technical performance (AUC) is insufficient; deployment must consider workflow integration, user interface design, and actionable recommendations
This question tests understanding of the chapter’s deployment section and the broader theme that ML projects succeed or fail based on adoption, not just technical metrics. The scenario describes a common deployment failure: technically accurate predictions that don’t fit into clinical workflows.
The problem: Technical success, operational failure
The model works (produces risk scores), but isn’t useful in practice. This illustrates the gap between ML development and operational deployment that the chapter emphasizes throughout.
Why clinicians ignore the predictions:
1. Workflow integration failure: - Current state: Daily email with risk scores - Clinician workflow: Busy seeing patients, reviewing EHR, responding to pages - Result: Email gets buried, risk scores never seen at point of care
Better integration (chapter’s guidance): - EHR pop-up when viewing patient record - Mobile app for rounds - Integration with care team huddles - Automated alerts for acute risk changes
The chapter’s deployment section discusses exactly this: “Where do predictions appear in the user’s workflow?” Email doesn’t meet clinicians where they work.
2. Actionability gap: Risk score alone doesn’t tell clinicians what to do: - “Patient X has 60% readmission risk” → “So what? What should I do differently?” - No context about why risk is high - No suggested interventions - No way to act on information
Better presentation (chapter’s guidance):
Patient: John Doe
Readmission Risk: 60% (High)
Key Risk Factors:
- 3 admissions in past 6 months
- Polypharmacy (12 medications)
- No primary care follow-up scheduled
Recommended Actions:
☐ Schedule PCP appointment within 7 days
☐ Pharmacist medication reconciliation
☐ Social work assessment for barriers to care
☐ Home health referral
This transforms passive information (risk score) into actionable workflow (checklist).
3. Trust and interpretability issues: Clinicians may not trust a black box prediction without understanding reasoning. The chapter’s SHAP analysis section addresses this: - Show which factors drive each patient’s risk - Enable clinicians to verify predictions make clinical sense - Provide confidence intervals or uncertainty estimates
The chapter’s deployment philosophy:
The walkthrough includes a Streamlit dashboard section that demonstrates user-centered design: - Visualizations: Not just numbers, but plots showing risk trajectory - Explanations: SHAP plots for each prediction - Interactivity: Clinicians can explore what-if scenarios - Integration: API for EHR integration, not just standalone app
This reflects the chapter’s emphasis: deployment isn’t publishing an API; it’s creating a system that fits users’ needs.
Why other options miss the point:
Option (a)—retrain with different features: Feature selection might help (choosing features clinicians can act on), but doesn’t address the core problem: predictions aren’t integrated into workflow. You could have perfect features but still fail if nobody sees the predictions or knows what to do with them.
Option (c)—train clinicians: Training has value, but this diagnosis blames users for a design failure. If predictions are hard to use, the problem is typically system design, not user competence. The chapter emphasizes human-centered design: build systems that fit users’ workflows and cognitive models, not systems that require extensive training to use.
Option (d)—mandate use: Mandates without fixing usability problems breed workarounds and resentment. Clinicians forced to “use” the system might check a box without actually incorporating predictions into decisions. The chapter discusses stakeholder buy-in as essential; mandates without engagement produce compliance theater, not real adoption.
The broader lesson from the chapter:
Technical ML work (data cleaning, modeling, evaluation) is necessary but insufficient. The chapter’s project lifecycle includes deployment and monitoring precisely because this is where projects often fail.
Deployment requires:
1. Workflow integration: - Where do users work? (EHR, mobile app, morning huddles?) - When do they need predictions? (During encounter, during discharge planning?) - How should predictions appear? (Alert, dashboard, report?)
2. Actionability: - What can users do with predictions? - What interventions are available? - How do we close the loop (track whether actions were taken)?
3. Stakeholder engagement: - Were users consulted during design? - Did they pilot test the interface? - Is there ongoing feedback mechanism?
4. Change management: - How is the system introduced? - What training is provided? - Who troubleshoots issues? - How is success measured?
The chapter’s readmission walkthrough includes stakeholder engagement throughout: talking to clinicians about what they need, piloting interfaces, iterating based on feedback.
Real-world examples from earlier chapters:
- Epic sepsis model: Technically deployed but generated alert fatigue → clinicians ignored it → no benefit despite deployment
- IBM Watson for Oncology: Technically sophisticated but didn’t fit oncologists’ workflows → abandoned at many hospitals
- Various clinical decision support tools: High accuracy but low adoption due to poor workflow integration
Successful deployment patterns (from chapter):
Example: Medication adherence prediction - Bad: Daily email with list of at-risk patients - Good: EHR flag visible during prescription writing + suggested pharmacist consult button
Example: Fall risk assessment - Bad: Quarterly report with aggregate statistics - Good: Real-time alert when high-risk patient admitted + automatic fall precautions order set
Example: Readmission risk (the chapter’s project) - Bad: Risk scores in separate dashboard clinicians never visit - Good: Risk displayed in discharge planning workflow + integrated action checklist
Fixing the scenario:
Short-term fixes: 1. Better notification: EHR inbox message instead of email 2. Add context: Explain why patient is high-risk (top 3 factors) 3. Suggest actions: Specific interventions matched to risk factors 4. Make it visible: Display risk during discharge planning, when it’s most relevant
Long-term solutions: 1. EHR integration: Risk score in patient summary, visible throughout care 2. Decision support: Suggested order sets for high-risk patients 3. Team workflows: Risk scores integrated into daily team huddles 4. Feedback loop: Track whether interventions reduce readmissions
Measuring deployment success:
The chapter emphasizes that deployment success isn’t “model deployed” but “model adopted and improving outcomes”: - Usage metrics: What % of clinicians view predictions? - Action metrics: What % of high-risk patients receive interventions? - Outcome metrics: Did readmissions decrease in high-risk patients?
Without these measures, deployment is “done” technically but failing operationally.
For practitioners deploying ML:
The chapter’s message is clear: budget significant time for deployment, not just development. The project lifecycle figure shows deployment as a major phase, not an afterthought.
Deployment checklist (implicit in chapter): - [ ] User research: How do people work today? - [ ] Interface design: Multiple mockups, user testing - [ ] Integration: Fits existing systems, not standalone - [ ] Actionability: Predictions enable decisions - [ ] Training: Users know how to interpret and act - [ ] Monitoring: Track usage and outcomes - [ ] Iteration: Continuous improvement based on feedback
Following this approach transforms ML projects from technical exercises into operational systems that actually improve public health outcomes. The scenario’s failure—predictions produced but not used—is preventable through user-centered deployment that the chapter models in its walkthrough.
A student’s first ML project faces multiple challenges: the dataset has 40% missing values in key variables, the outcome is highly imbalanced (5% positive class), the project deadline is in 2 weeks, and they’re stuck debugging why their model performs worse on the test set than training set. What should be their FIRST priority?
- Implement advanced imputation techniques (MICE, KNN imputation) to handle missing data optimally
- Try SMOTE, class weights, and ensemble methods to address class imbalance comprehensively
- Focus on overfitting first by simplifying the model, reducing features, and strengthening regularization—then address data issues systematically
- Extend the deadline since 2 weeks is insufficient for a quality project with these challenges
Correct Answer: c) Focus on overfitting first by simplifying the model, reducing features, and strengthening regularization—then address data issues systematically
This question tests understanding of debugging priorities and avoiding common pitfalls—key themes in the chapter’s “Common Pitfalls” section. The scenario presents a realistic situation where multiple problems coexist and prioritization is essential.
Diagnosing the core problem:
“Model performs worse on test set than training set” is the key diagnostic clue. This is the textbook definition of overfitting—the model memorizes training data rather than learning generalizable patterns.
Why overfitting is the first priority:
The missing data and class imbalance are real challenges, but they’re not causing the test/train performance gap. That gap indicates the model’s complexity exceeds what the data can support. Until this is fixed, addressing other issues won’t help—you’ll just overfit in different ways.
The chapter’s debugging approach:
The chapter emphasizes systematic debugging, not throwing solutions at problems randomly. The debugging process is:
1. Identify the symptom: Train performance > test performance 2. Diagnose the cause: Overfitting (model too complex for data) 3. Apply simplest fix first: Reduce model complexity 4. Validate fix: Check if train/test gap closes 5. Then address other issues: Missing data, imbalance, etc.
Practical steps to address overfitting:
Simplify the model:
# If using Random Forest with 1000 trees:
= RandomForestClassifier(n_estimators=50, # Reduce from 1000
model =5, # Limit depth
max_depth=20) # Require more samples per split min_samples_split
Reduce features:
# From 50 features to 10 most important
= SelectKBest(f_classif, k=10)
selector = selector.fit_transform(X_train, y_train)
X_train_reduced = selector.transform(X_test) X_test_reduced
Strengthen regularization (if using logistic regression):
= LogisticRegression(C=0.1, # Increase regularization (smaller C = stronger regularization)
model ='l1') # Or use elastic net penalty
Results to expect: - Training performance decreases (model can’t memorize as well) - Test performance increases (model generalizes better) - Train/test gap narrows (success!)
Once this gap is addressed, the model is learning rather than memorizing. Then you can meaningfully address missing data and imbalance.
Why other options are wrong priorities:
Option (a)—advanced imputation: The chapter discusses imputation methods (mean, median, KNN, MICE), but implementing sophisticated imputation while the model is overfitting is premature optimization. Problems:
- Won’t fix overfitting: Even perfectly imputed data can be overfit
- Time sink: MICE is complex to implement and tune
- May worsen overfitting: Adding complexity (imputation) before fixing fundamental model issues compounds problems
Better approach (after fixing overfitting): - Start simple: mean/median imputation or missing indicator - Check if model improves - Only try sophisticated imputation if simple methods fail
Option (b)—comprehensive imbalance handling: Class imbalance (5% positive) is real, but the chapter emphasizes it’s often overemphasized by beginners. Problems:
- Won’t fix train/test gap: Imbalance affects both train and test sets equally
- SMOTE can worsen overfitting: Synthetic samples add complexity
- Complexity cascade: Class weights + SMOTE + ensembles = many hyperparameters to tune
Better approach (after fixing overfitting): - Start with class weights (simple) - Use stratified splits (ensures representative test set) - Evaluate with appropriate metrics (F1, precision-recall, not just accuracy) - Only if these fail, try SMOTE or other advanced techniques
The chapter’s walkthrough addresses imbalance with class weights and stratified splitting—simple approaches first.
Option (d)—extend deadline: While realistic project timelines matter, this isn’t a deadline problem—it’s a debugging problem. With correct prioritization, 2 weeks is sufficient for a first project: - 2-3 days: Fix overfitting, get working baseline - 3-4 days: Handle missing data systematically - 2-3 days: Address imbalance if needed - 2-3 days: Documentation, polish, portfolio preparation - Buffer: Unexpected issues
Extending the deadline without fixing the approach just delays completion without improving outcomes.
The chapter’s debugging wisdom:
The “Common Pitfalls” section describes exactly this scenario:
Pitfall: “Trying too many things at once” - Problem: Student tries complex imputation + SMOTE + neural networks + ensemble methods simultaneously - Result: Can’t diagnose what works/doesn’t work - Solution: Change one thing at a time, validate, then iterate
Pitfall: “Ignoring train/test gap” - Problem: Focusing on improving training performance while test performance stagnates or degrades - Result: Severe overfitting, unusable model - Solution: Monitor both metrics, prioritize closing the gap
Pitfall: “Premature optimization” - Problem: Implementing sophisticated techniques before simple baselines work - Result: Complexity without improvement - Solution: Simple models first, add complexity only if justified
The systematic debugging process (from chapter):
Step 1: Establish baseline
# Simplest possible model
= LogisticRegression()
model
model.fit(X_train, y_train)print(f"Train AUC: {roc_auc_score(y_train, model.predict_proba(X_train)[:,1])}")
print(f"Test AUC: {roc_auc_score(y_test, model.predict_proba(X_test)[:,1])}")
Step 2: Diagnose problems - Train AUC >> Test AUC → Overfitting - Both low → Underfitting or data quality issues - Both reasonable → Proceed with improvements
Step 3: Fix systematically - If overfitting: Regularization, feature reduction, simpler models - If underfitting: More features, complex models - If data quality: Missing data, outliers, feature engineering
Step 4: Validate each change - Make one change - Retrain and evaluate - Keep if improves, revert if worsens
The two-week timeline (realistic execution):
Days 1-2: Debug overfitting - Implement simple logistic regression baseline - Check train/test metrics - If overfitting, add regularization - Validate gap has closed - Deliverable: Working baseline with reasonable generalization
Days 3-5: Handle data issues systematically - Simple missing data imputation (mean/median) - Check impact on performance - If needed, try more sophisticated methods - Address outliers if present - Deliverable: Clean dataset, preprocessed features
Days 6-8: Improve model - Try Random Forest, XGBoost - Use stratified CV for imbalance - Implement class weights if needed - Basic hyperparameter tuning - Deliverable: Best model selected and tuned
Days 9-11: Evaluation and interpretation - Comprehensive evaluation (calibration, fairness, etc.) - SHAP analysis for interpretability - Create visualizations - Deliverable: Complete evaluation report
Days 12-14: Documentation and polish - Clean code, write README - Create Streamlit dashboard - Write portfolio description - Deliverable: Portfolio-ready project
This timeline is achievable only if debugging is systematic (option c), not scattered (options a, b).
For practitioners facing similar situations:
The chapter’s debugging advice: 1. Diagnose before treating: Understand the problem (overfitting, underfitting, data quality) 2. Simplest solution first: Occam’s Razor applies to ML debugging 3. One change at a time: Can’t learn from experiments with multiple simultaneous changes 4. Validate each step: Confirm fixes work before moving on 5. Know when to stop: Perfect is the enemy of done; first project should be good, not optimal
The scenario’s student is making a common mistake: trying to handle everything at once (missing data + imbalance + model selection) while ignoring the fundamental problem (overfitting). The chapter’s walkthrough models the right approach: fix core issues first, then systematically address remaining challenges. This transforms an overwhelming situation (multiple problems, tight deadline) into a manageable sequence of achievable steps.
Congratulations! You’ve completed Chapter 14 and built your first complete ML project from problem definition to deployment. 🎉
You now have: - ✅ Hands-on experience with the full ML lifecycle - ✅ A portfolio project to showcase - ✅ Understanding of common pitfalls and how to avoid them - ✅ Tools and frameworks for future projects - ✅ Foundation for more advanced work