# Building Your First Project - Interactive Notebook

**From Chapter 14 of The Public Health AI Handbook**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/BryanTegomoh/publichealth-ai-handbook/blob/main/notebooks/chapter14-first-project-interactive.ipynb)

---

## Project: Hospital Readmission Prediction

**Goal:** Build a complete ML pipeline to predict 30-day hospital readmission for heart failure patients.

**Learning Objectives:**
1. Define a well-scoped problem with clear success criteria
2. Conduct systematic exploratory data analysis
3. Establish baseline models before complex algorithms
4. Implement proper validation strategies
5. Create effective visualizations for stakeholders

---

## Phase 1: Problem Definition

### Bad Scoping vs Good Scoping

‚ùå **Bad:** "Build AI to predict all patient outcomes"  
‚úÖ **Good:** "Predict 30-day readmission for heart failure patients"

‚ùå **Bad:** "Achieve 99% accuracy"  
‚úÖ **Good:** "Achieve 75% AUC (clinically useful threshold)"

‚ùå **Bad:** "Deploy in 2 weeks"  
‚úÖ **Good:** "Prototype in 3 months, deployment in 6 months"

### Our Project Scope

- **Population:** Heart failure patients discharged from hospital
- **Outcome:** 30-day readmission (binary: yes/no)
- **Data Source:** EHR + claims data
- **Success Metric:** AUC ‚â• 0.75
- **Timeline:** 3-month prototype
- **Deployment:** Risk dashboard for case managers

In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, roc_curve, classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8')
print("‚úÖ Setup complete")

---

## Phase 2: Data Collection & Initial Exploration

In [None]:
# Generate synthetic heart failure readmission data
np.random.seed(42)
n = 2000

data = {
    'age': np.random.normal(72, 12, n).clip(40, 95),
    'length_of_stay': np.random.poisson(5, n).clip(1, 30),
    'num_prior_admits': np.random.poisson(2, n),
    'ejection_fraction': np.random.normal(35, 10, n).clip(15, 70),
    'sodium': np.random.normal(138, 4, n),
    'creatinine': np.random.exponential(1.2, n).clip(0.5, 10),
    'num_medications': np.random.poisson(8, n),
    'has_diabetes': np.random.binomial(1, 0.4, n),
    'has_copd': np.random.binomial(1, 0.3, n),
    'has_hypertension': np.random.binomial(1, 0.7, n),
    'discharge_to_home': np.random.binomial(1, 0.6, n)
}

df = pd.DataFrame(data)

# Generate readmission based on risk factors
readmit_risk = (
    0.1 * (df['age'] / 100) +
    0.15 * (df['num_prior_admits'] / 10) +
    0.2 * (1 - df['ejection_fraction'] / 70) +
    0.1 * (df['creatinine'] / 10) +
    0.15 * df['has_diabetes'] +
    0.1 * df['has_copd'] +
    0.1 * (1 - df['discharge_to_home'])
)

df['readmitted_30d'] = (readmit_risk + np.random.normal(0, 0.1, n) > 0.5).astype(int)

# Add some missing data (realistic)
missing_mask = np.random.random(n) < 0.05
df.loc[missing_mask, 'ejection_fraction'] = np.nan

print(f"‚úÖ Data generated: {df.shape}")
print(f"Readmission rate: {df['readmitted_30d'].mean():.1%}")
df.head()

---

## Phase 3: Exploratory Data Analysis (EDA)

**EDA Checklist:**
1. ‚úÖ Data completeness (missing values)
2. ‚úÖ Distributions (histograms, outliers)
3. ‚úÖ Relationships (correlations)
4. ‚úÖ Quality issues (impossible values)
5. ‚úÖ Class balance

In [None]:
# 1. Check completeness
print("üìä Missing Data:")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing': missing, 'Percent': missing_pct})
print(missing_df[missing_df['Missing'] > 0])

# 2. Summary statistics
print("\nüìà Summary Statistics:")
df.describe().T

In [None]:
# 3. Target distribution
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
df['readmitted_30d'].value_counts().plot(kind='bar', ax=ax[0])
ax[0].set_title('30-Day Readmission Distribution')
ax[0].set_xlabel('Readmitted')
ax[0].set_xticklabels(['No', 'Yes'], rotation=0)

df['readmitted_30d'].value_counts().plot(kind='pie', ax=ax[1], autopct='%1.1f%%')
ax[1].set_title('Readmission Rate')
ax[1].set_ylabel('')
plt.tight_layout()
plt.show()

In [None]:
# 4. Feature distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.ravel()

numeric_features = ['age', 'length_of_stay', 'num_prior_admits', 
                   'ejection_fraction', 'sodium', 'creatinine']

for idx, feature in enumerate(numeric_features):
    df[feature].hist(bins=30, ax=axes[idx], edgecolor='black')
    axes[idx].set_title(feature.replace('_', ' ').title())
    axes[idx].set_xlabel('Value')
    axes[idx].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
# 5. Feature comparison by readmission status
key_features = ['age', 'num_prior_admits', 'ejection_fraction', 'creatinine']

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, feature in enumerate(key_features):
    df[df['readmitted_30d']==0][feature].hist(ax=axes[idx], alpha=0.6, 
                                               label='Not Readmitted', bins=30, color='green')
    df[df['readmitted_30d']==1][feature].hist(ax=axes[idx], alpha=0.6, 
                                               label='Readmitted', bins=30, color='red')
    axes[idx].set_title(f'{feature.replace("_", " ").title()} by Readmission')
    axes[idx].set_xlabel('Value')
    axes[idx].set_ylabel('Frequency')
    axes[idx].legend()

plt.tight_layout()
plt.show()

print("‚úÖ EDA complete")

---

## Phase 4: Data Preprocessing

In [None]:
# Handle missing values
df['ejection_fraction'].fillna(df['ejection_fraction'].median(), inplace=True)
print(f"‚úÖ Missing values imputed: {df.isnull().sum().sum()} remaining")

# Split features and target
X = df.drop('readmitted_30d', axis=1)
y = df['readmitted_30d']

# Train/validation/test split (60/20/20)
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp)

print("\nüìä Data Split:")
print(f"Training: {X_train.shape[0]} samples ({y_train.mean():.1%} readmission)")
print(f"Validation: {X_val.shape[0]} samples ({y_val.mean():.1%} readmission)")
print(f"Test: {X_test.shape[0]} samples ({y_test.mean():.1%} readmission)")

In [None]:
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

print("‚úÖ Features scaled")

---

## Phase 5: Baseline Model

**Always establish a baseline before trying complex models!**

Baseline options:
1. Random guessing (50% AUC)
2. Majority class (predict "no readmission" for everyone)
3. Simple logistic regression

In [None]:
# Baseline 1: Random guessing
random_pred_proba = np.random.random(len(y_val))
baseline_random_auc = roc_auc_score(y_val, random_pred_proba)

# Baseline 2: Majority class
majority_pred = np.zeros(len(y_val))  # Predict "no readmission" for everyone
majority_pred_proba = np.zeros(len(y_val))
# Can't compute AUC for constant predictions

# Baseline 3: Simple logistic regression
lr_baseline = LogisticRegression(random_state=42, max_iter=1000)
lr_baseline.fit(X_train_scaled, y_train)
lr_pred_proba = lr_baseline.predict_proba(X_val_scaled)[:, 1]
baseline_lr_auc = roc_auc_score(y_val, lr_pred_proba)

print("üìä Baseline Performance:")
print(f"Random Guessing AUC: {baseline_random_auc:.3f}")
print(f"Logistic Regression AUC: {baseline_lr_auc:.3f}")
print(f"\nüéØ Target: AUC ‚â• 0.75")

if baseline_lr_auc >= 0.75:
    print("‚úÖ Baseline already meets target! May not need complex models.")
else:
    print("‚ö†Ô∏è Baseline below target. Will try more complex models.")

---

## Phase 6: Model Development

Try progressively more complex models only if baseline insufficient.

In [None]:
# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train_scaled, y_train)
rf_pred_proba = rf_model.predict_proba(X_val_scaled)[:, 1]
rf_auc = roc_auc_score(y_val, rf_pred_proba)

print(f"\nüå≤ Random Forest AUC: {rf_auc:.3f}")
print(f"Improvement over baseline: {rf_auc - baseline_lr_auc:.3f}")

---

## Phase 7: Model Evaluation & Selection

In [None]:
# Compare models on validation set
models = {
    'Logistic Regression': (lr_baseline, lr_pred_proba, baseline_lr_auc),
    'Random Forest': (rf_model, rf_pred_proba, rf_auc)
}

# ROC curves
plt.figure(figsize=(10, 7))
for name, (model, pred_proba, auc) in models.items():
    fpr, tpr, _ = roc_curve(y_val, pred_proba)
    plt.plot(fpr, tpr, label=f'{name} (AUC = {auc:.3f})', linewidth=2)

plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.50)', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curves: 30-Day Readmission Prediction', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Select best model
best_model_name = max(models.items(), key=lambda x: x[1][2])[0]
best_model = models[best_model_name][0]
print(f"\nüèÜ Best Model: {best_model_name}")

---

## Phase 8: Final Evaluation on Test Set

**Important:** Only evaluate on test set ONCE, after model selection!

In [None]:
# Final test set evaluation
test_pred = best_model.predict(X_test_scaled)
test_pred_proba = best_model.predict_proba(X_test_scaled)[:, 1]
test_auc = roc_auc_score(y_test, test_pred_proba)

print("\n" + "="*60)
print("üéØ FINAL TEST SET PERFORMANCE")
print("="*60)
print(f"\nModel: {best_model_name}")
print(f"AUC: {test_auc:.3f}")
print(f"Target: 0.75")
print(f"Status: {'‚úÖ MET' if test_auc >= 0.75 else '‚ùå NOT MET'}")

print("\n" + classification_report(y_test, test_pred, 
                                    target_names=['Not Readmitted', 'Readmitted']))

# Confusion matrix
cm = confusion_matrix(y_test, test_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Not Readmitted', 'Readmitted'],
            yticklabels=['Not Readmitted', 'Readmitted'])
plt.title(f'Confusion Matrix: {best_model_name}\nAUC = {test_auc:.3f}', fontweight='bold')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.tight_layout()
plt.show()

---

## Phase 9: Feature Importance & Interpretation

In [None]:
# Feature importance (if Random Forest won)
if best_model_name == 'Random Forest':
    importance = pd.DataFrame({
        'feature': X_train.columns,
        'importance': best_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    plt.figure(figsize=(10, 6))
    sns.barplot(data=importance, x='importance', y='feature', palette='viridis')
    plt.title('Feature Importance: 30-Day Readmission', fontsize=14, fontweight='bold')
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.tight_layout()
    plt.show()
    
    print("üìä Top 5 Risk Factors for Readmission:")
    print(importance.head().to_string(index=False))
elif best_model_name == 'Logistic Regression':
    # Coefficients
    coefs = pd.DataFrame({
        'feature': X_train.columns,
        'coefficient': best_model.coef_[0]
    }).sort_values('coefficient', ascending=False)
    
    plt.figure(figsize=(10, 6))
    sns.barplot(data=coefs, x='coefficient', y='feature', palette='coolwarm')
    plt.title('Logistic Regression Coefficients', fontsize=14, fontweight='bold')
    plt.xlabel('Coefficient')
    plt.ylabel('Feature')
    plt.tight_layout()
    plt.show()
    
    print("üìä Top Risk Factors (positive coefficients increase readmission risk):")
    print(coefs.head().to_string(index=False))

---

## Phase 10: Stakeholder Communication

**Create a simple dashboard-style summary for non-technical stakeholders.**

In [None]:
# Create stakeholder summary
summary = f"""
{'='*70}
30-DAY HOSPITAL READMISSION PREDICTION MODEL
Project Summary for Stakeholders
{'='*70}

OBJECTIVE
Predict which heart failure patients are at high risk of being readmitted 
within 30 days of discharge.

MODEL PERFORMANCE
- Model Type: {best_model_name}
- AUC Score: {test_auc:.3f} (Target: ‚â•0.75)
- Status: {'‚úÖ Target Met' if test_auc >= 0.75 else '‚ùå Below Target'}

WHAT THIS MEANS
The model correctly identifies high-risk patients {test_auc:.0%} of the time.
This is {'significantly better' if test_auc >= 0.75 else 'not significantly better'} than random chance (50%).

TOP RISK FACTORS
Based on the model, the strongest predictors of readmission are:
1. Number of prior hospital admissions
2. Patient age
3. Ejection fraction (heart pump strength)
4. Kidney function (creatinine level)
5. Length of hospital stay

CLINICAL USE CASE
- Target Users: Case managers, care coordinators
- Deployment: Risk dashboard showing high-risk patients at discharge
- Action: Intensive follow-up for high-risk patients (home visits, early 
          appointments, medication reconciliation)

LIMITATIONS
- Model trained on historical data (may not capture recent changes)
- Performance may vary for subpopulations not well-represented in training
- Should complement, not replace, clinical judgment

NEXT STEPS
1. Clinical validation with case managers
2. Pilot deployment in 2 units (3 months)
3. Measure impact on readmission rates
4. Refine model based on feedback
5. Full deployment if pilot successful

{'='*70}
"""

print(summary)

# Save summary
with open('project_summary.txt', 'w') as f:
    f.write(summary)

print("\n‚úÖ Project summary saved to: project_summary.txt")

---

## Key Lessons from This Project

### What We Did Right ‚úÖ

1. **Started with clear scope:** Specific population, outcome, and success criteria
2. **Conducted thorough EDA:** Understood data before modeling
3. **Established baseline:** Simple model first, then increased complexity
4. **Proper validation:** Separate train/val/test sets, only used test set once
5. **Feature importance:** Interpreted model to identify key risk factors
6. **Stakeholder communication:** Non-technical summary for decision-makers

### Common Pitfalls We Avoided ‚ùå

1. **Data leakage:** Never looked at test set until final evaluation
2. **Premature optimization:** Started simple, only added complexity if needed
3. **Ignoring class imbalance:** Monitored readmission rates across splits
4. **Black box models:** Analyzed feature importance for interpretability
5. **Perfect accuracy obsession:** Focused on clinically useful AUC threshold

### Time Allocation Reality

If this were a 3-month project:
- **Week 1-2:** Problem definition, stakeholder interviews
- **Week 3-6:** Data collection, cleaning, EDA
- **Week 7-8:** Feature engineering
- **Week 9:** Baseline model
- **Week 10-11:** Model development and tuning
- **Week 12:** Final evaluation, documentation, presentation

**Most time = Data work (60%), NOT modeling (9%)**

---

## Next Steps

1. **Read Chapter 8:** Learn proper model evaluation techniques
2. **Read Chapter 11:** Understand AI safety for deployment
3. **Read Chapter 12:** Learn deployment, monitoring, and maintenance
4. **Try with real data:** Apply this pipeline to actual EHR data
5. **Build a dashboard:** Use Streamlit to create interactive risk dashboard

---

**Questions or Issues?**  
GitHub: https://github.com/BryanTegomoh/publichealth-ai-handbook/issues