16  AI-Assisted Coding and Development Tools

NoteLearning Objectives

By the end of this chapter, you will be able to:

  1. Set up modern development environments appropriate for public health data analysis
  2. Use AI coding assistants effectively while understanding their limitations
  3. Implement version control workflows for reproducible analysis
  4. Apply practical coding workflows to common public health tasks
  5. Evaluate when AI coding assistance is appropriate versus when human expertise is required
  6. Choose between learning to code, using low-code tools, or hiring developers
TipTime Estimate

Reading and exercises: 90-120 minutes Hands-on practice: 180-240 minutes Total: 4.5-6 hours

ImportantPrerequisites

This chapter builds on:

  • Chapter 13: Your AI Toolkit (?sec-toolkit)
  • Chapter 14: Building Your First Project (?sec-first-project)

You should be familiar with basic computing concepts and have an interest in building technical skills for data analysis.

16.1 What You’ll Learn

This chapter provides practical guidance for public health practitioners seeking to leverage AI-assisted coding tools. We focus on accessible, widely-used tools rather than cutting-edge research systems. Our emphasis is on enabling effective, safe use while building genuine technical competency.


16.2 Introduction: The Democratization of Coding

The public health workforce increasingly requires computational skills. Modern epidemiology involves wrangling messy surveillance data, automating routine reports, building interactive dashboards, and conducting complex statistical analyses. Historically, these tasks required years of programming expertise. Today, AI-assisted coding tools are lowering barriers, enabling public health practitioners with minimal programming background to accomplish sophisticated technical work.

A 2023 survey found that 87% of data scientists and analysts reported using AI coding assistants, with 68% saying these tools made them significantly more productive (Kalliamvakou et al., 2023, GitHub). GitHub Copilot users completed tasks 55% faster than those without AI assistance, and reported higher job satisfaction (Ziegler et al., 2022, arXiv). These tools are particularly transformative for learners: novice programmers using AI assistance completed coding tasks with quality comparable to experienced programmers working without AI (Prather et al., 2023, ACM SIGCSE).

However, AI coding assistants have significant limitations. They generate syntactically correct code that may be logically flawed, introduce security vulnerabilities in 40% of cases in security-sensitive contexts (Pearce et al., 2022, IEEE Security & Privacy), and can impede deep learning if used as a crutch rather than a teaching aid (Prather et al., 2024, ACM SIGCSE). For public health applications handling sensitive data and informing consequential decisions, understanding when and how to use AI coding tools responsibly is critical.


16.3 Modern Development Environments

16.3.1 Visual Studio Code (VS Code): The Standard

What it is: Visual Studio Code (VS Code) is a free, open-source code editor developed by Microsoft, now the most popular development environment worldwide with over 73% market share among developers (Stack Overflow, 2023, Developer Survey). It’s lightweight yet powerful, extensible through thousands of plugins, and supports virtually every programming language.

Why it matters for public health:

Traditional approaches require separate tools for each task—R code in RStudio, Python code in Jupyter, SQL queries in database clients, and documentation in Word. This creates inefficient workflows with constant context switching.

VS Code provides one unified environment where you can work with R, Python, SQL, and Markdown all in one editor, with integrated terminals, Git version control built-in, and extensions for AI assistance, data visualization, and debugging. This creates a seamless workflow with fewer tools to learn.

Key features for public health:

1. Multi-language support

You can switch between languages in the same project:

# Python for data processing
import pandas as pd
data = pd.read_csv('surveillance_data.csv')
# R for statistical analysis
library(tidyverse)
data %>% group_by(region) %>% summarize(cases = sum(case_count))
-- SQL for database queries
SELECT region, COUNT(*) as cases
FROM surveillance_data
WHERE date >= '2024-01-01'
GROUP BY region;

2. Integrated terminal

Run code without leaving the editor—no need to switch between editor and command line: - Python scripts: python analysis.py - R scripts: Rscript analysis.R - Git commands: git commit -m "message" - Package installation: pip install pandas

3. Extensions ecosystem

Essential extensions for public health data work:

Data Science: - Python: Full Python support (Microsoft) - R: R language support and debugging - Jupyter: Run notebooks directly in VS Code - Rainbow CSV: Syntax highlighting for CSV files - Data Wrangler: Visual data exploration

AI Assistance: - GitHub Copilot: AI code completion ($10/month) - Tabnine: Free AI completions - Cody: AI chat and code search

Productivity: - GitLens: Enhanced Git integration - Todo Tree: Track TODOs in codebase - Code Spell Checker: Catch typos in code - Markdown Preview: Preview documentation

Getting started:

Step 1: Download and install Visit https://code.visualstudio.com/, download for your OS (Windows, Mac, Linux), and install—it’s a 5-minute process with no configuration needed.

Step 2: Install language support Open VS Code → Extensions (Ctrl+Shift+X). Search and install “Python” (by Microsoft) and “R” (by REditorSupport). Install Python/R on your computer if not already installed.

Step 3: Open your first project File → Open Folder → Select project directory. Create new file: analysis.py or analysis.R. Start coding with autocomplete and syntax highlighting.

Step 4: Add AI assistance (optional) Extensions → Search “GitHub Copilot”. Install and sign in (requires GitHub account + subscription). Begin getting AI code suggestions as you type.

16.3.2 Cursor: AI-First Coding Environment

What it is: Cursor is a fork of VS Code with integrated AI capabilities, launched in 2023 as a code editor built specifically for AI-assisted development (Cursor, 2024). It includes GPT-4 and Claude built directly into the editor, enabling natural language code generation, explanation, and debugging.

Key differences from VS Code + Copilot:

VS Code with Copilot provides AI suggestions as you type (autocomplete on steroids), requires explicit prompting for explanations, and has AI context limited to the current file.

Cursor understands your entire codebase across multiple files, converts natural language to code (e.g., “Create function to calculate disease incidence”), includes built-in chat with code context, allows editing code directly through AI commands, and provides automatic bug detection and fixing.

The trade-off: More powerful AI, but $20/month subscription after trial versus VS Code’s free option with $10/month Copilot.

Best use cases for public health:

Scenario 1: Starting from scratch You ask: “Create a Python script to read surveillance data from CSV, calculate weekly disease incidence rates by region, and generate a bar chart”

Cursor generates a complete working script with imports, data loading, calculations, and visualization—saving 30-60 minutes for a novice or 10-15 minutes for an expert.

Scenario 2: Understanding existing code Select a complex function and ask Cursor “What does this code do?” It explains: “This function performs age-standardization using the direct method: takes age-specific rates and standard population, applies weights, and returns age-adjusted rates. This allows comparing disease rates across populations with different age structures.”

Scenario 3: Debugging When code throws an error, Cursor automatically suggests fixes, like: “The error ‘KeyError: age_group’ occurs because the CSV column is named ‘age_category’ not ‘age_group’. Change line 42 to: df['age_category']

Limitations:

  • Cost: $20/month after free trial (vs free VS Code + $10/month Copilot)
  • Privacy: Code sent to AI providers—don’t use with sensitive code without enterprise agreement
  • Learning: May become crutch preventing deep understanding
  • Stability: Newer tool, less mature than VS Code

When to choose Cursor over VS Code: Choose Cursor if you have heavy AI-assisted coding workflows, frequently work with unfamiliar codebases, value integrated AI over plugin ecosystem, and budget allows ($240/year).

When to stick with VS Code: Choose VS Code if you’re budget-conscious ($0-120/year with Copilot), need maximum control and customization, prefer established mature tools, or work with sensitive code requiring privacy.

16.3.3 Jupyter Notebooks: Interactive Data Analysis

What it is: Jupyter Notebooks are web-based interactive computing environments where code, visualizations, and narrative text coexist in a single document (Kluyver et al., 2016, IOS Press). They’ve become the standard for exploratory data analysis, with over 10 million public notebooks on GitHub (Rule et al., 2018, PLOS Computational Biology).

Why they matter for public health:

Traditional scripts (analysis.R or analysis.py) run top to bottom with output to console or files, and narrative is kept separate (e.g., in Word documents). This makes it hard to share interactive analysis.

Jupyter notebooks run in “cells” that can execute out of order, display visualizations inline, include Markdown text between code cells, and allow sharing complete analysis (code + output + narrative). The result is executable documents perfect for exploration and communication.

Structure of a Jupyter notebook:

Cell 1 [Markdown]:
# COVID-19 Vaccine Coverage Analysis
## Data Source: State Immunization Registry
Analysis date: 2024-10-15

Cell 2 [Code]:
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('vaccine_data.csv')
print(f"Total records: {len(data)}")

[Output]: Total records: 15,247

Cell 3 [Markdown]:
The dataset contains vaccination records for 15,247 individuals.
We'll analyze coverage by age group and geography.

Cell 4 [Code]:
coverage_by_age = data.groupby('age_group')['vaccinated'].mean()
coverage_by_age.plot(kind='bar')
plt.title('Vaccination Coverage by Age Group')
plt.ylabel('Proportion Vaccinated')

[Output]: [Bar chart displayed inline]

Cell 5 [Markdown]:
Coverage is highest in the 65+ age group (87%) and lowest in
18-29 age group (62%). This suggests targeted outreach needed for
younger adults.

Best practices for public health notebooks:

Structure clearly: 1. Introduction cell: What, why, when 2. Setup cells: Imports, file paths, parameters 3. Analysis cells: One logical step per cell 4. Interpretation cells: Markdown explaining findings 5. Conclusion cell: Summary and recommendations

Document thoroughly: - Markdown cells explain purpose of code - Code comments explain implementation - Assumptions stated explicitly - Data sources and dates noted

Make reproducible: - Clear file paths or URLs for data - Specify package versions (requirements.txt) - Set random seeds for stochastic processes - Include environment information

Avoid pitfalls: - Running cells out of order (creates confusion) - Hidden state (variables from deleted cells) - Massive outputs (truncate long outputs) - Lack of narrative (pure code without explanation)

AI assistance in Jupyter:

Approach 1: Copilot in VS Code Open .ipynb files in VS Code. GitHub Copilot suggests code as you type. Execute cells directly in VS Code.

Approach 2: JupyterLab AI extensions Use extensions like jupyterlab-ai (ChatGPT integration) or jupyter-ai (multi-model AI assistance) to generate entire cells from natural language.

Approach 3: External LLM → Copy code Describe analysis in ChatGPT/Claude, generate code in chat interface, copy into notebook cells, then test and validate.

When to use notebooks vs scripts:

Use Jupyter notebooks for: - Exploratory data analysis - Teaching and learning - Communicating findings (code + narrative + visualizations) - One-off analyses - Interactive demonstrations

Use scripts (.py, .R) for: - Production pipelines - Automated reports (scheduled runs) - Functions and packages - Performance-critical code - Version control-friendly code

Many workflows start by exploring in a notebook, then productionize as a script.

16.3.4 RStudio: The R Environment

What it is: RStudio is an integrated development environment (IDE) specifically designed for R, the statistical programming language dominant in epidemiology and biostatistics (Racine, 2012, Journal of Applied Econometrics). While VS Code and Jupyter support R, RStudio provides the most polished R experience.

Advantages for R users:

RStudio offers native R integration with the best-in-class console with command history, an environment pane showing all objects (data, functions, variables), plots pane for easy visualization management, package management interface, built-in help and documentation, R Markdown support (like Jupyter but R-native), and debugging tools specific to R.

If R is your primary language, RStudio is hard to beat.

AI assistance in RStudio:

RStudio doesn’t have built-in AI like Cursor, but can be enhanced:

Option 1: GitHub Copilot (via VS Code) Work in VS Code for R development and use Copilot for code suggestions. Trade-off: Lose RStudio-specific features.

Option 2: R packages for AI assistance

The {gptstudio} package adds ChatGPT integration to RStudio:

install.packages("gptstudio")
# Adds "ChatGPT" menu to RStudio
# Can ask for code, explanations, debugging help

The {chattr} package provides a chat interface for multiple LLMs:

install.packages("chattr")
# Chat with GPT-4, Claude, local models
# Generate code, explain concepts

Option 3: External LLM → Copy code Use ChatGPT/Claude in browser, generate R code, copy into RStudio. This is the standard workflow for many users.

R Markdown for reproducible reports:

R Markdown documents (like Jupyter notebooks) combine R code chunks, Markdown narrative, inline code results, and can output to PDF, HTML, Word, etc.

Example: Automated weekly surveillance report

An R Markdown file would contain code chunks that execute when rendered, combining code, results, and narrative:

---
title: "Weekly Disease Surveillance Report"
date: (auto-generated with Sys.Date())
output: html_document
---

Code chunk: Load libraries and data
- library(tidyverse)
- library(knitr)
- data <- read_csv("weekly_surveillance.csv")

## Executive Summary

The report would display: "This week, [number] cases were reported,
representing a [percentage]% change from last week."
(Values calculated from nrow(data) and comparisons)

Code chunk: Generate time series plot
- ggplot showing cases over time
- Output: Interactive line chart

## Recommendations

Based on current trends, we recommend:
1. Enhanced surveillance in Region X
2. Targeted outreach in age group Y

When to choose RStudio:

Choose RStudio if: - R is your primary language (>80% of coding time) - Doing statistical analysis (traditional strengths of R) - Working in epidemiology/biostatistics (R-dominant fields) - Need advanced R debugging - Prefer specialized, purpose-built tool

Choose VS Code if: - Working in multiple languages (R + Python + SQL) - Need AI assistance (better Copilot integration) - Want unified environment for all projects - Prefer customizable, extensible editor


16.4 AI Coding Assistants

16.4.1 GitHub Copilot: The Market Leader

What it is: GitHub Copilot, launched in 2021, was the first widely-available AI coding assistant, powered by OpenAI Codex (a model derived from GPT) (Chen et al., 2021, arXiv). It suggests code completions as you type, learns from context, and can generate entire functions from comments.

How it works:

When you type a comment like # Calculate disease incidence, Copilot suggests a complete function:

def calculate_incidence(cases, population, time_period):
    """
    Calculate disease incidence rate

    Parameters:
    cases: Number of new cases
    population: Population at risk
    time_period: Time period in years

    Returns:
    Incidence rate per 100,000 population
    """
    rate = (cases / population) * 100000 / time_period
    return rate

Mechanism: 1. Analyzes your code context (files open, cursor position) 2. Generates multiple completions using language model 3. Ranks by likelihood and relevance 4. Suggests most probable completion 5. You accept (Tab), reject (Esc), or modify

Suggestions appear in less than 1 second as you type.

Key features:

1. Context-aware suggestions

Copilot learns from your current file, related open files, file type and language, variable names, function names, and comments/docstrings.

Example: If you have a dataframe called surveillance_data with columns region, date, cases, Copilot will suggest code using those exact names and structure.

2. Multi-line completions

Copilot can complete single lines like total = sum(case_counts) or generate entire multi-line functions:

def clean_surveillance_data(df):
    # Remove duplicate records
    df = df.drop_duplicates(subset=['id'])

    # Handle missing values
    df = df.dropna(subset=['date', 'region'])

    # Standardize date format
    df['date'] = pd.to_datetime(df['date'])

    return df

3. Natural language to code

Comment-driven development works well with Copilot:

# Load surveillance data from CSV, filter to 2024, calculate weekly totals by region
# [Copilot suggests complete implementation]

# Create a choropleth map showing disease incidence by county
# [Copilot suggests map code with appropriate library]

Pricing and access:

  • Individual: $10/month or $100/year with unlimited completions and chat interface
  • Business: $19/user/month with team management, organization-wide policies, and audit logs
  • Free for: Verified students (education), open source maintainers, and some enterprise plans
  • Trial: 30 days free for all new users

Effectiveness evidence:

Research on Copilot productivity shows developers completed tasks 55% faster with Copilot (Ziegler et al., 2022, arXiv). The acceptance rate is 26% of suggestions accepted (varies by language). Copilot shows highest utility for repetitive tasks, boilerplate code, and test writing, with a learning curve of 2-4 weeks to develop effective use patterns (Kalliamvakou et al., 2023, GitHub).

Limitations and risks:

Code correctness not guaranteed: - Syntactically correct ≠ logically correct - May suggest plausible but wrong algorithms - Can introduce subtle bugs - Always test generated code

Security vulnerabilities: - 40% of Copilot suggestions contained security issues in one study focused on security-sensitive contexts (Pearce et al., 2022, IEEE Security & Privacy) - May suggest deprecated functions or unsafe practices - Requires security code review

License and copyright concerns: - Trained on public GitHub code (various licenses) - Generated code may closely match training examples - Legal gray area for code ownership - Organizations may prohibit for proprietary code

Privacy considerations: - Code context sent to Microsoft/OpenAI servers - May expose sensitive information in variable names, comments - Not appropriate for classified or highly sensitive code - Enterprise version offers more privacy controls

16.4.2 Other AI Coding Assistants

Amazon CodeWhisperer (Now “Amazon Q Developer”)

Amazon’s AI coding assistant is free for individuals (as of 2023), or $19/month for professional features. It supports Python, Java, JavaScript, TypeScript, C#, Go, Rust, PHP, Ruby, and other languages.

Advantages: - Generous free tier (unlike Copilot) - Integrated with AWS services - Security scanning included - Bias detection

Disadvantages: - Less polished than Copilot - Smaller training dataset - Fewer integrations

Best for: AWS users, budget-conscious developers, security-focused teams

Tabnine

Tabnine is a privacy-focused AI code completion tool with a free basic tier and $12/month Pro version. It can be deployed in the cloud or on-premises.

Advantages: - Privacy emphasis (can run locally) - Team training on private codebase - No code leaves organization (self-hosted) - Compliance-friendly (HIPAA, SOC 2)

Disadvantages: - Less capable than Copilot/Claude - Local models require GPU - Fewer languages supported

Best for: Organizations with strict privacy requirements, healthcare/finance sectors

Replit Ghostwriter

Replit Ghostwriter is an AI assistant in Replit (browser-based IDE) for $25/month (includes Replit compute). It’s browser-only with no installation required.

Advantages: - Zero setup (fully browser-based) - Great for learners - Integrated execution environment - Collaboration features

Disadvantages: - Requires internet - Limited to Replit ecosystem - Less powerful than Copilot

Best for: Students, teaching, quick prototypes, no-install scenarios

Cody (by Sourcegraph)

Cody combines code search with AI assistance. It’s free for individuals, or $9-19/month Pro. It can index large codebases.

Advantages: - Understands entire codebase context - Code search alongside AI - Multiple LLM backends (GPT-4, Claude, etc.) - Good for large projects

Disadvantages: - Requires initial indexing - More complex setup

Best for: Large projects, teams, codebases with extensive history

Selection guide:

  • Choose Copilot if: Standard choice, best autocomplete, wide adoption
  • Choose Claude Code if: Prefer conversational, long context, privacy-conscious
  • Choose CodeWhisperer if: Budget-constrained, AWS user, security focus
  • Choose Tabnine if: Privacy paramount, can’t send code externally
  • Choose Replit if: Teaching/learning, want zero setup
  • Choose Cody if: Large codebase, need semantic code search

16.5 Version Control with Git and GitHub

16.5.1 Why Version Control Matters for Public Health

The problem without version control:

Your desktop might look like this:

analysis_v1.R
analysis_v2.R
analysis_v2_final.R
analysis_v2_final_REVISED.R
analysis_v2_final_REVISED_JAN15.R
analysis_FINAL_USE_THIS_ONE.R

Disaster scenarios include: - Which version was used for the report? - What changed between versions? - How to merge colleague’s edits? - Accidentally deleted file, no backup? - Need to revert to version from 2 months ago?

Version control solution:

Git tracks changes to files over time, providing: - Complete history (who changed what, when, why) - Ability to revert to any previous version - Parallel work (branches) - Merging changes from multiple people - Backup on remote server (GitHub) - Collaboration without emailing files

Result: Professional, reproducible, collaborative workflow

Why it’s critical for public health:

  1. Reproducibility: Scientific requirement to recreate analyses
  2. Compliance: Regulatory requirements for audit trails (FDA, CDC)
  3. Collaboration: Multiple analysts working on same project
  4. Safety: Backup against data loss
  5. Documentation: Clear record of analytical decisions

16.5.2 Git Basics

Core concepts:

  • Repository (repo): Project folder tracked by Git
  • Commit: Snapshot of files at a point in time
  • Branch: Parallel version of code
  • Remote: Server hosting repository (GitHub, GitLab)
  • Clone: Copy repository to local computer
  • Pull: Download changes from remote
  • Push: Upload changes to remote
  • Merge: Combine branches

Essential Git workflow:

# One-time setup
git config --global user.name "Your Name"
git config --global user.email "your.email@health.gov"

# Starting a new project
cd my-analysis-project
git init                           # Initialize Git repo
git add .                          # Stage all files
git commit -m "Initial commit"     # Create first snapshot

# Making changes
# ... edit files ...
git status                         # See what changed
git add analysis.R                 # Stage specific file
git commit -m "Add incidence calculation"  # Save snapshot

# Reviewing history
git log                            # See all commits
git diff                           # See exact changes

# Connecting to GitHub
git remote add origin https://github.com/yourusername/project.git
git push -u origin main            # Upload to GitHub

# Daily workflow
git pull                           # Download latest changes
# ... work on files ...
git add modified_files.R
git commit -m "Descriptive message"
git push                           # Upload your changes

Commit message best practices:

Bad messages:

git commit -m "fix"
git commit -m "changes"
git commit -m "update"

Good messages:

git commit -m "Fix calculation error in age-standardization function"
git commit -m "Add missing values handling in data cleaning script"
git commit -m "Update visualization colors for colorblind accessibility"

Guidelines: - Start with verb (Add, Fix, Update, Remove) - Be specific about what changed - Explain why if not obvious - Keep under 50 characters - Use present tense

16.5.3 GitHub: Collaboration and Hosting

What GitHub adds to Git:

Git is local version control on your computer. GitHub is remote cloud hosting that adds: - Backup on cloud servers - Collaboration features - Web interface for browsing code - Issue tracking - Project management - Automated workflows (GitHub Actions) - Team permissions

Key collaboration features:

1. Pull requests (code review)

Workflow: 1. Create branch: git branch feature-new-analysis 2. Make changes, commit 3. Push branch to GitHub 4. Open pull request: “Please review and merge my changes” 5. Team reviews code, suggests improvements 6. Address feedback, update code 7. Approved → Merge into main branch

Benefits: Code review catches errors, enables knowledge sharing, provides discussion and documentation, and ensures quality control before production.

2. Issues (task tracking)

Create issues for: - Bugs: “Incidence calculation returns negative values for Region X” - Features: “Add confidence intervals to disease rate calculations” - Questions: “Which statistical test appropriate for this comparison?” - Documentation: “Add comments to data cleaning script”

Track with labels (bug, enhancement, documentation), assignments (who’s working on it), milestones (group issues for v1.0 release), and projects (kanban board view).

3. GitHub Actions (automation)

# .github/workflows/run-analysis.yml
name: Weekly Analysis
on:
  schedule:
    - cron: '0 9 * * 1'  # Every Monday at 9 AM
jobs:
  run-analysis:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Set up R
        uses: r-lib/actions/setup-r@v2
      - name: Install packages
        run: Rscript -e 'install.packages(c("tidyverse", "knitr"))'
      - name: Run analysis
        run: Rscript weekly_analysis.R
      - name: Commit report
        run: |
          git config --local user.email "action@github.com"
          git config --local user.name "GitHub Action"
          git add weekly_report.html
          git commit -m "Automated weekly analysis $(date)"
          git push

Public vs private repositories:

Public repositories: - Free unlimited public repos - Anyone can view code - Good for: Open source, teaching, public health surveillance methods - Don’t use for: Proprietary methods, unreleased findings, sensitive data

Private repositories: - Free unlimited private repos (GitHub changed policy in 2019) - Access control (specific collaborators only) - Good for: Organizational projects, unpublished research, draft analyses - Not backed up on personal computer (need to clone)

CRITICAL: NEVER commit sensitive data (PHI, PII) to GitHub, public or private.

16.5.4 Practical Git Workflows for Public Health

Workflow 1: Solo analyst

# Setup (once)
mkdir covid-analysis
cd covid-analysis
git init
git add .
git commit -m "Initial project structure"

# Create GitHub repo (via website)
# Connect local to remote
git remote add origin https://github.com/yourname/covid-analysis.git
git push -u origin main

# Daily work
# Morning: Get latest
git pull

# Work: Edit, save frequently
git add updated_files
git commit -m "Add demographic analysis"

# End of day: Backup
git push

Benefits: Complete history of work, backed up on GitHub, can revert mistakes, professional portfolio (if public).

Workflow 2: Team collaboration

# Team member A: Creates base analysis
git clone https://github.com/orgname/surveillance-project.git
# ... develops initial analysis ...
git add analysis.R
git commit -m "Initial surveillance analysis"
git push

# Team member B: Adds visualizations
git pull  # Get A's work
git checkout -b add-visualizations  # New branch
# ... adds visualization code ...
git add plotting.R
git commit -m "Add regional disease map visualization"
git push origin add-visualizations

# Create pull request on GitHub
# Team reviews, discusses
# After approval, merge to main

Benefits: Parallel work without conflicts, code review improves quality, clear communication via PRs, audit trail of decisions.

Workflow 3: Reproducible analysis

Project structure:
surveillance-analysis/
├── README.md              # Project description
├── data/
│   ├── raw/              # Original data (never edit)
│   └── processed/        # Cleaned data
├── scripts/
│   ├── 01_clean_data.R   # Numbered for order
│   ├── 02_analyze.R
│   └── 03_visualize.R
├── output/
│   ├── figures/
│   └── tables/
├── reports/
│   └── final_report.Rmd
└── .gitignore            # Files to exclude from Git

.gitignore contents:

data/raw/*                # Don't commit data (privacy + size)
*.csv                     # Exclude all CSV files
.Rhistory                 # Exclude R history
.DS_Store                 # Exclude system files
output/                   # Generated outputs (can recreate)

README.md contents:

# Surveillance Analysis Project

## Purpose
Weekly analysis of disease surveillance data for County X.

## Requirements
- R 4.2+
- Packages: tidyverse, sf, viridis

## Usage
1. Place data file in data/raw/
2. Run scripts in order: 01, 02, 03
3. View output in output/ directory

## Contact
[Your name] - your.email@health.gov

Benefits: Anyone can reproduce analysis, clear documentation, organized structure, privacy protected (data not in Git).


16.6 Practical Workflows

16.6.1 Workflow 1: Analyzing CSV Surveillance Data

Scenario: You receive weekly surveillance data as CSV. Need to calculate disease rates, create visualizations, generate report.

Step 1: Set up project

# Create project structure
mkdir weekly-surveillance
cd weekly-surveillance
mkdir data scripts output
git init

# Create README
echo "# Weekly Surveillance Analysis" > README.md
echo "Automated weekly analysis of disease surveillance data" >> README.md

git add .
git commit -m "Initial project structure"

Step 2: Generate data cleaning script with AI

Prompt to ChatGPT/Claude/Copilot:

"Create a Python script to:
1. Read CSV file 'surveillance_data.csv' with columns: date, region, age_group, cases, population
2. Convert date to datetime
3. Remove rows with missing cases or population
4. Standardize region names (title case)
5. Add calculated field: incidence_rate = (cases/population) * 100000
6. Export to 'cleaned_data.csv'
Include error handling and logging."

AI generates a complete script that you can review, test, and commit.

Step 3: Generate analysis script

Prompt: “Create Python script to analyze cleaned surveillance data: Calculate total cases and incidence rate by region, calculate 7-day moving average, identify regions with significant increases (>20% week-over-week), generate summary statistics table, export results to CSV and print summary.”

Step 4: Generate visualization script

Prompt: “Create Python script to visualize surveillance data: Line plot showing cases over time by region, bar plot of current week incidence rates by region, heatmap of weekly cases by region (calendar heatmap style). Save as high-resolution PNG files for reports. Use colorblind-friendly palette.”

Step 5: Automate with shell script

#!/bin/bash
echo "Starting weekly surveillance analysis..."

python scripts/01_clean_data.py
python scripts/02_analyze.py
python scripts/03_visualize.py

echo "Analysis complete. Outputs in output/ directory."

Time saved: Manual analysis takes 3-4 hours/week. With AI-generated scripts: 30 minutes initial setup, 5 minutes/week ongoing = 150+ hours/year saved.

16.6.2 Workflow 2: Automating Reports with R Markdown

Scenario: Manual Word reports take 2 hours weekly. Want automated generation.

Create an R Markdown template with AI assistance:

Prompt: “Create R Markdown template for automated disease surveillance report with title and auto-generated date, executive summary with key metrics, tables for cases by region and demographic breakdown, figures showing trend plot and choropleth map, and recommendations section. Use parameters for: data_file, report_date, alert_threshold.”

AI generates a parameterized template that automatically generates reports from your data, complete with visualizations, summary statistics, and data-driven recommendations.

You can then automate distribution:

library(blastula)

email <- compose_email(
  body = md("Attached is this week's surveillance report..."),
  footer = md("*Automated report*")
)

smtp_send(email,
  from = "surveillance@health.gov",
  to = c("director@health.gov", "epi-team@health.gov"),
  subject = paste("Weekly Surveillance Report", Sys.Date())
)

Result: 2-hour manual task → 5-minute automated process.

16.6.3 Workflow 3: Building Interactive Dashboards

Scenario: Need interactive dashboard for leadership to explore surveillance data.

Use R Shiny or Python Dash. Prompt AI to create dashboard skeleton:

“Create R Shiny dashboard for disease surveillance with sidebar containing date range selector, region filter, and metric dropdown. Main panel should show value boxes with key metrics, interactive line plot of cases over time, filterable data table, and map visualization. Data source: surveillance_data.csv with date, region, cases, population columns.”

AI generates complete dashboard code that you can customize, test locally, and deploy to hosting services like shinyapps.io or Heroku.

16.6.4 Workflow 4: AI-Assisted Debugging

Scenario: Code produces cryptic error, can’t figure out the problem.

Step 1: Gather information

Your code throws an error. For example:

data <- read_csv("surveillance.csv")
result <- data %>% group_by(region) %>% summarize(avg_cases = mean(cases))

# Error: argument is not numeric or logical: returning NA

Step 2: Use AI to diagnose

Prompt to AI:

"I'm getting this error in R:
'Error in mean.default(cases) : argument is not numeric or logical: returning NA'

My code:
data <- read_csv('surveillance.csv')
result <- data %>% group_by(region) %>% summarize(avg_cases = mean(cases))

The 'cases' column appears as character (chr) instead of numeric.
How do I fix this?"

AI explains the issue (cases stored as character due to non-numeric values in data) and provides solutions: convert to numeric, specify column types when reading, or handle missing values explicitly.

Step 3: Implement fix, verify

# Apply fix
data <- read_csv('surveillance.csv') %>%
  mutate(cases = na_if(cases, '-')) %>%
  mutate(cases = as.numeric(cases))

# Verify
str(data$cases)  # Should show 'num' not 'chr'

16.7 Responsible Use of AI Coding Tools

16.7.1 When AI Assistance Is Appropriate

Good use cases:

Boilerplate and repetitive code: Data loading, error handling, logging, configuration files. High accuracy, low risk.

Exploratory prototyping: Quickly test approaches, generate multiple alternatives, learn new libraries or frameworks.

Code explanation: Understanding inherited code, learning new concepts, identifying potential issues in complex code.

Syntax and API reference: “How do I filter pandas DataFrame?”, “What’s the argument order for this function?”, “Remind me of the syntax for list comprehension.”

Test generation: Creating unit tests, generating test data, edge case identification.

16.7.2 When to Avoid AI Assistance

Inappropriate use cases:

Critical calculations: Disease rate calculations, statistical analyses, risk assessments. Errors have serious consequences—verify manually.

Security-sensitive code: Authentication, data encryption, privacy protections. AI-generated code may have vulnerabilities.

Novel algorithms: Cutting-edge methods, research code, domain-specific logic. AI may not understand specialized requirements.

Learning fundamentals: If you’re trying to learn programming, over-reliance prevents understanding.

Proprietary or sensitive contexts: Code with business logic, classified information, or PHI. May violate privacy/IP policies.

16.7.3 Validation Strategies

Always validate AI-generated code:

1. Read and understand: Don’t accept code you don’t understand. If it’s not clear, ask AI to explain or simplify.

2. Test with known cases: Run code on data with known outcomes. Compare results to manual calculations.

3. Edge case testing: Test with missing values, zero values, extreme values, empty datasets, malformed inputs.

4. Code review: Have colleague review AI-generated code, especially for production use. Fresh eyes catch errors.

5. Documentation: Add comments explaining logic. If you can’t explain it, you don’t understand it well enough.

6. Version control: Commit AI-generated code separately with clear messages noting AI assistance.

16.7.4 Building Technical Competency

AI as teaching tool vs crutch:

Use AI to accelerate learning: - Ask for explanations alongside code - Request variations to understand patterns - Generate exercises and practice problems - Get feedback on your own code

Don’t let AI prevent learning: - Type code yourself rather than copy-pasting blindly - Modify AI suggestions to fit your specific needs - Attempt problems before asking AI - Use AI for hints, not complete solutions

Progressive skill building:

Beginner: AI generates most code, you modify and test Intermediate: You write code, AI assists with specific tasks Advanced: AI helps with unfamiliar libraries, you write logic Expert: AI used for speed/exploration, you validate and architect

The goal is moving up this ladder, not staying at the bottom.


16.8 Choosing Your Coding Path

16.8.1 Should You Learn to Code?

Consider learning to code if:

  • You frequently work with data requiring custom analysis
  • You want independence from IT/developer teams
  • You enjoy problem-solving and logical thinking
  • You have time to invest in skill development (3-6 months to basic competency)
  • Your organization supports learning and practice

Consider alternatives if:

  • Your needs are one-time or very infrequent
  • Low-code tools meet your requirements
  • You have access to dedicated analyst/developer support
  • Time constraints prevent sustained learning
  • Your role focuses on other critical skills

16.8.2 Low-Code Alternatives

When low-code tools are sufficient:

Data visualization: Tableau, Power BI, Looker—create sophisticated dashboards without coding

Data analysis: Excel with Power Query, Google Sheets with Apps Script—handle many analyses with formulas and built-in tools

Workflow automation: Zapier, Make (formerly Integromat)—connect systems and automate tasks with visual workflows

Web apps: Retool, Bubble, Webflow—build interactive applications without traditional programming

Limitations: Less flexible than code, may hit capability ceiling, vendor lock-in, cost for advanced features, limited customization.

16.8.3 When to Hire Developers

Hire/consult developers for:

  • Complex systems: Multi-component applications, databases, APIs
  • Production deployment: User-facing tools requiring reliability, security, scalability
  • Specialized expertise: Machine learning model development, advanced statistical methods
  • Long-term maintenance: Systems requiring ongoing support and updates
  • Compliance requirements: HIPAA-compliant systems, validated software for regulatory submissions

Working effectively with developers:

Communicate requirements clearly: What problem are you solving, who are the users, what are the key features, what data is involved, what are success criteria?

Provide domain expertise: Developers know coding, you know public health. Collaboration produces best results.

Iterate and give feedback: Start with minimum viable product (MVP), test with real workflows, provide specific feedback, refine iteratively.

Learn enough to communicate: Basic understanding of technical concepts helps discussions. You don’t need to code, but understanding possibilities and constraints helps.


16.9 Summary

AI-assisted coding tools are transforming software development, making programming more accessible to public health practitioners. Modern development environments like VS Code, Jupyter, and RStudio, combined with AI assistants like GitHub Copilot and Cursor, enable analysts to accomplish sophisticated technical work with less training.

However, these tools require responsible use: - Validate AI-generated code thoroughly - Understand code before deploying - Use AI to accelerate learning, not replace it - Apply appropriate security for sensitive contexts - Choose the right tool for your use case

Version control with Git and GitHub provides professional workflows for reproducible analysis, enabling collaboration, backup, and audit trails critical for public health work.

Whether you choose to learn coding, use low-code tools, or hire developers depends on your specific needs, resources, and constraints. AI tools lower the barrier to entry, but thoughtful application and continued learning remain essential.


Check Your Understanding

NoteQuestion 1: Development Environment Selection

You’re a public health analyst who primarily works with R for statistical analysis but occasionally needs Python for specific packages. You want AI coding assistance and don’t have a large budget. Which development environment would be most appropriate?

  1. Cursor ($20/month) for best AI integration
  2. RStudio + external ChatGPT (free)
  3. VS Code + GitHub Copilot ($10/month)
  4. Replit Ghostwriter ($25/month)
Click to reveal answer

Answer: C) VS Code + GitHub Copilot ($10/month)

Rationale: - VS Code provides excellent multi-language support (both R and Python) - GitHub Copilot offers strong AI assistance at reasonable cost ($10/month vs $20 for Cursor or $25 for Replit) - More flexible than RStudio alone (which is R-specific) - Cursor would work but costs more - Replit is browser-only and costs more - RStudio + ChatGPT works but provides less integrated AI assistance

VS Code + Copilot offers the best balance of multi-language support, AI assistance, and cost for this use case.
NoteQuestion 2: Git Workflow Safety

You’re working on a surveillance data analysis project in Git. Which of these practices is MOST important for safety and reproducibility?

  1. Committing data files to Git for backup
  2. Using descriptive commit messages like “Fix age-standardization calculation”
  3. Making commits at the end of each week
  4. Working directly on the main branch for simplicity
Click to reveal answer

Answer: B) Using descriptive commit messages like “Fix age-standardization calculation”

Rationale: - A is WRONG: Never commit sensitive data (PHI/PII) to Git. Use .gitignore to exclude data files - B is CORRECT: Descriptive commit messages create audit trail and enable understanding changes months later - C is WRONG: Should commit frequently (daily or after completing logical units of work), not weekly - D is WRONG: Should use branches for new features/analyses, protecting main branch from experimental changes

Clear commit messages are critical for reproducibility, collaboration, and compliance. They document what changed and why, enabling future understanding and potential rollback.

Additional best practices include: - Commit frequently (not just weekly) - Use branches for experimental work - Never commit sensitive data - Review changes before committing
NoteQuestion 3: Appropriate AI Assistant Use

Which of these scenarios is MOST appropriate for relying heavily on AI-generated code without extensive manual review?

  1. Calculating age-adjusted disease rates for published report
  2. Creating data loading script to read CSV files
  3. Implementing patient data encryption system
  4. Developing novel statistical method for outbreak detection
Click to reveal answer

Answer: B) Creating data loading script to read CSV files

Rationale: - A is INAPPROPRIATE: Disease rate calculations are critical for public health decisions. Errors could lead to wrong conclusions. Requires careful validation. - B is APPROPRIATE: Data loading is routine, well-defined, low-risk task. Easy to test with known data. Errors typically obvious and non-consequential. - C is INAPPROPRIATE: Security-sensitive code requires expert review. AI tools introduce vulnerabilities in 40% of security contexts. Critical for HIPAA compliance. - D is INAPPROPRIATE: Novel methods require deep understanding and validation. AI trained on existing methods may not correctly implement new approaches.

Safe AI use principles: - High trust for routine/boilerplate tasks (data loading, plotting, file I/O) - Medium trust for standard analyses (test, validate with known cases) - Low trust for critical calculations (verify manually, peer review) - No trust for security or novel algorithms (expert review required)

Always validate AI code, but level of scrutiny should match consequences of errors.

16.10 Further Resources

16.10.1 Tools and Documentation

16.10.2 Learning Platforms

16.10.3 Public Health Specific

16.10.4 AI and Coding