17 AI-Assisted Coding and Development Tools

Learning Objectives

This chapter examines AI coding assistants for public health analysis. You will learn to:

Use AI coding assistants effectively (GitHub Copilot, Claude Code, ChatGPT)
Apply prompt engineering for accurate and useful code suggestions
Implement verification strategies catching bugs and outdated approaches
Recognize when to override automated suggestions
Navigate strategic decisions (learning code, low-code platforms, hiring developers)
Develop workflows combining human expertise with AI assistance
Establish version control for reproducible AI-generated analysis
Evaluate when AI tools accelerate vs. create technical debt

No coding prerequisites required beyond basic computer literacy.

📋 Chapter Summary (TL;DR)

The Big Picture: AI coding assistants democratize programming—GitHub study showed 55% faster task completion with AI help, novices with AI performing comparably to experts without AI. But code correctness not guaranteed (syntactically correct ≠ logically correct), 40% of suggestions contain security vulnerabilities in security-sensitive contexts, and over-reliance impedes learning. AI accelerates development but doesn’t replace understanding.

AI Coding Assistants Landscape:

Integrated IDEs with AI: - GitHub Copilot (VS Code, JetBrains): Context-aware autocomplete, whole function suggestions. $10/month individual, $19/month business - Cursor: AI-first code editor. Natural language → code, codebase-aware chat. Free tier, $20/month Pro - Claude Code (Anthropic): Terminal-based assistant, long context (200K tokens), file operations. API-based pricing

Chat-Based: - ChatGPT (OpenAI): Natural language → code, debugging help. Free tier, $20/month Plus, GPT-5 for complex tasks - Claude (Anthropic): State-of-the-art agentic coding with Opus 4.5, 200K token context, autonomous workflows

Development Environment Choices:

VS Code (Best All-Around): Free, extensible, Jupyter integration, Git support, Copilot native
Jupyter Notebooks: Interactive, exploratory, great for data analysis. Not for production code
RStudio: R-first but Python integration (reticulate package). Good for mixed R/Python workflows
Google Colab: Cloud-based, free GPU, no setup. Limited for production, great for learning

Prompt Engineering for Code (How to Get Useful Suggestions):

Bad Prompts: - “Write code for analysis” - “Fix this” - “Make it better”

Good Prompts: - “Write Python function using pandas to calculate 7-day rolling average of COVID cases, handling missing values with forward fill” - “This code raises KeyError on line 23 when ‘date’ column missing. Add error handling with informative message” - “Refactor this function to follow PEP 8 style guide, add type hints, and include docstring with examples”

Principles: - Be specific: Language, libraries, expected behavior - Provide context: Input format, desired output, edge cases - Include constraints: Performance requirements, style preferences - Request explanations: “Explain each step with comments”

Verification Strategies (Critical for AI-Generated Code):

Syntactic Correctness: Does it run without errors?
Logical Correctness: Does it produce expected results? Test with known inputs/outputs
Edge Cases: Missing data, empty inputs, extreme values, wrong data types
Performance: Acceptable speed for dataset size?
Security: SQL injection, unsafe file operations, hardcoded credentials?
Style: Readable, maintainable, follows conventions (PEP 8 for Python)?

Never deploy AI-generated code without testing, especially for critical calculations (disease rates, risk scores, resource allocation).

Git/GitHub for Reproducible Analysis (Professional Workflow):

Git Basics: - Repository: Project folder with version history - Commit: Save snapshot of changes with message - Branch: Parallel development line (feature branches) - Merge: Combine branches - Pull/Push: Sync with remote (GitHub)

Workflow:

git init                        # Initialize repository
git add file.py                 # Stage changes
git commit -m "Add data cleaning function"  # Save snapshot
git push origin main            # Upload to GitHub

Best Practices: - Commit early, commit often (small, focused changes) - Meaningful commit messages (“Fix missing data handling” not “update”) - Never commit sensitive data (use .gitignore for data files, .env for secrets) - Use branches for new features (keep main stable)

Common Pitfalls of AI-Generated Code:

Syntactically Correct, Logically Flawed:
- Code runs but produces wrong results
- Example: Off-by-one errors, wrong aggregation functions
- Prevention: Test with known ground truth
Security Vulnerabilities:
- SQL injection, unsafe file operations, hardcoded secrets
- GitHub study: 40% of Copilot suggestions in security contexts had vulnerabilities
- Prevention: Security review, never use AI for authentication/authorization logic
Outdated Approaches:
- AI trained on old code, suggests deprecated libraries/methods
- Example: Python 2 syntax, old pandas methods
- Prevention: Check documentation, specify versions in prompt
License/Copyright Issues:
- AI may reproduce copyrighted code verbatim
- Risk: Legal liability
- Prevention: Review suggestions, avoid exact matches to known libraries
Over-Reliance Impeding Learning:
- Accepting without understanding prevents skill development
- Balance: Use AI to accelerate, not replace learning

When to Use AI Coding Assistants:

Appropriate Use: - Boilerplate code (data loading, plotting templates) - Syntax reminders (pandas methods, matplotlib formatting) - Debugging assistance (error explanations, fix suggestions) - Code refactoring (improving readability, adding docstrings) - Learning new libraries (example usage patterns)

Inappropriate Use: - Critical calculations without validation (disease rates, risk scores) - Security-sensitive code (authentication, encryption, access control) - Novel algorithms requiring domain expertise - Production code without thorough testing

Strategic Decisions (Learning vs. Hiring vs. Low-Code):

Learn to Code (with AI assistance): - Pro: Full control, customization, skill development - Con: Time investment, steep learning curve - When: Small team, frequent custom analyses, budget constraints

Hire Developer: - Pro: Professional quality, maintenance, scalability - Con: Expensive ($100K+ salary), communication overhead - When: Large-scale deployment, complex systems, ongoing support needed

Low-Code Platforms: - Pro: Fast prototyping, no coding required - Con: Limited flexibility, vendor lock-in, scaling issues - When: Simple dashboards, proof-of-concept, non-technical users

AI coding assistants make “Learn to Code” more accessible, reducing time investment while maintaining control.

The Takeaway for Public Health Practitioners:

AI coding assistants democratize programming—55% faster task completion, novices with AI ≈ experts without AI. But validation essential—syntactically correct ≠ logically correct. 40% of security-sensitive suggestions have vulnerabilities. Prompt engineering matters: be specific, provide context, include constraints, request explanations. Verification strategies: test with known inputs, check edge cases, review for security issues. Git/GitHub for version control—commit early/often, meaningful messages, never commit sensitive data. Appropriate uses: boilerplate, syntax help, debugging, refactoring. Inappropriate: critical calculations without validation, security-sensitive code, novel algorithms. Balance: use AI to accelerate, not replace learning. Strategic decision: AI makes “learn to code” more accessible vs. hiring developers or low-code platforms. Most importantly: AI coding assistants are powerful productivity tools, not replacements for understanding. Always validate critical code, especially for clinical/public health decisions. When in doubt, ask an expert.

Prerequisites

This chapter builds on:

Chapter 12: Your AI Toolkit
Chapter 13: Building Your First Project

You should be familiar with basic computing concepts and have an interest in building technical skills for data analysis.

17.1 What You’ll Learn

This chapter provides practical guidance for public health practitioners seeking to leverage AI-assisted coding tools. We focus on accessible, widely-used tools rather than cutting-edge research systems. Our emphasis is on enabling effective, safe use while building genuine technical competency.

17.2 Introduction: The Democratization of Coding

The public health workforce increasingly requires computational skills. Modern epidemiology involves wrangling messy surveillance data, automating routine reports, building interactive dashboards, and conducting complex statistical analyses. Historically, these tasks required years of programming expertise. Today, AI-assisted coding tools are lowering barriers, enabling public health practitioners with minimal programming background to accomplish sophisticated technical work.

A 2023 survey found that 87% of data scientists and analysts reported using AI coding assistants, with 68% saying these tools made them significantly more productive (Kalliamvakou et al., 2023, GitHub). GitHub Copilot users completed tasks 55% faster than those without AI assistance, and reported higher job satisfaction (Ziegler et al., 2022, arXiv). These tools are particularly transformative for learners: novice programmers using AI assistance completed coding tasks with quality comparable to experienced programmers working without AI (Prather et al., 2023, ACM SIGCSE).

However, AI coding assistants have significant limitations. They generate syntactically correct code that may be logically flawed, introduce security vulnerabilities in 40% of cases in security-sensitive contexts (Pearce et al., 2022, IEEE Security & Privacy), and can impede deep learning if used as a crutch rather than a teaching aid (Prather et al., 2024, ACM SIGCSE). For public health applications handling sensitive data and informing consequential decisions, understanding when and how to use AI coding tools responsibly is critical.

17.3 Modern Development Environments

17.3.1 Visual Studio Code (VS Code): The Standard

What it is: Visual Studio Code (VS Code) is a free, open-source code editor developed by Microsoft, now the most popular development environment worldwide with over 73% market share among developers (Stack Overflow, 2023, Developer Survey). It’s lightweight yet powerful, extensible through thousands of plugins, and supports virtually every programming language.

Why it matters for public health:

Traditional approaches require separate tools for each task—R code in RStudio, Python code in Jupyter, SQL queries in database clients, and documentation in Word. This creates inefficient workflows with constant context switching.

VS Code provides one unified environment where you can work with R, Python, SQL, and Markdown all in one editor, with integrated terminals, Git version control built-in, and extensions for AI assistance, data visualization, and debugging. This creates a seamless workflow with fewer tools to learn.

Key features for public health:

1. Multi-language support

You can switch between languages in the same project:

# Python for data processing
import pandas as pd
data = pd.read_csv('surveillance_data.csv')

# R for statistical analysis
library(tidyverse)
data %>% group_by(region) %>% summarize(cases = sum(case_count))

-- SQL for database queries
SELECT region, COUNT(*) as cases
FROM surveillance_data
WHERE date >= '2024-01-01'
GROUP BY region;

2. Integrated terminal

Run code without leaving the editor—no need to switch between editor and command line: - Python scripts: python analysis.py - R scripts: Rscript analysis.R - Git commands: git commit -m "message" - Package installation: pip install pandas

3. Extensions ecosystem

Essential extensions for public health data work:

Data Science: - Python: Full Python support (Microsoft) - R: R language support and debugging - Jupyter: Run notebooks directly in VS Code - Rainbow CSV: Syntax highlighting for CSV files - Data Wrangler: Visual data exploration

AI Assistance: - GitHub Copilot: AI code completion ($10/month) - Tabnine: Free AI completions - Cody: AI chat and code search

Productivity: - GitLens: Enhanced Git integration - Todo Tree: Track TODOs in codebase - Code Spell Checker: Catch typos in code - Markdown Preview: Preview documentation

Getting started:

Step 1: Download and install Visit https://code.visualstudio.com/, download for your OS (Windows, Mac, Linux), and install—it’s a 5-minute process with no configuration needed.

Step 2: Install language support Open VS Code → Extensions (Ctrl+Shift+X). Search and install “Python” (by Microsoft) and “R” (by REditorSupport). Install Python/R on your computer if not already installed.

Step 3: Open your first project File → Open Folder → Select project directory. Create new file: analysis.py or analysis.R. Start coding with autocomplete and syntax highlighting.

Step 4: Add AI assistance (optional) Extensions → Search “GitHub Copilot”. Install and sign in (requires GitHub account + subscription). Begin getting AI code suggestions as you type.

17.3.2 Cursor: AI-First Coding Environment

What it is: Cursor is a fork of VS Code with integrated AI capabilities, launched in 2023 as a code editor built specifically for AI-assisted development (Cursor, 2024). It includes GPT-5, Claude, and other frontier models built directly into the editor, enabling natural language code generation, explanation, and debugging.

Key differences from VS Code + Copilot:

VS Code with Copilot provides AI suggestions as you type (autocomplete on steroids), requires explicit prompting for explanations, and has AI context limited to the current file.

Cursor understands your entire codebase across multiple files, converts natural language to code (e.g., “Create function to calculate disease incidence”), includes built-in chat with code context, allows editing code directly through AI commands, and provides automatic bug detection and fixing.

The trade-off: More powerful AI, but $20/month subscription after trial versus VS Code’s free option with $10/month Copilot.

Best use cases for public health:

Scenario 1: Starting from scratch You ask: “Create a Python script to read surveillance data from CSV, calculate weekly disease incidence rates by region, and generate a bar chart”

Cursor generates a complete working script with imports, data loading, calculations, and visualization—saving 30-60 minutes for a novice or 10-15 minutes for an expert.

Scenario 2: Understanding existing code Select a complex function and ask Cursor “What does this code do?” It explains: “This function performs age-standardization using the direct method: takes age-specific rates and standard population, applies weights, and returns age-adjusted rates. This allows comparing disease rates across populations with different age structures.”

Scenario 3: Debugging When code throws an error, Cursor automatically suggests fixes, like: “The error ‘KeyError: age_group’ occurs because the CSV column is named ‘age_category’ not ‘age_group’. Change line 42 to: df['age_category']”

Limitations:

Cost: $20/month after free trial (vs free VS Code + $10/month Copilot)
Privacy: Code sent to AI providers—don’t use with sensitive code without enterprise agreement
Learning: May become crutch preventing deep understanding
Stability: Newer tool, less mature than VS Code

When to choose Cursor over VS Code: Choose Cursor if you have heavy AI-assisted coding workflows, frequently work with unfamiliar codebases, value integrated AI over plugin ecosystem, and budget allows ($240/year).

When to stick with VS Code: Choose VS Code if you’re budget-conscious ($0-120/year with Copilot), need maximum control and customization, prefer established mature tools, or work with sensitive code requiring privacy.

17.3.3 Jupyter Notebooks: Interactive Data Analysis

What it is: Jupyter Notebooks are web-based interactive computing environments where code, visualizations, and narrative text coexist in a single document (Kluyver et al., 2016, IOS Press). They’ve become the standard for exploratory data analysis, with over 10 million public notebooks on GitHub (Rule et al., 2018, PLOS Computational Biology).

Why they matter for public health:

Traditional scripts (analysis.R or analysis.py) run top to bottom with output to console or files, and narrative is kept separate (e.g., in Word documents). This makes it hard to share interactive analysis.

Jupyter notebooks run in “cells” that can execute out of order, display visualizations inline, include Markdown text between code cells, and allow sharing complete analysis (code + output + narrative). The result is executable documents perfect for exploration and communication.

Structure of a Jupyter notebook:

Cell 1 [Markdown]:
# COVID-19 Vaccine Coverage Analysis
## Data Source: State Immunization Registry
Analysis date: 2024-10-15

Cell 2 [Code]:
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('vaccine_data.csv')
print(f"Total records: {len(data)}")

[Output]: Total records: 15,247

Cell 3 [Markdown]:
The dataset contains vaccination records for 15,247 individuals.
We'll analyze coverage by age group and geography.

Cell 4 [Code]:
coverage_by_age = data.groupby('age_group')['vaccinated'].mean()
coverage_by_age.plot(kind='bar')
plt.title('Vaccination Coverage by Age Group')
plt.ylabel('Proportion Vaccinated')

[Output]: [Bar chart displayed inline]

Cell 5 [Markdown]:
Coverage is highest in the 65+ age group (87%) and lowest in
18-29 age group (62%). This suggests targeted outreach needed for
younger adults.

Best practices for public health notebooks:

Structure clearly: 1. Introduction cell: What, why, when 2. Setup cells: Imports, file paths, parameters 3. Analysis cells: One logical step per cell 4. Interpretation cells: Markdown explaining findings 5. Conclusion cell: Summary and recommendations

Document thoroughly: - Markdown cells explain purpose of code - Code comments explain implementation - Assumptions stated explicitly - Data sources and dates noted

Make reproducible: - Clear file paths or URLs for data - Specify package versions (requirements.txt) - Set random seeds for stochastic processes - Include environment information

Avoid pitfalls: - Running cells out of order (creates confusion) - Hidden state (variables from deleted cells) - Massive outputs (truncate long outputs) - Lack of narrative (pure code without explanation)

AI assistance in Jupyter:

Approach 1: Copilot in VS Code Open .ipynb files in VS Code. GitHub Copilot suggests code as you type. Execute cells directly in VS Code.

Approach 2: JupyterLab AI extensions Use extensions like jupyterlab-ai (ChatGPT integration) or jupyter-ai (multi-model AI assistance) to generate entire cells from natural language.

Approach 3: External LLM → Copy code Describe analysis in ChatGPT/Claude, generate code in chat interface, copy into notebook cells, then test and validate.

When to use notebooks vs scripts:

Use Jupyter notebooks for: - Exploratory data analysis - Teaching and learning - Communicating findings (code + narrative + visualizations) - One-off analyses - Interactive demonstrations

Use scripts (.py, .R) for: - Production pipelines - Automated reports (scheduled runs) - Functions and packages - Performance-critical code - Version control-friendly code

Many workflows start by exploring in a notebook, then productionize as a script.

17.3.4 RStudio: The R Environment

What it is: RStudio is an integrated development environment (IDE) specifically designed for R, the statistical programming language dominant in epidemiology and biostatistics (Racine, 2012, Journal of Applied Econometrics). While VS Code and Jupyter support R, RStudio provides the most polished R experience.

Advantages for R users:

RStudio offers native R integration with the best-in-class console with command history, an environment pane showing all objects (data, functions, variables), plots pane for easy visualization management, package management interface, built-in help and documentation, R Markdown support (like Jupyter but R-native), and debugging tools specific to R.

If R is your primary language, RStudio is hard to beat.

AI assistance in RStudio:

RStudio doesn’t have built-in AI like Cursor, but can be enhanced:

Option 1: GitHub Copilot (via VS Code) Work in VS Code for R development and use Copilot for code suggestions. Trade-off: Lose RStudio-specific features.

Option 2: R packages for AI assistance

The {gptstudio} package adds ChatGPT integration to RStudio:

install.packages("gptstudio")
# Adds "ChatGPT" menu to RStudio
# Can ask for code, explanations, debugging help

The {chattr} package provides a chat interface for multiple LLMs:

install.packages("chattr")
# Chat with GPT-5, Claude, local models
# Generate code, explain concepts

Option 3: External LLM → Copy code Use ChatGPT/Claude in browser, generate R code, copy into RStudio. This is the standard workflow for many users.

R Markdown for reproducible reports:

R Markdown documents (like Jupyter notebooks) combine R code chunks, Markdown narrative, inline code results, and can output to PDF, HTML, Word, etc.

Example: Automated weekly surveillance report

An R Markdown file would contain code chunks that execute when rendered, combining code, results, and narrative:

---
title: "Weekly Disease Surveillance Report"
date: (auto-generated with Sys.Date())
output: html_document
---

Code chunk: Load libraries and data
- library(tidyverse)
- library(knitr)
- data <- read_csv("weekly_surveillance.csv")

## Executive Summary

The report would display: "This week, [number] cases were reported,
representing a [percentage]% change from last week."
(Values calculated from nrow(data) and comparisons)

Code chunk: Generate time series plot
- ggplot showing cases over time
- Output: Interactive line chart

## Recommendations

Based on current trends, we recommend:
1. Enhanced surveillance in Region X
2. Targeted outreach in age group Y

When to choose RStudio:

Choose RStudio if: - R is your primary language (>80% of coding time) - Doing statistical analysis (traditional strengths of R) - Working in epidemiology/biostatistics (R-dominant fields) - Need advanced R debugging - Prefer specialized, purpose-built tool

Choose VS Code if: - Working in multiple languages (R + Python + SQL) - Need AI assistance (better Copilot integration) - Want unified environment for all projects - Prefer customizable, extensible editor

17.4 AI Coding Assistants

17.4.1 GitHub Copilot: The Market Leader

What it is: GitHub Copilot, launched in 2021, was the first widely-available AI coding assistant, powered by OpenAI Codex (a model derived from GPT) (Chen et al., 2021, arXiv). It suggests code completions as you type, learns from context, and can generate entire functions from comments.

How it works:

When you type a comment like # Calculate disease incidence, Copilot suggests a complete function:

def calculate_incidence(cases, population, time_period):
    """
    Calculate disease incidence rate

    Parameters:
    cases: Number of new cases
    population: Population at risk
    time_period: Time period in years

    Returns:
    Incidence rate per 100,000 population
    """
    rate = (cases / population) * 100000 / time_period
    return rate

Mechanism: 1. Analyzes your code context (files open, cursor position) 2. Generates multiple completions using language model 3. Ranks by likelihood and relevance 4. Suggests most probable completion 5. You accept (Tab), reject (Esc), or modify

Suggestions appear in less than 1 second as you type.

Key features:

1. Context-aware suggestions

Copilot learns from your current file, related open files, file type and language, variable names, function names, and comments/docstrings.

Example: If you have a dataframe called surveillance_data with columns region, date, cases, Copilot will suggest code using those exact names and structure.

2. Multi-line completions

Copilot can complete single lines like total = sum(case_counts) or generate entire multi-line functions:

def clean_surveillance_data(df):
    # Remove duplicate records
    df = df.drop_duplicates(subset=['id'])

    # Handle missing values
    df = df.dropna(subset=['date', 'region'])

    # Standardize date format
    df['date'] = pd.to_datetime(df['date'])

    return df

3. Natural language to code

Comment-driven development works well with Copilot:

# Load surveillance data from CSV, filter to 2024, calculate weekly totals by region
# [Copilot suggests complete implementation]

# Create a choropleth map showing disease incidence by county
# [Copilot suggests map code with appropriate library]

Pricing and access:

Individual: $10/month or $100/year with unlimited completions and chat interface
Business: $19/user/month with team management, organization-wide policies, and audit logs
Free for: Verified students (education), open source maintainers, and some enterprise plans
Trial: 30 days free for all new users

Effectiveness evidence:

Research on Copilot productivity shows developers completed tasks 55% faster with Copilot (Ziegler et al., 2022, arXiv). The acceptance rate is 26% of suggestions accepted (varies by language). Copilot shows highest utility for repetitive tasks, boilerplate code, and test writing, with a learning curve of 2-4 weeks to develop effective use patterns (Kalliamvakou et al., 2023, GitHub).

Limitations and risks:

Code correctness not guaranteed: - Syntactically correct ≠ logically correct - May suggest plausible but wrong algorithms - Can introduce subtle bugs - Always test generated code

Security vulnerabilities: - 40% of Copilot suggestions contained security issues in one study focused on security-sensitive contexts (Pearce et al., 2022, IEEE Security & Privacy) - May suggest deprecated functions or unsafe practices - Requires security code review

License and copyright concerns: - Trained on public GitHub code (various licenses) - Generated code may closely match training examples - Legal gray area for code ownership - Organizations may prohibit for proprietary code

Privacy considerations: - Code context sent to Microsoft/OpenAI servers - May expose sensitive information in variable names, comments - Not appropriate for classified or highly sensitive code - Enterprise version offers more privacy controls

17.4.2 Other AI Coding Assistants

Amazon CodeWhisperer (Now “Amazon Q Developer”)

Amazon’s AI coding assistant is free for individuals (as of 2023), or $19/month for professional features. It supports Python, Java, JavaScript, TypeScript, C#, Go, Rust, PHP, Ruby, and other languages.

Advantages: - Generous free tier (unlike Copilot) - Integrated with AWS services - Security scanning included - Bias detection

Disadvantages: - Less polished than Copilot - Smaller training dataset - Fewer integrations

Best for: AWS users, budget-conscious developers, security-focused teams

Tabnine

Tabnine is a privacy-focused AI code completion tool with a free basic tier and $12/month Pro version. It can be deployed in the cloud or on-premises.

Advantages: - Privacy emphasis (can run locally) - Team training on private codebase - No code leaves organization (self-hosted) - Compliance-friendly (HIPAA, SOC 2)

Disadvantages: - Less capable than Copilot/Claude - Local models require GPU - Fewer languages supported

Best for: Organizations with strict privacy requirements, healthcare/finance sectors

Replit Ghostwriter

Replit Ghostwriter is an AI assistant in Replit (browser-based IDE) for $25/month (includes Replit compute). It’s browser-only with no installation required.

Advantages: - Zero setup (fully browser-based) - Great for learners - Integrated execution environment - Collaboration features

Disadvantages: - Requires internet - Limited to Replit ecosystem - Less powerful than Copilot

Best for: Students, teaching, quick prototypes, no-install scenarios

Cody (by Sourcegraph)

Cody combines code search with AI assistance. It’s free for individuals, or $9-19/month Pro. It can index large codebases.

Advantages: - Understands entire codebase context - Code search alongside AI - Multiple LLM backends (GPT-5, Claude, Gemini, etc.) - Good for large projects

Disadvantages: - Requires initial indexing - More complex setup

Best for: Large projects, teams, codebases with extensive history

Selection guide:

Choose Copilot if: Standard choice, best autocomplete, wide adoption
Choose Claude Code if: Prefer conversational, long context, privacy-conscious
Choose CodeWhisperer if: Budget-constrained, AWS user, security focus
Choose Tabnine if: Privacy paramount, can’t send code externally
Choose Replit if: Teaching/learning, want zero setup
Choose Cody if: Large codebase, need semantic code search

17.5 Version Control with Git and GitHub

17.5.1 Why Version Control Matters for Public Health

The problem without version control:

Your desktop might look like this:

analysis_v1.R
analysis_v2.R
analysis_v2_final.R
analysis_v2_final_REVISED.R
analysis_v2_final_REVISED_JAN15.R
analysis_FINAL_USE_THIS_ONE.R

Disaster scenarios include: - Which version was used for the report? - What changed between versions? - How to merge colleague’s edits? - Accidentally deleted file, no backup? - Need to revert to version from 2 months ago?

Version control solution:

Git tracks changes to files over time, providing: - Complete history (who changed what, when, why) - Ability to revert to any previous version - Parallel work (branches) - Merging changes from multiple people - Backup on remote server (GitHub) - Collaboration without emailing files

Result: Professional, reproducible, collaborative workflow

Why it’s critical for public health:

Reproducibility: Scientific requirement to recreate analyses
Compliance: Regulatory requirements for audit trails (FDA, CDC)
Collaboration: Multiple analysts working on same project
Safety: Backup against data loss
Documentation: Clear record of analytical decisions

17.5.2 Git Basics

Core concepts:

Repository (repo): Project folder tracked by Git
Commit: Snapshot of files at a point in time
Branch: Parallel version of code
Remote: Server hosting repository (GitHub, GitLab)
Clone: Copy repository to local computer
Pull: Download changes from remote
Push: Upload changes to remote
Merge: Combine branches

Essential Git workflow:

# One-time setup
git config --global user.name "Your Name"
git config --global user.email "your.email@health.gov"

# Starting a new project
cd my-analysis-project
git init                           # Initialize Git repo
git add .                          # Stage all files
git commit -m "Initial commit"     # Create first snapshot

# Making changes
# ... edit files ...
git status                         # See what changed
git add analysis.R                 # Stage specific file
git commit -m "Add incidence calculation"  # Save snapshot

# Reviewing history
git log                            # See all commits
git diff                           # See exact changes

# Connecting to GitHub
git remote add origin https://github.com/yourusername/project.git
git push -u origin main            # Upload to GitHub

# Daily workflow
git pull                           # Download latest changes
# ... work on files ...
git add modified_files.R
git commit -m "Descriptive message"
git push                           # Upload your changes

Commit message best practices:

Bad messages:

git commit -m "fix"
git commit -m "changes"
git commit -m "update"

Good messages:

git commit -m "Fix calculation error in age-standardization function"
git commit -m "Add missing values handling in data cleaning script"
git commit -m "Update visualization colors for colorblind accessibility"

Guidelines: - Start with verb (Add, Fix, Update, Remove) - Be specific about what changed - Explain why if not obvious - Keep under 50 characters - Use present tense

17.5.3 GitHub: Collaboration and Hosting

What GitHub adds to Git:

Git is local version control on your computer. GitHub is remote cloud hosting that adds: - Backup on cloud servers - Collaboration features - Web interface for browsing code - Issue tracking - Project management - Automated workflows (GitHub Actions) - Team permissions

Key collaboration features:

1. Pull requests (code review)

Workflow: 1. Create branch: git branch feature-new-analysis 2. Make changes, commit 3. Push branch to GitHub 4. Open pull request: “Please review and merge my changes” 5. Team reviews code, suggests improvements 6. Address feedback, update code 7. Approved → Merge into main branch

Benefits: Code review catches errors, enables knowledge sharing, provides discussion and documentation, and ensures quality control before production.

2. Issues (task tracking)

Create issues for: - Bugs: “Incidence calculation returns negative values for Region X” - Features: “Add confidence intervals to disease rate calculations” - Questions: “Which statistical test appropriate for this comparison?” - Documentation: “Add comments to data cleaning script”

Track with labels (bug, enhancement, documentation), assignments (who’s working on it), milestones (group issues for v1.0 release), and projects (kanban board view).

3. GitHub Actions (automation)

# .github/workflows/run-analysis.yml
name: Weekly Analysis
on:
  schedule:
    - cron: '0 9 * * 1'  # Every Monday at 9 AM
jobs:
  run-analysis:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Set up R
        uses: r-lib/actions/setup-r@v2
      - name: Install packages
        run: Rscript -e 'install.packages(c("tidyverse", "knitr"))'
      - name: Run analysis
        run: Rscript weekly_analysis.R
      - name: Commit report
        run: |
          git config --local user.email "action@github.com"
          git config --local user.name "GitHub Action"
          git add weekly_report.html
          git commit -m "Automated weekly analysis $(date)"
          git push

Public vs private repositories:

Public repositories: - Free unlimited public repos - Anyone can view code - Good for: Open source, teaching, public health surveillance methods - Don’t use for: Proprietary methods, unreleased findings, sensitive data

Private repositories: - Free unlimited private repos (GitHub changed policy in 2019) - Access control (specific collaborators only) - Good for: Organizational projects, unpublished research, draft analyses - Not backed up on personal computer (need to clone)

CRITICAL: NEVER commit sensitive data (PHI, PII) to GitHub, public or private.

17.5.4 Practical Git Workflows for Public Health

Workflow 1: Solo analyst

# Setup (once)
mkdir covid-analysis
cd covid-analysis
git init
git add .
git commit -m "Initial project structure"

# Create GitHub repo (via website)
# Connect local to remote
git remote add origin https://github.com/yourname/covid-analysis.git
git push -u origin main

# Daily work
# Morning: Get latest
git pull

# Work: Edit, save frequently
git add updated_files
git commit -m "Add demographic analysis"

# End of day: Backup
git push

Benefits: Complete history of work, backed up on GitHub, can revert mistakes, professional portfolio (if public).

Workflow 2: Team collaboration

# Team member A: Creates base analysis
git clone https://github.com/orgname/surveillance-project.git
# ... develops initial analysis ...
git add analysis.R
git commit -m "Initial surveillance analysis"
git push

# Team member B: Adds visualizations
git pull  # Get A's work
git checkout -b add-visualizations  # New branch
# ... adds visualization code ...
git add plotting.R
git commit -m "Add regional disease map visualization"
git push origin add-visualizations

# Create pull request on GitHub
# Team reviews, discusses
# After approval, merge to main

Benefits: Parallel work without conflicts, code review improves quality, clear communication via PRs, audit trail of decisions.

Workflow 3: Reproducible analysis

Project structure:
surveillance-analysis/
├── README.md              # Project description
├── data/
│   ├── raw/              # Original data (never edit)
│   └── processed/        # Cleaned data
├── scripts/
│   ├── 01_clean_data.R   # Numbered for order
│   ├── 02_analyze.R
│   └── 03_visualize.R
├── output/
│   ├── figures/
│   └── tables/
├── reports/
│   └── final_report.Rmd
└── .gitignore            # Files to exclude from Git

.gitignore contents:

data/raw/*                # Don't commit data (privacy + size)
*.csv                     # Exclude all CSV files
.Rhistory                 # Exclude R history
.DS_Store                 # Exclude system files
output/                   # Generated outputs (can recreate)

README.md contents:

# Surveillance Analysis Project

## Purpose
Weekly analysis of disease surveillance data for County X.

## Requirements
- R 4.2+
- Packages: tidyverse, sf, viridis

## Usage
1. Place data file in data/raw/
2. Run scripts in order: 01, 02, 03
3. View output in output/ directory

## Contact
[Your name] - your.email@health.gov

Benefits: Anyone can reproduce analysis, clear documentation, organized structure, privacy protected (data not in Git).

17.6 Practical Workflows

17.6.1 Workflow 1: Analyzing CSV Surveillance Data

Scenario: You receive weekly surveillance data as CSV. Need to calculate disease rates, create visualizations, generate report.

Step 1: Set up project

# Create project structure
mkdir weekly-surveillance
cd weekly-surveillance
mkdir data scripts output
git init

# Create README
echo "# Weekly Surveillance Analysis" > README.md
echo "Automated weekly analysis of disease surveillance data" >> README.md

git add .
git commit -m "Initial project structure"

Step 2: Generate data cleaning script with AI

Prompt to ChatGPT/Claude/Copilot:

"Create a Python script to:
1. Read CSV file 'surveillance_data.csv' with columns: date, region, age_group, cases, population
2. Convert date to datetime
3. Remove rows with missing cases or population
4. Standardize region names (title case)
5. Add calculated field: incidence_rate = (cases/population) * 100000
6. Export to 'cleaned_data.csv'
Include error handling and logging."

AI generates a complete script that you can review, test, and commit.

Step 3: Generate analysis script

Prompt: “Create Python script to analyze cleaned surveillance data: Calculate total cases and incidence rate by region, calculate 7-day moving average, identify regions with significant increases (>20% week-over-week), generate summary statistics table, export results to CSV and print summary.”

Step 4: Generate visualization script

Prompt: “Create Python script to visualize surveillance data: Line plot showing cases over time by region, bar plot of current week incidence rates by region, heatmap of weekly cases by region (calendar heatmap style). Save as high-resolution PNG files for reports. Use colorblind-friendly palette.”

Step 5: Automate with shell script

#!/bin/bash
echo "Starting weekly surveillance analysis..."

python scripts/01_clean_data.py
python scripts/02_analyze.py
python scripts/03_visualize.py

echo "Analysis complete. Outputs in output/ directory."

Time saved: Manual analysis takes 3-4 hours/week. With AI-generated scripts: 30 minutes initial setup, 5 minutes/week ongoing = 150+ hours/year saved.

17.6.2 Workflow 2: Automating Reports with R Markdown

Scenario: Manual Word reports take 2 hours weekly. Want automated generation.

Create an R Markdown template with AI assistance:

Prompt: “Create R Markdown template for automated disease surveillance report with title and auto-generated date, executive summary with key metrics, tables for cases by region and demographic breakdown, figures showing trend plot and choropleth map, and recommendations section. Use parameters for: data_file, report_date, alert_threshold.”

AI generates a parameterized template that automatically generates reports from your data, complete with visualizations, summary statistics, and data-driven recommendations.

You can then automate distribution:

library(blastula)

email <- compose_email(
  body = md("Attached is this week's surveillance report..."),
  footer = md("*Automated report*")
)

smtp_send(email,
  from = "surveillance@health.gov",
  to = c("director@health.gov", "epi-team@health.gov"),
  subject = paste("Weekly Surveillance Report", Sys.Date())
)

Result: 2-hour manual task → 5-minute automated process.

17.6.3 Workflow 3: Building Interactive Dashboards

Scenario: Need interactive dashboard for leadership to explore surveillance data.

Use R Shiny or Python Dash. Prompt AI to create dashboard skeleton:

“Create R Shiny dashboard for disease surveillance with sidebar containing date range selector, region filter, and metric dropdown. Main panel should show value boxes with key metrics, interactive line plot of cases over time, filterable data table, and map visualization. Data source: surveillance_data.csv with date, region, cases, population columns.”

AI generates complete dashboard code that you can customize, test locally, and deploy to hosting services like shinyapps.io or Heroku.

17.6.4 Workflow 4: AI-Assisted Debugging

Scenario: Code produces cryptic error, can’t figure out the problem.

Step 1: Gather information

Your code throws an error. For example:

data <- read_csv("surveillance.csv")
result <- data %>% group_by(region) %>% summarize(avg_cases = mean(cases))

# Error: argument is not numeric or logical: returning NA

Step 2: Use AI to diagnose

Prompt to AI:

"I'm getting this error in R:
'Error in mean.default(cases) : argument is not numeric or logical: returning NA'

My code:
data <- read_csv('surveillance.csv')
result <- data %>% group_by(region) %>% summarize(avg_cases = mean(cases))

The 'cases' column appears as character (chr) instead of numeric.
How do I fix this?"

AI explains the issue (cases stored as character due to non-numeric values in data) and provides solutions: convert to numeric, specify column types when reading, or handle missing values explicitly.

Step 3: Implement fix, verify

# Apply fix
data <- read_csv('surveillance.csv') %>%
  mutate(cases = na_if(cases, '-')) %>%
  mutate(cases = as.numeric(cases))

# Verify
str(data$cases)  # Should show 'num' not 'chr'

17.7 Responsible Use of AI Coding Tools

17.7.1 When AI Assistance Is Appropriate

Good use cases:

Boilerplate and repetitive code: Data loading, error handling, logging, configuration files. High accuracy, low risk.

Exploratory prototyping: Quickly test approaches, generate multiple alternatives, learn new libraries or frameworks.

Code explanation: Understanding inherited code, learning new concepts, identifying potential issues in complex code.

Syntax and API reference: “How do I filter pandas DataFrame?”, “What’s the argument order for this function?”, “Remind me of the syntax for list comprehension.”

Test generation: Creating unit tests, generating test data, edge case identification.

17.7.2 When to Avoid AI Assistance

Inappropriate use cases:

Critical calculations: Disease rate calculations, statistical analyses, risk assessments. Errors have serious consequences—verify manually.

Security-sensitive code: Authentication, data encryption, privacy protections. AI-generated code may have vulnerabilities.

Novel algorithms: Cutting-edge methods, research code, domain-specific logic. AI may not understand specialized requirements.

Learning fundamentals: If you’re trying to learn programming, over-reliance prevents understanding.

Proprietary or sensitive contexts: Code with business logic, classified information, or PHI. May violate privacy/IP policies.

17.7.3 Validation Strategies

Always validate AI-generated code:

1. Read and understand: Don’t accept code you don’t understand. If it’s not clear, ask AI to explain or simplify.

2. Test with known cases: Run code on data with known outcomes. Compare results to manual calculations.

3. Edge case testing: Test with missing values, zero values, extreme values, empty datasets, malformed inputs.

4. Code review: Have colleague review AI-generated code, especially for production use. Fresh eyes catch errors.

5. Documentation: Add comments explaining logic. If you can’t explain it, you don’t understand it well enough.

6. Version control: Commit AI-generated code separately with clear messages noting AI assistance.

17.7.4 Building Technical Competency

AI as teaching tool vs crutch:

Use AI to accelerate learning: - Ask for explanations alongside code - Request variations to understand patterns - Generate exercises and practice problems - Get feedback on your own code

Don’t let AI prevent learning: - Type code yourself rather than copy-pasting blindly - Modify AI suggestions to fit your specific needs - Attempt problems before asking AI - Use AI for hints, not complete solutions

Progressive skill building:

Beginner: AI generates most code, you modify and test Intermediate: You write code, AI assists with specific tasks Advanced: AI helps with unfamiliar libraries, you write logic Expert: AI used for speed/exploration, you validate and architect

The goal is moving up this ladder, not staying at the bottom.

17.8 Choosing Your Coding Path

17.8.1 Should You Learn to Code?

Consider learning to code if:

You frequently work with data requiring custom analysis
You want independence from IT/developer teams
You enjoy problem-solving and logical thinking
You have time to invest in skill development (3-6 months to basic competency)
Your organization supports learning and practice

Consider alternatives if:

Your needs are one-time or very infrequent
Low-code tools meet your requirements
You have access to dedicated analyst/developer support
Time constraints prevent sustained learning
Your role focuses on other critical skills

17.8.2 Low-Code Alternatives

When low-code tools are sufficient:

Data visualization: Tableau, Power BI, Looker—create sophisticated dashboards without coding

Data analysis: Excel with Power Query, Google Sheets with Apps Script—handle many analyses with formulas and built-in tools

Workflow automation: Zapier, Make (formerly Integromat)—connect systems and automate tasks with visual workflows

Web apps: Retool, Bubble, Webflow—build interactive applications without traditional programming

Limitations: Less flexible than code, may hit capability ceiling, vendor lock-in, cost for advanced features, limited customization.

17.8.3 When to Hire Developers

Hire/consult developers for:

Complex systems: Multi-component applications, databases, APIs
Production deployment: User-facing tools requiring reliability, security, scalability
Specialized expertise: Machine learning model development, advanced statistical methods
Long-term maintenance: Systems requiring ongoing support and updates
Compliance requirements: HIPAA-compliant systems, validated software for regulatory submissions

Working effectively with developers:

Communicate requirements clearly: What problem are you solving, who are the users, what are the key features, what data is involved, what are success criteria?

Provide domain expertise: Developers know coding, you know public health. Collaboration produces best results.

Iterate and give feedback: Start with minimum viable product (MVP), test with real workflows, provide specific feedback, refine iteratively.

Learn enough to communicate: Basic understanding of technical concepts helps discussions. You don’t need to code, but understanding possibilities and constraints helps.

17.9 Summary

AI-assisted coding tools are transforming software development, making programming more accessible to public health practitioners. Modern development environments like VS Code, Jupyter, and RStudio, combined with AI assistants like GitHub Copilot and Cursor, enable analysts to accomplish sophisticated technical work with less training.

However, these tools require responsible use: - Validate AI-generated code thoroughly - Understand code before deploying - Use AI to accelerate learning, not replace it - Apply appropriate security for sensitive contexts - Choose the right tool for your use case

Version control with Git and GitHub provides professional workflows for reproducible analysis, enabling collaboration, backup, and audit trails critical for public health work.

Whether you choose to learn coding, use low-code tools, or hire developers depends on your specific needs, resources, and constraints. AI tools lower the barrier to entry, but thoughtful application and continued learning remain essential.

Check Your Understanding

Question 1: Development Environment Selection

You’re a public health analyst who primarily works with R for statistical analysis but occasionally needs Python for specific packages. You want AI coding assistance and don’t have a large budget. Which development environment would be most appropriate?

Cursor ($20/month) for best AI integration
RStudio + external ChatGPT (free)
VS Code + GitHub Copilot ($10/month)
Replit Ghostwriter ($25/month)

Click to reveal answer

Answer: C) VS Code + GitHub Copilot ($10/month)

Rationale: - VS Code provides excellent multi-language support (both R and Python) - GitHub Copilot offers strong AI assistance at reasonable cost ($10/month vs $20 for Cursor or $25 for Replit) - More flexible than RStudio alone (which is R-specific) - Cursor would work but costs more - Replit is browser-only and costs more - RStudio + ChatGPT works but provides less integrated AI assistance

VS Code + Copilot offers the best balance of multi-language support, AI assistance, and cost for this use case.

Question 2: Git Workflow Safety

You’re working on a surveillance data analysis project in Git. Which of these practices is MOST important for safety and reproducibility?

Committing data files to Git for backup
Using descriptive commit messages like “Fix age-standardization calculation”
Making commits at the end of each week
Working directly on the main branch for simplicity

Click to reveal answer

Answer: B) Using descriptive commit messages like “Fix age-standardization calculation”

Rationale: - A is WRONG: Never commit sensitive data (PHI/PII) to Git. Use .gitignore to exclude data files - B is CORRECT: Descriptive commit messages create audit trail and enable understanding changes months later - C is WRONG: Should commit frequently (daily or after completing logical units of work), not weekly - D is WRONG: Should use branches for new features/analyses, protecting main branch from experimental changes

Clear commit messages are critical for reproducibility, collaboration, and compliance. They document what changed and why, enabling future understanding and potential rollback.

Additional best practices include: - Commit frequently (not just weekly) - Use branches for experimental work - Never commit sensitive data - Review changes before committing

Question 3: Appropriate AI Assistant Use

Which of these scenarios is MOST appropriate for relying heavily on AI-generated code without extensive manual review?

Calculating age-adjusted disease rates for published report
Creating data loading script to read CSV files
Implementing patient data encryption system
Developing novel statistical method for outbreak detection

Click to reveal answer

Answer: B) Creating data loading script to read CSV files

Rationale: - A is INAPPROPRIATE: Disease rate calculations are critical for public health decisions. Errors could lead to wrong conclusions. Requires careful validation. - B is APPROPRIATE: Data loading is routine, well-defined, low-risk task. Easy to test with known data. Errors typically obvious and non-consequential. - C is INAPPROPRIATE: Security-sensitive code requires expert review. AI tools introduce vulnerabilities in 40% of security contexts. Critical for HIPAA compliance. - D is INAPPROPRIATE: Novel methods require deep understanding and validation. AI trained on existing methods may not correctly implement new approaches.

Safe AI use principles: - High trust for routine/boilerplate tasks (data loading, plotting, file I/O) - Medium trust for standard analyses (test, validate with known cases) - Low trust for critical calculations (verify manually, peer review) - No trust for security or novel algorithms (expert review required)

Always validate AI code, but level of scrutiny should match consequences of errors.

17.10 Further Resources

17.10.1 Tools and Documentation

VS Code Documentation - Official guide to Visual Studio Code
GitHub Copilot Guide - Using GitHub Copilot effectively
Pro Git Book - Comprehensive Git reference (free online)
R for Data Science - Modern R programming (free online)
Python for Data Analysis - Pandas and data analysis in Python

17.10.2 Learning Platforms

Software Carpentry - Scientific computing workshops (R, Python, Git, shell)
Posit Recipes - R and RStudio tutorials
Jupyter Tutorial - Getting started with Jupyter notebooks

17.10.3 Public Health Specific

CDC R Training - Public health R courses
EpiEstim Package - R tools for outbreak analysis
Public Health Data Science Course - Johns Hopkins data science for public health

17.10.4 AI and Coding

OpenAI Codex Paper - Research on AI code generation
GitHub Copilot Research - Productivity impact studies
Prompt Engineering for Code - Best practices for AI coding tools