8 Genomic Surveillance and Pathogen Analysis

Learning Objectives

This chapter examines genomic sequencing in outbreak investigation. You will learn to:

Construct phylogenetic trees revealing pathogen evolution and transmission chains
Implement variant detection systems identifying mutations of concern
Apply ML to predict antimicrobial resistance from genomic sequences
Use practical platforms (GISAID, Nextstrain) for genomic surveillance
Recognize genomic sampling biases in global monitoring
Navigate ethical tensions between data sharing and biosecurity
Integrate genomic and traditional epidemiology (SARS-CoV-2, tuberculosis)

Prerequisites: Disease Surveillance and Outbreak Detection, Just Enough AI to Be Dangerous, The Data Problem.

📋 Chapter Summary (TL;DR)

The Big Picture: The genomic revolution transformed outbreak response—SARS-CoV-2 genome sequenced in 10 days (vs. 15 years for 1918 flu), enabling diagnostic tests globally within weeks. Genomic surveillance answers questions traditional epidemiology cannot: transmission chains, variant evolution, antimicrobial resistance prediction, outbreak source identification.

The Speed of Science (SARS-CoV-2 Timeline):

Dec 31, 2019: WHO notified of “undiagnosed pneumonia” in Wuhan
Jan 10, 2020 (10 days): Complete genome publicly shared by Fudan University
Jan 13, 2020 (3 days): Thailand confirms first imported case using PCR from shared sequence
Jan 23, 2020: Nextstrain publishes first phylogenetic analysis showing human-to-human transmission
Contrast: 1918 influenza pandemic genome not fully sequenced until 2005 (87 years later)

Core Genomic Surveillance Capabilities:

Phylogenetic Analysis: Genetic trees reveal transmission chains and evolutionary relationships
- SNP-based distance: <5 SNPs = likely direct transmission
- Outbreak source identification: Distinguish new introduction vs. ongoing transmission
- Real-world: Nebraska Omicron cluster traced to Nigeria travel through phylogenetics
Variant Detection: Identify mutations of public health concern
- Alpha (B.1.1.7): Detected through UK’s COG-UK genomic surveillance (Dec 2020)
- Delta, Omicron: Rapid identification enabled targeted responses
- Spike protein mutations linked to transmissibility, immune escape, severity
- Platforms: GISAID for data sharing, Nextstrain for visualization, Pango for nomenclature
Antimicrobial Resistance (AMR) Prediction: ML predicts resistance from genome
- TB drug resistance: Genomic testing faster than phenotypic culture (weeks → days)
- Gonorrhea resistance surveillance: Tracking cephalosporin resistance spread
- Advantage: Predict resistance before culturing organism

Key Platforms & Tools:

GISAID: Global database for influenza and SARS-CoV-2 sequences (15+ million genomes). Enables real-time global collaboration
Nextstrain: Open-source phylogenetic visualization. Tracks pathogen evolution in real-time
Pango Lineages: Nomenclature system for SARS-CoV-2 variants (B.1.1.7, BA.1, etc.)
WHO Labels: Simplified names for public communication (Alpha, Delta, Omicron)

Critical Quality Metrics:

Coverage Depth: >30x depth for reliable variant calling. Low viral load samples sequence poorly
Genome Completeness: >90% genome coverage. Partial genomes limit phylogenetic accuracy
Ct Values: Samples with Ct >30 often fail sequencing (creates sampling bias toward sicker patients)

Genomic Sampling Biases (Equity Challenge):

Geographic Disparity: High-income countries sequence 100x more per capita than LMICs
Clinical Bias: Hospitalized patients over-represented (lower Ct values sequence better)
Result: Genomic surveillance reflects sequencing capacity, not true disease distribution
Example: South Africa detected Omicron first (strong surveillance), faced travel bans as “punishment” for transparency

Integration with Traditional Epidemiology:

Genomics answers “how/why” questions that epi answers “who/what/when/where”: - Traditional epi: “100 linked cases in County X” - Genomics: “These 100 cases form 3 distinct clusters—2 from separate introductions, 1 from ongoing transmission” - Synergy: Combining contact tracing + genomics strengthens both approaches

Ethical Tensions:

Data Sharing vs. Biosecurity: Open science enables rapid response but raises dual-use concerns (engineered pathogens)
Transparency Paradox: South Africa punished for detecting/sharing Omicron through travel bans. Disincentivizes transparency
Pathogen Access and Benefit Sharing (PABS): Who owns genomic data? How are benefits distributed?
Privacy: Can individuals be re-identified from pathogen genomes if linked to metadata?

When Genomic Surveillance Adds Value:

Use genomics when: - Investigating transmission chains and outbreak sources - Monitoring pathogen evolution and variant emergence - Predicting antimicrobial resistance - Differentiating reinfection vs. persistent infection

Traditional epi sufficient when: - Describing outbreak magnitude and trends - Identifying risk factors and exposures - Evaluating intervention effectiveness at population level

The Takeaway for Public Health Practitioners:

Genomic surveillance provides actionable intelligence traditional surveillance cannot—transmission directionality, variant tracking, AMR prediction. Platforms like GISAID democratized access, enabling real-time global collaboration during COVID-19. But genomic capacity remains concentrated in high-income settings, creating equity gaps in global surveillance. Quality metrics (coverage depth, Ct values) critical for interpreting results—sampling biases toward severe cases. Integration with traditional epidemiology strengthens both approaches. Ethical tensions between open science and biosecurity require navigation. Most importantly: genomic surveillance succeeds when transparent data sharing rewarded, not punished—South Africa’s Omicron experience shows perverse incentives undermining global health security.

8.1 Introduction: The Genomic Revolution in Public Health

December 31, 2019, Wuhan, China: ProMED-mail publishes a post about “undiagnosed pneumonia” in Wuhan. The WHO is notified. The world is on alert.

January 10, 2020 (10 days later): Professor Yong-Zhen Zhang at Fudan University publicly releases the complete genome sequence of the novel coronavirus on virological.org—before formal publication, before bureaucratic approval, simply sharing it openly with the world.

January 13, 2020 (3 days later): Thailand confirms the first imported case outside China—identified using PCR diagnostic tests designed directly from that shared genome sequence.

This was unprecedented. In past pandemics, months or years passed before pathogen identification. During the 1918 influenza pandemic, the virus wasn’t even isolated until 1933—15 years later. For SARS-CoV-2, the full genome was sequenced and shared globally within 10 days of initial reports.

February 2021, Kent, England:

Routine genomic surveillance by the COVID-19 Genomics UK Consortium (COG-UK) detects an unusual cluster of sequences. The lineage, designated B.1.1.7 (later renamed Alpha), contains 17 mutations—an unprecedented number for SARS-CoV-2.

More concerning: the variant is spreading 50-70% faster than existing strains. Within the spike protein, the N501Y mutation enhances ACE2 receptor binding. The deletion at positions 69-70 causes S-gene target failure in some PCR tests—an unintended diagnostic marker.

Within weeks: Rambaut et al., 2020 publish detailed phylogenetic analysis. The variant is detected in 50+ countries. Genomic data guides public health responses, travel restrictions, and urgent vaccine effectiveness studies.

May 2021, India:

The Delta variant (B.1.617.2) emerges with the L452R mutation—later shown to increase transmissibility and cause modest immune evasion. Genomic surveillance tracks its rapid global spread. It becomes dominant worldwide within months.

November 2021, Botswana/South Africa:

Omicron (B.1.1.529) is detected with an unprecedented 30+ spike protein mutations. South African scientists immediately share sequence data via GISAID. Global genomic surveillance confirms rapid spread. Within weeks, the variant is detected on every continent.

December 2021, Nebraska, USA:

One of the first U.S. Omicron outbreak investigations used genomic surveillance to trace a household cluster of 6 cases linked to international travel from Nigeria. Tegomoh et al., 2021, MMWR documented a shorter incubation period (~3 days) and milder symptoms compared to previous variants—early evidence that would later be confirmed globally. This rapid investigation demonstrated how genomic sequencing enables real-time characterization of emerging variants.

The paradox: South Africa, which has among the world’s strongest genomic surveillance systems, was blamed for “causing” Omicron and faced immediate travel bans—despite simply being first to detect and share the data. This created a dangerous precedent: countries that do good surveillance get punished for transparency.

8.1.1 Why Genomic Surveillance Matters

Traditional surveillance tells you: - “There’s an outbreak of respiratory illness” - “Case count is increasing” - “Hospitalizations are up”

Genomic surveillance tells you: - “It’s SARS-CoV-2, Delta variant (B.1.617.2), sub-lineage AY.4.2” - “The outbreak involves two separate introductions (phylogenetically distinct clusters)” - “Case A likely infected Cases B, C, and D based on genetic similarity and timing” - “A new mutation (E484K) emerged that may affect antibody binding” - “This variant is 40% more transmissible based on growth rate analysis”

The difference is actionable intelligence:

1. Attribution: Is this a new introduction or ongoing transmission?
2. Transmission chains: Who infected whom? Where are transmission hotspots?
3. Variants: Is this a new, more dangerous strain?
4. Outbreak linkage: Are these cases connected or separate introductions?
5. Drug resistance: Will antibiotics, antivirals, or vaccines work?
6. Public health response: Targeted interventions vs. population-wide measures

8.2 The Foundations: What Is Genomic Surveillance?

8.2.1 From DNA to Disease: The Central Dogma

All living organisms and many pathogens (viruses, bacteria, parasites, fungi) store genetic information in DNA or RNA:

DNA → (transcription) → RNA → (translation) → Protein → Phenotype

For pathogens, mutations in genetic sequences can lead to:

Increased transmissibility: - SARS-CoV-2 N501Y mutation: ~50% increase in transmission - Influenza H274Y mutation: No fitness cost, enables oseltamivir resistance

Immune evasion: - Omicron’s 30+ spike mutations: Escape from prior infection and vaccination - HIV env gene variation: Constant immune escape

Drug resistance: - M. tuberculosis rpoB mutations: Rifampicin resistance - S. aureus mecA gene: Methicillin resistance (MRSA) - Plasmodium falciparum kelch13 mutations: Artemisinin resistance

Virulence changes: - Y. pestis loss of Pla protease: Reduced virulence - E. coli O157:H7 Shiga toxin acquisition: Increased pathogenicity

The Speed of Evolution

RNA viruses (influenza, SARS-CoV-2, HIV): - Mutation rate: ~10⁻⁶ to 10⁻⁴ per nucleotide per replication - Fast evolution = constant adaptation

DNA viruses (smallpox, herpes): - Mutation rate: ~10⁻⁸ per nucleotide per replication - Slower evolution = more stable targets for vaccines

Bacteria: - Mutation rate: ~10⁻⁹ per nucleotide per generation - But horizontal gene transfer (acquiring resistance genes from other bacteria) enables rapid adaptation

Why this matters: RNA viruses require continuous surveillance; bacterial resistance can spread suddenly via plasmid transfer.

For detailed evolutionary rates, see Sanjuán et al., 2010, Journal of Virology.

8.2.2 Key Concepts in Genomic Epidemiology

8.2.2.1 1. Single Nucleotide Polymorphisms (SNPs)

The smallest unit of genetic variation:

Reference genome:  ...ATCGATCGATCG...
Variant 1:         ...ATCGATCGATCG...  (0 SNPs - identical)
Variant 2:         ...ATCAATCGATCG...  (1 SNP: G→A at position 4)
Variant 3:         ...GTCAATCGATCG...  (2 SNPs: A→G at pos 1, G→A at pos 4)

SNP distance (Hamming distance) measures genetic similarity: - 0-2 SNPs: Likely direct transmission or very recent common ancestor - 3-10 SNPs: Possibly linked through intermediates - >10 SNPs: Likely unrelated (for pathogens with slow mutation rates)

Critical caveat: SNP thresholds vary by pathogen! - SARS-CoV-2: ~2 mutations/month/lineage (0.0007 mutations/site/year) - Influenza: ~10 mutations/year in HA gene - M. tuberculosis: ~0.5 SNPs/genome/year (extremely slow) - HIV: Hypervariable; can vary within a single patient

8.2.2.2 2. Lineages, Clades, and Variants

Hierarchical classification of related genomes:

Ancestral SARS-CoV-2 (Wuhan-Hu-1)
│
├─ A lineage
├─ B lineage
   ├─ B.1
   │  ├─ B.1.1.7 (Alpha)
   │  ├─ B.1.351 (Beta)
   │  └─ B.1.617.2 (Delta)
   │     ├─ AY.1
   │     ├─ AY.2
   │     └─ AY.4.2
   └─ B.1.1.529 (Omicron)
      ├─ BA.1
      ├─ BA.2
      ├─ BA.4
      └─ BA.5
         └─ BQ.1.1

Nomenclature systems:

Pango lineages (cov-lineages.org): - Dynamic nomenclature for SARS-CoV-2 - B.1.1.7, B.1.617.2, etc. - Updated as virus evolves

WHO labels (WHO naming system): - Simplified names for public communication using Greek alphabet (Alpha, Beta, Delta, Omicron) - Avoids geographic stigma - Reserved for Variants of Concern (VOCs) - Continuously updated at WHO variant tracking

Nextstrain clades (Nextstrain clade naming): - Time-based naming system (19A, 20A, 21A, etc.) based on year and sequential letters - Integrates with phylogenetic analysis - New clades defined by ≥2 mutations from parent clade

For the full Pango system, see Rambaut et al., 2020, Nature Microbiology.

8.2.2.3 3. Phylogenetic Trees

Visual representation of evolutionary relationships:

                    ┌─ Sequence_A
              ┌─────┤
              │     └─ Sequence_B
    ┌─────────┤
    │         └─────── Sequence_C
────┤
    │         ┌─────── Sequence_D
    └─────────┤
              └─────── Sequence_E
    
    ↑         ↑        ↑
   Root    Internal   Tips
          node      (leaves)

Key components:

Branch length: Represents evolutionary distance (number of mutations)
Topology: Shows who’s related to whom
Root: Most recent common ancestor (MRCA)
Tips (leaves): Observed sequences
Internal nodes: Inferred ancestral sequences

Types of trees:

Phylogram: Branch lengths proportional to mutations
Chronogram: Branch lengths proportional to time
Cladogram: Shows relationships only (no branch length information)

For comprehensive phylogenetics introduction, see Felsenstein, 2004, Inferring Phylogenies.

8.2.2.4 4. Mutations of Concern

Not all mutations are equal:

Synonymous (silent) mutations: - Change DNA sequence but not amino acid - Example: CTT → CTC (both code for leucine) - Usually neutral, but can affect RNA stability

Non-synonymous (missense) mutations: - Change amino acid - Example: GAT → AAT (aspartate → asparagine) - May alter protein function

Nonsense mutations: - Create stop codon - Truncate protein (usually deleterious)

Insertions/Deletions (indels): - Can cause frameshift - Dramatically alter downstream protein sequence

For SARS-CoV-2, mutations in the spike protein are critical:

Mutation	Location	Effect
D614G	Outside RBD	Increased transmissibility (~50%)
N501Y	RBD	Enhanced ACE2 binding, increased transmissibility
E484K/A	RBD	Immune evasion (antibody escape)
L452R	RBD	Immune evasion + increased transmissibility
K417N/T	RBD	Reduced antibody binding
P681H/R	Furin site	Enhanced spike cleavage, increased entry

For detailed functional analysis, see Harvey et al., 2021, Nature Reviews Microbiology.

8.3 Genomic Data: Collection, Quality, and Biases

Before analyzing genomes, we must understand how data is collected and its inherent limitations.

8.3.1 The Sequencing Process

1. Sample Collection - Nasopharyngeal swabs (respiratory viruses) - Blood samples (bloodborne pathogens, bacteria) - Stool samples (enteric pathogens) - Wastewater (community-level surveillance)

2. Nucleic Acid Extraction - RNA extraction (SARS-CoV-2, influenza, HIV) - DNA extraction (bacteria, DNA viruses) - Quality depends on sample quality and viral load

3. Library Preparation - Amplicon-based (ARTIC for SARS-CoV-2): Fast, cheap, works with low viral load - Metagenomic: No prior knowledge needed, finds unexpected pathogens - Targeted capture: Enriches specific pathogen sequences

4. Sequencing - Illumina: High accuracy, moderate cost, 1-3 days - Oxford Nanopore: Real-time, portable, lower accuracy, hours - PacBio: Long reads, high accuracy, expensive

5. Bioinformatics Pipeline - Quality control (remove low-quality reads) - Alignment to reference genome - Variant calling (identify SNPs, indels) - Consensus sequence generation - Lineage assignment (Pangolin for SARS-CoV-2)

Typical timeline: Sample to sequence = 2-5 days
Cost: $50-300 per sample (decreasing rapidly)

8.3.2 Quality Metrics and Limitations

Key quality indicators:

Sequencing depth (coverage): - >100x per position: High confidence - 10-100x: Acceptable for most purposes - <10x: Low confidence, may miss variants

Genome completeness: - >95%: Suitable for phylogenetics - 80-95%: Marginal; some analyses possible - <80%: Poor quality; limited use

Ct value (cycle threshold) from PCR: - <25: High viral load, easy to sequence - 25-30: Moderate viral load, usually sequences well - >30: Low viral load, may fail sequencing or produce incomplete genomes

The Low Viral Load Problem

Challenge: Patients with low viral load (high Ct values) are: - Often sequenced less successfully (incomplete genomes) - May be at different disease stages (early infection, late recovery) - Could have different variant distributions

This creates bias: We sequence sickest patients (high viral load) more successfully than mild cases.

Impact on surveillance: May miss variants that cause milder disease.

For discussion of sequencing biases, see Grubaugh et al., 2019, Nature Microbiology.

8.3.3 Global Surveillance Infrastructure: GISAID

GISAID (Global Initiative on Sharing All Influenza Data) is the primary platform for sharing pathogen genomes.

Key features:

Rapid data sharing: - SARS-CoV-2: >16 million sequences (as of 2024) - Influenza: >5 million sequences - Updates daily

Access control: - Free for public health and research use - Requires registration (prevents misuse) - Submitters retain rights (encourages sharing)

Quality standards: - Automated quality checks - Lineage assignment - Metadata requirements (date, location)

Attribution: - Proper citation of data contributors - Prevents “parasitic” publications

Impact: - Enabled real-time COVID-19 tracking globally - Detected variants within days of emergence - Guided vaccine development and policy

Accessing GISAID Data

For public health practitioners: 1. Register at GISAID.org 2. Agree to terms (no redistribution, proper citation) 3. Access: - Web interface for browsing - Batch download for analysis - API for programmatic access

Data available: - Complete genomes (FASTA format) - Metadata (collection date, location, lineage, patient demographics) - Submission acknowledgments (proper attribution)

Key tools: - EpiCoV (SARS-CoV-2) - EpiFlu (Influenza) - EpiPox (Monkeypox) - EpiRSV (RSV)

8.4 Building Your Genomic Surveillance Pipeline

8.4.1 Example 1: Loading and Exploring Genomic Surveillance Data

Scenario: You’re a genomic epidemiologist at a state health department. Your lab has sequenced 500 SARS-CoV-2 samples over the past three months. Your first task: understand what variants are circulating and how they’re changing over time.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Simulate realistic genomic surveillance data
# In practice, you'd load from GISAID or your local database
np.random.seed(42)
n_samples = 500

# Generate sample metadata
start_date = datetime(2024, 1, 1)
dates = [start_date + timedelta(days=int(x)) for x in np.random.exponential(20, n_samples)]
dates = sorted(dates)

# Geographic distribution
counties = np.random.choice(
    ['County_A', 'County_B', 'County_C', 'County_D', 'County_E'], 
    n_samples,
    p=[0.4, 0.25, 0.15, 0.12, 0.08]  # Unequal sampling
)

# Demographics
ages = np.random.normal(45, 22, n_samples).clip(0, 95).astype(int)
sex = np.random.choice(['M', 'F'], n_samples)

# Clinical data
ct_values = np.random.uniform(15, 35, n_samples)  # Cycle threshold (viral load proxy)
hospitalized = (ct_values < 25) & (np.random.random(n_samples) < 0.3)  # High viral load more likely hospitalized

# Variant distribution (evolving over time)
# Early period: JN.1 emerging, XBB.1.5 dominant
# Later period: JN.1 takeover
def assign_variant(date, ct):
    days_since_start = (date - start_date).days
    
    # JN.1 growth advantage
    jn1_prob = 0.05 + (days_since_start / 100) * 0.6  # Increases over time
    jn1_prob = min(jn1_prob, 0.70)
    
    # Lower viral load slightly increases chance of JN.1 (immune evasion)
    if ct > 28:
        jn1_prob += 0.1
    
    rand = np.random.random()
    if rand < jn1_prob:
        return 'JN.1'
    elif rand < jn1_prob + 0.25:
        return 'XBB.1.5'
    elif rand < jn1_prob + 0.40:
        return 'EG.5'
    else:
        return 'BA.2.86'

variants = [assign_variant(date, ct) for date, ct in zip(dates, ct_values)]

# Sequence quality metrics
coverage_depth = np.random.lognormal(4.5, 0.5, n_samples).clip(10, 5000)  # Sequencing depth
genome_completeness = 100 - (1000 / coverage_depth) - np.random.uniform(0, 3, n_samples)  # % genome covered
genome_completeness = genome_completeness.clip(70, 100)

# Create DataFrame
genomic_data = pd.DataFrame({
    'sample_id': [f'SAM{i:04d}' for i in range(n_samples)],
    'collection_date': dates,
    'county': counties,
    'age': ages,
    'sex': sex,
    'ct_value': ct_values,
    'hospitalized': hospitalized,
    'variant': variants,
    'coverage_depth': coverage_depth,
    'genome_completeness': genome_completeness,
})

# Sequence quality flag
genomic_data['high_quality'] = (
    (genomic_data['genome_completeness'] > 95) &
    (genomic_data['coverage_depth'] > 100)
)

print("="*70)
print("GENOMIC SURVEILLANCE DATASET SUMMARY")
print("="*70)
print(f"\nTotal samples sequenced: {len(genomic_data)}")
print(f"Date range: {genomic_data['collection_date'].min().date()} to {genomic_data['collection_date'].max().date()}")
print(f"Duration: {(genomic_data['collection_date'].max() - genomic_data['collection_date'].min()).days} days")

print(f"\n{'Variant Distribution':40s}")
print("-"*70)
variant_counts = genomic_data['variant'].value_counts()
for variant, count in variant_counts.items():
    pct = count / len(genomic_data) * 100
    print(f"{variant:15s}: {count:4d} samples ({pct:5.1f}%)")

print(f"\n{'Geographic Distribution':40s}")
print("-"*70)
county_counts = genomic_data['county'].value_counts()
for county, count in county_counts.items():
    pct = count / len(genomic_data) * 100
    print(f"{county:15s}: {count:4d} samples ({pct:5.1f}%)")

print(f"\n{'Sequence Quality':40s}")
print("-"*70)
print(f"High quality (>95% complete, >100x depth): {genomic_data['high_quality'].sum()} ({genomic_data['high_quality'].mean()*100:.1f}%)")
print(f"Mean coverage depth: {genomic_data['coverage_depth'].mean():.0f}x")
print(f"Mean genome completeness: {genomic_data['genome_completeness'].mean():.1f}%")

print(f"\n{'Clinical Characteristics':40s}")
print("-"*70)
print(f"Hospitalized: {genomic_data['hospitalized'].sum()} ({genomic_data['hospitalized'].mean()*100:.1f}%)")
print(f"Mean Ct value: {genomic_data['ct_value'].mean():.1f} (lower = higher viral load)")
print(f"Mean age: {genomic_data['age'].mean():.1f} years")

print(f"\n{'Sample of data':40s}")
print("-"*70)
print(genomic_data.head(10).to_string(index=False))

# Visualize sample distribution over time
fig, axes = plt.subplots(2, 1, figsize=(14, 10))

# Daily sequencing volume
genomic_data['date_only'] = genomic_data['collection_date'].dt.date
daily_counts = genomic_data.groupby('date_only').size()

axes[0].bar(daily_counts.index, daily_counts.values, alpha=0.7, color='steelblue', edgecolor='black')
axes[0].set_xlabel('Collection Date', fontsize=12)
axes[0].set_ylabel('Number of Sequences', fontsize=12)
axes[0].set_title('Daily Sequencing Volume', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')
axes[0].axhline(daily_counts.mean(), color='red', linestyle='--', 
                linewidth=2, label=f'Mean: {daily_counts.mean():.1f} samples/day')
axes[0].legend()

# Sequence quality over time
weekly_quality = genomic_data.set_index('collection_date').resample('W').agg({
    'genome_completeness': 'mean',
    'coverage_depth': 'mean',
    'high_quality': 'mean'
})

ax2 = axes[1]
ax2_twin = ax2.twinx()

ax2.plot(weekly_quality.index, weekly_quality['genome_completeness'], 
         'o-', color='green', linewidth=2, markersize=6, label='Genome Completeness (%)')
ax2_twin.plot(weekly_quality.index, weekly_quality['coverage_depth'],
              's-', color='purple', linewidth=2, markersize=6, label='Coverage Depth (x)')

ax2.set_xlabel('Week', fontsize=12)
ax2.set_ylabel('Genome Completeness (%)', fontsize=12, color='green')
ax2_twin.set_ylabel('Coverage Depth', fontsize=12, color='purple')
ax2.set_title('Sequence Quality Metrics Over Time', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.legend(loc='upper left')
ax2_twin.legend(loc='upper right')

plt.tight_layout()
plt.savefig('genomic_surveillance_overview.png', dpi=300, bbox_inches='tight')
plt.show()

8.4.2 Example 2: Tracking Variant Emergence and Spread

Key question: Is a new variant taking over? How fast is it spreading?

# Calculate variant frequencies by week
genomic_data['week'] = genomic_data['collection_date'].dt.to_period('W')
variant_by_week = genomic_data.groupby(['week', 'variant']).size().reset_index(name='count')
variant_by_week['total'] = variant_by_week.groupby('week')['count'].transform('sum')
variant_by_week['frequency'] = variant_by_week['count'] / variant_by_week['total']

# Convert Period to timestamp for plotting
variant_by_week['week_start'] = variant_by_week['week'].dt.to_timestamp()

# Visualize variant dynamics
fig, axes = plt.subplots(2, 1, figsize=(14, 12))

# Top: Stacked area plot
variant_pivot = variant_by_week.pivot(index='week_start', columns='variant', values='frequency').fillna(0)
axes[0].stackplot(variant_pivot.index, 
                  *[variant_pivot[col] for col in variant_pivot.columns],
                  labels=variant_pivot.columns,
                  alpha=0.8)
axes[0].set_ylabel('Proportion of Sequences', fontsize=12)
axes[0].set_title('Variant Frequencies Over Time (Stacked)', fontsize=14, fontweight='bold')
axes[0].legend(loc='upper left', fontsize=10)
axes[0].set_ylim(0, 1)
axes[0].grid(True, alpha=0.3, axis='y')

# Bottom: Individual variant trajectories
for variant in variant_by_week['variant'].unique():
    data = variant_by_week[variant_by_week['variant'] == variant]
    axes[1].plot(data['week_start'], data['frequency'], 
                marker='o', linewidth=2, markersize=6, label=variant, alpha=0.8)

axes[1].set_xlabel('Week', fontsize=12)
axes[1].set_ylabel('Proportion of Sequences', fontsize=12)
axes[1].set_title('Individual Variant Trajectories', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=10, loc='best')
axes[1].set_ylim(0, 1)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('variant_dynamics.png', dpi=300, bbox_inches='tight')
plt.show()

# Statistical analysis: Growth rate estimation
print("\n" + "="*70)
print("VARIANT GROWTH ANALYSIS")
print("="*70)

# Compare first vs. last month
first_month = genomic_data[genomic_data['collection_date'] < start_date + timedelta(days=30)]
last_month = genomic_data[genomic_data['collection_date'] >= genomic_data['collection_date'].max() - timedelta(days=30)]

for variant in genomic_data['variant'].unique():
    freq_early = (first_month['variant'] == variant).mean()
    freq_late = (last_month['variant'] == variant).mean()
    
    change = freq_late - freq_early
    fold_change = freq_late / freq_early if freq_early > 0 else float('inf')
    
    # Estimate daily growth advantage
    days = (last_month['collection_date'].mean() - first_month['collection_date'].mean()).days
    if freq_early > 0 and freq_late > 0:
        growth_rate = (np.log(freq_late) - np.log(freq_early)) / days
        doubling_time = np.log(2) / growth_rate if growth_rate > 0 else float('inf')
    else:
        growth_rate = None
        doubling_time = None
    
    status = "📈 INCREASING" if change > 0.05 else "📉 DECREASING" if change < -0.05 else "→ STABLE"
    
    print(f"\n{variant}:")
    print(f"  First month frequency: {freq_early:.1%}")
    print(f"  Last month frequency: {freq_late:.1%}")
    print(f"  Absolute change: {change:+.1%}")
    print(f"  Fold change: {fold_change:.2f}x")
    
    if growth_rate is not None and growth_rate > 0:
        print(f"  Growth rate: {growth_rate:.4f} per day ({growth_rate*7:.2%} per week)")
        print(f"  Doubling time: {doubling_time:.1f} days")
    
    print(f"  Status: {status}")

# Project future frequencies (simple exponential model)
print("\n" + "="*70)
print("PROJECTION (4 weeks ahead, assuming constant growth rates)")
print("="*70)
print("⚠️  WARNING: Assumes no new variants, no interventions, constant growth")

for variant in genomic_data['variant'].unique():
    current_freq = (last_month['variant'] == variant).mean()
    
    # Estimate growth rate from last 60 days
    recent_data = genomic_data[genomic_data['collection_date'] >= genomic_data['collection_date'].max() - timedelta(days=60)]
    recent_data['days_from_end'] = (recent_data['collection_date'].max() - recent_data['collection_date']).dt.days
    
    # Simple exponential fit
    recent_variant_freq = recent_data.groupby('days_from_end').apply(
        lambda x: (x['variant'] == variant).mean()
    ).reset_index(name='frequency')
    
    if len(recent_variant_freq) > 5 and recent_variant_freq['frequency'].max() > 0.01:
        # Fit log-linear model
        from scipy.stats import linregress
        
        # Remove zeros for log
        nonzero = recent_variant_freq[recent_variant_freq['frequency'] > 0].copy()
        
        if len(nonzero) > 3:
            nonzero['log_freq'] = np.log(nonzero['frequency'])
            slope, intercept, r_value, _, _ = linregress(-nonzero['days_from_end'], nonzero['log_freq'])
            
            # Project 28 days ahead
            projected_log_freq = intercept + slope * 28
            projected_freq = np.exp(projected_log_freq)
            projected_freq = min(projected_freq, 0.99)  # Cap at 99%
            
            print(f"{variant}: {current_freq:.1%} → {projected_freq:.1%} (R²={r_value**2:.2f})")

Interpreting Variant Growth Rates

Growth advantage = How much faster one variant spreads than another

Key metrics:

Absolute frequency change: Simple difference (JN.1: 10% → 40% = +30%)
Fold change: Ratio (JN.1: 10% → 40% = 4x increase)
Growth rate (r): Daily exponential growth ($r$ per day)
Doubling time: $\ln(2) / r$ (how many days to double frequency)
Generation time advantage: How much more transmissible

Example: - Variant A: Doubling time = 10 days - Variant B: Doubling time = 7 days - Variant B has ~40% transmission advantage ($R_{relative} \approx$ 1.4)

Caveats: - Assumes no interventions (lockdowns, behavior changes) - Assumes no new variants emerge - Geographic structure matters (variant may saturate locally) - Immunity levels affect growth

For rigorous methods, see Volz et al., 2021, Cell on Alpha variant growth analysis.

8.5 Phylogenetic Analysis: Reconstructing Transmission

Phylogenetics allows us to infer who infected whom without contact tracing data.

8.5.1 Example 3: Building Phylogenetic Trees for Outbreak Investigation

Scenario: A nursing home reports 9 COVID-19 cases. Traditional contact tracing is difficult (asymptomatic transmission, poor recall). You sequence all 9 cases plus 2 recent community cases to determine: Are these linked? One introduction or multiple?

Real-World Example: Omicron Household Cluster

A similar approach was used in Nebraska’s first Omicron investigation where genomic sequencing of a household cluster of 6 cases linked to international travel enabled rapid characterization of the variant’s transmission dynamics, incubation period, and clinical presentation. This investigation provided some of the earliest U.S. data on Omicron’s characteristics.

from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import squareform
import numpy as np
import matplotlib.pyplot as plt

# Simulate genomic distances (SNP differences)
# In reality, you'd align sequences with MAFFT and calculate pairwise distances

sequences = {
    'Resident_1': 0,  # Reference
    'Resident_2': 0,  # Identical to Resident_1
    'Resident_3': 1,  # 1 SNP different from cluster 1
    'Resident_4': 2,  # 2 SNPs different from cluster 1
    'Resident_5': 12, # Different cluster
    'Resident_6': 13, # Related to Resident_5
    'Resident_7': 14, # Related to Residents 5-6
    'Resident_8': 13, # Related to cluster 2
    'Resident_9': 14, # Related to cluster 2
    'Community_A': 1, # Matches cluster 1
    'Community_B': 12, # Matches cluster 2
}

# Create distance matrix
n = len(sequences)
names = list(sequences.keys())
distance_matrix = np.zeros((n, n))

# Calculate pairwise distances (simplified)
for i in range(n):
    for j in range(n):
        if i != j:
            distance_matrix[i, j] = abs(sequences[names[i]] - sequences[names[j]])

# Display distance matrix
distance_df = pd.DataFrame(
    distance_matrix,
    index=names,
    columns=names
)

print("="*70)
print("PAIRWISE SNP DISTANCE MATRIX")
print("="*70)
print(distance_df.to_string())

# Identify clusters (using simple threshold)
print("\n" + "="*70)
print("CLUSTER ANALYSIS")
print("="*70)

threshold = 3  # <3 SNPs = likely linked
print(f"Threshold for linkage: <{threshold} SNPs\n")

clusters = {}
cluster_id = 1

for i, name_i in enumerate(names):
    if any(name_i in cluster for cluster in clusters.values()):
        continue  # Already assigned
    
    cluster_members = [name_i]
    
    for j, name_j in enumerate(names):
        if i != j and name_j not in cluster_members:
            if distance_matrix[i, j] < threshold:
                cluster_members.append(name_j)
    
    if len(cluster_members) > 1:
        clusters[f'Cluster_{cluster_id}'] = cluster_members
        cluster_id += 1

for cluster_name, members in clusters.items():
    print(f"{cluster_name}:")
    for member in members:
        resident_or_community = "👤 Resident" if member.startswith('Resident') else "🏙️  Community"
        print(f"  - {member:15s} {resident_or_community}")
    print()

# Interpretation
print("="*70)
print("PUBLIC HEALTH INTERPRETATION")
print("="*70)

print(f"\nNumber of distinct introductions: {len(clusters)}")
print("\nCluster 1 (Residents 1-4 + Community_A):")
print("  • 5 cases, 0-2 SNPs apart")
print("  • Most likely source: Community_A")
print("  • Inference: Single introduction from community → spread within facility")
print("\nCluster 2 (Residents 5-9 + Community_B):")
print("  • 6 cases, 0-2 SNPs apart within cluster")
print("  • Most likely source: Community_B")
print("  • Inference: Separate introduction → independent transmission chain")
print("\nConclusion:")
print("  🔴 TWO independent introductions")
print("  🔴 Both resulted in facility transmission")
print("  🔴 Ongoing transmission risk")
print("\nRecommended actions:")
print("  1. Enhanced infection control (both clusters active)")
print("  2. Investigate entry points (staff? visitors?)")
print("  3. Cohort cases by cluster to prevent mixing")
print("  4. Increased testing frequency (find additional cases early)")

# Hierarchical clustering visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Convert to condensed distance matrix for linkage
from scipy.spatial.distance import squareform
condensed_dist = squareform(distance_matrix)

# Perform hierarchical clustering
linkage_matrix = linkage(condensed_dist, method='average')

# Dendrogram
dendrogram(linkage_matrix, labels=names, ax=ax1, leaf_font_size=10,
           color_threshold=threshold, above_threshold_color='gray')
ax1.axhline(y=threshold, color='red', linestyle='--', linewidth=2,
            label=f'Linkage threshold ({threshold} SNPs)')
ax1.set_xlabel('Sample', fontsize=12)
ax1.set_ylabel('SNP Distance', fontsize=12)
ax1.set_title('Phylogenetic Tree (Dendrogram)', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')

# Heatmap of distances
im = ax2.imshow(distance_matrix, cmap='YlOrRd', aspect='auto')
ax2.set_xticks(range(n))
ax2.set_yticks(range(n))
ax2.set_xticklabels(names, rotation=45, ha='right')
ax2.set_yticklabels(names)
ax2.set_title('SNP Distance Heatmap', fontsize=14, fontweight='bold')

# Add text annotations
for i in range(n):
    for j in range(n):
        text = ax2.text(j, i, f'{int(distance_matrix[i, j])}',
                       ha="center", va="center", color="black" if distance_matrix[i, j] < 8 else "white",
                       fontsize=8)

plt.colorbar(im, ax=ax2, label='SNP Distance')
plt.tight_layout()
plt.savefig('phylogenetic_outbreak_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

SNP Thresholds for Transmission Linkage

How close is “close enough” to infer transmission?

SARS-CoV-2: - 0-2 SNPs: Likely direct transmission or very recent common ancestor - 3-5 SNPs: Possibly linked through 1-2 intermediates - >5 SNPs: Likely unrelated or distantly related

Mycobacterium tuberculosis: - 0-5 SNPs: Likely recent transmission (<3 years) - 6-12 SNPs: Possible transmission (>3 years ago) - >12 SNPs: Likely not recently transmitted

Influenza: - 0-3 nucleotides: Direct transmission likely - 4-10 nucleotides: Possibly linked - >10 nucleotides: Likely independent

Critical consideration: Thresholds depend on: - Mutation rate of pathogen - Time between samples - Within-host diversity - Sampling intensity

For TB-specific thresholds, see Walker et al., 2013, Nature Genetics.

For SARS-CoV-2, see Aggarwal et al., 2022, Science.

8.5.2 Example 4: Mutation Detection and Variant Characterization

Scenario: Your lab sequences a sample with unusual mutations in the spike protein. You need to identify which mutations are present and assess their significance.

# SARS-CoV-2 spike protein reference (Wuhan-Hu-1)
# Simplified: focusing on key positions only

reference_spike = {
    'position': [417, 452, 478, 484, 501, 614, 681, 950],
    'reference_aa': ['K', 'L', 'T', 'E', 'N', 'D', 'P', 'N'],
    'domain': ['RBD', 'RBD', 'RBD', 'RBD', 'RBD', 'Outside RBD', 'Furin site', 'Other'],
}

# New sample with multiple mutations
sample_spike = {
    'position': [417, 452, 478, 484, 501, 614, 681, 950],
    'sample_aa': ['N', 'R', 'K', 'A', 'Y', 'G', 'H', 'N'],
    'domain': ['RBD', 'RBD', 'RBD', 'RBD', 'RBD', 'Outside RBD', 'Furin site', 'Other'],
}

# Detect mutations
mutations = []
for i in range(len(reference_spike['position'])):
    ref_aa = reference_spike['reference_aa'][i]
    sample_aa = sample_spike['sample_aa'][i]
    pos = reference_spike['position'][i]
    domain = reference_spike['domain'][i]
    
    if ref_aa != sample_aa:
        mutation_name = f"{ref_aa}{pos}{sample_aa}"
        mutations.append({
            'mutation': mutation_name,
            'position': pos,
            'domain': domain,
            'reference': ref_aa,
            'sample': sample_aa
        })

mutations_df = pd.DataFrame(mutations)

print("="*70)
print("DETECTED SPIKE PROTEIN MUTATIONS")
print("="*70)
print(mutations_df.to_string(index=False))

# Known mutations of concern (from literature)
mutations_of_concern = {
    'K417N': {
        'effect': 'Reduces antibody binding',
        'variants': 'Beta, Omicron BA.1',
        'severity': 'HIGH',
        'evidence': 'https://www.nature.com/articles/s41586-021-03398-2'
    },
    'K417T': {
        'effect': 'Reduces antibody binding',
        'variants': 'Omicron BA.2',
        'severity': 'HIGH',
        'evidence': 'Nature (2022)'
    },
    'L452R': {
        'effect': 'Increases transmissibility, immune evasion',
        'variants': 'Delta, Omicron sub-lineages',
        'severity': 'HIGH',
        'evidence': 'https://www.cell.com/cell/fulltext/S0092-8674(21)00755-8'
    },
    'T478K': {
        'effect': 'Immune evasion, enhanced RBD stability',
        'variants': 'Delta, Omicron',
        'severity': 'MODERATE',
        'evidence': 'Science (2021)'
    },
    'E484A': {
        'effect': 'Significant immune evasion',
        'variants': 'Omicron BA.2, BA.4, BA.5',
        'severity': 'HIGH',
        'evidence': 'https://www.nature.com/articles/s41586-022-04594-4'
    },
    'E484K': {
        'effect': 'Antibody escape',
        'variants': 'Beta, Gamma, some Omicron sub-lineages',
        'severity': 'HIGH',
        'evidence': 'Cell (2021)'
    },
    'N501Y': {
        'effect': 'Enhanced ACE2 binding affinity (~10x)',
        'variants': 'Alpha, Beta, Gamma, Omicron',
        'severity': 'HIGH',
        'evidence': 'https://www.nature.com/articles/s41586-021-03402-9'
    },
    'D614G': {
        'effect': 'Increased transmissibility (~50%)',
        'variants': 'All major variants (universal)',
        'severity': 'HIGH',
        'evidence': 'https://www.science.org/doi/10.1126/science.abe8499'
    },
    'P681H': {
        'effect': 'Enhanced furin cleavage',
        'variants': 'Alpha, Omicron BA.1',
        'severity': 'MODERATE',
        'evidence': 'Nature (2021)'
    },
    'P681R': {
        'effect': 'Enhanced furin cleavage (more than P681H)',
        'variants': 'Delta',
        'severity': 'HIGH',
        'evidence': 'Nature (2021)'
    },
}

# Assess mutations
print("\n" + "="*70)
print("MUTATION SIGNIFICANCE ASSESSMENT")
print("="*70)

high_concern_count = 0
moderate_concern_count = 0
unknown_count = 0

for _, row in mutations_df.iterrows():
    mut = row['mutation']
    
    if mut in mutations_of_concern:
        info = mutations_of_concern[mut]
        
        if info['severity'] == 'HIGH':
            emoji = "🔴"
            high_concern_count += 1
        else:
            emoji = "🟡"
            moderate_concern_count += 1
        
        print(f"\n{emoji} {mut} - **{info['severity']} CONCERN**")
        print(f"   Domain: {row['domain']}")
        print(f"   Effect: {info['effect']}")
        print(f"   Found in: {info['variants']}")
        print(f"   Evidence: {info['evidence']}")
    else:
        print(f"\n⚪ {mut} - Unknown/Novel")
        print(f"   Domain: {row['domain']}")
        print(f"   Status: Not previously characterized as mutation of concern")
        unknown_count += 1

# Summary and recommended action
print("\n" + "="*70)
print("SUMMARY AND RECOMMENDED ACTION")
print("="*70)

print(f"\nTotal mutations detected: {len(mutations)}")
print(f"  🔴 High concern: {high_concern_count}")
print(f"  🟡 Moderate concern: {moderate_concern_count}")
print(f"  ⚪ Unknown/Novel: {unknown_count}")

print("\nKey findings:")
print(f"  • Multiple RBD mutations detected ({mutations_df[mutations_df['domain']=='RBD'].shape[0]} total)")
print(f"  • Includes immune evasion mutations (E484A, K417N)")
print(f"  • Includes enhanced binding mutations (N501Y)")
print(f"  • Universal transmissibility mutation present (D614G)")
print(f"  • Furin cleavage site modification (P681H)")

print("\n⚠️  This constellation of mutations suggests:")
print("   1. Potential new Variant Under Investigation (VUI)")
print("   2. Likely significant immune evasion capability")
print("   3. Enhanced transmissibility expected")
print("   4. May impact vaccine effectiveness")

print("\n📋 RECOMMENDED ACTIONS:")
print("   1. Immediate notification to state epidemiologist")
print("   2. Upload sequence to GISAID with appropriate flags")
print("   3. Enhanced surveillance for additional cases")
print("   4. Consider whole genome sequencing for full characterization")
print("   5. Report to CDC as potential Variant of Concern (VOC)")
print("   6. Investigate patient travel history and contacts")
print("   7. Assess local vaccination rates and breakthrough infection status")

# Visualize mutation locations
fig, ax = plt.subplots(figsize=(14, 6))

# Spike protein domains (simplified)
spike_length = 1273  # amino acids
domains = {
    'NTD': (14, 305, 'lightblue'),
    'RBD': (319, 541, 'salmon'),
    'SD1': (542, 685, 'lightgreen'),
    'SD2': (686, 815, 'lightyellow'),
    'HR1': (916, 1163, 'plum'),
}

# Draw domains
y_height = 0.3
for domain_name, (start, end, color) in domains.items():
    width = end - start
    rect = plt.Rectangle((start, 0.35), width, y_height, 
                         facecolor=color, edgecolor='black', linewidth=2)
    ax.add_patch(rect)
    ax.text(start + width/2, 0.5, domain_name, 
           ha='center', va='center', fontsize=11, fontweight='bold')

# Mark mutations
for _, row in mutations_df.iterrows():
    pos = row['position']
    mut = row['mutation']
    
    # Color by severity
    if mut in mutations_of_concern:
        severity = mutations_of_concern[mut]['severity']
        color = 'red' if severity == 'HIGH' else 'orange'
        marker = 'v'
        size = 200
    else:
        color = 'gray'
        marker = 'o'
        size = 100
    
    ax.scatter([pos], [0.8], s=size, c=color, marker=marker, 
              edgecolors='black', linewidth=2, zorder=10)
    ax.text(pos, 0.95, mut, rotation=45, ha='right', fontsize=9, fontweight='bold')

ax.set_xlim(0, spike_length)
ax.set_ylim(0, 1.2)
ax.set_xlabel('Amino Acid Position', fontsize=12)
ax.set_title('SARS-CoV-2 Spike Protein: Detected Mutations', fontsize=14, fontweight='bold')
ax.set_yticks([])
ax.grid(True, alpha=0.3, axis='x')

# Legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='red', label='High Concern Mutation'),
    Patch(facecolor='orange', label='Moderate Concern Mutation'),
    Patch(facecolor='gray', label='Unknown Mutation'),
    Patch(facecolor='salmon', label='RBD (Receptor Binding Domain)'),
]
ax.legend(handles=legend_elements, loc='upper right', fontsize=10)

plt.tight_layout()
plt.savefig('spike_mutations_visualization.png', dpi=300, bbox_inches='tight')
plt.show()

8.6 Machine Learning for Genotype-to-Phenotype Prediction

One of the most powerful applications of AI in genomics: predicting phenotype (drug resistance, virulence, transmissibility) directly from genotype.

8.6.1 Example 5: Predicting Antimicrobial Resistance from Bacterial Genomes

Scenario: You’re analyzing Mycobacterium tuberculosis genomes to predict rifampicin resistance. Traditional culture-based drug susceptibility testing (DST) takes 6-8 weeks. Genomic prediction can provide results in days.

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import warnings
warnings.filterwarnings('ignore')

# Simulate TB genomic data
# In practice, this would come from whole genome sequencing
np.random.seed(42)
n_samples = 1200

# Known rifampicin resistance-associated genes/mutations
resistance_markers = {
    'rpoB_S450L': 'High confidence RIF resistance',
    'rpoB_H445Y': 'High confidence RIF resistance',
    'rpoB_H445D': 'High confidence RIF resistance',
    'rpoB_D435V': 'High confidence RIF resistance',
    'rpoB_L452P': 'Moderate confidence RIF resistance',
    'rpoB_I491F': 'Moderate confidence RIF resistance',
    'rpoB_Q513K': 'Moderate confidence RIF resistance',
    'rpoB_S441L': 'Moderate confidence RIF resistance',
    # Other resistance genes (not RIF, but correlated with MDR-TB)
    'katG_S315T': 'Isoniazid resistance',
    'inhA_C15T': 'Isoniazid resistance',
    'pncA_H57D': 'Pyrazinamide resistance',
    'embB_M306I': 'Ethambutol resistance',
    'gyrA_A90V': 'Fluoroquinolone resistance',
    'gyrA_D94G': 'Fluoroquinolone resistance',
    'rrs_A1401G': 'Aminoglycoside resistance',
}

# Generate features (presence/absence of mutations)
features = {}
for marker in resistance_markers.keys():
    # Higher prevalence for actual RIF resistance markers
    if 'rpoB' in marker:
        prevalence = 0.25 if 'High confidence' in resistance_markers[marker] else 0.15
    else:
        prevalence = 0.20  # Other resistance markers
    
    features[marker] = np.random.binomial(1, prevalence, n_samples)

# Target: Rifampicin resistance
# Strongly associated with rpoB mutations, some other markers due to MDR-TB

rif_resistance = np.zeros(n_samples, dtype=int)

for i in range(n_samples):
    # High confidence rpoB mutations → almost always resistant
    if (features['rpoB_S450L'][i] == 1 or 
        features['rpoB_H445Y'][i] == 1 or
        features['rpoB_H445D'][i] == 1 or
        features['rpoB_D435V'][i] == 1):
        rif_resistance[i] = 1 if np.random.random() < 0.95 else 0
    
    # Moderate confidence rpoB mutations
    elif (features['rpoB_L452P'][i] == 1 or
          features['rpoB_I491F'][i] == 1 or
          features['rpoB_Q513K'][i] == 1 or
          features['rpoB_S441L'][i] == 1):
        rif_resistance[i] = 1 if np.random.random() < 0.70 else 0
    
    # MDR-TB: multiple resistance markers suggests RIF resistance
    elif sum([features[marker][i] for marker in features.keys() if 'rpoB' not in marker]) >= 3:
        rif_resistance[i] = 1 if np.random.random() < 0.60 else 0
    
    # Random resistance (other mechanisms not captured)
    else:
        rif_resistance[i] = 1 if np.random.random() < 0.08 else 0

# Create DataFrame
tb_data = pd.DataFrame(features)
tb_data['rifampicin_resistant'] = rif_resistance

print("="*70)
print("TUBERCULOSIS GENOMIC DATASET")
print("="*70)
print(f"\nTotal isolates: {n_samples}")
print(f"Rifampicin resistance prevalence: {rif_resistance.mean():.1%}")

print(f"\nResistance marker prevalence:")
marker_prev = tb_data.drop('rifampicin_resistant', axis=1).mean().sort_values(ascending=False)
for marker, prev in marker_prev.head(8).items():
    print(f"  {marker:20s}: {prev:.1%}")

print(f"\nSample data (first 5 isolates):")
print(tb_data.head().to_string(index=False))

# Split data
X = tb_data.drop('rifampicin_resistant', axis=1)
y = tb_data['rifampicin_resistant']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

print(f"\nTraining set: {len(X_train)} isolates")
print(f"Test set: {len(X_test)} isolates")

# Train multiple models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, max_depth=5),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42, max_depth=3)
}

results = {}

print("\n" + "="*70)
print("MODEL TRAINING AND EVALUATION")
print("="*70)

for model_name, model in models.items():
    print(f"\n{'='*70}")
    print(f"{model_name}")
    print(f"{'='*70}")
    
    # Train
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Metrics
    auc = roc_auc_score(y_test, y_pred_proba)
    
    # Classification report
    print(classification_report(y_test, y_pred, 
                               target_names=['Susceptible', 'Resistant'],
                               digits=3))
    
    print(f"AUC-ROC: {auc:.3f}")
    
    # Confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    
    # Clinical metrics
    sensitivity = cm[1,1] / (cm[1,1] + cm[1,0])
    specificity = cm[0,0] / (cm[0,0] + cm[0,1])
    ppv = cm[1,1] / (cm[1,1] + cm[0,1]) if (cm[1,1] + cm[0,1]) > 0 else 0
    npv = cm[0,0] / (cm[0,0] + cm[1,0]) if (cm[0,0] + cm[1,0]) > 0 else 0
    
    results[model_name] = {
        'auc': auc,
        'sensitivity': sensitivity,
        'specificity': specificity,
        'ppv': ppv,
        'npv': npv,
        'cm': cm,
        'model': model
    }
    
    print(f"\nClinical Performance Metrics:")
    print(f"  Sensitivity: {sensitivity:.1%} (detects {sensitivity:.1%} of resistant cases)")
    print(f"  Specificity: {specificity:.1%} (correctly identifies {specificity:.1%} of susceptible cases)")
    print(f"  PPV: {ppv:.1%} (if predicted resistant, {ppv:.1%} actually are)")
    print(f"  NPV: {npv:.1%} (if predicted susceptible, {npv:.1%} actually are)")

# Select best model (highest AUC)
best_model_name = max(results, key=lambda x: results[x]['auc'])
best_model = results[best_model_name]['model']

print("\n" + "="*70)
print(f"BEST MODEL: {best_model_name}")
print("="*70)

# Feature importance
if hasattr(best_model, 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'marker': X.columns,
        'importance': best_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print("\nTop 10 Most Important Genetic Markers:")
    print(feature_importance.head(10).to_string(index=False))

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Top-left: ROC curves
for model_name, result in results.items():
    y_pred_proba = result['model'].predict_proba(X_test)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    axes[0,0].plot(fpr, tpr, linewidth=2, label=f'{model_name} (AUC={result["auc"]:.3f})')

axes[0,0].plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random (AUC=0.5)')
axes[0,0].set_xlabel('False Positive Rate', fontsize=12)
axes[0,0].set_ylabel('True Positive Rate (Sensitivity)', fontsize=12)
axes[0,0].set_title('ROC Curves: Model Comparison', fontsize=14, fontweight='bold')
axes[0,0].legend(fontsize=10)
axes[0,0].grid(True, alpha=0.3)

# Top-right: Confusion matrix (best model)
best_cm = results[best_model_name]['cm']
sns.heatmap(best_cm, annot=True, fmt='d', cmap='Blues', ax=axes[0,1],
            xticklabels=['Predicted Susceptible', 'Predicted Resistant'],
            yticklabels=['Actually Susceptible', 'Actually Resistant'],
            cbar_kws={'label': 'Count'})
axes[0,1].set_title(f'Confusion Matrix: {best_model_name}', fontsize=14, fontweight='bold')

# Bottom-left: Feature importance
if hasattr(best_model, 'feature_importances_'):
    top_features = feature_importance.head(10)
    axes[1,0].barh(range(len(top_features)), top_features['importance'], color='steelblue', edgecolor='black')
    axes[1,0].set_yticks(range(len(top_features)))
    axes[1,0].set_yticklabels(top_features['marker'])
    axes[1,0].set_xlabel('Importance', fontsize=12)
    axes[1,0].set_title('Top 10 Predictive Genetic Markers', fontsize=14, fontweight='bold')
    axes[1,0].grid(True, alpha=0.3, axis='x')
    axes[1,0].invert_yaxis()

# Bottom-right: Clinical performance comparison
metrics = ['Sensitivity', 'Specificity', 'PPV', 'NPV']
model_names = list(results.keys())
x = np.arange(len(metrics))
width = 0.25

for i, model_name in enumerate(model_names):
    values = [
        results[model_name]['sensitivity'],
        results[model_name]['specificity'],
        results[model_name]['ppv'],
        results[model_name]['npv']
    ]
    axes[1,1].bar(x + i*width, values, width, label=model_name, edgecolor='black')

axes[1,1].set_ylabel('Score', fontsize=12)
axes[1,1].set_title('Clinical Performance Comparison', fontsize=14, fontweight='bold')
axes[1,1].set_xticks(x + width)
axes[1,1].set_xticklabels(metrics)
axes[1,1].legend(fontsize=9)
axes[1,1].set_ylim(0, 1.05)
axes[1,1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('amr_prediction_evaluation.png', dpi=300, bbox_inches='tight')
plt.show()

# Clinical use case example
print("\n" + "="*70)
print("CLINICAL USE CASE EXAMPLE")
print("="*70)

print("\n📋 Patient Case:")
print("   • 45-year-old with pulmonary TB")
print("   • Previously treated 5 years ago")
print("   • GeneXpert MTB/RIF: MTB detected")
print("   • Sputum culture pending (results in 6-8 weeks)")
print("   • WGS completed in 48 hours")

# Simulate a test case
test_case = pd.DataFrame({
    'rpoB_S450L': [1], # HIGH CONFIDENCE RIF RESISTANCE
    'rpoB_H445Y': [0],
    'rpoB_H445D': [0],
    'rpoB_D435V': [0],
    'rpoB_L452P': [0],
    'rpoB_I491F': [0],
    'rpoB_Q513K': [0],
    'rpoB_S441L': [0],
    'katG_S315T': [1], # ISONIAZID RESISTANCE
    'inhA_C15T': [0],
    'pncA_H57D': [0],
    'embB_M306I': [0],
    'gyrA_A90V': [0],
    'gyrA_D94G': [0],
    'rrs_A1401G': [0],
})

# Predict
test_prediction = best_model.predict(test_case)[0]
test_proba = best_model.predict_proba(test_case)[0, 1]

print("\n🧬 Genomic Analysis Results:")
print(f"   • rpoB S450L mutation: DETECTED ⚠️")
print(f"   • katG S315T mutation: DETECTED ⚠️")

print(f"\n🤖 Machine Learning Prediction:")
print(f"   • Model: {best_model_name}")
print(f"   • Prediction: {'RESISTANT' if test_prediction == 1 else 'SUSCEPTIBLE'}")
print(f"   • Confidence: {test_proba:.1%}")

print("\n💊 Clinical Interpretation:")
print("   • Genomic evidence of rifampicin resistance (rpoB S450L)")
print("   • Genomic evidence of isoniazid resistance (katG S315T)")
print("   • Diagnosis: Likely Multidrug-Resistant TB (MDR-TB)")

print("\n📋 Recommended Treatment:")
print("   • DO NOT use rifampicin or isoniazid (first-line therapy)")
print("   • Initiate MDR-TB regimen:")
print("     - Bedaquiline")
print("     - Linezolid")
print("     - Levofloxacin")
print("     - Cycloserine")
print("     - Pyrazinamide (test susceptibility)")
print("   • Await culture DST for full susceptibility profile")
print("   • Isolate patient (infection control for MDR-TB)")

print("\n⏱️  Time Saved:")
print("   • Traditional DST: 6-8 weeks")
print("   • Genomic prediction: 2-3 days")
print("   • **Time saved: 5-7 weeks of inappropriate therapy avoided**")

The CRyPTIC Consortium: World’s Largest TB Resistance Study

The CRyPTIC Consortium (New England Journal of Medicine, 2022) sequenced >10,000 M. tuberculosis isolates and performed drug susceptibility testing for 13 antibiotics.

Key findings:

Predictive accuracy for rifampicin resistance: - Sensitivity: 97.1% - Specificity: 99.0% - PPV: 94.1% - NPV: 99.4%

For first-line drugs (rifampicin, isoniazid, ethambutol, pyrazinamide): - Genomic prediction as accurate as culture DST - 5-7 weeks faster than traditional methods - Cost: ~$100-200 vs. $1000+ for comprehensive DST

Clinical impact: - Earlier appropriate treatment - Reduced transmission of drug-resistant TB - Better patient outcomes

Limitations: - Some resistance mechanisms not yet understood (especially for newer drugs) - Heteroresistance (mixed resistant/susceptible populations) harder to detect - Requires WGS infrastructure (not available everywhere)

This represents a paradigm shift in TB diagnosis: from culture-based phenotypic testing to genomics-based prediction.

8.7 Real-World Platforms and Tools

8.7.1 Nextstrain: Open-Source Genomic Epidemiology

Nextstrain (Bedford et al., 2018, Bioinformatics) provides real-time tracking of pathogen evolution.

Key features:

Interactive phylogenetic trees: - Zoom, pan, filter by location/time - Color by country, clade, genotype - Time-resolved trees (branches = time)

Geographic visualization: - Map of where sequences were collected - Migration patterns between regions

Mutation tracking: - Which mutations are increasing - Frequency trajectories over time

Available pathogens: - SARS-CoV-2 (updated daily) - Influenza A & B - Ebola - Zika - Measles - Dengue - West Nile virus - And 20+ more

8.7.2 Example 6: Analyzing Nextstrain-Style Data

# Simulate Nextstrain-style dataset
np.random.seed(42)
n_sequences = 300

# Generate metadata
start_date = datetime(2024, 1, 1)
dates = [start_date + timedelta(days=int(x)) for x in np.random.exponential(25, n_sequences)]
dates = sorted(dates)

countries = np.random.choice(
    ['USA', 'UK', 'Germany', 'France', 'Japan', 'Australia', 'Brazil', 'South Africa'],
    n_sequences,
    p=[0.30, 0.15, 0.10, 0.10, 0.10, 0.08, 0.10, 0.07]
)

# Nextstrain clades (simplified)
# Early: 24A dominant
# Middle: 24B emerges and grows
# Late: 24C appears

def assign_clade(date):
    days = (date - start_date).days
    
    if days < 30:
        return np.random.choice(['24A', '24B'], p=[0.95, 0.05])
    elif days < 60:
        return np.random.choice(['24A', '24B', '24C'], p=[0.40, 0.55, 0.05])
    else:
        return np.random.choice(['24A', '24B', '24C'], p=[0.10, 0.50, 0.40])

clades = [assign_clade(date) for date in dates]

# Mutations accumulate over time
num_mutations = [25 + int((date - start_date).days / 5) + np.random.randint(-3, 4) 
                 for date in dates]

nextstrain_data = pd.DataFrame({
    'strain': [f'{country}/SEQ-{i:04d}/2024' for i, country in enumerate(countries)],
    'date': dates,
    'country': countries,
    'clade': clades,
    'num_mutations': num_mutations,
})

print("="*70)
print("NEXTSTRAIN-STYLE GENOMIC SURVEILLANCE DATA")
print("="*70)
print(f"\nTotal sequences: {len(nextstrain_data)}")
print(f"Date range: {nextstrain_data['date'].min().date()} to {nextstrain_data['date'].max().date()}")
print(f"Countries represented: {nextstrain_data['country'].nunique()}")

print("\nClade distribution:")
print(nextstrain_data['clade'].value_counts().to_string())

# Geographic distribution by clade
print("\n" + "="*70)
print("GEOGRAPHIC DISTRIBUTION BY CLADE")
print("="*70)

geo_clade = nextstrain_data.groupby(['country', 'clade']).size().unstack(fill_value=0)
geo_clade['total'] = geo_clade.sum(axis=1)
geo_clade_pct = geo_clade.div(geo_clade['total'], axis=0).drop('total', axis=1) * 100

print(geo_clade_pct.round(1).to_string())

# Temporal evolution
print("\n" + "="*70)
print("TEMPORAL CLADE EVOLUTION")
print("="*70)

nextstrain_data['month'] = nextstrain_data['date'].dt.to_period('M')
temporal_clade = nextstrain_data.groupby(['month', 'clade']).size().unstack(fill_value=0)
temporal_clade_pct = temporal_clade.div(temporal_clade.sum(axis=1), axis=0) * 100

print(temporal_clade_pct.round(1).to_string())

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Top-left: Geographic distribution
geo_clade_pct.plot(kind='bar', stacked=True, ax=axes[0,0], 
                   colormap='Set3', edgecolor='black', linewidth=0.5)
axes[0,0].set_xlabel('Country', fontsize=12)
axes[0,0].set_ylabel('Percentage', fontsize=12)
axes[0,0].set_title('Clade Distribution by Country', fontsize=14, fontweight='bold')
axes[0,0].legend(title='Clade', fontsize=10)
axes[0,0].set_xticklabels(axes[0,0].get_xticklabels(), rotation=45, ha='right')
axes[0,0].set_ylim(0, 100)
axes[0,0].grid(True, alpha=0.3, axis='y')

# Top-right: Temporal evolution (stacked area)
temporal_clade_pct.index = temporal_clade_pct.index.to_timestamp()
axes[0,1].stackplot(temporal_clade_pct.index,
                    *[temporal_clade_pct[col] for col in temporal_clade_pct.columns],
                    labels=temporal_clade_pct.columns,
                    alpha=0.8)
axes[0,1].set_xlabel('Month', fontsize=12)
axes[0,1].set_ylabel('Percentage', fontsize=12)
axes[0,1].set_title('Clade Evolution Over Time', fontsize=14, fontweight='bold')
axes[0,1].legend(loc='upper left', fontsize=10)
axes[0,1].set_ylim(0, 100)
axes[0,1].grid(True, alpha=0.3, axis='y')

# Bottom-left: Mutation accumulation over time
for clade in nextstrain_data['clade'].unique():
    clade_data = nextstrain_data[nextstrain_data['clade'] == clade]
    axes[1,0].scatter(clade_data['date'], clade_data['num_mutations'],
                     alpha=0.6, s=30, label=clade)

axes[1,0].set_xlabel('Date', fontsize=12)
axes[1,0].set_ylabel('Number of Mutations', fontsize=12)
axes[1,0].set_title('Mutation Accumulation Over Time', fontsize=14, fontweight='bold')
axes[1,0].legend(title='Clade', fontsize=10)
axes[1,0].grid(True, alpha=0.3)

# Bottom-right: Sampling intensity by country
country_counts = nextstrain_data['country'].value_counts()
axes[1,1].barh(range(len(country_counts)), country_counts.values, 
              color='teal', edgecolor='black', linewidth=0.5)
axes[1,1].set_yticks(range(len(country_counts)))
axes[1,1].set_yticklabels(country_counts.index)
axes[1,1].set_xlabel('Number of Sequences', fontsize=12)
axes[1,1].set_title('Sampling Intensity by Country', fontsize=14, fontweight='bold')
axes[1,1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig('nextstrain_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

# Key insights
print("\n" + "="*70)
print("KEY EPIDEMIOLOGICAL INSIGHTS")
print("="*70)

print("\n1. Clade Dynamics:")
print("   • Clade 24A: Declining (was dominant early, now rare)")
print("   • Clade 24B: Currently dominant (~50%)")
print("   • Clade 24C: Rapidly emerging (5% → 40% in 2 months)")

print("\n2. Geographic Patterns:")
usa_24c = geo_clade_pct.loc['USA', '24C'] if 'USA' in geo_clade_pct.index else 0
print(f"   • USA has highest 24C prevalence ({usa_24c:.0f}%)")
print("   • Suggests either:")
print("     a) 24C emerged in USA, or")
print("     b) USA detected it earlier (better surveillance)")

print("\n3. Mutation Rate:")
mutations_per_month = (nextstrain_data['num_mutations'].max() - nextstrain_data['num_mutations'].min()) / 3
print(f"   • ~{mutations_per_month:.0f} mutations accumulated per month")
print(f"   • Consistent with SARS-CoV-2 background rate")

print("\n4. Surveillance Gaps:")
undersampled = country_counts[country_counts < 20].index.tolist()
if undersampled:
    print(f"   • Under-sampled countries: {', '.join(undersampled)}")
    print("   • May have circulating variants we haven't detected")

print("\n📋 Recommended Actions:")
print("   1. Enhanced surveillance for 24C (growth advantage evident)")
print("   2. Functional characterization of 24C-specific mutations")
print("   3. Assess vaccine effectiveness against 24C")
print("   4. Support genomic capacity in under-sampled regions")

8.8 Challenges, Limitations, and Biases

Genomic surveillance is powerful but not perfect. Understanding its limitations is critical for proper interpretation.

8.8.1 The Sampling Bias Problem

Challenge: We sequence a tiny fraction of all infections, and sampling is highly unequal.

# Simulate global surveillance inequality
surveillance_capacity = pd.DataFrame({
    'region': ['High-income', 'Upper-middle', 'Lower-middle', 'Low-income'],
    'population_millions': [1200, 2600, 3200, 800],
    'covid_cases_millions': [180, 390, 320, 80],
    'sequences_per_1000_cases': [50, 5, 0.5, 0.05],  # 1000x difference!
})

surveillance_capacity['total_sequences'] = (
    surveillance_capacity['covid_cases_millions'] * 1000 * 
    surveillance_capacity['sequences_per_1000_cases']
).astype(int)

surveillance_capacity['sequences_per_million_people'] = (
    surveillance_capacity['total_sequences'] / 
    surveillance_capacity['population_millions']
).round(0)

print("="*70)
print("GLOBAL GENOMIC SURVEILLANCE CAPACITY INEQUALITY")
print("="*70)
print(surveillance_capacity.to_string(index=False))

print("\n" + "="*70)
print("THE PARADOX")
print("="*70)
print("\nCountries with the BEST surveillance:")
print("  • Detect variants first")
print("  • Get blamed for 'causing' them")
print("  • Face travel bans and stigma")

print("\nCountries with the WORST surveillance:")
print("  • Variants circulate undetected")
print("  • Export cases silently")
print("  • No consequences (because they're invisible)")

print("\nConsequence: Perverse incentive AGAINST doing good surveillance!")

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Left: Sequencing capacity
x = range(len(surveillance_capacity))
ax1.bar(x, surveillance_capacity['sequences_per_1000_cases'], 
       color='steelblue', edgecolor='black', linewidth=2)
ax1.set_xticks(x)
ax1.set_xticklabels(surveillance_capacity['region'], rotation=0)
ax1.set_ylabel('Sequences per 1,000 Cases', fontsize=12)
ax1.set_title('Genomic Surveillance Capacity', fontsize=14, fontweight='bold')
ax1.set_yscale('log')
ax1.grid(True, alpha=0.3, axis='y')

# Annotate inequality
ax1.text(0.5, 0.95, '1000× inequality', transform=ax1.transAxes,
        fontsize=14, fontweight='bold', color='red',
        ha='center', va='top',
        bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7))

# Right: Per capita sequencing
ax2.bar(x, surveillance_capacity['sequences_per_million_people'],
       color='coral', edgecolor='black', linewidth=2)
ax2.set_xticks(x)
ax2.set_xticklabels(surveillance_capacity['region'], rotation=0)
ax2.set_ylabel('Sequences per Million People', fontsize=12)
ax2.set_title('Per Capita Sequencing Intensity', fontsize=14, fontweight='bold')
ax2.set_yscale('log')
ax2.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('surveillance_inequality.png', dpi=300, bbox_inches='tight')
plt.show()

The South Africa Problem

When South Africa detected and reported Omicron in November 2021:

What happened: - Immediate travel bans from 10+ countries - Economic damage to tourism - Accusations of “creating” the variant - Punished for transparency

What should have happened: - Recognition of excellent surveillance - Support for genomic capacity - Reward for rapid data sharing

The lesson: Current system punishes good surveillance, creating incentive to hide variants.

Proposed solutions: - WHO Pandemic Agreement: Benefit-sharing for countries that contribute data - No travel bans based on variant detection (only on epidemiological situation) - Technical support for countries building capacity

For discussion, see Yozwiak et al., 2016, PLOS Biology on equitable data sharing.

8.8.2 Technical Limitations

1. Sequence Quality:

# Simulate sequence quality issues
quality_issues = pd.DataFrame({
    'issue': ['Low viral load', 'Contamination', 'PCR bias', 'Coverage gaps', 'Mixed infection'],
    'frequency_pct': [15, 5, 10, 20, 3],
    'impact': ['Incomplete genome', 'Incorrect calls', 'Allele dropout', 'Missing mutations', 'Ambiguous calls']
})

print("\nCommon Sequencing Quality Issues:")
print(quality_issues.to_string(index=False))

2. Turnaround Time: - Sample collection → Sequence result: 3-7 days - + Analysis and interpretation: 1-2 days - Total: 4-9 days (vs. 1 day for PCR)

For outbreak response: Sometimes too slow (cases have already spread)

3. Cost: - Sequencing: $50-300 per sample - Not sustainable for universal surveillance - Must prioritize strategically

4. Bioinformatics Complexity: - Requires specialized expertise - Computational infrastructure - Rapidly evolving tools and nomenclature

8.9 Ethics and Governance

8.9.2 Dual-Use Research Concerns

Question: Can genomic data enable creation of dangerous pathogens?

2012 H5N1 Controversy: Two labs created highly transmissible H5N1 influenza in ferrets. Should the sequences be published?

Arguments FOR publication: ✅ Scientific transparency
✅ Enables surveillance (detect if mutations arise naturally)
✅ Guides vaccine development

Arguments AGAINST: ❌ Dual-use potential (bioterrorism)
❌ Accidental lab release risk
❌ May trigger panic

Resolution: Eventually published with methods somewhat redacted. Debate continues.

For SARS-CoV-2: - All sequences openly shared - No evidence of misuse - Benefits (diagnostics, vaccines, treatments) vastly outweighed risks

Governance: National Science Advisory Board for Biosecurity (NSABB) reviews high-risk research.

For ethical framework, see Inglesby et al., 2014, mBio.

8.9.3 Stigma and Discrimination

Naming variants after places: - “UK variant”, “South African variant”, “Indian variant” - Creates stigma, encourages discrimination - WHO now uses Greek letters (Alpha, Beta, Delta, Omicron)

Individual-level concerns: - Genomic data can be identifying (small communities) - Must protect patient privacy while sharing pathogen data

8.10 Practical Implementation Guide

8.10.1 Building Your Genomic Surveillance System

Step 1: Define Objectives ☐ What questions will genomics answer? (Outbreak investigation? Variant monitoring? AMR?)
☐ What volume of sequencing is feasible? (Budget constraints)
☐ Who will analyze the data? (In-house expertise? Academic partners?)

Step 2: Establish Infrastructure ☐ Sequencing platform (Illumina? Oxford Nanopore?)
☐ Bioinformatics pipeline (Cloud-based? Local server?)
☐ Data management (How will sequences be stored/shared?)
☐ Quality control procedures

Step 3: Prioritize Samples - Outbreaks: Sequence all or representative subset - Sentinel surveillance: Random sample (2-5% of cases) - Special situations: Vaccine breakthroughs, reinfections, severe cases in vaccinated individuals - Wastewater: Community-level surveillance

Step 4: Analysis Workflow

Sample received
    ↓
QC (Ct value, collection date)
    ↓
Sequencing (1-3 days)
    ↓
Bioinformatics (alignment, variant calling)
    ↓
Lineage assignment (Pangolin, Nextclade)
    ↓
Quality filtering (>90% complete, >100x depth)
    ↓
Phylogenetic analysis (if cluster)
    ↓
Upload to GISAID
    ↓
Report to epidemiology team
    ↓
Action (if novel variant or outbreak cluster)

Step 5: Integration with Epidemiology - Don’t work in silos! Genomics teams + epi teams = powerful combination - Link sequence data to case investigations - Use genomics to guide contact tracing (who infected whom?) - Monitor for variants that may affect control measures

Step 6: Capacity Building - Train staff (wet lab + bioinformatics) - Establish SOPs (standard operating procedures) - Participate in quality assurance programs - Collaborate with academic/reference labs

8.10.2 Free/Open-Source Tools

Sequencing: - ARTIC Network protocols - SARS-CoV-2 sequencing

Bioinformatics: - Nextstrain - Phylogenetic analysis - Pangolin - SARS-CoV-2 lineage assignment - Nextclade - Web-based clade assignment - IQ-TREE - Fast phylogenetic tree building - snippy - Variant calling for bacteria

Cloud Platforms: - Terra.bio - Cloud genomics (free for public health) - Galaxy - Web-based analysis - BaseSpace - Illumina platform

8.11 Key Takeaways

Summary: What You’ve Learned

Genomic surveillance revolutionized outbreak response - SARS-CoV-2 sequenced in 10 days; variants detected within weeks of emergence
Phylogenetics reconstructs transmission - Who infected whom, even without contact tracing data
Mutations have consequences - Specific genetic changes cause increased transmissibility, immune evasion, drug resistance
Machine learning predicts phenotype from genotype - Drug resistance prediction from sequence alone (TB, S. aureus, E. coli)
Real-world platforms exist and work - GISAID, Nextstrain enable global collaboration
Sampling bias is pervasive - Rich countries sequence 100-1000x more than poor countries, creating blind spots
Technical challenges remain - Cost, turnaround time, bioinformatics expertise required
Equity and ethics matter - Countries doing good surveillance shouldn’t be punished; need benefit-sharing
Integration is essential - Genomics enhances traditional epidemiology but doesn’t replace it
The future is genomic - Sequencing costs dropping; routine genomic surveillance becoming standard of care

8.12 Practice Exercises

8.12.1 Exercise 1: Outbreak Investigation

Nine sequences from a restaurant outbreak. SNP distances: - Cases 1-5: 0-2 SNPs apart - Cases 6-9: 0-1 SNPs apart - Between groups: 15 SNPs

Questions: - How many introductions? - Which cases are linked? - What public health actions are needed?

8.12.2 Exercise 2: Variant Growth

Variant A frequency: Week 1 = 10%, Week 4 = 40%

Calculate: - Fold change - Weekly growth rate - Doubling time - Transmission advantage (relative R)

8.12.3 Exercise 3: AMR Prediction

E. coli genome with blaCTX-M-15 gene detected.

Questions: - What antibiotic class will be ineffective? - What should you prescribe instead? - How confident are you? (What could cause false positive/negative?)

8.12.4 Exercise 4: Phylogenetic Interpretation

Build a phylogenetic tree from sequences. Identify: - Which samples are most closely related? - Do geographic clusters exist? - What is the most recent common ancestor?

Check Your Understanding

Test your knowledge of genomic surveillance and its applications in public health. Each question builds on the key concepts from this chapter.

Question 1

A public health laboratory is planning to establish genomic surveillance for SARS-CoV-2 variants. They are deciding between two sequencing approaches: targeted amplicon sequencing (e.g., ARTIC protocol) versus whole genome shotgun sequencing. Which statement BEST describes the key trade-off between these approaches for routine surveillance?

Targeted amplicon sequencing is more expensive but provides complete genome coverage
Whole genome shotgun sequencing is faster and more suitable for high-throughput surveillance
Targeted amplicon sequencing is more cost-effective and works with low viral loads, but may miss novel variants in primer binding sites
Whole genome shotgun sequencing is the only method that can detect single nucleotide polymorphisms (SNPs)

Correct Answer: c) Targeted amplicon sequencing is more cost-effective and works with low viral loads, but may miss novel variants in primer binding sites

The choice between targeted amplicon sequencing and whole genome shotgun sequencing represents a fundamental trade-off in genomic surveillance. Targeted amplicon sequencing (like the widely-used ARTIC protocol) uses specific primers to amplify predetermined regions of the viral genome. This approach is highly cost-effective, requires less sequencing depth, and crucially, works well with clinical samples that have low viral loads—a common situation with respiratory specimens. These advantages made ARTIC the workhorse of global SARS-CoV-2 surveillance during the pandemic.

However, targeted approaches have a critical vulnerability: if mutations occur in primer binding sites, those regions may fail to amplify, creating “dropout” patterns. This happened with the Omicron variant, where mutations in primer binding sites initially caused detection issues. While primer sets can be updated, there’s always a lag between variant emergence and primer redesign.

Option (a) is incorrect because targeted sequencing is actually less expensive, not more. Option (b) is incorrect because targeted approaches are generally faster and more cost-effective for high-throughput surveillance, not shotgun sequencing. Option (d) is incorrect because both methods can detect SNPs—the difference lies in their systematic biases and coverage patterns, not in the fundamental ability to detect sequence variation.

For routine surveillance, most laboratories use targeted sequencing for cost-effectiveness and rapid turnaround, while maintaining shotgun sequencing capacity for investigating unusual patterns or confirming novel variants. This hybrid strategy balances practical constraints with surveillance sensitivity.

Question 2

You are investigating a hospital outbreak of methicillin-resistant Staphylococcus aureus (MRSA). Whole genome sequencing reveals that isolates from five patients differ by 0-3 SNPs. Hospital infection control asks whether this proves transmission occurred within the hospital. What is the MOST scientifically accurate response?

Yes, 0-3 SNPs definitively proves direct transmission between these patients
No, MRSA does not accumulate SNPs during person-to-person transmission
The SNP data is consistent with recent transmission, but cannot prove it without epidemiological context and a null hypothesis
SNP analysis is unreliable for outbreak investigation and should not be used for infection control decisions

Correct Answer: c) The SNP data is consistent with recent transmission, but cannot prove it without epidemiological context and a null hypothesis

This question addresses a critical pitfall in genomic epidemiology: confusing “consistent with” for “proof of” transmission. While 0-3 SNPs between MRSA isolates strongly suggests recent common ancestry, it does not definitively prove direct transmission within the hospital. Multiple scenarios could produce this pattern: patients could have independently acquired closely related strains from a common community source, they could have been colonized before admission, or transmission could have occurred in a different healthcare facility they all visited.

Rigorous outbreak investigation requires integrating genomic data with epidemiological context. Key questions include: Did patients have overlapping hospital stays? Were they in the same unit? Did they share healthcare workers? What is the background genetic diversity of MRSA in this community? Without comparing the outbreak cluster to a null hypothesis (what would we expect by chance given local MRSA diversity?), we cannot assess whether the genetic similarity is remarkable or expected.

Option (a) is too absolute—genomic data alone cannot prove causation. Option (b) is factually incorrect; MRSA does accumulate SNPs during transmission, typically 0-25 per year depending on mutation rates and generation time. Option (d) is also incorrect; SNP analysis is highly valuable for outbreak investigation when properly interpreted—the chapter discusses multiple successful applications in tuberculosis and other bacterial infections.

The Walker et al. (2013) study on tuberculosis transmission in the UK exemplifies the proper integration of genomic and epidemiological data, showing that many apparent transmission links suggested by conventional epidemiology were inconsistent with genetic data, and vice versa. Genomics is a powerful tool, but not a standalone oracle.

Question 3

A phylogenetic tree of SARS-CoV-2 sequences shows that samples from Country A cluster together on one branch, while samples from Country B form a different cluster. A government official concludes that this proves the virus evolved differently in each country and suggests country-specific vaccines. What is the PRIMARY flaw in this interpretation?

Phylogenetic trees cannot show geographic patterns because they only represent genetic relationships
The clustering likely reflects sampling bias and founder effects, not independent evolution requiring different vaccines
SARS-CoV-2 does not evolve fast enough to develop country-specific lineages
Phylogenetic trees are unreliable for RNA viruses due to high mutation rates

Correct Answer: b) The clustering likely reflects sampling bias and founder effects, not independent evolution requiring different vaccines

This question highlights a common misinterpretation of phylogeographic patterns. When samples from different countries cluster separately on a phylogenetic tree, it’s tempting to conclude this represents distinct evolutionary trajectories. However, geographic clustering usually reflects sampling and epidemiological patterns rather than functionally significant evolutionary divergence.

The clustering likely arises from founder effects: different viral lineages were introduced to each country during the early pandemic, and local transmission amplified these founding lineages. If Country A had limited travel connections during the critical seeding period, or if sequencing began after local transmission was established, the sequences will naturally cluster together because they descend from a small number of introductions. This phylogenetic structure tells us about epidemic history and introduction patterns, not about functional differences requiring different vaccines.

Furthermore, the chapter discusses how variants of concern (Alpha, Delta, Omicron) arose in specific locations but then spread globally because they had functional advantages. These variants required updated vaccines not because of where they originated, but because they acquired specific mutations affecting immune escape. Simple geographic clustering without evidence of altered antigenicity or functional changes does not justify country-specific vaccines.

Option (a) is incorrect—phylogenetic trees do show geographic patterns, but these require careful interpretation. Option (c) is incorrect; SARS-CoV-2 evolves quite rapidly, but this doesn’t automatically mean geographically distinct lineages require different vaccines. Option (d) is incorrect; phylogenetic methods are well-suited to RNA viruses, and high mutation rates actually provide more phylogenetic signal, not less reliability.

The key lesson: phylogenetic structure reflects both evolutionary processes and sampling/epidemiological history. Distinguishing these requires understanding founder effects, sampling biases, and the functional significance of observed mutations.

Question 4

A laboratory has sequenced the genome of an E. coli isolate from a patient with a urinary tract infection and detected the blaCTX-M-15 gene. Based on this finding alone, what is the MOST appropriate conclusion about antibiotic treatment?

The patient should immediately be switched to a carbapenem antibiotic as first-line therapy
Third-generation cephalosporins (e.g., ceftriaxone) will likely be ineffective and alternative treatments should be considered
The infection cannot be treated with any beta-lactam antibiotics and requires completely different drug classes
Genomic detection of resistance genes is unreliable and phenotypic susceptibility testing should be performed instead

Correct Answer: b) Third-generation cephalosporins (e.g., ceftriaxone) will likely be ineffective and alternative treatments should be considered

The blaCTX-M-15 gene encodes an extended-spectrum beta-lactamase (ESBL) that hydrolyzes third-generation cephalosporins, rendering them ineffective. This genotype-to-phenotype prediction is highly reliable for this gene, as discussed in the chapter’s section on antimicrobial resistance genomics. Detection of blaCTX-M-15 should prompt clinicians to avoid cephalosporins and consider alternative treatments such as fluoroquinolones, nitrofurantoin (for uncomplicated UTIs), or carbapenems (for severe infections).

However, option (a) is too aggressive—carbapenems are important drugs held in reserve for severe infections, and their overuse drives resistance. For an uncomplicated UTI, alternative oral agents like nitrofurantoin or fosfomycin might be appropriate. The choice depends on infection severity, local resistance patterns, and antibiotic stewardship principles. Genomic data informs the decision but doesn’t automatically dictate carbapenem use.

Option (c) is incorrect because blaCTX-M-15 specifically affects cephalosporins; other beta-lactams may remain effective depending on the resistance profile. Many ESBL-producing organisms remain susceptible to carbapenems and beta-lactamase inhibitor combinations.

Option (d) misunderstands the role of genomic predictions. While phenotypic testing remains important for confirming susceptibility and detecting unexpected resistance, genomic detection of well-characterized resistance genes like blaCTX-M-15 provides rapid, actionable information. The chapter discusses how machine learning models are increasingly successful at predicting resistance phenotypes from genotypes, especially for well-studied genes. The CRyPTIC Consortium’s work on tuberculosis (cited in Further Resources) demonstrated that genomic prediction of drug resistance can match or exceed traditional phenotypic testing for many drugs.

The key principle: genomic resistance detection provides rapid, reliable information that can guide empiric therapy while awaiting confirmatory phenotypic results, particularly for well-characterized resistance mechanisms.

Question 5

During a suspected foodborne outbreak, public health investigators sequence bacterial isolates from patients, food samples, and potential source farms. The phylogenetic tree shows that patient isolates cluster with one farm’s isolates (5-10 SNPs apart), while isolates from other farms differ by >100 SNPs. However, the implicated farm has excellent food safety records and no epidemiological links to the patients. What should investigators do NEXT?

Immediately close the implicated farm since genomic evidence is definitive proof of the source
Disregard the genomic data since epidemiological links are absent and focus only on traditional investigation
Investigate potential indirect connections (e.g., shared suppliers, distributors, or environmental sources) and expand sampling
Conclude the outbreak investigation is inconclusive and close the case

Correct Answer: c) Investigate potential indirect connections (e.g., shared suppliers, distributors, or environmental sources) and expand sampling

This scenario illustrates the investigative power of genomic data while highlighting that real-world outbreak investigations require integrating multiple lines of evidence. The phylogenetic pattern strongly suggests a connection between patient cases and one specific farm (5-10 SNPs indicates very recent common ancestry), but the lack of direct epidemiological links means the connection might be indirect.

The appropriate next step is to use genomic data as a hypothesis-generating tool to expand the investigation. Potential scenarios include: the farm supplies a distributor that processes multiple products, creating cross-contamination opportunities; the farm shares irrigation water or equipment with other operations that directly supply implicated restaurants; or there’s an intermediate processor or distributor not yet identified. The chapter discusses similar scenarios where genomic evidence revealed unexpected transmission pathways that traditional epidemiology missed.

Option (a) represents regulatory overreach—genomic data provides strong evidence but should not be used in isolation to take punitive actions, especially when epidemiological context is lacking. False leads can occur due to sampling bias or coincidental clustering. Option (b) makes the opposite error, dismissing valuable scientific evidence. The history of outbreak investigation shows that integrating genomic and traditional epidemiology produces better outcomes than either alone. Option (d) is premature; the strong genomic signal warrants further investigation, not case closure.

This approach mirrors successful outbreak investigations discussed in the chapter, such as the Tegomoh et al. (2021) Nebraska Omicron cluster investigation, where genomic surveillance revealed unexpected transmission links that informed public health action. The key principle: genomic data should prompt expanded investigation and hypothesis refinement, not serve as a sole decision-making criterion.

Modern outbreak investigation increasingly resembles detective work where genomic data provides leads that must be confirmed and contextualized through traditional epidemiological investigation. This integration produces more accurate source attribution and more effective public health interventions.

Question 6

A low-income country with limited sequencing capacity wants to establish genomic surveillance for antimicrobial resistance and epidemic pathogens. International partners offer to sequence all samples offshore and share data through global databases. However, local scientists express concerns about this arrangement. What is the MOST important ethical consideration that should guide the surveillance system design?

Cost-effectiveness: offshore sequencing is cheaper, so ethical use of limited resources requires accepting this arrangement
Speed: rapid results are paramount for public health action, so the fastest sequencing location should be prioritized
Data sovereignty and equitable benefit-sharing: local capacity building and in-country analysis should be prioritized even if initially more expensive or slower
Data quality: only high-income country laboratories can produce reliable sequences, so ethical surveillance requires offshore processing

Correct Answer: c) Data sovereignty and equitable benefit-sharing: local capacity building and in-country analysis should be prioritized even if initially more expensive or slower

This question addresses a critical ethical dimension of genomic surveillance discussed in the chapter’s ethics section. While offshore sequencing might appear pragmatic, it perpetuates problematic power dynamics in global health and can undermine the long-term goals of surveillance systems.

The chapter cites Yozwiak et al. (2016) on equitable data sharing, emphasizing that communities generating genomic data should have the capacity to analyze and benefit from their own data. When samples are sent offshore for sequencing, several problems arise: local scientists lose opportunities to develop expertise and ask research questions relevant to their context; there may be delays in returning actionable information; intellectual property and scientific credit often accrue to international partners rather than local investigators; and the country remains dependent on external support rather than building sustainable capacity.

During the COVID-19 pandemic, some countries with samples sent abroad for sequencing faced significant delays in receiving results, while international collaborators published papers using “their” data. This extractive pattern echoes colonial-era scientific practices and violates principles of research equity and justice.

The appropriate approach combines immediate practical needs with capacity building: initial offshore sequencing can be accompanied by training programs, technology transfer, and progressive transition to in-country analysis. Partnership models should emphasize co-leadership, co-authorship, and sustainable infrastructure development.

Option (a) takes an overly narrow view of cost-effectiveness that ignores long-term sustainability and equity. Option (b) prioritizes speed over sovereignty, which may be necessary in acute emergencies but should not be the default model. Option (d) is factually incorrect and reflects biased assumptions; many laboratories in low- and middle-income countries produce high-quality genomic data, and the primary barrier is resource access, not capability.

The key principle: genomic surveillance systems should be designed not just to collect data, but to build sustainable local capacity and ensure that benefits flow to the communities that generate samples. This approach aligns with broader goals of equity in global health research and strengthens long-term pandemic preparedness.

8.13 Discussion Questions

South Africa detected and reported Omicron, then faced immediate travel bans. How can we design a system that rewards transparency instead of punishing it? What would proper “benefit-sharing” look like?
Genomic surveillance is 100-1000x more intensive in high-income countries. What variants are we missing in under-surveilled regions? How do we build capacity equitably?
A nursing home outbreak: genomics shows two separate introductions. How does this change your public health response compared to a single introduction? What are the implications for infection control?
TB resistance prediction is now as accurate as culture testing and 6 weeks faster. Should we replace phenotypic testing with genomic prediction? What are the risks? When is verification necessary?
Publishing highly transmissible H5N1 sequences: Dual-use dilemma. Where should we draw the line between scientific openness and biosecurity concerns? Who decides?
A new SARS-CoV-2 variant with 20+ spike mutations is detected. Using what you’ve learned, how would you assess: (a) Is it more transmissible? (b) Does it evade immunity? (c) Should it be a Variant of Concern?

8.14 Further Resources

8.14.1 Books

Felsenstein, 2004, Inferring Phylogenies - Comprehensive phylogenetics
Pevsner, 2015, Bioinformatics and Functional Genomics - Bioinformatics fundamentals

8.14.2 Key Papers

Genomic Epidemiology: - Bedford et al., 2018, Bioinformatics - Nextstrain Essential - Grubaugh et al., 2019, Nature Microbiology - Framework Essential

SARS-CoV-2 Case Studies: - Rambaut et al., 2021, Nature - Alpha variant - Key case study on variant detection - Viana et al., 2022, Nature - Omicron origins - Tracing variant emergence in South Africa - Tegomoh et al., 2021, MMWR - Nebraska Omicron cluster investigation - Practical outbreak investigation using genomic surveillance - COG-UK, 2020, Nature Communications - Large-scale genomic surveillance infrastructure

Antimicrobial Resistance: - CRyPTIC Consortium, 2022, NEJM - TB resistance Landmark study - Walker et al., 2013, Nature Genetics - TB transmission

Ethics: - Yozwiak et al., 2016, PLOS Biology - Equitable data sharing Essential

8.14.3 Tools

Nextstrain - Real-time pathogen tracking
GISAID - Genomic database
Pangolin - SARS-CoV-2 lineage assignment
IQ-TREE - Phylogenetic inference
Terra.bio - Cloud genomics platform

Genomic surveillance is the future of infectious disease control. But technology alone isn’t enough—we need equity, ethics, and integration with traditional public health. The sequences tell us what’s happening. Epidemiologists tell us what to do about it.