8 Genomic Surveillance and Pathogen Analysis
Learning Objectives
By the end of this chapter, you will:
- Understand how genomic sequencing transformed outbreak investigation
- Build phylogenetic trees to trace pathogen transmission chains
- Detect emerging variants and mutations of concern using sequence analysis
- Apply machine learning to predict antimicrobial resistance from genomic data
- Use GISAID, Nextstrain, and other genomic surveillance platforms
- Implement automated pipelines for variant detection and classification
- Recognize the limitations and biases in genomic surveillance data
- Navigate the ethical considerations of publishing pathogen genomes
- Integrate genomic data with traditional epidemiology
Time to complete: 120-150 minutes
Prerequisites: Chapter 4: Surveillance, Chapter 2: AI Basics, Chapter 3: Data
What you’ll build: 💻 Complete genomic surveillance pipeline including SARS-CoV-2 variant detection, phylogenetic tree construction, transmission chain analysis, antimicrobial resistance prediction, and interactive visualization
8.1 Introduction: The Genomic Revolution in Public Health
December 31, 2019, Wuhan, China: ProMED-mail publishes a post about “undiagnosed pneumonia” in Wuhan. The WHO is notified. The world is on alert.
January 10, 2020 (10 days later): Professor Yong-Zhen Zhang at Fudan University publicly releases the complete genome sequence of the novel coronavirus on virological.org—before formal publication, before bureaucratic approval, simply sharing it openly with the world.
January 13, 2020 (3 days later): Thailand confirms the first imported case outside China—identified using PCR diagnostic tests designed directly from that shared genome sequence.
This was unprecedented. In past pandemics, months or years passed before pathogen identification. During the 1918 influenza pandemic, the virus wasn’t even isolated until 1933—15 years later. For SARS-CoV-2, the full genome was sequenced and shared globally within 10 days of initial reports.
February 2021, Kent, England:
Routine genomic surveillance by the COVID-19 Genomics UK Consortium (COG-UK) detects an unusual cluster of sequences. The lineage, designated B.1.1.7 (later renamed Alpha), contains 17 mutations—an unprecedented number for SARS-CoV-2.
More concerning: the variant is spreading 50-70% faster than existing strains. Within the spike protein, the N501Y mutation enhances ACE2 receptor binding. The deletion at positions 69-70 causes S-gene target failure in some PCR tests—an unintended diagnostic marker.
Within weeks: Rambaut et al., 2020 publish detailed phylogenetic analysis. The variant is detected in 50+ countries. Genomic data guides public health responses, travel restrictions, and urgent vaccine effectiveness studies.
May 2021, India:
The Delta variant (B.1.617.2) emerges with the L452R mutation—later shown to increase transmissibility and cause modest immune evasion. Genomic surveillance tracks its rapid global spread. It becomes dominant worldwide within months.
November 2021, Botswana/South Africa:
Omicron (B.1.1.529) is detected with an unprecedented 30+ spike protein mutations. South African scientists immediately share sequence data via GISAID. Global genomic surveillance confirms rapid spread. Within weeks, the variant is detected on every continent.
December 2021, Nebraska, USA:
One of the first U.S. Omicron outbreak investigations used genomic surveillance to trace a household cluster of 6 cases linked to international travel from Nigeria. Tegomoh et al., 2021, MMWR documented a shorter incubation period (~3 days) and milder symptoms compared to previous variants—early evidence that would later be confirmed globally. This rapid investigation demonstrated how genomic sequencing enables real-time characterization of emerging variants.
The paradox: South Africa, which has among the world’s strongest genomic surveillance systems, was blamed for “causing” Omicron and faced immediate travel bans—despite simply being first to detect and share the data. This created a dangerous precedent: countries that do good surveillance get punished for transparency.
8.1.1 Why Genomic Surveillance Matters
Traditional surveillance tells you: - “There’s an outbreak of respiratory illness” - “Case count is increasing” - “Hospitalizations are up”
Genomic surveillance tells you: - “It’s SARS-CoV-2, Delta variant (B.1.617.2), sub-lineage AY.4.2” - “The outbreak involves two separate introductions (phylogenetically distinct clusters)” - “Case A likely infected Cases B, C, and D based on genetic similarity and timing” - “A new mutation (E484K) emerged that may affect antibody binding” - “This variant is 40% more transmissible based on growth rate analysis”
The difference is actionable intelligence:
1. Attribution: Is this a new introduction or ongoing transmission?
2. Transmission chains: Who infected whom? Where are transmission hotspots?
3. Variants: Is this a new, more dangerous strain?
4. Outbreak linkage: Are these cases connected or separate introductions?
5. Drug resistance: Will antibiotics, antivirals, or vaccines work?
6. Public health response: Targeted interventions vs. population-wide measures
8.2 The Foundations: What Is Genomic Surveillance?
8.2.1 From DNA to Disease: The Central Dogma
All living organisms and many pathogens (viruses, bacteria, parasites, fungi) store genetic information in DNA or RNA:
DNA → (transcription) → RNA → (translation) → Protein → Phenotype
For pathogens, mutations in genetic sequences can lead to:
Increased transmissibility: - SARS-CoV-2 N501Y mutation: ~50% increase in transmission - Influenza H274Y mutation: No fitness cost, enables oseltamivir resistance
Immune evasion: - Omicron’s 30+ spike mutations: Escape from prior infection and vaccination - HIV env gene variation: Constant immune escape
Drug resistance: - M. tuberculosis rpoB mutations: Rifampicin resistance - S. aureus mecA gene: Methicillin resistance (MRSA) - Plasmodium falciparum kelch13 mutations: Artemisinin resistance
Virulence changes: - Y. pestis loss of Pla protease: Reduced virulence - E. coli O157:H7 Shiga toxin acquisition: Increased pathogenicity
RNA viruses (influenza, SARS-CoV-2, HIV): - Mutation rate: ~10⁻⁶ to 10⁻⁴ per nucleotide per replication - Fast evolution = constant adaptation
DNA viruses (smallpox, herpes): - Mutation rate: ~10⁻⁸ per nucleotide per replication - Slower evolution = more stable targets for vaccines
Bacteria: - Mutation rate: ~10⁻⁹ per nucleotide per generation - But horizontal gene transfer (acquiring resistance genes from other bacteria) enables rapid adaptation
Why this matters: RNA viruses require continuous surveillance; bacterial resistance can spread suddenly via plasmid transfer.
For detailed evolutionary rates, see Sanjuán et al., 2010, Journal of Virology.
8.2.2 Key Concepts in Genomic Epidemiology
8.2.2.1 1. Single Nucleotide Polymorphisms (SNPs)
The smallest unit of genetic variation:
Reference genome: ...ATCGATCGATCG...
Variant 1: ...ATCGATCGATCG... (0 SNPs - identical)
Variant 2: ...ATCAATCGATCG... (1 SNP: G→A at position 4)
Variant 3: ...GTCAATCGATCG... (2 SNPs: A→G at pos 1, G→A at pos 4)
SNP distance (Hamming distance) measures genetic similarity: - 0-2 SNPs: Likely direct transmission or very recent common ancestor - 3-10 SNPs: Possibly linked through intermediates - >10 SNPs: Likely unrelated (for pathogens with slow mutation rates)
Critical caveat: SNP thresholds vary by pathogen! - SARS-CoV-2: ~2 mutations/month/lineage (0.0007 mutations/site/year) - Influenza: ~10 mutations/year in HA gene - M. tuberculosis: ~0.5 SNPs/genome/year (extremely slow) - HIV: Hypervariable; can vary within a single patient
8.2.2.2 2. Lineages, Clades, and Variants
Hierarchical classification of related genomes:
Ancestral SARS-CoV-2 (Wuhan-Hu-1)
│
├─ A lineage
├─ B lineage
├─ B.1
│ ├─ B.1.1.7 (Alpha)
│ ├─ B.1.351 (Beta)
│ └─ B.1.617.2 (Delta)
│ ├─ AY.1
│ ├─ AY.2
│ └─ AY.4.2
└─ B.1.1.529 (Omicron)
├─ BA.1
├─ BA.2
├─ BA.4
└─ BA.5
└─ BQ.1.1
Nomenclature systems:
Pango lineages (cov-lineages.org): - Dynamic nomenclature for SARS-CoV-2 - B.1.1.7, B.1.617.2, etc. - Updated as virus evolves
WHO labels (Alpha, Beta, Delta, Omicron): - Simplified names for public communication - Avoids geographic stigma - Reserved for Variants of Concern (VOCs)
Nextstrain clades (19A, 20A, 21A, etc.): - Time-based naming system - Integrates with phylogenetic analysis
For the full Pango system, see Rambaut et al., 2020, Nature Microbiology.
8.2.2.3 3. Phylogenetic Trees
Visual representation of evolutionary relationships:
┌─ Sequence_A
┌─────┤
│ └─ Sequence_B
┌─────────┤
│ └─────── Sequence_C
────┤
│ ┌─────── Sequence_D
└─────────┤
└─────── Sequence_E
↑ ↑ ↑
Root Internal Tips
node (leaves)
Key components:
Branch length: Represents evolutionary distance (number of mutations)
Topology: Shows who’s related to whom
Root: Most recent common ancestor (MRCA)
Tips (leaves): Observed sequences
Internal nodes: Inferred ancestral sequences
Types of trees:
Phylogram: Branch lengths proportional to mutations
Chronogram: Branch lengths proportional to time
Cladogram: Shows relationships only (no branch length information)
For comprehensive phylogenetics introduction, see Felsenstein, 2004, Inferring Phylogenies.
8.2.2.4 4. Mutations of Concern
Not all mutations are equal:
Synonymous (silent) mutations: - Change DNA sequence but not amino acid - Example: CTT → CTC (both code for leucine) - Usually neutral, but can affect RNA stability
Non-synonymous (missense) mutations: - Change amino acid - Example: GAT → AAT (aspartate → asparagine) - May alter protein function
Nonsense mutations: - Create stop codon - Truncate protein (usually deleterious)
Insertions/Deletions (indels): - Can cause frameshift - Dramatically alter downstream protein sequence
For SARS-CoV-2, mutations in the spike protein are critical:
Mutation | Location | Effect |
---|---|---|
D614G | Outside RBD | Increased transmissibility (~50%) |
N501Y | RBD | Enhanced ACE2 binding, increased transmissibility |
E484K/A | RBD | Immune evasion (antibody escape) |
L452R | RBD | Immune evasion + increased transmissibility |
K417N/T | RBD | Reduced antibody binding |
P681H/R | Furin site | Enhanced spike cleavage, increased entry |
For detailed functional analysis, see Harvey et al., 2021, Nature Reviews Microbiology.
8.3 Genomic Data: Collection, Quality, and Biases
Before analyzing genomes, we must understand how data is collected and its inherent limitations.
8.3.1 The Sequencing Process
1. Sample Collection - Nasopharyngeal swabs (respiratory viruses) - Blood samples (bloodborne pathogens, bacteria) - Stool samples (enteric pathogens) - Wastewater (community-level surveillance)
2. Nucleic Acid Extraction - RNA extraction (SARS-CoV-2, influenza, HIV) - DNA extraction (bacteria, DNA viruses) - Quality depends on sample quality and viral load
3. Library Preparation - Amplicon-based (ARTIC for SARS-CoV-2): Fast, cheap, works with low viral load - Metagenomic: No prior knowledge needed, finds unexpected pathogens - Targeted capture: Enriches specific pathogen sequences
4. Sequencing - Illumina: High accuracy, moderate cost, 1-3 days - Oxford Nanopore: Real-time, portable, lower accuracy, hours - PacBio: Long reads, high accuracy, expensive
5. Bioinformatics Pipeline - Quality control (remove low-quality reads) - Alignment to reference genome - Variant calling (identify SNPs, indels) - Consensus sequence generation - Lineage assignment (Pangolin for SARS-CoV-2)
Typical timeline: Sample to sequence = 2-5 days
Cost: $50-300 per sample (decreasing rapidly)
8.3.2 Quality Metrics and Limitations
Key quality indicators:
Sequencing depth (coverage): - >100x per position: High confidence - 10-100x: Acceptable for most purposes - <10x: Low confidence, may miss variants
Genome completeness: - >95%: Suitable for phylogenetics - 80-95%: Marginal; some analyses possible - <80%: Poor quality; limited use
Ct value (cycle threshold) from PCR: - <25: High viral load, easy to sequence - 25-30: Moderate viral load, usually sequences well - >30: Low viral load, may fail sequencing or produce incomplete genomes
Challenge: Patients with low viral load (high Ct values) are: - Often sequenced less successfully (incomplete genomes) - May be at different disease stages (early infection, late recovery) - Could have different variant distributions
This creates bias: We sequence sickest patients (high viral load) more successfully than mild cases.
Impact on surveillance: May miss variants that cause milder disease.
For discussion of sequencing biases, see Grubaugh et al., 2019, Nature Microbiology.
8.3.3 Global Surveillance Infrastructure: GISAID
GISAID (Global Initiative on Sharing All Influenza Data) is the primary platform for sharing pathogen genomes.
Key features:
Rapid data sharing: - SARS-CoV-2: >16 million sequences (as of 2024) - Influenza: >5 million sequences - Updates daily
Access control: - Free for public health and research use - Requires registration (prevents misuse) - Submitters retain rights (encourages sharing)
Quality standards: - Automated quality checks - Lineage assignment - Metadata requirements (date, location)
Attribution: - Proper citation of data contributors - Prevents “parasitic” publications
Impact: - Enabled real-time COVID-19 tracking globally - Detected variants within days of emergence - Guided vaccine development and policy
For public health practitioners: 1. Register at GISAID.org 2. Agree to terms (no redistribution, proper citation) 3. Access: - Web interface for browsing - Batch download for analysis - API for programmatic access
Data available: - Complete genomes (FASTA format) - Metadata (collection date, location, lineage, patient demographics) - Submission acknowledgments (proper attribution)
Key tools: - EpiCoV (SARS-CoV-2) - EpiFlu (Influenza) - EpiPox (Monkeypox) - EpiRSV (RSV)
8.4 Building Your Genomic Surveillance Pipeline
8.4.1 Example 1: Loading and Exploring Genomic Surveillance Data
Scenario: You’re a genomic epidemiologist at a state health department. Your lab has sequenced 500 SARS-CoV-2 samples over the past three months. Your first task: understand what variants are circulating and how they’re changing over time.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
'ignore')
warnings.filterwarnings(
# Simulate realistic genomic surveillance data
# In practice, you'd load from GISAID or your local database
42)
np.random.seed(= 500
n_samples
# Generate sample metadata
= datetime(2024, 1, 1)
start_date = [start_date + timedelta(days=int(x)) for x in np.random.exponential(20, n_samples)]
dates = sorted(dates)
dates
# Geographic distribution
= np.random.choice(
counties 'County_A', 'County_B', 'County_C', 'County_D', 'County_E'],
[
n_samples,=[0.4, 0.25, 0.15, 0.12, 0.08] # Unequal sampling
p
)
# Demographics
= np.random.normal(45, 22, n_samples).clip(0, 95).astype(int)
ages = np.random.choice(['M', 'F'], n_samples)
sex
# Clinical data
= np.random.uniform(15, 35, n_samples) # Cycle threshold (viral load proxy)
ct_values = (ct_values < 25) & (np.random.random(n_samples) < 0.3) # High viral load more likely hospitalized
hospitalized
# Variant distribution (evolving over time)
# Early period: JN.1 emerging, XBB.1.5 dominant
# Later period: JN.1 takeover
def assign_variant(date, ct):
= (date - start_date).days
days_since_start
# JN.1 growth advantage
= 0.05 + (days_since_start / 100) * 0.6 # Increases over time
jn1_prob = min(jn1_prob, 0.70)
jn1_prob
# Lower viral load slightly increases chance of JN.1 (immune evasion)
if ct > 28:
+= 0.1
jn1_prob
= np.random.random()
rand if rand < jn1_prob:
return 'JN.1'
elif rand < jn1_prob + 0.25:
return 'XBB.1.5'
elif rand < jn1_prob + 0.40:
return 'EG.5'
else:
return 'BA.2.86'
= [assign_variant(date, ct) for date, ct in zip(dates, ct_values)]
variants
# Sequence quality metrics
= np.random.lognormal(4.5, 0.5, n_samples).clip(10, 5000) # Sequencing depth
coverage_depth = 100 - (1000 / coverage_depth) - np.random.uniform(0, 3, n_samples) # % genome covered
genome_completeness = genome_completeness.clip(70, 100)
genome_completeness
# Create DataFrame
= pd.DataFrame({
genomic_data 'sample_id': [f'SAM{i:04d}' for i in range(n_samples)],
'collection_date': dates,
'county': counties,
'age': ages,
'sex': sex,
'ct_value': ct_values,
'hospitalized': hospitalized,
'variant': variants,
'coverage_depth': coverage_depth,
'genome_completeness': genome_completeness,
})
# Sequence quality flag
'high_quality'] = (
genomic_data['genome_completeness'] > 95) &
(genomic_data['coverage_depth'] > 100)
(genomic_data[
)
print("="*70)
print("GENOMIC SURVEILLANCE DATASET SUMMARY")
print("="*70)
print(f"\nTotal samples sequenced: {len(genomic_data)}")
print(f"Date range: {genomic_data['collection_date'].min().date()} to {genomic_data['collection_date'].max().date()}")
print(f"Duration: {(genomic_data['collection_date'].max() - genomic_data['collection_date'].min()).days} days")
print(f"\n{'Variant Distribution':40s}")
print("-"*70)
= genomic_data['variant'].value_counts()
variant_counts for variant, count in variant_counts.items():
= count / len(genomic_data) * 100
pct print(f"{variant:15s}: {count:4d} samples ({pct:5.1f}%)")
print(f"\n{'Geographic Distribution':40s}")
print("-"*70)
= genomic_data['county'].value_counts()
county_counts for county, count in county_counts.items():
= count / len(genomic_data) * 100
pct print(f"{county:15s}: {count:4d} samples ({pct:5.1f}%)")
print(f"\n{'Sequence Quality':40s}")
print("-"*70)
print(f"High quality (>95% complete, >100x depth): {genomic_data['high_quality'].sum()} ({genomic_data['high_quality'].mean()*100:.1f}%)")
print(f"Mean coverage depth: {genomic_data['coverage_depth'].mean():.0f}x")
print(f"Mean genome completeness: {genomic_data['genome_completeness'].mean():.1f}%")
print(f"\n{'Clinical Characteristics':40s}")
print("-"*70)
print(f"Hospitalized: {genomic_data['hospitalized'].sum()} ({genomic_data['hospitalized'].mean()*100:.1f}%)")
print(f"Mean Ct value: {genomic_data['ct_value'].mean():.1f} (lower = higher viral load)")
print(f"Mean age: {genomic_data['age'].mean():.1f} years")
print(f"\n{'Sample of data':40s}")
print("-"*70)
print(genomic_data.head(10).to_string(index=False))
# Visualize sample distribution over time
= plt.subplots(2, 1, figsize=(14, 10))
fig, axes
# Daily sequencing volume
'date_only'] = genomic_data['collection_date'].dt.date
genomic_data[= genomic_data.groupby('date_only').size()
daily_counts
0].bar(daily_counts.index, daily_counts.values, alpha=0.7, color='steelblue', edgecolor='black')
axes[0].set_xlabel('Collection Date', fontsize=12)
axes[0].set_ylabel('Number of Sequences', fontsize=12)
axes[0].set_title('Daily Sequencing Volume', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')
axes[0].axhline(daily_counts.mean(), color='red', linestyle='--',
axes[=2, label=f'Mean: {daily_counts.mean():.1f} samples/day')
linewidth0].legend()
axes[
# Sequence quality over time
= genomic_data.set_index('collection_date').resample('W').agg({
weekly_quality 'genome_completeness': 'mean',
'coverage_depth': 'mean',
'high_quality': 'mean'
})
= axes[1]
ax2 = ax2.twinx()
ax2_twin
'genome_completeness'],
ax2.plot(weekly_quality.index, weekly_quality['o-', color='green', linewidth=2, markersize=6, label='Genome Completeness (%)')
'coverage_depth'],
ax2_twin.plot(weekly_quality.index, weekly_quality['s-', color='purple', linewidth=2, markersize=6, label='Coverage Depth (x)')
'Week', fontsize=12)
ax2.set_xlabel('Genome Completeness (%)', fontsize=12, color='green')
ax2.set_ylabel('Coverage Depth', fontsize=12, color='purple')
ax2_twin.set_ylabel('Sequence Quality Metrics Over Time', fontsize=14, fontweight='bold')
ax2.set_title(True, alpha=0.3)
ax2.grid(='upper left')
ax2.legend(loc='upper right')
ax2_twin.legend(loc
plt.tight_layout()'genomic_surveillance_overview.png', dpi=300, bbox_inches='tight')
plt.savefig( plt.show()
8.4.2 Example 2: Tracking Variant Emergence and Spread
Key question: Is a new variant taking over? How fast is it spreading?
# Calculate variant frequencies by week
'week'] = genomic_data['collection_date'].dt.to_period('W')
genomic_data[= genomic_data.groupby(['week', 'variant']).size().reset_index(name='count')
variant_by_week 'total'] = variant_by_week.groupby('week')['count'].transform('sum')
variant_by_week['frequency'] = variant_by_week['count'] / variant_by_week['total']
variant_by_week[
# Convert Period to timestamp for plotting
'week_start'] = variant_by_week['week'].dt.to_timestamp()
variant_by_week[
# Visualize variant dynamics
= plt.subplots(2, 1, figsize=(14, 12))
fig, axes
# Top: Stacked area plot
= variant_by_week.pivot(index='week_start', columns='variant', values='frequency').fillna(0)
variant_pivot 0].stackplot(variant_pivot.index,
axes[*[variant_pivot[col] for col in variant_pivot.columns],
=variant_pivot.columns,
labels=0.8)
alpha0].set_ylabel('Proportion of Sequences', fontsize=12)
axes[0].set_title('Variant Frequencies Over Time (Stacked)', fontsize=14, fontweight='bold')
axes[0].legend(loc='upper left', fontsize=10)
axes[0].set_ylim(0, 1)
axes[0].grid(True, alpha=0.3, axis='y')
axes[
# Bottom: Individual variant trajectories
for variant in variant_by_week['variant'].unique():
= variant_by_week[variant_by_week['variant'] == variant]
data 1].plot(data['week_start'], data['frequency'],
axes[='o', linewidth=2, markersize=6, label=variant, alpha=0.8)
marker
1].set_xlabel('Week', fontsize=12)
axes[1].set_ylabel('Proportion of Sequences', fontsize=12)
axes[1].set_title('Individual Variant Trajectories', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=10, loc='best')
axes[1].set_ylim(0, 1)
axes[1].grid(True, alpha=0.3)
axes[
plt.tight_layout()'variant_dynamics.png', dpi=300, bbox_inches='tight')
plt.savefig(
plt.show()
# Statistical analysis: Growth rate estimation
print("\n" + "="*70)
print("VARIANT GROWTH ANALYSIS")
print("="*70)
# Compare first vs. last month
= genomic_data[genomic_data['collection_date'] < start_date + timedelta(days=30)]
first_month = genomic_data[genomic_data['collection_date'] >= genomic_data['collection_date'].max() - timedelta(days=30)]
last_month
for variant in genomic_data['variant'].unique():
= (first_month['variant'] == variant).mean()
freq_early = (last_month['variant'] == variant).mean()
freq_late
= freq_late - freq_early
change = freq_late / freq_early if freq_early > 0 else float('inf')
fold_change
# Estimate daily growth advantage
= (last_month['collection_date'].mean() - first_month['collection_date'].mean()).days
days if freq_early > 0 and freq_late > 0:
= (np.log(freq_late) - np.log(freq_early)) / days
growth_rate = np.log(2) / growth_rate if growth_rate > 0 else float('inf')
doubling_time else:
= None
growth_rate = None
doubling_time
= "📈 INCREASING" if change > 0.05 else "📉 DECREASING" if change < -0.05 else "→ STABLE"
status
print(f"\n{variant}:")
print(f" First month frequency: {freq_early:.1%}")
print(f" Last month frequency: {freq_late:.1%}")
print(f" Absolute change: {change:+.1%}")
print(f" Fold change: {fold_change:.2f}x")
if growth_rate is not None and growth_rate > 0:
print(f" Growth rate: {growth_rate:.4f} per day ({growth_rate*7:.2%} per week)")
print(f" Doubling time: {doubling_time:.1f} days")
print(f" Status: {status}")
# Project future frequencies (simple exponential model)
print("\n" + "="*70)
print("PROJECTION (4 weeks ahead, assuming constant growth rates)")
print("="*70)
print("⚠️ WARNING: Assumes no new variants, no interventions, constant growth")
for variant in genomic_data['variant'].unique():
= (last_month['variant'] == variant).mean()
current_freq
# Estimate growth rate from last 60 days
= genomic_data[genomic_data['collection_date'] >= genomic_data['collection_date'].max() - timedelta(days=60)]
recent_data 'days_from_end'] = (recent_data['collection_date'].max() - recent_data['collection_date']).dt.days
recent_data[
# Simple exponential fit
= recent_data.groupby('days_from_end').apply(
recent_variant_freq lambda x: (x['variant'] == variant).mean()
='frequency')
).reset_index(name
if len(recent_variant_freq) > 5 and recent_variant_freq['frequency'].max() > 0.01:
# Fit log-linear model
from scipy.stats import linregress
# Remove zeros for log
= recent_variant_freq[recent_variant_freq['frequency'] > 0].copy()
nonzero
if len(nonzero) > 3:
'log_freq'] = np.log(nonzero['frequency'])
nonzero[= linregress(-nonzero['days_from_end'], nonzero['log_freq'])
slope, intercept, r_value, _, _
# Project 28 days ahead
= intercept + slope * 28
projected_log_freq = np.exp(projected_log_freq)
projected_freq = min(projected_freq, 0.99) # Cap at 99%
projected_freq
print(f"{variant}: {current_freq:.1%} → {projected_freq:.1%} (R²={r_value**2:.2f})")
Growth advantage = How much faster one variant spreads than another
Key metrics:
Absolute frequency change: Simple difference (JN.1: 10% → 40% = +30%)
Fold change: Ratio (JN.1: 10% → 40% = 4x increase)
Growth rate (r): Daily exponential growth (\(r\) per day)
Doubling time: \(\ln(2) / r\) (how many days to double frequency)
Generation time advantage: How much more transmissible
Example: - Variant A: Doubling time = 10 days - Variant B: Doubling time = 7 days - Variant B has ~40% transmission advantage (\(R_{relative} \approx\) 1.4)
Caveats: - Assumes no interventions (lockdowns, behavior changes) - Assumes no new variants emerge - Geographic structure matters (variant may saturate locally) - Immunity levels affect growth
For rigorous methods, see Volz et al., 2021, Cell on Alpha variant growth analysis.
8.5 Phylogenetic Analysis: Reconstructing Transmission
Phylogenetics allows us to infer who infected whom without contact tracing data.
8.5.1 Example 3: Building Phylogenetic Trees for Outbreak Investigation
Scenario: A nursing home reports 9 COVID-19 cases. Traditional contact tracing is difficult (asymptomatic transmission, poor recall). You sequence all 9 cases plus 2 recent community cases to determine: Are these linked? One introduction or multiple?
A similar approach was used in Nebraska’s first Omicron investigation where genomic sequencing of a household cluster of 6 cases linked to international travel enabled rapid characterization of the variant’s transmission dynamics, incubation period, and clinical presentation. This investigation provided some of the earliest U.S. data on Omicron’s characteristics.
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import squareform
import numpy as np
import matplotlib.pyplot as plt
# Simulate genomic distances (SNP differences)
# In reality, you'd align sequences with MAFFT and calculate pairwise distances
= {
sequences 'Resident_1': 0, # Reference
'Resident_2': 0, # Identical to Resident_1
'Resident_3': 1, # 1 SNP different from cluster 1
'Resident_4': 2, # 2 SNPs different from cluster 1
'Resident_5': 12, # Different cluster
'Resident_6': 13, # Related to Resident_5
'Resident_7': 14, # Related to Residents 5-6
'Resident_8': 13, # Related to cluster 2
'Resident_9': 14, # Related to cluster 2
'Community_A': 1, # Matches cluster 1
'Community_B': 12, # Matches cluster 2
}
# Create distance matrix
= len(sequences)
n = list(sequences.keys())
names = np.zeros((n, n))
distance_matrix
# Calculate pairwise distances (simplified)
for i in range(n):
for j in range(n):
if i != j:
= abs(sequences[names[i]] - sequences[names[j]])
distance_matrix[i, j]
# Display distance matrix
= pd.DataFrame(
distance_df
distance_matrix,=names,
index=names
columns
)
print("="*70)
print("PAIRWISE SNP DISTANCE MATRIX")
print("="*70)
print(distance_df.to_string())
# Identify clusters (using simple threshold)
print("\n" + "="*70)
print("CLUSTER ANALYSIS")
print("="*70)
= 3 # <3 SNPs = likely linked
threshold print(f"Threshold for linkage: <{threshold} SNPs\n")
= {}
clusters = 1
cluster_id
for i, name_i in enumerate(names):
if any(name_i in cluster for cluster in clusters.values()):
continue # Already assigned
= [name_i]
cluster_members
for j, name_j in enumerate(names):
if i != j and name_j not in cluster_members:
if distance_matrix[i, j] < threshold:
cluster_members.append(name_j)
if len(cluster_members) > 1:
f'Cluster_{cluster_id}'] = cluster_members
clusters[+= 1
cluster_id
for cluster_name, members in clusters.items():
print(f"{cluster_name}:")
for member in members:
= "👤 Resident" if member.startswith('Resident') else "🏙️ Community"
resident_or_community print(f" - {member:15s} {resident_or_community}")
print()
# Interpretation
print("="*70)
print("PUBLIC HEALTH INTERPRETATION")
print("="*70)
print(f"\nNumber of distinct introductions: {len(clusters)}")
print("\nCluster 1 (Residents 1-4 + Community_A):")
print(" • 5 cases, 0-2 SNPs apart")
print(" • Most likely source: Community_A")
print(" • Inference: Single introduction from community → spread within facility")
print("\nCluster 2 (Residents 5-9 + Community_B):")
print(" • 6 cases, 0-2 SNPs apart within cluster")
print(" • Most likely source: Community_B")
print(" • Inference: Separate introduction → independent transmission chain")
print("\nConclusion:")
print(" 🔴 TWO independent introductions")
print(" 🔴 Both resulted in facility transmission")
print(" 🔴 Ongoing transmission risk")
print("\nRecommended actions:")
print(" 1. Enhanced infection control (both clusters active)")
print(" 2. Investigate entry points (staff? visitors?)")
print(" 3. Cohort cases by cluster to prevent mixing")
print(" 4. Increased testing frequency (find additional cases early)")
# Hierarchical clustering visualization
= plt.subplots(1, 2, figsize=(16, 6))
fig, (ax1, ax2)
# Convert to condensed distance matrix for linkage
from scipy.spatial.distance import squareform
= squareform(distance_matrix)
condensed_dist
# Perform hierarchical clustering
= linkage(condensed_dist, method='average')
linkage_matrix
# Dendrogram
=names, ax=ax1, leaf_font_size=10,
dendrogram(linkage_matrix, labels=threshold, above_threshold_color='gray')
color_threshold=threshold, color='red', linestyle='--', linewidth=2,
ax1.axhline(y=f'Linkage threshold ({threshold} SNPs)')
label'Sample', fontsize=12)
ax1.set_xlabel('SNP Distance', fontsize=12)
ax1.set_ylabel('Phylogenetic Tree (Dendrogram)', fontsize=14, fontweight='bold')
ax1.set_title(
ax1.legend()True, alpha=0.3, axis='y')
ax1.grid(
# Heatmap of distances
= ax2.imshow(distance_matrix, cmap='YlOrRd', aspect='auto')
im range(n))
ax2.set_xticks(range(n))
ax2.set_yticks(=45, ha='right')
ax2.set_xticklabels(names, rotation
ax2.set_yticklabels(names)'SNP Distance Heatmap', fontsize=14, fontweight='bold')
ax2.set_title(
# Add text annotations
for i in range(n):
for j in range(n):
= ax2.text(j, i, f'{int(distance_matrix[i, j])}',
text ="center", va="center", color="black" if distance_matrix[i, j] < 8 else "white",
ha=8)
fontsize
=ax2, label='SNP Distance')
plt.colorbar(im, ax
plt.tight_layout()'phylogenetic_outbreak_analysis.png', dpi=300, bbox_inches='tight')
plt.savefig( plt.show()
How close is “close enough” to infer transmission?
SARS-CoV-2: - 0-2 SNPs: Likely direct transmission or very recent common ancestor - 3-5 SNPs: Possibly linked through 1-2 intermediates - >5 SNPs: Likely unrelated or distantly related
Mycobacterium tuberculosis: - 0-5 SNPs: Likely recent transmission (<3 years) - 6-12 SNPs: Possible transmission (>3 years ago) - >12 SNPs: Likely not recently transmitted
Influenza: - 0-3 nucleotides: Direct transmission likely - 4-10 nucleotides: Possibly linked - >10 nucleotides: Likely independent
Critical consideration: Thresholds depend on: - Mutation rate of pathogen - Time between samples - Within-host diversity - Sampling intensity
For TB-specific thresholds, see Walker et al., 2013, Nature Genetics.
For SARS-CoV-2, see Aggarwal et al., 2022, Science.
8.5.2 Example 4: Mutation Detection and Variant Characterization
Scenario: Your lab sequences a sample with unusual mutations in the spike protein. You need to identify which mutations are present and assess their significance.
# SARS-CoV-2 spike protein reference (Wuhan-Hu-1)
# Simplified: focusing on key positions only
= {
reference_spike 'position': [417, 452, 478, 484, 501, 614, 681, 950],
'reference_aa': ['K', 'L', 'T', 'E', 'N', 'D', 'P', 'N'],
'domain': ['RBD', 'RBD', 'RBD', 'RBD', 'RBD', 'Outside RBD', 'Furin site', 'Other'],
}
# New sample with multiple mutations
= {
sample_spike 'position': [417, 452, 478, 484, 501, 614, 681, 950],
'sample_aa': ['N', 'R', 'K', 'A', 'Y', 'G', 'H', 'N'],
'domain': ['RBD', 'RBD', 'RBD', 'RBD', 'RBD', 'Outside RBD', 'Furin site', 'Other'],
}
# Detect mutations
= []
mutations for i in range(len(reference_spike['position'])):
= reference_spike['reference_aa'][i]
ref_aa = sample_spike['sample_aa'][i]
sample_aa = reference_spike['position'][i]
pos = reference_spike['domain'][i]
domain
if ref_aa != sample_aa:
= f"{ref_aa}{pos}{sample_aa}"
mutation_name
mutations.append({'mutation': mutation_name,
'position': pos,
'domain': domain,
'reference': ref_aa,
'sample': sample_aa
})
= pd.DataFrame(mutations)
mutations_df
print("="*70)
print("DETECTED SPIKE PROTEIN MUTATIONS")
print("="*70)
print(mutations_df.to_string(index=False))
# Known mutations of concern (from literature)
= {
mutations_of_concern 'K417N': {
'effect': 'Reduces antibody binding',
'variants': 'Beta, Omicron BA.1',
'severity': 'HIGH',
'evidence': 'https://www.nature.com/articles/s41586-021-03398-2'
},'K417T': {
'effect': 'Reduces antibody binding',
'variants': 'Omicron BA.2',
'severity': 'HIGH',
'evidence': 'Nature (2022)'
},'L452R': {
'effect': 'Increases transmissibility, immune evasion',
'variants': 'Delta, Omicron sub-lineages',
'severity': 'HIGH',
'evidence': 'https://www.cell.com/cell/fulltext/S0092-8674(21)00755-8'
},'T478K': {
'effect': 'Immune evasion, enhanced RBD stability',
'variants': 'Delta, Omicron',
'severity': 'MODERATE',
'evidence': 'Science (2021)'
},'E484A': {
'effect': 'Significant immune evasion',
'variants': 'Omicron BA.2, BA.4, BA.5',
'severity': 'HIGH',
'evidence': 'https://www.nature.com/articles/s41586-022-04594-4'
},'E484K': {
'effect': 'Antibody escape',
'variants': 'Beta, Gamma, some Omicron sub-lineages',
'severity': 'HIGH',
'evidence': 'Cell (2021)'
},'N501Y': {
'effect': 'Enhanced ACE2 binding affinity (~10x)',
'variants': 'Alpha, Beta, Gamma, Omicron',
'severity': 'HIGH',
'evidence': 'https://www.nature.com/articles/s41586-021-03402-9'
},'D614G': {
'effect': 'Increased transmissibility (~50%)',
'variants': 'All major variants (universal)',
'severity': 'HIGH',
'evidence': 'https://www.science.org/doi/10.1126/science.abe8499'
},'P681H': {
'effect': 'Enhanced furin cleavage',
'variants': 'Alpha, Omicron BA.1',
'severity': 'MODERATE',
'evidence': 'Nature (2021)'
},'P681R': {
'effect': 'Enhanced furin cleavage (more than P681H)',
'variants': 'Delta',
'severity': 'HIGH',
'evidence': 'Nature (2021)'
},
}
# Assess mutations
print("\n" + "="*70)
print("MUTATION SIGNIFICANCE ASSESSMENT")
print("="*70)
= 0
high_concern_count = 0
moderate_concern_count = 0
unknown_count
for _, row in mutations_df.iterrows():
= row['mutation']
mut
if mut in mutations_of_concern:
= mutations_of_concern[mut]
info
if info['severity'] == 'HIGH':
= "🔴"
emoji += 1
high_concern_count else:
= "🟡"
emoji += 1
moderate_concern_count
print(f"\n{emoji} {mut} - **{info['severity']} CONCERN**")
print(f" Domain: {row['domain']}")
print(f" Effect: {info['effect']}")
print(f" Found in: {info['variants']}")
print(f" Evidence: {info['evidence']}")
else:
print(f"\n⚪ {mut} - Unknown/Novel")
print(f" Domain: {row['domain']}")
print(f" Status: Not previously characterized as mutation of concern")
+= 1
unknown_count
# Summary and recommended action
print("\n" + "="*70)
print("SUMMARY AND RECOMMENDED ACTION")
print("="*70)
print(f"\nTotal mutations detected: {len(mutations)}")
print(f" 🔴 High concern: {high_concern_count}")
print(f" 🟡 Moderate concern: {moderate_concern_count}")
print(f" ⚪ Unknown/Novel: {unknown_count}")
print("\nKey findings:")
print(f" • Multiple RBD mutations detected ({mutations_df[mutations_df['domain']=='RBD'].shape[0]} total)")
print(f" • Includes immune evasion mutations (E484A, K417N)")
print(f" • Includes enhanced binding mutations (N501Y)")
print(f" • Universal transmissibility mutation present (D614G)")
print(f" • Furin cleavage site modification (P681H)")
print("\n⚠️ This constellation of mutations suggests:")
print(" 1. Potential new Variant Under Investigation (VUI)")
print(" 2. Likely significant immune evasion capability")
print(" 3. Enhanced transmissibility expected")
print(" 4. May impact vaccine effectiveness")
print("\n📋 RECOMMENDED ACTIONS:")
print(" 1. Immediate notification to state epidemiologist")
print(" 2. Upload sequence to GISAID with appropriate flags")
print(" 3. Enhanced surveillance for additional cases")
print(" 4. Consider whole genome sequencing for full characterization")
print(" 5. Report to CDC as potential Variant of Concern (VOC)")
print(" 6. Investigate patient travel history and contacts")
print(" 7. Assess local vaccination rates and breakthrough infection status")
# Visualize mutation locations
= plt.subplots(figsize=(14, 6))
fig, ax
# Spike protein domains (simplified)
= 1273 # amino acids
spike_length = {
domains 'NTD': (14, 305, 'lightblue'),
'RBD': (319, 541, 'salmon'),
'SD1': (542, 685, 'lightgreen'),
'SD2': (686, 815, 'lightyellow'),
'HR1': (916, 1163, 'plum'),
}
# Draw domains
= 0.3
y_height for domain_name, (start, end, color) in domains.items():
= end - start
width = plt.Rectangle((start, 0.35), width, y_height,
rect =color, edgecolor='black', linewidth=2)
facecolor
ax.add_patch(rect)+ width/2, 0.5, domain_name,
ax.text(start ='center', va='center', fontsize=11, fontweight='bold')
ha
# Mark mutations
for _, row in mutations_df.iterrows():
= row['position']
pos = row['mutation']
mut
# Color by severity
if mut in mutations_of_concern:
= mutations_of_concern[mut]['severity']
severity = 'red' if severity == 'HIGH' else 'orange'
color = 'v'
marker = 200
size else:
= 'gray'
color = 'o'
marker = 100
size
0.8], s=size, c=color, marker=marker,
ax.scatter([pos], [='black', linewidth=2, zorder=10)
edgecolors0.95, mut, rotation=45, ha='right', fontsize=9, fontweight='bold')
ax.text(pos,
0, spike_length)
ax.set_xlim(0, 1.2)
ax.set_ylim('Amino Acid Position', fontsize=12)
ax.set_xlabel('SARS-CoV-2 Spike Protein: Detected Mutations', fontsize=14, fontweight='bold')
ax.set_title(
ax.set_yticks([])True, alpha=0.3, axis='x')
ax.grid(
# Legend
from matplotlib.patches import Patch
= [
legend_elements ='red', label='High Concern Mutation'),
Patch(facecolor='orange', label='Moderate Concern Mutation'),
Patch(facecolor='gray', label='Unknown Mutation'),
Patch(facecolor='salmon', label='RBD (Receptor Binding Domain)'),
Patch(facecolor
]=legend_elements, loc='upper right', fontsize=10)
ax.legend(handles
plt.tight_layout()'spike_mutations_visualization.png', dpi=300, bbox_inches='tight')
plt.savefig( plt.show()
8.6 Machine Learning for Genotype-to-Phenotype Prediction
One of the most powerful applications of AI in genomics: predicting phenotype (drug resistance, virulence, transmissibility) directly from genotype.
8.6.1 Example 5: Predicting Antimicrobial Resistance from Bacterial Genomes
Scenario: You’re analyzing Mycobacterium tuberculosis genomes to predict rifampicin resistance. Traditional culture-based drug susceptibility testing (DST) takes 6-8 weeks. Genomic prediction can provide results in days.
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import warnings
'ignore')
warnings.filterwarnings(
# Simulate TB genomic data
# In practice, this would come from whole genome sequencing
42)
np.random.seed(= 1200
n_samples
# Known rifampicin resistance-associated genes/mutations
= {
resistance_markers 'rpoB_S450L': 'High confidence RIF resistance',
'rpoB_H445Y': 'High confidence RIF resistance',
'rpoB_H445D': 'High confidence RIF resistance',
'rpoB_D435V': 'High confidence RIF resistance',
'rpoB_L452P': 'Moderate confidence RIF resistance',
'rpoB_I491F': 'Moderate confidence RIF resistance',
'rpoB_Q513K': 'Moderate confidence RIF resistance',
'rpoB_S441L': 'Moderate confidence RIF resistance',
# Other resistance genes (not RIF, but correlated with MDR-TB)
'katG_S315T': 'Isoniazid resistance',
'inhA_C15T': 'Isoniazid resistance',
'pncA_H57D': 'Pyrazinamide resistance',
'embB_M306I': 'Ethambutol resistance',
'gyrA_A90V': 'Fluoroquinolone resistance',
'gyrA_D94G': 'Fluoroquinolone resistance',
'rrs_A1401G': 'Aminoglycoside resistance',
}
# Generate features (presence/absence of mutations)
= {}
features for marker in resistance_markers.keys():
# Higher prevalence for actual RIF resistance markers
if 'rpoB' in marker:
= 0.25 if 'High confidence' in resistance_markers[marker] else 0.15
prevalence else:
= 0.20 # Other resistance markers
prevalence
= np.random.binomial(1, prevalence, n_samples)
features[marker]
# Target: Rifampicin resistance
# Strongly associated with rpoB mutations, some other markers due to MDR-TB
= np.zeros(n_samples, dtype=int)
rif_resistance
for i in range(n_samples):
# High confidence rpoB mutations → almost always resistant
if (features['rpoB_S450L'][i] == 1 or
'rpoB_H445Y'][i] == 1 or
features['rpoB_H445D'][i] == 1 or
features['rpoB_D435V'][i] == 1):
features[= 1 if np.random.random() < 0.95 else 0
rif_resistance[i]
# Moderate confidence rpoB mutations
elif (features['rpoB_L452P'][i] == 1 or
'rpoB_I491F'][i] == 1 or
features['rpoB_Q513K'][i] == 1 or
features['rpoB_S441L'][i] == 1):
features[= 1 if np.random.random() < 0.70 else 0
rif_resistance[i]
# MDR-TB: multiple resistance markers suggests RIF resistance
elif sum([features[marker][i] for marker in features.keys() if 'rpoB' not in marker]) >= 3:
= 1 if np.random.random() < 0.60 else 0
rif_resistance[i]
# Random resistance (other mechanisms not captured)
else:
= 1 if np.random.random() < 0.08 else 0
rif_resistance[i]
# Create DataFrame
= pd.DataFrame(features)
tb_data 'rifampicin_resistant'] = rif_resistance
tb_data[
print("="*70)
print("TUBERCULOSIS GENOMIC DATASET")
print("="*70)
print(f"\nTotal isolates: {n_samples}")
print(f"Rifampicin resistance prevalence: {rif_resistance.mean():.1%}")
print(f"\nResistance marker prevalence:")
= tb_data.drop('rifampicin_resistant', axis=1).mean().sort_values(ascending=False)
marker_prev for marker, prev in marker_prev.head(8).items():
print(f" {marker:20s}: {prev:.1%}")
print(f"\nSample data (first 5 isolates):")
print(tb_data.head().to_string(index=False))
# Split data
= tb_data.drop('rifampicin_resistant', axis=1)
X = tb_data['rifampicin_resistant']
y
= train_test_split(
X_train, X_test, y_train, y_test =0.25, random_state=42, stratify=y
X, y, test_size
)
print(f"\nTraining set: {len(X_train)} isolates")
print(f"Test set: {len(X_test)} isolates")
# Train multiple models
= {
models 'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, max_depth=5),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42, max_depth=3)
}
= {}
results
print("\n" + "="*70)
print("MODEL TRAINING AND EVALUATION")
print("="*70)
for model_name, model in models.items():
print(f"\n{'='*70}")
print(f"{model_name}")
print(f"{'='*70}")
# Train
model.fit(X_train, y_train)
# Predictions
= model.predict(X_test)
y_pred = model.predict_proba(X_test)[:, 1]
y_pred_proba
# Metrics
= roc_auc_score(y_test, y_pred_proba)
auc
# Classification report
print(classification_report(y_test, y_pred,
=['Susceptible', 'Resistant'],
target_names=3))
digits
print(f"AUC-ROC: {auc:.3f}")
# Confusion matrix
= confusion_matrix(y_test, y_pred)
cm
# Clinical metrics
= cm[1,1] / (cm[1,1] + cm[1,0])
sensitivity = cm[0,0] / (cm[0,0] + cm[0,1])
specificity = cm[1,1] / (cm[1,1] + cm[0,1]) if (cm[1,1] + cm[0,1]) > 0 else 0
ppv = cm[0,0] / (cm[0,0] + cm[1,0]) if (cm[0,0] + cm[1,0]) > 0 else 0
npv
= {
results[model_name] 'auc': auc,
'sensitivity': sensitivity,
'specificity': specificity,
'ppv': ppv,
'npv': npv,
'cm': cm,
'model': model
}
print(f"\nClinical Performance Metrics:")
print(f" Sensitivity: {sensitivity:.1%} (detects {sensitivity:.1%} of resistant cases)")
print(f" Specificity: {specificity:.1%} (correctly identifies {specificity:.1%} of susceptible cases)")
print(f" PPV: {ppv:.1%} (if predicted resistant, {ppv:.1%} actually are)")
print(f" NPV: {npv:.1%} (if predicted susceptible, {npv:.1%} actually are)")
# Select best model (highest AUC)
= max(results, key=lambda x: results[x]['auc'])
best_model_name = results[best_model_name]['model']
best_model
print("\n" + "="*70)
print(f"BEST MODEL: {best_model_name}")
print("="*70)
# Feature importance
if hasattr(best_model, 'feature_importances_'):
= pd.DataFrame({
feature_importance 'marker': X.columns,
'importance': best_model.feature_importances_
'importance', ascending=False)
}).sort_values(
print("\nTop 10 Most Important Genetic Markers:")
print(feature_importance.head(10).to_string(index=False))
# Visualizations
= plt.subplots(2, 2, figsize=(16, 12))
fig, axes
# Top-left: ROC curves
for model_name, result in results.items():
= result['model'].predict_proba(X_test)[:, 1]
y_pred_proba = roc_curve(y_test, y_pred_proba)
fpr, tpr, _ 0,0].plot(fpr, tpr, linewidth=2, label=f'{model_name} (AUC={result["auc"]:.3f})')
axes[
0,0].plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random (AUC=0.5)')
axes[0,0].set_xlabel('False Positive Rate', fontsize=12)
axes[0,0].set_ylabel('True Positive Rate (Sensitivity)', fontsize=12)
axes[0,0].set_title('ROC Curves: Model Comparison', fontsize=14, fontweight='bold')
axes[0,0].legend(fontsize=10)
axes[0,0].grid(True, alpha=0.3)
axes[
# Top-right: Confusion matrix (best model)
= results[best_model_name]['cm']
best_cm =True, fmt='d', cmap='Blues', ax=axes[0,1],
sns.heatmap(best_cm, annot=['Predicted Susceptible', 'Predicted Resistant'],
xticklabels=['Actually Susceptible', 'Actually Resistant'],
yticklabels={'label': 'Count'})
cbar_kws0,1].set_title(f'Confusion Matrix: {best_model_name}', fontsize=14, fontweight='bold')
axes[
# Bottom-left: Feature importance
if hasattr(best_model, 'feature_importances_'):
= feature_importance.head(10)
top_features 1,0].barh(range(len(top_features)), top_features['importance'], color='steelblue', edgecolor='black')
axes[1,0].set_yticks(range(len(top_features)))
axes[1,0].set_yticklabels(top_features['marker'])
axes[1,0].set_xlabel('Importance', fontsize=12)
axes[1,0].set_title('Top 10 Predictive Genetic Markers', fontsize=14, fontweight='bold')
axes[1,0].grid(True, alpha=0.3, axis='x')
axes[1,0].invert_yaxis()
axes[
# Bottom-right: Clinical performance comparison
= ['Sensitivity', 'Specificity', 'PPV', 'NPV']
metrics = list(results.keys())
model_names = np.arange(len(metrics))
x = 0.25
width
for i, model_name in enumerate(model_names):
= [
values 'sensitivity'],
results[model_name]['specificity'],
results[model_name]['ppv'],
results[model_name]['npv']
results[model_name][
]1,1].bar(x + i*width, values, width, label=model_name, edgecolor='black')
axes[
1,1].set_ylabel('Score', fontsize=12)
axes[1,1].set_title('Clinical Performance Comparison', fontsize=14, fontweight='bold')
axes[1,1].set_xticks(x + width)
axes[1,1].set_xticklabels(metrics)
axes[1,1].legend(fontsize=9)
axes[1,1].set_ylim(0, 1.05)
axes[1,1].grid(True, alpha=0.3, axis='y')
axes[
plt.tight_layout()'amr_prediction_evaluation.png', dpi=300, bbox_inches='tight')
plt.savefig(
plt.show()
# Clinical use case example
print("\n" + "="*70)
print("CLINICAL USE CASE EXAMPLE")
print("="*70)
print("\n📋 Patient Case:")
print(" • 45-year-old with pulmonary TB")
print(" • Previously treated 5 years ago")
print(" • GeneXpert MTB/RIF: MTB detected")
print(" • Sputum culture pending (results in 6-8 weeks)")
print(" • WGS completed in 48 hours")
# Simulate a test case
= pd.DataFrame({
test_case 'rpoB_S450L': [1], # HIGH CONFIDENCE RIF RESISTANCE
'rpoB_H445Y': [0],
'rpoB_H445D': [0],
'rpoB_D435V': [0],
'rpoB_L452P': [0],
'rpoB_I491F': [0],
'rpoB_Q513K': [0],
'rpoB_S441L': [0],
'katG_S315T': [1], # ISONIAZID RESISTANCE
'inhA_C15T': [0],
'pncA_H57D': [0],
'embB_M306I': [0],
'gyrA_A90V': [0],
'gyrA_D94G': [0],
'rrs_A1401G': [0],
})
# Predict
= best_model.predict(test_case)[0]
test_prediction = best_model.predict_proba(test_case)[0, 1]
test_proba
print("\n🧬 Genomic Analysis Results:")
print(f" • rpoB S450L mutation: DETECTED ⚠️")
print(f" • katG S315T mutation: DETECTED ⚠️")
print(f"\n🤖 Machine Learning Prediction:")
print(f" • Model: {best_model_name}")
print(f" • Prediction: {'RESISTANT' if test_prediction == 1 else 'SUSCEPTIBLE'}")
print(f" • Confidence: {test_proba:.1%}")
print("\n💊 Clinical Interpretation:")
print(" • Genomic evidence of rifampicin resistance (rpoB S450L)")
print(" • Genomic evidence of isoniazid resistance (katG S315T)")
print(" • Diagnosis: Likely Multidrug-Resistant TB (MDR-TB)")
print("\n📋 Recommended Treatment:")
print(" • DO NOT use rifampicin or isoniazid (first-line therapy)")
print(" • Initiate MDR-TB regimen:")
print(" - Bedaquiline")
print(" - Linezolid")
print(" - Levofloxacin")
print(" - Cycloserine")
print(" - Pyrazinamide (test susceptibility)")
print(" • Await culture DST for full susceptibility profile")
print(" • Isolate patient (infection control for MDR-TB)")
print("\n⏱️ Time Saved:")
print(" • Traditional DST: 6-8 weeks")
print(" • Genomic prediction: 2-3 days")
print(" • **Time saved: 5-7 weeks of inappropriate therapy avoided**")
The CRyPTIC Consortium (New England Journal of Medicine, 2022) sequenced >10,000 M. tuberculosis isolates and performed drug susceptibility testing for 13 antibiotics.
Key findings:
Predictive accuracy for rifampicin resistance: - Sensitivity: 97.1% - Specificity: 99.0% - PPV: 94.1% - NPV: 99.4%
For first-line drugs (rifampicin, isoniazid, ethambutol, pyrazinamide): - Genomic prediction as accurate as culture DST - 5-7 weeks faster than traditional methods - Cost: ~$100-200 vs. $1000+ for comprehensive DST
Clinical impact: - Earlier appropriate treatment - Reduced transmission of drug-resistant TB - Better patient outcomes
Limitations: - Some resistance mechanisms not yet understood (especially for newer drugs) - Heteroresistance (mixed resistant/susceptible populations) harder to detect - Requires WGS infrastructure (not available everywhere)
This represents a paradigm shift in TB diagnosis: from culture-based phenotypic testing to genomics-based prediction.
8.7 Real-World Platforms and Tools
8.7.1 Nextstrain: Open-Source Genomic Epidemiology
Nextstrain (Bedford et al., 2018, Bioinformatics) provides real-time tracking of pathogen evolution.
Key features:
Interactive phylogenetic trees: - Zoom, pan, filter by location/time - Color by country, clade, genotype - Time-resolved trees (branches = time)
Geographic visualization: - Map of where sequences were collected - Migration patterns between regions
Mutation tracking: - Which mutations are increasing - Frequency trajectories over time
Available pathogens: - SARS-CoV-2 (updated daily) - Influenza A & B - Ebola - Zika - Measles - Dengue - West Nile virus - And 20+ more
8.7.2 Example 6: Analyzing Nextstrain-Style Data
# Simulate Nextstrain-style dataset
42)
np.random.seed(= 300
n_sequences
# Generate metadata
= datetime(2024, 1, 1)
start_date = [start_date + timedelta(days=int(x)) for x in np.random.exponential(25, n_sequences)]
dates = sorted(dates)
dates
= np.random.choice(
countries 'USA', 'UK', 'Germany', 'France', 'Japan', 'Australia', 'Brazil', 'South Africa'],
[
n_sequences,=[0.30, 0.15, 0.10, 0.10, 0.10, 0.08, 0.10, 0.07]
p
)
# Nextstrain clades (simplified)
# Early: 24A dominant
# Middle: 24B emerges and grows
# Late: 24C appears
def assign_clade(date):
= (date - start_date).days
days
if days < 30:
return np.random.choice(['24A', '24B'], p=[0.95, 0.05])
elif days < 60:
return np.random.choice(['24A', '24B', '24C'], p=[0.40, 0.55, 0.05])
else:
return np.random.choice(['24A', '24B', '24C'], p=[0.10, 0.50, 0.40])
= [assign_clade(date) for date in dates]
clades
# Mutations accumulate over time
= [25 + int((date - start_date).days / 5) + np.random.randint(-3, 4)
num_mutations for date in dates]
= pd.DataFrame({
nextstrain_data 'strain': [f'{country}/SEQ-{i:04d}/2024' for i, country in enumerate(countries)],
'date': dates,
'country': countries,
'clade': clades,
'num_mutations': num_mutations,
})
print("="*70)
print("NEXTSTRAIN-STYLE GENOMIC SURVEILLANCE DATA")
print("="*70)
print(f"\nTotal sequences: {len(nextstrain_data)}")
print(f"Date range: {nextstrain_data['date'].min().date()} to {nextstrain_data['date'].max().date()}")
print(f"Countries represented: {nextstrain_data['country'].nunique()}")
print("\nClade distribution:")
print(nextstrain_data['clade'].value_counts().to_string())
# Geographic distribution by clade
print("\n" + "="*70)
print("GEOGRAPHIC DISTRIBUTION BY CLADE")
print("="*70)
= nextstrain_data.groupby(['country', 'clade']).size().unstack(fill_value=0)
geo_clade 'total'] = geo_clade.sum(axis=1)
geo_clade[= geo_clade.div(geo_clade['total'], axis=0).drop('total', axis=1) * 100
geo_clade_pct
print(geo_clade_pct.round(1).to_string())
# Temporal evolution
print("\n" + "="*70)
print("TEMPORAL CLADE EVOLUTION")
print("="*70)
'month'] = nextstrain_data['date'].dt.to_period('M')
nextstrain_data[= nextstrain_data.groupby(['month', 'clade']).size().unstack(fill_value=0)
temporal_clade = temporal_clade.div(temporal_clade.sum(axis=1), axis=0) * 100
temporal_clade_pct
print(temporal_clade_pct.round(1).to_string())
# Visualizations
= plt.subplots(2, 2, figsize=(16, 12))
fig, axes
# Top-left: Geographic distribution
='bar', stacked=True, ax=axes[0,0],
geo_clade_pct.plot(kind='Set3', edgecolor='black', linewidth=0.5)
colormap0,0].set_xlabel('Country', fontsize=12)
axes[0,0].set_ylabel('Percentage', fontsize=12)
axes[0,0].set_title('Clade Distribution by Country', fontsize=14, fontweight='bold')
axes[0,0].legend(title='Clade', fontsize=10)
axes[0,0].set_xticklabels(axes[0,0].get_xticklabels(), rotation=45, ha='right')
axes[0,0].set_ylim(0, 100)
axes[0,0].grid(True, alpha=0.3, axis='y')
axes[
# Top-right: Temporal evolution (stacked area)
= temporal_clade_pct.index.to_timestamp()
temporal_clade_pct.index 0,1].stackplot(temporal_clade_pct.index,
axes[*[temporal_clade_pct[col] for col in temporal_clade_pct.columns],
=temporal_clade_pct.columns,
labels=0.8)
alpha0,1].set_xlabel('Month', fontsize=12)
axes[0,1].set_ylabel('Percentage', fontsize=12)
axes[0,1].set_title('Clade Evolution Over Time', fontsize=14, fontweight='bold')
axes[0,1].legend(loc='upper left', fontsize=10)
axes[0,1].set_ylim(0, 100)
axes[0,1].grid(True, alpha=0.3, axis='y')
axes[
# Bottom-left: Mutation accumulation over time
for clade in nextstrain_data['clade'].unique():
= nextstrain_data[nextstrain_data['clade'] == clade]
clade_data 1,0].scatter(clade_data['date'], clade_data['num_mutations'],
axes[=0.6, s=30, label=clade)
alpha
1,0].set_xlabel('Date', fontsize=12)
axes[1,0].set_ylabel('Number of Mutations', fontsize=12)
axes[1,0].set_title('Mutation Accumulation Over Time', fontsize=14, fontweight='bold')
axes[1,0].legend(title='Clade', fontsize=10)
axes[1,0].grid(True, alpha=0.3)
axes[
# Bottom-right: Sampling intensity by country
= nextstrain_data['country'].value_counts()
country_counts 1,1].barh(range(len(country_counts)), country_counts.values,
axes[='teal', edgecolor='black', linewidth=0.5)
color1,1].set_yticks(range(len(country_counts)))
axes[1,1].set_yticklabels(country_counts.index)
axes[1,1].set_xlabel('Number of Sequences', fontsize=12)
axes[1,1].set_title('Sampling Intensity by Country', fontsize=14, fontweight='bold')
axes[1,1].grid(True, alpha=0.3, axis='x')
axes[
plt.tight_layout()'nextstrain_analysis.png', dpi=300, bbox_inches='tight')
plt.savefig(
plt.show()
# Key insights
print("\n" + "="*70)
print("KEY EPIDEMIOLOGICAL INSIGHTS")
print("="*70)
print("\n1. Clade Dynamics:")
print(" • Clade 24A: Declining (was dominant early, now rare)")
print(" • Clade 24B: Currently dominant (~50%)")
print(" • Clade 24C: Rapidly emerging (5% → 40% in 2 months)")
print("\n2. Geographic Patterns:")
= geo_clade_pct.loc['USA', '24C'] if 'USA' in geo_clade_pct.index else 0
usa_24c print(f" • USA has highest 24C prevalence ({usa_24c:.0f}%)")
print(" • Suggests either:")
print(" a) 24C emerged in USA, or")
print(" b) USA detected it earlier (better surveillance)")
print("\n3. Mutation Rate:")
= (nextstrain_data['num_mutations'].max() - nextstrain_data['num_mutations'].min()) / 3
mutations_per_month print(f" • ~{mutations_per_month:.0f} mutations accumulated per month")
print(f" • Consistent with SARS-CoV-2 background rate")
print("\n4. Surveillance Gaps:")
= country_counts[country_counts < 20].index.tolist()
undersampled if undersampled:
print(f" • Under-sampled countries: {', '.join(undersampled)}")
print(" • May have circulating variants we haven't detected")
print("\n📋 Recommended Actions:")
print(" 1. Enhanced surveillance for 24C (growth advantage evident)")
print(" 2. Functional characterization of 24C-specific mutations")
print(" 3. Assess vaccine effectiveness against 24C")
print(" 4. Support genomic capacity in under-sampled regions")
8.8 Challenges, Limitations, and Biases
Genomic surveillance is powerful but not perfect. Understanding its limitations is critical for proper interpretation.
8.8.1 The Sampling Bias Problem
Challenge: We sequence a tiny fraction of all infections, and sampling is highly unequal.
# Simulate global surveillance inequality
= pd.DataFrame({
surveillance_capacity 'region': ['High-income', 'Upper-middle', 'Lower-middle', 'Low-income'],
'population_millions': [1200, 2600, 3200, 800],
'covid_cases_millions': [180, 390, 320, 80],
'sequences_per_1000_cases': [50, 5, 0.5, 0.05], # 1000x difference!
})
'total_sequences'] = (
surveillance_capacity['covid_cases_millions'] * 1000 *
surveillance_capacity['sequences_per_1000_cases']
surveillance_capacity[int)
).astype(
'sequences_per_million_people'] = (
surveillance_capacity['total_sequences'] /
surveillance_capacity['population_millions']
surveillance_capacity[round(0)
).
print("="*70)
print("GLOBAL GENOMIC SURVEILLANCE CAPACITY INEQUALITY")
print("="*70)
print(surveillance_capacity.to_string(index=False))
print("\n" + "="*70)
print("THE PARADOX")
print("="*70)
print("\nCountries with the BEST surveillance:")
print(" • Detect variants first")
print(" • Get blamed for 'causing' them")
print(" • Face travel bans and stigma")
print("\nCountries with the WORST surveillance:")
print(" • Variants circulate undetected")
print(" • Export cases silently")
print(" • No consequences (because they're invisible)")
print("\nConsequence: Perverse incentive AGAINST doing good surveillance!")
# Visualize
= plt.subplots(1, 2, figsize=(16, 6))
fig, (ax1, ax2)
# Left: Sequencing capacity
= range(len(surveillance_capacity))
x 'sequences_per_1000_cases'],
ax1.bar(x, surveillance_capacity[='steelblue', edgecolor='black', linewidth=2)
color
ax1.set_xticks(x)'region'], rotation=0)
ax1.set_xticklabels(surveillance_capacity['Sequences per 1,000 Cases', fontsize=12)
ax1.set_ylabel('Genomic Surveillance Capacity', fontsize=14, fontweight='bold')
ax1.set_title('log')
ax1.set_yscale(True, alpha=0.3, axis='y')
ax1.grid(
# Annotate inequality
0.5, 0.95, '1000× inequality', transform=ax1.transAxes,
ax1.text(=14, fontweight='bold', color='red',
fontsize='center', va='top',
ha=dict(boxstyle='round', facecolor='yellow', alpha=0.7))
bbox
# Right: Per capita sequencing
'sequences_per_million_people'],
ax2.bar(x, surveillance_capacity[='coral', edgecolor='black', linewidth=2)
color
ax2.set_xticks(x)'region'], rotation=0)
ax2.set_xticklabels(surveillance_capacity['Sequences per Million People', fontsize=12)
ax2.set_ylabel('Per Capita Sequencing Intensity', fontsize=14, fontweight='bold')
ax2.set_title('log')
ax2.set_yscale(True, alpha=0.3, axis='y')
ax2.grid(
plt.tight_layout()'surveillance_inequality.png', dpi=300, bbox_inches='tight')
plt.savefig( plt.show()
When South Africa detected and reported Omicron in November 2021:
What happened: - Immediate travel bans from 10+ countries - Economic damage to tourism - Accusations of “creating” the variant - Punished for transparency
What should have happened: - Recognition of excellent surveillance - Support for genomic capacity - Reward for rapid data sharing
The lesson: Current system punishes good surveillance, creating incentive to hide variants.
Proposed solutions: - WHO Pandemic Agreement: Benefit-sharing for countries that contribute data - No travel bans based on variant detection (only on epidemiological situation) - Technical support for countries building capacity
For discussion, see Yozwiak et al., 2016, PLOS Biology on equitable data sharing.
8.8.2 Technical Limitations
1. Sequence Quality:
# Simulate sequence quality issues
= pd.DataFrame({
quality_issues 'issue': ['Low viral load', 'Contamination', 'PCR bias', 'Coverage gaps', 'Mixed infection'],
'frequency_pct': [15, 5, 10, 20, 3],
'impact': ['Incomplete genome', 'Incorrect calls', 'Allele dropout', 'Missing mutations', 'Ambiguous calls']
})
print("\nCommon Sequencing Quality Issues:")
print(quality_issues.to_string(index=False))
2. Turnaround Time: - Sample collection → Sequence result: 3-7 days - + Analysis and interpretation: 1-2 days - Total: 4-9 days (vs. 1 day for PCR)
For outbreak response: Sometimes too slow (cases have already spread)
3. Cost: - Sequencing: $50-300 per sample - Not sustainable for universal surveillance - Must prioritize strategically
4. Bioinformatics Complexity: - Requires specialized expertise - Computational infrastructure - Rapidly evolving tools and nomenclature
8.9 Ethics and Governance
8.9.1 Data Sharing and Attribution
The tension: - Public health needs: Rapid, open sharing of sequence data - Researcher incentives: Credit, publications, career advancement
GISAID’s model: - Free access for public health/research - Mandatory attribution of data contributors - Prevents “parasitic” publications (using data without acknowledging sources)
Has it worked? Yes. GISAID enabled unprecedented COVID-19 response.
Remaining challenges: - Low-income countries contribute samples but lack capacity to analyze - “Helicopter research”: Samples leave country, papers published elsewhere without local authorship - Need: Capacity building, equitable partnerships
8.9.2 Dual-Use Research Concerns
Question: Can genomic data enable creation of dangerous pathogens?
2012 H5N1 Controversy: Two labs created highly transmissible H5N1 influenza in ferrets. Should the sequences be published?
Arguments FOR publication: ✅ Scientific transparency
✅ Enables surveillance (detect if mutations arise naturally)
✅ Guides vaccine development
Arguments AGAINST: ❌ Dual-use potential (bioterrorism)
❌ Accidental lab release risk
❌ May trigger panic
Resolution: Eventually published with methods somewhat redacted. Debate continues.
For SARS-CoV-2: - All sequences openly shared - No evidence of misuse - Benefits (diagnostics, vaccines, treatments) vastly outweighed risks
Governance: National Science Advisory Board for Biosecurity (NSABB) reviews high-risk research.
For ethical framework, see Inglesby et al., 2014, mBio.
8.9.3 Stigma and Discrimination
Naming variants after places: - “UK variant”, “South African variant”, “Indian variant” - Creates stigma, encourages discrimination - WHO now uses Greek letters (Alpha, Beta, Delta, Omicron)
Individual-level concerns: - Genomic data can be identifying (small communities) - Must protect patient privacy while sharing pathogen data
8.10 Practical Implementation Guide
8.10.1 Building Your Genomic Surveillance System
Step 1: Define Objectives ☐ What questions will genomics answer? (Outbreak investigation? Variant monitoring? AMR?)
☐ What volume of sequencing is feasible? (Budget constraints)
☐ Who will analyze the data? (In-house expertise? Academic partners?)
Step 2: Establish Infrastructure ☐ Sequencing platform (Illumina? Oxford Nanopore?)
☐ Bioinformatics pipeline (Cloud-based? Local server?)
☐ Data management (How will sequences be stored/shared?)
☐ Quality control procedures
Step 3: Prioritize Samples - Outbreaks: Sequence all or representative subset - Sentinel surveillance: Random sample (2-5% of cases) - Special situations: Vaccine breakthroughs, reinfections, severe cases in vaccinated individuals - Wastewater: Community-level surveillance
Step 4: Analysis Workflow
Sample received
↓
QC (Ct value, collection date)
↓
Sequencing (1-3 days)
↓
Bioinformatics (alignment, variant calling)
↓
Lineage assignment (Pangolin, Nextclade)
↓
Quality filtering (>90% complete, >100x depth)
↓
Phylogenetic analysis (if cluster)
↓
Upload to GISAID
↓
Report to epidemiology team
↓
Action (if novel variant or outbreak cluster)
Step 5: Integration with Epidemiology - Don’t work in silos! Genomics teams + epi teams = powerful combination - Link sequence data to case investigations - Use genomics to guide contact tracing (who infected whom?) - Monitor for variants that may affect control measures
Step 6: Capacity Building - Train staff (wet lab + bioinformatics) - Establish SOPs (standard operating procedures) - Participate in quality assurance programs - Collaborate with academic/reference labs
8.10.2 Free/Open-Source Tools
Sequencing: - ARTIC Network protocols - SARS-CoV-2 sequencing
Bioinformatics: - Nextstrain - Phylogenetic analysis - Pangolin - SARS-CoV-2 lineage assignment - Nextclade - Web-based clade assignment - IQ-TREE - Fast phylogenetic tree building - snippy - Variant calling for bacteria
Cloud Platforms: - Terra.bio - Cloud genomics (free for public health) - Galaxy - Web-based analysis - BaseSpace - Illumina platform
8.11 Key Takeaways
Genomic surveillance revolutionized outbreak response - SARS-CoV-2 sequenced in 10 days; variants detected within weeks of emergence
Phylogenetics reconstructs transmission - Who infected whom, even without contact tracing data
Mutations have consequences - Specific genetic changes cause increased transmissibility, immune evasion, drug resistance
Machine learning predicts phenotype from genotype - Drug resistance prediction from sequence alone (TB, S. aureus, E. coli)
Real-world platforms exist and work - GISAID, Nextstrain enable global collaboration
Sampling bias is pervasive - Rich countries sequence 100-1000x more than poor countries, creating blind spots
Technical challenges remain - Cost, turnaround time, bioinformatics expertise required
Equity and ethics matter - Countries doing good surveillance shouldn’t be punished; need benefit-sharing
Integration is essential - Genomics enhances traditional epidemiology but doesn’t replace it
The future is genomic - Sequencing costs dropping; routine genomic surveillance becoming standard of care
8.12 Practice Exercises
8.12.1 Exercise 1: Outbreak Investigation
Nine sequences from a restaurant outbreak. SNP distances: - Cases 1-5: 0-2 SNPs apart - Cases 6-9: 0-1 SNPs apart - Between groups: 15 SNPs
Questions: - How many introductions? - Which cases are linked? - What public health actions are needed?
8.12.2 Exercise 2: Variant Growth
Variant A frequency: Week 1 = 10%, Week 4 = 40%
Calculate: - Fold change - Weekly growth rate - Doubling time - Transmission advantage (relative R)
8.12.3 Exercise 3: AMR Prediction
E. coli genome with blaCTX-M-15 gene detected.
Questions: - What antibiotic class will be ineffective? - What should you prescribe instead? - How confident are you? (What could cause false positive/negative?)
8.12.4 Exercise 4: Phylogenetic Interpretation
Build a phylogenetic tree from sequences. Identify: - Which samples are most closely related? - Do geographic clusters exist? - What is the most recent common ancestor?
Check Your Understanding
Test your knowledge of genomic surveillance and its applications in public health. Each question builds on the key concepts from this chapter.
A public health laboratory is planning to establish genomic surveillance for SARS-CoV-2 variants. They are deciding between two sequencing approaches: targeted amplicon sequencing (e.g., ARTIC protocol) versus whole genome shotgun sequencing. Which statement BEST describes the key trade-off between these approaches for routine surveillance?
- Targeted amplicon sequencing is more expensive but provides complete genome coverage
- Whole genome shotgun sequencing is faster and more suitable for high-throughput surveillance
- Targeted amplicon sequencing is more cost-effective and works with low viral loads, but may miss novel variants in primer binding sites
- Whole genome shotgun sequencing is the only method that can detect single nucleotide polymorphisms (SNPs)
Correct Answer: c) Targeted amplicon sequencing is more cost-effective and works with low viral loads, but may miss novel variants in primer binding sites
The choice between targeted amplicon sequencing and whole genome shotgun sequencing represents a fundamental trade-off in genomic surveillance. Targeted amplicon sequencing (like the widely-used ARTIC protocol) uses specific primers to amplify predetermined regions of the viral genome. This approach is highly cost-effective, requires less sequencing depth, and crucially, works well with clinical samples that have low viral loads—a common situation with respiratory specimens. These advantages made ARTIC the workhorse of global SARS-CoV-2 surveillance during the pandemic.
However, targeted approaches have a critical vulnerability: if mutations occur in primer binding sites, those regions may fail to amplify, creating “dropout” patterns. This happened with the Omicron variant, where mutations in primer binding sites initially caused detection issues. While primer sets can be updated, there’s always a lag between variant emergence and primer redesign.
Option (a) is incorrect because targeted sequencing is actually less expensive, not more. Option (b) is incorrect because targeted approaches are generally faster and more cost-effective for high-throughput surveillance, not shotgun sequencing. Option (d) is incorrect because both methods can detect SNPs—the difference lies in their systematic biases and coverage patterns, not in the fundamental ability to detect sequence variation.
For routine surveillance, most laboratories use targeted sequencing for cost-effectiveness and rapid turnaround, while maintaining shotgun sequencing capacity for investigating unusual patterns or confirming novel variants. This hybrid strategy balances practical constraints with surveillance sensitivity.
You are investigating a hospital outbreak of methicillin-resistant Staphylococcus aureus (MRSA). Whole genome sequencing reveals that isolates from five patients differ by 0-3 SNPs. Hospital infection control asks whether this proves transmission occurred within the hospital. What is the MOST scientifically accurate response?
- Yes, 0-3 SNPs definitively proves direct transmission between these patients
- No, MRSA does not accumulate SNPs during person-to-person transmission
- The SNP data is consistent with recent transmission, but cannot prove it without epidemiological context and a null hypothesis
- SNP analysis is unreliable for outbreak investigation and should not be used for infection control decisions
Correct Answer: c) The SNP data is consistent with recent transmission, but cannot prove it without epidemiological context and a null hypothesis
This question addresses a critical pitfall in genomic epidemiology: confusing “consistent with” for “proof of” transmission. While 0-3 SNPs between MRSA isolates strongly suggests recent common ancestry, it does not definitively prove direct transmission within the hospital. Multiple scenarios could produce this pattern: patients could have independently acquired closely related strains from a common community source, they could have been colonized before admission, or transmission could have occurred in a different healthcare facility they all visited.
Rigorous outbreak investigation requires integrating genomic data with epidemiological context. Key questions include: Did patients have overlapping hospital stays? Were they in the same unit? Did they share healthcare workers? What is the background genetic diversity of MRSA in this community? Without comparing the outbreak cluster to a null hypothesis (what would we expect by chance given local MRSA diversity?), we cannot assess whether the genetic similarity is remarkable or expected.
Option (a) is too absolute—genomic data alone cannot prove causation. Option (b) is factually incorrect; MRSA does accumulate SNPs during transmission, typically 0-25 per year depending on mutation rates and generation time. Option (d) is also incorrect; SNP analysis is highly valuable for outbreak investigation when properly interpreted—the chapter discusses multiple successful applications in tuberculosis and other bacterial infections.
The Walker et al. (2013) study on tuberculosis transmission in the UK exemplifies the proper integration of genomic and epidemiological data, showing that many apparent transmission links suggested by conventional epidemiology were inconsistent with genetic data, and vice versa. Genomics is a powerful tool, but not a standalone oracle.
A phylogenetic tree of SARS-CoV-2 sequences shows that samples from Country A cluster together on one branch, while samples from Country B form a different cluster. A government official concludes that this proves the virus evolved differently in each country and suggests country-specific vaccines. What is the PRIMARY flaw in this interpretation?
- Phylogenetic trees cannot show geographic patterns because they only represent genetic relationships
- The clustering likely reflects sampling bias and founder effects, not independent evolution requiring different vaccines
- SARS-CoV-2 does not evolve fast enough to develop country-specific lineages
- Phylogenetic trees are unreliable for RNA viruses due to high mutation rates
Correct Answer: b) The clustering likely reflects sampling bias and founder effects, not independent evolution requiring different vaccines
This question highlights a common misinterpretation of phylogeographic patterns. When samples from different countries cluster separately on a phylogenetic tree, it’s tempting to conclude this represents distinct evolutionary trajectories. However, geographic clustering usually reflects sampling and epidemiological patterns rather than functionally significant evolutionary divergence.
The clustering likely arises from founder effects: different viral lineages were introduced to each country during the early pandemic, and local transmission amplified these founding lineages. If Country A had limited travel connections during the critical seeding period, or if sequencing began after local transmission was established, the sequences will naturally cluster together because they descend from a small number of introductions. This phylogenetic structure tells us about epidemic history and introduction patterns, not about functional differences requiring different vaccines.
Furthermore, the chapter discusses how variants of concern (Alpha, Delta, Omicron) arose in specific locations but then spread globally because they had functional advantages. These variants required updated vaccines not because of where they originated, but because they acquired specific mutations affecting immune escape. Simple geographic clustering without evidence of altered antigenicity or functional changes does not justify country-specific vaccines.
Option (a) is incorrect—phylogenetic trees do show geographic patterns, but these require careful interpretation. Option (c) is incorrect; SARS-CoV-2 evolves quite rapidly, but this doesn’t automatically mean geographically distinct lineages require different vaccines. Option (d) is incorrect; phylogenetic methods are well-suited to RNA viruses, and high mutation rates actually provide more phylogenetic signal, not less reliability.
The key lesson: phylogenetic structure reflects both evolutionary processes and sampling/epidemiological history. Distinguishing these requires understanding founder effects, sampling biases, and the functional significance of observed mutations.
A laboratory has sequenced the genome of an E. coli isolate from a patient with a urinary tract infection and detected the blaCTX-M-15 gene. Based on this finding alone, what is the MOST appropriate conclusion about antibiotic treatment?
- The patient should immediately be switched to a carbapenem antibiotic as first-line therapy
- Third-generation cephalosporins (e.g., ceftriaxone) will likely be ineffective and alternative treatments should be considered
- The infection cannot be treated with any beta-lactam antibiotics and requires completely different drug classes
- Genomic detection of resistance genes is unreliable and phenotypic susceptibility testing should be performed instead
Correct Answer: b) Third-generation cephalosporins (e.g., ceftriaxone) will likely be ineffective and alternative treatments should be considered
The blaCTX-M-15 gene encodes an extended-spectrum beta-lactamase (ESBL) that hydrolyzes third-generation cephalosporins, rendering them ineffective. This genotype-to-phenotype prediction is highly reliable for this gene, as discussed in the chapter’s section on antimicrobial resistance genomics. Detection of blaCTX-M-15 should prompt clinicians to avoid cephalosporins and consider alternative treatments such as fluoroquinolones, nitrofurantoin (for uncomplicated UTIs), or carbapenems (for severe infections).
However, option (a) is too aggressive—carbapenems are important drugs held in reserve for severe infections, and their overuse drives resistance. For an uncomplicated UTI, alternative oral agents like nitrofurantoin or fosfomycin might be appropriate. The choice depends on infection severity, local resistance patterns, and antibiotic stewardship principles. Genomic data informs the decision but doesn’t automatically dictate carbapenem use.
Option (c) is incorrect because blaCTX-M-15 specifically affects cephalosporins; other beta-lactams may remain effective depending on the resistance profile. Many ESBL-producing organisms remain susceptible to carbapenems and beta-lactamase inhibitor combinations.
Option (d) misunderstands the role of genomic predictions. While phenotypic testing remains important for confirming susceptibility and detecting unexpected resistance, genomic detection of well-characterized resistance genes like blaCTX-M-15 provides rapid, actionable information. The chapter discusses how machine learning models are increasingly successful at predicting resistance phenotypes from genotypes, especially for well-studied genes. The CRyPTIC Consortium’s work on tuberculosis (cited in Further Resources) demonstrated that genomic prediction of drug resistance can match or exceed traditional phenotypic testing for many drugs.
The key principle: genomic resistance detection provides rapid, reliable information that can guide empiric therapy while awaiting confirmatory phenotypic results, particularly for well-characterized resistance mechanisms.
During a suspected foodborne outbreak, public health investigators sequence bacterial isolates from patients, food samples, and potential source farms. The phylogenetic tree shows that patient isolates cluster with one farm’s isolates (5-10 SNPs apart), while isolates from other farms differ by >100 SNPs. However, the implicated farm has excellent food safety records and no epidemiological links to the patients. What should investigators do NEXT?
- Immediately close the implicated farm since genomic evidence is definitive proof of the source
- Disregard the genomic data since epidemiological links are absent and focus only on traditional investigation
- Investigate potential indirect connections (e.g., shared suppliers, distributors, or environmental sources) and expand sampling
- Conclude the outbreak investigation is inconclusive and close the case
Correct Answer: c) Investigate potential indirect connections (e.g., shared suppliers, distributors, or environmental sources) and expand sampling
This scenario illustrates the investigative power of genomic data while highlighting that real-world outbreak investigations require integrating multiple lines of evidence. The phylogenetic pattern strongly suggests a connection between patient cases and one specific farm (5-10 SNPs indicates very recent common ancestry), but the lack of direct epidemiological links means the connection might be indirect.
The appropriate next step is to use genomic data as a hypothesis-generating tool to expand the investigation. Potential scenarios include: the farm supplies a distributor that processes multiple products, creating cross-contamination opportunities; the farm shares irrigation water or equipment with other operations that directly supply implicated restaurants; or there’s an intermediate processor or distributor not yet identified. The chapter discusses similar scenarios where genomic evidence revealed unexpected transmission pathways that traditional epidemiology missed.
Option (a) represents regulatory overreach—genomic data provides strong evidence but should not be used in isolation to take punitive actions, especially when epidemiological context is lacking. False leads can occur due to sampling bias or coincidental clustering. Option (b) makes the opposite error, dismissing valuable scientific evidence. The history of outbreak investigation shows that integrating genomic and traditional epidemiology produces better outcomes than either alone. Option (d) is premature; the strong genomic signal warrants further investigation, not case closure.
This approach mirrors successful outbreak investigations discussed in the chapter, such as the Tegomoh et al. (2021) Nebraska Omicron cluster investigation, where genomic surveillance revealed unexpected transmission links that informed public health action. The key principle: genomic data should prompt expanded investigation and hypothesis refinement, not serve as a sole decision-making criterion.
Modern outbreak investigation increasingly resembles detective work where genomic data provides leads that must be confirmed and contextualized through traditional epidemiological investigation. This integration produces more accurate source attribution and more effective public health interventions.
A low-income country with limited sequencing capacity wants to establish genomic surveillance for antimicrobial resistance and epidemic pathogens. International partners offer to sequence all samples offshore and share data through global databases. However, local scientists express concerns about this arrangement. What is the MOST important ethical consideration that should guide the surveillance system design?
- Cost-effectiveness: offshore sequencing is cheaper, so ethical use of limited resources requires accepting this arrangement
- Speed: rapid results are paramount for public health action, so the fastest sequencing location should be prioritized
- Data sovereignty and equitable benefit-sharing: local capacity building and in-country analysis should be prioritized even if initially more expensive or slower
- Data quality: only high-income country laboratories can produce reliable sequences, so ethical surveillance requires offshore processing
Correct Answer: c) Data sovereignty and equitable benefit-sharing: local capacity building and in-country analysis should be prioritized even if initially more expensive or slower
This question addresses a critical ethical dimension of genomic surveillance discussed in the chapter’s ethics section. While offshore sequencing might appear pragmatic, it perpetuates problematic power dynamics in global health and can undermine the long-term goals of surveillance systems.
The chapter cites Yozwiak et al. (2016) on equitable data sharing, emphasizing that communities generating genomic data should have the capacity to analyze and benefit from their own data. When samples are sent offshore for sequencing, several problems arise: local scientists lose opportunities to develop expertise and ask research questions relevant to their context; there may be delays in returning actionable information; intellectual property and scientific credit often accrue to international partners rather than local investigators; and the country remains dependent on external support rather than building sustainable capacity.
During the COVID-19 pandemic, some countries with samples sent abroad for sequencing faced significant delays in receiving results, while international collaborators published papers using “their” data. This extractive pattern echoes colonial-era scientific practices and violates principles of research equity and justice.
The appropriate approach combines immediate practical needs with capacity building: initial offshore sequencing can be accompanied by training programs, technology transfer, and progressive transition to in-country analysis. Partnership models should emphasize co-leadership, co-authorship, and sustainable infrastructure development.
Option (a) takes an overly narrow view of cost-effectiveness that ignores long-term sustainability and equity. Option (b) prioritizes speed over sovereignty, which may be necessary in acute emergencies but should not be the default model. Option (d) is factually incorrect and reflects biased assumptions; many laboratories in low- and middle-income countries produce high-quality genomic data, and the primary barrier is resource access, not capability.
The key principle: genomic surveillance systems should be designed not just to collect data, but to build sustainable local capacity and ensure that benefits flow to the communities that generate samples. This approach aligns with broader goals of equity in global health research and strengthens long-term pandemic preparedness.
8.13 Discussion Questions
South Africa detected and reported Omicron, then faced immediate travel bans. How can we design a system that rewards transparency instead of punishing it? What would proper “benefit-sharing” look like?
Genomic surveillance is 100-1000x more intensive in high-income countries. What variants are we missing in under-surveilled regions? How do we build capacity equitably?
A nursing home outbreak: genomics shows two separate introductions. How does this change your public health response compared to a single introduction? What are the implications for infection control?
TB resistance prediction is now as accurate as culture testing and 6 weeks faster. Should we replace phenotypic testing with genomic prediction? What are the risks? When is verification necessary?
Publishing highly transmissible H5N1 sequences: Dual-use dilemma. Where should we draw the line between scientific openness and biosecurity concerns? Who decides?
A new SARS-CoV-2 variant with 20+ spike mutations is detected. Using what you’ve learned, how would you assess: (a) Is it more transmissible? (b) Does it evade immunity? (c) Should it be a Variant of Concern?
8.14 Further Resources
8.14.1 📚 Books
- Felsenstein, 2004, Inferring Phylogenies - Comprehensive phylogenetics
- Pevsner, 2015, Bioinformatics and Functional Genomics - Bioinformatics fundamentals
8.14.2 📄 Key Papers
Genomic Epidemiology: - Bedford et al., 2018, Bioinformatics - Nextstrain 🎯 Essential - Grubaugh et al., 2019, Nature Microbiology - Framework 🎯 Essential
SARS-CoV-2 Case Studies: - Rambaut et al., 2021, Nature - Alpha variant - 🎯 Key case study on variant detection - Viana et al., 2022, Nature - Omicron origins - Tracing variant emergence in South Africa - Tegomoh et al., 2021, MMWR - Nebraska Omicron cluster investigation - 🎯 Practical outbreak investigation using genomic surveillance - COG-UK, 2020, Nature Communications - Large-scale genomic surveillance infrastructure
Antimicrobial Resistance: - CRyPTIC Consortium, 2022, NEJM - TB resistance 🎯 Landmark study - Walker et al., 2013, Nature Genetics - TB transmission
Ethics: - Yozwiak et al., 2016, PLOS Biology - Equitable data sharing 🎯 Essential
8.14.3 💻 Tools
- Nextstrain - Real-time pathogen tracking
- GISAID - Genomic database
- Pangolin - SARS-CoV-2 lineage assignment
- IQ-TREE - Phylogenetic inference
- Terra.bio - Cloud genomics platform
Genomic surveillance is the future of infectious disease control. But technology alone isn’t enough—we need equity, ethics, and integration with traditional public health. The sequences tell us what’s happening. Epidemiologists tell us what to do about it.