PHORAGER: Engineering the Future of Phage Therapy Through AI-Driven Host Retargeting

Abstract

Phage therapy offers a promising alternative to antibiotics, but clinical implementation is limited by the extensive screening required to identify suitable therapeutic phages. Here we present PHORAGER (Phage Host-Optimized Receptor-Activated Generative Engineering Repository), a computational pipeline that redesigns bacteriophage receptor-binding proteins to reprogram viral host specificity. PHORAGER integrates state-of-the-art protein and structural foundation models, statistical sampling-based optimization to navigate fitness landscapes and identify optimal variants. We experimentally validated PHORAGER by engineering four E. coli phages with altered host specificity. Two were designed to recognize truncated R1 lipopolysaccharide glycans instead of K12, and two were engineered to bind OmpC instead of LamB protein receptors. We further establish the first Generative Phage Bank with an intelligent routing strategy to streamline phage selection. The system retrieves pre-validated phages when bacterial receptors share high sequence identity with database entries. When no close match exists, the generative model designs novel receptor-binding proteins for the target. By conducting thousands of in silico binding simulations before experimental synthesis, PHORAGER reduces phage screening time from months to days, unlocking patient-specific antimicrobial design at clinically relevant timescales.

Background

Antibiotic resistance is one of humanity’s largest global health issues. It is projected to cause approximately 39 million cumulative deaths between now and 2050^[1]. Phage therapy is a promising solution and has had success in multiple high-profile cases^{[2, 3]}. However, the process of finding a phage with the correct receptor-binding protein (RBP) for its target bacterial receptor, a process called phage matching, is highly time consuming^[4]. Studies have shown that slow phage matching causes months-long wait times^[5] and single-digit fulfillment rates^[6] for phage therapies. This prevents this treatment from effectively scaling and meeting the demands of millions of future cases where antibiotics might fail.

Phage Therapy: The Problems It Solves and the Problems It Poses

Phages infect bacterial cells by first binding to either a glycan or protein structure on the outer membrane of the bacteria using structures called tail fibers. These tail fibers contain receptor-binding proteins (RBPs) that recognize and attach to specific bacterial surface receptors with exquisite molecular specificity. These glycan or protein receptors are unique between bacterial strains, limiting the host range of the phage. Following initial recognition, the tail fibers use their enzymatic activity to degrade bacterial capsules and facilitate tighter binding. The phage then injects its genome into the bacterial cell, which eventually leads to the production of new phage particles that are assembled within the cell. The bacterial cell is subsequently lysed and the new phages spread to new bacterial cells^[7].

Phage therapy harnesses this biological process to fight bacterial infections, but its clinical use faces multiple limitations. The use of phage therapy has been mainly limited by phage screening, the process of identifying specific bacteriophages that can effectively target and destroy a patient’s bacterial infection. This challenge stems from the fundamental biology of phage-host recognition: the molecular compatibility between a phage’s RBP and the target bacterium’s surface receptor determines whether infection can occur.

The phage screening process often involves searching through thousands of phages for a match to a specific bacterial target, making it a long and extensive process^[8]. One method used currently to address these problems is the use of phage cocktails, which are mixtures of several different phages given at once in the hope that at least one will infect the target bacteria. The phage cocktail approach, however, risks increasing resistance and reducing phage selectivity, undermining their justification as a good solution to begin with^[9]. Studies have shown that the most critical metric for determining whether a phage can bind to a bacterium is the RBP of the phage^{[10, 11]}.

The development and use of recombinant phages that are engineered to bind to a custom bacterial target by genetically altering the RBP is a relatively novel concept. However, computational design of RBPs to retarget phage host specificity faces three fundamental challenges. First, bacterial receptors are incredibly diverse. Since each RBP is highly specific to its cognate receptor, generative models would need to encompass a vast range of sequence and structural diversity to generalize across necessary receptor classes. This makes de novo generation very difficult. Second, many bacterial receptors, such as flexible glycans like lipopolysaccharides and capsular polysaccharides, are extremely challenging to represent computationally, and current generative models struggle to design binders for them. Third, many receptors lack solved or annotated RBP-receptor complexes, leaving generative models with little data to train on or fine-tune. These challenges, combined with insufficient data surrounding phage RBP-host interactions, represent significant obstacles to the practical use of engineered phage therapy^[12]. Overcoming these major obstacles therefore requires the development of powerful AI-based computational models.

AI-Driven Protein Engineering and Structural Models

The protein design challenge is like searching for a single golden key in an ocean of near-identical fakes. The ocean contains 20^300 possible sequences for a typical 300-residue protein, a combinatorial explosion that dwarfs the number of atoms in the observable universe^[13].

Recent AI-driven tools like ESM, AlphaFold, and RoseTTAFold have revolutionized protein modeling and engineering. Despite their power, these methods have not been applied to the targeted design of synthetic receptor-binding proteins (RBPs) for redirecting phage host specificity. This represents a critical gap in phage therapy, where bacterial resistance patterns shift faster than traditional phage isolation methods can track^[4].

The challenge is compounded by the extreme sequence and structural variability of phage RBPs across different families. Even functionally analogous proteins binding similar bacterial receptors often share no detectable homology^[14]. This represents molecular convergent evolution at its most dramatic: nature has discovered thousands of distinct solutions to the same recognition problem, leaving AI models trained on homology-rich protein families unable to extract generalizable design principles.

AlphaFold3 represents the current state-of-the-art of structure prediction, visualizing three-dimensional protein conformations with near-experimental accuracy from amino acid sequences alone. This capability has transformed protein engineering from blind exploration of sequence space to informed navigation^[15]. However, AlphaFold3’s limitation emerges precisely where phage RBPs operate: it lacks dedicated modules for modeling protein-ligand and protein-glycan interactions^[16]. These binding modes are critical for phage RBPs, which must recognize highly variable bacterial surface receptors such as lipopolysaccharides, teichoic acids, and capsular polysaccharides. Their conformational flexibility and chemical diversity fall outside AlphaFold’s protein-centric training data.

Boltz-2 advances protein modeling by coupling AlphaFold-like structural prediction with an affinity prediction module that evaluates binding strength between candidate proteins and their targets, including glycans and membrane proteins^[17]. This dual capability is particularly valuable for phage research, as it links structure and function within a single framework. It enables prediction of how well a designed RBP will bind to its bacterial receptor target.

Where We Come In

Phage therapy, synthetic biology, and AI-driven protein design are converging to create unprecedented opportunities for fighting ABR. PHORAGER harnesses these advances to pursue the following objectives:

Curate all known Escherichia phages from GenBank, annotate their receptor-binding domains (RBDs), and pre-validate these sequences for inclusion in the Generative Phage Bank.
Develop a generative model for novel receptor-binding protein (RBP) domains using state-of-the-art biomolecular foundation models, specifically ESM3 for protein design and Boltz-2 for structure prediction.
Engineer novel RBPs for four Escherichia phages: Mu, P2, Lambda, and HK97, to reprogram their host specificity.

Batch 1: Sequences generated using ESM3 and Boltz-2 prototype workflow
Batch 2: Sequences generated using full MCMC with simulated annealing optimization

Establish a Generative Phage Bank framework with intelligent routing that assesses bacterial receptor sequence similarity and either retrieves pre-validated phages or initiates de novo RBP design for novel bacterial targets, bypassing exhaustive phage screening.

Design/Methods

fig1 workflow — Figure 1. PHORAGER generative phage bank workflow.

Overview of the computational pipeline for generating and validating novel receptor-binding proteins. The workflow integrates a user-friendly interface for clinical applications, computational sequence generation through iterative MCMC refinement, and experimental validation of designed sequences

Overview

PHORAGER (Phage Host-Optimized Receptor-Activated Generative Engineering Repository) is a fully automated pipeline that engineers synthetic bacteriophage receptor-binding proteins (RBPs) to reprogram phage host specificity, bypassing the traditional months-long phage screening process. The pipeline comprises three main components: data curation, phage generation, and iterative optimization through Markov Chain Monte Carlo (MCMC) sampling.

The workflow begins with ESM3, a state-of-the-art protein language model, which generates novel receptor-binding domain (RBD) candidates by masking wild-type RBD regions. Each candidate sequence is evaluated using Boltz-2, a biomolecular foundation model that co-folds the generated RBD with the target receptor and assesses binding interactions with bacterial surface glycans and proteins. Boltz-2 produces a combined “energy” score that quantifies binding affinity for each sequence.

MCMC optimization with simulated annealing uses these energy scores to iteratively refine the sequence population. After each iteration, the algorithm updates its sampling parameters, enabling navigation of the protein fitness landscape while escaping local minima. This dual approach balances exploration of sequence space with exploitation of high-fitness variants, ultimately converging on RBPs predicted to bind strongly to target bacterial receptors.

Top-ranked candidate sequences undergo rigorous computational validation through multiple filters, including sequence clustering, BLAST homology searches, and AlphaFold structural verification. Molecular docking serves as the final in silico validation step, performing detailed simulations to verify interactions between the bacterial receptor and novel RBD sequences. The highest-confidence sequences that pass all validation steps are experimentally tested by engineering them into E. coli phages and assessing their altered host specificity.

PHORAGER implements a framework where experimental results feed back into the system. Sequences demonstrating strong binding are promoted in the Generative Phage Bank, while ineffective candidates are penalized. This iterative refinement improves predictive accuracy with each cycle, building an evolving repository tailored to diverse bacterial targets. Together, this generative-predictive loop enables rapid, accurate engineering of personalized therapeutic phages against antibiotic-resistant bacterial strains.

Data Curation

Collecting the Diversity of Escherichia RBP Sequences

To build a comprehensive database of receptor-binding proteins, we systematically collected and curated RBP sequences from Escherichia phages. We performed BLASTp searches (identity > 0.4, coverage > 0.8) against NCBI RefSeq using P2 and Mu RBPs as queries. From the initial hits, we constructed Hidden Markov Models (HMMs) for each receptor-binding domain (RBD) based on multiple sequence alignments and performed iterative searches with sequence-level e-value cutoff of 1e-5 and domain-level cutoff of 1e-3.

To reduce redundancy while maintaining sequence diversity, we clustered the RBD sequences at multiple similarity thresholds using CD-HIT: 95% identity/95% coverage, 90% identity/90% coverage, and 80% identity/80% coverage. Full-length RBPs, isolated RBD regions, and corresponding multiple sequence alignments were retained to form the foundation of the phage database.

Annotating Escherichia Genomes with Receptor Types

Accurate identification of bacterial surface receptors is critical for the Generative Phage Bank routing system. When a novel bacterial sequence is submitted, receptor annotation is the first computational step. It determines whether pre-validated phages can be retrieved or if de novo RBP design must be initiated. We developed comprehensive annotation pipelines for both glycan and protein receptors.

Glycan Receptors (Lipopolysaccharide Outer Core Types)

E. coli lipopolysaccharide (LPS) outer cores fall into five major types: R1, R2, R3, R4, and K12^[18]. Each type is encoded by highly conserved genomic regions containing 10-15 genes. We implemented a hybrid pipeline combining LPSTyper, a PCR-based in silico typing tool, with custom BLAST-based verification to handle genomes with poor sequencing quality in outer-core-encoding regions.

The workflow first applies LPSTyper, which simulates PCR amplification of signature genes by identifying primer binding sites and comparing product lengths. For genomes where LPSTyper fails to assign a type, we perform BLASTn and tBLASTn searches using curated representative sequences for each outer core type. Full-length genomic DNA sequences from untyped genomes are searched against representative outer core nucleotide sequences and signature protein sequences. Hits are filtered based on sequence similarity and coverage to assign outer core types. Results from both methods are combined to produce comprehensive LPS annotations.

Protein Receptors

Fourteen major protein receptors are known to serve as phage binding targets in E. coli: BamA, BtuB, FadL, FepA, FhuA, LamB, LptD, NfrA, OmpA, OmpC, OmpF, TolC, Tsx, and YncD^[19]. We curated one to two representative amino acid sequences for each receptor and used DIAMOND, an accelerated BLAST implementation, to identify homologs across all genomes.

Protein sequences from all genomes were extracted and concatenated into a unified database, with genome identifiers prefixed to each protein ID for downstream tracking. DIAMOND BLASTp searches were performed for each protein receptor query using stringent thresholds (e-value < 1e-5, identity > 60%, query coverage > 70%, subject coverage > 70%) in ultra-sensitive mode. Identified protein receptors were extracted and catalogued for each genome.

Glycan Structure Generation

To enable Boltz-2 structural co-folding and molecular docking simulations with LPS glycan receptors, we generated three-dimensional structural models for each outer core type. Glycan structures were constructed using CHARMM-GUI based on experimentally determined architectures (Bertani and Ruiz, 2018). Five glycan files corresponding to the R1, R2, R3, R4, and K12 outer core types were built using CHARMM force field parameters. These structure files serve as binding targets for evaluating RBP-glycan interactions during computational validation.

Commands and Scripts

RBP Collection and Clustering

# Initial BLASTp search against NCBI RefSeq
blastp -query P2_Mu_RBPs.fasta -db refseq_protein -evalue 1e-5 \
-outfmt 6 -out initial_hits.tab

# Build HMM from MSA and search
hmmbuild RBD.hmm RBD_msa.sto
hmmsearch --domE 1e-3 -E 1e-5 RBD.hmm refseq_protein.fasta > hmm_results.out

# Cluster RBDs at different thresholds
cd-hit -i RBD_sequences.fasta -o RBD_95.fasta -c 0.95 -aS 0.95
cd-hit -i RBD_sequences.fasta -o RBD_90.fasta -c 0.90 -aS 0.90
cd-hit -i RBD_sequences.fasta -o RBD_80.fasta -c 0.80 -aS 0.80

Glycan Receptor Annotation

# Convert .gbff to .fna for LPSTyper 
python gbff_to_fna.py

# Run LPSTyper in batches
python batch_LPSTyper.py

# Extract unidentified genomes
python extract_DNA_list.py unidentified_genomes.txt

# Create BLAST database for unidentified genomes
makeblastdb -in unidentified_combined.fasta -dbtype nucl -out unidentified_combined_db

# BLASTn search with representative outer core sequences
blastn -query outer_core_nt.fasta -db unidentified_combined_db \
-out unidentified_genome_BLASTn_results.tab -evalue 1e-5 \
-outfmt 7 -max_target_seqs 1000000

# tBLASTn search with signature proteins
tblastn -query outer_core_key_proteins.fasta -db unidentified_combined_db \
-out key_proteins_tblastn_results.tab -evalue 1e-5 \
-outfmt 7 -max_target_seqs 1000000

# Analyze BLAST results and assign types
python BLASTn_LPS_typing.py

# Combine LPSTyper and BLAST results
Rscript combine_outercore_typing_results.Rmd

Protein Receptor Annotation

# Convert .gbff to .faa
python gbff_to_faa.py

# Concatenate all protein sequences with genome ID prefixes
COMBINED=all_escherichia_proteins.faa
:> "$COMBINED"

for f in *.faa; do
GENOME=$(basename "$f" .faa | cut -d"_" -f1-2)
awk -v pfx="$GENOME" '/^>/{sub(/^>/,">"pfx":")}1' "$f" >> "$COMBINED"
rm "$f"
done

# Create DIAMOND database
diamond makedb --in all_escherichia_proteins.faa -d escherichia_db_prot

# Run DIAMOND BLASTp for each protein receptor
diamond blastp \
--db escherichia_db_prot.dmnd \
--query OMP/OmpC.fasta \
--out OMP_results/OmpC_diamond_results.tab \
--evalue 1e-5 \
--id 60 \
--query-cover 70 \
--subject-cover 70 \
--max-target-seqs 0 \
--ultra-sensitive \
--outfmt 6

# Extract identified receptor sequences
python receptor_extract.py

Phage Generation: ESM3 and Boltz-2 Workflow

Experimental Design

We initially validated the PHORAGER pipeline using two Escherichia phages available in our laboratory: Mu and P2. Both phages naturally target K12 lipopolysaccharide glycans on E. coli. Our objective was to reprogram their receptor-binding domains to recognize truncated R1 glycans instead, thereby expanding their host range to bacterial strains inaccessible to the wild-type phages.

Generative Protein Design with ESM3

Receptor-binding domains were expertly annotated in both phages (Data curation section) and masked for regeneration. We employed ESM3, a state-of-the-art generative protein language model, to generate novel RBD sequences. ESM3 is trained as a masked language model across the three data modalities of sequence, structure, and function. This training paradigm enables the model to generate biologically plausible protein sequences while maintaining structural and functional constraints.

We systematically evaluated different ESM3 model configurations. Initial experiments using complete RBD regeneration with the ESM3-small (1.4B parameters) model yielded approximately 10% of sequences passing subsequent filtering criteria. Scaling to ESM3-large (98B parameters, accessed via the Forge API) improved the success rate to 20-25%. However, we found that ESM3-small with guided generation, an iterative resampling approach that optimizes toward a target metric, achieved comparable performance to ESM3-large. Since guided generation uses pTM (predicted Template Modeling score) as the optimization metric, sequences converge toward higher structural confidence. Given that ESM3-small is open-source and available under academic license without the generation limits imposed by the Forge API, we adopted ESM3-small with guided generation for all subsequent experiments.

Structure Prediction and Binding Evaluation with Boltz-2

Generated RBD sequences were evaluated using Boltz-2, a biomolecular foundation model with performance comparable to AlphaFold3. Critically, Boltz-2 incorporates a novel binding affinity module that not only predicts protein structure but also estimates binding likelihood and affinity between molecular entities. This capability enables direct assessment of RBD-glycan interactions.

For each generated sequence, Boltz-2 co-folded the RBD with both K12 (wild-type) and truncated R1 (target) glycan structures. The resulting binding affinity scores, along with confidence metrics including pTM and iPTM, served as quality filters for candidate sequences. We optimized Boltz-2 prediction parameters to balance accuracy and computational cost. After comprehensive testing, we matched AlphaFold3’s settings with 10 recycling steps and 25 diffusion samples. While these parameters increased computational overhead, they provided substantially improved prediction reliability for downstream experimental validation.

Candidate Selection

Candidate RBDs were prioritized based on three criteria: (1) increased predicted binding affinity to truncated R1 glycans, (2) reduced affinity to K12 glycans relative to wild-type RBDs, and (3) high structural confidence scores. Selected sequences were additionally screened via NCBI BLASTp against nr to ensure novelty relative to known phage RBDs. Top-ranked candidates meeting all criteria were synthesized for experimental validation through phage spotting assays.

Iterative Optimization with MCMC and Simulated Annealing

Transition from Manual to Automated Sampling

The initial ESM3-Boltz-2 workflow (Version 1) relied on manual inspection of generated structures and binding scores to select candidates. To systematically explore sequence space and identify optimal RBD variants, we developed an automated iterative pipeline (Version 2) integrating Markov Chain Monte Carlo (MCMC) sampling with simulated annealing. This approach uses Boltz-2 predictions to directly guide ESM3 sequence generation, enabling principled navigation of the protein fitness landscape.

Fitness Landscape and Energy Function

The RBP sequence optimization problem can be conceptualized as traversing a high-dimensional fitness landscape where each point represents a unique RBD sequence and its associated fitness or “energy” score. MCMC, a widely adopted sampling technique in protein engineering, enables systematic exploration of this vast landscape to identify sequences predicted to bind strongly to target receptors.

We defined a composite energy function for each candidate sequence as a linear combination of four metrics: (1) predicted binding affinity to the target receptor (positive contribution), (2) predicted binding affinity to the wild-type receptor (negative contribution), (3) structural confidence scores (pTM and iPTM from Boltz-2), and (4) sequence novelty assessed by percent identity to known phages via BLASTp against 15,000 E. coli phage sequences from GenBank. This energy function balances binding specificity, structural plausibility, and sequence diversity.

MCMC Sampling Protocol

The sampling process begins at iteration 0 with the wild-type RBD sequence. The energy for this initial sequence is calculated and accepted into the Markov chain. At each subsequent iteration, a subset of residues is randomly masked from the previous sequence in the chain and regenerated using ESM3 with guided generation. The number of residues to mask is drawn from a Gaussian distribution with an initial mean of 40 residues.

The newly generated sequence is evaluated with Boltz-2, and its energy is calculated. Acceptance into the Markov chain follows the Metropolis criterion, which compares the energy difference (ΔE) between the current and previous sequences. Sequences with improved energy are deterministically accepted, while sequences with worse energy are accepted probabilistically based on exp(−ΔE/T), where T is the temperature parameter. This stochastic acceptance enables exploration of the fitness landscape and prevents premature convergence to local maxima. This is a critical advantage over greedy best-sequence selection strategies.

Simulated Annealing for Convergence

To balance exploration of sequence space with exploitation of high-fitness regions, we implemented simulated annealing. As iterations progress, both the mean number of masked residues and the acceptance threshold (controlled by the temperature parameter T) gradually decrease. This cooling schedule encourages broad exploration in early iterations and progressive refinement toward high-energy sequences in later iterations, facilitating convergence on optimal RBD variants.

Post-MCMC Validation

Sequences sampled throughout the MCMC trajectory undergo additional computational validation. We perform sequence clustering to identify dominant sequence families, analyze chain convergence diagnostics to ensure adequate sampling, and verify novelty through comprehensive BLASTp searches. Top-ranked sequences that the Markov chain converges upon are selected for final validation through molecular docking simulations before experimental synthesis.

Physics-Based Docking as Orthogonal Validation for Machine Learning Predictions

Physics-based molecular docking provides essential orthogonal validation for machine learning-based binding affinity predictions in de novo protein design pipelines. The consensus across recent literature is that ML and physics-based approaches are complementary rather than competitive. Hybrid pipelines combining the speed of ML with the rigor of physics-based methods achieve the best results. ML models like AlphaFold3 and Boltz-2 excel at rapid initial predictions but exhibit critical limitations including stereochemistry violations, inability to model conformational changes, and susceptibility to generating adversarial sequences that appear valid computationally but fail experimentally. Physics-based docking tools like HADDOCK3 address these limitations through explicit physical modeling, providing confidence in predictions that pure ML approaches cannot guarantee^[20].

Recent assessments of AlphaFold3 and similar ML models reveal fundamental limitations that necessitate physics-based validation. Abramson et al. demonstrated in their Nature 2024 paper that AlphaFold3 achieves only 95.6% stereochemical validity, exhibiting a 4.4% chirality violation rate and occasional production of overlapping atoms that violate basic physical chemistry^[15]. These artifacts arise because ML models learn statistical patterns from training data rather than enforcing physical constraints. Wee and Wei (2024) found that while AlphaFold3 achieves good correlation (R=0.86) for predicting binding free energy changes upon mutation, some complex structures contain large errors not captured by internal confidence metrics like ipTM. They concluded that independent validation of AlphaFold3 predictions is necessary^[21].

The challenge with ML models fundamentally stems from their optimization objectives. Notin et al. noted in their Nature Biotechnology 2024 review that in silico metrics such as native sequence recovery can be misleading^[22]. ML models are optimized for structure prediction accuracy on natural sequences, not for distinguishing viable from non-viable designed sequences. This creates a critical gap that physics-based validation fills.

HADDOCK3 Provides Modular Physics-Based Validation with Specialized Capabilities

HADDOCK3 (Giulini et al., 2025) represents evolutionary thinking in molecular docking. It breaks the original rigid pipeline into modular components that adapt to diverse validation scenarios. This architectural flexibility positions HADDOCK3 as the ideal physics-based complement to ML predictions in the post-AlphaFold era.

The HADDOCK scoring function combines physics-based energy terms in stage-dependent weighted formulations:

HADDOCK Score =

w_{vdw}E_{vdw} + w_{elec}E_{elec} + w_{desolv}E_{desolv} - w_{BSA} \cdot BSA + w_{AIR}E_{AIR}

Here, E_vdw and E_elec capture van der Waals and Coulombic interactions through molecular mechanics force fields. E_desolv approximates desolvation via empirical atomic solvation parameters. BSA (buried surface area) captures hydrophobic effects, and E_AIR penalizes violations of ambiguous interaction restraints. The weights shift dramatically across refinement stages: rigid-body docking uses w_vdw=0.01 to allow initial penetration, semi-flexible refinement increases to w_vdw=1.0 for realistic contacts, while explicit solvent refinement damps electrostatics to w_elec=0.2. This empirical dampening implicitly accounts for polarization and local rearrangement that simplified models miss. It encodes decades of validation against experimental binding thermodynamics.

These physics-based terms approximate the theoretical Gibbs free energy decomposition:

\Delta G_{bind} = \Delta G_{solvent} + \Delta G_{conf} + \Delta G_{int} + \Delta G_{motion}

This captures solvation changes, conformational penalties, direct intermolecular interactions, and entropy losses upon binding. While approximations (implicit solvation, fixed force fields, crude entropy estimates) limit absolute accuracy, the critical advantage is that these approximations encode known physical principles with predictable failure modes, unlike ML’s opaque pattern-matching.

Glycan-Protein Docking: Where ML Training Data Runs Dry

HADDOCK3 demonstrates unique advantages for protein-glycan interactions, precisely where ML models trained predominantly on protein-protein interfaces stumble. Ranaudo et al. (2024)^[23] benchmarked HADDOCK across 89 high-resolution glycan complexes, achieving 70% success for bound docking and 40% for unbound when binding sites are known. The protocol handles diverse systems (antibodies, lectins, viral glycan-binding proteins, enzymes) that span the functional space of carbohydrate recognition.

Glycans’ conformational promiscuity, branched structures with 2-7 monosaccharides exploring vast configurational space, demands the flexible refinement and MD sampling that HADDOCK’s modular architecture enables. Clustering between rigid-body and refinement stages increases diversity, crucial for exploring branched oligosaccharides’ high-dimensional landscapes. The information-driven approach using ambiguous restraints to target known interface regions while permitting flexibility proves more effective than template-free methods where ML models lack carbohydrate training data.

Parameter Sweeps with Latin Hypercube Sampling: Preliminary Benchmarks

Systematic parameter exploration addresses a subtle but crucial weakness in docking: sensitivity to arbitrary parameter choices in rugged, non-convex energy landscapes. Bender et al.’s (2021)^[24] practical guide on large-scale docking emphasizes parameter evaluation before screening.

Latin hypercube sampling (LHS) provides stratified coverage of parameter space by dividing each parameter’s range into N equiprobable intervals and sampling once per interval. This ensures no two samples share the same interval for any parameter. For k parameters, this yields N samples with guaranteed coverage across all dimensions, versus N^k samples required for equivalent full factorial coverage^[25]. This efficiency proves essential when each docking simulation costs substantial computational resources^{[26, 27]}.

Staged Validation Architecture: Marrying Speed with Rigor

Successful pipelines adopt a cascade strategy. Rapid ML-based exploration generates diverse candidates, physics-based filtering eliminates physically implausible designs, and parameter sweeps quantify confidence. Dauparas et al. (2022)^[28] demonstrated this with ProteinMPNN, where ML sequence design rescued 73 of 96 previously failed attempts, with 50 maintaining desired oligomeric states. The authors emphasized that structure prediction evaluates purely in silico while design methods cannot, as computational metrics prove sensitive to architecture without correlating reliably with experimental success.

The most sophisticated implementations employ parameter sweeps during physics-based validation to quantify robustness. Rather than accepting scores at face value, varying weights, solvation models, and sampling protocols through LHS identifies predictions robust across parameter space versus those sensitive to arbitrary choices. Wu et al.’s (2025)^[29] FDA (Folding-Docking-Affinity) framework found ColabFold-generated apo-structures surprisingly improved affinity predictions versus crystallized holo-structures. This suggests physics-based docking captures induced-fit effects ML structure prediction misses.

In our MCMC-driven RBP design pipeline, HADDOCK3 serves this critical validation role. After iterative ESM3 generation guided by Boltz-2’s affinity oracle, final candidates undergo full physics-based docking simulations with both wild-type and target glycan receptors. The modular flexref protocol for glycan docking, combined with LHS parameter sweeps across scoring weights and solvation models, identifies designs that satisfy both ML pattern-matching and explicit physical constraints. These are candidates robust enough to warrant experimental synthesis.

Codon Optimization

Following protein sequence generation and validation, candidate RBD sequences were codon-optimized for expression in Escherichia coli K-12 using CodonTransformer, a context-dependent codon optimizer. Each protein sequence was tokenized and processed through the model to generate codon-optimized DNA sequences that maximize translation efficiency while maintaining the encoded amino acid sequence. Optimized sequences were evaluated across multiple metrics: Codon Adaptation Index (CAI) to assess usage of preferred codons, GC content percentage to ensure sequence stability, Codon Frequency Distribution (CFD) with a threshold of 0.3 to evaluate overall codon bias, percentage of codons within minimum-maximum frequency bounds (PctMinMax, window size 18), and sequence complexity to avoid repetitive or problematic motifs.

Generative Phage Bank: Intelligent Routing and Workflow

The Generative Phage Bank serves as the central orchestration system for PHORAGER. It integrates receptor annotation, similarity-based routing, and phage recommendation workflows. The system accepts bacterial receptor information from clinicians or researchers and automatically determines whether pre-validated phages can be retrieved from the database or if de novo RBP design is required.

Input Specifications

The pipeline accepts input in single-query or batch mode via a text file containing three required fields for each bacterial sample: (1) O-antigen type (string), (2) outer core type (string), and (3) amino acid sequence of the target protein receptor (string). In batch mode, multiple queries are processed sequentially, with each line representing a distinct bacterial sample and its associated receptor information.

Output and Logging

For each query, the system generates comprehensive results including the closest matching bacterial strain or phage from the reference database with corresponding percent identity, the downstream pipeline invoked (receptor typing for known receptors or generative phage design for novel targets), and file paths to all generated outputs. All pipeline operations are recorded in a detailed log file (routing_pipeline.log), and a summary table (summary.tsv) consolidates results across all processed queries.

Routing Workflow

The intelligent routing process begins by parsing and validating all input queries to ensure O-antigen, outer core type, and protein receptor sequences are present and properly formatted. For each validated query, the system writes the protein receptor sequence to a temporary FASTA file and performs sequence alignment against the reference database using DIAMOND, an accelerated BLAST implementation optimized for large-scale searches.

DIAMOND hits are ranked by percent identity to identify the closest match in the database. If the top hit exceeds a predefined similarity threshold (typically 85%), the system routes the query to the receptor typing pipeline, which retrieves pre-validated phages known to bind receptors with high sequence similarity. The system confirms successful output generation and logs the result.

When no database entry exceeds the similarity threshold, the query is classified as novel and routed to the MCMC generative design pipeline. The system identifies the closest phage or bacterial strain as a template and initiates de novo RBD design targeting the novel receptor. Output file locations are stored and logged for downstream experimental validation.

Summary and Reporting

Upon completion of all queries, the system compiles results into a structured summary table capturing sample identifiers, matched database entries, percent identities, pipeline routing decisions, and output file paths. This automated routing ensures that the Generative Phage Bank delivers either immediate phage recommendations for characterized receptors or optimized de novo designs for novel bacterial targets, enabling rapid response to diverse clinical scenarios.

Results

Comprehensive Database of Escherichia Phage RBPs

Our systematic collection and curation pipeline identified 5,550 proteins containing receptor-binding domains similar to P2 and Mu phages from NCBI RefSeq. Clustering analysis at multiple similarity thresholds revealed the sequence diversity within this dataset: 95% identity/coverage clustering yielded 191 representative sequences, 90% thresholds produced 88 clusters, and 80% thresholds resulted in 45 clusters. This hierarchical clustering demonstrates substantial sequence variation among Escherichia phage RBPs while identifying conserved sequence families suitable for template-based design.

Generative Phage Bank Architecture and Workflow

We developed PHORAGER, a computational pipeline for de novo design of bacteriophage receptor-binding proteins with retargeted specificity. The workflow integrates generative protein design, structure prediction, and iterative optimization through Markov Chain Monte Carlo sampling to generate novel phage variants capable of binding user-specified bacterial surface receptors. The complete system encompasses both computational sequence generation and experimental validation pathways, enabling rapid iteration between in silico design and wet-lab characterization. (Figure 1)

Batch 1: Direct Generation of Glycan-Targeting Receptor-Binding Proteins

Our initial design strategy employed ESM3 for sequence generation followed by Boltz-2 structure prediction to create receptor-binding proteins targeting truncated R1 lipopolysaccharide glycans. We generated and evaluated the top ten candidate sequences derived from the P2 phage gpH tail fiber protein, each redesigned to recognize the truncated R1 glycan while maintaining structural compatibility with the TFA chaperone protein required for proper folding and assembly.

BLASTp analysis against the NCBI nr database confirmed that all ten generated sequences exhibited minimal homology to known proteins, with sequence identities below 40% to any characterized protein. This substantial divergence from natural sequences demonstrates the generative capacity of our approach to explore novel regions of sequence space beyond the boundaries of existing phage diversity. Among the ten candidates, two sequences (P2_gpH_2 and P2_gpH_8) exhibited favorable structural properties upon modeling with Boltz-2, including stable receptor-binding domain conformations and proper positioning of the truncated R1 glycan within the binding pocket. Both designs maintained the critical interface with the TFA chaperone protein, suggesting preserved assembly competence. These two candidates were selected for experimental synthesis and downstream validation.

Figure 2. Validation and structural characterization of glycan receptor-binding designs (Batch 1)

**(A)** BLASTp analysis of top 10 generated sequences against the NCBI nr database confirms sequence novelty. **(B)** Structure of P2_gpH_2 RBD (blue) in complex with TFA chaperone (red) and truncated R1 glycan (black). **(C)** Structure of P2_gpH_8 RBD (blue) with TFA chaperone (red) and truncated R1 glycan (black). Both sequences passed all validation criteria and were experimentally synthesized.

Batch 1: Protein Receptor-Targeting Designs

Extending our generative approach to protein receptors, we designed bacteriophage tail fibers to target the outer membrane porin OmpC instead of wild-type LamB. This second batch utilized HK97 and Lambda phage receptor-binding domains as structural scaffolds, generating 60 sequences, out of which the top eight novel sequences were chosen. The trimeric architecture of these designs, essential for functional phage tail fiber assembly, was preserved throughout the generation process.

Structural modeling revealed that all eight candidates formed stable trimeric assemblies with appropriate geometry for engaging the OmpC trimer. The binding interfaces incorporated coordinated metal ions, consistent with known mechanisms of phage-host recognition involving divalent cations. BLASTp validation confirmed sequence novelty across all eight designs, with no significant matches to characterized proteins in public databases. The successful generation of multiple structurally plausible candidates targeting a protein receptor demonstrated the generalizability of our approach across different receptor types and phage scaffolds.

Figure 3. Protein receptor-binding designs and structural models (Batch 1).

**(A)** BLASTp validation of top 8 generated sequences against the NCBI nr database. **(B)** Structural models showing trimeric assemblies of HK97 and Lambda RBDs (blue) bound to trimeric OmpC protein receptor (orange), with coordinated ions (green).

Batch 2: MCMC-Based Iterative Optimization for Enhanced Glycan Binding

While Batch 1 demonstrated proof-of-concept for generative phage design, we hypothesized that iterative optimization through MCMC sampling would yield sequences with superior binding energetics. We applied this approach to three distinct receptor-binding proteins (Mu gp49, Mu gp52, and P2 gpH) initially evolved to bind E. coli K12 lipopolysaccharide, with the objective of retargeting them toward truncated R1 glycans. For each protein, we conducted three independent MCMC chains of 10,000 iterations each, using simulated annealing to balance exploration and exploitation of sequence space.

The MCMC sampling dynamics revealed characteristic convergence behavior across all three proteins. Acceptance rates declined from initial values of 50-60% to stabilized rates of 20-30% by the end of sampling, consistent with progressive refinement toward local optima in the energy landscape. Mu gp49 exhibited a mean acceptance rate of 34.9% across the three chains, while Mu gp52 showed slightly higher acceptance at 40.3%, and P2 gpH demonstrated 33.9% acceptance. The gradual reduction in acceptance rates indicates that the algorithm successfully transitioned from broad exploration of sequence space to focused optimization around high-affinity binding solutions.

Figure 4. MCMC sampling dynamics for Mu gp49 RBP optimization.

**(A)** Acceptance rate decay across three independent MCMC chains over 10,000 iterations, indicating convergence toward optimal sequence space. **(B)** Overall acceptance statistics across chains, with mean acceptance rate of 34.9%.

Energy trajectory analysis for Mu gp49 revealed convergence to stable, high-energy states representing enhanced binding to the truncated R1 target. Critically, all three independent chains achieved final binding energies to R1 that exceeded the initial wild-type binding energies to K12, demonstrating successful retargeting. The convergence of independent chains to similar energy ranges provides confidence that the optimization identified robust binding solutions rather than spurious local minima. This inversion of binding preferences, from K12 to R1, validates the capacity of MCMC-based optimization to fundamentally reshape receptor specificity.

Figure 5. Energy landscape analysis for Mu gp49 target retargeting.

**(A)** Energy trajectories for three independent MCMC chains showing convergence to stable high-affinity states. **(B)** Comparison of binding energies between wild-type K12 receptor and truncated R1 target receptor. All chains successfully optimized for target binding, achieving higher final energies for R1 compared to initial K12 binding energies.

Similar patterns emerged for Mu gp52, with all three chains converging to high-energy states and achieving R1 binding energies that surpassed wild-type K12 values. The higher overall acceptance rate for Mu gp52 compared to Mu gp49 suggests a smoother energy landscape or greater mutational tolerance, potentially reflecting structural differences between the two receptor-binding domains. Nevertheless, the consistent retargeting across both proteins indicates that our MCMC approach is robust to variations in starting scaffold architecture.

Figure 6. MCMC sampling dynamics for Mu gp52 RBP optimization.

**(A)** Acceptance rate trajectories for three independent chains demonstrating convergence behavior. **(B)** Cumulative acceptance statistics showing mean acceptance rate of 40.3% across all chains.

Figure 7. Energy landscape analysis for Mu gp52 target retargeting.

**(A)** Energy convergence profiles for three MCMC chains. **(B)** Wild-type K12 versus truncated R1 target binding energy comparison. Optimization successfully drives all chains toward enhanced R1 binding, with final target energies exceeding wild-type energies.

The P2 gpH optimization recapitulated the successful retargeting observed for the Mu proteins, with acceptance dynamics and energy convergence profiles consistent with effective exploration and optimization. Across all three proteins (Mu gp49, Mu gp52, and P2 gpH), the MCMC sampling achieved the dual objectives of generating novel sequences with minimal homology to natural proteins while simultaneously enhancing binding affinity for the target glycan receptor.

Figure 8. MCMC sampling dynamics for P2 gpH RBP optimization.

**(A)** Acceptance rate evolution across three independent chains. **(B)** Overall acceptance rate of 33.9% across all sampling chains.

Figure 9. Energy landscape analysis for P2 gpH target retargeting.

**(A)** Energy trajectories showing convergence across three independent chains. **(B)** Binding energy comparison between wild-type K12 and truncated R1 target. Consistent optimization toward enhanced R1 binding across all chains, with target energies surpassing wild-type values.

Clustering Analysis Reveals Progressive Energy Optimization

To identify the most promising candidate sequences from the MCMC trajectories, we performed hierarchical clustering on the sampled sequences. Analyzing clusters on a per-chain basis revealed a clear positive correlation between mean cluster iteration number and mean cluster energy. Clusters formed from later-sampled sequences consistently exhibited higher binding energies compared to early-iteration clusters, confirming that the MCMC algorithm progressively discovered improved binding solutions throughout the sampling process. This trend held across all receptor-binding proteins and all independent chains, demonstrating the systematic optimization behavior of the algorithm.

Figure 10. Per-chain cluster analysis reveals energy-iteration relationship.

**(A)** Mean cluster energy versus mean cluster iteration for individual chains across all RBPs. Later-sampled clusters exhibit higher binding energies, indicating progressive optimization. **(B)** Aggregate analysis across all chains and RBPs confirms positive correlation between iteration number and cluster energy.

We extended this analysis by pooling all three chains for each protein and performing combined clustering. This approach increased statistical power and enabled identification of consensus binding solutions that emerged across independent sampling runs. The combined clustering analysis reinforced the positive relationship between iteration number and binding energy, with later clusters representing the most optimized sequences. The consistency of this trend across both per-chain and combined analyses validates the robustness of our clustering approach for candidate selection.

Figure 11. Combined clustering analysis across MCMC chains.

**(A)** Mean cluster energy versus iteration for pooled chains within each RBP. Later clusters correspond to higher-energy local maxima. **(B)** Global trend across all RBPs demonstrating that sampling progression correlates with improved binding energies.

From the combined clustering, we selected the top nine clusters based on mean binding energy and cluster size. Examination of the energy distributions within each cluster revealed varying degrees of sequence diversity, with some clusters tightly focused around a single binding solution and others encompassing a broader ensemble of related sequences. Importantly, all nine clusters preferentially contained high-energy sequences, confirming that our selection criteria successfully enriched for optimized binding candidates. The diversity of solutions across clusters suggests multiple distinct mechanisms for achieving enhanced R1 binding, providing a rich pool of sequences for experimental validation.

Figure 12. Energy distributions of top candidate sequence clusters.

Energy profiles for sequences within the top 9 clusters selected from combined clustering analysis. Each panel represents a distinct cluster, showing variation in sequence diversity and energy distributions. Higher-energy sequences are preferentially represented across clusters.

Validation of Sequence Novelty and Structural Diversity

To confirm that the optimized sequences represent genuinely novel phage designs rather than convergence toward known natural variants, we performed BLASTp analysis of the representative sequences from each of the top nine clusters. All representatives exhibited minimal homology to characterized proteins in the NCBI nr database, with no matches exceeding 60% sequence identity. This stark contrast to the control BLASTp analysis of wild-type template sequences, which showed high homology to known phage proteins as expected, demonstrates that our optimization process successfully generated novel sequence space while maintaining the structural constraints necessary for receptor binding.

Figure 13. Sequence novelty validation via homology search.

**(A)** BLASTp analysis of representative sequences from top 9 clusters against the NCBI nr database reveals minimal homology to known proteins, confirming sequence novelty. **(B)** Control BLASTp analysis of wild-type template sequences confirms expected homology to characterized proteins.

Structural modeling of the nine representative sequences revealed substantial conformational diversity despite the shared objective of binding the truncated R1 glycan. The designs exhibited variations in loop conformations, side chain orientations, and overall binding pocket architecture, suggesting multiple structural strategies for glycan recognition. This structural diversity, combined with sequence novelty, positions these candidates as valuable starting points for experimental characterization and further optimization. The successful generation of multiple distinct binding solutions increases the likelihood that at least several candidates will exhibit functional activity in vitro.

Figure 14. Structural models of top candidate sequences.

Three-dimensional structures of representative sequences from each of the top 9 clusters, illustrating structural diversity of designed receptor-binding proteins.

Summary of Results

Our two-batch approach to generative phage design yielded complementary insights into computational retargeting strategies. Batch 1 demonstrated that direct generation using ESM3 can produce structurally plausible candidates for both glycan and protein receptors, with successful designs advancing to experimental synthesis. Batch 2 extended this foundation through MCMC-based iterative optimization, systematically enhancing binding energies while maintaining sequence novelty. The convergence of independent MCMC chains, progressive energy optimization revealed through clustering analysis, and structural diversity of final candidates collectively establish PHORAGER as a viable platform for de novo bacteriophage engineering. These computational designs now await experimental validation to assess binding affinity, specificity, and ultimately, the capacity to redirect phage infection toward designer-specified bacterial hosts.

Docking: Control-Driven Parameter Optimization

As previously described, we began with some benchmarking through positive and negative controls, the computational equivalent of calibrating a mass spectrometer before running precious samples. We deployed an automated sweep across the combinatorial space, testing restraint strategies (cmrest vs. surfrest), symmetry force constants (ksym: 50 vs. 100), hexamer body restraint scaling (unambig_scale: 100 vs. 200), and clustering thresholds (clust_cutoff: 2.0 vs. 2.5 Å).

Wild-type Mu and P2 RBDs docked against their cognate K12 glycans delivered the expected overall favourable scores: HADDOCK scores ranging from -127.3 to -79.41 arbitrary units respectively. The physics worked, but could it discriminate?

The negative control delivered validation and a warning. Docking Mu gp49 against mismatched R1 glycans produced mean scores of -68.4 ± 21.7 a.u., clearly worse than positive controls. Yet the distributions overlapped: the best negative runs encroached upon the worst positive runs. This twilight zone, where the scoring function wavered between “binder” and “non-binder,” made any single parameter set dangerously ambiguous.

This overlap transformed parameter selection from perfunctory validation into active optimization. The optimal regime emerged at the intersection of competing constraints: cmrest=true, cmtight=false, ksym=50, unambig_scale=100 for ab-initio positioning, combined with w_vdw=1.0, w_elec=0.2, w_desolv=1.0, w_bsa=-0.05 for scoring.

The calibrated docking pipeline was then used to evaluate the top structures from MCMC and decide which ones would be tested in Wet Lab.

Discussion

PHORAGER stands at the boundary between computational prediction and biological realization. It is a system built not only to model phage receptor-binding proteins (RBPs) but to engineer them. By combining structural prediction, generative design, and statistical optimization, PHORAGER reduces the timeline for phage retargeting from months to days. It demonstrates that machine learning and molecular biology need not operate in parallel lanes. When tightly coupled, they can navigate the staggering combinatorial complexity of protein sequence space with unprecedented purpose. Yet, in this synthesis of algorithms and evolution, PHORAGER also exposes the widening gap between computational potential and the biochemical subtleties of viral infection.

Bridging Computational Design and Biological Reality

PHORAGER leverages foundation models for sequence and structure (ESM, AlphaFold3 and Boltz-2) coupled with Markov-chain Monte Carlo (MCMC) sampling to traverse the enormous combinatorial landscape of receptor-binding proteins (RBPs). These RBPs, typically tail fibers or tailspike proteins, dictate phage host range by recognizing bacterial glycan or protein receptors.

Our pipeline demonstrates that combining sequence-level generative models with structural prediction and in silico affinity estimation can identify plausible RBPs for new receptors. The use of AlphaFold3 and Boltz-2 enables rapid three-dimensional prediction of designed sequences, and Boltz-2’s affinity module offers a thousand-fold speedup relative to free-energy perturbation (FEP) while approaching FEP accuracy. However, there are fundamental limitations that our current implementation cannot overcome.

Limitations of Current Models and Workflows

The performed operations and trial runs encompass a robust implementation of analytical and predictive techniques towards the realm of phages. We do acknowledge limitations that help inform the development and augmentation process for these steps.

Data scarcity and RBP diversity. RBPs diverge dramatically across phage families. Even analogues binding similar receptors may lack detectable sequence homology. Protein language models trained on homology-rich datasets (ESM or other transformers) thus struggle to generalize across these families, limiting their ability to explore truly novel RBP designs. Existing phage matching models mainly infer host range at the genus level, often failing for major ESKAPE pathogens. Resistance and receptor divergence also confound genome-based predictions.

Structural prediction of glycans and membrane interactions. AlphaFold3 treats ligands via a RDKit-based pipeline and often misassigns glycan conformations. It lacks meaningful metrics to penalize incorrect glycan torsions or linkage stereochemistry, and high-confidence scores can mask false glycan conformers. This necessitates evaluation of both high- and low-ranked models. Thus, the structural models we rely on provide accurate protein folds but poor glycan recognition, which is critical for RBPs that bind lipopolysaccharide or capsular polysaccharide receptors.

Computational complexity and throughput. Iterative MCMC sampling coupled with Boltz-2 scoring is computationally intensive. HADDOCK and similar physics-based docking tools provide valuable validation but are orders of magnitude slower and sensitive to starting poses, limiting high-throughput use. This slows the active-learning loop needed to refine the generative model with experimental feedback.

Limited host-range validation and resistance. Our experimental validation focused on E. coli targets. Yet clinical infections involve a broad spectrum of Gram-positive and Gram-negative bacteria. Furthermore, repeated use of engineered phages could select for resistance. Existing generative pipelines do not optimize for resistance avoidance, immunogenicity or fitness costs in bacteria, which could undermine long-term efficacy.

Resistance concerns. Overuse or repeated cycling of engineered phages risks bacterial resistance evolution, and current generative pipelines do not explicitly account for resistance avoidance.

As PHORAGER advances, a deliberate choice was made not to incorporate RFdiffusion into its generative loop. While RFdiffusion has emerged as a state-of-the-art framework for unconditional protein backbone generation, its design paradigm is misaligned with the requirements of receptor-binding protein (RBP) retargeting. The model’s diffusion prior assumes a smooth, low-dimensional energy landscape. This assumption breaks down in the highly discontinuous and chemically diverse interfaces that define phage-host recognition, particularly in glycan-binding RBPs. Unlike enzyme active sites or antibody paratopes, RBP interfaces often arise from convergent evolutionary solutions that lack conserved geometric motifs, making RFdiffusion’s motif-conditioned generation poorly transferable.

Moreover, RFdiffusion requires explicit specification of binding sites or template backbones for conditioning, resources that are unavailable for most phage RBPs due to their extreme sequence and structural diversity. By contrast, PHORAGER’s design architecture, anchored in Boltz-2’s co-learning of structure and binding affinity, permits de novo interface exploration without pre-defined templates. This distinction is central: PHORAGER seeks not to reproduce known motifs but to invent plausible binding geometries compatible with evolving bacterial receptors.

From a computational perspective, RFdiffusion’s diffusion-based generative process entails thousands of forward-reverse inference steps, producing high GPU memory overhead and slow convergence that hinder large-scale iterative sampling. In contrast, the probabilistic MCMC sampling employed by PHORAGER can efficiently traverse vast sequence spaces, integrating feedback from Boltz-2 affinity scores and ESM embeddings with tractable computational cost.

Finally, preliminary trials revealed that RFdiffusion-generated backbones frequently failed to produce stable folds when adapted to multi-domain or glycan-bound RBP architectures. This reflects priors optimized for soluble monomeric proteins rather than flexible, surface-tethered viral attachment systems. For these reasons, PHORAGER adopts a sampling-centric framework that combines structure-aware scoring, adaptive refinement, and experimental feedback loops. It trades unconditional generativity for biologically grounded precision.

Future Directions and Broader Implications

PHORAGER’s ultimate promise lies in its adaptability. By iteratively integrating experimental data, expanding training datasets, and refining objective functions, it could generalize beyond E. coli and deliver truly personalized phage therapies. We outline several key directions:

Expand and diversify datasets. Building a comprehensive database of RBP sequences, glycan receptors and measured affinities, including rare pathogens and multidrug-resistant strains, will reduce the out-of-distribution gap for language and structural models. Coupling synthetic data augmentation with curated experimental results will help models learn RBP-receptor determinants that are invisible to current training sets.

Enhance the active-learning loop. We plan to implement a Cross-Entropy Method (CEM) optimization strategy. CEM iteratively samples a batch of sequences, selects the top-performing “elite” candidates based on experimental or predicted scores, and updates the generative model to increase the probability of sampling high-function variants. By using multiple performance metrics, including wet-lab growth curves, binding assays, and Boltz-2 scores, we can guide the model toward near-optimal solutions. A continuously growing database of positive and negative experimental outcomes will allow us to perform multiple sequence alignments on successful RBPs to identify conserved residues for freezing, while BLAST can remove designs similar to known failures.

This method is useful when “the probability of locating an optimal or near-optimal solution using naive random search is a rare-event probability,” which applies to AI-driven protein design given the very large sample space. For the standard 20 amino acids and up to n masked residues, the sample space includes 20ⁿ sequences. Since the sample space is finite, this is a combinatorial optimization problem, where we have a tunable mechanism (ESM) that generates sequences, an objective function (Wet-Lab Validation) that outputs whether the sequence is successful, and a noisy function (Boltz-2) that predicts whether the sequence will be successful. CEM is an iterative process that involves generating a batch of sequences, and then updating the generative mechanism to reduce the sample space, essentially guiding our model towards a near-optimal solution. To update the generative mechanism, we must determine which generated sequences are “elite samples,” which we can determine from both Wet-Lab Validation and Boltz-2. Then, we can apply MSA to determine which residues were most common among these selected sequences, and update our generative function accordingly. The challenge with this method is that for MSA to return any useful information, it needs to be provided with a large number of sequences.

To overcome this challenge, we plan to implement a continuously growing database of wet-lab tested sequences. Sequences generated by the model are sent to the wet lab for validation, and once results are finalized, for each sequence we get a binary value: 0 if the sequence was unsuccessful, and 1 if the sequence was successful (phage spotting efficiency threshold to determine if it goes in positive or negative database). We store these results in a database. On the next iteration of generations, we will first apply MSA to successful sequences to obtain the most agreed upon residue values at each masked position, assigning a probability of freezing the residue based on how strong the agreeance was (high agreeance leads to high probability of freezing the residue). Then, based on the probabilities at each position, certain residues will be accepted into the MCMC. Next, proteins generated by this more constrained MCMC will be compared to the unsuccessful proteins from the wet lab database using BLAST, and if they have a high similarity they will be discarded. This effectively implements CEM, as each subsequent batch is generated from a smaller sample space that is more likely to approach a near-optimal solution.

Optimize sampling and multi-objective scoring. Future iterations should adopt adaptive MCMC or reinforcement-learning strategies to navigate sequence space more efficiently, reducing computational cost. Fitness functions must evolve from simple binding affinity metrics to multi-objective optimization that balances binding strength, folding stability, immunogenicity and resistance potential. Fast stability predictors such as FoldX or Rosetta can pre-screen candidates before Boltz-2 evaluation, and machine-learning-accelerated docking surrogates may allow high-throughput estimation of binding modes.

Expand host validation and assess resistance. Engineered RBPs must be tested in phages infecting diverse clinical isolates. Longitudinal studies should monitor whether repeated exposure to retargeted phages leads to receptor modifications or immunological responses. Combining PHORAGER designs with anti-resistance strategies, such as targeting essential receptors or alternating receptor targets, could mitigate resistance development.

Toward ethical and equitable deployment. Computational phage design raises questions about access and oversight. Phage therapies, especially those designed via AI, must be accompanied by regulatory frameworks to ensure safety, prevent misuse, and make treatments available in regions most burdened by antibacterial resistance (ABR). A global generative phage bank, shared openly with healthcare providers, could democratize access while enabling rapid response to emergent pathogens.

In sum, PHORAGER does not simply predict. It proposes. It charts a map across the vast unknowns of viral-host recognition, where every designed sequence is both a hypothesis and a candidate therapy. By coupling generative design with continuous experimental feedback, PHORAGER moves phage engineering from empirical discovery toward predictive control. This is an essential leap as we confront a post-antibiotic world.

References

[1]

Naghavi, M., Vollset, S. E., Ikuta, K. S., Swetschinski, L. R., Gray, A. P., Wool, E. E., Aguilar, G. R., Mestrovic, T., Smith, G., Han, C., & others. (2024). Global burden of bacterial antimicrobial resistance 1990–2021: a systematic analysis with forecasts to 2050. The Lancet, 404(10459), 1199–1226.

[2]

Schooley, R. T., Biswas, B., Gill, J. J., Hernandez-Morales, A., Lancaster, J., Lessor, L., Barr, J. J., Reed, S. L., Rohwer, F., Benler, S., & others. (2017). Development and use of personalized bacteriophage-based therapeutic cocktails to treat a patient with a disseminated resistant Acinetobacter baumannii infection. Antimicrobial Agents and Chemotherapy, 61(10), 10–1128.

[3]

Olawade, D. B., Fapohunda, O., Egbon, E., Ebiesuwa, O. A., Usman, S. O., Faronbi, A. O., & Fidelis, S. C. (2024). Phage therapy: A targeted approach to overcoming antibiotic resistance. Microbial Pathogenesis, 197, 107088.

[4]

Doud, M. B., Robertson, J. M., & Strathdee, S. A. (2025). Optimizing phage therapy with artificial intelligence: a perspective. Frontiers in Cellular and Infection Microbiology, 15, 1611857.

[5]

Aslam, S., Lampley, E., Wooten, D., Karris, M., Benson, C., Strathdee, S., & Schooley, R. T. (2020). Lessons learned from the first 10 consecutive cases of intravenous bacteriophage therapy to treat multidrug-resistant bacterial infections at a single center in the United States. Open Forum Infectious Diseases, 7(9), ofaa389.

[6]

Pirnay, J.-P., Djebara, S., Steurs, G., Griselain, J., Cochez, C., De Soir, S., Glonti, T., Spiessens, A., Vanden Berghe, E., Green, S., & others. (2024). Personalized bacteriophage therapy outcomes for 100 consecutive cases: a multicentre, multinational, retrospective observational study. Nature Microbiology, 9(6), 1434–1453.

[7]

Ofir, G., & Sorek, R. (2018). Contemporary phage biology: from classic models to new insights. Cell, 172(6), 1260–1270.

[8]

Ng, R. N., Grey, L. J., Vaitekenas, A., McLean, S. A., Rudrum, J. D., Laucirica, D. R., Poh, M. W.-P., Hillas, J., Winslow, S. G., Iszatt, J. J., & others. (2021). Development and validation of a miniaturized bacteriophage host range screening assay against antibiotic resistant Pseudomonas aeruginosa. Journal of Microbiological Methods, 190, 106346.

[9]

Molina, F., Menor-Flores, M., Fernández, L., Vega-Rodrı́guez, M. A., & Garcı́a, P. (2022). Systematic analysis of putative phage-phage interactions on minimum-sized phage cocktails. Scientific Reports, 12(1), 2458.

[10]

Klumpp, J., Dunne, M., & Loessner, M. J. (2023). A perfect fit: Bacteriophage receptor-binding proteins for diagnostic and therapeutic applications. Current Opinion in Microbiology, 71, 102240.

[11]

Dowah, A. S., & Clokie, M. R. (2018). Review of the nature, diversity and structure of bacteriophage receptor binding proteins that target Gram-positive bacteria. Biophysical Reviews, 10(2), 535–542.

[12]

Gaborieau, B., Vaysset, H., Tesson, F., Charachon, I., Dib, N., Bernier, J., Dequidt, T., Georjon, H., Clermont, O., Hersen, P., & others. (2024). Prediction of strain level phage–host interactions across the Escherichia genus using only genomic information. Nature Microbiology, 9(11), 2847–2861.

[13]

Currin, A., Swainston, N., Day, P. J., & Kell, D. B. (2015). Synthetic biology for the directed evolution of protein biocatalysts: navigating sequence space intelligently. Chemical Society Reviews, 44(5), 1172–1239.

[14]

Dion, M. B., Oechslin, F., & Moineau, S. (2020). Phage diversity, genomics and phylogeny. Nature Reviews Microbiology, 18(3), 125–138.

[15]

Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., Ronneberger, O., Willmore, L., Ballard, A. J., Bambrick, J., & others. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630(8016), 493–500.

[16]

Google DeepMind AlphaFold Team, & Isomorphic Labs Team. (2023). Performance and structural coverage of the latest, in-development AlphaFold model [Techreport]. DeepMind. https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/a-glimpse-of-the-next-generation-of-alphafold/alphafold_latest_oct2023.pdf

[17]

Passaro, S., & others. (2025). Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction. bioRxiv. https://doi.org/10.1101/2025.06.14.659707

[18]

Dissanayake, D., Wijewardana, T., Gunawardena, G., & Poxton, I. (2008). Distribution of lipopolysaccharide core types among avian pathogenic Escherichia coli in relation to the major phylogenetic groups. Veterinary Microbiology, 132(3–4), 355–363.

[19]

Hantke, K. (2020). Compilation of Escherichia coli K-12 outer membrane phage receptors–their function and some historical remarks. FEMS Microbiology Letters, 367(2), fnaa013.

[20]

Krokidis, M. G., Koumadorakis, D. E., Lazaros, K., Ivantsik, O., Exarchos, T. P., Vrahatis, A. G., Kotsiantis, S., & Vlamos, P. (2025). AlphaFold3: an overview of applications and performance insights. International Journal of Molecular Sciences, 26(8), 3671.

[21]

Wee, J., & Wei, G.-W. (2024). Benchmarking AlphaFold3’s protein-protein complex accuracy and machine learning prediction reliability for binding free energy changes upon mutation. ArXiv, arXiv-2406.

[22]

Notin, P., Rollins, N., Gal, Y., Sander, C., & Marks, D. (2024). Machine learning for functional protein design. Nature Biotechnology, 42(2), 216–228.

[23]

Ranaudo, A., Giulini, M., Pelissou Ayuso, A., & Bonvin, A. M. (2024). Modeling protein–glycan interactions with haddock. Journal of Chemical Information and Modeling, 64(19), 7816–7825.

[24]

Bender, B. J., Gahbauer, S., Luttens, A., Lyu, J., Webb, C. M., Stein, R. M., Fink, E. A., Balius, T. E., Carlsson, J., Irwin, J. J., & others. (2021). A practical guide to large-scale docking. Nature Protocols, 16(10), 4799–4832.

[25]

Renardy, M., Joslyn, L. R., Millar, J. A., & Kirschner, D. E. (2021). To Sobol or not to Sobol? The effects of sampling schemes in systems biology applications. Mathematical Biosciences, 337, 108593.

[26]

Read, M. N., Alden, K., Timmis, J., & Andrews, P. S. (2020). Strategies for calibrating models of biology. Briefings in Bioinformatics, 21(1), 24–35.

[27]

Sutthibutpong, T., Posansee, K., Liangruksa, M., Termsaithong, T., Piyayotai, S., Phitsuwan, P., Saparpakorn, P., Hannongbua, S., & Laomettachit, T. (2024). Combining Deep Learning and Structural Modeling to Identify Potential Acetylcholinesterase Inhibitors from Hericium erinaceus. ACS Omega, 9(14), 16311–16321.

[28]

Dauparas, J., Anishchenko, I., Bennett, N., Bai, H., Ragotte, R. J., Milles, L. F., Wicky, B. I., Courbet, A., de Haas, R. J., Bethel, N., & others. (2022). Robust deep learning–based protein sequence design using ProteinMPNN. Science, 378(6615), 49–56.

[29]

Wu, M.-H., Xie, Z., & Zhi, D. (2025). A Folding-Docking-Affinity framework for protein-ligand binding affinity prediction. Communications Chemistry, 8(1), 1–9.