Software
PHORAGER is the first open-source computational platform for engineering bacteriophage receptor-binding proteins through AI-driven design and physics-based validation. Created by the iGEM Toronto team, this project aims to revolutionize phage therapy by leveraging machine learning and computational biology to reprogram phage host specificity, reducing screening time from months to days.
Overview
PHORAGER provides a comprehensive set of tools and models for the computational design, validation, and optimization of phage receptor-binding proteins (RBPs). By utilizing state-of-the-art foundation models (ESM3, Boltz-2), statistical optimization (MCMC with simulated annealing), and physics-based docking (HADDOCK3), this pipeline enables researchers to engineer personalized therapeutic phages against antibiotic-resistant bacterial strains. The platform encompasses data curation, generative design, structural prediction and molecular docking.
Installation
To install PHORAGER, you can clone the repository from either GitHub or the iGEM GitLab for development access.
From GitHub:
git clone https://github.com/igem-toronto/protein-design-sandbox.git
cd protein-design-sandbox
pip install -r requirements.txt
pip install -e .
From iGEM GitLab (for submission purposes):
git clone https://gitlab.igem.org/2025/software-tools/toronto.git
cd toronto
pip install -r requirements.txt
pip install -e .
CUDA-accelerated Installation (Linux):
pip install -r requirements.txt
pip install -e .[cuda]
Repository Structure
The PHORAGER project is organized into several key components, as outlined below:
scripts/ Main orchestration and analysis pipelines
main.py Complete ESM3 → Boltz2 → Analysis workflow
esm3_hallucination*.py ESM3-guided sequence generation
boltz2_generation.py Boltz2 structure prediction
post_boltz_processing.py Results processing and summary extraction
process_rbp_receptor_complexes.py Complex characterization
routing_script.py Intelligent phage bank query interface
BLASTn_LPS_typing.py LPS typing and bacterial characterization
controls*.py Control experiments and validation
mcmc/ Iterative optimization and sampling protocols
mcmc_rbp_phages_wild_esm3.py Main MCMC optimization pipeline
boltz2_generation.py Boltz2 integration for MCMC
extract_affinity_data.py Affinity data extraction
MCMC_Analysis.ipynb Interactive analysis notebooks
docking/ HADDOCK3 parameter sweeps and validation runs
haddock3_docking_prep_protein.py Protein preparation
haddock3_docking_prep_glycan.py Glycan preparation
parameter_optimization/ Parameter sweep utilities
data/ Glycan structures, RBP complexes, processed datasets
rbp_complexes/ Reference protein complex structures
glycans/ Glycan structure libraries (R1, R2, R3, R4, K12)
decomplexed/ Separated protein and glycan components
docking/ Docking input files and results
notebooks/ Interactive analysis and visualization tools
ESM3 experiments Parameter exploration
Boltz2 analysis Structure visualization
Wiki figures Documentation generation
Hardware/ Custom PCB and CAD files for experimental validation
pcbGerberFiles.zip Custom PCB layouts for protein detection
cadFiles.zip 3D models for laboratory equipment
docs/ Detailed methodology and scientific rationale
README_RBP_receptor_complexes.md Complex analysis methods
results/ Generated sequences, structural predictions, docking scores
This structure represents a clear development pipeline that facilitates data curation, generative design, physics-based validation, and experimental synthesis of engineered phages.
Repository Setup
The PHORAGER repository can be found at GitHub, or the iGEM GitLab for the purposes of our submission.
To clone the repository:
git clone https://github.com/igem-toronto/protein-design-sandbox.git
Or from iGEM GitLab:
git clone https://gitlab.igem.org/2025/software-tools/toronto.git
To set up the environment, install dependencies with:
# Using conda
conda env create -f envs/environment.yaml
conda activate protein-design-sandbox
# Or using pip
pip install -r requirements.txt
pip install -e .
Required Environment Variables
# ESM3 API access (required)
export ESM3_TOKEN="your_esm3_token_here"
# Optional: AWS credentials for cloud processing
export AWS_ACCESS_KEY_ID="your_key"
export AWS_SECRET_ACCESS_KEY="your_secret"
External Tool Dependencies
PHORAGER requires the following external bioinformatics tools:
- BLAST+ (v2.12.0+) - Sequence alignment and homology searches
- DIAMOND (v2.0.15+) - High-throughput sequence alignment
- CD-HIT (v4.8.1+) - Sequence clustering
- HMMER (v3.3+) - HMM-based sequence searches
- HADDOCK3 (v3.0.0+) - Molecular docking (requires Docker)
Install via conda:
conda install -c bioconda blast diamond-aligner cd-hit hmmer
Quick Start
1. Main Protein Design Pipeline
Run the complete ESM3 → Boltz2 → Analysis workflow:
python -m scripts.main --csv-path data/esm-boltz-input.csv \
--structure-dir data/rbp_complexes \
--glycan-dir data/glycans \
--output-dir runs/design_run_1 \
--esm3-model esm3-large-2024-03 \
--num-sequences 10 \
--temperature 1.0 \
--boltz2-cache ~/.cache/boltz2
2. MCMC Parameter Optimization
Run Markov Chain Monte Carlo sampling with simulated annealing:
python mcmc/mcmc_rbp_phages_wild_esm3.py \
--config configs/mcmc_config.yaml \
--output-dir results/mcmc_optimization
3. RBP-Receptor Complex Analysis
Process and characterize protein complexes:
python scripts/process_rbp_receptor_complexes.py \
--input-dir data/rbp_complexes \
--output-dir results/complex_analysis
4. HADDOCK3 Docking Preparation
Prepare structures for molecular docking:
# Protein preparation
python docking/haddock3_docking_prep_protein.py \
--input protein.pdb --output prepared_protein.pdb
# Glycan preparation
python docking/haddock3_docking_prep_glycan.py \
--input glycan.pdb --output prepared_glycan.pdb
5. Phage Bank Routing
Query the intelligent phage bank for receptor matching:
python scripts/routing_script.py \
--query bacterial_sample.txt \
--database data/rbp_database/ \
--threshold 0.85 \
--output results/routing_output/
Advanced Usage
Batch Processing
Process multiple protein families in parallel:
python scripts/main_protein_receptors.py \
--config configs/batch_config.yaml \
--parallel-workers 4
HADDOCK3 Parameter Sweep
Systematic optimization of docking parameters:
cd docking/parameter_optimization/
python full_sweep.py --lhs 100 --workers 4 --ncores 8
Cloud Deployment
AWS batch processing for large-scale runs:
python scripts/aws_batch_processing.py \
--s3-bucket your-bucket \
--instance-type g4dn.xlarge
Core Dependencies
Python Packages
- Bioinformatics: biopython, numpy, pandas, pyyaml
- ML/AI: esm (3.x), boltz, torch (CPU/CUDA)
- Structure: rdkit, parmed, pdb-tools, pyparsing
- Utilities: python-dotenv, tqdm, psutil, filelock, boto3
- Visualization: matplotlib, seaborn
Platform-Specific Notes
- RDKit: Use conda-forge for better compatibility across platforms
- CUDA: Install appropriate PyTorch CUDA builds for GPU acceleration
- DIAMOND: Required for
routing_script.py(fast sequence alignment) - Docker: Required for HADDOCK3 docking workflows
Note: PyTorch is not pinned in requirements. Follow the official install selector for your CUDA/CPU configuration.
Contributing
We welcome contributions from the community! If you would like to contribute, please check out the repository on GitHub and follow the contributing guidelines. Areas of particular interest include:
- Expansion to additional bacterial species and receptor types
- Integration of additional foundation models
- Optimization of MCMC sampling strategies
- Development of experimental validation protocols
To contribute:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request with detailed description
Citation
If you use this platform in your research, please cite:
iGEM Toronto 2025: PHORAGER - Computational Platform for
Bacteriophage Receptor-Binding Protein Design
License
PHORAGER is licensed under the MIT License.