Engineering

The engineering cycle is an essential part of our team’s work towards developing PHORAGER, EVADE, and other parts of our project (such as our business plan). Multiple iGEM Toronto sub teams worked in feedback loops to reach project goals and iterate upon designs. Crucial to this process was constant learning from each iteration

We detail the design-build-test-learn (DBTL) cycles for the dry lab team, the wet lab team, the hardware team, and the entrepreneurship team. An overall diagram of the combined DBTL cycles and their relationships to each other is showcased below (see Fig. 1).

Fig. 1: Combined DBTL workflow for dry lab, wet lab, hardware, and entrepreneurship work.

Dry lab — PHORAGER Pipeline Creation and Validation

Iteration 1

Design:

The initial experimental design aimed to establish a generalized pipeline for rapidly screening and selecting RBPs, phages, and glycans through successive iterations. The first iteration focused on creating a data curation methodology, resulting in a databank of known receptor-binding protein (RBP) sequences along with their corresponding binding targets. Major sources included GenBank and various phage banks^{[1, 2, 3, 4, 5]}, . Additional data incorporated structural information related to bacterial surface receptors.

A key objective of this phase was validation – ensuring the system could generate viable candidates across defined parameters for RBPs and glycans. To this end, initial trials explored different configurations to assess the resource requirements, timelines, and quality of results needed to confirm the pipeline’s reliability. Given their strong relevance to binding affinity prediction, ESM3 and Boltz2 were selected to establish a baseline for evaluation. Candidate sequences were first generated using ESM3 and subsequently tested in silico with Boltz2 to assess structural stability and binding affinity.

Build

The curated sequences were preprocessed by masking the C-terminal receptor-binding domain (RBD) regions to focus generative modeling on these functional segments while preserving surrounding sequence context. Using this approach, ESM3, a transformer-based protein language model, was applied to regenerate and optimize masked RBDs. By sampling masked amino acids based on learned sequence–structure–function relationships, ESM3 explores diverse yet plausible RBD variants guided by the context of the full protein.

Generated sequences were then evaluated with Boltz-2, which jointly predicts 3D protein structure and binding affinity against both wild-type and target receptors. This dual capability provides a ranked list of variants using folding and interface metrics such as ipTM and pTM, offering objective measures of structural plausibility and binding potential. Top outputs were further reviewed for biological relevance and experimental feasibility to ensure suitability for downstream validation.

Together, ESM3 and Boltz-2 form a complementary pipeline: ESM3 generates context-aware sequence variants by capturing evolutionary and structural signals from large protein datasets, while Boltz-2 delivers high-speed, high-accuracy structure and affinity predictions, approaching the rigor of free energy perturbation at over 1000× faster performance. Their integration enables scalable, efficient virtual screening and protein engineering, accelerating discovery and providing robust candidates for experimental testing.

Test

Initial assessments of the generated sequences prompted a deeper investigation into how their scores were produced and how valid they were in the broader context of binding affinities. Multiple iterations using non-uniform inputs revealed a general trend toward higher predicted scores and affinities. At this stage, external laboratory validation was not ideal, so comparisons with preliminary docking studies were conducted instead. These comparisons confirmed the initial assumptions within a limited set of parameter ranges.

Learn

From this iteration, we confirmed that a single-pass generation-to-evaluation pipeline could yield a small set of viable results but was insufficient for efficiently producing high-quality RBPs. While ESM3 and Boltz2 each performed well in providing scoring context and updating sequences with new variants, the absence of feedback between them slowed optimization, and some sequences failed to achieve stable predicted folds. This limitation motivated the development of Iteration 2, which directly integrated Boltz2 feedback into the generative loop through an iterative Markov chain Monte Carlo (MCMC) and simulated annealing approach.

Iteration 2

Design

The second iteration aimed to generate more comprehensive candidates using a standard Markov Chain Monte Carlo (MCMC) pipeline defined by a Metropolis function, with additional simulated annealing steps. Experimental parameters such as temperature, masking positions, and iteration counts were varied to enable repeated runs.

To account for the extended number of epochs, Boltz-2 was integrated into the optimization loop alongside iPTM and binding affinity scores. Each configuration was executed across 1,000 runs, producing updated RBP chains against glycans. In parallel, AlphaFold testing confirmed the distribution of predicted scores. The resulting outputs informed the next selection round, which incorporated ESM3-generated sequences into the feedback loop. With these hallucinatory variants added as new inputs, the algorithm was able to iteratively refine towards RBPs and phages with higher predicted binding affinities.

Build

As in the first iteration, sequences were preprocessed by masking C-terminal receptor-binding domain (RBD) regions to prepare them for generative modeling. Additional parameters, such as temperature, were incorporated into the execution build to expand the input space. The Markov Chain was designed to optimize scores and iPTM outputs, with Monte Carlo runs set to terminate early if no major improvements occurred, before moving to the next round.

ESM3 was again configured for masked sequence generation over curated RBD regions, combining fixed and random masking. The resulting variants were evaluated for structure and binding affinity against both wild-type and target receptors. Final outputs consisted of large collections of conditional runs with corresponding best scores, RBP and glycan identifiers, and server configuration records. These results were filtered to select top batches, which were then transferred for validation by other team domains.

Test

Individual tests of screened outputs showed several candidates achieving scores above 0.7 across multiple iterations. The top-performing sequences, incorporating updated chains from the Markov process, were further validated through localized simulations. Experimental validation within a laboratory framework is ongoing.

Learn

The primary goal of incorporating MCMC simulations was to efficiently identify parameter combinations that yield the best scores. This addition proved essential for exploring a broader space of RBP sequences without expending excessive resources on repeated low-scoring runs. The next phase focuses on integrating an in silico–in vitro feedback loop, where laboratory-derived outputs are compared with simulation results. This active learning step allows wet-lab data to be fed back into the generative pipeline, guiding the algorithms toward stronger candidates more quickly. Updates will include refinements to coefficient weights, model parameters, and sequence-level attributes, improving the likelihood that generated sequences perform well under real biological conditions.

These spacers, along with both positive controls, did not successfully target the phages. We believe this could be due to issues with our experimental setup itself. In particular, we suspect that our K12 strain might be harbouring chloramphenicol resistance, which would interfere with the CmR selectable marker in pCas9. As such, we intend to revise our experimental design to overcome these issues.

Wet lab — Wet lab validation of batch 1 and batch 2 RBDs

Iteration 1 (anticipation)

Design

Generated RBDs with 500 bp flanking homology arms will be cloned into pETDuet, a plasmid compatible with CRISPR constructs. These recombination and CRISPR plasmids will be co-transformed into the corresponding lysogens. The CRISPR system will eliminate cells that fail to recombine, enriching for successful recombinants. To identify recombinants, colony PCR will be performed using a forward primer upstream of the homology arm and a reverse primer within the inserted RBD, ensuring that only chimeric lysogens yield PCR products.

Build

Two P2 RBD constructs have been successfully cloned using Gibson Assembly. Additional RBD constructs will be cloned as they arrive.

Test

Recombination and CRISPR plasmids will be co-transformed into the corresponding lysogens. Successful recombinants identified by colony PCR will be induced to produce phages, which will then be tested on E. coli strains expressing the relevant receptors.

Learn

We anticipate obtaining recombinant lysogens that survive CRISPR selection. If no surviving colonies are observed, we will modify the workflow by transforming the recombination construct first, allowing it to replicate in the lysogen for several generations before introducing the CRISPR plasmid. This should increase the likelihood of successful recombination events.

Hardware — Pill Release Mechanism Design

During the design process, we explored several mechanisms for on-demand phage release. Our first concept, inspired by published work, used a nichrome wire heating element to melt a fusible PCL thread that restrained a spring-loaded drug compartment. The system included a nichrome coil, tensile PCL filament, PDMS spring, elastic band, and a 3.7 V Li-Po battery. When triggered, the wire heated rapidly, melting the PCL (~60 °C) and releasing the stored tension to open the compartment for phage delivery. Ultimately, we shifted to an electromagnet-based release, which proved easier to prototype and more reliable over time, since magnetic force does not degrade like mechanical tension. From sketches to CAD models, we iteratively refined the capsule, repeatedly redesigning the release mechanism, component placement, and capsule size to shrink the device as much as possible (undertaking consistent DBTL cycles). Our current prototype remains too large to be swallowed due to manufacturing constraints; however, with further iterations, we intend to test a scaled-down version in the wet lab. We also intend to test our pill in a bacterial culture to observe whether it can sense H2 generated by E. coli.

Entrepreneurship — Developing pitches and business plan

Over the course of 4 months, the entrepreneurship team worked with the University of Toronto’s NEST Hatchery to craft many iterations of our business pitch and plan. With everybody’s combined efforts, as well as the expertise of our human practices division, we went over many different approaches to commercializing Mystiphage. Furthermore, we tested many different iterations of our pitch and business plan to investors and advisors, who were able to provide feedback on a weekly basis regarding the feasibility of our proposals. Hence, after 4 months of working and reworking multiple ideas, we executed numerous DBTL cycles (at times within a weekly timespan) and were able to eventually produce our business plan, pitch and cashflow projections that will allow Mystiphage to continue as a viable startup after the iGEM Jamboree.

References

[1]

Li, X., Xu, Z., Hong, X., Zhang, Y., & Zou, X. (2020). Databases and Bioinformatic Tools for Glycobiology and Glycoproteomics. International Journal of Molecular Sciences, 21(18), 6727. https://doi.org/10.3390/ijms21186727

[2]

RCSB PDB. (n.d.). RCSB Protein Data Bank. https://www.rcsb.org/

[3]

Varadi, M., Bertoni, D., Magana, P., Paramval, U., Pidruchna, I., Radhakrishnan, M., Tsenkov, M., Nair, S., Mirdita, M., Yeo, J., Kovalevskiy, O., Tunyasuvunakool, K., Laydon, A., Žídek, A., Tomlinson, H., Hariharan, D., Abrahamson, J., Green, T., Jumper, J., … Velankar, S. (2024). AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Research, 52(D1), D368–D375. https://doi.org/10.1093/nar/gkad1011

[4]

BindingDB. (n.d.). BindingDB: A Public, Web-Accessible Binding Database. https://www.bindingdb.org/rwd/bind/index.jsp

[5]

Liu, H., Chen, P., Zhai, X., Huo, K.-G., Zhou, S., Han, L., Fan, G., & others. (2024). PPB-Affinity: Protein-Protein Binding Affinity dataset for AI-based protein drug discovery. Scientific Data, 11(1), 1316. https://doi.org/10.1038/s41597-024-03997-4