Description

Protein engineering is one of the most powerful tools in modern biotechnology. From enzyme therapies to next-generation antibodies, proteins drive the future of medicine. However, despite advances like AlphaFold2 in predicting protein structure, the design of functional proteins remains inefficient, costly, and error-prone.

Abstract

Protein engineering underpins modern biotechnology, yet turning designed sequences into robustly functional biomolecules remains inefficient because datasets are small, screening is costly, and sequence–function mappings are uncertain. While the development of large protein models such as AlphaFold has advanced structure prediction, consistent performance across diverse conditions is still difficult to achieve. We built an evolution platform that couples a data-free, generative protein language framework with active learning refinement and a directed-evolution pipeline spanning both in vivo OrthoRep systems and in vitro error-prone mutagenesis. Using the SpyTag/SpyCatcher pair as a testbed, we targeted improved association rates and iteratively validated candidates with split-luciferase assays. By uniting computation with biological selection, our framework accelerates discovery of adaptable, high-performing protein variants.

Introduction

Modern protein engineering promises tailor-made therapeutics, but the path from sequence to validated function is still slow, expensive, and failure-prone. Large structural-genomics programs document high attrition across expression, purification, crystallization and structure determination—often stretching timelines from months to years.^1,2

The problem landscape

1) Little functional data → many mutations & cycles. Early R&D rarely starts with condition-matched activity data; reviews of ML-assisted protein design show that models trained on narrow distributions struggle to generalize across rugged fitness landscapes, so iterative mutagenesis and screening remain unavoidable to find robust sequences.³ → See our solution

2) "DL gives many sequences" → wet-lab validation is resource-intensive. Even with better docking/generative models, HTS hit rates are typically 0.01–0.14%, and prospective virtual screening often yields only single- to low-double-digit enrichment—meaning large candidate sets still demand costly triage and assays.^4,5 → See our solution

3) Real environments break in-silico assumptions. Expression and activity are context-dependent: membrane-protein overexpression is a classic bottleneck and even soluble targets frequently fail at the bench.^1,6,7 Meanwhile, many tumors present acidic extracellular pH (~6.5–7.0) with near-neutral intracellular pH, shifting stability and binding away from "ideal" design conditions.^6,7 → See our solution

4) Designed proteins still misfold or aggregate. Across biotherapeutics, aggregation and misfolding undermine yield, potency, and safety—one of the most persistent gaps between attractive in-silico designs and deployable drug leads.^8,9 → See our solution

Why this project & What inspired us

Specific problem fit. The biggest blockers we face are data scarcity, validation cost, and environment mismatch (acidic pHe in tumors). A design→rank→epPCR→assay→periodic-retrain pipeline tackles all three without requiring massive datasets or a fragile real-time loop.

Model system with clean kinetics. SpyCatcher/SpyTag offers a wide, well-characterized kinetic range and a simple 𝑘 on k on readout, making cross-buffer comparisons straightforward and publishable.

Evidence-backed tooling. Recent advances in RFdiffusion and ProteinMPNN show strong prospective performance, while active-learning-guided selection helps screen less under limited data.

Today's pipelines either (a) lack condition-matched data and must mutate-and-screen extensively, or (b) produce many in-silico candidates that still demand heavy wet-lab validation—all while environment and expression constraints derail otherwise promising designs.

Here's a story of our background.

Our Solutions — fasten and closing the gap between in-silico promise and bench-top reality

We proposed a closed-loop compute + biology framework. Upstream, we perform an in-silico seeding and AI-guided seeking model, Seed & Seek, to propose diverse, fold-plausible variants and rank them by predicted performance, providing optimized protein design. Downstream, we run directed evolution and functional assays (in vitro error prone; in vivo OrthoRep/EcORep). Each round outputs an experiment-ready shortlist with constraints and rationales.

Sample to Data

Starting from a protected scaffold with mutation masks, we generate variants using RFdiffusion (interfaces/backbone) and ProteinMPNN (sequence fill). Every candidate is virtually labeled by a standard pipeline: AlphaFold provides structure hypotheses and Brownian-dynamics simulation estimates diffusion-controlled k_on.

Data to Solution

Seed & Seek recursively learns a shared latent space from sequence features and residue-graph structural embeddings, then uses an edit-only decoder to propose on-manifold mutations. A calibrated scorer and a latent space optimizer (spread/temperature/top-p) balance exploration and exploitation to produce a clear, experiment-ready shortlist.

Solution to Reality

To bridge computational design with experimental reality, we employ directed evolution to validate and optimize model-generated sequences through a coarse-to-fine tuning strategy.

In the coarse phase, a fragmented error-prone PCR approach generates multiple localized mutant pools, allowing mutations to focus on functionally important regions or distribute evenly when structural insights are limited. For example, the SpyCatcher gene was divided into five sections, with two function-based regions selected for mutagenesis to build the initial library. The fine-tuning phase refines these variants through in vivo and in vitro optimization. Selected mutants evolve continuously in the EcORep system, or undergo iterative error-prone PCR and screening. In our case, the split luciferase assay was used for in vitro screening to measure the binding kinetics (k_on) between SpyCatcher and SpyTag under different conditions. This coarse-to-fine evolution pipeline balances exploration and precision, turning AI-designed sequences into real, functional performance.

References

Show References ▼

Joachimiak A. (2009). High-throughput crystallography for structural genomics. Current opinion in structural biology, 19(5), 573–584. https://doi.org/10.1016/j.sbi.2009.08.002

Kim, Y., Babnigg, G., Jedrzejczak, R., Eschenfeldt, W. H., Li, H., Maltseva, N., Hatzos-Skintges, C., Gu, M., Makowska-Grzyska, M., Wu, R., An, H., Chhor, G., & Joachimiak, A. (2011). High-throughput protein purification and quality assessment for crystallization. Methods (San Diego, Calif.), 55(1), 12–28. https://doi.org/10.1016/j.ymeth.2011.07.010

Yang, K. K., Wu, Z., & Arnold, F. H. (2019). Machine-learning-guided directed evolution for protein engineering. Nature methods, 16(8), 687–694. https://doi.org/10.1038/s41592-019-0496-6

Zhu, T., Cao, S., Su, P. C., Patel, R., Shah, D., Chokshi, H. B., Szukala, R., Johnson, M. E., & Hevener, K. E. (2013). Hit identification and optimization in virtual screening: practical recommendations based on a critical literature analysis. Journal of medicinal chemistry, 56(17), 6560–6572. https://doi.org/10.1021/jm301916b

Zhu, H., Zhang, Y., Li, W., & Huang, N. (2022). A Comprehensive Survey of Prospective Structure-Based Virtual Screening for Early Drug Discovery in the Past Fifteen Years. International journal of molecular sciences, 23(24), 15961. https://doi.org/10.3390/ijms232415961

Pérez-Herrero, E., & Fernández-Medarde, A. (2021). The reversed intra- and extracellular pH in tumors as a unified strategy to chemotherapeutic delivery using targeted nanocarriers. Acta pharmaceutica Sinica. B, 11(8), 2243–2264. https://doi.org/10.1016/j.apsb.2021.01.012

Rahman, M. A., Yadab, M. K., & Ali, M. M. (2024). Emerging Role of Extracellular pH in Tumor Microenvironment as a Therapeutic Target for Cancer Immunotherapy. Cells, 13(22), 1924. https://doi.org/10.3390/cells13221924

Roberts C. J. (2014). Therapeutic protein aggregation: mechanisms, design, and control. Trends in biotechnology, 32(7), 372–380. https://doi.org/10.1016/j.tibtech.2014.05.005

Roberts C. J. (2014). Protein aggregation and its impact on product quality. Current opinion in biotechnology, 30, 211–217. https://doi.org/10.1016/j.copbio.2014.08.001

10.

Watson, J. L., Juergens, D., Bennett, N. R., Trippe, B. L., Yim, J., Eisenach, H. E., Ahern, W., Borst, A. J., Ragotte, R. J., Milles, L. F., Wicky, B. I. M., Hanikel, N., Pellock, S. J., Courbet, A., Sheffler, W., Wang, J., Venkatesh, P., Sappington, I., Torres, S. V., Lauko, A., … Baker, D. (2023). De novo design of protein structure and function with RFdiffusion. Nature, 620(7976), 1089–1100. https://doi.org/10.1038/s41586-023-06415-8

11.

Dauparas, J., Anishchenko, I., Bennett, N., Bai, H., Ragotte, R. J., Milles, L. F., Wicky, B. I. M., Courbet, A., de Haas, R. J., Bethel, N., Leung, P. J. Y., Huddy, T. F., Pellock, S., Tischer, D., Chan, F., Koepnick, B., Nguyen, H., Kang, A., Sankaran, B., Bera, A. K., … Baker, D. (2022). Robust deep learning-based protein sequence design using ProteinMPNN. Science (New York, N.Y.), 378(6615), 49–56. https://doi.org/10.1126/science.add2187

12.

Keeble, A. H., Turkki, P., Stokes, S., Khairil Anuar, I. N. A., Rahikainen, R., Hytönen, V. P., & Howarth, M. (2019). Approaching infinite affinity through engineering of peptide-protein interaction. Proceedings of the National Academy of Sciences of the United States of America, 116(52), 26523–26533. https://doi.org/10.1073/pnas.1909653116

13.

Keeble, A. H., & Howarth, M. (2020). Power to the protein: enhancing and combining activities using the Spy toolbox. Chemical science, 11(28), 7281–7291. https://doi.org/10.1039/d0sc01878c

14.

Ravikumar, A., Arzumanyan, G. A., Obadi, M. K. A., Javanpour, A. A., & Liu, C. C. (2018). Scalable, Continuous Evolution of Genes at Mutation Rates above Genomic Error Thresholds. Cell, 175(7), 1946–1957.e13. https://doi.org/10.1016/j.cell.2018.10.021

15.

Molina, R. S., Rix, G., Mengiste, A. A., Alvarez, B., Seo, D., Chen, H., Hurtado, J., Zhang, Q., Donato García-García, J., Heins, Z. J., Almhjell, P. J., Arnold, F. H., Khalil, A. S., Hanson, A. D., Dueber, J. E., Schaffer, D. V., Chen, F., Kim, S., Ángel Fernández, L., Shoulders, M. D., … Liu, C. C. (2022). In vivo hypermutation and continuous evolution. Nature reviews. Methods primers, 2, 37. https://doi.org/10.1038/s43586-022-00130-w

16.

Lang, Y., Li, Z., & Li, H. (2019). Analysis of Protein-Protein Interactions by Split Luciferase Complementation Assay. Current protocols in toxicology, 82(1), e90. https://doi.org/10.1002/cptx.90

17.

Lang, Y., Li, Z., & Li, H. (2019). Analysis of Protein-Protein Interactions by Split Luciferase Complementation Assay. Current protocols in toxicology, 82(1), e90. https://doi.org/10.1002/cptx.90