Description
Protein engineering is one of the most powerful tools in modern biotechnology. From enzyme therapies to next-generation antibodies, proteins drive the future of medicine. However, despite advances like AlphaFold2 in predicting protein structure, the design of functional proteins remains inefficient, costly, and error-prone.
Abstract
Protein engineering underpins modern biotechnology, yet turning designed sequences into robustly functional biomolecules remains inefficient because datasets are small, screening is costly, and sequence–function mappings are uncertain. While the development of large protein models such as AlphaFold has advanced structure prediction, consistent performance across diverse conditions is still difficult to achieve. We built an evolution platform that couples a data-free, generative protein language framework with active learning refinement and a directed-evolution pipeline spanning both in vivo OrthoRep systems and in vitro error-prone mutagenesis. Using the SpyTag/SpyCatcher pair as a testbed, we targeted improved association rates and iteratively validated candidates with split-luciferase assays. By uniting computation with biological selection, our framework accelerates discovery of adaptable, high-performing protein variants.
Introduction
Modern protein engineering promises tailor-made therapeutics, but the path from sequence to validated function is still slow, expensive, and failure-prone. Large structural-genomics programs document high attrition across expression, purification, crystallization and structure determination—often stretching timelines from months to years.1,2
The problem landscape
1) Little functional data → many mutations & cycles. Early R&D rarely starts with condition-matched activity data; reviews of ML-assisted protein design show that models trained on narrow distributions struggle to generalize across rugged fitness landscapes, so iterative mutagenesis and screening remain unavoidable to find robust sequences.3 → See our solution
2) "DL gives many sequences" → wet-lab validation is resource-intensive. Even with better docking/generative models, HTS hit rates are typically 0.01–0.14%, and prospective virtual screening often yields only single- to low-double-digit enrichment—meaning large candidate sets still demand costly triage and assays.4,5 → See our solution
3) Real environments break in-silico assumptions. Expression and activity are context-dependent: membrane-protein overexpression is a classic bottleneck and even soluble targets frequently fail at the bench.1,6,7 Meanwhile, many tumors present acidic extracellular pH (~6.5–7.0) with near-neutral intracellular pH, shifting stability and binding away from "ideal" design conditions.6,7 → See our solution
4) Designed proteins still misfold or aggregate. Across biotherapeutics, aggregation and misfolding undermine yield, potency, and safety—one of the most persistent gaps between attractive in-silico designs and deployable drug leads.8,9 → See our solution
Why this project & What inspired us
Specific problem fit. The biggest blockers we face are data scarcity, validation cost, and environment mismatch (acidic pHe in tumors). A design→rank→epPCR→assay→periodic-retrain pipeline tackles all three without requiring massive datasets or a fragile real-time loop.
Model system with clean kinetics. SpyCatcher/SpyTag offers a wide, well-characterized kinetic range and a simple 𝑘 on k on readout, making cross-buffer comparisons straightforward and publishable.
Evidence-backed tooling. Recent advances in RFdiffusion and ProteinMPNN show strong prospective performance, while active-learning-guided selection helps screen less under limited data.
Today's pipelines either (a) lack condition-matched data and must mutate-and-screen extensively, or (b) produce many in-silico candidates that still demand heavy wet-lab validation—all while environment and expression constraints derail otherwise promising designs.
Our Solutions — fasten and closing the gap between in-silico promise and bench-top reality
We proposed a closed-loop compute + biology framework. Upstream, we perform an in-silico seeding and AI-guided seeking model, Seed & Seek, to propose diverse, fold-plausible variants and rank them by predicted performance, providing optimized protein design. Downstream, we run directed evolution and functional assays (in vitro error prone; in vivo OrthoRep/EcORep). Each round outputs an experiment-ready shortlist with constraints and rationales.
Sample to Data
Starting from a protected scaffold with mutation masks, we generate variants using RFdiffusion (interfaces/backbone) and ProteinMPNN (sequence fill). Every candidate is virtually labeled by a standard pipeline: AlphaFold provides structure hypotheses and Brownian-dynamics simulation estimates diffusion-controlled kon.
Data to Solution
Seed & Seek recursively learns a shared latent space from sequence features and residue-graph structural embeddings, then uses an edit-only decoder to propose on-manifold mutations. A calibrated scorer and a latent space optimizer (spread/temperature/top-p) balance exploration and exploitation to produce a clear, experiment-ready shortlist.
Solution to Reality
To bridge computational design with experimental reality, we employ directed evolution to validate and optimize model-generated sequences through a coarse-to-fine tuning strategy.
In the coarse phase, a fragmented error-prone PCR approach generates multiple localized mutant pools, allowing mutations to focus on functionally important regions or distribute evenly when structural insights are limited. For example, the SpyCatcher gene was divided into five sections, with two function-based regions selected for mutagenesis to build the initial library. The fine-tuning phase refines these variants through in vivo and in vitro optimization. Selected mutants evolve continuously in the EcORep system, or undergo iterative error-prone PCR and screening. In our case, the split luciferase assay was used for in vitro screening to measure the binding kinetics (kon) between SpyCatcher and SpyTag under different conditions. This coarse-to-fine evolution pipeline balances exploration and precision, turning AI-designed sequences into real, functional performance.