Model

Protein engineering is one of the most powerful tools in modern biotechnology. From enzyme therapies to next-generation antibodies, proteins drive the future of medicine. However, despite advances like AlphaFold2 in predicting protein structure, the design of functional proteins remains inefficient, costly, and error-prone.

Overview

Our Model: Seed & Seek

Seed & Seek is a data-free in silico modeling "pipeline" for customized directed evolution of general protein functional optimization. From a single exemplar sequence, the system seeds a diverse neighborhood with modern AI mutators (e.g., RFdiffusion, ProteinMPNN) and immediately assigns virtual-fitness labels—no historical measurements required. Those labels are then used to recursively and actively refocus the generator on the most promising frontier, so each round sharpens proposals toward the objective while preserving diversity and biochemical plausibility. The loop is explicitly iterative—propose, label, select, update—and because the model learns from its own feedback rather than a fixed dataset, it adapts rapidly to new targets and sustains momentum in data-scarce settings.

Designed for rapid prototyping, Seed&Seek functions as a model-guided design engine. It keeps constraints transparent, returns ranked candidates with uncertainty and traceable mutation histories, and compresses the time from hypothesis to testable design. While the pipeline can optionally ingest small batches of later measurements, its defining feature is that it begins and advances data-free, driven by strong priors and a recursive active learning loop. Our current protein-binder demo—starting from SpyCatcher-002 as the wild-type seed and optimizing for binding-rate performance with SpyTag002—illustrates this workflow end-to-end, showing how Seed&Seek concentrates search efficiently and delivers experiment-ready hypotheses for subsequent wet-lab validation.

Research

The advances of AI-guided directed evolution

The first move in model-guided directed evolution is no longer blind mutagenesis but in-silico seeding with strong generative priors. Using backbone-aware designers (e.g., MPNN-style sequence builders) and diffusion models for interfaces and scaffolds, we generate multiple, fold-consistent starting variants around a prototype and rank them with fast, physics-informed signals. This turns the "first look" from intuition into a reproducible shortlist of candidates whose constraints and assumptions are explicit.

A shortlist, however, is only the beginning. Detailed evaluators—whether stability proxies, docking-style scoring, or brief relaxations—behave like expensive black-box objectives, so brute-force screening at scale quickly becomes the bottleneck. We therefore couple seeding with a recursive active learning loop: propose a small batch, label it virtually, update the generator, then repeat. Each round concentrates computation on the most promising regions while preserving diversity and biochemical plausibility. In practice, this makes the model a decision engine that learns where to look next, improving sample-efficiency without claiming to perfectly simulate molecular reality.

Crucially, this workflow remains entirely in silico through the Seed&Seek phase, producing a ranked, constraint-audited set of designs with uncertainty and traceable mutation histories. Only then do we hand off to wet-lab directed evolution, which starts from this strengthened seed rather than from a wild type. The result is a smaller, sharper experimental campaign focused on biological fit and environmental adaptability, where lab selection can probe context-specific behaviors that computation cannot yet capture at scale. In short, Seed&Seek uses modern generators and a recursive active loop to spend simulation time wisely, deliver a scientifically defensible starting point, and make downstream evolution faster, cheaper, and more targeted.

Core Designs

1. RF Diffusion+ Protein MPNN for Mutation ▼

2. Molecular Simulation ▼

3. Demo: K_on of Protein Binders ▼

4. Implementation of Graph (Structural) Information ▼

5. Generational Deep Learning ▼

Development

Mutate from the Sample

Model:

RFdiffusion, developed at the University of Washington by Baker Lab, enabled us to explore new backbone geometries either from complete Gaussian noise (while fixing SpyTag at the interface) or by partially perturbing the original SpyCatcher structure. ProteinMPNN, also developed by Baker Lab, was then used to design multiple candidate sequences for each backbone generated from our RFdiffusion runs, while taking into account the fixed SpyTag and surrounding residues. These sequences were subsequently folded into 3D structures using AlphaFold2, an AI-based program created by Google DeepMind which predicts protein structures in the context of other molecules. Finally, the resulting structures were passed into our pipeline, which focused on estimating the association rate constant (kₒₙ).

Methods:

We start from the SpyCatcher-002 PDB. In RFdiffusion, we raise the sampling temperature and promote backbone diversity, setting num_designs to generate many distinct backbones. Each backbone is then fed to ProteinMPNN, where a higher sampling temperature draws num_seq sequences per backbone. Thus, total variants = num_designs × num_seq. For every design, we keep the RFdiffusion backbone together with its MPNN-designed sequence, yielding paired (structure, sequence) candidates. MPNN outputs both sequence and PDB structure files, enabling us to filter the variants by stability with Alphafold2. The result is a broad, traceable starting library that spans plausible conformational alternatives while remaining fold-consistent—ready for virtual labeling (e.g., association-rate estimates) and for iteration in our recursive, model-guided optimization loop.

Results:

We begin with SpyCatcher-002 (wild type) on the left. The red residues mark loop and surface patches that Keeble et al. showed to modulate encounter electrostatics and local flexibility—prime real estate for safely accelerating k_on without touching the covalent capture motif. Passing this structure through RFdiffusion (center) gives a handful of backbone variants that preserve the β-sheet core but subtly reshape loops and charge presentation. Each backbone then enters ProteinMPNN, which proposes fold-consistent sequences (right). The highlighted substitutions (red) concentrate on those loop/solvent-exposed regions: we see families of designs that introduce basic residues along the approach face, trim acidic clusters that slow association, or stiffen flexible segments by swapping gly/ser for threonine or aromatic anchors. Across the three examples, the narrative is the same: diffusion widens our geometric options; MPNN fills them with sequences that respect the scaffold; together they yield paired (structure, sequence) candidates that embody the acceleration logic from the SpyTag/SpyCatcher literature—ready for virtual k_on labeling and the next round of our seek-and-seed loop.

Evaluate the Variants

Model:

Browndye is a Brownian dynamics software suite for simulating diffusional encounters between biomolecules and estimating second order association rate constants kₒₙ. It advances two rigid reactants in a continuum solvent under random thermal forces and deterministic interactions from electrostatics and short range exclusion or van der Waals terms. By precomputing electrostatic fields on APBS grids and sampling large ensembles of short trajectories, the package captures the dominant physics of encounter while remaining far more efficient than explicit solvent molecular dynamics for diffusion controlled steps.

The workflow is modular and reproducible. Each reactant is provided in PQR format with positions, radius and charges. Long range electrostatics are supplied as DX maps computed by APBS. Productive encounter is defined in a compact XML reaction file that names interfacial atom or residue pairs and the distance thresholds and counts required to declare reaction. The simulator launches many trajectories from a standard separation and records whether a path reacts or escapes. The final kₒₙ is obtained from reacted versus escaped counts with confidence intervals, and a weighted ensemble option is available when reactions are rare.

Methods:

We integrated Browndye as a fast, physics-based labeller for association kinetics. Structures for SpyCatcher-002 and SpyTag-002 (and their RFdiffusion/MPNN variants) were protonated and converted to PQR with radii/charges; long-range fields were computed on APBS grids. A compact reaction XML specified interfacial atoms/residues and distance/count thresholds corresponding to a near-productive approach geometry. Browndye then launched large ensembles of short Brownian trajectories from a standard separation in a continuum solvent, advancing two rigid reactants under random thermal forces plus electrostatics and steric/van-der-Waals interactions. For each candidate, we recorded react vs escape paths, estimated k_on from those counts with confidence intervals, and used weighted-ensemble sampling when encounters were rare. This modular setup (Figure 2) lets us apply the exact same protocol to thousands of (structure, sequence) pairs while keeping assumptions—rigid bodies, diffusion-controlled step, continuum solvent—explicit. A substantial tail lies above wild type, reflecting designs that strengthen electrostatic steering or reduce loop drag on the approach face; a smaller tail falls below, often corresponding to charge-neutralizing edits or flexible-loop insertions. The molecular snapshot illustrates a typical simulated SpyTag–SpyCatcher encounter configuration used for labeling.

Results:

First, we validated the labeler against literature values: SpyCatcher-002's k_on ≈ 20,000 M⁻¹·s⁻¹ is reproduced by our pipeline (19,973 M⁻¹·s⁻¹, CI from trajectory counts), indicating that Brownian dynamics captures the dominant encounter physics for this system. We then processed 2,000 RFdiffusion/MPNN variants(50 backbones * 64 num_sequence then filtered by stability from Alphafold). The histogram in Figure 1 shows a broad, roughly unimodal distribution centered near the wild-type dashed line, spanning ~1.6×10⁴–2.45×10⁴ M⁻¹·s⁻¹.

Experiments: Baseline Directed Evolution

Design:

We can form a directed evolution loop!!!

With the approaches developed combined, we are capable of reproducing a classic directed-evolution cycle entirely in silico. In each generation, we (i) generate variants with a diffusion-based mutator (analogous to error-prone mutagenesis but structure-aware), (ii) estimate association kinetics (k_on) by Brownian-dynamics labeling, and (iii) select the top 10 variants to seed the next round. We tested batch sizes of 50, 100, 200, and 500 variants per round. For example, in Batch-100: RFdiffusion proposes 100 variants → simulate → keep the best 10 → each seed ~10 new variants → 100 candidates for the next round. This "mutate → simulate → select" loop lets us observe how performance evolves under pure in-silico pressure, without a learned model.

Results:

Performance converges!!!

Across 14 generations, the best k_on in each round rises rapidly and then converges. Larger batches consistently reach higher plateaus (Batch-500 > 200 > 100 > 50), confirming that more exploration per generation improves the chance of finding superior variants. Nevertheless, even with the most generous setting, gains level off after ~6–7 rounds—a hallmark of local-optimal saturation when mutations stop discovering genuinely new, viable configurations. The optimized sequence explored for this loop yields k_on of 41022.1 M⁻¹s⁻¹ with 3500 sets (batch500, gen7) of candidates evaluated.

The full sample distribution for Batch-500 (500 variants × 14 rounds = 7,000 designs) shows that sequences exceeding the SpyCatcher-002 baseline (~20,000 M⁻¹·s⁻¹, red line) are present each round but remain a minority. Most proposals cluster below the frontier, indicating that naive mutation + screening expends many evaluations on neutral or deleterious edits.

Analysis:

We have to preserve the hidden knowledge!!!

This experiment intentionally does not include any learned model; it mirrors traditional directed evolution with a computational assay. The observed plateaus have two causes: (i) random (or diffusion-biased) mutations eventually erase earlier favorable patterns, and (ii) evaluation is expensive, so we cannot brute-force our way past local optima. The remedy is to capture the information hidden in the elites each generation—electrostatic arrangements, loop stiffening, approach-face chemistry—and reuse it to guide proposals. In other words, instead of repeatedly "mutating," we should train a structure-aware generator on elite/frontier variants, optimize with active learning, and spend simulation budget only where it moves the frontier. This is precisely what our Seed&Seek pipeline adds on top of the baseline loop: it keeps the hard-won knowledge, proposes targeted edits, and breaks plateaus more efficiently than blind exploration.

Build and Train a Deep Learning Model

Model:

To turn the "mutation → simulate → select" loop into a knowledge-preserving generator, we trained a multimodal latent autoencoder on RFdiffusion+ProteinMPNN variants labeled in silico by Brownian dynamics (BD). The encoder has two synchronized inputs:

Sequence stream: Amino-acid tokens are embedded with ESM-2 (650M) and passed through a lightweight Transformer encoder to produce a context-aware sequence embedding.
Structure stream: The same variant is represented as a residue-level contact graph; node features encode residue identity/secondary-structure hints and edge features are RBFs of inter-residue distances. A GINEConv stack with Set2Set readout yields a structural embedding. During pretraining, we add three self-supervised signals on the graph branch—contrastive learning with a memory bank of hard negatives, prototype consistency, and masked-node recovery—so the structural representation becomes geometry-aware yet robust to small contact perturbations.

Both streams are L2-normalized and fused into a single latent vector z. A small performance head maps z to the normalized kon, shaping the manifold around what the BD labeler deems favorable. A compact Transformer decoder then reconstructs only the mutated positions (edit-only reconstruction): we supply a mutation mask derived from the RFdiffusion/MPNN delta to (i) copy untouched residues straight through via a skip connection, and (ii) apply the reconstruction loss exclusively on mutated sites. Non-mutated tokens incur no loss and are passed through unchanged. This design has three payoffs crucial for Seed&Seek:

Protects conserved motifs. Catalytic/structural residues are preserved by construction, preventing the decoder from "improving" what must not change.
Focuses capacity on edits. The latent is forced to encode how and why successful edits work (electrostatic tuning, loop stiffening), not to memorize the entire sequence.
Enables constrained generation. At proposal time we keep the mask fixed (or tighten it), letting the model suggest targeted edits while guaranteeing global fold-stabilizing regions remain intact.

We tie input/output embeddings for stability, keep the decoder shallow to avoid rote memorization, and regularize with label smoothing, token/edge dropout, and occasional encoder-drop (decode from z alone). The result is a fitness-aware, geometry-aware latent that captures the semantics of beneficial edits and can be searched directly in the recursive seek phase.

Methods:

# ====== Data ======
# batch = (seq_tokens, structure, mut_mask, y_target)  # y_target = normalized k_on (BD)

# ====== Modules ======
# --- Sequence side (ESM + adapters + small encoder) ---
ESM = ESM2_650M()                         # frozen weights by default
ESM_Adapters = LoRA_Adapters(ESM, rank=r) # trainable; attach to last L layers
SeqEncoder   = TransformerEncoder_small() # light encoder on top of ESM embeddings

# --- Graph side (residue-level contact graph) ---
GraphEncoder = GINE_Stack_With_Set2Set(   # GINEConv x L, Set2Set readout
    node_in_dim=20 + ss_hints, rbf_k=32, hidden=256, set2set_steps=3
)

# --- Fusion / heads / decoder ---
Fuse       = LinearFuse()                 # concat → linear → L2
PerfAdapter= MLP()                        # Z → Z_v (performance-adapted)
PerfHead   = Linear()                     # predicts performance scalar (normalized k_on)
Decoder    = TransformerDecoder_tiedIO()  # compact; edit-only reconstruction

# --- Graph SSL heads (for pretraining the GNN) ---
ContrastHead = MoCoQueue(temp=0.07, queue=2048)   # hard negatives
ProtoHead    = PrototypeConsistency(K=100, temp=0.05)
MaskNodeHead = LinearClassifier(20)               # AA category at masked nodes

# ====== Utilities ======
function BuildResidueGraph(structure):
    nodes = residues(structure)
    edges = kNN_or_distance_edges(nodes)          # e.g., k=16 or d<8Å
    E_attr = RBF(distance(edges), k=32, dmin=0, dmax=10)
    X_node = onehotAA(nodes) + secStructHint(nodes)
    return Graph(X_node, edges, E_attr)

function EditOnlyCE(logits, targets, mut_mask):
    idx = [i | mut_mask[i] == 1]
    return CrossEntropy(logits[idx], targets[idx])

L2(x) = x / max(ε, ||x||)

# ====== Forward (joint AE) ======
function FORWARD(seq_tokens, graph, mut_mask):
    # SEQUENCE: ESM embeddings (with optional adapter fine-tuning)
    esm_tok, esm_pooled = ESM.embed(seq_tokens)        # gradients allowed only through adapters
    h_seq  = SeqEncoder(esm_tok)                       # small Transformer on top
    Z_seq  = L2(pool(h_seq))

    # GRAPH: GINE + Set2Set
    Z_graph = L2( GraphEncoder(graph) )                # Set2Set readout → projection

    # FUSE → Z → Z_v → Z_latent
    Z       = L2( Fuse(CONCAT(Z_seq, Z_graph)) )
    Z_v     = L2( PerfAdapter(Z) )
    Z_lat   = L2( CONCAT(Z, Z_v) )

    # HEADS
    y_hat   = PerfHead(Z_lat)
    logits  = Decoder(Z_lat, context=seq_tokens)       # logits per position

    # EDIT-ONLY reconstruction
    rec_seq = seq_tokens
    for i in range(len(seq_tokens)):
        if mut_mask[i] == 1:
            rec_seq[i] = ARGMAX(logits[i])

    return Z_seq, Z_graph, Z, Z_v, Z_lat, y_hat, logits, rec_seq

# ====== Sequence-side adaptation (ESM + SeqEncoder) ======
# Masked-LM on mutated or high-entropy positions; keeps ESM stable via adapters.
function SEQ_PRETRAIN_STEP(seq_tokens, mut_mask):
    mask_idx = sample_positions(mut_mask, rate=0.15)          # prefer mutated sites
    seq_in, seq_tgt = apply_MLM_mask(seq_tokens, mask_idx)

    esm_tok, _ = ESM.embed(seq_in)                            # adapters get gradients
    h_seq  = SeqEncoder(esm_tok)
    logits = LM_head(h_seq)                                   # tied with decoder embeddings

    L_mlm  = CrossEntropy(logits[mask_idx], seq_tgt[mask_idx])
    OPT_SEQ.step(L_mlm)                                       # optimize ESM_Adapters + SeqEncoder

# ====== Graph-side pretraining (GINE + Set2Set) ======
function GRAPH_SSL_STEP(graph_batch):
    g1 = augment_graph(graph_batch)                           # edge-drop, rbf noise, node-drop
    g2 = augment_graph(graph_batch)

    z1 = L2(GraphEncoder(g1))
    z2 = L2(GraphEncoder(g2))

    L_contrast         = ContrastHead(z1, z2)                 # InfoNCE + memory bank
    L_proto, L_entreg  = ProtoHead(z1, z2)                    # prototype consistency + entropy reg
    L_masknode         = CE(MaskNodeHead(mask_nodes(g1)), true_labels_of_masked_nodes(g1))

    L_ssl = L_contrast + λp*L_proto + λe*L_entreg + λm*L_masknode
    OPT_GNN.step(L_ssl)                                       # optimize GraphEncoder + SSL heads

# ====== Joint AE training step (supervised by BD labels) ======
function JOINT_AE_STEP(seq_tokens, structure, mut_mask, y_target):
    graph = BuildResidueGraph(structure)
    _, _, Z, Z_v, Z_lat, y_hat, logits, _ = FORWARD(seq_tokens, graph, mut_mask)

    L_pred = MSE(y_hat, y_target)                             # performance shaping (k_on)
    L_rec  = EditOnlyCE(logits, seq_tokens, mut_mask)         # only mutated positions
    L_total = L_pred + α*L_rec + REG()                        # label smoothing, token/edge dropout

    OPT_AE.step(L_total)                                      # optimize SeqEncoder, GraphEncoder,
                                                              # Fuse, PerfAdapter, PerfHead, Decoder
# NOTE: when fine-tuning, freeze ESM backbone; unfreeze only LoRA adapters.

# ====== Optimizers / parameter groups ======
OPT_SEQ = Adam(params={ESM_Adapters, SeqEncoder, LM_head},        lr=1e-4, wd=1e-5)
OPT_GNN = Adam(params={GraphEncoder, ContrastHead, ProtoHead,
                       MaskNodeHead},                              lr=1e-3, wd=1e-5)
OPT_AE  = Adam(params={SeqEncoder, ESM_Adapters, GraphEncoder,    # small LR for adapters
                       Fuse, PerfAdapter, PerfHead, Decoder},      lr=5e-4, wd=1e-5)

# ====== Training loop ======
for epoch in 1..EPOCHS:
    for batch in DataLoader(TRAIN, B):
        seq_tokens, structure, mut_mask, y = batch

        # 1) sequence-side MLM adaptation (few steps per epoch)
        if rand() < p_seq_ssl:
            SEQ_PRETRAIN_STEP(seq_tokens, mut_mask)

        # 2) graph-side SSL (few steps per epoch)
        if rand() < p_graph_ssl:
            GRAPH_SSL_STEP(BuildResidueGraph(structure))

        # 3) supervised joint AE step (always)
        JOINT_AE_STEP(seq_tokens, structure, mut_mask, y)

    # validation (token accuracy on mutated sites + R^2 on k_on)
    val_tok_acc = evaluate_edit_only_accuracy(VALID)
    val_R2      = evaluate_R2(VALID)
    early_stop_if_needed(val_tok_acc, val_R2)

Data regime. Our training signals come from synthetic variants plus noisy BD labels, so we lean on large, frozen priors (ESM) and a small trainable head to avoid overfitting. The graph stream encodes what sequence cannot tell us—spatial neighborhood chemistry, orientation, and long-range couplings that are local in 3D—so mutations that preserve the fold but shift approach-face electrostatics can be separated in latent space.

Noise and regularization. BD labels have counting noise (react vs escape paths), so we use label-smoothing for the performance head, stochastic token/edge drop, and encoder-drop (occasionally decoded from z only) to prevent the model from attributing noise to specific residues. Temperatures in the contrastive/prototype heads are clamped to keep gradients well-scaled; a small EMA of weights stabilizes training.

Active-learning readiness. We deliberately keep z low-dimensional and normalized. In later "seeking" rounds, an acquisition function can move z toward better regions under compute budgets, then the decoder proposes concrete sequences—exactly the operation our mutation-only experiment lacked.

Results:

Reconstruction and prediction trade-off.

Training on RFdiffusion+MPNN variants labeled by BD produced a smooth, fitness-aware latent. Quantitatively, increasing the latent dimensionality from 96 → 128 raises token reconstruction accuracy from 0.942 → 0.987, confirming that the decoder can nearly perfectly reproduce mutants. However, the predictive alignment between latent and kinetics shows the opposite trend: the performance head's R² on held-out kon drops from 0.855 → 0.793. This is the classic capacity trade-off—larger latents allocate more degrees of freedom to memorize sequence detail, while slightly weakening the pressure to organize z along fitness-relevant directions. For downstream seeking, we therefore favor d=96 as the default (better R² while maintaining high reconstruction), and reserve d=128 when near-lossless reconstruction is essential (e.g., strict motif preservation).

Latent–fitness geometry.

Projections of z by PCA and UMAP retain clear fitness gradients: warmer points (higher normalized kon) form contiguous neighborhoods, indicating that small moves in latent space correspond to meaningful, chemistry-preserving edits. The t-SNE fitness surface exhibits broad ascents rather than spiky ridges, and "weight-climbing" trajectories converge toward basins—both signals that the latent is navigable for active learning.

Interpretation: for Seed&Seek's recursive search, d=96 offers the best balance—accurate reconstruction (accuracy > 0.9 shows highly efficient sequence reconstruction) with stronger fitness shaping—while d=128 is a high-fidelity option when exact sequence recovery is prioritized.

Explore on Latent and Generate

Model:

At inference we search directly inside the autoencoder's latent space. You give the system a target performance (for example, a high normalized k_on). A small latent optimizer then moves to a point in the latent space where the model's performance head predicts that target. From that optimized point, we create a small cloud of nearby latent vectors to promote diversity. Finally, each latent vector is decoded by the Transformer decoder to produce a sequence candidate. Because decoding is driven only by the latent vector, there is no need to rebuild graphs or run encoders at this stage; it is fast and easy to batch. Two simple controls shape exploration: the regularization that keeps the latent near regions seen during training, and the "spread" rule that determines how widely we sample around the optimized point (either isotropic Gaussian noise or covariance‐aware noise estimated from nearest neighbors in the training latents).

Methods:

############################################################
# EXPLORE ON LATENT AND GENERATE — PSEUDOCODE
# Inputs:
#   - target_perf: desired normalized performance (e.g., k_on)
#   - N: number of sequences to generate
#   - cfg: knobs for optimization and diversity (below)
# Outputs:
#   - candidates: list of {sequence, pred_perf, latent}

PREP:
  model = load_trained_AE()            # exposes perf_head(z) and z-only decoder
  Z_train = load_training_latents()    # (N, D) latent database

function KNN_NEIGHBORS(z):
  d = pairwise_distance(z, Z_train)
  idx = argsort(d)[:knn_k]
  return Z_train[idx]                  # (k, D)

function KNN_MEAN(z):
  return mean(KNN_NEIGHBORS(z), axis=0, keepdims=True)

function OPTIMIZE_LATENT(target_perf):
  best = None
  repeat restarts times:
    z = normal_init(D, std=0.5)
    for t in 1..steps:
      y_hat = model.perf_head(z)
      loss = (y_hat - target_perf)^2 + l2_weight*norm2(z)
      z_knn = KNN_MEAN(z)
      loss += knn_weight * norm2(z - z_knn)
      z = z - lr * grad(loss, z)
    score = -abs(model.perf_head(z) - target_perf)
    if best is None or score > best.score:
      best = {z: copy(z), pred: model.perf_head(z), score: score}
  return best.z, best.pred

function SAMPLE_KNN_COV(z_opt, m):
  nbr = KNN_NEIGHBORS(z_opt)             # (k, D)
  X = nbr - mean(nbr, axis=0)
  V, S = top_svd(X)                      # low-rank basis and scales
  list = []
  for i in 1..m:
    eps = normal_r_like(S)               # draw in r-dim principal space
    jitter = (eps * S) @ V^T
    list.append(z_opt + spread_scale * jitter)
  return list

function DECODE_LATENTS(z_list):
  out = []
  for z in z_list:
    seq = decode_with_z(model, z,
                        temperature=decode_temperature,
                        top_p=decode_top_p,
                        no_repeat_ngram=no_repeat_ngram)
    pred = model.perf_head(z)
    out.append({sequence: seq, pred_perf: pred, latent: z})
  return out

# main
z_opt, pred_at_opt = OPTIMIZE_LATENT(target_perf)
z_pool = SAMPLE_KNN_COV(z_opt, num_candidates)
candidates = DECODE_LATENTS(z_pool)
save_PCA_background_and_candidates(Z_train, z_pool, candidates.pred_perf)
return candidates

We search the autoencoder's latent space for a point whose predicted performance matches a target while staying on the training manifold. Let f(z) be the performance head (normalized k_on). The loss we minimize is:

$$\mathcal{L}(z) \;=\; \bigl(f(z)-y^{\star}\bigr)^{2} \;+\; \lambda \,\lVert z\rVert_{2}^{2} \;+\; \lambda_{\mathrm{knn}}\,\bigl\lVert z-\bar{z}_{\mathrm{knn}}(z)\bigr\rVert_{2}^{2}.$$

where z_knn(z) is the mean of the K nearest training latents to z. After optimization yields z_opt, we generate diversity by sampling around it using the local KNN covariance: compute the SVD of centered KNN latents to get principal directions v_r and scales σ_r, then get:

$$z_i \;=\; z_{\mathrm{opt}} \;+\; \mathrm{scale}\cdot \sum_{r} \sigma_{r}\,\xi_{r}\,v_{r}, \qquad \xi_{r}\sim\mathcal{N}(0,1).$$

Each z_i is decoded autoregressively (temperature and top-p) to a sequence.

Results:

We apply the following defaults: target_k_norm=2.4, num=600, lambda_knn=0.05, per_candidate_latent=True, spread_mode="knn_cov", knn_k=128, spread_scale=1.5, temperature=1.0, top_p=0.9.

Observations from the PCA plots (KNN / spread = 2.5 vs 4.5).

Both panels place candidates along warm (high-performance) bands of the latent manifold—evidence that the KNN prior keeps samples on realistic directions.

Spread = 2.5: candidates form a compact cloud around the optimized point; the median predicted score (white tag) is slightly above the surrounding background and variance is modest—good exploitation.
Spread = 4.5: candidates occupy a wider region following manifold ridges; the median predicted score remains comparable, but the tail covers more diverse zones—better exploration with a reasonable quality floor because the local covariance steers away from empty space.

How to tune during active learning

Early cycles (explore): choose a larger spread (e.g., 3.5–4.5) and keep knn_k moderately large (64–128). This yields diverse proposals without falling off-manifold. Keep temperature near 1.0 and top_p≈0.9 to let decoding express the latent diversity.
Mid cycles (balance): reduce spread toward 2.0–3.0 while keeping lambda_knn=0.05 to maintain realism. If simulation budget is limited, drop num but increase restarts in optimization to refine z_opt.
Late cycles (exploit): tighten spread to 1.0–2.0 and optionally lower temperature (0.8–0.9) for higher per-candidate quality. Keep per_candidate_latent=True so proposals are not identical.

Safety rails: if candidates begin to drift into cooler regions, increase lambda_knn or knn_k; if all candidates cluster too tightly, raise spread_scale or slightly relax lambda_knn.

Practical recipe. Start with the defaults above; visualize each batch on PCA colored by predicted performance. If the candidate cloud is too concentrated, raise spread_scale toward 4.0 - 4.5; if quality drops noticeably, nudge lambda_knn to 0.075 – 0.1 or increase knn_k to 160 to pull proposals back onto the warm manifold. This schedule gives a smooth transition from exploration to exploitation while keeping generation fast and well-behaved.

Experiment: AI-guided Directed Evolution

Design: Active training

The full loop (diagram) runs as follows. We start from the pre-trained Seed&Seek autoencoder and its latent optimizer. Each active learning round proceeds in five steps:

Seek in latent. We optimize a latent point toward a target performance (normalized kon), then sample a KNN-covariance cloud around the optimum to get 600 candidates.
Decode & filter. We decode each latent to a sequence (edit-aware decoder).
Structure → physics. New sequences are sent to AlphaFold for structural hypotheses and then to Brownian dynamics (Browndye) to label kon.
Curate for fine-tuning. To avoid forgetting and collapsing, we carry over the top 300 from all previous rounds (diversity-aware sampling) and merge them with 300 newly ranked candidates → 600 total for that round's training set.
Fine-tune the model. We fine-tune the autoencoder on this 600-sample set:

The performance head regresses the new kon.
Decoder is trained in edit-only reconstruction so conserved positions pass through untouched; this keeps scaffolds stable.
Encoders receive a small learning rate with a KNN-prior and contrastive push to reshape the latent around new chemistry without drifting off-manifold.

Preventing "recursive collapse"

Two mechanisms make the loop robust:

Latent repulsion during re-encode. When new sequences are encoded back to z, we add a repulsive penalty that discourages assigning them to the exact same mode occupied by the previous round's proposals. Concretely, we compute KNNs between "old" and "new" latents and add a margin loss that pushes the new latents away from the old centroid while keeping them on-manifold (via the KNN prior). This forces both the decoder and the performance head to adapt to genuinely new directions rather than over-fitting one pocket.
Replay + diversity quota. We always replay the top-300 historical points and enforce a quota for diverse structural neighborhoods (based on AlphaFold contact maps). The fine-tune therefore sees both the frontier and the backbone, preventing drift and catastrophic forgetting.

Each new round re-runs this cycle, updating the model parameters and the latent optimizer's landscape.

Results: Seeking for optimized solution

Latent PCA

Left panel (colored by dataset). Pre-train points (gray) occupy the right-hand cluster. As we go through loop1→loop4 (brown→pink), the cloud shifts left and spreads, indicating the model is learning new latent directions rather than collapsing back to the original mode.

Right panel (colored by kon). Warmer colors (higher kon) concentrate in the shifted region, showing that the discovered latent territory corresponds to better association rates.

Kinetic distributions across rounds.

The density plot shows clear rightward shifts: pre-train ≈20k, loop1 ≈25–33k, loop2 ≈30–49k, loop3 ≈37–67k, loop4 ≈42–66k. The bulk improves and the high-end tail extends each round.

Mean and best kon.

The summary line plot quantifies the trend: mean increases nearly linearly across loops, while the best candidate improves even faster, consistent with KNN-covariance exploration plus targeted fine-tuning

Head-to-head efficiency.

With ~2,900 BD evaluations, Seed&Seek active training reaches a best simulated kon = 67,313 M⁻¹ s⁻¹ (+237% vs SpyCatcher002). A baseline "mutate-simulate-select" (RFdiffusion+MPNN only) needs ~3,500 evaluations to reach 41,022 M⁻¹ s⁻¹ (+105%). Active training achieves higher best performance with fewer simulations than the baseline, indicating better sample efficiency.

Analysis: why does the loop work and how do the knobs interact

Sample-efficiency and effect size.

Across ~2.9k BD evaluations, Seed&Seek's best simulated kon reaches 67.3k M⁻¹s⁻¹, a +237% lift over SpyCatcher002 and +64% over the baseline's best (41.0k) despite 17% fewer simulations (2,900 vs 3,500). The density curves shift right each loop and broaden asymmetrically—evidence that we're not just nudging the mean but expanding the high-performance tail. The mean/best lines rise in tandem, which usually indicates the model is re-shaping the latent rather than cherry-picking rare outliers.

Exploration–exploitation schedule.

Early rounds benefit from a larger KNN spread (2.5–3.5) and higher decoding temperature; this seeds multiple pockets that later rounds can exploit. The steady rise of the mean shows exploitation is working; the sustained growth of the best shows exploration hasn't shut down. Practically, we switched to a tighter spread (~1.5–2.0) once the density mode crossed ~35–40k to consolidate gains, which is reflected by the shrinking variance but rising mean.

Limits and next steps.

Plateauing will eventually occur as the frontier meets the model's prior or physics constraints. When the mean curve's slope begins to flatten, two actions usually help: (i) increase repulsion margin or decrease knn_k slightly to discover new pockets; (ii) inject small wet-lab batches (if available) to recalibrate BD biases. For deployment, we'd also add uncertainty-aware selection (e.g., ensemble perf heads) so each round mixes high-score and high-uncertainty candidates.

Discussion

Model Bias of RF Diffusion + MPNN

Theory

RFdiffusion proposes backbones by following a score learned from PDB-like structures; ProteinMPNN then “colors” those backbones with sequences drawn from residue–context statistics. Together they strongly favor (i) common secondary-structure motifs, (ii) conservative rotamer/packing patterns, and (iii) co-evolutionary residue pairings that keep you close to known folds.

What we observed.

In our baseline RFdiffusion+MPNN loop (no learning, just generate–score–keep-best), early rounds improved quickly, but by round 3–4 the new winners differed at only a handful of positions from earlier winners; backbone drift was minimal and sequence diversity collapsed around a few hotspots. In other words, the loop kept re-drawing from the same RF/MPNN comfort zone.

Why active training helps (but doesn’t eliminate model bias).

Our active loop adds a small model that learns from the labels we create: a multimodal latent encoder (sequence + residue-graph), an edit-only decoder, and a z-space optimizer. Three design choices slow the collapse:

1. Edit-only reconstruction preserves conserved motifs and forces capacity onto changes, not full-sequence memorization.
2. KNN-covariance jitter proposes candidates around the latent optimum, but along directions supported by real data — more diverse than greedy ascent, less random than noise.
3. Replay + repulsive re-encoding (keep top historical examples; penalize mapping new encodings onto last round’s centroid) pushes each round into a neighboring latent pocket instead of the same one.

This does keep diversity alive longer: after two rounds we still saw non-trivial sequence movement and fresh high-scorers. However, the active learner is pre-trained and fine-tuned on RF/MPNN-generated variants. Its representation ultimately reflects that proposal manifold, so the loop still converges by ~round 3 in our current setup. In short: active learning extends exploration and makes it more sample-efficient, but if the seeds come from RF/MPNN only, the ceiling you hit is still largely defined by those models.

Implication.

To push further, future versions should (i) mix additional generators (loop/Interface-specialized diffusion, hallucination from large PLMs), (ii) inject orthogonal perturbations (electrostatics-targeted edits; mini-MD stability filters), or (iii) bring in even a small amount of wet-lab feedback to break the RF/MPNN prior.

Fully in silico?

Our development phase is entirely computational: candidates are generated, ranked, and advanced using modern designers, AlphaFold structure hypotheses, and Browndye kon estimates. This does not claim to replace experiments; it changes when we spend them. By delivering a strengthened starting point—a ranked set with uncertainty and mutation provenance—we reduce the size of the wet-lab campaign and focus it on biological fitness, stability, and environmental adaptability that current simulation cannot robustly measure at scale.

Accuracy therefore depends on two fronts: (1) the realism of structural hypotheses and (2) the fidelity of the fast kinetic surrogate. AlphaFold gives plausible backbones/interfaces for many designs, and Brownian dynamics captures the diffusion-controlled encounter regime efficiently. For systems dominated by induced fit or large conformational selection, we expect weaker correlation; in those cases you can swap in a slower evaluator (e.g., restrained MD or MSM-based scoring) for a subset of candidates without changing the rest of the loop.

Practical guidance.Use Seed&Seek to compress the exploration phase—then hand off to wet-lab evolution from a boosted seed. As simulation tools improve, this hand-off simply moves later, making the loop even more sample-efficient.

Patent Provision

We submitted the full Seed & Seek pipeline to a professional patent attorney for a novelty / freedom-to-operate (FTO) search and patentability assessment. The review concluded that our integrated approach—multimodal latent learning + z-space optimization + edit-only decoding (reconstructing only specified mutation sites) + Brownian-dynamics labeling inside the active loop—is clearly distinguishable from prior art and supports assertable novelty and non-obviousness. We therefore received a high recommendation to proceed with a formal patent filing, and we have begun drafting claims and planning the priority filing strategy.

Future works

What's new in Seed & Seek (practical view).Our contribution is a closed-loop, data-free modeling pipeline that:

1. Seeds diversity with structure-aware generators (RFdiffusion + ProteinMPNN),
2. Labels candidates entirely in silico via AlphaFold→Brownian dynamics (fast k_on surrogate), and
3. Seeks by (optimizing a multimodal latent space) (sequence + residue-graph encoders) while preventing collapse with KNN-covariance exploration, replay, and repulsive re-encoding; the decoder edits only mutated sites, preserving conserved motifs.

This combination turns the classic “mutate → simulate → select” loop into a knowledge-preserving generator that keeps improving with every round and hands wet-lab teams a strengthened starting point. In a patent positioning sense, the protectable thrust is the integrated method (multimodal latent AE + z-space optimizer + edit-only decoding + physics-in-the-loop active learning) rather than any single off-the-shelf component. For filing, we would frame claims around: (i) learning a geometry- and performance-aware latent from sequence + residue-level graphs; (ii) z-space optimization coupled to edit-only decoding under explicit mutation masks; (iii) collapse-avoidance via KNN-cov sampling and repulsive allocation between rounds; and (iv) the specific use of a Brownian dynamics-based association-rate labeler to supervise the latent and guide proposals. (Summary adapted to align with your prior consultation notes and scope.)

Implementation

Seed & Seek as a Protein Engineering Platform

Setup. After completing several active loops, we froze the autoencoder encoders and the performance head and evaluated them on a held-out set of RFdiffusion+MPNN variants that were not used for training in the last loop. For each variant we computed two numbers: 1. Seeds the "ground-truth" kon from our AF→BD pipeline, and 2. the model's k^on obtained by encoding the sequence+structure graph and reading out the performance head (no decoding, no BD).

Results. The scatter on the right compares BD kon (x-axis) to the simulator's kon (y-axis). We observe a tight linear relationship with R^2 = 0.991, a best-fit slope close to 1 and a small intercept, indicating good calibration across the range tested (≈2×10⁴–7×10⁴ M⁻¹s⁻¹). Residuals are roughly homoscedastic with a slight widening at the extremes—consistent with higher BD variance near diffusion-limited or geometry-limited regimes.

Adaptability and Contribution

Where we are now. We have the Seed & Seek training pipeline running locally: RFdiffusion + ProteinMPNN for seeding, AlphaFold→Browndye (or another fast surrogate) for labeling, and a small active learner (multimodal latent AE + z-optimizer + edit-only decoder) that proposes the next round. We do not yet have a production cloud or a full UI.

How we intend teams to use it once wrapped (as in the figure).
A user would upload a prototype sequence/structure, optionally specify constraints (editable residues, tags, pH/temperature window), and choose an objective. The service would then run three automated blocks:

Seeding (< 1 h): RFdiffusion + MPNN generates a diverse, constraint-aware starting library.
2. Simulation (≈2–20 h, task-dependent): structures predicted and in-silico metrics scored (e.g., konk_\text{on}kon via Browndye for binders).
3. AI active training (≈1.5 h): the latent model fine-tunes on the new labels and performs z-space search to emit ranked candidates with uncertainty and mutation provenance.

Intended outputs. An “optimized protein structure/sequence” bundle: top-N sequences (FASTA), predicted structures (PDB) for the shortlist, and a reproducible report (inputs, masks, scores, why-this-design). The “pretrained customized AI” behind the button handles (i) functional analysis (motif/charge checks), (ii) sequence generation that respects locks, and (iii) rapid fine-tuning on the fly.

Realistic caveat. Until we finish the hosted UI and job orchestration, this remains a planned user flow. Today, teams can still run one or more local rounds with our scripts to pre-enrich libraries before the wet lab.

Where it likely helps (and where it may not) Helps (near-term): We have the Seed & Seek training pipeline running locally: RFdiffusion + ProteinMPNN for seeding, AlphaFold→Browndye (or another fast surrogate) for labeling, and a small active learner (multimodal latent AE + z-optimizer + edit-only decoder) that proposes the next round. We do not yet have a production cloud or a full UI.

● Binders / sensors where diffusion-limited on-rate is important; we can pre-enrich variants before a small wet-lab screen.

● Single-domain enzymes needing modest stability/charge refactoring while locking catalytic residues (edit-only decoding).

● Interface polishing for heterodimers or tag–binder systems when you want to protect epitopes or secretion tags.

Throughput. On our hardware, a forward pass of the simulator takes milliseconds per design, whereas AF→BD takes minutes to hours, yielding ~100–300× more hypotheses explored per unit compute. We still validate top designs with physics and (eventually) experiments—but the simulator lets Seed & Seek search far more intelligently before paying the full AF→BD cost.

May not (near-term):

● Designs dominated by large conformational changes, allostery, or complex oligomerization equilibria.

● Membrane proteins or strongly condition-dependent systems unless the surrogate is re-tooled for those regimes.

● The active learner is pretrained and fine-tuned on RF/MPNN-generated variants; without external data it typically converges by ~3 rounds. It improves sample-efficiency and diversity vs. baseline, but it doesn’t escape the RF/MPNN manifold indefinitely.

● Physics labels are approximations (AlphaFold structures, Browndye on-rates, fast ΔΔG, etc.). They are useful for ranking under the stated conditions, not guarantees of wet-lab outcomes.

● Cloud UI, job queueing, and one-click reports are planned, not shipped. Running the pipeline currently requires our scripts and a GPU box.

1. Define an edit mask: freeze catalytic/epitope/structural residues; allow edits in permissive loops or surfaces.
2. Run one local round (≈600 designs): RFdiffusion + MPNN seeding → AlphaFold→Browndye scoring → active learner proposes a ranked shortlist (e.g., 20–50).
3. Order a compact panel from the shortlist for screening.
4. If you obtain even 10–20 measurements, add them as gold labels and run one more round (the head re-weights toward real data; exploration uses KNN-cov jitter + uncertainty).
5. Move to the wet lab with a stronger starting point, and keep the edit mask to maintain safety/feasibility.

References

Show References ▼

Watson, J. L., Juergens, D., Bennett, N. R., … Baker, D. (2023). Broadly applicable and accurate protein design by RFdiffusion. arXiv:2303.04135. arXiv

Dauparas, J., Anishchenko, I., Bennett, N. R., … Baker, D. (2022). Robust deep learning based protein sequence design using ProteinMPNN. Science, 378(6615), 49–56. https://doi.org/10.1126/science.add2187

Jumper, J., Evans, R., Pritzel, A., … Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589. https://doi.org/10.1038/s41586-021-03819-2

Abramson, J., Adler, J., Dunger, J., … Jumper, J. M. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630, 144–154. https://doi.org/10.1038/s41586-024-07487-w

Evans, R., O'Neill, M., Pritzel, A., … Senior, A. (2022). Protein complex prediction with AlphaFold-Multimer. bioRxiv. https://doi.org/10.1101/2021.10.04.463034

Lin, Z., Akin, H., Rao, R., … Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model (ESM-2/ESMFold). Nature, 620, 824–833. https://doi.org/10.1038/s41586-023-06353-0

Baker, N. A., Sept, D., Joseph, S., Holst, M. J., & McCammon, J. A. (2001). Electrostatics of nanosystems: Application to microtubules and the ribosome. PNAS, 98(18), 10037–10041. (APBS) https://doi.org/10.1073/pnas.181342398

Ermak, D. L., & McCammon, J. A. (1978). Brownian dynamics with hydrodynamic interactions. The Journal of Chemical Physics, 69(4), 1352–1360. https://doi.org/10.1063/1.436761

Northrup, S. H., Allison, S. A., & McCammon, J. A. (1984). Brownian dynamics simulation of diffusion-influenced bimolecular reactions. The Journal of Chemical Physics, 80(4), 1517–1526. https://doi.org/10.1063/1.446900

10.

Xu, K., Hu, W., Leskovec, J., & Jegelka, S. (2019). How powerful are graph neural networks? In ICLR 2019. (GIN/GINE)

11.

Vinyals, O., Bengio, S., & Kudlur, M. (2015). Order matters: Sequence to sequence for sets. In NeurIPS 2015, 29. (Set2Set)

12.

He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum Contrast for Unsupervised Visual Representation Learning. In CVPR 2020, 9729–9738. (MoCo) https://doi.org/10.1109/CVPR42600.2020.00975

13.

Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS 2020, 33, 9912–9924. (SwAV / prototype consistency)

14.

Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv:1607.06450. arXiv

15.

Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv:1606.08415. arXiv

16.

Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A, 374, 20150202. https://doi.org/10.1098/rsta.2015.0202

17.

McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426. arXiv

18.

Abraham, M. J., Murtola, T., Schulz, R., … Lindahl, E. (2015). GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX, 1–2, 19–25. https://doi.org/10.1016/j.softx.2015.06.001

19.

Michaud-Agrawal, N., Denning, E. J., Woolf, T. B., & Beckstein, O. (2011). MDAnalysis: A toolkit for the analysis of molecular dynamics simulations. Journal of Computational Chemistry, 32(10), 2319–2327. https://doi.org/10.1002/jcc.21787

20.

Keeble, A. H., Turkki, P., Stokes, S., … Howarth, M. (2017). Approaching infinite affinity through engineering of peptide–protein interaction: SpyTag002/SpyCatcher002. Angewandte Chemie International Edition, 56(52), 16521–16525.