Model Overview


From 0.03% to 87%: The Truth Behind the Efficiency Gap

Imagine a gene integration system where success rates can vary by nearly 3000-fold under different conditions. This isn't an experimental error—it’s a black box waiting to be opened.

In our GenOMe system, one-to-one means that one pair of attB and attP sites recombines once through Bxb1 integrase to form one integration event.

Traditional genetic engineering relies on trial and error, but we asked:

  • Which step truly limits efficiency?
  • Why is two-to-two deletion so much harder than one-to-one insertion?
  • Can we mathematically predict optimal conditions instead of blind testing?

Our answer: Build an interpretable, predictive five-layer integration model.


The Efficiency Gap at a Glance

System Type Traditional Mode A Ready-to-Integrate Mode B Efficiency Gain
One-to-one~0.57%--
Two-to-two~0.05%79–87%1600-fold

A single variable change produces a three-order-of-magnitude leap — this is the core mystery our model explains.


Modeling Strategy: Dissecting the Black Box Layer by Layer

We decomposed the Bxb1 gene integration system into five modelable biological layers, peeling away complexity like an onion:

model-overview-1
model-overview-2
model-overview-3

Model1(M1): Transcriptional Dynamics and Induction Window Optimization

Quantifies how arabinose initiates Bxb1 gene transcription and the dynamics of mRNA production and degradation.

M2 - Translation Layer: The Protein Accumulation Race

Calculates how long it takes for Bxb1 protein to reach effective concentration and when it begins to lose activity.

M3 - Chromosomal Layer: The Hidden Major Bottleneck

Models how ssAP forms attB sites on the chromosome — the system’s most fragile step.

Mechanism Breakdown: ssAP (single-strand annealing protein) must:

  1. Be taken up by cells (efficiency ~10–30%)
  2. Find the replication fork spatiotemporal window
  3. Successfully anneal with the lagging strand
  4. Avoid degradation or dilution

The multiplication of these four steps results in overall success rate < 1%.

M4 - Recombination Layer: The Two-Site Coordination Challenge

Quantifies how Bxb1 recognizes and recombines two attB sites, and why longer fragments reduce efficiency.

M5 - Experimental Assistant: From Mechanism to Practical Tool

model-overview-4

Transforms parameters from the first four layers into a lookup-table success rate predictor for experiment design.


Three Core Discoveries

Discovery 1: The Real Bottleneck Isn’t Where You Think

Most would assume low Bxb1 protein concentration causes poor efficiency. But the model reveals:

  • Intracellular Bxb1 reaches 200–400 nM (μM range)
  • In vitro experiments need only 10–50 nM for efficient recombination
  • So why is efficiency still low in living cells?

The answer lies in M3: randomness of attB site formation is the true killer.

  • Single ssAP integration probability into chromosome < 1%
  • Success rate strongly depends on chromosomal position:
    • Near oriC: ~0.6%
    • Near terC: ~0.03% (20-fold difference!)

Position Effect Formula: P(d) ≈ 0.6% × e−0.0007d, where d = distance from oriC (kb).


Discovery 2: The Mechanistic Cost of Two-to-Two

At the same chromosomal position (oriC-proximal), we compared:

  • One-to-one: ~0.57% success rate
  • Two-to-two: ~0.05% success rate

Why the 10-fold difference? The model provides the answer:

The fundamental mechanism complexity differs, not just a probability multiplication:

Success_one-to-one = P₀ × [modification factors]
≈ 0.001066 × φtot × fgc × flocus × distance_factor × …
≈ 0.57%

Success_two-to-two = P₀_dual × [same modification factors]
≈ (P₀ / 15–20) × φtot × fgc × flocus × distance_factor × …
≈ 0.05%

Why is P₀_dual so much smaller? Two-to-two deletion requires:

  1. Two independent ssAP integration events both succeed
  2. Spatial coordination between distant chromosomal sites
  3. Synchronized recombination at both attB sites
  4. Any single failure aborts the entire process

Intuitive understanding: It's like needing to successfully thread two needles at once while they're moving—even if you're skilled at threading one needle (one-to-one), doing two simultaneously in coordination is fundamentally harder, not just twice as hard.

Mechanistic complexity multiplier: Two-to-two is 15-20 times harder than one-to-one


Discovery 3: Skip the Bottleneck, Efficiency Soars to 80%+

When using Ready-to-Integrate Mode B (linear DNA with pre-built attB sites):

Fragment Length Measured Success Model Prediction Error
1039 bp87.5%87.6%+0.1%
1605 bp79.0%79.3%+0.3%
1613 bp80.0%79.1%−0.9%

Average prediction error < 0.4%

The model not only predicts accurately but also explains why:

  • Bypasses the M3 bottleneck (no need for ssAP to form attB)
  • M4 recombination becomes the dominant step, but remains efficient in the 1-2 kb range
  • Bxb1 concentration reaches effective range within 30 minutes

Model Validation: Data Speaks

We calibrated the model using three length anchor points (1039/1605/1613 bp):

Mean Absolute Error (MAE) = 0.37%
Maximum deviation < 1%
Covers practical range 1.0-1.6 kb

All predictions include 95% confidence intervals, with extrapolation regions (>4 kb, extreme GC%) clearly marked.


How to Read This Model?

Choose different reading paths based on your needs:

I want to use it quickly

→ Jump directly to M5 Experimental Design Assistant
Input your fragment parameters and get success rate predictions from the lookup table.

I want to understand why efficiency is low

→ Focus on M3 vs M4 Comparative Analysis
See where the bottleneck truly lies.

I want complete mechanistic understanding

→ Read sequentially M1 → M2 → M3 → M4 → M5
Trace the complete process from induction to integration.

I want to improve or adapt this model

→ Consult parameter tables and calibration guidelines for each layer.
All assumptions and limitations are explicitly stated.


What Does This Model Tell Us?

  1. Low efficiency isn't due to insufficient Bxb1, but because attB formation is too difficult.
  2. Chromosomal position effects are physical limitations that traditional mode cannot avoid.
  3. Two-to-two deletion is inherently harder than one-to-one insertion (mechanistic complexity).
  4. Mode B system bypasses the bottleneck, maintaining ≥80% efficiency for 1–2 kb fragments.
  5. Simple conditions suffice: 300–400 nM Bxb1, 30 minutes reaches the concentration plateau.

Extended Implications of the Model

Can this modeling framework be transferred to other systems?

Yes! This five-layer architecture applies to any gene editing system requiring:

  • Induced expression (M1–M2)
  • Chromosomal site marking (M3)
  • Sequence-specific recombination (M4)

Examples include:

  • Other integrases: Cre-loxP, FLP-FRT, ΦC31
  • CRISPR systems: Use M3 to model sgRNA delivery, M4 to model Cas9 cleavage efficiency
  • Transposon systems: Tn5, Tn7, etc.

Insights for Modern Gene Editing

The model reveals that "delivery bottleneck > reaction kinetics" — a principle explaining why:

  • In vitro assembled RNPs (CRISPR) outperform plasmid expression
  • Electroporated DNA is more stable than chemical transformation
  • Viral vectors dominate in eukaryotic cells

The underlying logic is the same: bypass random intracellular synthesis/marking processes by directly delivering functional molecules.


Ready to Dive Deeper?

This is not just a mathematical model, but a complete roadmap from gene expression to genome rewriting.

Choose your path and start exploring


Have questions about modeling? Want to see raw data? Click the sections to begin your journey!

Model1(M1): Transcriptional Dynamics of Bxb1 Expression


Model Scope

The M1 model describes the mRNA dynamics of ssAP (single-strand annealing protein) and Bxb1 integrase under various induction regimes.

  • The transcriptional outputs mg(t) provide direct input for M2 protein accumulation.
  • Together, M1 and M2 determine whether proteins reach functional steady states required for M3 (attB formation) and M4 (recombination).

A key feature of M1 is its ability to resolve induction time windows, including the effect of cooling periods where transcription is reduced but not fully terminated.

Mathematical Formulation

mRNA dynamics are governed by:

$$ \frac{dm_g}{dt} = \beta_g \cdot I_g(t) - d_{m,g} m_g, \quad g \in \{ \text{Bxb1, ssAP} \} $$

where:

  • mg(t): mRNA concentration (μM)
  • βg: transcription rate (μM/min)
  • dm,g=ln⁡(2)/t1/2: degradation constant (1/min), defined by half-life
  • Ig(t): induction function, varying with experimental design

Model Parameters

Name Definition Value [unit] Reference
βssAPTranscription rate of ssAP0.15 [μM/min]Estimated from promoter strength [1]
βBxb1Transcription rate of Bxb10.10 [μM/min]Estimated from promoter strength [1]
βBxb1-J23105Transcription rate of Bxb1 under promoter J231050.04 [μM/min]iGEM Registry [3]
t1/2,ssAPmRNA half-life of ssAP7.0 [min]Literature [2]
t1/2,Bxb1mRNA half-life of Bxb16.9 [min]Literature [2]
t1/2,Bxb1-J23105mRNA half-life of Bxb1-J231056.8 [min]Literature [2]
dm,ssAPDegradation constant of ssAP mRNA0.099 [1/min]Derived from half-life [2]
dm,Bxb1Degradation constant of Bxb1 mRNA0.100 [1/min]Derived from half-life [2]
dm,Bxb1-J23105Degradation constant of Bxb1-J23105 mRNA0.102 [1/min]Derived from half-life [2]
γResidual transcription ratio during cooling0.1Literature [5]
tinduce,ssAPInduction time for ssAP45 [min]Experimental condition
tinduce,Bxb1Induction time for Bxb1240 [min]Experimental condition
tinduce,J23105Induction time for J23105 promoter240 [min]Experimental condition
tcool,ssAPCooling time for ssAP10 [min]Experimental setting

For ssAP, we explicitly incorporated a cooling phase to reflect reduced transcriptional activity:

$ I(t) = \begin{cases} 1, & 0 \le t < 45 \quad \text{(full induction)} \\ \gamma, & 45 \le t < 55 \quad \text{(cooling, residual activity)} \\ 0, & t \ge 55 \quad \text{(fully off)} \end{cases} $

with γ = 0.1, representing 10% residual transcriptional efficiency during cooling on ice [4].

Analytical solutions follow canonical forms:

  • Continuous induction (I = 1):
    $$ m(t) = \frac{\beta}{d_m} \left(1 - e^{-d_m t}\right) $$
  • Switch-off (I = 0):
    $$ m(t) = m(t_{\text{off}}) e^{-d_m (t - t_{\text{off}})} $$
  • Steady-state level:
    $$ m_{\text{steady}} = \frac{\beta}{d_m} $$

These are standard analytical solutions [5] obtained by solving the first-order ODE for mRNA dynamics.

Simulation Results

  1. ssAP (45 min induction + 10 min cooling)
    • mRNA rose rapidly to a peak of ~1.43 μM at 45 min.
    • During the cooling phase, transcription was not fully halted but reduced to 10% activity, slowing decay.
    • By 120 min, mRNA levels approached baseline, illustrating a transient profile.
    ssAP (45 min induction + 10 min cooling)
  2. Bxb1 (inducible, 240 min with L-arabinose)
    • mRNA reached 95% of steady state within ~30 min, then maintained a plateau for 4 h.
    • This demonstrates long-term sustained supply consistent with Bxb1’s catalytic role.
    Bxb1 (inducible, 240 min with L-arabinose)
  3. Bxb1 (constitutive expression, promoter J23105)
    • Steady-state mRNA concentration stabilized at ~0.109 μM, significantly lower than inducible conditions.
    • Nevertheless, it provides a low but persistent baseline.
    Bxb1 (constitutive expression, promoter J23105)

Conclusions and Experimental Guidance

  • Conclusions:
    • M1 captures distinct temporal regimes: transient (ssAP) vs. sustained (Bxb1).
    • Protein bottlenecks arise not from insufficient magnitude but from induction cutoffs before steady-state.
    • Cooling phases must be considered as low-efficiency transcriptional windows, not as complete shutdowns.
  • Guidance:


    Extend ssAP induction to 45 min to align with its mRNA peak, accounting for residual transcription during cooling.

[1] Registry of Standard Biological Parts (iGEM). (n.d.). Promoter BBa_J23105. Retrieved from http://parts.igem.org/Part:BBa_J23105

[2] Bernstein, J. A., Khodursky, A. B., Lin, P. H., Lin-Chao, S., & Cohen, S. N. (2002). Global analysis of mRNA decay and abundance in Escherichia coli at single-gene resolution using two-color fluorescent DNA microarrays. PNAS, 99(15), 9697–9702. https://doi.org/10.1073/pnas.112318199

[3] Selinger, D. W., Saxena, R. M., Cheung, K. J., Church, G. M., & Rosenow, C. (2003). Global RNA half-life analysis in Escherichia coli reveals positional patterns of transcript degradation. Genome Research, 13(2), 216–223. https://doi.org/10.1101/gr.912603

[4] Phadtare, S., & Inouye, M. (2003). Cold shock response and cold shock proteins. Genes & Development, 17(8), 971–981. https://doi.org/10.1101/gad.1077503

[5] Alon, U. (2006). An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman & Hall/CRC.

Model2(M2): Translation Dynamics and Protein Accumulation


Model Scope

Building upon the transcriptional dynamics characterized in M1, the M2 model quantifies the accumulation of ssAP and Bxb1 proteins.

  • The mRNA profiles from M1 serve as direct inputs for translation.
  • Protein concentrations are the decisive factors for downstream steps: ssAP enabling attB formation (M3) and Bxb1 catalyzing attP × attB recombination (M4).

Thus, M2 serves as the critical bridge between gene expression and functional integration efficiency. Its role is to evaluate whether extended induction windows are sufficient to ensure proteins act within their effective functional windows.

Mathematical Formulation with Translation Delay

We start from the canonical mRNA–protein dynamics widely used in gene expression modeling.

(1) mRNA dynamics

$$ \frac{dm_g}{dt} = \beta_g I_g(t) - d_{m,g} m_g(t), \quad g \in \{\mathrm{Bxb1}, \mathrm{ssAP}\}. \quad [1][2] $$

This is a first–order linear ODE describing production at rate $\beta_g$ (controlled by induction function $I_g(t)$) and degradation at rate $d_{m,g}$. The analytical solution is obtained using the integrating factor method:

$$ m_g(t) = m_g(0)e^{-d_{m,g}t} + \int_0^t \beta_g I_g(s)e^{-d_{m,g}(t-s)}\, ds. \quad [1][2] $$

Special cases include:

  • Continuous induction, $I_g=1$, $m_g(0)=0$:
    $$ m_g(t) = \frac{\beta_g}{d_{m,g}}(1 - e^{-d_{m,g}t}). \quad [1] $$
  • Switch-off at $t_{\mathrm{off}}$:
    $$ m_g(t) = m_g(t_{\mathrm{off}}) e^{-d_{m,g}(t - t_{\mathrm{off}})}. \quad [1] $$
  • Steady state: $$ m_{g,\mathrm{ss}} = \beta_g / d_{m,g}.\quad [1] $$

Moreover, the degradation constant is related to the mRNA half-life by $d_{m,g} = \ln 2 / t_{1/2}$, which in E. coli has been globally quantified to be on the order of a few minutes [5].

(2) Protein dynamics with translation delay

We incorporate an average translation/maturation delay $\tau$, such that fully functional proteins appear only after time $\tau$:

$$ \frac{dP_g}{dt} = k_{\mathrm{trans},g}\, m_g(t - \tau) - d^{\mathrm{eff}}_{p,g}\, P_g, \quad [3][4] $$

where the effective decay constant combines protein degradation and growth dilution:

$$ d^{\mathrm{eff}}_{p,g} \equiv k^{P}_{\mathrm{deg},g} + \mu. \quad [1][2] $$

The convention is $m_g(t - \tau) = 0$ for $t < \tau$.

(3) Analytical solutions (continuous induction example)

Assume $I_g(t) = 1$ with initial conditions $m_g(0) = P_g(0) = 0$.

  • mRNA solution
    $$ m_g(t) = \frac{\beta_g}{d_{m,g}}\left(1 - e^{-d_{m,g}t}\right). \quad [1][2] $$
  • Protein solution with delay
    • For $t < \tau$: $$ P_g(t) = 0. \quad [3][4] $$
    • For $t \ge \tau$: $$ P_g(t) = \frac{k_{\mathrm{trans},g}\,\beta_g}{d_{m,g}\,d^{\mathrm{eff}}_{p,g}} \left(1 - e^{-d^{\mathrm{eff}}_{p,g}(t-\tau)}\right) - \frac{k_{\mathrm{trans},g}\,\beta_g}{d_{m,g}\left(d^{\mathrm{eff}}_{p,g}-d_{m,g}\right)} \left(e^{-d_{m,g}(t-\tau)} - e^{-d^{\mathrm{eff}}_{p,g}(t-\tau)}\right). \quad [3][4] $$

This expression shows that the protein dynamics are identical to the no-delay case, but shifted by $(t - \tau)$.

(4) Generalization

For switch-off or pulse induction with residual activity \(\gamma\), one substitutes the time-dependent \(I_g(t)\) into the mRNA solution and propagates it through the protein convolution. The closed form remains valid provided all terms are shifted by \((t-\tau)\) and \(d_p\) is replaced by dp,geff.

Protein removal is modeled through an effective decay constant \(d_{p,\mathrm{eff}}=d_p+\mu\), where \(\mu\) represents dilution due to cell growth and \(d_p\) denotes intrinsic protein degradation. In M2 we assume a constant \(\mu\), corresponding to balanced exponential growth, to keep the focus on transcription–translation–recombination dynamics. A burden-aware formulation \(\mu([\textit{Bxb1}])\) is introduced later as an optional extension.

Name Definition Value [unit] Reference
βssAPTranscription rate of ssAP mRNA0.15 [μM/min]Estimated from promoter strength [6]
βBxb1 Transcription rate of Bxb1 mRNA (induced)0.10 [μM/min]Estimated from promoter strength [6]
βBxb1-J23105Transcription rate of Bxb1 under J23105 promoter0.04 [μM/min]iGEM Registry [6]
t1/2, mRNA, ssAPmRNA half-life of ssAP7.0 [min]Literature [7,8]
t1/2, mRNA, Bxb1mRNA half-life of Bxb16.9 [min]Literature [7,8]
t1/2, mRNA, Bxb1-J23105mRNA half-life of Bxb1-J231056.8 [min]Literature [7,8]
dm, ssAPmRNA degradation constant of ssAP0.099 [1/min]Calculated as ln2 / t1/2 [7,8]
dm, Bxb1mRNA degradation constant of Bxb10.100 [1/min]Calculated as ln2 / t1/2 [7,8]
dm, Bxb1-J23105mRNA degradation constant of Bxb1-J231050.102 [1/min]Calculated as ln2 / t1/2 [7,8]
ktrans, ssAPTranslation rate constant of ssAP0.50 [1/min] (baseline assumption)Assumed unified value; later fitted [9,10]
ktrans, Bxb1Translation rate constant of Bxb10.20 [1/min] (baseline assumption)Assumed unified value [9,10]
ktrans, Bxb1-J23105Translation rate constant of Bxb1 (J23105 promoter)0.20 [1/min] (baseline assumption)Assumed unified value [9,10]
τTranslation delay (initiation/maturation)2 [min]Assumed; typical ribosome initiation + maturation delay [9]
t1/2, Protein, ssAPProtein half-life of ssAP60 [min]Literature: short-lived protein [11]
t1/2, Protein, Bxb1Protein half-life of Bxb1960 [min] (16 h)Literature: stable protein [11]
t1/2, Protein, Bxb1-J23105Protein half-life of Bxb1-J23105960 [min] (16 h)Literature: stable protein [11]
dp, ssAPProtein degradation constant of ssAP0.012 [1/min]Derived from half-life [11]
dp, Bxb1Protein degradation constant of Bxb1≈ 0 [1/min]Derived from half-life [11]
dp, Bxb1-J23105Protein degradation constant of Bxb1-J23105≈ 0 [1/min]Derived from half-life [11]
μGrowth dilution rate~0.0116 [1/min] (Td=60 min)Calculated as ln2 / Td
dpeffEffective protein decay constant$ d_p^{\text{eff}} = dp + \mu $Model definition

Simulation Results

  1. ssAP (45 min induction + 10 min cooling)
  • Protein concentration peaked at ~9.5 μM around 48 min.
  • Levels declined to ~6 μM by 120 min due to degradation.
  • The transient dynamic matches its role as a short-lived auxiliary factor.
ssAP (45 min induction + 10 min cooling)
  1. Bxb1 (inducible, 240 min with L-arabinose)
  • Protein accumulated continuously, reaching ~42 μM at 240 min.
  • Due to its long half-life (>16 h), Bxb1 maintained a near-plateau even after induction ceased.
  • Represents long-term stable supply, ideal for recombination catalysis.
Bxb1 (inducible, 240 min with L-arabinose)
  1. Bxb1 (constitutive expression, promoter J23105, 240 min)
  • Accumulated more slowly, reaching ~0.3 μM at 240 min.
  • Despite lower levels, still sufficient to support recombination, serving as a baseline self-sufficient regime.
Bxb1 (constitutive expression, promoter J23105, 240 min)

Biological Interpretation

Comparison with Literature

In vitro assays typically employ serine integrases at 200–400 nM, occasionally up to 1000 nM.
Our simulations predicted μM-range intracellular concentrations, one to two orders of magnitude higher.
This confirms that absolute protein abundance is not the limiting factor in vivo.

Temporal Considerations

The critical determinant is not whether proteins reach steady-state, but whether they are present at the right time window.

  • ssAP must first bind to the oligo to generate attB sites.
  • Bxb1 then needs sufficient time to encounter both the newly formed attB and the pre-installed attP to catalyze recombination.
    Because this step requires two-site coordination and may be limited by conformational accessibility, short induction leaves Bxb1 with too little opportunity to capture both sites.
    Extending induction therefore increases the encounter probability within the effective reaction window, rather than merely raising overall abundance.

Role Differentiation

  • ssAP: rapid accumulation and decay → transient support for attB formation.
  • Bxb1: gradual but persistent accumulation → sustained catalytic supply, provided the induction window is long enough to cover the recombination step.

Dry–Wet Interaction and Design Decisions

Dry → Wet (Prediction to Guidance):

  • Initial inductions (30 min ssAP, 2 h Bxb1) were predicted to leave proteins insufficiently aligned with their effective functional windows, restricting recombination efficiency.
  • Based on M1–M2 integration, we recommended extending induction windows: ssAP to 45 min and Bxb1 to 4 h, ensuring transient ssAP action is captured and Bxb1 has enough time for dual-site encounters.

Wet → Dry (Feedback to Refinement):

  • Experimental teams extended induction accordingly (ssAP: 30 → 45 min; Bxb1: 2 h → 4 h).
  • Feedback from colony counts will recalibrate degradation constants and refine M2 protein predictions.

This cycle demonstrates how modeling can directly shape experimental protocols by moving beyond concentration magnitude and instead emphasizing the importance of temporal coordination and encounter probability.
Without this temporal adjustment, downstream recombination (M4) would systematically underestimate efficiency, despite protein abundance being non-limiting.

Conclusions and Experimental Guidance

Conclusions:

  • Intracellular protein levels are already far above in vitro requirements, ruling out insufficiency as a bottleneck.
  • The decisive factor is whether induction windows allow ssAP to generate attB and give Bxb1 adequate time to engage both attB and attP.

Experimental Guidance:

  • Maintain ssAP induction for 45 min to align with its transient peak and attB formation.
  • Extend Bxb1 induction to 4 h to maximize the probability of dual-site recombination events.
    This adjustment ensures effective coordination of M3 and M4, improving recombination efficiency without requiring higher expression strength.

Graphics for Wiki/Publication

  • Figure 1: Overlaid protein dynamics curves (ssAP transient, Bxb1 inducible plateau, Bxb1 constitutive slow rise).
Overlaid protein dynamics curves
  • Figure 2: Flow diagram of Dry–Wet interaction: Model prediction → Experimental adjustment (longer induction) → Data feedback → Model refinement.
Flow diagram of Dry–Wet interaction

[1]Alon, U. (2006). An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman & Hall/CRC.

[2]Thattai, M., & van Oudenaarden, A. (2001). Intrinsic noise in gene regulatory networks. PNAS, 98(15), 8614–8619.https://doi.org/10.1073/pnas.151588598

[3]Monk, N. A. M. (2003). Oscillatory expression of Hes1, p53, and NF-κB driven by transcriptional time delays. Current Biology, 13(16), 1409–1413.https://doi.org/10.1016/S0960-9822(03)00534-9

[4]Bratsun, D., Volfson, D., Tsimring, L. S., & Hasty, J. (2005). Delay-induced stochastic oscillations in gene regulation. PNAS, 102(41), 14593–14598.https://doi.org/10.1073/pnas.0503858102

[5]Bernstein, J. A., Khodursky, A. B., Lin, P. H., Lin-Chao, S., & Cohen, S. N. (2002). Global analysis of mRNA decay and abundance in Escherichia coli. PNAS, 99(15), 9697–9702.https://doi.org/10.1073/pnas.112318199

[6]Registry of Standard Biological Parts (iGEM). Promoter BBa_J23105.http://parts.igem.org/Part:BBa_J23105

[7]Selinger, D. W., et al. (2003). Global RNA half-life analysis in Escherichia coli. Genome Research, 13(2), 216–223.https://doi.org/10.1101/gr.912603

[8]Li, G.-W., Oh, E., & Weissman, J. S. (2012). The anti-Shine-Dalgarno sequence drives translational pausing. Nature, 484(7395), 538–541.https://doi.org/10.1038/nature10965

[9]BioNumbers ID 100059. Rate of translation by ribosome in E. coli.https://bionumbers.hms.harvard.edu/bionumber.aspx?id=100059

[10]Belle, A., et al. (2006). Quantification of protein half-lives in budding yeast. PNAS, 103(35), 13004–13009.https://doi.org/10.1073/pnas.0605420103

Model3(M3) Model: Single-Cell Stochastic Process of attB Formation


Model Scope

The M3 model adopts a single-cell, site-discrete probabilistic framework to describe how synthetic oligonucleotides (oligos) are integrated into the E. coli MG1655 chromosome via RecT (ssAP), forming attB sites. Compared to M1 (mRNA dynamics) and M2 (protein dynamics), addressing not only whether a cell forms at least one attB site, but also how genomic position relative to oriC affects integration probability.

Motivation and Design Philosophy

Traditional concentration-based kinetic models fail to capture the polarization phenomenon often observed in recombineering: in the same batch, a minority of cells succeed in integration while most fail completely. This arises because each cell contains only 1–4 genome copies, making the relevant question not the average concentration of intermediates but the probability that a given cell achieves ≥1 attB integration event. This motivated a stochastic, site-limited modeling strategy.

Positioning and Function

  • Upstream dependence: Takes ssAP dynamics from M2, converting concentrations to per-cell molecule counts.
  • Core feature: Models attB appearance as a non-homogeneous Poisson process, naturally capturing population heterogeneity.
  • Outputs: Time landmarks (t₁₀, t₅₀, t₉₀), distributions of attB copy numbers per cell, and optimal Bxb1 induction windows.
  • Bridging role: Links the protein concentration layer (M2) with recombination efficiency (M4), defining both the fraction of competent cells and their timing.
  • Locus dependence: Integration efficiency declines with increasing distance from oriC, consistent with replication-timing and gene-dosage effects.
  • Replication-fork gating: Integration is efficient only when the replication fork passes through the target locus; otherwise probability is near zero.

Mathematical Formulation

(1) Concentration-to-molecule conversion
The ssAP concentration output from M2 is converted into per-cell molecule numbers:

$$ N_{ssAP}(t) = ssAP \times 602 $$

where 602 is the approximate conversion factor for E. coli (1 µM ≈ 602 molecules per 1 fL cell volume) [1]. This links the protein-level dynamics in M2 to discrete single-cell stochastic modeling in M3.

(2) Single-cell reaction rate function (Cycle 4 baseline)
For cell i, with genome copy number (Gi {1,2,4}) (replication-state dependent [2]), the instantaneous rate of attB formation is modeled by a Michaelis–Menten form:

$$ v_i(t) = G_i \cdot \frac{k_{cat} , N_O^{(i)}(t)}{K_m + N_O^{(i)}(t)} $$

where:

  • NO(i)(t): available oligos in cell i
  • kcat: catalytic turnover constant (min⁻¹) [3]
  • Km: effective Michaelis constant (molecules) [3]
  • Gi: genome copy number (sampled as 1, 2, or 4 [2])

Oligos decay exponentially due to nucleases [4,5]:

$$ N_O^{(i)}(t) = N_O^{(i)}(0) e^{-k_{loss}t} $$

(3) Stochastic integration as a non-homogeneous Poisson process
The cumulative intensity is:

$$ \Lambda_i(t) = \int_0^t v_i(\tau) d\tau $$

and the probability that cell i has formed at least one attB site by time t is:

$$ P_i(t) = 1 - e^{-\Lambda_i(t)}. $$

Monte Carlo sampling (10³–10⁴ cells) over initial oligo distributions (log-normal, GSD ≈ 2–3 [6]), ssAP multiplicative noise (GSD ≈ 1.3–1.5 [7]), and genome copy number distributions yields the population-level fraction of cells with ≥1 attB site.

(4) Cycle 5 extension: locus dependence and replication-fork gating
Cycle 4 assumed all chromosomal loci were equally accessible. In Cycle 5, we introduce explicit locus and time-dependence:

$$ v_i(x,t) = G_i(x,t) \cdot \frac{k_{cat} N_O^{(i)}(t)}{K_m + N_O^{(i)}(t)} \cdot \chi_{\text{rep}}(x,t) $$

where:

  • x: genomic locus (distance from oriC in kb)
  • Gi(x,t): locus-specific copy number, higher near oriC and lower near terC [2,8]
  • χrep(x,t): replication-window indicator (1 when the replication fork passes locus x, 0 otherwise [9])

Thus, attB formation can only occur during fork passage, and success probability decreases with locus distance from oriC.

(5) Empirical oriC–ter decay function
Fitting experimental and predicted efficiencies (hisA, metA, galK, leuD, lacZ), the success probability declines approximately exponentially with oriC distance:

$$ P(d) \approx P_0 \cdot e^{-k d} $$

where:

  • d: locus distance from oriC (kb)
  • P0 ≈ 0.6%: maximal success rate at ori-proximal loci
  • k ≈ 7x10-4, kb-1: decay constant

This fit indicates that each additional ~1000 kb from oriC reduces integration probability by ~50–70%, consistent with oriC-proximal loci (hisA, 0.57%) being an order of magnitude more efficient than ter-proximal loci (leuD, 0.05%).

(6) Numerical implementation
Analytical solutions remain tractable only in the simplified Cycle 4 case (constant inputs, no locus dependence). With locus-specific copy number and replication timing, numerical integration combined with Monte Carlo sampling is required. This allows computation of population-level trajectories, polarization histograms, and locus-dependent gradients (Figs. M3-2A–E).

Name Definition Value [unit] Reference
VcellAverage E. coli cell volume1 fLAlon (2006) [1]
xGenomic locus coordinate (distance from oriC)0–2300 kb (E. coli genome half-length)Cooper & Helmstetter (1968) [2]
Gi(x,t)Locus-specific copy number (replication-timing dependent)1–4 copies; oriC-proximal loci show earlier/multiple copiesCooper & Helmstetter (1968) [2]
χrep(x,t)Replication-window indicator function (fork accessibility)Binary: 1 if replication fork passes locus x at time t; 0 otherwiseCooper & Helmstetter (1968) [2]; Rao et al. (2018) [9]
trep(x)Replication timing of locus x (fork arrival time)~0–40 min depending on oriC–ter position (E. coli doubling time 40 min)Cooper & Helmstetter (1968) [2]
kcatCatalytic turnover rate of ssAP0.01–0.2 (baseline 0.02) min⁻¹Briggs & Haldane (1925) [3]
KmEffective Michaelis constant10³–10⁴ (baseline 6000) moleculesBriggs & Haldane (1925) [3]
gm, oligoGeometric mean of oligo uptake per cell10²–10⁴ (baseline 300) moleculesMosberg et al. (2010) [6]
gsd, oligoGeometric SD of oligo uptake distribution2–3Taniguchi et al. (2010) [7]
klossOligo degradation rate constant0.15–0.20 (baseline 0.2) min⁻¹Sawitzke et al. (2007) [5]; Mosberg et al. (2010) [6]
NssAP(t)ssAP molecules per cell over timeDerived from M2 dynamics (~10²–10³)Taniguchi et al. (2010) [7]
p≥1(t)Population fraction of cells with ≥1 attBSimulation output
t₁₀, t₅₀, t₉₀Characteristic times when p≥1(t) = 0.1/0.5/0.9Simulation output

Simulation Results

(1) Baseline molecular framework .
Simulations under baseline parameters (gm = 300 molecules/cell, kcat = 0.02 min⁻¹, Km = 6000 molecules) reproduced the characteristic recombineering behavior. Population success curves exhibited lag–rise–plateau dynamics with a final plateau of ~30–35%, consistent with reported λ-Red/RecT efficiencies. Polarization histograms revealed bimodal outcomes: most cells remained unsuccessful, while a small fraction achieved near-complete integration. Sensitivity analyses confirmed parameter priority as oligo delivery (gm) > catalytic turnover (kcat) > Km, with delivery dominating across regimes.

(2) Locus dependence .
In incorporating locus position and replication-fork gating fundamentally reshaped predictions. Ori-proximal loci (hisA) reached high success probabilities (0.57%), mid-replichore loci (galK) were intermediate (~0.19%), and ter-proximal loci (leuD) dropped to 0.05%. A simple exponential model captured this oriC–ter gradient:

$$ P(d) \approx 0.6\%\,\cdot\, e^{-0.0007d}, $$

indicating each additional ~1000 kb from oriC reduces success probability by ~50–70%. These results demonstrate that positional constraints outweigh steric hindrance: two-to-two recombination at lacZ (0.0195%) did not outperform one-to-one at leuD (0.0386%).

To explicitly incorporate replication timing, Cycle 5 simulations compared representative loci across the E. coli chromosome. As shown in Fig. M3-2E, ori-proximal genes (hisA) reached significantly higher success fractions than mid- and ter-proximal sites (galK, leuD), consistent with experimental observations. This establishes a genome-wide oriC–ter gradient as the dominant factor shaping integration efficiency.

Fig. M3-2E.
Fig. M3-2E. Locus dependence of attB formation success (Cycle 5).

Simulated integration success trajectories for representative chromosomal loci demonstrate a clear oriC–terC gradient. Ori-proximal loci (hisA) reach high plateaus (0.57%), mid-replichore locus (galK) reaches intermediate levels (~0.19%), and ter-proximal loci (leuD) remain very low (0.05%). The curves highlight that genomic position, mediated by replication timing and locus copy number, is the dominant determinant of success, outweighing steric hindrance effects.

Biological Interpretation

  • Polarization arises from stochastic molecular heterogeneity (oligo uptake, ssAP variability, nuclease degradation), not population averages.
  • S-shaped dynamics emerge naturally from the Poisson framework, reflecting the transient window where oligo availability and RecT activity overlap.
  • Locus dependence dominates: ori-proximal loci are ~10–20× more favorable than ter-proximal sites due to earlier replication and sustained higher copy number.
  • Steric hindrance is secondary: detectable only at high-efficiency loci; at low-efficiency loci, positional effects override.
  • Design implication: integration efficiency is best predicted by combining molecular noise with replication timing.

Dry–Wet Interaction and Design Decisions

Dry → Wet (predictions):

  • Optimal induction window aligns with replication fork passage; inducing too early (<15 min) is premature, while induction after the fork has passed results in negligible success at distal loci.
  • oriC-proximal loci should be prioritized whenever possible. If distal loci must be targeted, extended induction or oligo stabilization is required.
  • Parameter priority remains: (1) oligo delivery, (2) RecT/Bxb1 expression and activity, (3) catalytic fine-tuning.

Wet → Dry (feedback):

  • Experimental colony assays confirmed the oriC–ter gradient, motivating explicit fork-gating in the model.
  • Time-course qPCR and single-cell FACS measurements of integration efficiency across loci are needed to refine locus-specific parameters.

Conclusions and Experimental Guidance

  • attB formation is governed by single-cell stochasticity shaped by genomic context.
  • Molecular heterogeneity (oligo delivery, nuclease degradation, ssAP variability) explains polarization and S-shaped dynamics at the population level, establishing the baseline variability upon which locus-dependent replication timing imposes an additional positional gradient.
  • Genomic position is the dominant determinant: ori-proximal loci achieve order-of-magnitude higher efficiencies than ter-proximal loci.
  • Design rules:
  1. Select oriC-proximal loci for maximum efficiency.
  2. For distal loci, extend induction and stabilize ssDNA to overlap with replication-fork passage.
  3. Induce Bxb1 when 30–50% of cells are predicted to carry attB sites (~20–30 min post-electroporation at proximal loci).
  • This combined framework unifies molecular noise and chromosomal context, providing actionable guidance for rational locus selection in genome engineering.

[1] Alon, U. (2006). An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman & Hall/CRC.

[2] Cooper, S., & Helmstetter, C. E. (1968). Chromosome replication and the division cycle of Escherichia coli B/r. Journal of Molecular Biology, 31(3), 519–540. https://doi.org/10.1016/0022-2836(68)90425-7

[3] Briggs, G. E., & Haldane, J. B. S. (1925). A note on the kinetics of enzyme action. Biochemical Journal, 19(2), 338–339. https://doi.org/10.1042/bj0190338

[4] Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry, 81(25), 2340–2361. https://doi.org/10.1021/j100540a008

[5] Sawitzke, J. A., Thomason, L. C., Costantino, N., Bubunenko, M., Datta, S., & Court, D. L. (2007). Recombineering: In vivo genetic engineering in E. coli, S. enterica, and beyond. Methods in Enzymology, 421, 171–199. https://doi.org/10.1016/S0076-6879(06)21012-3

[6] Mosberg, J. A., Lajoie, M. J., & Church, G. M. (2010). Improving oligonucleotide recombineering by removing roadblocks to annealing. Nucleic Acids Research, 38(20), 7462–7469. https://doi.org/10.1093/nar/gkq617

[7] Taniguchi, Y., Choi, P. J., Li, G. W., Chen, H., Babu, M., Hearn, J., Emili, A., & Xie, X. S. (2010). Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science, 329(5991), 533–538. https://doi.org/10.1126/science.1188308

[8] Cooper, S., & Helmstetter, C. E. (1968). Chromosome replication and the division cycle of Escherichia coli B/r. Journal of Molecular Biology, 31(3), 519–540. https://doi.org/10.1016/0022-2836(68)90425-7

[9] Mosberg, J. A., Gregg, C. J., Lajoie, M. J., Wang, H. H., & Church, G. M. (2012). Improving oligonucleotide-mediated recombineering by mismatch repair protein mutS. Nucleic Acids Research, 40(14), e111. https://doi.org/10.1093/nar/gks412

Model4(M4) — Dual-End Tetrameric Recombination Core Model


1. Model Scope and Hierarchical Design

Overall Layering and Data Flow

The M4 model operates within a hierarchical framework where upstream models (M1–M3) provide boundary conditions: M1–M2 establish the temporal profile of Bxb1 concentration following induction, capturing the rise, plateau, and potential decay phases [1][2]; M3 quantifies chromosomal attB site availability and the position-dependent efficiency gradient along the oriC–terC axis [3][4]; M4 constitutes the mechanistic core, converting these upstream conditions into dynamic trajectories of success versus failure pathways through explicit treatment of binding, synapsis, and cleavage steps [5][6].

This modular architecture enables independent validation and parametric refinement of each layer,maintaining model interpretability and experimental tractability [7][8].The design supports two-level population mapping: from conditional success rates within the attB-bearing sub-population to whole-population outcomes.Specifically, we define:PTX = nattB / ntotal (transformation efficiency, e.g., 1700 / 100,000 = 0.017)Example calculation: 1.147 % × PTX = 0.017 = 0.0195 % overall, confirming consistency between conditional and population-level rates.We propagate binomial SE for PTX in confidence intervals of whole-population rates;central value 0.017 with propagated uncertainty (see Methods).We calibrate parameters using the simpler one-to-one insertion system before extending to two-to-two replacement,avoiding identifiability problems inherent in direct fitting of complex mechanisms [7][8].In our GenOMe system, one-to-one means that one pair of attB and attP sites recombines once through Bxb1 integrase to form one integration event.

Two-to-two, on the other hand, means two pairs of attB and attP sites are available on the genome and the DNA fragment, allowing two recombination events to happen at the same time — one at each site.

One-to-One System: Baseline Calibration

The one-to-one insertion mechanism proceeds through sequential steps: high-affinity recognition (kon / koff) → single-end homologous synapsis (kf1 / ksynr) → catalytic strand exchange (kcat) → product release with passive attL/attR binding [5][9]. This simpler topology establishes the quantitative relationship between sequence specificity (encoded in site-dependent kon / koff ratios) and overall conversion efficiency, providing a validated parameter window for the two-to-two model [10].

Critical correction on product binding: attL and attR sites do not catalytically inhibit the forward reaction. Instead, they function as non-reactive binding sinks that transiently sequester Bxb1 through reversible association, reducing the free enzyme fraction available for productive attP–attB recognition. This passive sequestration becomes significant only at high product accumulation and does not drive reverse conversion in the absence of recombination directionality factor (RDF) [11][12].

Fig.1
Fig.1: The ins and outs of serine integrase site-specific recombination. Karen Rutherford, Gregory D Van Duyne (2014).

$$ \begin{aligned} \text{attP} + \text{Bxb1} &\xrightleftharpoons[k_{-1}]{k_1} C_P \\ \text{attB} + \text{Bxb1} &\xrightleftharpoons[k_{-2}]{k_2} C_B \\ C_P + C_B &\xrightleftharpoons[k_{-s}]{k_s} \text{synapsis} \xrightarrow{k_{cat}} C_L + C_B \\ C_L &\xrightleftharpoons[k_{-3}]{k_3} \text{attL} + \text{Bxb1} \\ C_R &\xrightleftharpoons[k_{-4}]{k_4} \text{attR} + \text{Bxb1} \end{aligned} $$

Fig.2

Two-to-Two System: Non-Length-Increasing Replacement

The two-to-two replacement mechanism embodies “dual-end synchrony”: two spatially separated attP/attB pairs first form independent single-end synaptic complexes (Sleft and Sright) with Bxb1 dimers; these must then encounter each other through chromosomal looping to achieve bimolecular ring-closure (rate constant kr2), assembling a tetrameric synaptic complex (D); finally, concerted cleavage and strand exchange complete the replacement [6][13].

Unlike one-to-one insertion—which opens a chromosomal gap and introduces new DNA—two-to-two performs direct segment replacement between pre-installed attB sites without altering total genome length. Critically, the newly inserted segment can carry orthogonal att sites at its termini, enabling unlimited replacement cycles without cumulative length increase. This architecture enables unlimited “replace-without-growth” cycles and shifts selection from additive stacking to iterative updating.

$$ \begin{aligned} \frac{d[\text{attP}]}{dt} &= -k_{\text{on}}[\text{attP}][\text{Bxb1}] + k_{\text{off}}[C_p] \\ \frac{d[\text{attB}]}{dt} &= -k_{\text{on}}[\text{attB}][\text{Bxb1}] + k_{\text{off}}[C_b] \\ \frac{d[\text{Bxb1}]}{dt} &= -k_{\text{on}}[\text{attP}][\text{Bxb1}] - k_{\text{on}}[\text{attB}][\text{Bxb1}] + k_{\text{off}}[C_p] + k_{\text{off}}[C_b] + 2k_{\text{cat}}[C_{\text{syn}}] \\ \frac{d[C_p]}{dt} &= k_{\text{on}}[\text{attP}][\text{Bxb1}] - k_{\text{off}}[C_p] - k_{\text{syn}}[C_p][C_b] + k_{-\text{syn}}[C_{\text{syn}}] \\ \frac{d[C_b]}{dt} &= k_{\text{on}}[\text{attB}][\text{Bxb1}] - k_{\text{off}}[C_b] - k_{\text{syn}}[C_p][C_b] + k_{-\text{syn}}[C_{\text{syn}}] \\ \frac{d[C_{\text{syn}}]}{dt} &= k_{\text{syn}}[C_p][C_b] - k_{-\text{syn}}[C_{\text{syn}}] - k_{\text{cat}}[C_{\text{syn}}] \\ \frac{d[\text{attL}]}{dt} &= k_{\text{cat}}[C_{\text{syn}}] - k_{\text{on}}[\text{attL}][\text{Bxb1}] + k_{\text{off}}[C_l] - k_{\text{on}}[\text{attL}][\text{Bxb1}] + k_{\text{off}}[C_r] \\ \frac{d[C_l]}{dt} &= k_{\text{on}}[\text{attL}][\text{Bxb1}] - k_{\text{off}}[C_l] \\ \frac{d[C_r]}{dt} &= k_{\text{on}}[\text{attL}][\text{Bxb1}] - k_{\text{off}}[C_r] \end{aligned} $$

Fig.3
Fig.4
Fig.4

Parameter relationship correction:

Parameter Definition Magnitude Mechanistic Basis
kf1Single-end synapsis ratekf1kf2Rapid, reversible encounter between pre-bound sites
kf2Dual-end ring-closure rateRate-limitingGeometrically and topologically constrained looping

Notation (first-use definitions). kf1 (single-end synapsis), ksynr−1 (its reverse), kf2 (dual-end ring-closure), krel (single-end disassembly), kcat (cleavage). Sleft/right: single-end synaptic intermediates; D: dual-end tetrameric complex. We normalize [S] ∈ [0,1] as the fraction of cells (or occupancy) carrying a single-end synaptic intermediate. fssAP (0.05–0.10): fraction of chromosomal attB sites activated by RecT/CspRecT-mediated single-strand annealing. asite (0.28–0.40): position-dependent reduction in association rate due to nucleoid structure. gloop (0.30–0.40): reduction in ring-closure rate due to DNA topology and confinement.

2.Locus Geometry: Literature Basis and Experimental Context

Both published data and our one-to-one analyses demonstrate a robust positional effect: integration efficiency decreases monotonically with distance from oriC toward terC [3][14]. Quantitative examples include high-efficiency loci (hisA, metA, galK; all oriC-proximal) versus low-efficiency regions (leuD, lacZ; ter-proximal) [15][16].

Our experimental two-to-two design utilized the lacZ region primarily for blue-white screening convenience, accepting the trade-off of operating in a low-efficiency zone where lacZ and leuD nearly coincide.

The 30–60 min / 0.3–1.0 μM window corresponds to the time-over-threshold (ToT) concept originating from M1–M2 models and is localized here as the standard operational (SOP) range for Mode B optimization.

This locus choice explains why the efficiency differential between our two-to-two and one-to-one implementations was smaller than initial predictions: both operated under comparable geometric constraints. Relocating one or both att sites toward oriC would substantially increase overall success probability by enhancing both local DNA accessibility and effective encounter rates [4].

3. Mathematical Formulation

I. One-to-One Baseline Mechanism

Binding Equilibria (Pre-Synaptic Complex Formation)

$$ \frac{d[C_P]}{dt} = k_{on}[\text{attP}][\text{Bxb1}] - k_{off}[C_P] $$

$$ \frac{d[C_B]}{dt} = k_{on}[\text{attB}][\text{Bxb1}] - k_{off}[C_B] $$

These expressions inherit sequence-dependent affinity from heat-map data [10]. We map in vitro specificity ratios to in-cell \(k_{on}/k_{off}\) via proportional scaling while preserving \(K_d\); absolute values are re-fit in Layer 2. Paired adjustments maintain thermodynamically reasonable \(K_d\) values (typically 10–50 nM). High-affinity recognition ensures rapid, reversible pre-synaptic complex formation, establishing the substrate pool for downstream synapsis and cleavage. Sequence-dependent differences manifest primarily at this recognition step, avoiding mis-attribution to catalytic phases.

Single-End Synapsis

$$ \frac{d[S]}{dt} = k_{f1}[C_P][C_B] - k_{synr}[S] - k_{cat}[S] $$

Synaptic complex stability governs the waiting-time distribution before potential dual-end encounters in the two-to-two system. Large \(k{synr}\) values induce rapid cycling between pre-synaptic and synaptic states, reducing the steady-state synaptic intermediate pool. This step exhibits strong concentration dependence and sensitivity to the Bxb1:DNA stoichiometric ratio.

Product Binding as Passive Sinks

In the absence of RDF, we do not include reverse catalysis. Instead, attL/attR contribute to enzyme sequestration:

$$ \frac{d[C_L]}{dt} = k_{on}^{attL}[\text{attL}][\text{Bxb1}] - k_{off}^{attL}[C_L] $$

$$ \frac{d[C_R]}{dt} = k_{on}^{attR}[\text{attR}][\text{Bxb1}] - k_{off}^{attR}[C_R] $$

These enter the Bxb1 mass balance but leave product concentrations unchanged; we do not subtract binding terms from d[attL]/dt or d[attR]/dt. Products are not consumed by reversible binding; the effect is reduced free enzyme, not catalytic inhibition [11][12].

These terms subtract only from the free [Bxb1] pool and do not enter d[attL]/dt or d[attR]/dt, thereby preserving mass balance without introducing reverse catalysis.

II. Two-to-Two Replacement Mechanism

Dual-End Ring-Closure as the Bottleneck

Two single-end intermediates \( S_{left} \) and \( S_{right} \) must meet to form \( D \).

Defining \( J_{f2} = k_{f2} [S_{left}] [S_{right}] \):

$$ \frac{d[S_{left}]}{dt} = \dots - J_{f2} - k_{rel}[S_{left}] $$ $$ \frac{d[S_{right}]}{dt} = \dots - J_{f2} - k_{rel}[S_{right}] $$ $$ \frac{d[D]}{dt} = + J_{f2} - k_{cat}[D] $$

Under \( [S_{left}] = [S_{right}] = [S] \), dual-end encounter is second-order in [S], yielding:

$$ P_{ring} = \frac{k_{f2}[S]^2}{k_{f2}[S]^2 + k_{rel}} $$

Here [S] is a normalized single-end pool (fraction of cells or dimensionless occupancy), so \( k_{f2}[S]^2 \) and \( k_{rel} \) are effective first-order terms and \( P_{ring} \) is dimensionless. This square dependence explains the sharp, sigmoidal gain when ends are closer; \( k_{rel} \) captures the decay of single-end intermediates outside the synchrony window.

$$ \begin{aligned} \frac{dP1}{dt} &= -k_{\text{on}}^{ns} P1 \cdot Int + k_{\text{off}}^{ns} Cp1 \\ \frac{dB1}{dt} &= -k_{\text{on}}^{ns} B1 \cdot Int + k_{\text{off}}^{ns} Cb1 \\ \frac{dP6}{dt} &= -k_{\text{on}}^{ns} P6 \cdot Int + k_{\text{off}}^{ns} Cp6 \\ \frac{dB6}{dt} &= -k_{\text{on}}^{ns} B6 \cdot Int + k_{\text{off}}^{ns} Cb6 \\ \frac{dB2}{dt} &= -k_{\text{on}}^{ns} B2 \cdot Int + k_{\text{off}}^{ns} Cb2 \\ \frac{dB3}{dt} &= -k_{\text{on}}^{ns} B3 \cdot Int + k_{\text{off}}^{ns} Cb3 \\ \frac{dB5}{dt} &= -k_{\text{on}}^{ns} B5 \cdot Int + k_{\text{off}}^{ns} Cb5 \\ \\ \frac{dCp1}{dt} &= +k_{\text{on}}^{ns} P1 \cdot Int - k_{\text{off}}^{ns} Cp1 - k_{\text{syn}} Cp1 Cb1 + k_{\text{syn,r}} Cs1 \\ \frac{dCb1}{dt} &= +k_{\text{on}}^{ns} B1 \cdot Int - k_{\text{off}}^{ns} Cb1 - k_{\text{syn}} Cp1 Cb1 + k_{\text{syn,r}} Cs1 \\ \frac{dCp6}{dt} &= +k_{\text{on}}^{ns} P6 \cdot Int - k_{\text{off}}^{ns} Cp6 - k_{\text{syn}} Cp6 Cb6 + k_{\text{syn,r}} Cs6 \end{aligned} $$
$$ \begin{aligned} \frac{dCb6}{dt} &= +k_{\text{on}}^{ns} B6 \cdot Int - k_{\text{off}}^{ns} Cb6 - k_{\text{syn}} Cp6 Cb6 + k_{\text{syn,r}} Cs6 \\ \frac{dCb2}{dt} &= +k_{\text{on}}^{ns} B2 \cdot Int - k_{\text{off}}^{ns} Cb2 \\ \frac{dCb3}{dt} &= +k_{\text{on}}^{ns} B3 \cdot Int - k_{\text{off}}^{ns} Cb3 \\ \frac{dCb5}{dt} &= +k_{\text{on}}^{ns} B5 \cdot Int - k_{\text{off}}^{ns} Cb5 \\ \\ \frac{dCs1}{dt} &= +k_{\text{syn}} Cp1 Cb1 - k_{\text{syn,r}} Cs1 - k_{f2} Cs1 Cs6 + k_{\text{rel}} Tet - k_{\text{half}} Cs1 \\ \frac{dCs6}{dt} &= +k_{\text{syn}} Cp6 Cb6 - k_{\text{syn,r}} Cs6 - k_{f2} Cs1 Cs6 + k_{\text{rel}} Tet - k_{\text{half}} Cs6 \\ \frac{dTet}{dt} &= +k_{f2} Cs1 Cs6 - k_{\text{rel}} Tet - k_{\text{cut}} Tet \\ \frac{dProd}{dt} &= +k_{\text{cut}} Tet \\ \frac{dLin}{dt} &= +k_{\text{half}} (Cs1 + Cs6) \end{aligned} $$
Reverse directionality (RDF) policy

gp47 (RDF) was not expressed in any condition; models therefore set krev = 0 and treat the forward pathway as effectively irreversible after dual-end closure. Any effects of attL/attR are modeled only as passive, reversible enzyme-sequestration terms in the Bxb1 mass balance (no catalytic inhibition) [11][12]. As RDF is excluded by design, we do not analyze reversibility further in this work.

4. In-Cell Effective Parameter Mapping

We avoid double counting geometry by applying it only to the steps it governs:

$$ I_{\text{eff}} = I_{\text{tot}} \times f_{\text{ssAP}} $$ $$ k_{\text{on,eff}} = a_{\text{site}} \cdot k_{\text{on}} \quad \text{(association only)} $$ $$ k_{f2,\text{eff}} = g_{\text{loop}} \cdot k_{f2} \quad \text{(ring-closure only)} $$

We map chromosomal position and locus accessibility to \( a_{\text{site}} \) (association only), and polymer looping/topology (length, orientation, torsion) to \( g_{\text{loop}} \) (ring-closure only); no geometry factors are applied elsewhere [3][4][17][18].

Factor Physical Meaning Range Data Source
fssAPssAP-mediated attB activation efficiency0.05–0.10M2/M3 output
asiteLocus accessibility (position-dependent)0.28–0.40[3,4]
gloopLooping geometry factor0.30–0.40[17,18]
IeffEffective Bxb1 concentration1–2 μMItot × fssAP

Ranges reflect our fit priors anchored to macrodomain/topology literature; see Table S2 for locus-wise posteriors. Reported Ieff (1–2 μM) reflects M2/M3-inferred fssAP under our induction regime and should be re-calibrated if growth rate, copy number, or crowding differ.

5. Length-Dependent Kinetics

Length penalties apply exclusively to ring-closure and disassembly; recognition parameters are length-invariant:

$$ k_{f2}(L) = k_{f2}^{\circ} \cdot \exp[-\alpha (L - L_0)] $$ $$ k_{rel}(L) = k_{rel}^{\circ} \cdot \exp[+\beta (L - L_0)] $$

where \( L_0 = 1039 \, \text{bp} \). Fitting the 1039/1605/1613-bp triad yields (α, β) with 95% CIs (see Table S3). Longer segments reduce end-to-end encounters (\( k_{f2}\downarrow \)) and increase strain-induced disassembly (\( k_{rel}\uparrow \)) [17][18][19][20][21]; shorter segments elevate \( P_{ring} \) and accelerate completion.

6. Parameter Identifiability and Calibration Hierarchy

Parameter Class Data Source Calibration Order
kon, koff (sequence-specific)Heat-map ratios [10]L1: Fixed
kf1, ksynr, kcatOne-to-one kineticsL2: Fit
kr2, krel, α, βTwo-to-two 240-min conditional successL3: Fit
asite, gloop, fssAPChromosomal contextL4: Inherit
PTXTransformation efficiencyFixed: 0.017

This hierarchy prevents over-parameterization [7][8]. Minimal observation sets per layer:

  • L1 — single-site heat-map ratios
  • L2 — one-to-one time-course (≥2 time points pre/post 60 min)
  • L3 — two-to-two conditional success at 240 min plus ≥2 length points
  • L4 — locus-wise (oriC→ter) comparisons

7. Core Parameter Summary

A compact set of calibrated parameters is summarized in Table 1. Full parameter tables, initial conditions, and locus-wise geometry estimates are provided in Tables S1–S3 .

Table 1. Core Kinetic Parameters

Symbol Definition Value Unit Source
Recognition & Binding
konattP–Bxb1 → CP4.3 × 105M−1s−1Literature
koffCP dissociation6.5s−1Literature
KdDissociation constant15nMkoff/kon
Synapsis & Cleavage
kf1Single-end synapsis1.7 × 10−3s−1One-to-one fit
ksynrSynaptic disassembly3.3 × 10−5s−1One-to-one fit
kcatStrand exchange8.3 × 10−4s−1Literature
Two-to-Two Specific
kf2Ring-closure (L = L0)1.7 × 10−5s−1240-min fit
krelSingle-end failure (L = L0)5.9 × 10−4s−1240-min fit
αLength penalty (kf2)1.5 × 10−3bp−1Length series (see Table S3)
βLength penalty (krel)1.0 × 10−3bp−1Length series (see Table S3)
L0Reference length1039bpShortest construct
Effective Parameters
fssAPssAP activation0.05–0.10M3 output
asiteLocus accessibility0.28–0.40Position-dependent (see Table S2)
gloopLooping geometry0.30–0.40DNA topology
IeffEffective Bxb11–2μMItot × fssAP
PTXTransformation efficiency0.0171700/100,000

8. Simulation Results

A. One-to-One Baseline

Pre-synaptic complex formation reaches quasi-equilibrium rapidly (\( \tau \approx 5 \, \text{min} \)); product binding becomes relevant only in late phases (>60 min) when attL/attR accumulation exceeds ~30% conversion. This baseline establishes that if two-to-two underperforms, the bottleneck lies in \( k_{f2} \) versus \( k_{rel} \) competition rather than deficient recognition.

B. M416 Results (Two-to-Two with Dynamic Bxb1)

Bxb1 dynamics: The total translated Bxb1 (Itot) rises during 0–240 min and approaches the order of 10 μM in our simulations (model outputs informed by M2/M3). The effective reactive fraction (Ieff = Itot × fssAP × ffree) remains around 1–2 μM and serves as the input concentration for downstream kinetic models. This scale is consistent with the 0.3–1.0 μM operational window defined in M5 and used for Mode B optimization.

Dual-end competition: The success pathway flux (Jf2 = kf2 [Sleft][Sright]) directly competes with failure (krel [Si]), determining whether dual-end ring-closure completes within the synchrony window before single-end intermediates commit to failure pathways.

Population mapping: Calibrated to 1.147% conditional success at 240 min within \( n_{attB} = 1700 \) cells; multiplication by \( P_{TX} = 0.017 \) yields 0.0195% whole-population rate.

Figure 1.
Figure 1. M416 two-level population dynamics. Solid blue line: conditional success rate within attB-bearing sub-population (\( n = 1700 \) cells). Dashed orange line: whole-population success rate (conditional × \( P_{TX}, n_{total} = 100{,}000 \)). Horizontal dashed line marks calibration point at 240 min (1.147% conditional = 0.0195% population). Bootstrap resamples were drawn at the replicate level with a fixed RNG seed for reproducibility; point estimates are medians, and bands denote 95% CIs. Sample sizes: \( n_{attB} = 1700 \), \( n_{total} = 10^5 \). Time in minutes; success rate as percentage.

C. M423 Results (Steady-State Multi-Round)

Under constant Bxb1 (~4.36 μM), success is governed by \( k_{f2}/k_{rel} \) balance and fragment length. The observed ordering (1039 bp: 87.5%, 1605 bp: 80%, 1613 bp: 79%) emerges purely from length-dependent \( k_{f2}/k_{rel} \) modulation.

Figure 2.
Figure 2. M423 length-dependent success rates.

Bars represent mean success percentage across \( n = 3 \) biological replicates (error bars: ±1 SD). Fragment lengths: 1039 bp (reference), 1605 bp, 1613 bp. Solid line: model prediction using \( k_{f2}(L) = k_{f2}^{\circ} \exp[-\alpha (L - L_0)] \) with \( \alpha = 1.5 \times 10^{-3} \, \text{bp}^{-1} \) fitted to data. Dashed lines: 95% prediction interval. Bootstrap resamples were drawn at the replicate level with a fixed RNG seed for reproducibility; point estimates are medians, and bands denote 95% CIs.

9. Biological Interpretation

Near-Irreversibility Threshold

Without RDF, the forward pathway after dual-end closure is effectively irreversible under our conditions; we did not observe reverse flux, and products enter only via passive enzyme sequestration (no catalytic inhibition). This explains why increasing \( k_{f2} \) (or effective [S]) dramatically shifts overall success. Conversely, “waiting longer” provides diminishing returns once the synchrony window closes: single-end-committed failures become kinetically locked.

Sequence Geometry and Length Effects

DNA looping probability scales approximately as \( P_{loop} \propto L^{-3/2} \) for \( L > \) persistence length (~50 nm ≈ 150 bp) [19][20][21]. For our 1.0–1.6 kb range, this predicts substantial reduction in encounter frequency, consistent with observed efficiency drops. Shorter segments elevate \( P_{ring} \) faster than they increase torsional strain because bending energy scales as \( \kappa \cdot \theta^2 / L \), while encounter frequency scales as \( L^{-3/2} \).

Chromosomal Position (oriC Distance)

Ori-proximal loci exhibit higher supercoiling density, greater replication-associated accessibility, and reduced NAP crowding [3][4]. These effects compound into \( a_{site} \) gradients ranging from ~0.40 (hisA, 0.2 Mb from oriC) to ~0.28 (lacZ, 1.5 Mb from oriC), directly modulating \( k_{on,eff} \) (see Table S2).

Orthogonal Design Benefits

Operationally, we require \( K_{d,cross}/K_{d,self} > 100 \) and observe <1% cross-recombination in plasmid challenge assays; see Methods. This validates clean heat-map → \( k_{on} \) translation and enables stable multi-round performance by preventing inter-round cross-talk. We require (i) <1% cross-recombination in plasmid challenge assays and (ii) no off-target junctions by junction-specific PCR across two rounds.

10. Structural and Geometric Intuition

Dual-end recombination assembles two Bxb1 dimers (one per att pair) into a tetrameric synaptic complex. When sites are closely spaced (≈1–1.3 kb) and co-oriented, DNA bending strain remains small, elevating effective \( k_{f2} \). Beyond a critical spacing, DNA must loop back under torsional constraint, increasing \( k_{rel} \) probability.

This geometric framework explains the “shorter is better” principle: confinement that might hinder one-to-one insertions (steric crowding) actually facilitates two-to-two replacements by constraining the search volume for dual-end encounters. In molecular visualization, the tetramer functions as a rigid hinge—once both ends dock, cleavage proceeds with near-irreversible commitment.

11. Wet-Lab Verification Pathways

To experimentally validate M4 predictions:

  1. Position gradient assay: Fix attP-plasmid and relocate genomic attB pairs along oriC → terC axis (metA → hisA → lacZ → galK → leuD). Success probability should decay monotonically with increasing terC proximity.
  2. Length-dependence experiment: At constant locus (lacZ), vary replacement lengths (1.0, 1.3, 1.6 kb). Success should follow \( P_{succ} \propto \exp[-\alpha \cdot L] \) with α within the predicted confidence interval.
  3. Combined optimization: Shift one attB terminus toward oriC while maintaining total length. The model predicts substantial gain in \( P_{ring} \) for moderate relocation.
  4. Reversibility control: Introduce RDF (gp47) expression. The reverse branch should appear only after gp47 induction, validating the forward-only assumption under standard conditions.

12. Model Limitations and External Factors

While M4 captures essential dual-end recombination kinetics, several biological processes remain simplified:

  • DNA repair competition: Mismatch repair (MMR, \( \tau \approx 2\text{–}5 \, \text{min} \)) and RecBCD nuclease can prematurely erase single-end intermediates. Inclusion would require coupling to the RecA/RecBCD network.
  • Topological constraints: Supercoiling density fluctuations and NAP binding dynamics are averaged into \( a_{site} \) and \( g_{loop} \).
  • Resource competition: RecT/CspRecT translation competes with ribosomal capacity, implicit through M2/M3 outputs.
  • Spatial heterogeneity: Nucleoid macrodomains (Ori, Ter, Left, Right) exhibit distinct physical properties, coarse-grained to single parameters.
  • Cell-to-cell heterogeneity: Oligonucleotide uptake and ssAP expression vary substantially across cells due to stochastic gene expression [22][23].

Quantitative predictions are most trustworthy for 1–2 kb segments in mid-log E. coli (\( \text{OD}_{600} = 0.4\text{–}0.6 \)) where repair activity and topological state are stable.

13. Conclusions and Actionable Guidance

Key Conclusions

  1. Central bottleneck: \( k_{f2} \) vs \( k_{rel} \) competition (\( P_{ring} = \frac{k_{f2}[S]^2}{k_{f2}[S]^2 + k_{rel}} \)) determines success/failure branching; maximizing \( [S]_{eff} \) elevates \( P_{ring} \).
  2. Non-length-increasing advantage: Two-to-two enables unlimited cycles without genome growth, transforming selection from additive to modular.
  3. Design levers: Position (\( a_{site} \) variation) and length (α, β parameters) are most sensitive parameters.
  4. Orthogonality requirement: Specificity ratio > 100:1 minimizes off-target recombination to < 1% per round in our assays.

Actionable Experimental Guidance

  1. Locus selection: Achieve ≥1 oriC-proximal terminus; if constrained, compensate via length reduction (~200 bp yields substantial \( P_{ring} \) gain).
  2. Length planning: Prefer ≤1.1 kb; for ≥1.6 kb, use sequential rounds (efficiencyper_round0.5 > efficiencysingle_long).
  3. Calibration hierarchy: Heat-map → \( k_{on}/k_{off} \) → one-to-one (\( k_{f1}, k_{cat} \)) → two-to-two (\( k_{f2}, k_{rel} \)) → population (\( P_{TX} \)).
  4. Reversibility control: Omit RDF for forward-only; introduce gp47 only when designed reversibility required.
  5. Multi-round QC: Execute rounds sequentially with inter-round PCR validation to confirm orthogonality and expected amplicon sizes.

[1] Bonnet, J., Subsoontorn, P., & Endy, D. (2012). Rewritable digital data storage in live cells via engineered control of recombination directionality. Proceedings of the National Academy of Sciences, 109(23), 8884–8889. https://doi.org/10.1073/pnas.1202344109

[2] Inniss, M. C., & Silver, P. A. (2013). Building synthetic memory. Current Biology, 23(17), R812–R816. https://doi.org/10.1016/j.cub.2013.07.056

[3] Bryant, J. A., Sellars, L. E., Busby, S. J., & Lee, D. J. (2014). Chromosome position effects on gene expression in Escherichia coli K-12. Nucleic Acids Research, 42(18), 11383–11392. https://doi.org/10.1093/nar/gku828

[4] Valens, M., Penaud, S., Rossignol, M., Cornet, F., & Boccard, F. (2004). Macrodomain organization of the Escherichia coli chromosome. EMBO Journal, 23(21), 4330–4341. https://doi.org/10.1038/sj.emboj.7600434

[5] Rutherford, K., & Van Duyne, G. D. (2014). The ins and outs of serine integrase site-specific recombination. Current Opinion in Structural Biology, 24, 125–131. https://doi.org/10.1016/j.sbi.2013.12.003

[6] Olorunniji, F. J., Rosser, S. J., & Stark, W. M. (2016). Site-specific recombinases: molecular machines for the genetic revolution. Biochemical Journal, 473(6), 673–684. https://doi.org/10.1042/BJ20151112

[7] Raue, A., Kreutz, C., Maiwald, T., Bachmann, J., Schilling, M., Klingmüller, U., & Timmer, J. (2009). Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics, 25(15), 1923–1929. https://doi.org/10.1093/bioinformatics/btp358

[8] Gutenkunst, R. N., Waterfall, J. J., Casey, F. P., Brown, K. S., Myers, C. R., & Sethna, J. P. (2007). Universally sloppy parameter sensitivities in systems biology models. PLoS Computational Biology, 3(10), e189. https://doi.org/10.1371/journal.pcbi.0030189

[9] Smith, M. C. M., & Thorpe, H. M. (2002). Diversity in the serine recombinases. Molecular Microbiology, 44(2), 299–307. https://doi.org/10.1046/j.1365-2958.2002.02891.x

[10] Saunders, S. H., & Ahmed, A. M. (2024). Mapping orthogonality and recombination efficiency across serine integrase site variants. Nucleic Acids Research, 52(8), 4234–4249. https://doi.org/10.1093/nar/gkae227

[11] Ghosh, P., Wasil, L. R., & Hatfull, G. F. (2006). Control of phage Bxb1 excision by a novel recombination directionality factor. PLoS Biology, 4(6), e186. https://doi.org/10.1371/journal.pbio.0040186

[12] Savinov, A., Pan, J., & Ghosh, P. (2012). Mechanisms of the phage Bxb1 integrase directionality factor gp47. Virology, 429(2), 129–135. https://doi.org/10.1016/j.virol.2012.04.011

[13] Ghosh, P., Kim, A. I., & Hatfull, G. F. (2003). The orientation of mycobacteriophage Bxb1 integration is dependent on the central dinucleotide of its attachment sites. Molecular Cell, 12(5), 1101–1111. https://doi.org/10.1016/S1097-2765(03)00429-3

[14] Block, D. H., Hussein, R., Liang, L. W., & Lim, H. N. (2012). Regulatory consequences of gene translocation in bacteria. Nucleic Acids Research, 40(18), 8979–8992. https://doi.org/10.1093/nar/gks694

[15] Haldimann, A., & Wanner, B. L. (2001). Conditional-replication, integration, excision, and retrieval plasmid-host systems for gene structure-function studies of bacteria. Journal of Bacteriology, 183(21), 6384–6393. https://doi.org/10.1128/JB.183.21.6384-6393.2001

[16] Datsenko, K. A., & Wanner, B. L. (2000). One-step inactivation of chromosomal genes in Escherichia coli K-12 using PCR products. Proceedings of the National Academy of Sciences, 97(12), 6640–6645. https://doi.org/10.1073/pnas.120163297

[17] Rippe, K., Guthold, M., von Hippel, P. H., & Bustamante, C. (1995). Transcriptional activation via DNA looping: visualization of intermediate states in E. coli RNA polymerase–λPR promoter complexes by scanning force microscopy. Trends in Biochemical Sciences, 20(12), 500–506. https://doi.org/10.1016/S0968-0004(00)89099-4

[18] Schleif, R. (1992). DNA looping. Annual Review of Biochemistry, 61, 199–223. https://doi.org/10.1146/annurev.bi.61.070192.001215

[19] Yan, J., & Marko, J. F. (2004). Localized single-strand breaks strongly affect global chromatin folding in the Escherichia coli chromosome. Physical Review Letters, 93(10), 108108. https://doi.org/10.1103/PhysRevLett.93.108108

[20] Cloutier, T. E., & Widom, J. (2004). Spontaneous sharp bending of double-stranded DNA. Molecular Cell, 14(3), 355–362. https://doi.org/10.1016/S1097-2765(04)00249-7

[21] Shimada, J., & Yamakawa, H. (1984). Ring-closure probabilities for twisted wormlike chains. Macromolecules, 17(4), 689–698. https://doi.org/10.1021/ma00134a017

[22] Raj, A., & van Oudenaarden, A. (2008). Nature, nurture, or chance: stochastic gene expression and its consequences. Cell, 135(2), 216–226. https://doi.org/10.1016/j.cell.2008.09.050

[23] Elowitz, M. B., Levine, A. J., Siggia, E. D., & Swain, P. S. (2002). Stochastic gene expression in a single cell. Science, 297(5584), 1183–1186. https://doi.org/10.1126/science.1070919

Table S1. Complete Parameter Set

SymbolDefinitionValueUnitSource
Recognition & Binding
konattP–Bxb1 → CP4.3 × 105M−1s−1Literature
koffCP dissociation6.5s−1Literature
KdDissociation constant15nMkoff/kon
Synapsis & Cleavage
kf1Single-end synapsis1.7 × 10−3s−1One-to-one fit
ksynrSynaptic disassembly3.3 × 10−5s−1One-to-one fit
kcatStrand exchange8.3 × 10−4s−1Literature
Two-to-Two Specific
kf2Ring-closure (L = L0)1.7 × 10−5s−1240-min fit
krelSingle-end failure (L = L0)5.9 × 10−4s−1240-min fit
αLength penalty (kf2)1.5 × 10−3bp−1Length series
βLength penalty (krel)1.0 × 10−3bp−1Length series
L0Reference length1039bpShortest construct
Effective Parameters
fssAPssAP activation0.05–0.10M3 output
asiteLocus accessibility0.28–0.40Position dependent
gloopLooping geometry0.30–0.40DNA topology
IeffEffective Bxb11–2μMItot × fssAP
PTXTransformation efficiency0.0171700/100,000

Length-dependence update: The early fit yielded α = 1.5 × 10⁻³ and β = 1.0 × 10⁻³ bp⁻¹. Subsequent tri-point regression across 1039/1605/1613 bp constructs refined these to αmech = βmech = 8.5 × 10⁻⁴ bp⁻¹ (SLOPE = 1.70 × 10⁻³ bp⁻¹), ensuring consistency with the logit-linear fit reported in M5 (Fig. 1).

Table S2. Position-Dependent Parameters

Locus Distance from oriC (Mb) asite (mean ± CI) Relative Efficiency
(normalized to hisA)
Reference
hisA 0.20 0.40 ± 0.02 1.00 [3],[14]
metA 0.35 0.37 ± 0.02 0.77 [3],[14]
galK 1.45 0.33 ± 0.02 0.32 [3],[14]
leuD 1.55 0.29 ± 0.02 0.08 [3],[14]

Table S3. Length-Dependent Fitting Results

Length (bp) Observed Success (%) Fitted Success (%) Residual Predicted kf2(L) (s−1) Predicted krel(L) (s−1)
1039 87.5 87.6 +0.1 1.70×10−5 5.90×10−4
1605 80.0 79.3 −0.7 9.8×10−6 6.5×10−4
1613 79.0 79.1 +0.1 9.7×10−6 6.6×10−4

Fit statistics:
α̂ = 1.5 × 10⁻³ bp⁻¹ (95% CI: 1.2–1.9 × 10⁻³)
β̂ = 1.0 × 10⁻³ bp⁻¹ (95% CI: 0.8–1.3 × 10⁻³)
RMSE = 0.53%; R² = 0.991; Cross-validation residual < ±1.2%.

Model5(M5) Experimental Design Assistant


1.Overview

We present the M5 Experimental Design Assistant, an interpretable modeling framework that quantitatively deconstructs the integration efficiency of the Bxb1 serine integrase into three mechanistic layers — the protein layer (expression and catalytic dynamics), the DNA layer (sequence composition and structural configuration), and the population layer (transformation efficiency and selection output).

Calibrated with experimental datasets, M5 generates table-based success-rate predictions and standard operating procedure (SOP) recommendations, transforming empirical trial-and-error workflows into quantitative, design-driven decision-making.

For the production mode (Mode B) — representing the second-round two-to-two integration process — M5 reveals that under a 30-minute reaction window and 0.3–1.0 μM (300–1000 nM) Bxb1 concentration range, insert fragment length becomes the dominant determinant of recombination success.
In our host strain, fragments between 1 kb and 2 kb consistently achieve ≥ 80 % integration efficiency under validated conditions.

Beyond efficiency prediction, Mode A provides genomic site-selection maps and population-level output estimations for first-round integrations, allowing researchers to finalize condition selection and plate-number planning before PCR primer design or transformation experiments.

Internally, the M5 framework is affectionately nicknamed “Salmon” — symbolizing its protein-based architecture and its ability to swim upstream through experimental noise, adapting seamlessly across different species, experimental setups, and host strains.

The interactive dashboard allows users to choose between Mode A (Site Selection) and Mode B (Production Mode).

Input genomic parameters (distance to oriC, GC fraction, homology arm length, induction time) to receive automated success-rate predictions and experimental recommendations.

Each prediction dynamically links to the calibrated M1–M5 parameter set and returns interpretable quantitative guidance for wet-lab design.

Figure M5
Figure M5-1. Experimental Design Assistant Interface
The synthetic 150-nt ssDNA (TargetingOligo_B1_B6) installs attB1 and attB6 at the lacZ locus, using lacZ-derived homology arms (blue) and a short spacer (yellow) between the two attB cores to create dual docking sites for Landing Pad integration.

M5 visualizes integration probability as a function of insert length, Bxb1 concentration, and induction time.
The green curve represents model predictions (logit-linear fit), while red dots show wet-lab validation data (1039 bp, 1605 bp, 1613 bp).
The average absolute error (MAE) is ≈ 0.37%, demonstrating the model’s high predictive accuracy within the validated 1–2 kb range.

Figure M5-2
Figure M5-2. Real-Time Prediction and Model Validation

This table provides instant success-rate lookup for pre-installed attB strains under validated conditions (387 nM Bxb1 × 30 min).
Each entry shows the expected integration success rate, recommended number of plates, and total colonies required for a desired target yield.
Results are computed from the M5 logit-linear model ($ P = \frac{1}{1 + \exp\!\big(-\,(a - b \cdot L)\big)} $) with parameters a = 3.0161, b = 1.03 × 10⁻³ /bp.

Figure M5-3
Figure M5-3. Quick Lookup Table for Integration Success Prediction

2.Core Positioning

2.1 Motivation and Positioning

The integration efficiency of serine integrases is highly sensitive to both sequence context and geometric configuration, resulting in substantial variability across loci and experimental conditions. Traditional workflows rely heavily on empirical trial-and-error, often leading to poor reproducibility.

Building upon the mechanistic insights and calibrated parameters established in the M4 model—including the distance-decay constant (kd), length sensitivity (SLOPE), and global scaling factor (G)—the M5 Experimental Design Assistant transforms these validated regularities into an actionable predictive tool.

With only minimal user input, M5 produces lookup-table success rate predictions and SOP-based recommendations, converting uncertain wet-lab experimentation into quantitative, design-driven decisions.

The framework is organized into two operational modes:

  • Mode A — for first-round integration and genomic site selection;
  • Mode B — for ready-to-integrate strains performing second-round two-to-two recombination with high success rates.

Much like a salmon swimming upstream, M5 reverses the traditional flow of experimentation — transforming random trial-and-error into a predictable design process guided by quantitative principles.

2.2 Architecture Across Scales

M5 maps each experimental “control knob” onto interpretable quantitative factors whose multiplicative combination yields the overall probability of successful integration.

  • Protein Layer (φ): Describes the effective operational window of concentration × time × ssAP (single-strand annealing protein). In Mode B, reactions reach near-saturation within 30 min at 200–400 nM, where concentration contributes only a weak normalization effect.
  • DNA Layer: In Mode A, the model incorporates the GC-content penalty, the sigmoidal dependence on homology-arm length, and the distance effect $$F_{\text{distance}}(d) = e^{-k_d \cdot d},$$ indicating that integration probability decreases exponentially with increasing distance from oriC.
    In Mode B, the primary determinant is insert length (L), governed by a logit-linear relationship.
  • Population Layer: Combines a log-normal distribution of DNA uptake with a binomial readout, translating single-cell success probabilities into expected per-plate positive colony counts.

The integrated predictive form used in the main text applies a logit-linear model $$ (a = 3.0161,\; b = 1.03 \times 10^{-3}\,\mathrm{bp}^{-1}) $$ coupled with a mild Hill-type concentration normalization $$ (K = 0.20 μM, n = 6, C_{\text{ref}} = 0.387 μM) $$ for Mode B predictions. Detailed mechanistic parameters—such as $$ (k_{f2}), (k_{rel}), (G) $$ , and SLOPE—are provided in the Supplementary Information.

2.3 Community Contribution

  • From trial-and-error to rational design:
    Predictive maps (curves and heatmaps) enable users to visually identify feasible design regions at a glance, estimate success probabilities, and assess potential risks before experimental execution.
  • Generalizable and re-calibratable:
    Only a small number of local calibration points are required to re-fit parameters (a, b) and, when necessary, (K, n, kd) and (γ), enabling rapid adaptation across diverse host strains and payload constructs.
  • Scalable and SOP-standardized:
    Mode B explicitly defines the “length-dominant, concentration-weak” relationship, supporting systematic plate-number planning and manpower allocation for production-level workflows.
  • Transparent validation:
    All anchor points (1039 bp, 1605 bp, 1613 bp) and their mean absolute error (MAE ≈ 0.37%) are publicly available. Figures include clear SOP threshold markers (P = 0.8) and extrapolation-zone warnings to ensure interpretability and reproducibility.

2.4 Link to the M4 Framework

  • From mechanism to rule:
    The preceding M4 model defined the distance-decay coefficient (kd), length sensitivity $$(\alpha_{\text{mech}} = \beta_{\text{mech}} = 8.5 \times 10^{-4}\,\text{bp}^{-1}),$$ and global scaling factor $$(G = 6.632173469).$$ It further revealed that antibiotic-marker strength in the second-round two-to-two configuration does not constitute a major determinant of integration efficiency.
  • From rule to tool:
    M5 operationalizes these findings into a logit-linear model with mild concentration normalization for Mode B look-up predictions, while Mode A directly applies the $$(F_{\text{distance}}(d)),$$ GC-content, and homology-arm functions for genomic site-selection decisions.
  • DBTL convergence:
    Newly acquired data can be continuously fed back to update (G), (kd), SLOPE, (K), and (n).
    As the Design–Build–Test–Learn (DBTL) cycle progresses, both predictive maps and SOP guidelines iteratively converge toward higher accuracy, thereby shortening the design-to-validation loop.

3.Lookup-Table Workflow (Three-Step Guide)

Step 1 — Determine Integration Success Probability (Mode B: Second-Round Integration)

Procedure:
  1. Identify the insert fragment length (L) (e.g., 1.5 kb).
  2. Refer to Figure 1 or Figure 2.
  3. Read the predicted probability $ P_{\text{prod}} $ under the standard condition of 0.387 µM (= 387 nM) for 30 min.
Example 1
  • Insert fragment: $ L = 1.5 \text{ kb} $
  • From Figure 1 → $ P_{\text{prod}} ≈ 0.82 $ (82 % success rate)
  • Conclusion: Under standard conditions, more than 80 % of transformed cells are expected to achieve successful integration.
Example 2
  • Insert fragment: $ L = 2.0 \text{ kb} $
  • From Figure 1 → $ P_{\text{prod}} ≈ 0.75 $ (75 % success rate)
  • Conclusion: Slightly below the 80 % threshold but still acceptable.

Step 2 — Determine Required Plate Number (Population-Level Planning)

Formula:

$$ \text{Expected positive colonies} = N_{\text{CFU}} × P_{\text{prod}} × α_{\text{pop}} $$ $$ \text{Required plates} = \lceil \text{Target positives} \, / \, \text{Expected positives per plate} \rceil $$

Example:

Assume —

  • $ N_{\text{CFU}} = 30 $ (colonies plated per plate)
  • $ P_{\text{prod}} = 0.82 $ (from Step 1)
  • $ α_{\text{pop}} = 1.0 $ (standard condition)
  • Target: 50 positive colonies
Calculation:

$$ \text{Expected positives per plate} = 30 × 0.82 × 1.0 = 24.6 ≈ 25 $$ $$ \text{Required plates} = \lceil 50 / 25 \rceil = 2 $$

Conclusion:

Two plates are sufficient; for added safety, plate three (recommended considering binomial variation).

Adjustment strategies:
  • If $ N_{\text{CFU}} $ is low (10–15) → increase plate count.
  • If $ α_{\text{pop}} < 0.8 $ (poor transformation efficiency) → optimize workflow or increase plates.
  • If > 100 clones are required → consider liquid-culture screening.

Step 3 — Genomic Site Selection and Condition Optimization (Mode A: First-Round Integration)

3.1 Selecting Candidate Genomic Loci (Figure 4)
  1. List candidate integration sites and record their:
    • Distance from oriC (kb)
    • GC content (%)
  2. Locate each site on Figure 4 to identify its predicted success zone:
    • Yellow–green regions: High probability → recommended.
    • Dark purple regions: Low probability → avoid.
Example A
  • Candidate A: Distance from oriC = 500 kb, GC = 48 %
  • From Figure 4 → Yellow region (high probability)
  • Recommendation: Use this site.
Example B
  • Candidate B: Distance from oriC = 100 kb, GC = 65 %
  • From Figure 4 → Blue–purple region.
    Although close to oriC (which generally favors integration), the high GC content (65 %) reduces success probability by altering DNA melting temperature and protein binding stability.
  • Recommendation: Not preferred. Select loci with GC content near 50 %, or optimize the sequence to lower GC content before integration.
3.2 Confirm the Valid Induction Window (Figure 5)
  1. Refer to Figure 5 to verify your induction conditions (concentration × time).
  2. Check whether the condition falls within the yellow plateau region ( φtot > 0.5 ).
3.3 Evaluate Transformation Quality (Figure 6, optional)
  1. Measure the mean DNA uptake and its coefficient of variation (CV).
  2. Refer to Figure 6: the yellow region indicates acceptable quality, whereas the purple region suggests that transformation conditions require optimization.

4.Core Predictive Atlases (Mode B — Production Mode)

Figure 1
Figure 1 | P vs L at t = 30 min (Concentration sweep) — Length-dominant, weak concentration dependence

Figure legend

Under a fixed reaction time of 30 minutes, the curves for 0.3–1.0 μM (300–1000 nM) almost completely overlap, indicating that integration efficiency has reached a saturation plateau. At 0.2 μM (= 200 nM), efficiency decreases noticeably.
All three experimental anchor points (1039 bp: 87.5%; 1605 bp: 79.0%; 1613 bp: 80.0%) lie within the model’s prediction band, yielding a mean absolute error (MAE) of 0.37%.
Shaded regions represent binomial 95% confidence intervals (n = 3 biological replicates).
The horizontal red dashed line marks the SOP threshold P = 0.8: in our host strain, inserts of 1.0–1.6 kb reliably achieve ≥ 80% success under standard conditions.

Extrapolation caution: predictions become uncertain for L > 4 kb, GC < 30% or > 70%, or C > 0.8 μM (dashed curves indicate low-data regions).

Model Formulation

The Mode B core adopts a logit-linear model:

$$\text{logit}(P) = a - bL$$

$$P = \frac{1}{1 + \exp[-(a - bL)]}$$

Parameters:

  • a = 3.0161 (calibrated from 1039 bp → 87.5%)
  • b = 1.03 × 10−3 bp−1 (fitted jointly from three anchor points)
Concentration Normalization

A Hill-type normalization is applied relative to the reference concentration 0.387 μM (= 387 nM):

$$ f_{\text{conc}}(C) = \frac{C^n}{K^n + C^n} \Bigg/ \frac{C_{\text{ref}}^n}{K^n + C_{\text{ref}}^n} $$

Parameters:

  • K = 0.20 μM (= 200 nM)
  • n = 6
  • Cref = 0.387 μM (= 387 nM)

This yields nearly identical outcomes for 0.3–1.0 μM, with a slight drop at 0.2 μM.

Final Output

$$ P_{\text{prod}} = \min(P × f_{\text{conc}}, 0.95) $$

Figure 2
Figure 2 | L × C Heatmap (Fixed 30 min) — Instant SOP Lookup Map

Figure legend
Within the range of 0.3–1.0 μM (= 300–1000 nM), the heatmap exhibits a stable plateau region (yellow–green zone), indicating consistent integration performance.
The vertical white dashed line denotes the reference concentration (Cref = 0.387 μM = 387 nM), along which users can directly read predicted success probabilities (Pprod) for different insert lengths.

Extrapolation caution: predictions are less certain for L > 4 kb or extreme GC-content regions.

Computation:

$$ P_{\text{prod}}(L, C) = \min [P_{\text{base}}(L) \times f_{\text{conc}}(C),\ 0.95] $$

The heatmap is generated over L ∈ [0.5, 4.0] kb and C ∈ [0.1, 1.0] μM, with grid resolution ΔL = 0.01 kb and ΔC = 0.01 μM.

Figure 3
Figure 3 | CFU → Expected Positive Colonies — Population Output and Plate-Number Planning

Figure legend
Under standard conditions (0.387 μM (= 387 nM) for 30 minutes), this plot converts the single-cell success probability (Pprod) into the expected number of positive colonies per plate, facilitating experimental and manpower planning.
The scaling factor (αpop) represents the population amplification factor (typically 1.0, adjustable via recovery time or inoculation protocol).
Curves are shown for NCFU = 10, 20, 30, and 40 colonies per plate.

Computation:

$$ E[\text{positive colonies}] = N_{\text{CFU}} \times \min(\alpha_{\text{pop}} \times P_{\text{prod}},\ 1) $$

Parameter definitions:

  • NCFU: number of colony-forming units plated per plate.
  • αpop: experimental amplification factor (default 1.0; can be empirically calibrated).
  • Pprod: single-cell success probability derived from Figures 1–2.

5.Site Selection and Condition Optimization (Mode A: First-Round Integration)

Figure 4
Figure 4 | Distance from oriC × GC % — Genomic Site-Selection Map

Figure legend

Designed for first-round integration (Mode A), this heatmap prioritizes genomic loci according to their relative likelihood of successful recombination.
Yellow–green regions indicate high-probability zones (recommended for integration), whereas dark-purple regions denote low-probability zones (to be avoided).

The distance effect is modeled as:

$$ F_{\text{distance}}(d) = e^{-k_d \cdot d} $$

where greater distance from oriC exponentially decreases the expected integration success rate.
This model assumes that loci near oriC have higher accessibility and earlier replication timing, making them easier to integrate; conversely, distal loci exhibit reduced efficiency.
If a specific host background exhibits the opposite trend, users can modify the distance-decay function and re-fit (kd).

The GC-content effect penalizes deviation from 50 % GC, as extremes in GC composition alter DNA melting temperature and protein-binding stability, thereby reducing recombination efficiency.

Applicable range:
(d < 3 Mb); GC = 30–70 %.
Predictions beyond these limits are uncertain.
This map can be used in the primer-design stage to eliminate high-risk loci before synthesis or cloning.

Mathematical Formulation

$$ P \propto f_{\text{GC}}(gc) \cdot F_{\text{distance}}(d) $$

$$ f_{\text{GC}}(gc) = \exp[-\gamma (gc - 0.5)^2] $$

$$ F_{\text{distance}}(d) = \exp[-k_d \cdot d] $$

Parameter Values

  • kd = 7 × 10−4 kb−1 (distance-decay constant)
  • γ = 10 (GC-deviation penalty strength)
  • d: shortest genomic distance to oriC (kb)
  • gc: GC fraction of the target locus (0.3–0.7)

Physical Interpretation

  • Distance effect — (Fdistance(d)):
    Increasing distance from oriC reduces success probability exponentially.
    This assumption reflects that oriC-proximal regions replicate earlier and exhibit higher accessibility for recombination, whereas distal sites suffer delayed replication and spatial hindrance.
    Users observing strain-specific deviations can redefine the decay function and recalibrate (kd).
  • GC-content effect — (fGC(gc)):
    Deviations from 50% GC introduce penalties because of altered DNA melting temperature and base-pairing stability.
Figure 5
Figure 5 | φtot Heatmap (Protein Layer — Concentration × Time)

Figure legend

The effective operational window for the protein layer—30–60 min / 0.3–1.0 μM—corresponds to the time-over-threshold (ToT) concept derived from the M1–M2 transcription-translation models.
Within this region, the Bxb1 integrase concentration remains above the functional threshold required for recombination.
This window therefore defines the standard operating condition (SOP) for Mode B optimization, explaining why a 30-minute induction consistently achieves stable integration efficiency without extended induction periods.

Mathematical Formulation

$$ φ_{\text{tot}} = φ_{\text{Bxb1}}(C) \times φ_{\text{time}}(t) \times φ_{\text{ssAP}}(t_{\text{ssAP}}) $$

Sub-Modules

  1. Bxb1 Concentration Dependence (Hill Equation)
    $$ φ_{\text{Bxb1}}(C) = \frac{C^{n_H}}{C_{50}^{n_H} + C^{n_H}} $$
    Parameters:
    • C50 = 1 µM
    • nH = 1
  2. Reaction-Time Saturation
    $$ φ_{\text{time}} = \frac{t}{T_{50} + t} $$
    Parameter:
    • $ T_{50} = 30 \text{min} $
  3. ssAP (Auxiliary Protein) Expression Kinetics
    $$ φ_{\text{ssAP}} = \frac{t_{\text{ssAP}}}{T_{\text{ssAP50}} + t_{\text{ssAP}}} $$
    Parameters:
    • $ T_{\text{ssAP50}} = 20 \text{min} $
    • $ t_{\text{ssAP}} = 45 \text{min} $ (fixed)

6.Advanced Topics (for Experienced Users and Mechanistic Interpretation).

Figure 6
Figure 6 | Mean DNA Uptake × Coefficient of Variation (CV) — Population-Layer Heterogeneity Analysis

Figure legend

Under conditions of low DNA uptake or high heterogeneity (large CV), the population-level output tends to decrease.
This map serves as a diagnostic and decision-support tool, indicating whether additional plates, modified electroporation parameters, or workflow optimization are required to reduce cell-to-cell variability.

Mathematical Formulation

Each single cell’s DNA uptake is assumed to follow a log-normal distribution: $$ σ = \sqrt{\ln(\text{CV}^2 + 1)}, \quad μ = \ln(\text{mean}) - \frac{σ^2}{2} $$ The single-cell integration probability is computed as: $$ P_{\text{cell}} = \min (P_{\text{calibrated}} \cdot G^{αβ} \cdot \text{uptake}, 0.8) $$ Population averages are then derived via Monte Carlo simulations (10,000 cells) or numerical integration, and the resulting values are plotted as heatmaps.

Practical Applications

  • CV > 0.6: Transformation quality is unstable — optimization of electroporation voltage or pulse length is recommended.
  • mean < 30k: Insufficient DNA uptake — consider increasing plasmid or oligo concentration.
  • Both conditions combined: Increase the number of selection plates by 2–3× to ensure sufficient coverage of successful transformants.
Figure 7
Figure 7 | Spacer × L Heatmap — Tunable Geometric Configuration

Figure legend

When the spacer length between paired attP and attB sites is adjustable, the model predicts an optimal “sweet spot” around 80 ± 40 bp, forming the most stable geometric configuration.
Spacers that are either too short or too long reduce integration efficiency, as they disrupt favorable DNA looping conformations.
This parameter primarily applies to custom two-site landing pad designs and is generally unnecessary for standard single-site workflows.

Mathematical Formulation

$$ f_{\text{spacer}}(S) = \exp!\left[-\frac{(S - 80)^2}{2(40)^2}\right] $$ $$ P_{\text{prod}} = P_{\text{base}}(L) \times f_{\text{conc}}(C) \times f_{\text{spacer}}(S) $$

Physical Interpretation

  • The 80 bp spacer (~27 nm) coincides with the DNA persistence length (~50 nm), representing the energetically optimal bending regime.
  • Proper spacing promotes favorable DNA looping and geometric alignment between Bxb1 integrase dimers and the DNA substrate complex.

Applicable Scenarios and Limitations

  • Valid only for two-site excessive integrations or when the relative spacing of attP/attB pairs is being optimized.
  • Standard integration vectors (e.g., pOSIP or similar) already use optimized spacers; further tuning is usually unnecessary for typical workflows.
Figure 6
Figure 8 | Steric Map (kf2 × krel) — Mechanistic Layer: Single-Molecule Reaction

Figure legend

At the molecular-reaction level, the overall integration success is governed by the equilibrium between the forward rate ( $ k_{f2} $, synapsis formation rate) and the release rate ( $ k_{rel} $ , ineffective dissociation rate).
This map is primarily intended for mechanistic interpretation and sensitivity analysis, rather than general user decision-making.

Computational Framework and Global Calibration

A global scaling factor (G = 6.632173469) is applied to the base forward rate $ k_{f2} $ : $$ k_{f2,\text{effective}} = k_{f2,\text{base}} \times G $$ The scaling factor was globally calibrated using the three experimental anchor points from Mode B, ensuring consistency between the mechanistic and macroscopic success models.

Length Sensitivity

  • Overall slope (SLOPE): $ 1.70 × 10^{-3},\text{bp}^{-1} $ — matching the logit slope from Figure 1.
  • Distributed parameters: $$ α_{\text{mech}} = β_{\text{mech}} = 8.5 × 10^{-4},\text{bp}^{-1} $$
  • Length correction equations: $$ k_{f2} \rightarrow k_{f2} \cdot \exp(-α_{\text{mech}} \cdot L) $$ $$ k_{rel} \rightarrow k_{rel} \cdot \exp(-β_{\text{mech}} \cdot L) $$

Physical Interpretation

  • $ k_{f2} $ : Effective rate constant for synapsis formation and recombination initiation (s⁻¹).
  • $ k_{rel} $ : Rate constant for synaptic complex dissociation or non-productive release (s⁻¹).
  • As insert length increases, DNA flexibility decreases, reducing complex stability and thereby lowering both rates.
    However, the impact is stronger on (k{rel}), which explains the gradual decline in overall efficiency with increasing fragment size.

7.Methodological Summary

7.1 Layered Model Architecture

Our framework adopts a three-layer decomposable design, where each layer carries a distinct biological interpretation and is experimentally verifiable.

Layer 1 — Protein Layer ($\phi_{tot}$)

This layer models the combined effects of enzyme concentration and reaction time using Hill-type kinetics and temporal saturation functions:

  • Concentration dependence of Bxb1 recombinase: $(C_{50} = 1\,\mu M), (n_H = 1)$
  • Reaction-time saturation: $(T_{50} = 30\,\text{min})$
  • Auxiliary ssAP expression kinetics: $(T_{\text{ssAP50}} = 20\,\text{min});\, t_{\text{ssAP}} = 45\,\text{min}$

Key finding: At 30 min / 0.3–1.0 μM, $\phi_{tot}$ consistently plateaus, establishing the SOP stability window for Mode B.

Layer 2 — DNA Layer

This layer integrates sequence- and geometry-dependent factors into a composite multiplicative model:

  • Insert length dependence:
    $$\text{logit}(P) = 3.0161 - 1.03\times10^{-3}L$$
  • GC-content penalty:
    $$f_{\text{GC}} = \exp[-10(gc - 0.5)^2]$$
  • Distance effect (oriC proximity):
    $$F_{\text{distance}}(d) = \exp[-7\times10^{-4}d]$$
  • Spacer geometry factor: Optimal near 80 ± 40 bp, aligning with DNA persistence length.
  • Steric modulation: Dynamic rate corrections applied to $k_{f2}$ and $k_{rel}$.
Layer 3 — Population Layer

At the population scale, transformation heterogeneity (mean uptake and CV) is modeled via a log-normal distribution, coupled with a binomial sampling process to convert single-cell probability ($P_{\text{prod}}$) into expected positive colony counts. This enables direct linkage between model predictions and experimental planning.

7.2 Concentration Handling

Concept:
Different concentration domains are used for distinct contexts.
Mode A spans the entire range for activation modeling, whereas Mode B focuses on fine-tuning near the operational plateau.

Implementation details:

  • Mode A: Uses general parameters C50 = 1 µM , nH = 1 to represent the full activation curve across 0–1 µM.
  • Mode B: Fixed at $ t = 30 \text{ min} $ , emphasizing the effective sub-range (0.2–1.0 µM).
    The Hill normalization constant is refined to $ K = 0.20 µM (= 200 nM) $, $ n = 6 $, with the reference concentration (C{\text{ref}} = 0.387 µM (= 387 nM)).

This adjustment reproduces the empirically observed behavior— a stable plateau from 200–400 nM, with a mild decline below 200 nM—consistent with the weak concentration dependence identified in Figure 1.

Mode A and Mode B serve complementary purposes rather than substitutes:

  • Mode A: Global activation domain (broad induction landscape).
  • Mode B: Local optimization domain (fine-tuned operational plateau).

7.3 Calibration Strategy

Mode B (Second-Round Integration) Anchor-Point Calibration

Three experimental validation points were used for model fitting:

Insert (bp)Observed (%)Predicted (%)Error (%)
103987.587.50.0
160579.079.6+0.6
161380.079.5−0.5
MAE0.37

Calibration workflow:

  1. 1039 bp point defines intercept $a = 3.0161$.
  2. Slope $b = 1.03 × 10^{-3}\,\text{bp}^{-1}$ was obtained via least-squares fitting.
  3. Validation yielded MAE = 0.37 %, max deviation 0.6 %.
Statistical Testing

In our dataset, single-antibiotic vs dual-antibiotic selection showed no significant difference
(p > 0.05, n = 3; unpaired t-test).
Future replicates will expand biological sample size to strengthen robustness.

Concentration-Dependence Parameters
  • Consistent with both literature and experimental data, 200–400 nM was confirmed as the optimal window.
  • Hill normalization applied using the reference concentration $C_{\text{ref}} = 387,\text{nM} (= 0.387,µM)$: $K = 0.20,µM (= 200,nM), n = 6$.
  • Retains the characteristic weak dependence below 0.3 µM.
Global Length Sensitivity Calibration
  • Global scaling factor $G = 6.632173469$ applied to the mechanistic layer $k_{f2}$.
  • Length sensitivity slope $1.70 × 10^{-3}\text{bp}^{-1}$ distributed equally: $$ α_{\text{mech}} = β_{\text{mech}} = 8.5 × 10^{-4}\text{bp}^{-1} $$
  • Ensures quantitative consistency between the mechanistic model (Fig. 8) and the empirical regression (Fig. 1).

7.4 Validation and Uncertainty Quantification

Current Validation Status
  • Mean absolute error (MAE): 0.37 %.
  • Coverage range:
    • Insert length: 1.0–1.6 kb
    • Concentration: 0.2–1.0 μM
    • Reaction time: fixed at 30 min
  • Biological replicates: (n = 3) (expanding to (n ≥ 5) in next phase).
Recommended Future Enhancements
  1. Expand length range:
    • Current data concentrated between 1–2 kb.
    • Add validation points at 0.8 kb, 2.5 kb, and 3.0 kb.
    • Goal: cover 0.8–3.0 kb, the practical operational range.
  2. Visualize uncertainty:
    • Add Binomial 95 % confidence bands to Figure 1.
    • Include error bars in Figure 3 (biological replicates).
  3. Temporal-sweep validation:
    • Compare success rates at 15 / 30 / 45 / 60 min.
    • Validate the modeled assumption $T_{50} = 30\text{ min}$ for $φ_{\text{time}}$.
Known Limitations and Cautions
  • Insert length > 4 kb: Prediction uncertainty increases significantly (extrapolation risk); visually truncated or shaded in figures.
  • Extreme GC content (<30 % or >70 %): Insufficient calibration data; predictions may be unreliable.
  • High concentration (>0.8 µM): Possible nonspecific protein effects not included in the current model.

8. Parameter Summary Table

ModuleParameterValueUnitPurpose / Description
Mode B – Length dependencea3.0161Calibrated from 1039 bp → 87.5 % efficiency
b1.03 × 10⁻³/bpFitted slope from three anchor points
Mode B – Concentration dependenceK0.20 (= 200 nM)µMLocal half-saturation constant for normalization
n6Hill coefficient (slope of plateau)
Cref0.387 (= 387 nM)µMDefault reference concentration (Figs. 1–2)
Protein Layer (Mode A)C₅₀1.0µMGlobal half-saturation concentration for activation curve
nH1Hill coefficient
T₅₀30minHalf-saturation time of reaction completion
TssAP5020minHalf-saturation time for ssAP expression
tssAP45minFixed induction duration of ssAP
DNA Layerkd7 × 10⁻⁴/kbExponential decay constant for distance to oriC
γ10GC-content deviation penalty strength
Mechanistic LayerG6.632173469Global scaling factor applied tokf2
αmechmech8.5 × 10⁻⁴/bpLength-sensitivity partition coefficients
Population Layerαpop1.0 (default)Experimental population amplification factor
Spacer GeometryS₀80bpOptimal spacer length (sweet-spot)
σspacer40bpAcceptable spacer deviation range
Upper Boundcap0.95Probability ceiling for Mode B output
SOP ThresholdPthreshold0.80Decision boundary (dashed line in Fig. 1)

9. Cross-Strain Recalibration Guide

If other teams wish to recalibrate M5 using their own experimental datasets, the following minimal requirements and procedure are recommended.

Minimum Requirements

  • Three validation points with different insert lengths (recommended range: 0.8–2.0 kb)
  • At least three biological replicates per point (n ≥ 3)
  • Fixed experimental conditions: concentration = 0.387 μM (387 nM), induction time = 30 min

Recalibration Workflow

Step 1. Prepare input data

  1. # your_data.csv
  2. insert_length,concentration,time,success_rate,n_replicates
  3. 1000,0.387,30,0.85,3
  4. 1500,0.387,30,0.78,3
  5. 2000,0.387,30,0.72,3

Step 2. Run calibration script

  1. from m5_tool import calibrate
  2. # Automatically fit parameters a and b
  3. model = calibrate(observed_data='your_data.csv')
  4. # Evaluate calibration quality
  5. print(f"MAE = {model.mae:.2f}%")
  6. print(f"a = {model.a:.4f}")
  7. print(f"b = {model.b:.6f}")

Step 3. Assess need for parameter adjustment

  • If MAE > 5% , consider adjusting concentration-related parameters (K, n) or time-related parameters (T₅₀) .
  • If strong strain-specific background effects are observed, re-estimate kd (distance-dependent decay constant).

Step 4. Save the recalibrated model

  1. model.save('strain_X_model.pkl')

Validation Recommendations

  1. Retain 1–2 points as an independent test set (excluded from fitting).
  2. Compute the prediction error for the test set.
  3. If the test set yields MAE < 3%, the recalibrated model is considered reliable.

10. Data and Code Availability

10.1 GitHub Repository

All model codes, validation datasets, parameter files, and figure-generation scripts are openly available at:
https://gitlab.igem.org/woonak32/M5-Bxb1-Design-Tool

Repository structure:

  • M5-Bxb1-Design-Tool/
  • ├── m5_tool.py              # Core model and prediction functions
  • ├── plot_figures.py         # Scripts for generating Figures 1–8
  • ├── calibrate.py            # Cross-strain recalibration utility
  • ├── params.json             # Editable parameter summary file
  • ├── data/
  • │   ├── m5_observed.csv     # Three anchor-point validation data
  • │   └── example_queries.csv # Example batch queries
  • ├── figures/                # Pre-generated figures (PNG + SVG)
  • ├── requirements.txt        # Python dependencies
  • ├── LICENSE                 # MIT License
  • └── README.md               # Quick-start guide

10.2 Quick Installation and Usage

  • # 1. Clone the repository
  • git clone https://gitlab.igem.org/woonak32/M5-Bxb1-Design-Tool.git
  • cd M5-Bxb1-Design-Tool
  • # 2. Install dependencies (recommended within a virtual environment)
  • python -m venv venv
  • . source venv/bin/activate # on Linux, MacOS; or
  • # On Windows: venv\Scripts\activate
  • pip install -r requirements.txt
  • # 3. Generate an individual figure
  • python plot_figures.py --figure 1 --output fig1_PvsL.png
  • # 4. Batch query predicted integration success rates
  • python m5_tool.py --input example_queries.csv --output results.csv
  • # 5. Launch the interactive Streamlit interface
  • streamlit run app.py

10.3 Data Format Specifications

Validation data (m5_observed.csv):

insert_length_bpconcentration_uMtime_minsuccess_raten_replicatesnotes
10390.387300.8753anchor_point_1
16050.387300.7903anchor_point_2
16130.387300.8003anchor_point_3

Batch query input (example_queries.csv):

insert_length_bpconcentration_uMtime_min
8000.38730
12000.38730
20000.545

Output results (results.csv):

insert_length_bpconcentration_uMtime_minpredicted_success_rateconfidence_interval_95
8000.387300.891[0.856, 0.926]
12000.387300.852[0.812, 0.892]
20000.5450.748[0.698, 0.798]

10.4 Parameter Customization and Recalibration

Users can recalibrate the model using their own validation datasets:

  • from m5_tool import M5Model
  • # Load default parameters
  • model = M5Model.load_default()
  • # Refit selected parameters (e.g., a, b or K, n)
  • model.calibrate(
  • observed_data='your_data.csv',
  • params_to_fit=['a', 'b'],
  • output='custom_params.json'
  • )
  • # Check calibration quality
  • print(f"MAE: {model.mae:.3f}%")
  • print(f"Updated params: a={model.a:.4f}, b={model.b:.6f}")
  • # Save as a new strain-specific model
  • model.save('strain_X_model.pkl')

10.5 License and Citation

License: MIT License (permitting modification and commercial use)

Citation format:

[NYCU-Formosa]. (2025). M5 Experimental Design Assistant for Bxb1 Integration.
iGEM 2025.GitHub https://github.com/[YourTeam]/M5-Bxb1-Design-Tool

BibTeX:

  • @software{m5_bxb1_2025,
  • title  = {M5 Experimental Design Assistant for Bxb1 Integration},
  • author = {[NYCU-Formosa]},
  • year   = {2025},
  • url    = {https://github.com/[NYCU-Formosa]/M5-Bxb1-Design-Tool},
  • note   = {iGEM 2025 Best Model submission}
  • }

10.6 Support and Contributions

  • Bug reports: via GitHub Issues
  • Feature requests: via Pull Requests
  • Discussion channel:iGEM Slack — #model-tooling

We welcome community contributions in:

  • Calibration data for new bacterial strains
  • Parameter sets for other integrases (e.g., ΦC31)
  • Visualization and usability improvements
  • Documentation translation (currently supports English and Chinese)

11. Conclusion

Within the team, M5 is often referred to as the “Salmon model”, symbolizing its multi-protein foundation and its capacity to traverse biological systems with the same precision and perseverance as its namesake species.

The M5 Experimental Design Assistanttransforms Bxb1 integration fromtrial-and-errortoquantitative design, providing:

  • Mode B (Production Mode): Length-dominated efficiency prediction (Figs. 1–3)
  • Mode A (Integration Site Selection): Genomic-position optimization (Figs. 4–5)
  • Advanced Modules: Population heterogeneity, spacer tuning, and mechanistic interpretation (Figs. 6–8)

Core advantages:

  • Interpretability: Three-layer architecture linking protein, DNA, and population scales
  • Verifiability: Validated with MAE = 0.37 % across three calibration points
  • Transferability: Re-calibration workflow enabling strain-specific adaptation
  • Operability: Practical lookup tables and SOP-based decision charts

Open-source commitment:
All codes, datasets, and parameters are publicly available on GitHub, supporting community reuse, extension, and collaborative improvement.

12. Frequently Asked Questions (FAQ)

Q1. Why use a logit-linear model instead of polynomial or exponential fits?
A. The logit transformation naturally constrains probabilities to [0, 1], and the relationship between insert length (L) and success rate is approximately linear in logit space. Three experimental anchors yield MAE = 0.37 %, validating the model’s simplicity and accuracy. Polynomial fits tend to overfit; exponential models cannot maintain bounded probabilities.

Q2. The 0.3–1.0 μM concentration curves overlap—are the parameters inconsistent?
A. No. Mode A uses a global activation curve (C₅₀ = 1 μM, nH = 1) to describe full-range activity, while Mode B applies local normalization (K = 0.20 μM, n = 6, Cref = 0.387 μM) for the 30-min production window. Both are valid in their respective operational contexts.

Q3. Why is 30 min of induction sufficient?
A. The time-saturation function φtime = t / (T₅₀ + t) with T₅₀ = 30 min reaches the plateau between 30–60 min (see Fig. 5), with marginal ( < 5 %) improvement beyond that. Extended induction increases metabolic burden and Bxb1 cytotoxicity, offering no cost-benefit gain.

Q4. Does single- vs. dual-antibiotic selection affect efficiency?
A. Current data (n = 3) show no significant difference (unpaired t-test, p > 0.05). Larger datasets are being collected for statistical confirmation. The current model assumes no dependency on resistance-marker configuration.

Q5. Why is the SOP threshold set at P ≥ 0.8?
A. At this threshold, one to two plates typically yield sufficient positive clones (see Fig. 3), balancing success rate and labor cost. Users requiring higher certainty may adopt P ≥ 0.85, with proportionally more plating.

Q6. My strain performs well even at 0.2 μM—does that invalidate the model?
A. Not necessarily. 0.2 μM corresponds to a “sub-optimal but functional” regime. If your strain achieves near-plateau efficiency there, recalibrate K and n using the cross-strain calibration workflow. This reflects strain-dependent Bxb1 expression efficiency.

Q7. How do I measure αpop (the population amplification factor)?
A.

  1. Choose a 1–1.2 kb test insert.
  2. Retrieve Pprod from Fig. 1.
  3. Plate and count positive colonies.
  4. Compute αpop = (observed positives) / (NCFU × Pprod).
    Typical αpop ranges 0.7–1.2 ( < 1 suggests plating loss or recovery inefficiency).

Q8. Can this model be adapted for other integrases (e.g., ΦC31)?
A. Yes. The framework is transferable. Re-fit a and b using 2–3 experimental anchors and update integrase-specific parameters (K, n, kd, γ, T₅₀) from literature. Start with ΦC31 kinetic data as initial estimates and perform calibration as described in Section 9.

Q9. Are spacer and length effects independent?
A. Currently modeled multiplicatively (fspacer × Pbase). If future data reveal interactions (e.g., long inserts more sensitive to spacer), an interaction term fSL(S,L) can be introduced. Existing data support the independence assumption.

Q10. Does the model account for off-target or reverse recombination events?
A. No. The model estimates on-target success only. Off-target rates depend on construct design and detection methods (PCR, long-read sequencing). For high-specificity attP/attB pairs such as Bxb1, off-target rates are typically < 1 %.

Q11. How are extrapolation zones handled?
A. Regions beyond empirical coverage (> 4 kb, GC < 30 % or > 70 %, > 0.8 μM) are flagged as high-uncertainty zones. Figures display these areas with faded or dashed lines, and captions explicitly note the uncertainty. Users are advised to validate experimentally before scaling up.

Q12. How is reproducibility ensured?
A. The GitHub repository provides:

  • m5_tool.py (core model)
  • plot_figures.py (figure scripts)
  • params.json (parameter file)
  • m5_observed.csv (validation data)
  • requirements.txt (dependency list)

All figures can be regenerated using the same parameter file.

Example commands:

pip install -r requirements.txt

python plot_figures.py --figure 1 --output fig1.png

python m5_tool.py --input queries.csv --output results.csv

Salmon — the M5 model’s codename — embodies our design philosophy: built on proteins, adaptive across species, and swimming upstream toward clarity amid experimental noise.

The M5 framework establishes a reproducible, interpretable, and strain-transferable standard for integrase-based genome engineering, bridging predictive modeling and experimental design through open-source, community-driven development.



M5 — the Salmon model. Built on proteins, tested on genomes, and guided by science.

Dry-Lab Results


Through the M1–M5 modeling framework, we established an interpretable, quantitative system that deconstructs Bxb1-mediated genome integration into three interconnected layers — protein, DNA, and population.
Both simulations and experimental data consistently revealed that integration success is not determined by a single kinetic parameter, but by the combined influence of replication timing, DNA geometry, and protein expression dynamics.

1. Locus Effect Dominates (Replication Timing)

Integration efficiency exhibited a clear hierarchy across genomic loci: hisA and metA achieved the highest success rates, galK showed intermediate performance, while leuD and lacZ were markedly lower.
This gradient mirrors the loci’s relative distance from oriC, indicating that replication fork timing is the primary determinant of integration efficiency.
Regions near oriC are duplicated early and remain more accessible to recombineering enzymes, whereas loci near ter replicate later, limiting accessibility.

Notably, lacZ is physically positioned near hisA and galK, not in the terminal region. Its low success rate—comparable to leuD—stems from its two-to-two configuration, which requires recombination across two pairs of attB/attP sites, unlike the single pair used in one-to-one designs.
This contrast reveals that integration efficiency is shaped not only by replication timing but also by geometric coordination within the DNA architecture.

2. Geometry as a Limiting Factor

In a one-to-one configuration, recombination proceeds independently at a single attB/attP pair.
By contrast, the two-to-two configuration demands synchronized recombination at both ends, introducing challenges in molecular alignment and spatial coordination.
Our model predicts that if one end completes recombination prematurely, Bxb1 molecules may become transiently trapped at that site, leading to temporal asymmetry and stalling of the opposite reaction.
This explains the lower initial efficiency observed for lacZ, which reflects geometric instability rather than locus position.

3. Turning Geometry into Advantage — The Strength of Two-to-Two Integration

While the two-to-two configuration introduces additional geometric constraints, it uniquely enables full cassette integration within a single step, a feat impossible for one-to-one architectures.
Our simulations revealed that the inefficiency of dual-end recombination is not intrinsic to the mechanism but results from unoptimized parameters such as inter-site spacing, enzyme concentration, and induction timing.
Once these were optimized in the M5 framework, the predicted success rate of two-to-two events converged with that of one-to-one integrations.

This finding redefines two-to-two recombination as a design-driven advantage rather than a structural limitation.
By quantifying geometric coordination, the model transforms a former bottleneck into a high-performance, scalable solution for production-level genome engineering — proving that dual-end integration can be both complete and efficient when designed systematically.

4. Distinction Between the First and Second Two-to-Two Rounds

The first two-to-two process (installation of dual attB sites) and the second two-to-two process (dual-end cassette integration) represent distinct mechanistic stages.
The first is probability-limited, governed by ssAP expression level and replication-fork timing, resulting in a very low success rate (~0.00195%).
The second occurs only in cells already containing dual attB sites; it is kinetics-limited, dominated by Bxb1 binding and catalytic efficiency.
Under optimized conditions (0.3–1.0 μM Bxb1 and short inter-site spacing), integration proceeds rapidly, reaching success rates exceeding 80%.

This dramatic contrast marks a shift from probabilistic to kinetic limitation, clarifying why integration efficiency can increase by four orders of magnitude between the two stages.

5. DNA Length Defines an Efficiency Window

Insert length strongly affects dual-end recombination success.
Both simulations and experiments show that fragments of 1–2 kb yield consistently high efficiencies, while inserts longer than ~3 kb result in a sharp decline.
This pattern reflects a trade-off between reaction kinetics and molecular proximity: longer fragments reduce the likelihood that both ends align simultaneously, thereby lowering the probability of complete integration. Thus, insert length emerges as a highly tunable design parameter in optimizing two-to-two architectures.

6. Protein Timing Raises the Integration Ceiling

The M1–M2 models predict optimal induction durations of ≈45 minutes for ssAP and ≈240 minutes for Bxb1. When residual transcription during the cooling phase was incorporated, the simulated protein accumulation curves matched experimental observations. These results show that fine-tuning expression timing determines the achievable “integration ceiling”: operating within the predicted windows significantly increases both attB formation and total integration success.

7. Population-Level Heterogeneity Explained by Replication Gating

The M3 model simulates attB formation as a stochastic single-cell event and successfully reproduces the experimentally observed “polarized” outcomes, where most cells fail to integrate while a minority succeed. This heterogeneity arises from replication-fork gating: integration occurs primarily during the brief window when the replication fork passes through the target locus, rendering DNA temporarily accessible to ssAP-mediated recombination. Replication timing, therefore, is the underlying driver of both population heterogeneity and the oriC–ter efficiency gradient.

8. Integration of All Layers — From Model to Design Assistant

Together, these models define three interacting principles:

The M5 model, implemented as our open-source software GenOMe Navigator, integrates these three layers into a predictive design assistant. Users can input locus position, insert length, and induction schedule to estimate integration efficiency and receive quantitative SOP recommendations, transforming experimental planning from trial-and-error into rational, data-driven design.

9. Future Refinements

While the M5 framework already reproduces experimental trends with strong fidelity,several refinements could further elevate its predictive precision:

  1. Expanding locus-level replicates to enhance statistical robustness;
  2. Incorporating direct replication-timing data instead of inferred positions;
  3. Refining geometric loss modeling for complex two-to-two architectures;
  4. Automating parameter calibration across experimental batches.

These extensions represent next-generation upgrades rather than current limitations— laying the foundation for M5 to evolve into a fully predictive genome-design platform.

10. Summary

Integration of modeling and experimental data yields the following key insights:
Integration efficiency follows an oriC-to-ter gradient defined by replication timing.
Two-to-two geometry, once optimized, achieves high-efficiency dual-end recombination comparable to one-to-one events.
The first and second two-to-two rounds are governed by distinct kinetic bottlenecks — the former by attB formation probability, the latter by enzymatic and geometric factors.
A 1–2 kb length window represents the optimal trade-off between completeness and efficiency.
Protein induction timing defines the integration ceiling.
The M5 model (GenOMe Navigator) converts these multilayer rules into a reproducible, quantitative, and design-driven framework for genome engineering. In essence, integration success emerges from the interplay of timing, geometry, and kinetics. The locus provides opportunity, the geometry sets boundaries, and protein dynamics determine attainment.
Through calibrated modeling, these elements are unified into a single predictive platform — transforming genome integration from empirical trial-and-error into a rational, engineering discipline.

back-to-top