Imagine a gene integration system where success rates can vary by nearly 3000-fold under different conditions. This isn't an experimental error—it’s a black box waiting to be opened.
In our GenOMe system, one-to-one means that one pair of attB and attP sites recombines once through Bxb1 integrase to form one integration event.
Traditional genetic engineering relies on trial and error, but we asked:
Our answer: Build an interpretable, predictive five-layer integration model.
| System Type | Traditional Mode A | Ready-to-Integrate Mode B | Efficiency Gain |
|---|---|---|---|
| One-to-one | ~0.57% | - | - |
| Two-to-two | ~0.05% | 79–87% | 1600-fold |
A single variable change produces a three-order-of-magnitude leap — this is the core mystery our model explains.
We decomposed the Bxb1 gene integration system into five modelable biological layers, peeling away complexity like an onion:



Quantifies how arabinose initiates Bxb1 gene transcription and the dynamics of mRNA production and degradation.
Calculates how long it takes for Bxb1 protein to reach effective concentration and when it begins to lose activity.
Models how ssAP forms attB sites on the chromosome — the system’s most fragile step.
Mechanism Breakdown: ssAP (single-strand annealing protein) must:
The multiplication of these four steps results in overall success rate < 1%.
Quantifies how Bxb1 recognizes and recombines two attB sites, and why longer fragments reduce efficiency.
Transforms parameters from the first four layers into a lookup-table success rate predictor for experiment design.
Most would assume low Bxb1 protein concentration causes poor efficiency. But the model reveals:
The answer lies in M3: randomness of attB site formation is the true killer.
Position Effect Formula: P(d) ≈ 0.6% × e−0.0007d, where d = distance from oriC (kb).
At the same chromosomal position (oriC-proximal), we compared:
Why the 10-fold difference? The model provides the answer:
The fundamental mechanism complexity differs, not just a probability multiplication:
Success_one-to-one = P₀ × [modification factors]
≈ 0.001066 × φtot × fgc × flocus × distance_factor × …
≈ 0.57%
Success_two-to-two = P₀_dual × [same modification factors]
≈ (P₀ / 15–20) × φtot × fgc × flocus × distance_factor × …
≈ 0.05%
Why is P₀_dual so much smaller? Two-to-two deletion requires:
Intuitive understanding: It's like needing to successfully thread two needles at once while they're moving—even if you're skilled at threading one needle (one-to-one), doing two simultaneously in coordination is fundamentally harder, not just twice as hard.
Mechanistic complexity multiplier: Two-to-two is 15-20 times harder than one-to-one
When using Ready-to-Integrate Mode B (linear DNA with pre-built attB sites):
| Fragment Length | Measured Success | Model Prediction | Error |
|---|---|---|---|
| 1039 bp | 87.5% | 87.6% | +0.1% |
| 1605 bp | 79.0% | 79.3% | +0.3% |
| 1613 bp | 80.0% | 79.1% | −0.9% |
Average prediction error < 0.4%
The model not only predicts accurately but also explains why:
We calibrated the model using three length anchor points (1039/1605/1613 bp):
Mean Absolute Error (MAE) = 0.37%
Maximum deviation < 1%
Covers practical range 1.0-1.6 kb
All predictions include 95% confidence intervals, with extrapolation regions (>4 kb, extreme GC%) clearly marked.
Choose different reading paths based on your needs:
→ Jump directly to M5 Experimental Design Assistant
Input your fragment parameters and get success rate predictions from the lookup table.
→ Focus on M3 vs M4 Comparative Analysis
See where the bottleneck truly lies.
→ Read sequentially M1 → M2 → M3 → M4 → M5
Trace the complete process from induction to integration.
→ Consult parameter tables and calibration guidelines for each layer.
All assumptions and limitations are explicitly stated.
Yes! This five-layer architecture applies to any gene editing system requiring:
Examples include:
The model reveals that "delivery bottleneck > reaction kinetics" — a principle explaining why:
The underlying logic is the same: bypass random intracellular synthesis/marking processes by directly delivering functional molecules.
This is not just a mathematical model, but a complete roadmap from gene expression to genome rewriting.
Choose your path and start exploring
Have questions about modeling? Want to see raw data? Click the sections to begin your journey!
The M1 model describes the mRNA dynamics of ssAP (single-strand annealing protein) and Bxb1 integrase under various induction regimes.
A key feature of M1 is its ability to resolve induction time windows, including the effect of cooling periods where transcription is reduced but not fully terminated.
mRNA dynamics are governed by:
$$ \frac{dm_g}{dt} = \beta_g \cdot I_g(t) - d_{m,g} m_g, \quad g \in \{ \text{Bxb1, ssAP} \} $$
where:
| Name | Definition | Value [unit] | Reference |
|---|---|---|---|
| βssAP | Transcription rate of ssAP | 0.15 [μM/min] | Estimated from promoter strength [1] |
| βBxb1 | Transcription rate of Bxb1 | 0.10 [μM/min] | Estimated from promoter strength [1] |
| βBxb1-J23105 | Transcription rate of Bxb1 under promoter J23105 | 0.04 [μM/min] | iGEM Registry [3] |
| t1/2,ssAP | mRNA half-life of ssAP | 7.0 [min] | Literature [2] |
| t1/2,Bxb1 | mRNA half-life of Bxb1 | 6.9 [min] | Literature [2] |
| t1/2,Bxb1-J23105 | mRNA half-life of Bxb1-J23105 | 6.8 [min] | Literature [2] |
| dm,ssAP | Degradation constant of ssAP mRNA | 0.099 [1/min] | Derived from half-life [2] |
| dm,Bxb1 | Degradation constant of Bxb1 mRNA | 0.100 [1/min] | Derived from half-life [2] |
| dm,Bxb1-J23105 | Degradation constant of Bxb1-J23105 mRNA | 0.102 [1/min] | Derived from half-life [2] |
| γ | Residual transcription ratio during cooling | 0.1 | Literature [5] |
| tinduce,ssAP | Induction time for ssAP | 45 [min] | Experimental condition |
| tinduce,Bxb1 | Induction time for Bxb1 | 240 [min] | Experimental condition |
| tinduce,J23105 | Induction time for J23105 promoter | 240 [min] | Experimental condition |
| tcool,ssAP | Cooling time for ssAP | 10 [min] | Experimental setting |
For ssAP, we explicitly incorporated a cooling phase to reflect reduced transcriptional activity:
$ I(t) = \begin{cases} 1, & 0 \le t < 45 \quad \text{(full induction)} \\ \gamma, & 45 \le t < 55 \quad \text{(cooling, residual activity)} \\ 0, & t \ge 55 \quad \text{(fully off)} \end{cases} $
with γ = 0.1, representing 10% residual transcriptional efficiency during cooling on ice [4].
Analytical solutions follow canonical forms:
These are standard analytical solutions [5] obtained by solving the first-order ODE for mRNA dynamics.
Extend ssAP induction to 45 min to align with its mRNA peak, accounting for residual transcription during cooling.
[1] Registry of Standard Biological Parts (iGEM). (n.d.). Promoter BBa_J23105. Retrieved from http://parts.igem.org/Part:BBa_J23105
[2] Bernstein, J. A., Khodursky, A. B., Lin, P. H., Lin-Chao, S., & Cohen, S. N. (2002). Global analysis of mRNA decay and abundance in Escherichia coli at single-gene resolution using two-color fluorescent DNA microarrays. PNAS, 99(15), 9697–9702. https://doi.org/10.1073/pnas.112318199
[3] Selinger, D. W., Saxena, R. M., Cheung, K. J., Church, G. M., & Rosenow, C. (2003). Global RNA half-life analysis in Escherichia coli reveals positional patterns of transcript degradation. Genome Research, 13(2), 216–223. https://doi.org/10.1101/gr.912603
[4] Phadtare, S., & Inouye, M. (2003). Cold shock response and cold shock proteins. Genes & Development, 17(8), 971–981. https://doi.org/10.1101/gad.1077503
[5] Alon, U. (2006). An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman & Hall/CRC.
Building upon the transcriptional dynamics characterized in M1, the M2 model quantifies the accumulation of ssAP and Bxb1 proteins.
Thus, M2 serves as the critical bridge between gene expression and functional integration efficiency. Its role is to evaluate whether extended induction windows are sufficient to ensure proteins act within their effective functional windows.
We start from the canonical mRNA–protein dynamics widely used in gene expression modeling.
$$ \frac{dm_g}{dt} = \beta_g I_g(t) - d_{m,g} m_g(t), \quad g \in \{\mathrm{Bxb1}, \mathrm{ssAP}\}. \quad [1][2] $$
This is a first–order linear ODE describing production at rate $\beta_g$ (controlled by induction function $I_g(t)$) and degradation at rate $d_{m,g}$. The analytical solution is obtained using the integrating factor method:
$$ m_g(t) = m_g(0)e^{-d_{m,g}t} + \int_0^t \beta_g I_g(s)e^{-d_{m,g}(t-s)}\, ds. \quad [1][2] $$
Special cases include:
Moreover, the degradation constant is related to the mRNA half-life by $d_{m,g} = \ln 2 / t_{1/2}$, which in E. coli has been globally quantified to be on the order of a few minutes [5].
We incorporate an average translation/maturation delay $\tau$, such that fully functional proteins appear only after time $\tau$:
$$ \frac{dP_g}{dt} = k_{\mathrm{trans},g}\, m_g(t - \tau) - d^{\mathrm{eff}}_{p,g}\, P_g, \quad [3][4] $$
where the effective decay constant combines protein degradation and growth dilution:
$$ d^{\mathrm{eff}}_{p,g} \equiv k^{P}_{\mathrm{deg},g} + \mu. \quad [1][2] $$
The convention is $m_g(t - \tau) = 0$ for $t < \tau$.
Assume $I_g(t) = 1$ with initial conditions $m_g(0) = P_g(0) = 0$.
This expression shows that the protein dynamics are identical to the no-delay case, but shifted by $(t - \tau)$.
For switch-off or pulse induction with residual activity \(\gamma\), one substitutes the time-dependent \(I_g(t)\) into the mRNA solution and propagates it through the protein convolution. The closed form remains valid provided all terms are shifted by \((t-\tau)\) and \(d_p\) is replaced by dp,geff.
Protein removal is modeled through an effective decay constant \(d_{p,\mathrm{eff}}=d_p+\mu\), where \(\mu\) represents dilution due to cell growth and \(d_p\) denotes intrinsic protein degradation. In M2 we assume a constant \(\mu\), corresponding to balanced exponential growth, to keep the focus on transcription–translation–recombination dynamics. A burden-aware formulation \(\mu([\textit{Bxb1}])\) is introduced later as an optional extension.
| Name | Definition | Value [unit] | Reference |
|---|---|---|---|
| βssAP | Transcription rate of ssAP mRNA | 0.15 [μM/min] | Estimated from promoter strength [6] |
| βBxb1 | Transcription rate of Bxb1 mRNA (induced) | 0.10 [μM/min] | Estimated from promoter strength [6] |
| βBxb1-J23105 | Transcription rate of Bxb1 under J23105 promoter | 0.04 [μM/min] | iGEM Registry [6] |
| t1/2, mRNA, ssAP | mRNA half-life of ssAP | 7.0 [min] | Literature [7,8] |
| t1/2, mRNA, Bxb1 | mRNA half-life of Bxb1 | 6.9 [min] | Literature [7,8] |
| t1/2, mRNA, Bxb1-J23105 | mRNA half-life of Bxb1-J23105 | 6.8 [min] | Literature [7,8] |
| dm, ssAP | mRNA degradation constant of ssAP | 0.099 [1/min] | Calculated as ln2 / t1/2 [7,8] |
| dm, Bxb1 | mRNA degradation constant of Bxb1 | 0.100 [1/min] | Calculated as ln2 / t1/2 [7,8] |
| dm, Bxb1-J23105 | mRNA degradation constant of Bxb1-J23105 | 0.102 [1/min] | Calculated as ln2 / t1/2 [7,8] |
| ktrans, ssAP | Translation rate constant of ssAP | 0.50 [1/min] (baseline assumption) | Assumed unified value; later fitted [9,10] |
| ktrans, Bxb1 | Translation rate constant of Bxb1 | 0.20 [1/min] (baseline assumption) | Assumed unified value [9,10] |
| ktrans, Bxb1-J23105 | Translation rate constant of Bxb1 (J23105 promoter) | 0.20 [1/min] (baseline assumption) | Assumed unified value [9,10] |
| τ | Translation delay (initiation/maturation) | 2 [min] | Assumed; typical ribosome initiation + maturation delay [9] |
| t1/2, Protein, ssAP | Protein half-life of ssAP | 60 [min] | Literature: short-lived protein [11] |
| t1/2, Protein, Bxb1 | Protein half-life of Bxb1 | 960 [min] (16 h) | Literature: stable protein [11] |
| t1/2, Protein, Bxb1-J23105 | Protein half-life of Bxb1-J23105 | 960 [min] (16 h) | Literature: stable protein [11] |
| dp, ssAP | Protein degradation constant of ssAP | 0.012 [1/min] | Derived from half-life [11] |
| dp, Bxb1 | Protein degradation constant of Bxb1 | ≈ 0 [1/min] | Derived from half-life [11] |
| dp, Bxb1-J23105 | Protein degradation constant of Bxb1-J23105 | ≈ 0 [1/min] | Derived from half-life [11] |
| μ | Growth dilution rate | ~0.0116 [1/min] (Td=60 min) | Calculated as ln2 / Td |
| dpeff | Effective protein decay constant | $ d_p^{\text{eff}} = dp + \mu $ | Model definition |
In vitro assays typically employ serine integrases at 200–400 nM, occasionally up to 1000 nM.
Our simulations predicted μM-range intracellular concentrations, one to two orders of magnitude higher.
This confirms that absolute protein abundance is not the limiting factor in vivo.
The critical determinant is not whether proteins reach steady-state, but whether they are present at the right time window.
Role Differentiation
Dry → Wet (Prediction to Guidance):
Wet → Dry (Feedback to Refinement):
This cycle demonstrates how modeling can directly shape experimental protocols by moving beyond concentration magnitude and instead emphasizing the importance of temporal coordination and encounter probability.
Without this temporal adjustment, downstream recombination (M4) would systematically underestimate efficiency, despite protein abundance being non-limiting.
Conclusions:
Experimental Guidance:
[1]Alon, U. (2006). An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman & Hall/CRC.
[2]Thattai, M., & van Oudenaarden, A. (2001). Intrinsic noise in gene regulatory networks. PNAS, 98(15), 8614–8619.https://doi.org/10.1073/pnas.151588598
[3]Monk, N. A. M. (2003). Oscillatory expression of Hes1, p53, and NF-κB driven by transcriptional time delays. Current Biology, 13(16), 1409–1413.https://doi.org/10.1016/S0960-9822(03)00534-9
[4]Bratsun, D., Volfson, D., Tsimring, L. S., & Hasty, J. (2005). Delay-induced stochastic oscillations in gene regulation. PNAS, 102(41), 14593–14598.https://doi.org/10.1073/pnas.0503858102
[5]Bernstein, J. A., Khodursky, A. B., Lin, P. H., Lin-Chao, S., & Cohen, S. N. (2002). Global analysis of mRNA decay and abundance in Escherichia coli. PNAS, 99(15), 9697–9702.https://doi.org/10.1073/pnas.112318199
[6]Registry of Standard Biological Parts (iGEM). Promoter BBa_J23105.http://parts.igem.org/Part:BBa_J23105
[7]Selinger, D. W., et al. (2003). Global RNA half-life analysis in Escherichia coli. Genome Research, 13(2), 216–223.https://doi.org/10.1101/gr.912603
[8]Li, G.-W., Oh, E., & Weissman, J. S. (2012). The anti-Shine-Dalgarno sequence drives translational pausing. Nature, 484(7395), 538–541.https://doi.org/10.1038/nature10965
[9]BioNumbers ID 100059. Rate of translation by ribosome in E. coli.https://bionumbers.hms.harvard.edu/bionumber.aspx?id=100059
[10]Belle, A., et al. (2006). Quantification of protein half-lives in budding yeast. PNAS, 103(35), 13004–13009.https://doi.org/10.1073/pnas.0605420103
The M3 model adopts a single-cell, site-discrete probabilistic framework to describe how synthetic oligonucleotides (oligos) are integrated into the E. coli MG1655 chromosome via RecT (ssAP), forming attB sites. Compared to M1 (mRNA dynamics) and M2 (protein dynamics), addressing not only whether a cell forms at least one attB site, but also how genomic position relative to oriC affects integration probability.
Traditional concentration-based kinetic models fail to capture the polarization phenomenon often observed in recombineering: in the same batch, a minority of cells succeed in integration while most fail completely. This arises because each cell contains only 1–4 genome copies, making the relevant question not the average concentration of intermediates but the probability that a given cell achieves ≥1 attB integration event. This motivated a stochastic, site-limited modeling strategy.
(1) Concentration-to-molecule conversion
The ssAP concentration output from M2 is converted into per-cell molecule numbers:
$$ N_{ssAP}(t) = ssAP \times 602 $$
where 602 is the approximate conversion factor for E. coli (1 µM ≈ 602 molecules per 1 fL cell volume) [1]. This links the protein-level dynamics in M2 to discrete single-cell stochastic modeling in M3.
(2) Single-cell reaction rate function (Cycle 4 baseline)
For cell i, with genome copy number (Gi {1,2,4}) (replication-state dependent [2]), the instantaneous rate of attB formation is modeled by a Michaelis–Menten form:
$$ v_i(t) = G_i \cdot \frac{k_{cat} , N_O^{(i)}(t)}{K_m + N_O^{(i)}(t)} $$
where:
Oligos decay exponentially due to nucleases [4,5]:
$$ N_O^{(i)}(t) = N_O^{(i)}(0) e^{-k_{loss}t} $$
(3) Stochastic integration as a non-homogeneous Poisson process
The cumulative intensity is:
$$ \Lambda_i(t) = \int_0^t v_i(\tau) d\tau $$
and the probability that cell i has formed at least one attB site by time t is:
$$ P_i(t) = 1 - e^{-\Lambda_i(t)}. $$
Monte Carlo sampling (10³–10⁴ cells) over initial oligo distributions (log-normal, GSD ≈ 2–3 [6]), ssAP multiplicative noise (GSD ≈ 1.3–1.5 [7]), and genome copy number distributions yields the population-level fraction of cells with ≥1 attB site.
(4) Cycle 5 extension: locus dependence and replication-fork gating
Cycle 4 assumed all chromosomal loci were equally accessible. In Cycle 5, we introduce explicit locus and time-dependence:
$$ v_i(x,t) = G_i(x,t) \cdot \frac{k_{cat} N_O^{(i)}(t)}{K_m + N_O^{(i)}(t)} \cdot \chi_{\text{rep}}(x,t) $$
where:
Thus, attB formation can only occur during fork passage, and success probability decreases with locus distance from oriC.
(5) Empirical oriC–ter decay function
Fitting experimental and predicted efficiencies (hisA, metA, galK, leuD, lacZ), the success probability declines approximately exponentially with oriC distance:
$$ P(d) \approx P_0 \cdot e^{-k d} $$
where:
This fit indicates that each additional ~1000 kb from oriC reduces integration probability by ~50–70%, consistent with oriC-proximal loci (hisA, 0.57%) being an order of magnitude more efficient than ter-proximal loci (leuD, 0.05%).
(6) Numerical implementation
Analytical solutions remain tractable only in the simplified Cycle 4 case (constant inputs, no locus dependence). With locus-specific copy number and replication timing, numerical integration combined with Monte Carlo sampling is required. This allows computation of population-level trajectories, polarization histograms, and locus-dependent gradients (Figs. M3-2A–E).
| Name | Definition | Value [unit] | Reference |
|---|---|---|---|
| Vcell | Average E. coli cell volume | 1 fL | Alon (2006) [1] |
| x | Genomic locus coordinate (distance from oriC) | 0–2300 kb (E. coli genome half-length) | Cooper & Helmstetter (1968) [2] |
| Gi(x,t) | Locus-specific copy number (replication-timing dependent) | 1–4 copies; oriC-proximal loci show earlier/multiple copies | Cooper & Helmstetter (1968) [2] |
| χrep(x,t) | Replication-window indicator function (fork accessibility) | Binary: 1 if replication fork passes locus x at time t; 0 otherwise | Cooper & Helmstetter (1968) [2]; Rao et al. (2018) [9] |
| trep(x) | Replication timing of locus x (fork arrival time) | ~0–40 min depending on oriC–ter position (E. coli doubling time 40 min) | Cooper & Helmstetter (1968) [2] |
| kcat | Catalytic turnover rate of ssAP | 0.01–0.2 (baseline 0.02) min⁻¹ | Briggs & Haldane (1925) [3] |
| Km | Effective Michaelis constant | 10³–10⁴ (baseline 6000) molecules | Briggs & Haldane (1925) [3] |
| gm, oligo | Geometric mean of oligo uptake per cell | 10²–10⁴ (baseline 300) molecules | Mosberg et al. (2010) [6] |
| gsd, oligo | Geometric SD of oligo uptake distribution | 2–3 | Taniguchi et al. (2010) [7] |
| kloss | Oligo degradation rate constant | 0.15–0.20 (baseline 0.2) min⁻¹ | Sawitzke et al. (2007) [5]; Mosberg et al. (2010) [6] |
| NssAP(t) | ssAP molecules per cell over time | Derived from M2 dynamics (~10²–10³) | Taniguchi et al. (2010) [7] |
| p≥1(t) | Population fraction of cells with ≥1 attB | Simulation output | – |
| t₁₀, t₅₀, t₉₀ | Characteristic times when p≥1(t) = 0.1/0.5/0.9 | Simulation output | – |
(1) Baseline molecular framework .
Simulations under baseline parameters (gm = 300 molecules/cell, kcat = 0.02 min⁻¹, Km = 6000 molecules) reproduced the characteristic recombineering behavior. Population success curves exhibited lag–rise–plateau dynamics with a final plateau of ~30–35%, consistent with reported λ-Red/RecT efficiencies. Polarization histograms revealed bimodal outcomes: most cells remained unsuccessful, while a small fraction achieved near-complete integration. Sensitivity analyses confirmed parameter priority as oligo delivery (gm) > catalytic turnover (kcat) > Km, with delivery dominating across regimes.
(2) Locus dependence .
In incorporating locus position and replication-fork gating fundamentally reshaped predictions. Ori-proximal loci (hisA) reached high success probabilities (0.57%), mid-replichore loci (galK) were intermediate (~0.19%), and ter-proximal loci (leuD) dropped to 0.05%. A simple exponential model captured this oriC–ter gradient:
$$ P(d) \approx 0.6\%\,\cdot\, e^{-0.0007d}, $$
indicating each additional ~1000 kb from oriC reduces success probability by ~50–70%. These results demonstrate that positional constraints outweigh steric hindrance: two-to-two recombination at lacZ (0.0195%) did not outperform one-to-one at leuD (0.0386%).
To explicitly incorporate replication timing, Cycle 5 simulations compared representative loci across the E. coli chromosome. As shown in Fig. M3-2E, ori-proximal genes (hisA) reached significantly higher success fractions than mid- and ter-proximal sites (galK, leuD), consistent with experimental observations. This establishes a genome-wide oriC–ter gradient as the dominant factor shaping integration efficiency.
Simulated integration success trajectories for representative chromosomal loci demonstrate a clear oriC–terC gradient. Ori-proximal loci (hisA) reach high plateaus (0.57%), mid-replichore locus (galK) reaches intermediate levels (~0.19%), and ter-proximal loci (leuD) remain very low (0.05%). The curves highlight that genomic position, mediated by replication timing and locus copy number, is the dominant determinant of success, outweighing steric hindrance effects.
Dry → Wet (predictions):
Wet → Dry (feedback):
[1] Alon, U. (2006). An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman & Hall/CRC.
[2] Cooper, S., & Helmstetter, C. E. (1968). Chromosome replication and the division cycle of Escherichia coli B/r. Journal of Molecular Biology, 31(3), 519–540. https://doi.org/10.1016/0022-2836(68)90425-7
[3] Briggs, G. E., & Haldane, J. B. S. (1925). A note on the kinetics of enzyme action. Biochemical Journal, 19(2), 338–339. https://doi.org/10.1042/bj0190338
[4] Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry, 81(25), 2340–2361. https://doi.org/10.1021/j100540a008
[5] Sawitzke, J. A., Thomason, L. C., Costantino, N., Bubunenko, M., Datta, S., & Court, D. L. (2007). Recombineering: In vivo genetic engineering in E. coli, S. enterica, and beyond. Methods in Enzymology, 421, 171–199. https://doi.org/10.1016/S0076-6879(06)21012-3
[6] Mosberg, J. A., Lajoie, M. J., & Church, G. M. (2010). Improving oligonucleotide recombineering by removing roadblocks to annealing. Nucleic Acids Research, 38(20), 7462–7469. https://doi.org/10.1093/nar/gkq617
[7] Taniguchi, Y., Choi, P. J., Li, G. W., Chen, H., Babu, M., Hearn, J., Emili, A., & Xie, X. S. (2010). Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science, 329(5991), 533–538. https://doi.org/10.1126/science.1188308
[8] Cooper, S., & Helmstetter, C. E. (1968). Chromosome replication and the division cycle of Escherichia coli B/r. Journal of Molecular Biology, 31(3), 519–540. https://doi.org/10.1016/0022-2836(68)90425-7
[9] Mosberg, J. A., Gregg, C. J., Lajoie, M. J., Wang, H. H., & Church, G. M. (2012). Improving oligonucleotide-mediated recombineering by mismatch repair protein mutS. Nucleic Acids Research, 40(14), e111. https://doi.org/10.1093/nar/gks412
The M4 model operates within a hierarchical framework where upstream models (M1–M3) provide boundary conditions: M1–M2 establish the temporal profile of Bxb1 concentration following induction, capturing the rise, plateau, and potential decay phases [1][2]; M3 quantifies chromosomal attB site availability and the position-dependent efficiency gradient along the oriC–terC axis [3][4]; M4 constitutes the mechanistic core, converting these upstream conditions into dynamic trajectories of success versus failure pathways through explicit treatment of binding, synapsis, and cleavage steps [5][6].
This modular architecture enables independent validation and parametric refinement of each layer,maintaining model interpretability and experimental tractability [7][8].The design supports two-level population mapping: from conditional success rates within the attB-bearing sub-population to whole-population outcomes.Specifically, we define:PTX = nattB / ntotal (transformation efficiency, e.g., 1700 / 100,000 = 0.017)Example calculation: 1.147 % × PTX = 0.017 = 0.0195 % overall, confirming consistency between conditional and population-level rates.We propagate binomial SE for PTX in confidence intervals of whole-population rates;central value 0.017 with propagated uncertainty (see Methods).We calibrate parameters using the simpler one-to-one insertion system before extending to two-to-two replacement,avoiding identifiability problems inherent in direct fitting of complex mechanisms [7][8].In our GenOMe system, one-to-one means that one pair of attB and attP sites recombines once through Bxb1 integrase to form one integration event.
Two-to-two, on the other hand, means two pairs of attB and attP sites are available on the genome and the DNA fragment, allowing two recombination events to happen at the same time — one at each site.
The one-to-one insertion mechanism proceeds through sequential steps: high-affinity recognition (kon / koff) → single-end homologous synapsis (kf1 / ksynr) → catalytic strand exchange (kcat) → product release with passive attL/attR binding [5][9]. This simpler topology establishes the quantitative relationship between sequence specificity (encoded in site-dependent kon / koff ratios) and overall conversion efficiency, providing a validated parameter window for the two-to-two model [10].
Critical correction on product binding: attL and attR sites do not catalytically inhibit the forward reaction. Instead, they function as non-reactive binding sinks that transiently sequester Bxb1 through reversible association, reducing the free enzyme fraction available for productive attP–attB recognition. This passive sequestration becomes significant only at high product accumulation and does not drive reverse conversion in the absence of recombination directionality factor (RDF) [11][12].
$$ \begin{aligned} \text{attP} + \text{Bxb1} &\xrightleftharpoons[k_{-1}]{k_1} C_P \\ \text{attB} + \text{Bxb1} &\xrightleftharpoons[k_{-2}]{k_2} C_B \\ C_P + C_B &\xrightleftharpoons[k_{-s}]{k_s} \text{synapsis} \xrightarrow{k_{cat}} C_L + C_B \\ C_L &\xrightleftharpoons[k_{-3}]{k_3} \text{attL} + \text{Bxb1} \\ C_R &\xrightleftharpoons[k_{-4}]{k_4} \text{attR} + \text{Bxb1} \end{aligned} $$
The two-to-two replacement mechanism embodies “dual-end synchrony”: two spatially separated attP/attB pairs first form independent single-end synaptic complexes (Sleft and Sright) with Bxb1 dimers; these must then encounter each other through chromosomal looping to achieve bimolecular ring-closure (rate constant kr2), assembling a tetrameric synaptic complex (D); finally, concerted cleavage and strand exchange complete the replacement [6][13].
Unlike one-to-one insertion—which opens a chromosomal gap and introduces new DNA—two-to-two performs direct segment replacement between pre-installed attB sites without altering total genome length. Critically, the newly inserted segment can carry orthogonal att sites at its termini, enabling unlimited replacement cycles without cumulative length increase. This architecture enables unlimited “replace-without-growth” cycles and shifts selection from additive stacking to iterative updating.
$$ \begin{aligned} \frac{d[\text{attP}]}{dt} &= -k_{\text{on}}[\text{attP}][\text{Bxb1}] + k_{\text{off}}[C_p] \\ \frac{d[\text{attB}]}{dt} &= -k_{\text{on}}[\text{attB}][\text{Bxb1}] + k_{\text{off}}[C_b] \\ \frac{d[\text{Bxb1}]}{dt} &= -k_{\text{on}}[\text{attP}][\text{Bxb1}] - k_{\text{on}}[\text{attB}][\text{Bxb1}] + k_{\text{off}}[C_p] + k_{\text{off}}[C_b] + 2k_{\text{cat}}[C_{\text{syn}}] \\ \frac{d[C_p]}{dt} &= k_{\text{on}}[\text{attP}][\text{Bxb1}] - k_{\text{off}}[C_p] - k_{\text{syn}}[C_p][C_b] + k_{-\text{syn}}[C_{\text{syn}}] \\ \frac{d[C_b]}{dt} &= k_{\text{on}}[\text{attB}][\text{Bxb1}] - k_{\text{off}}[C_b] - k_{\text{syn}}[C_p][C_b] + k_{-\text{syn}}[C_{\text{syn}}] \\ \frac{d[C_{\text{syn}}]}{dt} &= k_{\text{syn}}[C_p][C_b] - k_{-\text{syn}}[C_{\text{syn}}] - k_{\text{cat}}[C_{\text{syn}}] \\ \frac{d[\text{attL}]}{dt} &= k_{\text{cat}}[C_{\text{syn}}] - k_{\text{on}}[\text{attL}][\text{Bxb1}] + k_{\text{off}}[C_l] - k_{\text{on}}[\text{attL}][\text{Bxb1}] + k_{\text{off}}[C_r] \\ \frac{d[C_l]}{dt} &= k_{\text{on}}[\text{attL}][\text{Bxb1}] - k_{\text{off}}[C_l] \\ \frac{d[C_r]}{dt} &= k_{\text{on}}[\text{attL}][\text{Bxb1}] - k_{\text{off}}[C_r] \end{aligned} $$
| Parameter | Definition | Magnitude | Mechanistic Basis |
|---|---|---|---|
| kf1 | Single-end synapsis rate | kf1 ≫ kf2 | Rapid, reversible encounter between pre-bound sites |
| kf2 | Dual-end ring-closure rate | Rate-limiting | Geometrically and topologically constrained looping |
Notation (first-use definitions). kf1 (single-end synapsis), ksynr−1 (its reverse), kf2 (dual-end ring-closure), krel (single-end disassembly), kcat (cleavage). Sleft/right: single-end synaptic intermediates; D: dual-end tetrameric complex. We normalize [S] ∈ [0,1] as the fraction of cells (or occupancy) carrying a single-end synaptic intermediate. fssAP (0.05–0.10): fraction of chromosomal attB sites activated by RecT/CspRecT-mediated single-strand annealing. asite (0.28–0.40): position-dependent reduction in association rate due to nucleoid structure. gloop (0.30–0.40): reduction in ring-closure rate due to DNA topology and confinement.
Both published data and our one-to-one analyses demonstrate a robust positional effect: integration efficiency decreases monotonically with distance from oriC toward terC [3][14]. Quantitative examples include high-efficiency loci (hisA, metA, galK; all oriC-proximal) versus low-efficiency regions (leuD, lacZ; ter-proximal) [15][16].
Our experimental two-to-two design utilized the lacZ region primarily for blue-white screening convenience, accepting the trade-off of operating in a low-efficiency zone where lacZ and leuD nearly coincide.
The 30–60 min / 0.3–1.0 μM window corresponds to the time-over-threshold (ToT) concept originating from M1–M2 models and is localized here as the standard operational (SOP) range for Mode B optimization.
This locus choice explains why the efficiency differential between our two-to-two and one-to-one implementations was smaller than initial predictions: both operated under comparable geometric constraints. Relocating one or both att sites toward oriC would substantially increase overall success probability by enhancing both local DNA accessibility and effective encounter rates [4].
$$ \frac{d[C_P]}{dt} = k_{on}[\text{attP}][\text{Bxb1}] - k_{off}[C_P] $$
$$ \frac{d[C_B]}{dt} = k_{on}[\text{attB}][\text{Bxb1}] - k_{off}[C_B] $$
These expressions inherit sequence-dependent affinity from heat-map data [10]. We map in vitro specificity ratios to in-cell \(k_{on}/k_{off}\) via proportional scaling while preserving \(K_d\); absolute values are re-fit in Layer 2. Paired adjustments maintain thermodynamically reasonable \(K_d\) values (typically 10–50 nM). High-affinity recognition ensures rapid, reversible pre-synaptic complex formation, establishing the substrate pool for downstream synapsis and cleavage. Sequence-dependent differences manifest primarily at this recognition step, avoiding mis-attribution to catalytic phases.
$$ \frac{d[S]}{dt} = k_{f1}[C_P][C_B] - k_{synr}[S] - k_{cat}[S] $$
Synaptic complex stability governs the waiting-time distribution before potential dual-end encounters in the two-to-two system. Large \(k{synr}\) values induce rapid cycling between pre-synaptic and synaptic states, reducing the steady-state synaptic intermediate pool. This step exhibits strong concentration dependence and sensitivity to the Bxb1:DNA stoichiometric ratio.
In the absence of RDF, we do not include reverse catalysis. Instead, attL/attR contribute to enzyme sequestration:
$$ \frac{d[C_L]}{dt} = k_{on}^{attL}[\text{attL}][\text{Bxb1}] - k_{off}^{attL}[C_L] $$
$$ \frac{d[C_R]}{dt} = k_{on}^{attR}[\text{attR}][\text{Bxb1}] - k_{off}^{attR}[C_R] $$
These enter the Bxb1 mass balance but leave product concentrations unchanged; we do not subtract binding terms from d[attL]/dt or d[attR]/dt. Products are not consumed by reversible binding; the effect is reduced free enzyme, not catalytic inhibition [11][12].
These terms subtract only from the free [Bxb1] pool and do not enter d[attL]/dt or d[attR]/dt, thereby preserving mass balance without introducing reverse catalysis.
Two single-end intermediates \( S_{left} \) and \( S_{right} \) must meet to form \( D \).
Defining \( J_{f2} = k_{f2} [S_{left}] [S_{right}] \):
$$ \frac{d[S_{left}]}{dt} = \dots - J_{f2} - k_{rel}[S_{left}] $$ $$ \frac{d[S_{right}]}{dt} = \dots - J_{f2} - k_{rel}[S_{right}] $$ $$ \frac{d[D]}{dt} = + J_{f2} - k_{cat}[D] $$
Under \( [S_{left}] = [S_{right}] = [S] \), dual-end encounter is second-order in [S], yielding:
$$ P_{ring} = \frac{k_{f2}[S]^2}{k_{f2}[S]^2 + k_{rel}} $$
Here [S] is a normalized single-end pool (fraction of cells or dimensionless occupancy), so \( k_{f2}[S]^2 \) and \( k_{rel} \) are effective first-order terms and \( P_{ring} \) is dimensionless. This square dependence explains the sharp, sigmoidal gain when ends are closer; \( k_{rel} \) captures the decay of single-end intermediates outside the synchrony window.
gp47 (RDF) was not expressed in any condition; models therefore set krev = 0 and treat the forward pathway as effectively irreversible after dual-end closure. Any effects of attL/attR are modeled only as passive, reversible enzyme-sequestration terms in the Bxb1 mass balance (no catalytic inhibition) [11][12]. As RDF is excluded by design, we do not analyze reversibility further in this work.
We avoid double counting geometry by applying it only to the steps it governs:
$$ I_{\text{eff}} = I_{\text{tot}} \times f_{\text{ssAP}} $$ $$ k_{\text{on,eff}} = a_{\text{site}} \cdot k_{\text{on}} \quad \text{(association only)} $$ $$ k_{f2,\text{eff}} = g_{\text{loop}} \cdot k_{f2} \quad \text{(ring-closure only)} $$
We map chromosomal position and locus accessibility to \( a_{\text{site}} \) (association only), and polymer looping/topology (length, orientation, torsion) to \( g_{\text{loop}} \) (ring-closure only); no geometry factors are applied elsewhere [3][4][17][18].
| Factor | Physical Meaning | Range | Data Source |
|---|---|---|---|
| fssAP | ssAP-mediated attB activation efficiency | 0.05–0.10 | M2/M3 output |
| asite | Locus accessibility (position-dependent) | 0.28–0.40 | [3,4] |
| gloop | Looping geometry factor | 0.30–0.40 | [17,18] |
| Ieff | Effective Bxb1 concentration | 1–2 μM | Itot × fssAP |
Ranges reflect our fit priors anchored to macrodomain/topology literature; see Table S2 for locus-wise posteriors. Reported Ieff (1–2 μM) reflects M2/M3-inferred fssAP under our induction regime and should be re-calibrated if growth rate, copy number, or crowding differ.
Length penalties apply exclusively to ring-closure and disassembly; recognition parameters are length-invariant:
$$ k_{f2}(L) = k_{f2}^{\circ} \cdot \exp[-\alpha (L - L_0)] $$ $$ k_{rel}(L) = k_{rel}^{\circ} \cdot \exp[+\beta (L - L_0)] $$
where \( L_0 = 1039 \, \text{bp} \). Fitting the 1039/1605/1613-bp triad yields (α, β) with 95% CIs (see Table S3). Longer segments reduce end-to-end encounters (\( k_{f2}\downarrow \)) and increase strain-induced disassembly (\( k_{rel}\uparrow \)) [17][18][19][20][21]; shorter segments elevate \( P_{ring} \) and accelerate completion.
| Parameter Class | Data Source | Calibration Order |
|---|---|---|
| kon, koff (sequence-specific) | Heat-map ratios [10] | L1: Fixed |
| kf1, ksynr, kcat | One-to-one kinetics | L2: Fit |
| kr2, krel, α, β | Two-to-two 240-min conditional success | L3: Fit |
| asite, gloop, fssAP | Chromosomal context | L4: Inherit |
| PTX | Transformation efficiency | Fixed: 0.017 |
This hierarchy prevents over-parameterization [7][8]. Minimal observation sets per layer:
A compact set of calibrated parameters is summarized in Table 1. Full parameter tables, initial conditions, and locus-wise geometry estimates are provided in Tables S1–S3 .
Table 1. Core Kinetic Parameters
| Symbol | Definition | Value | Unit | Source |
|---|---|---|---|---|
| Recognition & Binding | ||||
| kon | attP–Bxb1 → CP | 4.3 × 105 | M−1s−1 | Literature |
| koff | CP dissociation | 6.5 | s−1 | Literature |
| Kd | Dissociation constant | 15 | nM | koff/kon |
| Synapsis & Cleavage | ||||
| kf1 | Single-end synapsis | 1.7 × 10−3 | s−1 | One-to-one fit |
| ksynr | Synaptic disassembly | 3.3 × 10−5 | s−1 | One-to-one fit |
| kcat | Strand exchange | 8.3 × 10−4 | s−1 | Literature |
| Two-to-Two Specific | ||||
| kf2 | Ring-closure (L = L0) | 1.7 × 10−5 | s−1 | 240-min fit |
| krel | Single-end failure (L = L0) | 5.9 × 10−4 | s−1 | 240-min fit |
| α | Length penalty (kf2) | 1.5 × 10−3 | bp−1 | Length series (see Table S3) |
| β | Length penalty (krel) | 1.0 × 10−3 | bp−1 | Length series (see Table S3) |
| L0 | Reference length | 1039 | bp | Shortest construct |
| Effective Parameters | ||||
| fssAP | ssAP activation | 0.05–0.10 | — | M3 output |
| asite | Locus accessibility | 0.28–0.40 | — | Position-dependent (see Table S2) |
| gloop | Looping geometry | 0.30–0.40 | — | DNA topology |
| Ieff | Effective Bxb1 | 1–2 | μM | Itot × fssAP |
| PTX | Transformation efficiency | 0.017 | — | 1700/100,000 |
Pre-synaptic complex formation reaches quasi-equilibrium rapidly (\( \tau \approx 5 \, \text{min} \)); product binding becomes relevant only in late phases (>60 min) when attL/attR accumulation exceeds ~30% conversion. This baseline establishes that if two-to-two underperforms, the bottleneck lies in \( k_{f2} \) versus \( k_{rel} \) competition rather than deficient recognition.
Bxb1 dynamics: The total translated Bxb1 (Itot) rises during 0–240 min and approaches the order of 10 μM in our simulations (model outputs informed by M2/M3). The effective reactive fraction (Ieff = Itot × fssAP × ffree) remains around 1–2 μM and serves as the input concentration for downstream kinetic models. This scale is consistent with the 0.3–1.0 μM operational window defined in M5 and used for Mode B optimization.
Dual-end competition: The success pathway flux (Jf2 = kf2 [Sleft][Sright]) directly competes with failure (krel [Si]), determining whether dual-end ring-closure completes within the synchrony window before single-end intermediates commit to failure pathways.
Population mapping: Calibrated to 1.147% conditional success at 240 min within \( n_{attB} = 1700 \) cells; multiplication by \( P_{TX} = 0.017 \) yields 0.0195% whole-population rate.
Under constant Bxb1 (~4.36 μM), success is governed by \( k_{f2}/k_{rel} \) balance and fragment length. The observed ordering (1039 bp: 87.5%, 1605 bp: 80%, 1613 bp: 79%) emerges purely from length-dependent \( k_{f2}/k_{rel} \) modulation.
Bars represent mean success percentage across \( n = 3 \) biological replicates (error bars: ±1 SD). Fragment lengths: 1039 bp (reference), 1605 bp, 1613 bp. Solid line: model prediction using \( k_{f2}(L) = k_{f2}^{\circ} \exp[-\alpha (L - L_0)] \) with \( \alpha = 1.5 \times 10^{-3} \, \text{bp}^{-1} \) fitted to data. Dashed lines: 95% prediction interval. Bootstrap resamples were drawn at the replicate level with a fixed RNG seed for reproducibility; point estimates are medians, and bands denote 95% CIs.
Without RDF, the forward pathway after dual-end closure is effectively irreversible under our conditions; we did not observe reverse flux, and products enter only via passive enzyme sequestration (no catalytic inhibition). This explains why increasing \( k_{f2} \) (or effective [S]) dramatically shifts overall success. Conversely, “waiting longer” provides diminishing returns once the synchrony window closes: single-end-committed failures become kinetically locked.
DNA looping probability scales approximately as \( P_{loop} \propto L^{-3/2} \) for \( L > \) persistence length (~50 nm ≈ 150 bp) [19][20][21]. For our 1.0–1.6 kb range, this predicts substantial reduction in encounter frequency, consistent with observed efficiency drops. Shorter segments elevate \( P_{ring} \) faster than they increase torsional strain because bending energy scales as \( \kappa \cdot \theta^2 / L \), while encounter frequency scales as \( L^{-3/2} \).
Ori-proximal loci exhibit higher supercoiling density, greater replication-associated accessibility, and reduced NAP crowding [3][4]. These effects compound into \( a_{site} \) gradients ranging from ~0.40 (hisA, 0.2 Mb from oriC) to ~0.28 (lacZ, 1.5 Mb from oriC), directly modulating \( k_{on,eff} \) (see Table S2).
Operationally, we require \( K_{d,cross}/K_{d,self} > 100 \) and observe <1% cross-recombination in plasmid challenge assays; see Methods. This validates clean heat-map → \( k_{on} \) translation and enables stable multi-round performance by preventing inter-round cross-talk. We require (i) <1% cross-recombination in plasmid challenge assays and (ii) no off-target junctions by junction-specific PCR across two rounds.
Dual-end recombination assembles two Bxb1 dimers (one per att pair) into a tetrameric synaptic complex. When sites are closely spaced (≈1–1.3 kb) and co-oriented, DNA bending strain remains small, elevating effective \( k_{f2} \). Beyond a critical spacing, DNA must loop back under torsional constraint, increasing \( k_{rel} \) probability.
This geometric framework explains the “shorter is better” principle: confinement that might hinder one-to-one insertions (steric crowding) actually facilitates two-to-two replacements by constraining the search volume for dual-end encounters. In molecular visualization, the tetramer functions as a rigid hinge—once both ends dock, cleavage proceeds with near-irreversible commitment.
To experimentally validate M4 predictions:
While M4 captures essential dual-end recombination kinetics, several biological processes remain simplified:
Quantitative predictions are most trustworthy for 1–2 kb segments in mid-log E. coli (\( \text{OD}_{600} = 0.4\text{–}0.6 \)) where repair activity and topological state are stable.
[1] Bonnet, J., Subsoontorn, P., & Endy, D. (2012). Rewritable digital data storage in live cells via engineered control of recombination directionality. Proceedings of the National Academy of Sciences, 109(23), 8884–8889. https://doi.org/10.1073/pnas.1202344109
[2] Inniss, M. C., & Silver, P. A. (2013). Building synthetic memory. Current Biology, 23(17), R812–R816. https://doi.org/10.1016/j.cub.2013.07.056
[3] Bryant, J. A., Sellars, L. E., Busby, S. J., & Lee, D. J. (2014). Chromosome position effects on gene expression in Escherichia coli K-12. Nucleic Acids Research, 42(18), 11383–11392. https://doi.org/10.1093/nar/gku828
[4] Valens, M., Penaud, S., Rossignol, M., Cornet, F., & Boccard, F. (2004). Macrodomain organization of the Escherichia coli chromosome. EMBO Journal, 23(21), 4330–4341. https://doi.org/10.1038/sj.emboj.7600434
[5] Rutherford, K., & Van Duyne, G. D. (2014). The ins and outs of serine integrase site-specific recombination. Current Opinion in Structural Biology, 24, 125–131. https://doi.org/10.1016/j.sbi.2013.12.003
[6] Olorunniji, F. J., Rosser, S. J., & Stark, W. M. (2016). Site-specific recombinases: molecular machines for the genetic revolution. Biochemical Journal, 473(6), 673–684. https://doi.org/10.1042/BJ20151112
[7] Raue, A., Kreutz, C., Maiwald, T., Bachmann, J., Schilling, M., Klingmüller, U., & Timmer, J. (2009). Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics, 25(15), 1923–1929. https://doi.org/10.1093/bioinformatics/btp358
[8] Gutenkunst, R. N., Waterfall, J. J., Casey, F. P., Brown, K. S., Myers, C. R., & Sethna, J. P. (2007). Universally sloppy parameter sensitivities in systems biology models. PLoS Computational Biology, 3(10), e189. https://doi.org/10.1371/journal.pcbi.0030189
[9] Smith, M. C. M., & Thorpe, H. M. (2002). Diversity in the serine recombinases. Molecular Microbiology, 44(2), 299–307. https://doi.org/10.1046/j.1365-2958.2002.02891.x
[10] Saunders, S. H., & Ahmed, A. M. (2024). Mapping orthogonality and recombination efficiency across serine integrase site variants. Nucleic Acids Research, 52(8), 4234–4249. https://doi.org/10.1093/nar/gkae227
[11] Ghosh, P., Wasil, L. R., & Hatfull, G. F. (2006). Control of phage Bxb1 excision by a novel recombination directionality factor. PLoS Biology, 4(6), e186. https://doi.org/10.1371/journal.pbio.0040186
[12] Savinov, A., Pan, J., & Ghosh, P. (2012). Mechanisms of the phage Bxb1 integrase directionality factor gp47. Virology, 429(2), 129–135. https://doi.org/10.1016/j.virol.2012.04.011
[13] Ghosh, P., Kim, A. I., & Hatfull, G. F. (2003). The orientation of mycobacteriophage Bxb1 integration is dependent on the central dinucleotide of its attachment sites. Molecular Cell, 12(5), 1101–1111. https://doi.org/10.1016/S1097-2765(03)00429-3
[14] Block, D. H., Hussein, R., Liang, L. W., & Lim, H. N. (2012). Regulatory consequences of gene translocation in bacteria. Nucleic Acids Research, 40(18), 8979–8992. https://doi.org/10.1093/nar/gks694
[15] Haldimann, A., & Wanner, B. L. (2001). Conditional-replication, integration, excision, and retrieval plasmid-host systems for gene structure-function studies of bacteria. Journal of Bacteriology, 183(21), 6384–6393. https://doi.org/10.1128/JB.183.21.6384-6393.2001
[16] Datsenko, K. A., & Wanner, B. L. (2000). One-step inactivation of chromosomal genes in Escherichia coli K-12 using PCR products. Proceedings of the National Academy of Sciences, 97(12), 6640–6645. https://doi.org/10.1073/pnas.120163297
[17] Rippe, K., Guthold, M., von Hippel, P. H., & Bustamante, C. (1995). Transcriptional activation via DNA looping: visualization of intermediate states in E. coli RNA polymerase–λPR promoter complexes by scanning force microscopy. Trends in Biochemical Sciences, 20(12), 500–506. https://doi.org/10.1016/S0968-0004(00)89099-4
[18] Schleif, R. (1992). DNA looping. Annual Review of Biochemistry, 61, 199–223. https://doi.org/10.1146/annurev.bi.61.070192.001215
[19] Yan, J., & Marko, J. F. (2004). Localized single-strand breaks strongly affect global chromatin folding in the Escherichia coli chromosome. Physical Review Letters, 93(10), 108108. https://doi.org/10.1103/PhysRevLett.93.108108
[20] Cloutier, T. E., & Widom, J. (2004). Spontaneous sharp bending of double-stranded DNA. Molecular Cell, 14(3), 355–362. https://doi.org/10.1016/S1097-2765(04)00249-7
[21] Shimada, J., & Yamakawa, H. (1984). Ring-closure probabilities for twisted wormlike chains. Macromolecules, 17(4), 689–698. https://doi.org/10.1021/ma00134a017
[22] Raj, A., & van Oudenaarden, A. (2008). Nature, nurture, or chance: stochastic gene expression and its consequences. Cell, 135(2), 216–226. https://doi.org/10.1016/j.cell.2008.09.050
[23] Elowitz, M. B., Levine, A. J., Siggia, E. D., & Swain, P. S. (2002). Stochastic gene expression in a single cell. Science, 297(5584), 1183–1186. https://doi.org/10.1126/science.1070919
| Symbol | Definition | Value | Unit | Source |
|---|---|---|---|---|
| Recognition & Binding | ||||
| kon | attP–Bxb1 → CP | 4.3 × 105 | M−1s−1 | Literature |
| koff | CP dissociation | 6.5 | s−1 | Literature |
| Kd | Dissociation constant | 15 | nM | koff/kon |
| Synapsis & Cleavage | ||||
| kf1 | Single-end synapsis | 1.7 × 10−3 | s−1 | One-to-one fit |
| ksynr | Synaptic disassembly | 3.3 × 10−5 | s−1 | One-to-one fit |
| kcat | Strand exchange | 8.3 × 10−4 | s−1 | Literature |
| Two-to-Two Specific | ||||
| kf2 | Ring-closure (L = L0) | 1.7 × 10−5 | s−1 | 240-min fit |
| krel | Single-end failure (L = L0) | 5.9 × 10−4 | s−1 | 240-min fit |
| α | Length penalty (kf2) | 1.5 × 10−3 | bp−1 | Length series |
| β | Length penalty (krel) | 1.0 × 10−3 | bp−1 | Length series |
| L0 | Reference length | 1039 | bp | Shortest construct |
| Effective Parameters | ||||
| fssAP | ssAP activation | 0.05–0.10 | — | M3 output |
| asite | Locus accessibility | 0.28–0.40 | — | Position dependent |
| gloop | Looping geometry | 0.30–0.40 | — | DNA topology |
| Ieff | Effective Bxb1 | 1–2 | μM | Itot × fssAP |
| PTX | Transformation efficiency | 0.017 | — | 1700/100,000 |
Length-dependence update: The early fit yielded α = 1.5 × 10⁻³ and β = 1.0 × 10⁻³ bp⁻¹. Subsequent tri-point regression across 1039/1605/1613 bp constructs refined these to αmech = βmech = 8.5 × 10⁻⁴ bp⁻¹ (SLOPE = 1.70 × 10⁻³ bp⁻¹), ensuring consistency with the logit-linear fit reported in M5 (Fig. 1).
| Locus | Distance from oriC (Mb) | asite (mean ± CI) | Relative Efficiency (normalized to hisA) |
Reference |
|---|---|---|---|---|
| hisA | 0.20 | 0.40 ± 0.02 | 1.00 | [3],[14] |
| metA | 0.35 | 0.37 ± 0.02 | 0.77 | [3],[14] |
| galK | 1.45 | 0.33 ± 0.02 | 0.32 | [3],[14] |
| leuD | 1.55 | 0.29 ± 0.02 | 0.08 | [3],[14] |
| Length (bp) | Observed Success (%) | Fitted Success (%) | Residual | Predicted kf2(L) (s−1) | Predicted krel(L) (s−1) |
|---|---|---|---|---|---|
| 1039 | 87.5 | 87.6 | +0.1 | 1.70×10−5 | 5.90×10−4 |
| 1605 | 80.0 | 79.3 | −0.7 | 9.8×10−6 | 6.5×10−4 |
| 1613 | 79.0 | 79.1 | +0.1 | 9.7×10−6 | 6.6×10−4 |
Fit statistics:
α̂ = 1.5 × 10⁻³ bp⁻¹ (95% CI: 1.2–1.9 × 10⁻³)
β̂ = 1.0 × 10⁻³ bp⁻¹ (95% CI: 0.8–1.3 × 10⁻³)
RMSE = 0.53%; R² = 0.991; Cross-validation residual < ±1.2%.
We present the M5 Experimental Design Assistant, an interpretable modeling framework that quantitatively deconstructs the integration efficiency of the Bxb1 serine integrase into three mechanistic layers —
the protein layer (expression and catalytic dynamics),
the DNA layer (sequence composition and structural configuration), and
the population layer (transformation efficiency and selection output).
Calibrated with experimental datasets, M5 generates table-based success-rate predictions and standard operating procedure (SOP) recommendations, transforming empirical trial-and-error workflows into quantitative, design-driven decision-making.
For the production mode (Mode B) — representing the second-round two-to-two integration process — M5 reveals that under a 30-minute reaction window and 0.3–1.0 μM (300–1000 nM) Bxb1 concentration range, insert fragment length becomes the dominant determinant of recombination success.
In our host strain, fragments between 1 kb and 2 kb consistently achieve ≥ 80 % integration efficiency under validated conditions.
Beyond efficiency prediction, Mode A provides genomic site-selection maps and population-level output estimations for first-round integrations, allowing researchers to finalize condition selection and plate-number planning before PCR primer design or transformation experiments.
Internally, the M5 framework is affectionately nicknamed “Salmon” — symbolizing its protein-based architecture and its ability to swim upstream through experimental noise, adapting seamlessly across different species, experimental setups, and host strains.
The interactive dashboard allows users to choose between Mode A (Site Selection) and Mode B (Production Mode).
Input genomic parameters (distance to oriC, GC fraction, homology arm length, induction time) to receive automated success-rate predictions and experimental recommendations.
Each prediction dynamically links to the calibrated M1–M5 parameter set and returns interpretable quantitative guidance for wet-lab design.
M5 visualizes integration probability as a function of insert length, Bxb1 concentration, and induction time.
The green curve represents model predictions (logit-linear fit), while red dots show wet-lab validation data (1039 bp, 1605 bp, 1613 bp).
The average absolute error (MAE) is ≈ 0.37%, demonstrating the model’s high predictive accuracy within the validated 1–2 kb range.
This table provides instant success-rate lookup for pre-installed attB strains under validated conditions (387 nM Bxb1 × 30 min).
Each entry shows the expected integration success rate, recommended number of plates, and total colonies required for a desired target yield.
Results are computed from the M5 logit-linear model ($ P = \frac{1}{1 + \exp\!\big(-\,(a - b \cdot L)\big)} $) with parameters a = 3.0161, b = 1.03 × 10⁻³ /bp.
The integration efficiency of serine integrases is highly sensitive to both sequence context and geometric configuration, resulting in substantial variability across loci and experimental conditions. Traditional workflows rely heavily on empirical trial-and-error, often leading to poor reproducibility.
Building upon the mechanistic insights and calibrated parameters established in the M4 model—including the distance-decay constant (kd), length sensitivity (SLOPE), and global scaling factor (G)—the M5 Experimental Design Assistant transforms these validated regularities into an actionable predictive tool.
With only minimal user input, M5 produces lookup-table success rate predictions and SOP-based recommendations, converting uncertain wet-lab experimentation into quantitative, design-driven decisions.
The framework is organized into two operational modes:
Much like a salmon swimming upstream, M5 reverses the traditional flow of experimentation — transforming random trial-and-error into a predictable design process guided by quantitative principles.
M5 maps each experimental “control knob” onto interpretable quantitative factors whose multiplicative combination yields the overall probability of successful integration.
The integrated predictive form used in the main text applies a logit-linear model $$ (a = 3.0161,\; b = 1.03 \times 10^{-3}\,\mathrm{bp}^{-1}) $$ coupled with a mild Hill-type concentration normalization $$ (K = 0.20 μM, n = 6, C_{\text{ref}} = 0.387 μM) $$ for Mode B predictions. Detailed mechanistic parameters—such as $$ (k_{f2}), (k_{rel}), (G) $$ , and SLOPE—are provided in the Supplementary Information.
$$ \text{Expected positive colonies} = N_{\text{CFU}} × P_{\text{prod}} × α_{\text{pop}} $$ $$ \text{Required plates} = \lceil \text{Target positives} \, / \, \text{Expected positives per plate} \rceil $$
Assume —
$$ \text{Expected positives per plate} = 30 × 0.82 × 1.0 = 24.6 ≈ 25 $$ $$ \text{Required plates} = \lceil 50 / 25 \rceil = 2 $$
Two plates are sufficient; for added safety, plate three (recommended considering binomial variation).
Under a fixed reaction time of 30 minutes, the curves for 0.3–1.0 μM (300–1000 nM) almost completely overlap, indicating that integration efficiency has reached a saturation plateau. At 0.2 μM (= 200 nM), efficiency decreases noticeably.
All three experimental anchor points (1039 bp: 87.5%; 1605 bp: 79.0%; 1613 bp: 80.0%) lie within the model’s prediction band, yielding a mean absolute error (MAE) of 0.37%.
Shaded regions represent binomial 95% confidence intervals (n = 3 biological replicates).
The horizontal red dashed line marks the SOP threshold P = 0.8: in our host strain, inserts of 1.0–1.6 kb reliably achieve ≥ 80% success under standard conditions.
Extrapolation caution: predictions become uncertain for L > 4 kb, GC < 30% or > 70%, or C > 0.8 μM (dashed curves indicate low-data regions).
The Mode B core adopts a logit-linear model:
$$\text{logit}(P) = a - bL$$
$$P = \frac{1}{1 + \exp[-(a - bL)]}$$
Parameters:
A Hill-type normalization is applied relative to the reference concentration 0.387 μM (= 387 nM):
$$ f_{\text{conc}}(C) = \frac{C^n}{K^n + C^n} \Bigg/ \frac{C_{\text{ref}}^n}{K^n + C_{\text{ref}}^n} $$
Parameters:
This yields nearly identical outcomes for 0.3–1.0 μM, with a slight drop at 0.2 μM.
$$ P_{\text{prod}} = \min(P × f_{\text{conc}}, 0.95) $$
Figure legend
Within the range of 0.3–1.0 μM (= 300–1000 nM), the heatmap exhibits a stable plateau region (yellow–green zone), indicating consistent integration performance.
The vertical white dashed line denotes the reference concentration (Cref = 0.387 μM = 387 nM), along which users can directly read predicted success probabilities (Pprod) for different insert lengths.
Extrapolation caution: predictions are less certain for L > 4 kb or extreme GC-content regions.
Computation:
$$ P_{\text{prod}}(L, C) = \min [P_{\text{base}}(L) \times f_{\text{conc}}(C),\ 0.95] $$
The heatmap is generated over L ∈ [0.5, 4.0] kb and C ∈ [0.1, 1.0] μM, with grid resolution ΔL = 0.01 kb and ΔC = 0.01 μM.
Figure legend
Under standard conditions (0.387 μM (= 387 nM) for 30 minutes), this plot converts the single-cell success probability (Pprod) into the expected number of positive colonies per plate, facilitating experimental and manpower planning.
The scaling factor (αpop) represents the population amplification factor (typically 1.0, adjustable via recovery time or inoculation protocol).
Curves are shown for NCFU = 10, 20, 30, and 40 colonies per plate.
Computation:
$$ E[\text{positive colonies}] = N_{\text{CFU}} \times \min(\alpha_{\text{pop}} \times P_{\text{prod}},\ 1) $$
Parameter definitions:
Designed for first-round integration (Mode A), this heatmap prioritizes genomic loci according to their relative likelihood of successful recombination.
Yellow–green regions indicate high-probability zones (recommended for integration), whereas dark-purple regions denote low-probability zones (to be avoided).
The distance effect is modeled as:
$$ F_{\text{distance}}(d) = e^{-k_d \cdot d} $$
where greater distance from oriC exponentially decreases the expected integration success rate.
This model assumes that loci near oriC have higher accessibility and earlier replication timing, making them easier to integrate; conversely, distal loci exhibit reduced efficiency.
If a specific host background exhibits the opposite trend, users can modify the distance-decay function and re-fit (kd).
The GC-content effect penalizes deviation from 50 % GC, as extremes in GC composition alter DNA melting temperature and protein-binding stability, thereby reducing recombination efficiency.
Applicable range:
(d < 3 Mb); GC = 30–70 %.
Predictions beyond these limits are uncertain.
This map can be used in the primer-design stage to eliminate high-risk loci before synthesis or cloning.
$$ P \propto f_{\text{GC}}(gc) \cdot F_{\text{distance}}(d) $$
$$ f_{\text{GC}}(gc) = \exp[-\gamma (gc - 0.5)^2] $$
$$ F_{\text{distance}}(d) = \exp[-k_d \cdot d] $$
The effective operational window for the protein layer—30–60 min / 0.3–1.0 μM—corresponds to the time-over-threshold (ToT) concept derived from the M1–M2 transcription-translation models.
Within this region, the Bxb1 integrase concentration remains above the functional threshold required for recombination.
This window therefore defines the standard operating condition (SOP) for Mode B optimization, explaining why a 30-minute induction consistently achieves stable integration efficiency without extended induction periods.
$$ φ_{\text{tot}} = φ_{\text{Bxb1}}(C) \times φ_{\text{time}}(t) \times φ_{\text{ssAP}}(t_{\text{ssAP}}) $$
Under conditions of low DNA uptake or high heterogeneity (large CV), the population-level output tends to decrease.
This map serves as a diagnostic and decision-support tool, indicating whether additional plates, modified electroporation parameters, or workflow optimization are required to reduce cell-to-cell variability.
Each single cell’s DNA uptake is assumed to follow a log-normal distribution: $$ σ = \sqrt{\ln(\text{CV}^2 + 1)}, \quad μ = \ln(\text{mean}) - \frac{σ^2}{2} $$ The single-cell integration probability is computed as: $$ P_{\text{cell}} = \min (P_{\text{calibrated}} \cdot G^{αβ} \cdot \text{uptake}, 0.8) $$ Population averages are then derived via Monte Carlo simulations (10,000 cells) or numerical integration, and the resulting values are plotted as heatmaps.
When the spacer length between paired attP and attB sites is adjustable, the model predicts an optimal “sweet spot” around 80 ± 40 bp, forming the most stable geometric configuration.
Spacers that are either too short or too long reduce integration efficiency, as they disrupt favorable DNA looping conformations.
This parameter primarily applies to custom two-site landing pad designs and is generally unnecessary for standard single-site workflows.
At the molecular-reaction level, the overall integration success is governed by the equilibrium between the forward rate ( $ k_{f2} $, synapsis formation rate) and the release rate ( $ k_{rel} $ , ineffective dissociation rate).
This map is primarily intended for mechanistic interpretation and sensitivity analysis, rather than general user decision-making.
A global scaling factor (G = 6.632173469) is applied to the base forward rate $ k_{f2} $ : $$ k_{f2,\text{effective}} = k_{f2,\text{base}} \times G $$ The scaling factor was globally calibrated using the three experimental anchor points from Mode B, ensuring consistency between the mechanistic and macroscopic success models.
Our framework adopts a three-layer decomposable design, where each layer carries a distinct biological interpretation and is experimentally verifiable.
This layer models the combined effects of enzyme concentration and reaction time using Hill-type kinetics and temporal saturation functions:
Key finding: At 30 min / 0.3–1.0 μM, $\phi_{tot}$ consistently plateaus, establishing the SOP stability window for Mode B.
This layer integrates sequence- and geometry-dependent factors into a composite multiplicative model:
At the population scale, transformation heterogeneity (mean uptake and CV) is modeled via a log-normal distribution, coupled with a binomial sampling process to convert single-cell probability ($P_{\text{prod}}$) into expected positive colony counts. This enables direct linkage between model predictions and experimental planning.
Concept:
Different concentration domains are used for distinct contexts.
Mode A spans the entire range for activation modeling, whereas Mode B focuses on fine-tuning near the operational plateau.
Implementation details:
This adjustment reproduces the empirically observed behavior— a stable plateau from 200–400 nM, with a mild decline below 200 nM—consistent with the weak concentration dependence identified in Figure 1.
Mode A and Mode B serve complementary purposes rather than substitutes:
Three experimental validation points were used for model fitting:
| Insert (bp) | Observed (%) | Predicted (%) | Error (%) |
|---|---|---|---|
| 1039 | 87.5 | 87.5 | 0.0 |
| 1605 | 79.0 | 79.6 | +0.6 |
| 1613 | 80.0 | 79.5 | −0.5 |
| MAE | – | – | 0.37 |
Calibration workflow:
In our dataset, single-antibiotic vs dual-antibiotic selection showed no significant difference
(p > 0.05, n = 3; unpaired t-test).
Future replicates will expand biological sample size to strengthen robustness.
| Module | Parameter | Value | Unit | Purpose / Description |
|---|---|---|---|---|
| Mode B – Length dependence | a | 3.0161 | — | Calibrated from 1039 bp → 87.5 % efficiency |
| b | 1.03 × 10⁻³ | /bp | Fitted slope from three anchor points | |
| Mode B – Concentration dependence | K | 0.20 (= 200 nM) | µM | Local half-saturation constant for normalization |
| n | 6 | — | Hill coefficient (slope of plateau) | |
| Cref | 0.387 (= 387 nM) | µM | Default reference concentration (Figs. 1–2) | |
| Protein Layer (Mode A) | C₅₀ | 1.0 | µM | Global half-saturation concentration for activation curve |
| nH | 1 | — | Hill coefficient | |
| T₅₀ | 30 | min | Half-saturation time of reaction completion | |
| TssAP50 | 20 | min | Half-saturation time for ssAP expression | |
| tssAP | 45 | min | Fixed induction duration of ssAP | |
| DNA Layer | kd | 7 × 10⁻⁴ | /kb | Exponential decay constant for distance to oriC |
| γ | 10 | — | GC-content deviation penalty strength | |
| Mechanistic Layer | G | 6.632173469 | — | Global scaling factor applied tokf2 |
| αmech,βmech | 8.5 × 10⁻⁴ | /bp | Length-sensitivity partition coefficients | |
| Population Layer | αpop | 1.0 (default) | — | Experimental population amplification factor |
| Spacer Geometry | S₀ | 80 | bp | Optimal spacer length (sweet-spot) |
| σspacer | 40 | bp | Acceptable spacer deviation range | |
| Upper Bound | cap | 0.95 | — | Probability ceiling for Mode B output |
| SOP Threshold | Pthreshold | 0.80 | — | Decision boundary (dashed line in Fig. 1) |
If other teams wish to recalibrate M5 using their own experimental datasets, the following minimal requirements and procedure are recommended.
All model codes, validation datasets, parameter files, and figure-generation scripts are openly available at:
https://gitlab.igem.org/woonak32/M5-Bxb1-Design-Tool
Repository structure:
Validation data (m5_observed.csv):
| insert_length_bp | concentration_uM | time_min | success_rate | n_replicates | notes |
|---|---|---|---|---|---|
| 1039 | 0.387 | 30 | 0.875 | 3 | anchor_point_1 |
| 1605 | 0.387 | 30 | 0.790 | 3 | anchor_point_2 |
| 1613 | 0.387 | 30 | 0.800 | 3 | anchor_point_3 |
Batch query input (example_queries.csv):
| insert_length_bp | concentration_uM | time_min |
|---|---|---|
| 800 | 0.387 | 30 |
| 1200 | 0.387 | 30 |
| 2000 | 0.5 | 45 |
Output results (results.csv):
| insert_length_bp | concentration_uM | time_min | predicted_success_rate | confidence_interval_95 |
|---|---|---|---|---|
| 800 | 0.387 | 30 | 0.891 | [0.856, 0.926] |
| 1200 | 0.387 | 30 | 0.852 | [0.812, 0.892] |
| 2000 | 0.5 | 45 | 0.748 | [0.698, 0.798] |
Users can recalibrate the model using their own validation datasets:
License: MIT License (permitting modification and commercial use)
Citation format:
[NYCU-Formosa]. (2025). M5 Experimental Design Assistant for Bxb1 Integration.
iGEM 2025.GitHub
https://github.com/[YourTeam]/M5-Bxb1-Design-Tool
BibTeX:
We welcome community contributions in:
Within the team, M5 is often referred to as the “Salmon model”, symbolizing its multi-protein foundation and its capacity to traverse biological systems with the same precision and perseverance as its namesake species.
The M5 Experimental Design Assistanttransforms Bxb1 integration fromtrial-and-errortoquantitative design, providing:
Core advantages:
Open-source commitment:
All codes, datasets, and parameters are publicly available on GitHub, supporting community reuse, extension, and collaborative improvement.
Q1. Why use a logit-linear model instead of polynomial or exponential fits?
A. The logit transformation naturally constrains probabilities to [0, 1], and the relationship between insert length (L) and success rate is approximately linear in logit space. Three experimental anchors yield MAE = 0.37 %, validating the model’s simplicity and accuracy. Polynomial fits tend to overfit; exponential models cannot maintain bounded probabilities.
Q2. The 0.3–1.0 μM concentration curves overlap—are the parameters inconsistent?
A. No. Mode A uses a global activation curve (C₅₀ = 1 μM, nH = 1) to describe full-range activity, while Mode B applies local normalization (K = 0.20 μM, n = 6, Cref = 0.387 μM) for the 30-min production window. Both are valid in their respective operational contexts.
Q3. Why is 30 min of induction sufficient?
A. The time-saturation function φtime = t / (T₅₀ + t) with T₅₀ = 30 min reaches the plateau between 30–60 min (see Fig. 5), with marginal ( < 5 %) improvement beyond that. Extended induction increases metabolic burden and Bxb1 cytotoxicity, offering no cost-benefit gain.
Q4. Does single- vs. dual-antibiotic selection affect efficiency?
A. Current data (n = 3) show no significant difference (unpaired t-test, p > 0.05). Larger datasets are being collected for statistical confirmation. The current model assumes no dependency on resistance-marker configuration.
Q5. Why is the SOP threshold set at P ≥ 0.8?
A. At this threshold, one to two plates typically yield sufficient positive clones (see Fig. 3), balancing success rate and labor cost. Users requiring higher certainty may adopt P ≥ 0.85, with proportionally more plating.
Q6. My strain performs well even at 0.2 μM—does that invalidate the model?
A. Not necessarily. 0.2 μM corresponds to a “sub-optimal but functional” regime. If your strain achieves near-plateau efficiency there, recalibrate K and n using the cross-strain calibration workflow. This reflects strain-dependent Bxb1 expression efficiency.
Q7. How do I measure αpop (the population amplification factor)?
A.
Q8. Can this model be adapted for other integrases (e.g., ΦC31)?
A. Yes. The framework is transferable. Re-fit a and b using 2–3 experimental anchors and update integrase-specific parameters (K, n, kd, γ, T₅₀) from literature. Start with ΦC31 kinetic data as initial estimates and perform calibration as described in Section 9.
Q9. Are spacer and length effects independent?
A. Currently modeled multiplicatively (fspacer × Pbase). If future data reveal interactions (e.g., long inserts more sensitive to spacer), an interaction term fSL(S,L) can be introduced. Existing data support the independence assumption.
Q10. Does the model account for off-target or reverse recombination events?
A. No. The model estimates on-target success only. Off-target rates depend on construct design and detection methods (PCR, long-read sequencing). For high-specificity attP/attB pairs such as Bxb1, off-target rates are typically < 1 %.
Q11. How are extrapolation zones handled?
A. Regions beyond empirical coverage (> 4 kb, GC < 30 % or > 70 %, > 0.8 μM) are flagged as high-uncertainty zones. Figures display these areas with faded or dashed lines, and captions explicitly note the uncertainty. Users are advised to validate experimentally before scaling up.
Q12. How is reproducibility ensured?
A. The GitHub repository provides:
All figures can be regenerated using the same parameter file.
Example commands:
pip install -r requirements.txt
python plot_figures.py --figure 1 --output fig1.png
python m5_tool.py --input queries.csv --output results.csv
Salmon — the M5 model’s codename — embodies our design philosophy: built on proteins, adaptive across species, and swimming upstream toward clarity amid experimental noise.
The M5 framework establishes a reproducible, interpretable, and strain-transferable standard for integrase-based genome engineering, bridging predictive modeling and experimental design through open-source, community-driven development.
M5 — the Salmon model. Built on proteins, tested on genomes, and guided by science.
Through the M1–M5 modeling framework, we established an interpretable, quantitative system that deconstructs Bxb1-mediated genome integration into three interconnected layers — protein, DNA, and population.
Both simulations and experimental data consistently revealed that integration success is not determined by a single kinetic parameter, but by the combined influence of replication timing, DNA geometry, and protein expression dynamics.
Integration efficiency exhibited a clear hierarchy across genomic loci: hisA and metA achieved the highest success rates, galK showed intermediate performance, while leuD and lacZ were markedly lower.
This gradient mirrors the loci’s relative distance from oriC, indicating that replication fork timing is the primary determinant of integration efficiency.
Regions near oriC are duplicated early and remain more accessible to recombineering enzymes, whereas loci near ter replicate later, limiting accessibility.
Notably, lacZ is physically positioned near hisA and galK, not in the terminal region. Its low success rate—comparable to leuD—stems from its two-to-two configuration, which requires recombination across two pairs of attB/attP sites, unlike the single pair used in one-to-one designs.
This contrast reveals that integration efficiency is shaped not only by replication timing but also by geometric coordination within the DNA architecture.
In a one-to-one configuration, recombination proceeds independently at a single attB/attP pair.
By contrast, the two-to-two configuration demands synchronized recombination at both ends, introducing challenges in molecular alignment and spatial coordination.
Our model predicts that if one end completes recombination prematurely, Bxb1 molecules may become transiently trapped at that site, leading to temporal asymmetry and stalling of the opposite reaction.
This explains the lower initial efficiency observed for lacZ, which reflects geometric instability rather than locus position.
While the two-to-two configuration introduces additional geometric constraints, it uniquely enables full cassette integration within a single step, a feat impossible for one-to-one architectures.
Our simulations revealed that the inefficiency of dual-end recombination is not intrinsic to the mechanism but results from unoptimized parameters such as inter-site spacing, enzyme concentration, and induction timing.
Once these were optimized in the M5 framework, the predicted success rate of two-to-two events converged with that of one-to-one integrations.
This finding redefines two-to-two recombination as a design-driven advantage rather than a structural limitation.
By quantifying geometric coordination, the model transforms a former bottleneck into a high-performance, scalable solution for production-level genome engineering — proving that dual-end integration can be both complete and efficient when designed systematically.
The first two-to-two process (installation of dual attB sites) and the second two-to-two process (dual-end cassette integration) represent distinct mechanistic stages.
The first is probability-limited, governed by ssAP expression level and replication-fork timing, resulting in a very low success rate (~0.00195%).
The second occurs only in cells already containing dual attB sites; it is kinetics-limited, dominated by Bxb1 binding and catalytic efficiency.
Under optimized conditions (0.3–1.0 μM Bxb1 and short inter-site spacing), integration proceeds rapidly, reaching success rates exceeding 80%.
This dramatic contrast marks a shift from probabilistic to kinetic limitation, clarifying why integration efficiency can increase by four orders of magnitude between the two stages.
Insert length strongly affects dual-end recombination success.
Both simulations and experiments show that fragments of 1–2 kb yield consistently high efficiencies, while inserts longer than ~3 kb result in a sharp decline.
This pattern reflects a trade-off between reaction kinetics and molecular proximity: longer fragments reduce the likelihood that both ends align simultaneously, thereby lowering the probability of complete integration.
Thus, insert length emerges as a highly tunable design parameter in optimizing two-to-two architectures.
The M1–M2 models predict optimal induction durations of ≈45 minutes for ssAP and ≈240 minutes for Bxb1. When residual transcription during the cooling phase was incorporated, the simulated protein accumulation curves matched experimental observations. These results show that fine-tuning expression timing determines the achievable “integration ceiling”: operating within the predicted windows significantly increases both attB formation and total integration success.
The M3 model simulates attB formation as a stochastic single-cell event and successfully reproduces the experimentally observed “polarized” outcomes, where most cells fail to integrate while a minority succeed. This heterogeneity arises from replication-fork gating: integration occurs primarily during the brief window when the replication fork passes through the target locus, rendering DNA temporarily accessible to ssAP-mediated recombination. Replication timing, therefore, is the underlying driver of both population heterogeneity and the oriC–ter efficiency gradient.
Together, these models define three interacting principles:
The M5 model, implemented as our open-source software GenOMe Navigator, integrates these three layers into a predictive design assistant. Users can input locus position, insert length, and induction schedule to estimate integration efficiency and receive quantitative SOP recommendations, transforming experimental planning from trial-and-error into rational, data-driven design.
While the M5 framework already reproduces experimental trends with strong fidelity,several refinements could further elevate its predictive precision:
These extensions represent next-generation upgrades rather than current limitations— laying the foundation for M5 to evolve into a fully predictive genome-design platform.
Integration of modeling and experimental data yields the following key insights:
Integration efficiency follows an oriC-to-ter gradient defined by replication timing.
Two-to-two geometry, once optimized, achieves high-efficiency dual-end recombination comparable to one-to-one events.
The first and second two-to-two rounds are governed by distinct kinetic bottlenecks — the former by attB formation probability, the latter by enzymatic and geometric factors.
A 1–2 kb length window represents the optimal trade-off between completeness and efficiency.
Protein induction timing defines the integration ceiling.
The M5 model (GenOMe Navigator) converts these multilayer rules into a reproducible, quantitative, and design-driven framework for genome engineering.
In essence, integration success emerges from the interplay of timing, geometry, and kinetics.
The locus provides opportunity, the geometry sets boundaries, and protein dynamics determine attainment.
Through calibrated modeling, these elements are unified into a single predictive platform — transforming genome integration from empirical trial-and-error into a rational, engineering discipline.