Model Computer

To rationally construct a "one strain-dual production" microbial cell factory, we developed a multiscale computational framework that systematically guides experimental design through three core modules:

System Dynamics Simulation: A dynamic model based on ordinary differential equations (ODEs) was established to quantitatively guide the temporal switching of metabolic pathways between aerobic accumulation and anaerobic production.

Enzyme Engineering Design: By integrating AI-assisted prediction and structure-based design, we rationally designed enzyme variants with improved catalytic activity.

Metabolic Network Optimization: A genome-scale model was employed to identify key metabolic bottlenecks (e.g., the GapA node) and propose combinatorial strategies—such as reinforcing gapA and zwf—to enhance metabolic flux.

Together, these three modules form a closed-loop design framework spanning system regulation, component validation, and bottleneck identification, providing a predictive computational foundation for constructing efficient microbial cell factories.

All engineering code and data can be accessed at the following website:

https://github.com/Frozen-Zephyr/2025-iGEM-NJTech-China-Model

Simulating and Predicting System Dynamics via ODEs
Background

In traditional metabolic engineering, continuous enzyme expression often causes two major problems: metabolic burden and the loss of precursor homeostasis. To solve this, we designed an environmentally responsive enzyme-controlled release system. Self-assembling intrinsically disordered proteins (A-IDPs) were employed as molecular packaging materials to encapsulate lysine decarboxylase (CadA) immediately after synthesis, thereby maintaining it in a catalytically inactive state. A protease recognition site acts as a molecular switch. Upon transition from aerobic to anaerobic conditions, a specific protease is induced and cleaves the site and rapidly releases active CadA. This on-demand activation ensures lysine accumulation during the aerobic phase and is efficiently converted into cadaverine under anaerobic conditions.

Objective

We developed a mathematical model to describe our CadA-controlled release system. The model links signal perception → transcription → protein accumulation/release → product formation. Using ordinary differential equations (ODEs), we predicted the timing of enzyme release, cadaverine production rates, and optimal switching conditions.

Figure 1. Model Workflow.
Figure 1. Model Workflow.
Ordinary Differential Equations (ODEs) Model
1. Signal Perception and Integration Module

This module quantitatively models the CadC transcription factor as an "intelligent sensor" that integrates three environmental signals: pH, lysine, and cadaverine. We described CadC activity as a multiplicative integration of three Hill functions (pH, lysine, cadaverine), ensuring activation only when all signals meet threshold levels.. This design ensures efficient system activation only when all critical conditions—an acidic environment, sufficient substrate (lysine), and non-saturating product levels—are simultaneously met.

2. Gene Transcription Regulation Module

This module simulates how CadC activity drive the cadBA operon to produce mRNA. It employs a thermodynamic model-based equation, accurately reflecting the experimental observation that CadC binds the promoter as a dimer at two sites. This formulation predicts the dynamic of mRNA accumulation, including its steady-state level.

3. Protein Expression and Engineered Controlled-Release Module

This core module simulates our A-IDP/TEV protease system, which temporally decouples enzyme "synthesis" from "activity."

Aerobic Phase: The model describes how newly synthesized CadA is packaged and stored as an inactive "encapsulated enzyme".

Signal Triggering: At a predefined time point, the model simulates an instantaneous cleavage event, converting all accumulated "encapsulated enzyme" into active "free enzyme."

Anaerobic Phase: Post- trigger, any newly synthesized CadA exists directly in the active "free enzyme" state, enabling continuous catalysis.

4. Metabolic Conversion and Product Formation Module

This module uses Michaelis-Menten kinetics to describe the enzymatic reaction. It is directly coupled with the upstream module, as the instantaneous concentration of "free enzyme" sets the reaction rate. This integration allows for precise prediction of key production metrics: cadaverine titer, yield, and the maximum production rate.

Kinetic simulation and key predictive results
1. Signal Perception and Integration

To elucidate how pH, lysine, and cadaverine collectively regulate CadA expression, we developed a quantitative model for the integration of these environment signals by the transcription factor CadC.

1)pH sensing: CadC undergoes a conformational change upon proton binding. The effect of pH on CadC activity was modeled using a sigmoidal function:

namely:$$\begin{array}{rl}f(\mathrm{pH})&=\frac{1}{1+10^{\frac{\mathrm{pH-pH_0}}{\Delta\mathrm{pH}}}}\end{array} \tag{Eq.1}$$

2)Lysine activation: The substrate lysine exerts a positive cooperative effect on CadC. This was modeled by a standard Hill equation: $$g(l)\quad=\frac{(l/K_l)^{n_l}}{1+(l/K_l)^{n_l}} \tag{Eq.2}$$

3)Cadaverine inhibition: As a structural analog of lysine, cadaverine acts as a competitive inhibitor, implementing a product feedback inhibition mechanism. This was modeled using the reciprocal form of the Hill function:$$h(c)\quad=\frac{1}{1+(c(t)/K_c)^{n_c}} \tag{Eq.3}$$

4)Signal integration: We assumed that the effects of these three signals are independent. Therefore, the net activity of constitutively expressed CadC, with a total constant concentration C₀, is given by a multiplicative model that calculates the fraction of active protein:$$C_{\mathrm{active}}(t)=f(\mathrm{pH})\cdot g(l(t))\cdot h(c(t)) \tag{Eq.4}$$

Table 1. Parameters for the Signal Perception and Integration Equations[1]
Parameters Definition
\(C_{\mathrm{active}}(t)\) Concentration of active CadC at time t, reflecting the functional state of the transcription factor.
\(l(t)\) Extracellular lysine concentration at time t,unit: mM.
\(c(t)\) concentration of active CadC at time t,unit:mM.
\(\text{pH}\) Extracellular cadaverine concentration at time t, unit: mM.
\(\mathrm{pH}_0\) Half-maximal activation pH for CadC, set to 6.2 based on literature.
\(\Delta pH\) Width parameter of the pH response curve, controlling the steepness of activation, set to 0.5
\(K_{l}\) Half-maximal activation constant for lysine, reflecting the affinity of CadC for lysine, set to 3.6 mM based on literature.
\(n_{l}\) Hill coefficient for lysine binding, characterizing binding cooperativity, set to 1.1.
\(K_{c}\) Half-maximal inhibition constant for cadaverine, reflecting the strength of inhibition, set to 0.235 mM.
\(n_{c}\) Hill coefficient for cadaverine binding, characterizing inhibitory cooperativity, set to 2.8.
2. Transcriptional Regulation Level

The cadBA promoter contains two adjacent binding sites for the CadC, exhibiting positive cooperativity. We employed a thermodynamic model of transcription to describe this process, based on the following core assumptions:

①. Transcription factor-DNA binding reaches rapid equilibrium.

②. Different promoters states have distinct transcriptional activities.

③. The system maintains a quasi-steady state.

The probability of each promoter state is determined by its statistical weight, which is related to the free energy of the system:$$W_i=e^{-\frac{\Delta G_i}{RT}}\tag{Eq.5}$$

where \(\Delta G_{i}\) is the free energy of state \({i}\),\({R}\) is the gas constant, and \({T}\) is the absolute temperature.

We considered three possible promoter states:

State 0: Empty Promoter

Statistical Weight :\(W_{0}=1\) (reference state)

Transcriptional activity:\(A_{0}=1\) (basal level)

State 1: Single CadC dimer bound

Activated CadC forms dimers in solution. The dimerization equilibrium is: $$2CadC\rightleftharpoons CadC_2$$

Let CadC monomer concentration be denoted as \(C\). The dimerization equilibrium expression is:$$K_d=\frac{[CadC_2]}{[C]^2}$$

This leads to:$$[\mathrm{CadC}_2]\propto C^2$$

The binding equilibrium constant for the first site is:$$K_1=\frac{[\mathrm{P}_1]}{[\mathrm{P}_0][\mathrm{CadC}_2]}\tag{Eq.6}$$

Where \([\mathrm{P}_0]\) is the concentration of unbound promoters and \([\mathrm{P}_1]\) is the concentration of promoters with a single site occupied.

Integrating the above, we obtain the statistical weight expression:$$\frac{[\mathrm{P}_1]}{[\mathrm{P}_0]}=K_1\cdot[\mathrm{CadC}_2]\propto K_1\cdot C^2\tag{Eq.7}$$

Subsequently, we define \(K_{\mathcal{C}}\) as the half-saturation concentration for CadC dimer binding to DNA. Consolidating the proportionality into this defined constant yields:$$W_1=\left(\frac{C}{K_C}\right)^2\tag{Eq.8}$$

The transcriptional activity is assumed to be \(A_1=1\)(assuming binding to a single site does not significantly alter transcription).

The situation of the second binding site is the same as that of the first:\(W_2=\left(\frac{C}{K_C}\right)^2\)

The transcriptional activity is similarly assumed to be \(A_2=1\).

State 2: Two CadC dimers bound

If the two sites were completely independent, the statistical weight would be:$$W_3^{\mathrm{indep}}=\left(\frac{C}{K_C}\right)^2\cdot\left(\frac{C}{K_C}\right)^2=\left(\frac{C}{K_C}\right)^4\tag{Eq.9}$$

On this basis, we correlate it with the free energy of cooperative binding. First, the free energy for independent binding is:$$\Delta G_{\mathrm{indep}}=\Delta G_1+\Delta G_2$$

Where \(\Delta G_{1}\) and \(\Delta G_{2}\) are the free energy changes for binding to the two individual sites separately.

When cooperative effects exist, the free energy change for simultaneous binding at both sites is:$$\Delta G_{\mathrm{coop}}=\Delta G_1+\Delta G_2+\Delta G_c$$

Where \(\Delta G_{c}\) is the free energy change associated with cooperative binding.

According to statistical mechanics, the relationship between the ratio of statistical weights and the free energy change is:$$\frac{W_{\mathrm{coop}}}{W_{\mathrm{indep}}}=e^{-\frac{\Delta G_{\mathrm{coop}}-\Delta G_{\mathrm{indep}}}{RT}}=e^{-\frac{\Delta G_c}{RT}}\tag{Eq.10}$$

Thus, we define the cooperativity factor:$$f=e^{-\frac{\Delta G_c}{RT}}\tag{Eq.11}$$

This yields:$$W_\mathrm{coop}=f\cdot W_\mathrm{indep}=f\cdot\left(\frac{C}{K_C}\right)^4$$

If \(\Delta G_{c} < 0\): Positive cooperativity, \(f > 1\), indicating that the binding of one CadC dimer facilitates the binding of the other.

If \(\Delta G_{c} = 0\): No cooperativity, \(f = 1\), binding is completely independent.

If \(\Delta G_{c} > 0\): Negative cooperativity, \(f < 1\), indicating that the binding of one CadC dimer inhibits the binding of the other.

Transcriptional Activity: \(A_3=f\) (maximum activation)

We can now formulate the complete partition function:$$Z=W_0+2W_1+W_3=1+2\left(\frac{C}{K_C}\right)^2+f\left(\frac{C}{K_C}\right)^4\tag{Eq.12}$$

The transcriptional activity is similarly assumed to be \(A_3=f\).

The total transcriptional activity is:$$A=A_0+A_1+A_2+A_3=1+2\left(\frac{C}{K_C}\right)^2+f\left(\frac{C}{K_C}\right)^4$$

The probability of each state is:$$P_i=\frac{W_i}{Z}$$

Therefore, the average transcriptional activity is:$$\begin{aligned}\left\langle A\right\rangle&=\frac{\sum A_iW_i}{\sum W_i}\\&=\frac{A_0W_0+A_1W_1+A_2W_2+A_3W_3}{Z}\\&=\frac{1\cdot1+a\cdot\left(\frac{C}{K_C}\right)^2+a\cdot\left(\frac{C}{K_C}\right)^2+f\cdot f\left(\frac{C}{K_C}\right)^4}{1+2\left(\frac{C}{K_C}\right)^2+f\left(\frac{C}{K_C}\right)^4}\\&=\frac{1+2a\left(\frac{C}{K_C}\right)^2+f^2\left(\frac{C}{K_C}\right)^4}{1+2\left(\frac{C}{K_C}\right)^2+f\left(\frac{C}{K_C}\right)^4}\end{aligned}\tag{Eq.13}$$

Considering the symmetry of the two binding sites and the cooperative effect, we assume \(a=1\) to further simplify the expression:$$\langle A\rangle=[\frac{1+(\frac{C}{K_C})^2f}{1+(\frac{C}{K_C})^2}]^2$$

Finally, incorporating mRNA synthesis and degradation processes, we obtain the complete kinetic equation:$$\frac{dm}{dt}=\alpha_m\cdot\left[\frac{1+\left(\frac{C_{\mathrm{active}}}{K_C}\right)^2\cdot f}{1+\left(\frac{C_{\mathrm{active}}}{K_C}\right)^2}\right]^2-\beta_m\cdot m\tag{Eq.14}$$

This complex form originates from the cooperative interaction of CadC dimers with the two binding sites in the cadBA promoter. The squared terms reflect the characteristic binding of dimers, while the fractional structure embodies the nonlinear response of transcriptional activation.

Table 2. Parameters for the Transcriptional Regulation Equations[2]
Parameters Definition
\({m}\) Concentration of CadA mRNA
\(C_{\mathrm{active}}\) Concentration of active CadC monomers following environmental activation.
\({C}\) Concentration of active CadC monomers
\(\alpha_{m}\) Basal transcription rate constant (min⁻¹), representing the rate of mRNA synthesis in the absence of CadC activation. Literature value: 0.0043 min⁻¹.
\(\beta_{m}\) mRNA degradation rate constant (min⁻¹), representing the first-order degradation rate constant for mRNA. Literature value: 0.0502 min⁻¹.
\({f}\) Ratio of maximum transcription rate to basal transcription rate. Value set to 698.
\(K_{\mathcal{C}}\) CadC-DNA binding constant (half-saturation concentration for CadC dimer binding to DNA). Literature value: 1 (dimensionless or in appropriate concentration units).
\(\Delta G_{i}\) Free energy of state
\({R}\) Gas constant
\({T}\) Absolute temperature
\(\mathrm{P}_{0}\) Concentration of unbound promoters
\(\mathrm{P}_{1}\) Concentration of promoters with a single binding site occupied
3. Protein Expression and Engineered Controlled-Release Module

This module is the engineering core of our system, designed to temporally decouple enzyme synthesis from activity via the A-IDP/TEV protease control system. The dynamics are divided into three distinct phases."

Aerobic Phase: Synthesis and Encapsulation

During the initial aerobic phase, all newly synthesized CadA is immediately encapsulated by A-IDP, forming an inactive complex. The synthesis rate is proportional to mRNA concentration, and degradation follows first-order kinetics. No free enzyme is produced in this phase.$$\begin{aligned}\frac{dA_{\mathrm{wrapped}}}{dt}&=\alpha_p\cdot m-\beta_p\cdot A_{\mathrm{wrapped}}\\\frac{dA_{\mathrm{free}}}{dt}&=-\beta_p\cdot A_{\mathrm{free}}\end{aligned}\tag{Eq.15}$$

Signal Triggering: Instantaneous Enzyme Release

At a defined time \(\mathfrak{t}_{\mathrm{signal}}\), an environmental cue triggers TEV protease expression. The protease cleaves the A-IDP linker, instantaneously converting all encapsulated enzyme into its active form. Total enzyme mass is conserved during this event:$$A_{\mathrm{free}}(t_{\mathrm{signal}}^+)\quad=A_{\mathrm{free}}(t_{\mathrm{signal}}^-)+A_{\mathrm{wrapped}}(t_{\mathrm{signal}}^-)$$$$A_{\mathrm{wrapped}}(t_{\mathrm{signal}}^+)\quad=0\tag{Eq.16&17}$$

Anaerobic Phase: Direct Production of Active Enzyme

Post-induction, the system enters the anaerobic phase. TEV protease remains active, ensuring all newly synthesized CadA exists directly in the free, active state. Encapsulation ceases:$$\frac{dA_{\mathrm{wrapped}}}{dt}\quad=0$$$$\frac{dA_\mathrm{free}}{dt}\quad=\alpha_p\cdot m-\beta_p\cdot A_\mathrm{free}\tag{Eq.18}$$

Table 3. Parameters for the Signal Trigger Equations[3]
Parameters Definition
\(\alpha_{p}\) Translation rate constant, literature value:\(4.2\times10^{-3}min^{-1}\)
\({m}\) mRNA concentration
\(\beta_{p}\) Protein degradation rate constant, literature value:\(\ln2/1740\approx0.000398\mathrm{тіп}^{-1}\)
\(A_{\mathrm{wrapped}}\) Concentration of the encapsulated enzyme
\(A_{\mathrm{free}}\) Concentration of the free enzyme
\(\mathrm{t~signal}\) Signal trigger time point
4. Metabolic Conversion and Product Formation Module

This module describes the catalytic production of cadaverine from lysine, driven by the concentration of free CadA enzyme. The kinetics are modeled using the classic Michaelis-Menten framework.

The enzymatic reaction typically follows this fundamental mechanism:$$E+S\underset{k_{-1}}{\operatorname*{\overset{k_1}{\operatorname*{\operatorname*{=}}}}}ES\overset{k_2}{\operatorname*{\operatorname*{\to}}}E+P$$

The corresponding Michaelis-Menten equation defines the reaction rate v:$$v=\frac{v_{\max}[S]}{K_m+[S]}\tag{Eq.19}$$

We integrate the Michaelis-Menten equation into our model. Assuming the substrate concentration is much higher than the enzyme concentration and the reaction is irreversible, the rate is determined by the instantaneous concentrations of free enzyme Afree and substrate lysine l. Given a 1:1 stoichiometry, the dynamics of substrate consumption and product formation are:$$\frac{dl}{dt}\quad=-v_{\max}\cdot A_{\mathrm{free}}\cdot\frac{l}{K_m+l}\tag{Eq.20}$$

$$\frac{dc}{dt}\quad=v_{\max}\cdot A_{\mathrm{free}}\cdot\frac{l}{K_m+l}\tag{Eq.21}$$

Table 4. Parameters for the Metabolic Equations[4]
Parameters Definition
\({E}\) Free enzyme
\({S}\) Substrate (Lysine)
\({ES}\) Enzyme-substrate complex
\({P}\) Product (Cadaverine)
\([E]_0\) Total enzyme concentration
\(A_{\mathrm{free}}\) Free enzyme concentration
\({l}\) Lysine concentration
\({c}\) Extracellular cadaverine concentration
\(K_{m}\) Substrate concentration at which the reaction rate is half of Vmax, Literature value: 26 mM
\(v_{max}\) Maximum reaction rate when the enzyme is fully saturated by substrate. Literature value: \(1.3\times10^{-3}\)mM·min⁻¹

These equations are directly coupled with the upstream controlled-release module, as the concentration of active enzyme \(\mathrm{A}_{\mathrm{free}}\) is its output. This coupling allows the model to predict key performance metrics, including the final cadaverine titer, yield, and maximum production rate.

Theoretical significance and experimental guidance

We set \(t_{\mathrm{signal}}\) to be 60 min. Based on literature, we obtained the signal response thresholds (pH₀ = 6.2, Kₗ = 3.6 mM, K꜀ = 0.235 mM), Hill coefficients (nₗ = 1.1, n꜀ = 2.8), expression regulation parameters (αₘ = 0.0043 min⁻¹, βₘ = 0.0502 min⁻¹, f = 698), and the enzyme kinetic constants (\(\mathrm{V_{max}}\) = 0.0013 mM/min, Kₘ = 26 mM). We performed a least squares fitting to refine the model against experimental data. The subsequent simulations generated the following predictive results:

Figure 2.Temporal dynamics of mRNA, total protein, encapsulated enzyme, and free enzyme
Figure 2.Temporal dynamics of mRNA, total protein, encapsulated enzyme, and free enzyme

By numerically solving the ODE system for the protein expression and controlled-release module, we simulated the dynamic interplay between the encapsulated and free forms of CadA (Figure 2). The simulation used the mRNA profile from the upstream transcriptional module as its input. The results clearly capture the designed two-phase behavior: during the aerobic phase, the enzyme accumulates almost entirely in its encapsulated, inactive state. Upon the programmed signal at 60 minutes, the TEV protease is triggered, causing an instantaneous cleavage event that converts the pre-accumulated pool of encapsulated enzyme into the active, free form, resulting in a sharp surge in free CadA. This model provides a quantitative tool to predict the necessary aerobic cultivation time required to build up a sufficient reservoir of the enzyme for optimal production in the subsequent anaerobic phase.

Figure 3.Temporal dynamics of lysine consumption and cadaverine accumulation
Figure 3.Temporal dynamics of lysine consumption and cadaverine accumulation

The kinetic profiles of lysine consumption and cadaverine accumulation, as predicted by the model and shown in Figure 3, demonstrate a complementary relationship that validates mass conservation. This simulated trend successfully matches the direct experimental measurements of substrate depletion, thereby verifying that our integrated model accurately captures the overall reaction dynamics.

Figure 4.Temporal Changes in Cadaverine Production Rate
Figure 4.Temporal Changes in Cadaverine Production Rate

The cadaverine production rate profile was generated by numerically differentiating the accumulation curve from Figure 4. The peak value corresponds to the maximum production rate, represents the upper catalytic capacity of the system. The timing and magnitude of the peak reflect the synergy between enzyme activation and substrate utilization. Notably, the significant delay between the signal triger and the production peak suggests kinetics limited by the rate of enzyme accumulation or activation. This indicates that delaying the trigger time could be beneficial. Furthermore, if a substrate feeding strategy is employed, initiating the feed near this peak would help sustain high productivity.

Figure 5.Effect of Different Substrate Concentrations on Product Cadaverine Yield
Figure 5.Effect of Different Substrate Concentrations on Product Cadaverine Yield

To identify the optimal initial feeding strategy, we performed a parameter scan by systematically varying the initial lysine concentration (\(\mathrm{[L]}\)) in our model while keeping all other parameters constant. The resulting simulations, shown in Figure 5, predict how different starting substrate levels influence the final cadaverine titer. This analysis allows us to suggest an optimal initial lysine concentration for maximizing yield. A comparison with the corresponding experimental data in Figure 6 shows that the model's predictions are largely consistent with the observed trends, validating its utility for guiding process optimization.

Figure 6.Experimental Data
Figure 6.Experimental Data
Figure 7.Effect of Signal Trigger Time on Product Formation
Figure 7.Effect of Signal Trigger Time on Product Formation

To determine the optimal timing for the metabolic switch, we performed a parameter scan by systematically varying the signal trigger time \(\mathfrak{t}_{\mathrm{signal}}\), while holding all other parameters constant. The simulation results in Figure 7 reveals a clear optimal, representing the best trade-off between the aerobic phase for enzyme accumulation and the anaerobic phase for product synthesis. The model pinpoints the ideal trigger time to maximize the final cadaverine titer.

These predictions confirm that our controlled-release system effectively manages the transition between the two phases and provides a reliable theoretical basis for implementing the switch in experiments.

References

[1] Fritz G, Koller C, Burdack K, Tetsch L, Haneburger I, Jung K, Gerland U. Induction kinetics of a conditional pH stress response system in Escherichia coli. J Mol Biol. 2009 Oct 23;393(2):272-86

[2] Kim HJ, Kim YH, Shin JH, Bhatia SK, Sathiyanarayanan G, Seo HM, Choi KY, Yang YH, Park K. Optimization of Direct Lysine Decarboxylase Biotransformation for Cadaverine Production with Whole-Cell Biocatalysts at High Lysine Concentration. J Microbiol Biotechnol. 2015 Jul;25(7):1108-13.

[3] Tanaka Y, Kimura B, Takahashi H, Watanabe T, Obata H, Kai A, Morozumi S, Fujii T. Lysine decarboxylase of Vibrio parahaemolyticus: kinetics of transcription and role in acid resistance. J Appl Microbiol. 2008 May;104(5):1283-93.

[4] Xi Y, Ye L, Yu H. Enhanced thermal and alkaline stability of L-lysine decarboxylase CadA by combining directed evolution and computation-guided virtual screening. Bioresour Bioprocess. 2022 Mar 21;9(1):24.

[5] Kuper C, Jung K. CadC-mediated activation of the cadBA promoter in Escherichia coli. J Mol Microbiol Biotechnol. 2005;10(1):26-39.

MPNN-GNN-based generative model for lysine decarboxylase
Model Overview

To further enhance the performance of CadA, we employed a computational pipeline for protein engineering:

1.Key amino acid residues interacting with the substrate were identified using molecular docking methods, providing targets for input into the model for directed mutation.

2.A deep learning model based on a Message Passing Neural Network-Graph Neural Network (MPNN-GNN) was developed to engineer CadA for improved catalytic acticity.

3.A site-directed mutagenesis strategy targeting the key residues was designed, leading to the screening for mutants with improved activity.

Method
1.Key Site Selection

To computationally design experimentally testable lysine decarboxylase (CadA) variants with enhanced catalytic efficiency, we initiated a structure-based rational design pipeline. Focusing on the functional decameric assembly, we retrieved the CadA (PDB No. 6YN5) structure. We used a representative dimer extracted from the decamer to focus docking on the active pocket while preserving key interfacial residues. Molecular docking was performed using Schrödinger Maestro. A docking box of 12 Å was centered on the pyridoxal-5′-phosphate (PLP) cofactor, and lysine was docked into the active site.

Analysis of the resulting poses across both protomers of the dimer revealed key residues forming hydrophobic contacts, hydrogen bonds, and a salt bridge with the substrate. In chain G, lysine hydrophobically interacts with W333, and forms hydrogen bonds with Glu525, Lys526 and Gly654. Additionally, lysine forms electrostatic interaction with His-245. In chain B, lysine interacts hydrophobically with Ile182 and hydrogen bound with H245, K246, K527, Y652, G655.(Figure 1)

Figure 1. 2D diagram of interactions between Lys with CadA
Figure 1. 2D diagram of interactions between Lys with CadA

Based on the docking analysis, we identified seven residues (I182, H245, K246, W333, E526, K527, Y652) as high-priority targets for mutagenesis. These positions, which directly contact the substrate or cooperate with the PLP cofactor, are predicted to critically modulate substrate binding affinity and catalytic turnover. This candidate list provides a focused and experimentally testable guide for subsequent site-saturation mutagenesis and directed evolution.

2.Model Construction

Our deep learning framework, illustrated in Figure 3, integrates four components: a target protein feature extraction model, a compound feature extraction model, a protein-compound feature fusion model, and an activity prediction model. By innovatively combining the ESM2 protein model and a DGL-based MPNN-GNN compound feature extraction model, the compound-protein feature fusion model to predict the catalytic activity of ligand compounds based on the target protein.

Figure 2. Flowchart of the model
Figure 2. Flowchart of the model
2.1 Model Construction
2.1.1. Preprocessing

Compound Data processing

The complete dataset was partitioned into training, validation, and test sets in an 8:1:1 ratio. Molecular structures were converted into graph representations using the DGL-LifeSci toolkit, from which atom-level and bond-level features were extracted. The atom feature matrix has dimensions \(L \times C\), where \( L \) is the number of atoms in the molecule and \( C = 74 \) denotes the feature dimension per atom. The bond feature matrix has dimensions \(L \times C\), with \(C = 12\) representing bond feature dimensions. The adjacency matrix (\(L \times L \times L\)), computed using RDKit, represents atomic connectivity. This preprocessing pipeline produces graph-structured representations and corresponding adjacency matrices as inputs for subsequent feature extraction and learning.

Protein Mutation Encoding

To encode protein mutation information, a mutation-specific feature encoding strategy was designed. Each amino acid is represented by its single-letter code. The original and mutated residues are separately one-hot encoded (20 dimensions each) and combined with the mutation position index, forming a high-dimensional vector that provide input features for the deep learning model alongside the compound graph structures (see Figure 3).

2.1.2.Target Protein Feature Extraction Module

We employed the pre-trained ESM-2 model (esm2_t33_650M_UR50D) from Meta AI. The amino acid sequence was input into the model, and the final hidden representations were averaged across residues to obtain a global 640-dimensional protein embedding. To align with compound feature dimensions for downstream fusion, the embedding was passed through a projection network: a linear layer (640→256) with ReLU and Dropout, followed by another linear layer (256→64), preserving essential semantic information for protein–compound matching (Figure 3).

Figure 3. Data Processing Workflow
Figure 3. Data Processing Workflow
2.1.3. DataLoader

To improve training efficiency and data handling, all preprocessed protein features and compound graph structures were fed into PyTorch’s DataLoader module. A custom collate_fn function was implemented to accommodate the complex data structures (see Figure 6). The processing workflow is as follows:

Compound graph batching: Using the DGL library, nodes and edges of molecular graphs were batched. The dgl.batch method merges multiple molecular graphs into a single batched_graph, automatically remapping node and edge indices to maintain graph consistency and identifiability.

Feature batching: Node features, edge features, and protein features were concatenated along the batch dimension using torch.cat, forming unified tensors. The resulting feature dimensions are:$$\left( \sum_{i=1}^{\mathit{batchsize}} L_i, C \right)$$

where \(L_i\) is the sequence length of the i-th sample, and \(C\) is the hidden dimension.

Adjacency matrix batching: To handle molecules of varying sizes, adjacency matrices were zero-padded to the maximum node count \(L_{max}\).

within each batch. The padded matrices were then stacked into a unified tensor of shape:$$\left(L_{\mathrm{max}} \times \mathit{batchsize}, L_{\mathrm{max}}\right)$$

This ensures dimensional consistency across the batch, enabling efficient downstream computation.

2.1.4. Compound Feature Extraction Model: MPNN-GNN

We utilized a Message Passing Neural Network (MPNN) implemented in the Deep Graph Library (DGL) to extract representations from molecular graphs. The model comprises two stages:

Message Passing Phase - In each graph convolution layer, nodes aggregate information from their neighbors through a message function that encodes and transmits local chemical context.

Node Update Phase - Each node updates its state by integrating its current features with the aggregated messages using an update function. Node features are first linearly transformed and activated via a non-linear function to enhance representational capacity. Six message-passing iterations are performed to capture extended graph dependencies and refine molecular representations.

The node update in each graph convolution layer is formalized as:$$h_{i}^{l+1} = h_{i}^{l} + \sum_{j \in N(i)} (\{ f_{\Theta}(e_{ij}) \cdot h_{j}^{l}, j \in \mathcal{N}(i) \})$$

where:

\(h_i^l\):feature vector of node i at layer l

\(h_j^l\):feature vector of neighbor node j at layer l

\(e_{ij}\):edge feature between nodes i and j (chemical bond)

\(f_\Theta(e_{ij})\):learnable function that weights neighbor information according to edge features

\(\mathcal{N}(i)\): set of neighbors of node i

After processing through the MPNN layers, a readout function generates a fixed-size 64-dimensional graph-level embedding for each compound. These embeddings serve as input for downstream protein–compound feature fusion and activity prediction tasks.

Figure 4. Feature Extraction and Data Processing Pipeline
Figure 4. Feature Extraction and Data Processing Pipeline
2.2 Model Training and Optimizatio
2.2.1. Feature Fusion Model

To effectively integrate compound and protein features for enhanced predictive performance, we implemented a hybrid-GNN-based structural feature extractor approach proposed by Yang Hua et al. This method employed a one-dimensional deep convolutional (Deep-Wise Convolution, DWC) module to jointly model protein and compound representations at the feature level. Compared with simple concatenation operations, this architecture better captures local correlations and complementarities between the two types of features, generating more discriminative fused representations to support downstream tasks.

Given the extracted feature matrices:

Target protein feature matrix: \(F_p \in {R^{L_p \times C_p^s}}\)

Compound graph feature matrix: \(F_d \in {R^{L_d \times Q_d^s}}\)

Compound adjacency matrix: \(A_d \in {R^{L_d \times L_d}}\)

We extract structural drug-target interaction (DTI) features. The protein feature matrix is represented as amino acid nodes, with adjacency defined by sequential neighbors. A 1D deep convolution with kernel size 3 is applied to capture structural relationships among adjacent amino acids.

An interaction matrix \(C \in R^{L_p \times L_d}\) between protein residues and compound atoms is constructed as:$$C = \mathrm{Softmax}(F_{p} F_{d}^{T})$$

The Softmax function normalizes each row into a probability distribution:$$\sigma(z_i)=\frac{e^{z_i}}{\sum_{j=1}^ne^{z_j}}$$

Here, \(e^{z_i}\) represents the exponential operation applied to each element \(z_i\), ensuring that all values are positive. The term \(\sum_{j \mathrm{=1}}^{n}{}{e^{z_j}}\) is the sum of all elements after exponentiation, used for normalization to guarantee that the elements of the output vector sum to 1. The resulting interaction matrix \({C}\) represents the potential interactions between amino acid residues of the protein and atoms of the compound.

By combining the protein adjacency matrix \(A_p\), compound adjacency matrix \(A_d\), and interaction matrix \(C\), a complete GNN computation is formulated:$$F_s=[\begin{array}{cc}A_p&C\\C^T&A_d\end{array}][\begin{array}{c}F_p\\F_d\end{array}]$$

This operation integrates drug-target interaction information as a bridge connecting protein and compound graphs. Since protein sequences cannot form atom-level graphs directly, the 1D DWC extracts sequential relationships:$$A_pF_p=DWC(F_p)$$

Where \(A_p\) is a hypothetical adjacency among residues, and DWC uses a \(3 \times 1\) convolution to capture neighboring residue information. The updated protein and compound features are expressed as:$$\delta(F_p)=DWC(\omega(F_p))+C\omega(F_d)+F_p$$$$\delta(F_d)=C^T\omega(F_p)+A_d\omega(F_d)+F_d$$

Here, \(\omega\) denotes a linear transformation followed by a non-linear activation, while \(C\) and \(C^T\) propagate interaction information. After N rounds of GNN iteration, the final fused structural features are obtained via concatenation:$$\mathrm{F_s=Concatenate}\left(\delta(\mathrm{F_p}),\delta(\mathrm{F_d})\right)$$

2.2.2. Activity Prediction Module

After obtaining the fused substrate-target representations, a fully connected neural network (FCNN) was employed as the prediction module to estimate molecular activity. The network architecture is structured as follows:

Input layer: A linear transformation maps the 64-dimensional fused feature vector to 128 dimensions, followed by a ReLU activation function to introduce nonlinearity.

Hidden layer 1: The transformed features are further expanded to 256 dimensions and processed through an additional ReLU activation.

Hidden layer 2: A subsequent linear transformation reduces the feature dimensionality to 64, followed by ReLU activation and a Dropout layer to prevent overfitting.

Output layer: The final linear layer projects the feature representation onto a single-dimensional output space, suitable for either regression or classification tasks.

This hierarchical structure facilitates nonlinear feature interactions between the substrate and protein representations, thereby enhancing the model’s ability to capture complex biochemical relationships and generate task-specific activity predictions effectively.

2.2.3. Loss Function, Bayesian Hyperparameter Optimization, and Early Stopping

Optimizer and Loss Function:

The model was implemented and trained using the PyTorch framework. We adopted the AdamW optimizer, which improves upon standard Adam by decoupling weight decay from the gradient-based update steps. This separation explicitly applies L2 regularization directly to the model parameters, thereby enhancing training stability and generalization performance.

For regression tasks, we used the Mean Squared Error (MSE) loss, defined as:$$\mathrm{MSELoss}(y_{\mathrm{true}},y_{\mathrm{pred}})=\frac{1}{N}\sum_{i=1}^N(y_{\mathrm{true}}^{(i)}-y_{\mathrm{pred}}^{(i)})^2$$

Here, \(y_\text{true}\) denotes the true labels, and \(y_\text{pred}\) represents the model’s predicted values. MSELoss effectively captures the average squared deviation between predicted and true values, making it more sensitive to outliers.

Early Stopping:

During model training, the validation loss was continuously monitored, and early stopping was applied when no improvement was observed over 50 consecutive epochs. This strategy effectively mitigates overfitting and ensures that the final model corresponds to the checkpoint exhibiting the optimal validation performance.

Bayesian Hyperparameter Optimization:

Gaussian process-based Bayesian optimization was employed to automate the tuning of hyperparameters, with validation accuracy designated as the objective function. Critical parameters such as learning rate, weight decay, and the number of training epochs were optimized. To improve search efficiency, a two-stage optimization strategy was adopted: an initial coarse-grained global search was conducted to explore the parameter space broadly, followed by a fine-grained local search focused on the most promising regions. This hierarchical optimization procedure facilitated the efficient identification of an optimal hyperparameter configuration, enhancing model generalization and predictive robustness.

2.3. Data Sources

To facilitate model training, distinct datasets were constructed for different tasks. Experimental validation samples of kcat values were extracted from the BRENDA and SABIO-RK databases. Each record included the enzyme amino acid sequence, substrate name, and corresponding kinetic parameters. Whenever possible, enzyme sequences were mapped to a unique UniProt ID. For each substrate, the molecular name was also used to retrieve the corresponding SMILES representation from the KEGG and PubChem databases. Samples that could not be unambiguously mapped to a valid SMILES string were excluded.

Due to the limitation of the ESM2 model in handling sequences longer than 1,022 amino acids, samples exceeding this threshold were removed. Variant-containing samples were retained, while the top and bottom 10% of extreme outlier values were filtered out. Ultimately, a total of 29,102 samples were collected for kcat, including 744 entries belonging to EC 4.1.1.x (decarboxylases). The dataset was randomly partitioned into training, validation, and test sets at ratios of 80%, 10%, and 10%, respectively.

Result
1. Model Training

During model construction, Bayesian optimization was employed for hyperparameter search. Based on prior training experience, the search space was constrained as follows: learning_rate within the range \([1 \times 10^{-4}\), \(1 \times 10^{-3}\)] (linear space), weight_decay within the range \([1 \times 10^{-4}\), \(1 \times 10^{-3}\)] (linear space), batch_size selected from \(\{64, 128\}\), and epochs between 50 and 80. Training yielded the optimal hyperparameter combination: learning_rate = 2.275 × 10⁻⁴, weight_decay = 1.036 × 10⁻⁴, batch_size = 128, epochs = 54.

After training, model performance on the test set was evaluated using multiple regression metrics, including mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and adjusted coefficient of determination (Adjusted \(R^{2}\)). Their definitions are as follows:$$\mathrm{MSE}=\frac{1}{n}\sum_{i=1}^n(y_i-\widehat{y}_i)^2$$$$\mathrm{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^n(y_i-\widehat{y}_i)^2}$$$$\mathrm{MAE}=\frac{1}{n}\sum_{i=1}^{n}|y_{i}-\widehat{y}_{i}|$$$$\mathrm{Adjusted~}R^2=1-(1-R^2)\frac{n-1}{n-p-1},\quad R^2=1-\frac{\sum_{i=1}^n(y_i-\widehat{y}i)^2}{\sum i=1^n(y_i-\bar{y})^2}$$

Where n denotes the number of samples, p represents the number of model input features, \(y_i\) and \(\hat{y}_i\) correspond to the true and predicted values, respectively, and \(\bar{y}\) is the mean of the true values.

After completing the general pretraining phase, the model was fine-tuned using lysine decarboxylase (LDC)-specific data to enhance its predictive capability within this enzyme family. As shown in Table 1 fine-tuning systematically improved the key performance metrics on the LDC test set, with the adjusted \(R^{2}\) increasing from 0.3253 to 0.4927. This indicates that domain knowledge transfer was effective, and the model has begun to capture functional features specific to the LDC family.

Table 1. Performance of the model on the general test set and the lysine decarboxylase (LDC) test set
MSE RMSE MAE Adjusted R²
General-purpose dataset 0.3747 0.4101 0.3110 0.3253
LDC dataset 0.3069 0.3654 0.2395 0.4927
2. Experimental Validation Reveals Model-Experiment Discrepancy

Despite the promising computational performance of our fine-tuned model, This underscores the challenge of translating in silico predictions to wet-lab outcomes. We proceeded to experimentally characterize the top-predicted CadA mutants for lysine decarboxylase activity. Unexpectedly, none of these in silico-designed variants showed a significant improvement over the wild-type enzyme. This critical finding highlighted a fundamental gap between our static structure-based predictions and the complex reality of enzymatic function in a cellular environment.

We attribute this discrepancy to two primary factors: 1) The limited amount of high-quality, enzyme-specific training data, which restricts the model's precision; and 2) The model's inherent inability to account for dynamic effects such as conformational flexibility and solvation, which are crucial for catalysis.

Designing Interfacial Disulfide Bonds to Improve CadA activity
Methods

The divergence between our model's predictions and experimental results underscored the challenges of direct activity prediction. We therefore adopted an alternative, structure-guided strategy, strategy based on a well-established protein engineering principle: enhancing enzyme stability often leads to improved functional performance.

To implement this, we introduced engineered disulfide bonds at the dimer interface of CadA to increase stability. Using the Maestro-BioLuminate platform, we adopted a three-step strategy consisting of disulfide scanning - mechanics scoring - literature-based-validation.

A systematic disulfide bond scan was carried out. Candidate residue pairs at the dimer interface were screened according to geometric and accessibility criteria: interatomic distances between 4-7 Šand solvent-accessible surface areas (SASA) < 80 Ų. Simulated double-cysteine mutations generated 80 candidate residue pairs, which were subsequently filtered using weighted scoring. Pairs with weighted scores (WS) > 10,000 were excluded, yielding nine potential disulfide-bonding sites.

Finally, we cross-referenced these candidates with experimental reports of thermostability-enhancing mutations published between 2018 and 2024. Only those pairs that satisfied interfacial positioning, disulfide bond geometry, and literature evidence were retained. Based on this integrative analysis, six new mutants were designed: F14C/K44C, V12C/K41C, F14C/L45C, L93C/E445C, Y13C/P36C. Additionally, the mutation E104K, located at the dimer interface, was additionally identified by virtual screening as potentially stabilizing the enzyme.

Results

Wet-lab experiments demonstrated that the engineered disulfide bond mutants F14C/K44C, V12C/K41C, F14C/L45C, L93C/E445C and Y13C/P36C along with literature-supported variants V12C/K44C, F14C/D41C, and L89C/E445C, exhibited significantly enhanced enzymatic activity. Among these, the Y13C/P36C mutant diaplayed the most pronounced improvement in catalytic activity.

To investigate the structural basis of this enhancement, molecular dynamics (MD) simulations were performed for both the wild-type enzyme and the Y13C/P36C mutant. The results revealed a substantial reduction in Root-Mean-Square Fluctuations (RMSF) in the mutant, indicating enhanced structural stability that likely underlies its higher activity.

Discussion

Limitations of Model Predictions and Discrepancies with Experimental Validation

Our initial in silico MPNN-GNN–based model showed moderate predictive power for LDC variants (Adjusted R² = 0.4927), but experimental validation revealed that predicted high-activity mutants did not reach expected performance. These discrepancies likely stem from the model's reliance on generalized training data, static structural assumptions, and the incomplete representation of long-range dynamic effects that influence enzyme catalysis.

To overcome these gaps, we integrated disulfide bond scanning with literature based-validation, identifying key interfacial residues that contribute to enhanced stability and activity. Future work will combine site-directed mutagenesis, molecular dynamics, and high-throughput experimentalvalidation into an iterative design loop. This approach will continuously refine the predictive model and further improve both the stability and catalytic efficiency of CadA variants.

References

[1] Potashman MH, Duggan ME. Covalent modifiers: An orthogonal approach to drug design. J Med Chem (2009) 52:1231-46. doi: 10.1021/jm8008597

[2] Noor, F.; Noor, A.; Ishaq, A.R.; Farzeen, I.; Saleem, M.H.; Ghaffar, K.; Aslam, M.F.; Aslam, S.; Chen, J.-T. Recent advances in diagnostic and therapeutic approaches for breast cancer: A comprehensive review. Curr. Pharm. Des. 2021, 27, 2344-2365.

[3] Noor, F.; Tahir ul Qamar, M.; Ashfaq, U.A.; Albutti, A.;Alwashmi, A.S.; Aljasir, M.A. Network pharmacology approach for medicinal plants: Review and assessment. Pharmaceuticals 2022, 15, 572.

[4] Floresta, G.; Zagni, C.; Gentile, D.; Patamia, V.; Rescifina, A. Artificial intelligence technologies for COVID-19 de novo drug design. Int. J. Mol. Sci. 2022, 23, 3261.

[5] Sadaqat, M.; Qasim, M.; ul Qamar, M.T.; Masoud, M.S.; Ashfaq, U.A.; Noor, F.; Fatima, K.; Allemailem, K.S.; Alrumaihi,F.;Almatroudi, A. Advanced network pharmacology study reveals multi-pathway and multi-gene regulatory molecular mechanism of Bacopa monnieri in liver cancer based on data mining, molecular modeling, and microarray data analysis. Comput. Biol. Med.2023, 161, 107059.

[6] Yang, J.; Cai, Y.; Zhao, K.; Xie, H.; Chen, X. Concepts and applications of chemical fingerprint for hit and lead screening. Drug Discov. Today 2022, 27, 103356.

[7] Vetter, I. R., & Wittinghofer, A. (2001). The guanine nucleotide-binding switch in three dimensions. Science, 294(5545), 1299-1304.

[8] Liceras-Boillos, P.; García-Navas, R.; Ginel-Picardo, A.; Anta, B.; Perez-Andres, M.; Lillo, C.; Gómez, C.; Jimeno, D.; Fernández- Medarde, A.; Baltanas, F.C.; et al. Sos1 disruption impairs cellular proliferation and viability through an increase in mitochondrial oxidative stress in primary MEFs. Oncogene 2016, 35, 6389–6402.

[9] M. M. Mysinger, M. Carchia, J. J. Irwin, and B. K. Shoichet, "Directory of useful decoys,enhanced (dud-e): Better ligands and decoys for better benchmarking," Journal of Medicinal Chemistry, vol. 55, no. 14, pp. 6582–6594, 2012. PMID: 22716043.

[10] I. Wallach and R. Lilien, "Virtual Decoy Sets for Molecular Docking Benchmarks," J Chem.Inf. and Model., vol. 51, no. 2, pp. 196–202, 2011.

[11] D. Rogers and M. Hahn, "Extended-connectivity fingerprints," Journal of Chemical Information and Modeling, vol. 50, no. 5, pp. 742–754, 2010.

[12] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., et al. (2022). Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv. https://doi.org/10.1101/2022.07.20.500902

Rational Design and Optimization of a Cadaverine and Succinate Co-production System Using a Genome-Scale Metabolic Model
1. Overview

Bio-based polyamide PA54 requires cadaverine and succinate. Conventional stepwise fermentation involves more than 35 steps, leading to high cost and low efficiency. To address this, we developed a single-strain co-production system to synthesize both monomers, reducing core reaction steps to approximately 17 and eliminating waste salt generation. The system employs a temperature-sensitive switch (TlpA39) to dynamically regulate metabolism: at 33°C under aerobic conditions,metabolism is optimized for lysine accumulation; at 39°C under anaerobic conditions, succinate production is activated and CadA is expressed for lysine-to-cadaverine conversion. However, three key challenges remain, including low CadA activity and stability[1], redox imbalance between phases. and incomplete metabolic switching .

To overcome these challenges systematically, we applied genome-scale metabolic modeling to identify optimization targets. By constructing condition-specific models for both aerobic and anaerobic stages and employing multi-level computational analyses, we pinpointed stage-specific bottlenecks, providing a prioritized modification list to guide experimental strain engineering and minimize trial-and-error.

2. Metabolic Network Model Construction
2.1 Model Foundation and Customization

The genome-scale metabolic model iML1515 of Escherichia coli was used as the foundation,(Figure 1). To simulate the two-phase co-production process, we constructed two condition-specific models[2].

Figure 1. The contents of the iML1515 chassis model.
Figure 1. The contents of the iML1515 chassis model.

Aerobic model (lysine accumulation stage): We introduced the intracellular synthesis of cadaverine. The reaction is catalyzed by lysine decarboxylase CadA (reaction formula: LYS_L_c + H_c → 15dap_c + CO2_c).

To ensure that cadaverine can be secreted into the extracellular, we further added its transmembrane transport reaction (CADAttp:15dap_c → 15dap_p) and extracellular exchange reaction (EX_15dap_e:15dap_e→)(Figure 2), thereby fully describing the metabolic pathway from lysine to extracellular cadaverine.

Figure 2. Added reactions in the aerobic model.
Figure 2. Added reactions in the aerobic model.

Aerobic model (succinate and cadaverine co-production stage): This model was modified based on the aerobic model, with the culture conditions set to anaerobic, that is, all reactions related to oxygen intake and utilization are turned off. Meanwhile, we strengthened the secretion pathway of succinate by explicitly adding the transport reaction (CADAtpp1:succ_c→succ_p) and extracellular exchange reaction (EX_succ_e→) of succinate (metabolite ID: succ_c) to simulate its extracellular accumulation (Figure 3).

Figure 3. Addition reactions in the anaerobic model.
Figure 3. Addition reactions in the anaerobic model.
2.2. Analytical methods and algorithms

We implemented a multi-tiered computational workflow to interrogate the metabolic models.

(1) Flux Balance Analysis (FBA) :

FBA is a core constraint-based modeling approach that assumes metabolic networks operate at steady-state,where intracellular metabolite concentrations remain constant. This is formulated as a linear program:$$\max Z=c^T\cdot\nu$$

The constraints are:$$S.\nu=0$$$$\nu_{\min}\leq\nu\leq\nu_{\max}$$

Where, \(S\) is the stoichiometric matrix, \(\nu\) is the reaction flux vector, \(c\) is the objective vector (such as biomass synthesis), \(Z\) is the objective function value.

This study first takes biomass synthesis as the optimization objective, assesses the maximum metabolic capacity of the network under given conditions, and establishes a theoretical reference benchmark.

(2) Parsimony flux balance analysis (pFBA) :

pFBA introduces biological rationality constraints atop standard FBA. It operates on the principle that among all flux distributions achieving optimal growth (as determined by FBA), cells naturally select the configuration minimizing total enzyme burden—consistent with evolutionary resource optimization. Mathematically, this is formulated as:$$\min\sum|\nu_i|$$

The constraints are:$$S.\nu=0$$$$\nu_{\min}\leq\nu\leq\nu_{\max}$$$$c^T\cdot\nu=Z_{\mathrm{opt}}$$

Where \(Z_{\mathrm{opt}}\) is the maximum growth rate calculated by FBA. The flux distribution predicted by pFBA is considered closer to the real physiological state.

(3) Flux Variability Analysis (FVA) :

FVA is a powerful tool for assessing metabolic network redundancy, robustness, and identifying key rigid nodes. It calculates the minimum and maximum values achievable for each reaction flux given constraints on the objective function (e.g., a maximum growth rate of no less than 95%) and possible product synthesis.

For each reaction \(\nu_{i}\):$$\mathsf{Maximize}/\mathsf{Minimize}\nu_i$$

The constraint is:$$S\cdot\nu=0$$$$\nu_{\min}\leq\nu\leq\nu_{\max}$$$$c^T\cdot\nu=\mu\cdot Z_{\mathrm{opt}}$$$$\nu_{product}\geq\gamma\cdot\nu_{product,\max}$$

Where, \(\mu\) is the growth rate threshold coefficient, \(\gamma\) is the product synthesis force coefficient.Reactions with a very small range of flux variation are regarded as rigid nodes of the network.

(4) Flux Scanning (FSEOF) :

FSEOF is the core analytical method of this study, specifically designed for systematically identifying overexpression targets positively correlated with the synthesis of the target product. The method observed the dynamic response trend of all reaction fluxes by gradually increasing the synthetic flux requirements of the target product while optimizing cell growth at each step:

From \(k=0\) to \(k_{\max}\):$$\max\nu_{\mathrm{biomass}}$$

The constraint conditions are:$$S.\nu=0$$$$\nu_{\min}\leq\nu\leq\nu_{\max}$$$$\nu_{\mathrm{product}}=k\cdot\nu_{\mathrm{product,max}}$$

Where, \(\text{k}\) is the enforcement coefficient. Record the changes in all reaction fluxes \(V_{i}\) as the value of k (i.e.,product synthesis requirement) increases. Reactions whose fluxes increase monotonically with \(\nu_{\mathrm{product}}\) (i.e.,\(\frac{\mathrm{d}\nu_i}{\mathrm{d}\nu_{\mathrm{product}}}>0\) ) are identified as potential overexpression targets. The power of FSEOF lies in its ability to concurrently identify reactions directly involved in product synthesis, those supplying precursors, and those generating essential cofactors, thus generating a comprehensive list of potential engineering targets.

All model construction and simulation were implemented in Python using the COBRApy package, with Gurobi as the linear programming solver. Advanced flux analysis, including FSEOF, was performed using the Cameo library.

3. Model Validation and Analysis
3.1 Basic Function Validation and Flux Balance Analysis

The customized models were first validated for basic functionality. Under simulated glucose minimal medium conditions, both aerobic and anaerobic models achieved expected cell growth, confirming network connectivity and functional completeness. Subsequent flux balance analysis with biomass maximization quantified baseline growth rates, establishing a quantitative baseline for subsequent analyses.

3.2 Evaluation of network rigidity based on Flux variability Analysis (FVA)

Flux variability analysis was employed to assess the resilience of the metabolic network and to identify critical bottlenecks. Table 1 shows examples of flux ranges for some key reactions in the model given growth and product synthesis constraints.

Table 1. Reactions and fluxes involved in aerobic and anaerobic process
Aerobic Anaerobic
Reaction ID Flux Reaction ID Flux
EX_h2o_e 28.64481203 EX_h_e 34.27490909
EX_co2_e 20.80601504 ENO 20
ENO 15.67759398 GAP 20
GAP 15.67759398 CADAtpp1 17.13745455
NH4tpp 15.67759398 CADAtex1 17.13745455
NH4tex 15.67759398 FRD2 14.27490909
ATPS4rpp 14.69879699 PPC 14.27490909
PGL 12.96721805 NADH17pp 14.27490909
GND 12.96721805 FBA 10
G6PDH2r 12.96721805 PGI 10
CYTBO3_4pp 10.25684211 PFK 10
NADH16pp 10.25684211 TPI 10
GLCtex_copy2 10 GLCtex_copy2 10
RPE 8.64481203 EX_h2o_e 8.587636364
PPC 7.838796992 CO2tpp 8.549818182
SDPDS 7.838796992 CO2tex 8.549818182
DAPE 7.838796992 ATPM 6.86
DHDPRy 7.838796992 GLCptspp 5.725090909
DHDPS ASPK 7.838796992 PDH 5.725090909
DAPDC 7.838796992 HEX1 4.274909091
LYSDC 7.838796992 GLCt2pp 4.274909091
THDPS 7.838796992 CS 2.862545455
SUCOAS 7.838796992 ICL 2.862545455
CADAtpp 7.838796992 MALS 2.862545455
CADAtex 7.838796992 ACONTb 2.862545455
GLCptspp 7.838796992 ACONTa 2.862545455
ATPM 6.86 ATPS4rpp 1.134909091
TPI 5.677593985 CYTBO3_4pp 0.037818182
THD2pp 5.42075188 NADH16pp 0.037818182
O2tpp 5.128421053 O2tpp 0.018909091
O2tex 5.128421053 O2tex 0.018909091
PFK_3 4.322406015 \ \
TKT1 4.322406015 \ \
FBA3 4.322406015 \ \
TKT2 4.322406015 \ \
HEX1 2.161203008 \ \
GLCt2pp 2.161203008 \ \
FBA 1.35518797 \ \
PFK 1.35518797 \ \

While FVA was performed on separate models, making direct comparison of absolute flux values challenging, the identified key targets remain valid and instructive for the integrated co-production process.

A detailed analysis of FVA results revealed a crucial phenomenon: metabolic network rigidity significantly increases under anaerobic conditions. As shown in Table 1, numerous central carbon metabolic reactions exhibit substantially narrowed flux variability ranges. This enhanced rigidity primarily stems from the absence of oxygen as a terminal electron acceptor, forcing cells to rely on limited endogenous reactions for NAD⁺ regeneration to sustain basic glycolytic flux and viability.

This inherent constraint in energy metabolism severely reduces the flexibility of carbon flux redistribution, providing a systematic explanation for the "incomplete metabolic switching" observed in our wet-lab experiments. The computational findings thus offer fundamental insights into the metabolic limitations of the anaerobic phase and guide targeted engineering strategies to overcome these constraints.

3.3 Target system identification based on Flux ScanningTechnology(FSEOF)

We employed Flux Scanning based on Enforced Objective Flux (FSEOF) to systematically identify reactions positively correlated with the synthesis of cadaverine and succinate. This method monitors flux response trends across the metabolic network while progressively increasing the target product synthesis constraint. Reactions exhibiting monotonically increasing fluxes represent potential overexpression targets, while those with decreasing fluxes indicate competing pathways.

Aerobic phase (cadaverine): Analysis revealed several key targets beyond direct cadaverine synthesis and transport reactions (CADAtpp/CADAtex): Aspartate aminotransferase (ASPTA) flux showed strong dependence on cadaverine synthesis levels, indicating enhanced conversion of oxaloacetate to aspartate is crucial for driving lysine precursor supply. Pentose phosphate pathway enzymes, particularly glucose-6-phosphate dehydrogenase (Zwf) and 6-phosphogluconate dehydrogenase (Gnd) , demonstrated significant flux increases with rising cadaverine demand, clearly identifying NADPH supply as a core limitation for lysine/cadaverine biosynthesis.

Anaerobic phase (succinate co-production): Under these more rigid-constrained conditions, FSEOF identified critical targets for addressing energy bottlenecks Multiple glycolytic reactions showed pronounced flux increases when succinate synthesis was enforced. Gyceraldehyde-3-phosphate dehydrogenase (gap), showed the most consistent and significant upward trend, representing a crucial hub for simultaneous NADH and ATP generation. Enhancing gap flux presents a strategic approach to alleviate the dual demands of succinate synthesis (NADH) and cadaverine precursor synthesis (ATP) under anaerobic conditions.

These FSEOF results provide a prioritized target list for metabolic engineering to optimize both production phases of our co-production system.

4. Model-Guided Experimental Improvements

Based on model predictions, we implemented targeted engineering strategies to improve system performance.

4.1 Optimization of the TlpA39 switch

The regulatory precision of the temperature-sensitive TlpA39 switch was optimized to ensure complete metabolic mode transition between aerobic and anaerobic phases, effectively minimizing carbon flux leakage. Experimental validated that the modified TlpA39 switch significantly improved cadaverine production, achieving a maximum titer of 7-8 g/L.

4.2 Overexpression of Glyceraldehyde-3-Phosphate Dehydrogenase (gap)

As a hub gene identified in both fermentation phases, gap overexpression enhanced glycolytic flux toward lysine precursors under aerobic conditions and boosted NADH/ATP yield per glucose under anaerobic conditions. After its overexpression, succinate production reached 6.5 g/L, validating the predictive power of the model and establishing a data-driven DBTL framework for further optimization.

5. Conclusions

By integrating multi-level computational modeling—including FBA, pFBA, FVA, and FSEOF—we identified key metabolic bottlenecks and engineering targets for a co-production system of cadaverine and succinate. Our models revealed increased metabolic network rigidity under anaerobic conditions and pinpointed Gap as a global hub gene. Experimental validation confirmed that model-guided strategies, including gap overexpression and TlpA39 switch optimization, significantly enhanced both product titers. This work establishes a predictive DBTL cycle for metabolic engineering and provides a scalable framework for sustainable bio-monomer production.

6. References

[1].Dun Y, Pu ZJ, Kang H, et al. Zwitterionic peptides encircling-assisted enhanced catalytic performance of lysine decarboxylase for cadaverine biotransformation and mechanism analyses[J]. Chemical Engineering Science, 2022, 251: 117447.

[2].Amin, S.A., Chavez, E., Porokhin, V, et al. Towards creating an extended metabolic model (EMM) for E. coli using enzyme promiscuity prediction and metabolomics data[J]. Microb Cell Fact, 2019, 18: 109.