NJTECH-China

To rationally construct a "one strain-dual production" microbial cell factory, we developed a multiscale computational framework that systematically guides experimental design through three core modules:

System Dynamics Simulation: A dynamic model based on ordinary differential equations (ODEs) was established to quantitatively guide the temporal switching of metabolic pathways between aerobic accumulation and anaerobic production.

Enzyme Engineering Design: By integrating AI-assisted prediction and structure-based design, we rationally designed enzyme variants with improved catalytic activity.

Metabolic Network Optimization: A genome-scale model was employed to identify key metabolic bottlenecks (e.g., the GapA node) and propose combinatorial strategies—such as reinforcing gapA and zwf—to enhance metabolic flux.

Together, these three modules form a closed-loop design framework spanning system regulation, component validation, and bottleneck identification, providing a predictive computational foundation for constructing efficient microbial cell factories.

All engineering code and data can be accessed at the following website:

https://github.com/Frozen-Zephyr/2025-iGEM-NJTech-China-Model

Parameters	Definition
\(C_{\mathrm{active}}(t)\)	Concentration of active CadC at time t, reflecting the functional state of the transcription factor.
\(l(t)\)	Extracellular lysine concentration at time t,unit: mM.
\(c(t)\)	concentration of active CadC at time t,unit:mM.
\(\text{pH}\)	Extracellular cadaverine concentration at time t, unit: mM.
\(\mathrm{pH}_0\)	Half-maximal activation pH for CadC, set to 6.2 based on literature.
\(\Delta pH\)	Width parameter of the pH response curve, controlling the steepness of activation, set to 0.5
\(K_{l}\)	Half-maximal activation constant for lysine, reflecting the affinity of CadC for lysine, set to 3.6 mM based on literature.
\(n_{l}\)	Hill coefficient for lysine binding, characterizing binding cooperativity, set to 1.1.
\(K_{c}\)	Half-maximal inhibition constant for cadaverine, reflecting the strength of inhibition, set to 0.235 mM.
\(n_{c}\)	Hill coefficient for cadaverine binding, characterizing inhibitory cooperativity, set to 2.8.

Parameters

Definition

$C_{\mathrm{active}}(t)$

Concentration of active CadC at time t, reflecting the functional state of the transcription factor.

$l(t)$

Extracellular lysine concentration at time t,unit: mM.

$c(t)$

concentration of active CadC at time t,unit:mM.

$\text{pH}$

Extracellular cadaverine concentration at time t, unit: mM.

$\mathrm{pH}_0$

Half-maximal activation pH for CadC, set to 6.2 based on literature.

$\Delta pH$

Width parameter of the pH response curve, controlling the steepness of activation, set to 0.5

$K_{l}$

Half-maximal activation constant for lysine, reflecting the affinity of CadC for lysine, set to 3.6 mM based on literature.

$n_{l}$

Hill coefficient for lysine binding, characterizing binding cooperativity, set to 1.1.

$K_{c}$

Half-maximal inhibition constant for cadaverine, reflecting the strength of inhibition, set to 0.235 mM.

$n_{c}$

Hill coefficient for cadaverine binding, characterizing inhibitory cooperativity, set to 2.8.

Parameters	Definition
\({m}\)	Concentration of CadA mRNA
\(C_{\mathrm{active}}\)	Concentration of active CadC monomers following environmental activation.
\({C}\)	Concentration of active CadC monomers
\(\alpha_{m}\)	Basal transcription rate constant (min⁻¹), representing the rate of mRNA synthesis in the absence of CadC activation. Literature value: 0.0043 min⁻¹.
\(\beta_{m}\)	mRNA degradation rate constant (min⁻¹), representing the first-order degradation rate constant for mRNA. Literature value: 0.0502 min⁻¹.
\({f}\)	Ratio of maximum transcription rate to basal transcription rate. Value set to 698.
\(K_{\mathcal{C}}\)	CadC-DNA binding constant (half-saturation concentration for CadC dimer binding to DNA). Literature value: 1 (dimensionless or in appropriate concentration units).
\(\Delta G_{i}\)	Free energy of state
\({R}\)	Gas constant
\({T}\)	Absolute temperature
\(\mathrm{P}_{0}\)	Concentration of unbound promoters
\(\mathrm{P}_{1}\)	Concentration of promoters with a single binding site occupied

Parameters

Definition

${m}$

Concentration of CadA mRNA

$C_{\mathrm{active}}$

Concentration of active CadC monomers following environmental activation.

${C}$

Concentration of active CadC monomers

$\alpha_{m}$

Basal transcription rate constant (min⁻¹), representing the rate of mRNA synthesis in the absence of CadC activation. Literature value: 0.0043 min⁻¹.

$\beta_{m}$

mRNA degradation rate constant (min⁻¹), representing the first-order degradation rate constant for mRNA. Literature value: 0.0502 min⁻¹.

${f}$

Ratio of maximum transcription rate to basal transcription rate. Value set to 698.

$K_{\mathcal{C}}$

CadC-DNA binding constant (half-saturation concentration for CadC dimer binding to DNA). Literature value: 1 (dimensionless or in appropriate concentration units).

$\Delta G_{i}$

Free energy of state

${R}$

Gas constant

${T}$

Absolute temperature

$\mathrm{P}_{0}$

Concentration of unbound promoters

$\mathrm{P}_{1}$

Concentration of promoters with a single binding site occupied

Parameters	Definition
\(\alpha_{p}\)	Translation rate constant, literature value:\(4.2\times10^{-3}min^{-1}\)
\({m}\)	mRNA concentration
\(\beta_{p}\)	Protein degradation rate constant, literature value:\(\ln2/1740\approx0.000398\mathrm{тіп}^{-1}\)
\(A_{\mathrm{wrapped}}\)	Concentration of the encapsulated enzyme
\(A_{\mathrm{free}}\)	Concentration of the free enzyme
\(\mathrm{t~signal}\)	Signal trigger time point

Parameters

Definition

$\alpha_{p}$

Translation rate constant, literature value:$4.2\times10^{-3}min^{-1}$

${m}$

mRNA concentration

$\beta_{p}$

Protein degradation rate constant, literature value:$\ln2/1740\approx0.000398\mathrm{тіп}^{-1}$

$A_{\mathrm{wrapped}}$

Concentration of the encapsulated enzyme

$A_{\mathrm{free}}$

Concentration of the free enzyme

$\mathrm{t~signal}$

Signal trigger time point

Parameters	Definition
\({E}\)	Free enzyme
\({S}\)	Substrate (Lysine)
\({ES}\)	Enzyme-substrate complex
\({P}\)	Product (Cadaverine)
\([E]_0\)	Total enzyme concentration
\(A_{\mathrm{free}}\)	Free enzyme concentration
\({l}\)	Lysine concentration
\({c}\)	Extracellular cadaverine concentration
\(K_{m}\)	Substrate concentration at which the reaction rate is half of V_max, Literature value: 26 mM
\(v_{max}\)	Maximum reaction rate when the enzyme is fully saturated by substrate. Literature value: \(1.3\times10^{-3}\)mM·min⁻¹

Parameters

Definition

${E}$

Free enzyme

${S}$

Substrate (Lysine)

${ES}$

Enzyme-substrate complex

${P}$

Product (Cadaverine)

$[E]_0$

Total enzyme concentration

$A_{\mathrm{free}}$

Free enzyme concentration

${l}$

Lysine concentration

${c}$

Extracellular cadaverine concentration

$K_{m}$

Substrate concentration at which the reaction rate is half of V_max, Literature value: 26 mM

$v_{max}$

Maximum reaction rate when the enzyme is fully saturated by substrate. Literature value: $1.3\times10^{-3}$mM·min⁻¹

MPNN-GNN-based generative model for lysine decarboxylase

Model Overview

To further enhance the performance of CadA, we employed a computational pipeline for protein engineering:

1.Key amino acid residues interacting with the substrate were identified using molecular docking methods, providing targets for input into the model for directed mutation.

2.A deep learning model based on a Message Passing Neural Network-Graph Neural Network (MPNN-GNN) was developed to engineer CadA for improved catalytic acticity.

3.A site-directed mutagenesis strategy targeting the key residues was designed, leading to the screening for mutants with improved activity.

Method

1.Key Site Selection

To computationally design experimentally testable lysine decarboxylase (CadA) variants with enhanced catalytic efficiency, we initiated a structure-based rational design pipeline. Focusing on the functional decameric assembly, we retrieved the CadA (PDB No. 6YN5) structure. We used a representative dimer extracted from the decamer to focus docking on the active pocket while preserving key interfacial residues. Molecular docking was performed using Schrödinger Maestro. A docking box of 12 Å was centered on the pyridoxal-5′-phosphate (PLP) cofactor, and lysine was docked into the active site.

Analysis of the resulting poses across both protomers of the dimer revealed key residues forming hydrophobic contacts, hydrogen bonds, and a salt bridge with the substrate. In chain G, lysine hydrophobically interacts with W333, and forms hydrogen bonds with Glu525, Lys526 and Gly654. Additionally, lysine forms electrostatic interaction with His-245. In chain B, lysine interacts hydrophobically with Ile182 and hydrogen bound with H245, K246, K527, Y652, G655.（Figure 1）

Figure 1. 2D diagram of interactions between Lys with CadA

Based on the docking analysis, we identified seven residues (I182, H245, K246, W333, E526, K527, Y652) as high-priority targets for mutagenesis. These positions, which directly contact the substrate or cooperate with the PLP cofactor, are predicted to critically modulate substrate binding affinity and catalytic turnover. This candidate list provides a focused and experimentally testable guide for subsequent site-saturation mutagenesis and directed evolution.

2.Model Construction

Our deep learning framework, illustrated in Figure 3, integrates four components: a target protein feature extraction model, a compound feature extraction model, a protein-compound feature fusion model, and an activity prediction model. By innovatively combining the ESM2 protein model and a DGL-based MPNN-GNN compound feature extraction model, the compound-protein feature fusion model to predict the catalytic activity of ligand compounds based on the target protein.

Figure 2. Flowchart of the model

2.1 Model Construction

2.1.1. Preprocessing

Compound Data processing

The complete dataset was partitioned into training, validation, and test sets in an 8:1:1 ratio. Molecular structures were converted into graph representations using the DGL-LifeSci toolkit, from which atom-level and bond-level features were extracted. The atom feature matrix has dimensions $L \times C$, where $ L $ is the number of atoms in the molecule and $ C = 74 $ denotes the feature dimension per atom. The bond feature matrix has dimensions $L \times C$, with $C = 12$ representing bond feature dimensions. The adjacency matrix ($L \times L \times L$), computed using RDKit, represents atomic connectivity. This preprocessing pipeline produces graph-structured representations and corresponding adjacency matrices as inputs for subsequent feature extraction and learning.

Protein Mutation Encoding

To encode protein mutation information, a mutation-specific feature encoding strategy was designed. Each amino acid is represented by its single-letter code. The original and mutated residues are separately one-hot encoded (20 dimensions each) and combined with the mutation position index, forming a high-dimensional vector that provide input features for the deep learning model alongside the compound graph structures (see Figure 3).

2.1.2.Target Protein Feature Extraction Module

We employed the pre-trained ESM-2 model (esm2_t33_650M_UR50D) from Meta AI. The amino acid sequence was input into the model, and the final hidden representations were averaged across residues to obtain a global 640-dimensional protein embedding. To align with compound feature dimensions for downstream fusion, the embedding was passed through a projection network: a linear layer (640→256) with ReLU and Dropout, followed by another linear layer (256→64), preserving essential semantic information for protein–compound matching (Figure 3).

Figure 3. Data Processing Workflow

2.1.3. DataLoader

To improve training efficiency and data handling, all preprocessed protein features and compound graph structures were fed into PyTorch’s DataLoader module. A custom collate_fn function was implemented to accommodate the complex data structures (see Figure 6). The processing workflow is as follows:

Compound graph batching: Using the DGL library, nodes and edges of molecular graphs were batched. The dgl.batch method merges multiple molecular graphs into a single batched_graph, automatically remapping node and edge indices to maintain graph consistency and identifiability.

Feature batching: Node features, edge features, and protein features were concatenated along the batch dimension using torch.cat, forming unified tensors. The resulting feature dimensions are:$$\left( \sum_{i=1}^{\mathit{batchsize}} L_i, C \right)$$

where $L_i$ is the sequence length of the i-th sample, and $C$ is the hidden dimension.

Adjacency matrix batching: To handle molecules of varying sizes, adjacency matrices were zero-padded to the maximum node count $L_{max}$.

within each batch. The padded matrices were then stacked into a unified tensor of shape:$$\left(L_{\mathrm{max}} \times \mathit{batchsize}, L_{\mathrm{max}}\right)$$

This ensures dimensional consistency across the batch, enabling efficient downstream computation.

2.1.4. Compound Feature Extraction Model: MPNN-GNN

We utilized a Message Passing Neural Network (MPNN) implemented in the Deep Graph Library (DGL) to extract representations from molecular graphs. The model comprises two stages:

Message Passing Phase - In each graph convolution layer, nodes aggregate information from their neighbors through a message function that encodes and transmits local chemical context.

Node Update Phase - Each node updates its state by integrating its current features with the aggregated messages using an update function. Node features are first linearly transformed and activated via a non-linear function to enhance representational capacity. Six message-passing iterations are performed to capture extended graph dependencies and refine molecular representations.

The node update in each graph convolution layer is formalized as:$$h_{i}^{l+1} = h_{i}^{l} + \sum_{j \in N(i)} (\{ f_{\Theta}(e_{ij}) \cdot h_{j}^{l}, j \in \mathcal{N}(i) \})$$

where:

$h_i^l$:feature vector of node i at layer l

$h_j^l$:feature vector of neighbor node j at layer l

$e_{ij}$:edge feature between nodes i and j (chemical bond)

$f_\Theta(e_{ij})$:learnable function that weights neighbor information according to edge features

$\mathcal{N}(i)$: set of neighbors of node i

After processing through the MPNN layers, a readout function generates a fixed-size 64-dimensional graph-level embedding for each compound. These embeddings serve as input for downstream protein–compound feature fusion and activity prediction tasks.

Figure 4. Feature Extraction and Data Processing Pipeline

2.2 Model Training and Optimizatio

2.2.1. Feature Fusion Model

To effectively integrate compound and protein features for enhanced predictive performance, we implemented a hybrid-GNN-based structural feature extractor approach proposed by Yang Hua et al. This method employed a one-dimensional deep convolutional (Deep-Wise Convolution, DWC) module to jointly model protein and compound representations at the feature level. Compared with simple concatenation operations, this architecture better captures local correlations and complementarities between the two types of features, generating more discriminative fused representations to support downstream tasks.

Given the extracted feature matrices:

Target protein feature matrix: $F_p \in {R^{L_p \times C_p^s}}$

Compound graph feature matrix: $F_d \in {R^{L_d \times Q_d^s}}$

Compound adjacency matrix: $A_d \in {R^{L_d \times L_d}}$

We extract structural drug-target interaction (DTI) features. The protein feature matrix is represented as amino acid nodes, with adjacency defined by sequential neighbors. A 1D deep convolution with kernel size 3 is applied to capture structural relationships among adjacent amino acids.

An interaction matrix $C \in R^{L_p \times L_d}$ between protein residues and compound atoms is constructed as:$$C = \mathrm{Softmax}(F_{p} F_{d}^{T})$$

The Softmax function normalizes each row into a probability distribution:$$\sigma(z_i)=\frac{e^{z_i}}{\sum_{j=1}^ne^{z_j}}$$

Here, $e^{z_i}$ represents the exponential operation applied to each element $z_i$, ensuring that all values are positive. The term $\sum_{j \mathrm{=1}}^{n}{}{e^{z_j}}$ is the sum of all elements after exponentiation, used for normalization to guarantee that the elements of the output vector sum to 1. The resulting interaction matrix ${C}$ represents the potential interactions between amino acid residues of the protein and atoms of the compound.

By combining the protein adjacency matrix $A_p$, compound adjacency matrix $A_d$, and interaction matrix $C$, a complete GNN computation is formulated:$$F_s=[\begin{array}{cc}A_p&C\\C^T&A_d\end{array}][\begin{array}{c}F_p\\F_d\end{array}]$$

This operation integrates drug-target interaction information as a bridge connecting protein and compound graphs. Since protein sequences cannot form atom-level graphs directly, the 1D DWC extracts sequential relationships:$$A_pF_p=DWC(F_p)$$

Where $A_p$ is a hypothetical adjacency among residues, and DWC uses a $3 \times 1$ convolution to capture neighboring residue information. The updated protein and compound features are expressed as:$$\delta(F_p)=DWC(\omega(F_p))+C\omega(F_d)+F_p$$$$\delta(F_d)=C^T\omega(F_p)+A_d\omega(F_d)+F_d$$

Here, $\omega$ denotes a linear transformation followed by a non-linear activation, while $C$ and $C^T$ propagate interaction information. After N rounds of GNN iteration, the final fused structural features are obtained via concatenation:$$\mathrm{F_s=Concatenate}\left(\delta(\mathrm{F_p}),\delta(\mathrm{F_d})\right)$$

2.2.2. Activity Prediction Module

After obtaining the fused substrate-target representations, a fully connected neural network (FCNN) was employed as the prediction module to estimate molecular activity. The network architecture is structured as follows:

Input layer: A linear transformation maps the 64-dimensional fused feature vector to 128 dimensions, followed by a ReLU activation function to introduce nonlinearity.

Hidden layer 1: The transformed features are further expanded to 256 dimensions and processed through an additional ReLU activation.

Hidden layer 2: A subsequent linear transformation reduces the feature dimensionality to 64, followed by ReLU activation and a Dropout layer to prevent overfitting.

Output layer: The final linear layer projects the feature representation onto a single-dimensional output space, suitable for either regression or classification tasks.

This hierarchical structure facilitates nonlinear feature interactions between the substrate and protein representations, thereby enhancing the model’s ability to capture complex biochemical relationships and generate task-specific activity predictions effectively.

2.2.3. Loss Function, Bayesian Hyperparameter Optimization, and Early Stopping

Optimizer and Loss Function:

The model was implemented and trained using the PyTorch framework. We adopted the AdamW optimizer, which improves upon standard Adam by decoupling weight decay from the gradient-based update steps. This separation explicitly applies L2 regularization directly to the model parameters, thereby enhancing training stability and generalization performance.

For regression tasks, we used the Mean Squared Error (MSE) loss, defined as:$$\mathrm{MSELoss}(y_{\mathrm{true}},y_{\mathrm{pred}})=\frac{1}{N}\sum_{i=1}^N(y_{\mathrm{true}}^{(i)}-y_{\mathrm{pred}}^{(i)})^2$$

Here, $y_\text{true}$ denotes the true labels, and $y_\text{pred}$ represents the model’s predicted values. MSELoss effectively captures the average squared deviation between predicted and true values, making it more sensitive to outliers.

Early Stopping:

During model training, the validation loss was continuously monitored, and early stopping was applied when no improvement was observed over 50 consecutive epochs. This strategy effectively mitigates overfitting and ensures that the final model corresponds to the checkpoint exhibiting the optimal validation performance.

Bayesian Hyperparameter Optimization:

Gaussian process-based Bayesian optimization was employed to automate the tuning of hyperparameters, with validation accuracy designated as the objective function. Critical parameters such as learning rate, weight decay, and the number of training epochs were optimized. To improve search efficiency, a two-stage optimization strategy was adopted: an initial coarse-grained global search was conducted to explore the parameter space broadly, followed by a fine-grained local search focused on the most promising regions. This hierarchical optimization procedure facilitated the efficient identification of an optimal hyperparameter configuration, enhancing model generalization and predictive robustness.

2.3. Data Sources

To facilitate model training, distinct datasets were constructed for different tasks. Experimental validation samples of kcat values were extracted from the BRENDA and SABIO-RK databases. Each record included the enzyme amino acid sequence, substrate name, and corresponding kinetic parameters. Whenever possible, enzyme sequences were mapped to a unique UniProt ID. For each substrate, the molecular name was also used to retrieve the corresponding SMILES representation from the KEGG and PubChem databases. Samples that could not be unambiguously mapped to a valid SMILES string were excluded.

Due to the limitation of the ESM2 model in handling sequences longer than 1,022 amino acids, samples exceeding this threshold were removed. Variant-containing samples were retained, while the top and bottom 10% of extreme outlier values were filtered out. Ultimately, a total of 29,102 samples were collected for kcat, including 744 entries belonging to EC 4.1.1.x (decarboxylases). The dataset was randomly partitioned into training, validation, and test sets at ratios of 80%, 10%, and 10%, respectively.

Result

1. Model Training

During model construction, Bayesian optimization was employed for hyperparameter search. Based on prior training experience, the search space was constrained as follows: learning_rate within the range $[1 \times 10^{-4}$, $1 \times 10^{-3}$] (linear space), weight_decay within the range $[1 \times 10^{-4}$, $1 \times 10^{-3}$] (linear space), batch_size selected from $\{64, 128\}$, and epochs between 50 and 80. Training yielded the optimal hyperparameter combination: learning_rate = 2.275 × 10⁻⁴, weight_decay = 1.036 × 10⁻⁴, batch_size = 128, epochs = 54.

After training, model performance on the test set was evaluated using multiple regression metrics, including mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and adjusted coefficient of determination (Adjusted $R^{2}$). Their definitions are as follows:$$\mathrm{MSE}=\frac{1}{n}\sum_{i=1}^n(y_i-\widehat{y}_i)^2$$$$\mathrm{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^n(y_i-\widehat{y}_i)^2}$$$$\mathrm{MAE}=\frac{1}{n}\sum_{i=1}^{n}|y_{i}-\widehat{y}_{i}|$$$$\mathrm{Adjusted~}R^2=1-(1-R^2)\frac{n-1}{n-p-1},\quad R^2=1-\frac{\sum_{i=1}^n(y_i-\widehat{y}i)^2}{\sum i=1^n(y_i-\bar{y})^2}$$

Where n denotes the number of samples, p represents the number of model input features, $y_i$ and $\hat{y}_i$ correspond to the true and predicted values, respectively, and $\bar{y}$ is the mean of the true values.

After completing the general pretraining phase, the model was fine-tuned using lysine decarboxylase (LDC)-specific data to enhance its predictive capability within this enzyme family. As shown in Table 1 fine-tuning systematically improved the key performance metrics on the LDC test set, with the adjusted $R^{2}$ increasing from 0.3253 to 0.4927. This indicates that domain knowledge transfer was effective, and the model has begun to capture functional features specific to the LDC family.

Table 1. Performance of the model on the general test set and the lysine decarboxylase (LDC) test set

	MSE	RMSE	MAE	Adjusted R²
General-purpose dataset	0.3747	0.4101	0.3110	0.3253
LDC dataset	0.3069	0.3654	0.2395	0.4927

2. Experimental Validation Reveals Model-Experiment Discrepancy

Despite the promising computational performance of our fine-tuned model, This underscores the challenge of translating in silico predictions to wet-lab outcomes. We proceeded to experimentally characterize the top-predicted CadA mutants for lysine decarboxylase activity. Unexpectedly, none of these in silico-designed variants showed a significant improvement over the wild-type enzyme. This critical finding highlighted a fundamental gap between our static structure-based predictions and the complex reality of enzymatic function in a cellular environment.

We attribute this discrepancy to two primary factors: 1) The limited amount of high-quality, enzyme-specific training data, which restricts the model's precision; and 2) The model's inherent inability to account for dynamic effects such as conformational flexibility and solvation, which are crucial for catalysis.

Designing Interfacial Disulfide Bonds to Improve CadA activity

Methods

The divergence between our model's predictions and experimental results underscored the challenges of direct activity prediction. We therefore adopted an alternative, structure-guided strategy, strategy based on a well-established protein engineering principle: enhancing enzyme stability often leads to improved functional performance.

To implement this, we introduced engineered disulfide bonds at the dimer interface of CadA to increase stability. Using the Maestro-BioLuminate platform, we adopted a three-step strategy consisting of disulfide scanning - mechanics scoring - literature-based-validation.

A systematic disulfide bond scan was carried out. Candidate residue pairs at the dimer interface were screened according to geometric and accessibility criteria: interatomic distances between 4-7 Å and solvent-accessible surface areas (SASA) < 80 Å². Simulated double-cysteine mutations generated 80 candidate residue pairs, which were subsequently filtered using weighted scoring. Pairs with weighted scores (WS) > 10,000 were excluded, yielding nine potential disulfide-bonding sites.

Finally, we cross-referenced these candidates with experimental reports of thermostability-enhancing mutations published between 2018 and 2024. Only those pairs that satisfied interfacial positioning, disulfide bond geometry, and literature evidence were retained. Based on this integrative analysis, six new mutants were designed: F14C/K44C, V12C/K41C, F14C/L45C, L93C/E445C, Y13C/P36C. Additionally, the mutation E104K, located at the dimer interface, was additionally identified by virtual screening as potentially stabilizing the enzyme.

Results

Wet-lab experiments demonstrated that the engineered disulfide bond mutants F14C/K44C, V12C/K41C, F14C/L45C, L93C/E445C and Y13C/P36C along with literature-supported variants V12C/K44C, F14C/D41C, and L89C/E445C, exhibited significantly enhanced enzymatic activity. Among these, the Y13C/P36C mutant diaplayed the most pronounced improvement in catalytic activity.

To investigate the structural basis of this enhancement, molecular dynamics (MD) simulations were performed for both the wild-type enzyme and the Y13C/P36C mutant. The results revealed a substantial reduction in Root-Mean-Square Fluctuations (RMSF) in the mutant, indicating enhanced structural stability that likely underlies its higher activity.

Discussion

Limitations of Model Predictions and Discrepancies with Experimental Validation

Our initial in silico MPNN-GNN–based model showed moderate predictive power for LDC variants (Adjusted R² = 0.4927), but experimental validation revealed that predicted high-activity mutants did not reach expected performance. These discrepancies likely stem from the model's reliance on generalized training data, static structural assumptions, and the incomplete representation of long-range dynamic effects that influence enzyme catalysis.

To overcome these gaps, we integrated disulfide bond scanning with literature based-validation, identifying key interfacial residues that contribute to enhanced stability and activity. Future work will combine site-directed mutagenesis, molecular dynamics, and high-throughput experimentalvalidation into an iterative design loop. This approach will continuously refine the predictive model and further improve both the stability and catalytic efficiency of CadA variants.

References

[1] Potashman MH, Duggan ME. Covalent modifiers: An orthogonal approach to drug design. J Med Chem (2009) 52:1231-46. doi: 10.1021/jm8008597

[2] Noor, F.; Noor, A.; Ishaq, A.R.; Farzeen, I.; Saleem, M.H.; Ghaffar, K.; Aslam, M.F.; Aslam, S.; Chen, J.-T. Recent advances in diagnostic and therapeutic approaches for breast cancer: A comprehensive review. Curr. Pharm. Des. 2021, 27, 2344-2365.

[3] Noor, F.; Tahir ul Qamar, M.; Ashfaq, U.A.; Albutti, A.;Alwashmi, A.S.; Aljasir, M.A. Network pharmacology approach for medicinal plants: Review and assessment. Pharmaceuticals 2022, 15, 572.

[4] Floresta, G.; Zagni, C.; Gentile, D.; Patamia, V.; Rescifina, A. Artificial intelligence technologies for COVID-19 de novo drug design. Int. J. Mol. Sci. 2022, 23, 3261.

[5] Sadaqat, M.; Qasim, M.; ul Qamar, M.T.; Masoud, M.S.; Ashfaq, U.A.; Noor, F.; Fatima, K.; Allemailem, K.S.; Alrumaihi,F.;Almatroudi, A. Advanced network pharmacology study reveals multi-pathway and multi-gene regulatory molecular mechanism of Bacopa monnieri in liver cancer based on data mining, molecular modeling, and microarray data analysis. Comput. Biol. Med.2023, 161, 107059.

[6] Yang, J.; Cai, Y.; Zhao, K.; Xie, H.; Chen, X. Concepts and applications of chemical fingerprint for hit and lead screening. Drug Discov. Today 2022, 27, 103356.

[7] Vetter, I. R., & Wittinghofer, A. (2001). The guanine nucleotide-binding switch in three dimensions. Science, 294(5545), 1299-1304.

[8] Liceras-Boillos, P.; García-Navas, R.; Ginel-Picardo, A.; Anta, B.; Perez-Andres, M.; Lillo, C.; Gómez, C.; Jimeno, D.; Fernández- Medarde, A.; Baltanas, F.C.; et al. Sos1 disruption impairs cellular proliferation and viability through an increase in mitochondrial oxidative stress in primary MEFs. Oncogene 2016, 35, 6389–6402.

[9] M. M. Mysinger, M. Carchia, J. J. Irwin, and B. K. Shoichet, "Directory of useful decoys,enhanced (dud-e): Better ligands and decoys for better benchmarking," Journal of Medicinal Chemistry, vol. 55, no. 14, pp. 6582–6594, 2012. PMID: 22716043.

[10] I. Wallach and R. Lilien, "Virtual Decoy Sets for Molecular Docking Benchmarks," J Chem.Inf. and Model., vol. 51, no. 2, pp. 196–202, 2011.

[11] D. Rogers and M. Hahn, "Extended-connectivity fingerprints," Journal of Chemical Information and Modeling, vol. 50, no. 5, pp. 742–754, 2010.

[12] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., et al. (2022). Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv. https://doi.org/10.1101/2022.07.20.500902

Aerobic	Anaerobic
EX_h2o_e	28.64481203	EX_h_e	34.27490909
EX_co2_e	20.80601504	ENO	20
ENO	15.67759398	GAP	20
GAP	15.67759398	CADAtpp1	17.13745455
NH4tpp	15.67759398	CADAtex1	17.13745455
NH4tex	15.67759398	FRD2	14.27490909
ATPS4rpp	14.69879699	PPC	14.27490909
PGL	12.96721805	NADH17pp	14.27490909
GND	12.96721805	FBA	10
G6PDH2r	12.96721805	PGI	10
CYTBO3_4pp	10.25684211	PFK	10
NADH16pp	10.25684211	TPI	10
GLCtex_copy2	10	GLCtex_copy2	10
RPE	8.64481203	EX_h2o_e	8.587636364
PPC	7.838796992	CO2tpp	8.549818182
SDPDS	7.838796992	CO2tex	8.549818182
DAPE	7.838796992	ATPM	6.86
DHDPRy	7.838796992	GLCptspp	5.725090909
DHDPS ASPK	7.838796992	PDH	5.725090909
DAPDC	7.838796992	HEX1	4.274909091
LYSDC	7.838796992	GLCt2pp	4.274909091
THDPS	7.838796992	CS	2.862545455
SUCOAS	7.838796992	ICL	2.862545455
CADAtpp	7.838796992	MALS	2.862545455
CADAtex	7.838796992	ACONTb	2.862545455
GLCptspp	7.838796992	ACONTa	2.862545455
ATPM	6.86	ATPS4rpp	1.134909091
TPI	5.677593985	CYTBO3_4pp	0.037818182
THD2pp	5.42075188	NADH16pp	0.037818182
O2tpp	5.128421053	O2tpp	0.018909091
O2tex	5.128421053	O2tex	0.018909091
PFK_3	4.322406015	\	\
TKT1	4.322406015	\	\
FBA3	4.322406015	\	\
TKT2	4.322406015	\	\
HEX1	2.161203008	\	\
GLCt2pp	2.161203008	\	\
FBA	1.35518797	\	\
PFK	1.35518797	\	\

Aerobic

Anaerobic

Reaction ID

Flux

Reaction ID

Flux

EX_h2o_e

28.64481203

EX_h_e

34.27490909

EX_co2_e

20.80601504

ENO

15.67759398

GAP

15.67759398

CADAtpp1

17.13745455

NH4tpp

15.67759398

CADAtex1

17.13745455

NH4tex

15.67759398

FRD2

14.27490909

ATPS4rpp

14.69879699

PPC

14.27490909

PGL

12.96721805

NADH17pp

14.27490909

GND

12.96721805

FBA

G6PDH2r

12.96721805

PGI

CYTBO3_4pp

10.25684211

PFK

NADH16pp

10.25684211

TPI

GLCtex_copy2

RPE

8.64481203

EX_h2o_e

8.587636364

PPC

7.838796992

CO2tpp

8.549818182

SDPDS

7.838796992

CO2tex

8.549818182

DAPE

7.838796992

ATPM

6.86

DHDPRy

7.838796992

GLCptspp

5.725090909

DHDPS ASPK

7.838796992

PDH

5.725090909

DAPDC

7.838796992

HEX1

4.274909091

LYSDC

7.838796992

GLCt2pp

4.274909091

THDPS

7.838796992

2.862545455

SUCOAS

7.838796992

ICL

2.862545455

CADAtpp

7.838796992

MALS

2.862545455

CADAtex

7.838796992

ACONTb

2.862545455

GLCptspp

7.838796992

ACONTa

2.862545455

ATPM

6.86

ATPS4rpp

1.134909091

TPI

5.677593985

CYTBO3_4pp

0.037818182

THD2pp

5.42075188

NADH16pp

0.037818182

O2tpp

5.128421053

O2tpp

0.018909091

O2tex

5.128421053

O2tex

0.018909091

PFK_3

4.322406015

TKT1

4.322406015

FBA3

4.322406015

TKT2

4.322406015

HEX1

2.161203008

GLCt2pp

2.161203008

FBA

1.35518797

PFK

1.35518797