Method
1.Key Site Selection
To computationally design experimentally testable lysine decarboxylase (CadA) variants with enhanced catalytic efficiency, we initiated a structure-based rational design pipeline. Focusing on the functional decameric assembly, we retrieved the CadA (PDB No. 6YN5) structure. We used a representative dimer extracted from the decamer to focus docking on the active pocket while preserving key interfacial residues. Molecular docking was performed using Schrödinger Maestro. A docking box of 12 Å was centered on the pyridoxal-5′-phosphate (PLP) cofactor, and lysine was docked into the active site.
Analysis of the resulting poses across both protomers of the dimer revealed key residues forming hydrophobic contacts, hydrogen bonds, and a salt bridge with the substrate. In chain G, lysine hydrophobically interacts with W333, and forms hydrogen bonds with Glu525, Lys526 and Gly654. Additionally, lysine forms electrostatic interaction with His-245. In chain B, lysine interacts hydrophobically with Ile182 and hydrogen bound with H245, K246, K527, Y652, G655.(Figure 1)
Figure 1. 2D diagram of interactions between Lys with CadA
Based on the docking analysis, we identified seven residues (I182, H245, K246, W333, E526, K527, Y652) as high-priority targets for mutagenesis. These positions, which directly contact the substrate or cooperate with the PLP cofactor, are predicted to critically modulate substrate binding affinity and catalytic turnover. This candidate list provides a focused and experimentally testable guide for subsequent site-saturation mutagenesis and directed evolution.
2.Model Construction
Our deep learning framework, illustrated in Figure 3, integrates four components: a target protein feature extraction model, a compound feature extraction model, a protein-compound feature fusion model, and an activity prediction model. By innovatively combining the ESM2 protein model and a DGL-based MPNN-GNN compound feature extraction model, the compound-protein feature fusion model to predict the catalytic activity of ligand compounds based on the target protein.
Figure 2. Flowchart of the model
2.1 Model Construction
2.1.1. Preprocessing
Compound Data processing
The complete dataset was partitioned into training, validation, and test sets in an 8:1:1 ratio. Molecular structures were converted into graph representations using the DGL-LifeSci toolkit, from which atom-level and bond-level features were extracted. The atom feature matrix has dimensions \(L \times C\), where \( L \) is the number of atoms in the molecule and \( C = 74 \) denotes the feature dimension per atom. The bond feature matrix has dimensions \(L \times C\), with \(C = 12\) representing bond feature dimensions. The adjacency matrix (\(L \times L \times L\)), computed using RDKit, represents atomic connectivity. This preprocessing pipeline produces graph-structured representations and corresponding adjacency matrices as inputs for subsequent feature extraction and learning.
Protein Mutation Encoding
To encode protein mutation information, a mutation-specific feature encoding strategy was designed. Each amino acid is represented by its single-letter code. The original and mutated residues are separately one-hot encoded (20 dimensions each) and combined with the mutation position index, forming a high-dimensional vector that provide input features for the deep learning model alongside the compound graph structures (see Figure 3).
2.1.2.Target Protein Feature Extraction Module
We employed the pre-trained ESM-2 model (esm2_t33_650M_UR50D) from Meta AI. The amino acid sequence was input into the model, and the final hidden representations were averaged across residues to obtain a global 640-dimensional protein embedding. To align with compound feature dimensions for downstream fusion, the embedding was passed through a projection network: a linear layer (640→256) with ReLU and Dropout, followed by another linear layer (256→64), preserving essential semantic information for protein–compound matching (Figure 3).
Figure 3. Data Processing Workflow
2.1.3. DataLoader
To improve training efficiency and data handling, all preprocessed protein features and compound graph structures were fed into PyTorch’s DataLoader module. A custom collate_fn function was implemented to accommodate the complex data structures (see Figure 6). The processing workflow is as follows:
Compound graph batching: Using the DGL library, nodes and edges of molecular graphs were batched. The dgl.batch method merges multiple molecular graphs into a single batched_graph, automatically remapping node and edge indices to maintain graph consistency and identifiability.
Feature batching: Node features, edge features, and protein features were concatenated along the batch dimension using torch.cat, forming unified tensors. The resulting feature dimensions are:$$\left( \sum_{i=1}^{\mathit{batchsize}} L_i, C \right)$$
where \(L_i\) is the sequence length of the i-th sample, and \(C\) is the hidden dimension.
Adjacency matrix batching: To handle molecules of varying sizes, adjacency matrices were zero-padded to the maximum node count \(L_{max}\).
within each batch. The padded matrices were then stacked into a unified tensor of shape:$$\left(L_{\mathrm{max}} \times \mathit{batchsize}, L_{\mathrm{max}}\right)$$
This ensures dimensional consistency across the batch, enabling efficient downstream computation.
2.1.4. Compound Feature Extraction Model: MPNN-GNN
We utilized a Message Passing Neural Network (MPNN) implemented in the Deep Graph Library (DGL) to extract representations from molecular graphs. The model comprises two stages:
Message Passing Phase - In each graph convolution layer, nodes aggregate information from their neighbors through a message function that encodes and transmits local chemical context.
Node Update Phase - Each node updates its state by integrating its current features with the aggregated messages using an update function. Node features are first linearly transformed and activated via a non-linear function to enhance representational capacity. Six message-passing iterations are performed to capture extended graph dependencies and refine molecular representations.
The node update in each graph convolution layer is formalized as:$$h_{i}^{l+1} = h_{i}^{l} + \sum_{j \in N(i)} (\{ f_{\Theta}(e_{ij}) \cdot h_{j}^{l}, j \in \mathcal{N}(i) \})$$
where:
\(h_i^l\):feature vector of node i at layer l
\(h_j^l\):feature vector of neighbor node j at layer l
\(e_{ij}\):edge feature between nodes i and j (chemical bond)
\(f_\Theta(e_{ij})\):learnable function that weights neighbor information according to edge features
\(\mathcal{N}(i)\): set of neighbors of node i
After processing through the MPNN layers, a readout function generates a fixed-size 64-dimensional graph-level embedding for each compound. These embeddings serve as input for downstream protein–compound feature fusion and activity prediction tasks.
Figure 4. Feature Extraction and Data Processing Pipeline
2.2 Model Training and Optimizatio
2.2.1. Feature Fusion Model
To effectively integrate compound and protein features for enhanced predictive performance, we implemented a hybrid-GNN-based structural feature extractor approach proposed by Yang Hua et al. This method employed a one-dimensional deep convolutional (Deep-Wise Convolution, DWC) module to jointly model protein and compound representations at the feature level. Compared with simple concatenation operations, this architecture better captures local correlations and complementarities between the two types of features, generating more discriminative fused representations to support downstream tasks.
Given the extracted feature matrices:
Target protein feature matrix: \(F_p \in {R^{L_p \times C_p^s}}\)
Compound graph feature matrix: \(F_d \in {R^{L_d \times Q_d^s}}\)
Compound adjacency matrix: \(A_d \in {R^{L_d \times L_d}}\)
We extract structural drug-target interaction (DTI) features. The protein feature matrix is represented as amino acid nodes, with adjacency defined by sequential neighbors. A 1D deep convolution with kernel size 3 is applied to capture structural relationships among adjacent amino acids.
An interaction matrix \(C \in R^{L_p \times L_d}\) between protein residues and compound atoms is constructed as:$$C = \mathrm{Softmax}(F_{p} F_{d}^{T})$$
The Softmax function normalizes each row into a probability distribution:$$\sigma(z_i)=\frac{e^{z_i}}{\sum_{j=1}^ne^{z_j}}$$
Here, \(e^{z_i}\) represents the exponential operation applied to each element \(z_i\), ensuring that all values are positive. The term \(\sum_{j \mathrm{=1}}^{n}{}{e^{z_j}}\) is the sum of all elements after exponentiation, used for normalization to guarantee that the elements of the output vector sum to 1. The resulting interaction matrix \({C}\) represents the potential interactions between amino acid residues of the protein and atoms of the compound.
By combining the protein adjacency matrix \(A_p\), compound adjacency matrix \(A_d\), and interaction matrix \(C\), a complete GNN computation is formulated:$$F_s=[\begin{array}{cc}A_p&C\\C^T&A_d\end{array}][\begin{array}{c}F_p\\F_d\end{array}]$$
This operation integrates drug-target interaction information as a bridge connecting protein and compound graphs. Since protein sequences cannot form atom-level graphs directly, the 1D DWC extracts sequential relationships:$$A_pF_p=DWC(F_p)$$
Where \(A_p\) is a hypothetical adjacency among residues, and DWC uses a \(3 \times 1\) convolution to capture neighboring residue information. The updated protein and compound features are expressed as:$$\delta(F_p)=DWC(\omega(F_p))+C\omega(F_d)+F_p$$$$\delta(F_d)=C^T\omega(F_p)+A_d\omega(F_d)+F_d$$
Here, \(\omega\) denotes a linear transformation followed by a non-linear activation, while \(C\) and \(C^T\) propagate interaction information. After N rounds of GNN iteration, the final fused structural features are obtained via concatenation:$$\mathrm{F_s=Concatenate}\left(\delta(\mathrm{F_p}),\delta(\mathrm{F_d})\right)$$
2.2.2. Activity Prediction Module
After obtaining the fused substrate-target representations, a fully connected neural network (FCNN) was employed as the prediction module to estimate molecular activity. The network architecture is structured as follows:
Input layer: A linear transformation maps the 64-dimensional fused feature vector to 128 dimensions, followed by a ReLU activation function to introduce nonlinearity.
Hidden layer 1: The transformed features are further expanded to 256 dimensions and processed through an additional ReLU activation.
Hidden layer 2: A subsequent linear transformation reduces the feature dimensionality to 64, followed by ReLU activation and a Dropout layer to prevent overfitting.
Output layer: The final linear layer projects the feature representation onto a single-dimensional output space, suitable for either regression or classification tasks.
This hierarchical structure facilitates nonlinear feature interactions between the substrate and protein representations, thereby enhancing the model’s ability to capture complex biochemical relationships and generate task-specific activity predictions effectively.
2.2.3. Loss Function, Bayesian Hyperparameter Optimization, and Early Stopping
Optimizer and Loss Function:
The model was implemented and trained using the PyTorch framework. We adopted the AdamW optimizer, which improves upon standard Adam by decoupling weight decay from the gradient-based update steps. This separation explicitly applies L2 regularization directly to the model parameters, thereby enhancing training stability and generalization performance.
For regression tasks, we used the Mean Squared Error (MSE) loss, defined as:$$\mathrm{MSELoss}(y_{\mathrm{true}},y_{\mathrm{pred}})=\frac{1}{N}\sum_{i=1}^N(y_{\mathrm{true}}^{(i)}-y_{\mathrm{pred}}^{(i)})^2$$
Here, \(y_\text{true}\) denotes the true labels, and \(y_\text{pred}\) represents the model’s predicted values. MSELoss effectively captures the average squared deviation between predicted and true values, making it more sensitive to outliers.
Early Stopping:
During model training, the validation loss was continuously monitored, and early stopping was applied when no improvement was observed over 50 consecutive epochs. This strategy effectively mitigates overfitting and ensures that the final model corresponds to the checkpoint exhibiting the optimal validation performance.
Bayesian Hyperparameter Optimization:
Gaussian process-based Bayesian optimization was employed to automate the tuning of hyperparameters, with validation accuracy designated as the objective function. Critical parameters such as learning rate, weight decay, and the number of training epochs were optimized. To improve search efficiency, a two-stage optimization strategy was adopted: an initial coarse-grained global search was conducted to explore the parameter space broadly, followed by a fine-grained local search focused on the most promising regions. This hierarchical optimization procedure facilitated the efficient identification of an optimal hyperparameter configuration, enhancing model generalization and predictive robustness.
2.3. Data Sources
To facilitate model training, distinct datasets were constructed for different tasks. Experimental validation samples of kcat values were extracted from the BRENDA and SABIO-RK databases. Each record included the enzyme amino acid sequence, substrate name, and corresponding kinetic parameters. Whenever possible, enzyme sequences were mapped to a unique UniProt ID. For each substrate, the molecular name was also used to retrieve the corresponding SMILES representation from the KEGG and PubChem databases. Samples that could not be unambiguously mapped to a valid SMILES string were excluded.
Due to the limitation of the ESM2 model in handling sequences longer than 1,022 amino acids, samples exceeding this threshold were removed. Variant-containing samples were retained, while the top and bottom 10% of extreme outlier values were filtered out. Ultimately, a total of 29,102 samples were collected for kcat, including 744 entries belonging to EC 4.1.1.x (decarboxylases). The dataset was randomly partitioned into training, validation, and test sets at ratios of 80%, 10%, and 10%, respectively.
Result
1. Model Training
During model construction, Bayesian optimization was employed for hyperparameter search. Based on prior training experience, the search space was constrained as follows: learning_rate within the range \([1 \times 10^{-4}\), \(1 \times 10^{-3}\)] (linear space), weight_decay within the range \([1 \times 10^{-4}\), \(1 \times 10^{-3}\)] (linear space), batch_size selected from \(\{64, 128\}\), and epochs between 50 and 80. Training yielded the optimal hyperparameter combination: learning_rate = 2.275 × 10⁻⁴, weight_decay = 1.036 × 10⁻⁴, batch_size = 128, epochs = 54.
After training, model performance on the test set was evaluated using multiple regression metrics, including mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and adjusted coefficient of determination (Adjusted \(R^{2}\)). Their definitions are as follows:$$\mathrm{MSE}=\frac{1}{n}\sum_{i=1}^n(y_i-\widehat{y}_i)^2$$$$\mathrm{RMSE}=\sqrt{\frac{1}{n}\sum_{i=1}^n(y_i-\widehat{y}_i)^2}$$$$\mathrm{MAE}=\frac{1}{n}\sum_{i=1}^{n}|y_{i}-\widehat{y}_{i}|$$$$\mathrm{Adjusted~}R^2=1-(1-R^2)\frac{n-1}{n-p-1},\quad R^2=1-\frac{\sum_{i=1}^n(y_i-\widehat{y}i)^2}{\sum i=1^n(y_i-\bar{y})^2}$$
Where n denotes the number of samples, p represents the number of model input features, \(y_i\) and \(\hat{y}_i\) correspond to the true and predicted values, respectively, and \(\bar{y}\) is the mean of the true values.
After completing the general pretraining phase, the model was fine-tuned using lysine decarboxylase (LDC)-specific data to enhance its predictive capability within this enzyme family. As shown in Table 1 fine-tuning systematically improved the key performance metrics on the LDC test set, with the adjusted \(R^{2}\) increasing from 0.3253 to 0.4927. This indicates that domain knowledge transfer was effective, and the model has begun to capture functional features specific to the LDC family.
Table 1. Performance of the model on the general test set and the lysine decarboxylase (LDC) test set
|
MSE |
RMSE |
MAE |
Adjusted R² |
| General-purpose dataset |
0.3747 |
0.4101 |
0.3110 |
0.3253 |
| LDC dataset |
0.3069 |
0.3654 |
0.2395 |
0.4927 |
2. Experimental Validation Reveals Model-Experiment Discrepancy
Despite the promising computational performance of our fine-tuned model, This underscores the challenge of translating in silico predictions to wet-lab outcomes. We proceeded to experimentally characterize the top-predicted CadA mutants for lysine decarboxylase activity. Unexpectedly, none of these in silico-designed variants showed a significant improvement over the wild-type enzyme. This critical finding highlighted a fundamental gap between our static structure-based predictions and the complex reality of enzymatic function in a cellular environment.
We attribute this discrepancy to two primary factors: 1) The limited amount of high-quality, enzyme-specific training data, which restricts the model's precision; and 2) The model's inherent inability to account for dynamic effects such as conformational flexibility and solvation, which are crucial for catalysis.
Designing Interfacial Disulfide Bonds to Improve CadA activity
Methods
The divergence between our model's predictions and experimental results underscored the challenges of direct activity prediction. We therefore adopted an alternative, structure-guided strategy, strategy based on a well-established protein engineering principle: enhancing enzyme stability often leads to improved functional performance.
To implement this, we introduced engineered disulfide bonds at the dimer interface of CadA to increase stability. Using the Maestro-BioLuminate platform, we adopted a three-step strategy consisting of disulfide scanning - mechanics scoring - literature-based-validation.
A systematic disulfide bond scan was carried out. Candidate residue pairs at the dimer interface were screened according to geometric and accessibility criteria: interatomic distances between 4-7 Šand solvent-accessible surface areas (SASA) < 80 Ų. Simulated double-cysteine mutations generated 80 candidate residue pairs, which were subsequently filtered using weighted scoring. Pairs with weighted scores (WS) > 10,000 were excluded, yielding nine potential disulfide-bonding sites.
Finally, we cross-referenced these candidates with experimental reports of thermostability-enhancing mutations published between 2018 and 2024. Only those pairs that satisfied interfacial positioning, disulfide bond geometry, and literature evidence were retained. Based on this integrative analysis, six new mutants were designed: F14C/K44C, V12C/K41C, F14C/L45C, L93C/E445C, Y13C/P36C. Additionally, the mutation E104K, located at the dimer interface, was additionally identified by virtual screening as potentially stabilizing the enzyme.
Results
Wet-lab experiments demonstrated that the engineered disulfide bond mutants F14C/K44C, V12C/K41C, F14C/L45C, L93C/E445C and Y13C/P36C along with literature-supported variants V12C/K44C, F14C/D41C, and L89C/E445C, exhibited significantly enhanced enzymatic activity. Among these, the Y13C/P36C mutant diaplayed the most pronounced improvement in catalytic activity.
To investigate the structural basis of this enhancement, molecular dynamics (MD) simulations were performed for both the wild-type enzyme and the Y13C/P36C mutant. The results revealed a substantial reduction in Root-Mean-Square Fluctuations (RMSF) in the mutant, indicating enhanced structural stability that likely underlies its higher activity.
Discussion
Limitations of Model Predictions and Discrepancies with Experimental Validation
Our initial in silico MPNN-GNN–based model showed moderate predictive power for LDC variants (Adjusted R² = 0.4927), but experimental validation revealed that predicted high-activity mutants did not reach expected performance. These discrepancies likely stem from the model's reliance on generalized training data, static structural assumptions, and the incomplete representation of long-range dynamic effects that influence enzyme catalysis.
To overcome these gaps, we integrated disulfide bond scanning with literature based-validation, identifying key interfacial residues that contribute to enhanced stability and activity. Future work will combine site-directed mutagenesis, molecular dynamics, and high-throughput experimentalvalidation into an iterative design loop. This approach will continuously refine the predictive model and further improve both the stability and catalytic efficiency of CadA variants.
References
[1] Potashman MH, Duggan ME. Covalent modifiers: An orthogonal approach to drug design. J Med Chem (2009) 52:1231-46. doi: 10.1021/jm8008597
[2] Noor, F.; Noor, A.; Ishaq, A.R.; Farzeen, I.; Saleem, M.H.; Ghaffar, K.; Aslam, M.F.; Aslam, S.; Chen, J.-T. Recent advances in diagnostic and therapeutic approaches for breast cancer: A comprehensive review. Curr. Pharm. Des. 2021, 27, 2344-2365.
[3] Noor, F.; Tahir ul Qamar, M.; Ashfaq, U.A.; Albutti, A.;Alwashmi, A.S.; Aljasir, M.A. Network pharmacology approach for medicinal plants: Review and assessment. Pharmaceuticals 2022, 15, 572.
[4] Floresta, G.; Zagni, C.; Gentile, D.; Patamia, V.; Rescifina, A. Artificial intelligence technologies for COVID-19 de novo drug design. Int. J. Mol. Sci. 2022, 23, 3261.
[5] Sadaqat, M.; Qasim, M.; ul Qamar, M.T.; Masoud, M.S.; Ashfaq, U.A.; Noor, F.; Fatima, K.; Allemailem, K.S.; Alrumaihi,F.;Almatroudi, A. Advanced network pharmacology study reveals multi-pathway and multi-gene regulatory molecular mechanism of Bacopa monnieri in liver cancer based on data mining, molecular modeling, and microarray data analysis. Comput. Biol. Med.2023, 161, 107059.
[6] Yang, J.; Cai, Y.; Zhao, K.; Xie, H.; Chen, X. Concepts and applications of chemical fingerprint for hit and lead screening. Drug Discov. Today 2022, 27, 103356.
[7] Vetter, I. R., & Wittinghofer, A. (2001). The guanine nucleotide-binding switch in three dimensions. Science, 294(5545), 1299-1304.
[8] Liceras-Boillos, P.; García-Navas, R.; Ginel-Picardo, A.; Anta, B.; Perez-Andres, M.; Lillo, C.; Gómez, C.; Jimeno, D.; Fernández- Medarde, A.; Baltanas, F.C.; et al. Sos1 disruption impairs cellular proliferation and viability through an increase in mitochondrial oxidative stress in primary MEFs. Oncogene 2016, 35, 6389–6402.
[9] M. M. Mysinger, M. Carchia, J. J. Irwin, and B. K. Shoichet, "Directory of useful decoys,enhanced (dud-e): Better ligands and decoys for better benchmarking," Journal of Medicinal Chemistry, vol. 55, no. 14, pp. 6582–6594, 2012. PMID: 22716043.
[10] I. Wallach and R. Lilien, "Virtual Decoy Sets for Molecular Docking Benchmarks," J Chem.Inf. and Model., vol. 51, no. 2, pp. 196–202, 2011.
[11] D. Rogers and M. Hahn, "Extended-connectivity fingerprints," Journal of Chemical Information and Modeling, vol. 50, no. 5, pp. 742–754, 2010.
[12] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., et al. (2022). Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv. https://doi.org/10.1101/2022.07.20.500902