Introduction
Triterpenoid saponins such as ginsenoside Ro are of great pharmacological value, yet their natural abundance is extremely low, posing significant challenges for large-scale production. Synthetic biology provides a promising strategy to address this issue by reconstructing and optimizing biosynthetic pathways in heterologous hosts. However, achieving efficient Ro biosynthesis requires the systematic identification of suitable enzymes, careful optimization of catalytic modules, and rational strategies for improving enzyme activity.
Question 1: Which enzymes transform oleanolic acid (OA) to Ro precursors?
Question 2: Which suitable glycosyltransferases be chosen?
Question 3: How should we selectively optimize the most promising enzymes?
Question 4: How can enzyme activity be further improved at the sequence level?
To tackle these challenges, our modeling work focused on four key questions.
Question
1: Which enzymes should be
selected to initiate the glycosylation process from oleanolic acid (OA) to Ro
precursors? To address this, we conducted
cross-species screening of cellulose synthase-like (Csl) enzymes from ginseng, notoginseng,
and soybean, using BLAST and
HMM analysis followed by phylogenetic classification, domain and motif analysis, and
molecular docking. This allowed us
to identify candidate enzymes with high catalytic potential toward OA.
Question 2: How
can suitable glycosyltransferases
be chosen to complete the Ro biosynthetic pathway? For this, we applied enzyme
kinetic modeling based on
Michaelis-Menten equations, integrating experimental kinetic parameters to evaluate
different members of the UGT73
family. Genetic algorithms were then employed to optimize enzyme combinations for maximizing
Ro yield.
Question 3: How
should we selectively optimize the most promising enzymes? We focused on UGT73F3
and UGT73P40, constructing a metabolic
network model to simulate their dynamic contributions, applying metabolic control analysis
to quantify their control
coefficients, and comparing different enzyme modification strategies to enhance Ro
synthesis.
Question 4: How can enzyme
activity be further improved at the sequence level? To this end, we developed a
machine learning-based framework that
integrates protein language models and graph neural networks to predict the functional
impact of single-site mutations.
From a library of 9,443 mutants, we identified high-probability variants, validated their
binding affinities through
docking simulations, and highlighted S175P as a potential top-hit mutant.
In summary, our modeling framework established a multi-layered strategy that combines enzyme discovery, kinetic optimization, metabolic network analysis, and computational protein engineering. These efforts laid the groundwork for the wet-lab synthesis of ginsenoside Ro and provided a robust computational pipeline for future studies in synthetic biology-driven natural product biosynthesis.
Question 1: Screening suitable Csl enzymes for converting OA to CE
In the complete biosynthetic pathway of Ro, the conversion from OA to CE represents the first glycosylation step on the triterpenoid backbone. This reaction requires the participation of UDP-glucose dehydrogenase (UGDH) and cellulose synthase-like enzymes (Csl). UGDH catalyzes the transformation of UDP-Glc into UDP-GlcA, while Csl is responsible for the critical task of attaching a glucuronic acid moiety to the C-3 position of OA [1,2]. Therefore, selecting highly efficient Csl enzymes is essential for the conversion of OA to CE, and ultimately for the successful biosynthesis of Ro.
Csl enzymes constitute a major branch of the cellulose synthase (CesA) superfamily and are responsible for catalyzing polysaccharide backbone synthesis across diverse organisms. Their central function lies in mediating the biosynthesis of hemicellulose, a key structural component of plant and algal cell walls [3,4]. In this study, we analyzed the genomes of soybean (Glycine max), ginseng (Panax ginseng), and notoginseng (Panax notoginseng), all of which possess the capacity to synthesize triterpenoid backbones, in order to identify Csl enzymes capable of attaching monosaccharides to the triterpenoid scaffold and thereby facilitating the efficient conversion of OA to CE. Using the Csl enzyme pn022859 from P. notoginseng as the reference sequence, we performed sequence similarity searches with the BLAST algorithm and Hidden Markov Models (HMM). Candidate homologs of pn022859 were subjected to domain and motif analyses. Based on these results, we merged the candidate sequences from the three species, removed those with low similarity, and constructed a phylogenetic tree to classify sequences into families and subfamilies. Finally, Csl enzymes within the subfamily containing pn022859 were subjected to molecular docking and molecular dynamics simulations to evaluate their potential to enhance the catalytic efficiency toward OA (Figure 1).
Figure 1. Workflow for cross-species Csl enzyme screening.Abbreviations: Csl, Cellulose synthase-like; CD-search, Conserved domain search; MEME, Multiple EM for Motif Elicitation; NJ, Neighbor-Joining. Image created with BioRender.com, with permission.
1.1 Discover Csl enzymes in ginseng, notoginseng, and soybean through BLAST and HMM analysis
First, we obtained the genomic data of Panax ginseng, Panax notoginseng, and Glycine max, annotated them using GFF files, and retrieved the complete protein-coding sequences in FASTA format. Using the protein sequence of pn022859 as a reference, we performed candidate sequence searches with both BLAST and HMM algorithms. For each of the three species, the intersection of the BLAST and HMM results was selected, and duplicate sequences were removed before inclusion in subsequent analyses. As shown in Figure 2, the combined BLAST and HMM analyses identified 82 candidate sequences in P. ginseng, 42 in P. notoginseng, and 73 in G. max (after redundancy elimination).
Figure 2. Identification of Csl enzymes in Panax ginseng, Panax notoginseng, and Glycine max using BLAST and HMM algorithms.
1.2 Perform domain and motif analysis of Csl enzymes
After obtaining the candidate protein sequences of Csl enzymes from the three species, we further analyzed their domain architectures and motif distributions to facilitate subsequent classification into families and subfamilies. The FASTA files of candidate Csl protein sequences from each species were retrieved and subjected to conserved domain searches using the NCBI Batch CD-Search tool (https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi)[5] with the Conserved Domain Database and an expect value threshold of 0.01. The CD-search results indicated that most Csl enzymes are single-domain proteins, providing insufficient resolution for detailed family classification.
To overcome this limitation, we conducted motif analysis using MEME Suite (https://meme-suite.org/meme/),[6] setting the minimum number of motifs to be identified at 10. This analysis revealed the motif distributions of Csl enzymes across the three species. Figures 3-5 illustrate the top 10 motifs and their distributions in P. ginseng, P. notoginseng, and G. max, respectively. Interestingly, aside from a few sequences with limited motifs and potentially incomplete functions, the motif distributions across all three species exhibited a consistent pattern, clustering into two major groups. Based on these findings, we hypothesize that Csl enzymes in these three species can be classified into six families. As the next step, we will merge the candidate Csl sequences from the three species, construct a phylogenetic tree, and perform a more detailed classification at the family and subfamily levels.
Figure 3. Schematic of the conserved motif organization in Csl proteins of Panax ginseng.
Figure 4. Schematic of the conserved motif organization in Csl proteins of Panax notoginseng.
Figure 5. Schematic of the conserved motif organization in Csl proteins of Glycine Max.
1.3 Construct phylogenetic trees for Csl enzymes in ginseng, notoginseng, and soybean
We combined the candidate Csl enzymes from Panax ginseng, Panax notoginseng, and Glycine max, and performed protein sequence alignment[5][6] using the MUSCLE algorithm implemented in MEGA 12. After removing sequences with low alignment quality or excessively short lengths, we constructed a phylogenetic tree using the Neighbor-Joining (NJ) method, with 5,000 bootstrap replicates and the p-distance model. This analysis yielded the phylogenetic tree of candidate Csl sequences across the three species.
1.4 Divide the phylogenetic tree of Csl enzymes into families and subfamilies
The Newick file generated from the NJ phylogenetic tree was uploaded into iTOL[7], where bootstrap values were standardized to a 0-100 scale. Using a threshold of bootstrap >70 as the criterion for subdivision, branches with bootstrap values < 70 were defined as independent families. Based on this standard, the candidate Csl enzymes were classified into six clusters (Figure 6). Our reference sequence, PN022859, was located within Cluster E, which comprised 24 homologous proteins. To further resolve the subfamily structure, we extracted the FASTA sequences of these 24 homologs and performed additional analyses to delineate subfamilies.
Figure 6. Phylogenetic analysis of Csl proteins from Panax ginseng, Panax notoginseng, and Glycine max. The circular phylogenetic tree was constructed using the NJ method based on a multiple sequence alignment of full-length Csl protein sequences. The numbers at the nodes indicate bootstrap support values from 5,000 replicates. The Csl proteins are classified into six distinct clusters, designated as Cluster A through Cluster F and highlighted with different colored backgrounds. Protein sequences are labeled with their database identifiers. Based on common prefixes, sequences from P. ginseng are denoted by 'EVM', P. notoginseng by 'PN', and G. max by 'XM/NM'. The P. notoginseng protein PN022859 is highlighted in blue for specific emphasis.
The Csl candidate sequences belonging to the same family as PN022859 were extracted and subjected to phylogenetic analysis using the NJ method following the procedure described above. Applying the bootstrap >70 threshold as the classification criterion, 21 sequences were divided into seven subclusters. Among these, the sequences most closely related to PN022859 were EVM0017102 and EVM0010403, both of which are Csl sequences from Panax ginseng (Figure 7).
Figure 7. Detailed phylogenetic analysis of the Csl subfamily. The circular phylogenetic tree illustrates the evolutionary relationships among Csl proteins from P. ginseng (sequences with 'EVM' prefixes), P. notoginseng ('PN'), and G. max ('XM'/'NM'). The tree was constructed using the NJ method based on a full-length protein sequence alignment. The proteins are grouped into seven distinct subclusters (1-7), denoted by colored branches and corresponding arcs as indicated in the legend. The P. notoginseng protein PN022859 is highlighted in blue for specific emphasis.
1.5 Perform in silico validation using molecular docking and molecular dynamics analyses
To further validate the binding capacity of the Csl enzymes homologous to PN022859 with the substrate OA, we employed AlphaFold 3 to construct 3D structural models of the 21 Csl sequences belonging to the same family as PN022859. The 3D structure of OA was obtained from PubChem, and molecular docking was performed using AutoDock Vina. The docking results were evaluated based on the binding energies of the best conformations, with lower binding energies indicating stronger binding affinities. The results demonstrated that PN022859 exhibited favorable binding to OA (Best Affinity = -12.6 kcal/mol). Moreover, four Csl enzymes displayed even stronger binding affinities than PN022859, namely EVM0004671 from P. ginseng (-12.6 kcal/mol), and XM_041006423 (-12.7 kcal/mol), XM_003536208 (-13.0 kcal/mol), and NM_001365113 (-13.2 kcal/mol) from G. max.
These findings suggest that there remains potential to improve the catalytic efficiency of Csl enzymes in the OA-to-CE conversion step. However, molecular docking can only reveal whether and how the OA substrate can bind to the enzyme active site, as well as provide an approximate estimation of binding strength. It does not clarify how the overall system responds upon OA binding or how the conformation of the Csl enzyme adapts to facilitate OA catalysis. Therefore, molecular dynamics simulations are required to further elucidate these mechanistic changes.
Figure 8. In silico binding affinity analysis of a putative substrate with Csl proteins. The bar chart compares the calculated binding affinities (in kcal/mol), derived from molecular docking, between a common ligand OA and various Csl proteins. The proteins, identified by their accession numbers, are from P. ginseng (sequences with 'EVM' prefixes), P. notoginseng ('PN'), and G. max ('XM'/'NM'). Each bar represents the docking score for the most favorable binding pose. Lower, more negative scores indicate a stronger predicted binding interaction. The protein of interest from P. notoginseng, PN022859, is highlighted in red for emphasis.
1.6 Perform in silico validation using molecular dynamics simulation
To benchmark PN022859 against close homologs, we analyzed binding free energy (bar chart) and MD-derived compactness (radius of gyration, Rg) across four receptors (with receptor15 = PN022859).
Binding free energy (kJ/mol). More negative values indicate stronger binding. Among the four, receptor15 (PN022859) shows the most favorable binding (≈ -21 kJ/mol), followed by receptor11 (≈ -12 kJ/mol), receptor21 (≈ -10 kJ/mol), and receptor16 (≈ -8-9 kJ/mol). Thus, PN022859 ranks first in predicted affinity, consistent with a tighter OA-enzyme complex.
Figure 9. Binding free energy of OA with four Csl receptors. Bar chart of docking-predicted binding free energies (kJ/mol); more negative indicates stronger affinity. receptor15 = PN022859 shows the most favorable energy (≈ -21 kJ/mol), followed by receptor11 (≈ -12), receptor21 (≈ -10), and receptor16 (≈ -9).
Conformational stability (Rg, nm). Lower and less variable Rg suggests a more compact and stable enzyme-ligand ensemble during the trajectory. PN022859 (receptor15) exhibits the lowest and tightest Rg distribution (~ 2.79-2.82 nm), indicating a compact, time-stable conformation upon binding. Receptor11 remains moderately compact (~ 2.86-2.89 nm) with limited dispersion. Receptor21 shows broader spread (~ 2.86-2.92 nm), suggesting greater breathing motions. Receptor16 displays the largest Rg and widest dispersion (~ 2.93-2.97 nm), consistent with a more expanded and fluctuating fold. Time-colored scatter points further suggest that PN022859 maintains its compact state throughout the simulation, whereas receptor16/21 trend toward slight expansion at later times.
Figure 10. Conformational stability assessed by radius of gyration (Rg). Violin plots with time-colored scatter show Rg (nm) distributions over the MD trajectory; lower and tighter Rg implies a more compact, stable complex. receptor15 (PN022859) is the most compact with lowest dispersion; receptor11 is moderate; receptor21 and especially receptor16 display larger Rg and variability. Color encodes simulation time.
Implications. The concordance between strongest binding energy and highest conformational compactness/stability positions PN022859 (receptor15) as the leading catalyst candidate for OA engagement. Receptor11 appears secondary, while receptor21 and especially receptor16 are less favorable due to weaker affinity and/or greater structural lability.
Question2: Screening suitable glycosyltransferase for converting CE to Ro
Figure 11.
In the protein glycosylation module, the compound CE ultimately converts to the target product Ro, forming a metabolic network: the main pathway first generates IVa, then Ro; the bypass pathway first generates R1, then Ro. The reactions involved in these four steps are all classic glycosylation transfer reactions. By consulting the literature, we found that the enzymes required for each step of the transformation can all be selected from the UGT73 family.
To maximize the yield of the product Ro, we need to select appropriate enzymes from the enzyme library and place them at positions E1, E2, E3, and E4. (Figure11)
- E1: Side-path enzyme, catalyzing CE → R1
- E2: Main path enzyme, catalyzing CE → IVa
- E3: Side-path enzyme, catalyzing R1 → Ro
- E4: Main path enzyme, catalyzing IVa → Ro
2.1 Construct the enzyme kinetics equation
Simulate the reaction process using the Michaelis-Menten kinetic model
Here, vi represents the rate of reaction i, kcati is the conversion
number of enzyme i, Kmi is the Michaelis constant of enzyme i,
[E] is the enzyme concentration, and [S] is the substrate concentration. The Michaelis
equation was proposed by the
German biochemist Leonor Michaelis and the Danish physiologist Maud Menten in
1913[8], and it is a mathematical model
in enzymology that describes the relationship between the rate of enzymatic reaction and the
substrate concentration.
The differential equation system is as follows
The reaction rate
Assuming that the enzyme concentration [E] at each enzyme position is constant and set to 1 μM, the initial condition is [CE]0=C0, and the concentrations of other substances are 0. The final Ro yield is the Ro concentration at a sufficiently long reaction time.
2.2 Obtain the kinetic parameters of enzymes in the UGT library for three compounds using Catapro
We obtained the enzyme kinetic parameters of UGT enzymes for three compounds (CE, IVa, R1) from the Catapro tool [9] and conducted a multi-dimensional analysis on them.
Firstly, in order to reveal the overall distribution characteristics, we plotted box plots of log10(kcat), log10(Km), and log10(kcat/Km) (Figure 12). This graph visually presents the median, quartile intervals, and potential outliers of each parameter. The results show that most enzymes have a concentration of conversions in the log10(kcat) dimension that is concentrated in the lower range, but there are still a few enzymes whose activity is significantly higher than the average level, suggesting that they may possess potential for efficient catalysis. In contrast, the distribution of log10(Km) is relatively concentrated, indicating that the overall difference in the affinity between the enzymes and the substrates is not significant. However, there are still some enzymes with extremely low Km values, representing candidates with stronger binding capabilities. As for log10(kcat/Km), as a comprehensive indicator of catalytic efficiency, its distribution is relatively concentrated, but there are still a few outliers at the upper end, suggesting the presence of special enzymes with extremely high catalytic efficiency, which is worthy of further attention in subsequent studies. Overall, this analysis indicates that the majority of enzymes have moderate performance, while a few outstanding enzymes form the tail of the distribution, providing important candidates for subsequent enzyme screening.
Figure 12.
To further explore the relationships among different kinetic parameters, we plotted a scatter plot matrix of log10(kcat), log10(Km), and log10(kcat/Km) (Figure 13). This image not only shows the pairwise relationships among the parameters, but also provides a kernel density estimation on the diagonal for the distribution. The results indicate that there is a significant positive correlation between log10(kcat) and log10(kcat/Km), which is consistent with theoretical expectations, as the catalytic efficiency is numerically driven by kcat. In contrast, log10(Km) and log10(kcat/Km) show a certain negative correlation, indicating that a lower Km (i.e., higher substrate affinity) usually helps to improve the overall efficiency. The correlation between log10(kcat) and log10(Km) is relatively weak, suggesting that the conversion number and the change in affinity are relatively independent in most cases. Moreover, the scatter points show an aggregation trend under different parameter combinations, suggesting that some enzymes may share similar kinetic characteristics. This observation is consistent with the subsequent clustering results, providing further support for identifying functionally similar enzyme groups.
Figure 13.
To reveal the similarities in the kinetic characteristics of different enzymes, we further plotted a clustering heatmap of standardized parameters and grouped them through hierarchical clustering (Figure 14). The longitudinal clustering shows the overall patterns of different UGT73 enzymes in terms of Km, kcat, and kcat/Km, while the lateral clustering reveals the intrinsic connections between the parameters. The results indicate that the distributions of the three types of parameters are not completely consistent. For instance, some enzymes exhibit extremely high values in the Km dimension (i.e., low affinity), but are lower in the kcat dimension, indicating that they have a higher demand for substrates but fail to convert into faster catalytic rates. Conversely, a few enzymes excel simultaneously in both kcat and kcat/Km indicators, suggesting that they possess both a higher conversion number and overall catalytic efficiency, and are expected to be preferred candidates in Ro synthesis. The clustering results also revealed the similarities between different enzyme groups: enzymes within certain branches exhibit similar patterns in the three parameters, suggesting that they may have common characteristics in substrate utilization and catalytic mechanisms. This not only provides a basis for narrowing the candidate range but also lays the foundation for subsequent protein engineering modifications. It is worth noting that some enzymes exhibit extreme performance in a single dimension, such as the outstanding performance of UGT73VN1 in kcat and the highly efficient catalysis of UGT73DF112 in kcat/Km. These are clearly highlighted in the figure, suggesting that they may play a key role in metabolic network optimization.
Figure 14.
To compare the characteristics of different substrates, we plotted the distribution histograms of the three compounds (CE, IVa, and R1) in three dimensions: log10(kcat), log10(Km), and log10(kcat/Km) (Figure 15). The results showed that the distribution patterns of the three substrates in log10(kcat) were roughly similar, but there were differences in the central positions: the peaks of CE and R1 were slightly lower, while the distribution center of IVa was higher, suggesting that it might have a faster conversion rate in some enzymes. In the distribution of log10(Km), the three substrates did not show significant differences, but the distribution of CE was slightly biased towards higher Km values, indicating that its affinity might be overall weaker than that of IVa and R1. As for log10(kcat/Km), the distributions of the three were all concentrated, and the central position of IVa was slightly higher than that of CE and R1, suggesting that IVa had a greater overall catalytic efficiency. In summary, the distribution patterns of the three substrates were highly consistent, but IVa performed slightly better in the efficiency indicators. This result suggests that in the optimization design of metabolic pathways, reactions related to IVa may have greater potential, while the deficiency in affinity of CE needs to be improved through enzyme engineering methods.
Figure 15.
2.3 Solve the model using Genetic Algorithms
Due to the large size of the enzyme library (27,834 enzymes), the computational cost of using the exhaustive method (traversing all combinations) is extremely high, amounting to approximately 27,834^4 = 6e+17 combinations, which is unfeasible. Therefore, an optimization algorithm is adopted to find an approximate optimal solution. Here, we use the Genetic Algorithm (GA) [10] to search for the optimal enzyme combination.
- Genetic algorithm simulates natural selection and genetic principles:
- 1. Population initialization: randomly generate a set of candidate solutions (enzyme combinations)
- 2. Fitness evaluation: calculate the quality (Ro yield) of each candidate solution
- 3. Selection: select the superior individuals based on fitness
- 4. Crossover: combine the characteristics of the superior individuals to create new individuals
- 5. Mutation: randomly alter certain characteristics of the individuals
- 6. Iteration: repeat steps 2-5 until convergence
Here, the fitness function simulates the reaction system and calculates the final Ro yield:
Figure 16.
This graph depicts the changes in the best fitness (maximum Ro yield) and the
average fitness for each generation. If
the curve steadily rises in the later generations and reaches a plateau, it indicates that
the algorithm has found the
global optimal solution.
The result after running is E1 = E4 = UGT73P40, E2 = E3 = UGT73F3
Question3: Selective optimization for UGT73F3 and UGT73P40
In the metabolic network formed by converting compound CE to obtain the target product Ro, the optimal enzyme combination of UGT73F3 and UGT73P40 has been obtained through the combined optimization model. To increase the yield of Ro, we consider modifying one of these enzymes. Therefore, the following model is established to determine which of the two enzymes to modify in order to bring greater benefits.
3.1 Apply mathematical modeling to select the appropriate glycosyltransferase
First, we define the key variables. The concentrations of the four compounds involved are [CE], [IVa], [R1], and [Ro], while the concentrations of the two enzymes are [EP40] and [EF3]. The three enzyme kinetic parameters are the turnover number kcat, the Michaelis constant Km, and the catalytic efficiency kcat/Km. We consider modeling each enzyme-substrate pair separately. UGT73P40 acts on the two processes of hydrolyzing IVa and CE, with parameters being kcat,P40,IVa 、KM,P40,Iva, and kcat,P40,CE 、KM,P40,CE acts on the two processes of converting R1 to Ro and hydrolyzing CE, with parameters being kcat,F3,R1、KM,F3,R1, and kcat,F3,CE、KM,F3,CE.
Since enzymes alternate their actions in multiple pathways and multiple substrates of the same enzyme compete for the same active site, the competitive substrate inhibition model and the pathway diversion equation are adopted. For UGT73P40, the rate equation is:
The cross-terms in the denominator (containing the concentration of another substrate) represent competitive inhibition: when the concentration of one substrate is high, it will occupy the active site of the enzyme, thereby inhibiting the conversion of the other substrate. Here, [EP40] is the concentration of UGT73P40 (assuming it is constant because the enzyme concentration does not change within a short period of time); kcat,P40,IVa and kcat,P40,CE are the catalytic constants of UGT73P40 for IVa and CE (the number of substrate molecules converted per enzyme molecule per unit time); KM,P40,Iva and KM,P40,CE are the Michaelis constants of UGT73P40 for IVa and CE (the concentration of the substrate when it reaches half of the maximum rate, reflecting the affinity of the enzyme for the substrate). For UGT73F3, the rate equation is:
Here, [EF3] represents the concentration of UGT73F3 (assumed to be constant); kcat,F3,CE and kcat,F3,R1 are the catalytic constants of UGT73F3 for CE and R1 respectively; KM,F3,CE and KM,F3,R1 are the Michaelis constants of UGT73F3 for CE and R1 respectively.
3.2 Construct a system of ordinary differential equations in a dynamic model of a metabolic network
The main consideration is the glycosylation module shown in Figure 11. Assuming that the generation of CE is stable, let vin be the generation rate of CE entering the glycosylation module (determined by the previous reaction and serving as input), and there is an ODE system
Among them, vbypathrepresents the rate at which R1 is generated through other bypass pathways other than the direct hydrolysis of CE and IVa by UGT73P40, and it can be set as a constant. kdeg,CE, kdeg,IVa, kdeg,R1, kdeg,Ro respectively represent the degradation rate constants of these four compounds. Under steady-state conditions, the concentration of the metabolites no longer changes with time, that is
By solving the steady-state equations, we can obtain the yield YRo of Ro. The definition of YRo is
3.3 Design enzyme modification strategies
We propose two modification strategies. The first one is to modify UGT73P40 so that its selectivity for IVa is much higher than that for CE, thereby reducing the generation of by-products through the bypass pathway. In the numerical simulation, increase kcat,P40,IVa/KM,P40,IVa * kcat,P40,IVa/KM,P40,IVa relative to kcat,P40,CE/KM,P40,CE , or decrease kcat,P40,CE. The second strategy is to modify UGT73F3 so that its catalytic efficiency for R1 is improved, avoiding the accumulation of R1. In the numerical simulation, increase .
To quantify the effect of enzyme modification, we employed Metabolic Control Analysis (MCA), which is a mathematical framework in biochemistry and systems biology used to quantify how changes in enzyme activity affect the flux and metabolite concentrations in metabolic pathways [11]. MCA aims to analyze the control strength of each enzyme in the metabolic network over system variables (such as flux, metabolite concentration). It is not based on the intuitive concept of "rate-limiting steps", but rather quantifies the control contribution of each enzyme through strict mathematical definitions.
This coefficient represents the percentage change in steady-state flux when the enzyme activity changes by 1%. According to the summation theorem (Summation Theorem) - the sum of the control coefficients for all enzymes that regulate the same flux is 1, that is
This means that the control is distributed throughout the entire pathway, rather than being solely controlled by a single "limiting enzyme". The value of FCC is greater than 0, and the sum of all enzymes' FCCs for the same flux is equal to 1. The larger the FCC, the stronger the control of the enzyme over the flux, and the more significant the effect of modifying this enzyme.
For UGT73P40, there is
For UGT73F3, there is
In numerical simulation, the numerical approximation of FCC can be obtained by making a small perturbation around the current enzyme concentration (for example, increasing the enzyme concentration by 1%), and then recalculating the steady-state Ro flux.
3.4 Solve the mathematical model
3.4.1 Dynamic graph of metabolite concentration change
Figure 17. Dynamic Changes of Metabolite Concentrations
With time on the horizontal axis and concentration on the vertical axis, the changing trends of the four metabolites (CE, IVa, R1, Ro) over time are presented. It is possible to visually observe the process of each metabolite gradually reaching a steady state starting from the initial concentration. The dynamic correlations among them can be clearly seen, such as which metabolite has a faster increase rate and which one stabilizes earlier.
3.4.2 Enzyme modification strategy
Figure 18. Comparison Chart of Production Output after Enzyme Modification Strategy
The concentration of ginsenoside Ro was compared through bar charts under three conditions: the original system, the modified UGT73F3, and the modified UGT73P40.
Figure 19. Comparative diagram of the time course of enzyme modification
It presents the trend of Ro concentration changes over time under different enzyme modification strategies. The three curves in the figure represent the original system (black), the modified UGT73P40 (red), and the modified UGT73F3 (blue). As time progresses, the Ro concentrations of all three systems increase, and the modified enzyme systems show significantly faster growth rates compared to the original system. Among them, the modification of UGT73F3 has the most prominent improvement, increasing by 48.1% compared to the original system; the modification of UGT73P40 also shows some improvement, increasing by 12.8%. The circles in the figure are marked as steady-state points, and the Ro concentration then tends to stabilize.
3.4.3 Metabolic control analysis
Figure 20. Graph showing the extent to which enzymes control the Ro yield
The bar chart shows the metabolic control coefficients of UGT73F3 and UGT73P40 enzymes on the Ro yield. The size of the coefficient reflects the enzyme's ability to control the Ro yield. The larger the value, the more significant the influence of the enzyme on the Ro yield. This provides a basis for further improving the Ro yield by regulating the enzymes.
Figure 21. Steady-state Metabolic Flux Analysis Diagram
The rates of five reactions - vF3,CE、vF3,R1、vP40,CE、vP40,IV a、vin - are presented in the form of bar charts. This enables us to understand the activity levels of each reaction at steady state, analyze the main pathways and key nodes of substance flow in the metabolic network, and thereby gain a deeper understanding of the metabolic process of Ro synthesis.
3.4.4 Analysis of the accumulation problem of R1
Figure 22. Graph of Concentration Changes of R1 and Ro
This graph illustrates the changing relationship between the accumulation of R1 and the output of Ro over time. For the accumulation of R1, the original system and the modified P40 system increase over time, with the original system being slightly higher, while the modified F3 system remains almost unchanged and is much lower than the former two; for the output of Ro, all three increase over time. Among them, the modified F3 system grows the fastest, the modified P40 system follows, and the original system grows relatively slowly. Overall, the modified F3 system performs best in terms of Ro output, while the worst is the original system in terms of R1 accumulation. The modified P40 system shows improvements in both R1 accumulation and Ro output, but the improvement in Ro output is not as significant as that of the modified F3 system.
3.4.5 Sensitivity analysis
Figure 23. Sensitivity Analysis Diagram of Enzyme Kinetic Parameters
The x-axis of this graph represents the relative values of parameters such as kcat (catalytic constant) and KM (Michaelis constant), while the y-axis shows the percentage of Ro output relative to the baseline. The color ranges from blue to red, with blue indicating low output and red indicating high output. Analysis reveals that kcat has a significant impact on Ro output. High kcat values (such as kcat,F3,R1、kcat,P40,IVa) result in high output, reaching up to 200% and exceeding 130% when the output is 200%. KM has a minor effect, especially KM,F3,CE, and KM,P40,CE. Optimizing the reaction to increase Ro output should prioritize increasing kcat values, particularly parameters such as kcat,F3,R1M, and kcat,P40,IV a.
Question4: Optimization for UGT73F3
Figure 24. Technical approach for engineering UGT73F3 through single-site saturation mutagenesis of the full sequence
In this study, advanced computational approaches were applied to investigate the functional effects of enzyme mutations on enzymatic activity and pathogenicity. The core datasets consisted of a cross-validation set, S10988[12], which included 10,510 mutants (5,255 positives and 5,255 negatives), and an independent test set, S2814, which included 5,978 mutants (1,118 positives and 4,860 negatives). For the SPM system, five-fold cross-validation was conducted on the training dataset, and model performance was subsequently evaluated on the independent test set.
4.1 Method
4.1.1 Feature Generation and Representation
Sequence embeddings are generated using the state-of-the-art protein language models ESM1v[13] and ESM2b[14], producing 1,280-dimensional vectors that encode evolutionary features. For each mutation, the embeddings of the wild-type (WT) and mutant (Mut) sequences are computed, and their differences are used to represent mutation-specific features (dESM1v/dESM2b). These features are then used as input for models such as GNN, MLP, SVM, and RF. For the GNN, we use the protein 3D structures predicted by AlphaFold3[15], and construct a graph by integrating residues within a 10 Å radius around the mutation site. Each residue's dESM feature is used as the node feature in the graph.
4.1.2 Model Development
GNN architectures to analyze integrated features for protein mutation prediction. Specifically, GCNs[16] and GATs were used to model the relational and spatial dependencies inherent in protein structures[17]. In these models, amino acid residues were represented as nodes, with embeddings derived from sequence data, while edges, constructed from adjacency matrices, encoded spatial interactions between residues. GCNs aggregated information from neighboring nodes to capture short-range residue interactions, while GATs applied attention mechanisms to dynamically weight the contributions of neighboring nodes, focusing on long-range dependencies and key mutation effects. For comparison, simpler Multilayer Perceptron (MLP) models using only sequence embeddings, as well as traditional classifiers such as Random Forest (RF) [18]and Support Vector Machines (SVM), were implemented to benchmark the performance of graph-based approaches.
The GNN models were implemented using the Deep Graph Library (DGL) framework [19,20]. The GCN architecture consisted of two layers of graph convolution (GraphConv), where each layer aggregated features from neighboring nodes based on both the node features and the graph topology, thereby capturing local structural information. The output from the convolution layers was passed through a ReLU activation function, followed by global average pooling (mean_nodes) to generate graph-level embeddings, which were then mapped to logits for classification. For the GAT model, two layers of multi-head attention convolutions (GATConv) were employed. The first layer generated multi-head embeddings and applied the ELU activation function for non-linear transformation, while the second layer extracted deeper, more abstract features. After each layer, the multi-head features were concatenated and subjected to global average pooling to obtain the final graph embeddings, which were subsequently mapped to the classification space.
To capture the effects of specific mutations, ESM embeddings were used for each amino acid, and the difference vector between the wild-type and mutant forms was calculated for each amino acid. Specifically, given the mutant and wild-type forms, ESM was used to generate embeddings for each amino acid, and the difference between the wild-type and mutant was calculated to obtain a difference vector for each amino acid. A graph structure was then constructed by selecting residues within a 10-angstrom range of each mutation, with edges assigned between residues that were within this distance. This method allowed for a comprehensive and rigorous evaluation of graph-based approaches in predicting the functional impact of mutations on protein structures.
This expression represents the difference in ESM features between the mutated amino acid
aaiwt and the wild-type amino acid (aaimut) ,
under the condition that the distance between (aaiwt) and the mutation
site (aaimut) is less than or equal to 10 Å.
Once the entire graph G is obtained, we use GAT and GCN to perform convolution operations to
update the node
representations:
h(l)represents the node features in the current layer. Conv(h(l-1)) applies the convolution operation to the features from the previous layer.
The MLP is implemented using PyTorch and designed for a binary classification task. Its input dimension is 1280, corresponding to the dimension of the difference vector between the wild-type and mutant forms. The model consists of three fully connected layers, progressively mapping the input to logits for the two output classes. The hidden layers contain 512 and 1024 neurons, respectively, and use the ReLU activation function to enhance the model's non-linear expressiveness. After forward propagation, the input features generate the classification results. The model is simple and flexible, making it suitable for handling high-dimensional feature classification problems.
Both GNN and MLP use binary cross-entropy loss functions and the Adam optimizer with a learning rate of 0.0001 for loss calculation and gradient updates. Early stopping is employed to prevent overfitting.
SVM and RF are implemented using Scikit-learn, with hyperparameters determined through grid search. The hyperparameters for the RF model are as follows: {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20, 30], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4]}.The parameters of SVM are:{'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf'], 'gamma': ['scale', 'auto']}.
Figure 25. Workflow for predicting mutations by dESM
4.1.3 Model Evaluation
The evaluation framework relied on stratified K-fold cross-validation to ensure robust performance assessments across validation and test datasets. Separate analyses of SPM and MPM datasets were conducted to evaluate model generalizability across distinct mutation categories.[21] Model performance was measured using metrics such as Precision, Recall, Accuracy, F1 Score, Matthews Correlation Coefficient (MCC), Area Under the Curve (AUC), and Average Precision (AP). To further interpret model predictions, visualizations such as Receiver Operating Characteristic (ROC) curves, precision-recall curves, and confusion matrices were employed.
Molecular docking was conducted using AutoDock Vina (version 1.2.x) controlled by a custom Python script. For each ligand-receptor pair, grid box parameters (center coordinates and box sizes) were specified in a task file, and the script automatically created subdirectories, invoked Vina five times, and recorded the output. From each run, the docking log and output PDBQT file were parsed to extract the pose with the lowest binding energy (kcal·mol⁻¹). The best-scoring poses were saved both in PDBQT and converted to MOL2 format using Open Babel (version 3.x)[22]. Binding energies from all repeats were summarized in a text file to evaluate the reproducibility of docking results.
4.2 Result
4.2.1 Single-Point Mutation Prediction
For single-point mutation prediction, we performed cross-validation on the S10998 dataset and then evaluated the trained five models on the independent test set S2814 (Figure 26, Figure 27, Table 1 and Table 2). In cross-validation, the graph-based methods (dESM1v-AF3-GCN and dESM1v-AF3-GAT) achieved the best results, with AUC values of 0.99 and 0.98, respectively. These models also demonstrated high accuracy (0.95) and F1 scores (0.95), indicating their effectiveness in integrating sequence and structural features. However, their performance declined significantly on the test set, with a sharp drop in sensitivity (TPR = 0.24), although they maintained high specificity (TNR > 0.94). This led to AUC values of 0.70 and 0.69, suggesting challenges in generalizing to unseen data, possibly due to overreliance on structural or contextual patterns in the training data.
The MLP-based models, dESM1v-MLP and dESM2b-MLP, both achieved AUC values of 0.98 in cross-validation. On the test set, dESM2b-MLP outperformed dESM1v-MLP with AUC values of 0.65 and 0.64, respectively. However, sensitivity remained moderate, highlighting the difficulty of detecting true positive mutations in real-world scenarios.
RF and SVM showed consistent but slightly lower performance. Their cross-validation AUC values ranged from 0.96 to 0.97, and test set AUC values ranged from 0.64 to 0.65, reflecting their limitations in capturing complex mutation effects.
Figure 26. Comparison of models (AF3-GCN, AF3-GAT, MLP, RF, SVM) with ESM-1v embedding on single-point mutations dataset using ROC curves and confusion matrices.
Figure 27. ROC curves and confusion matrices for models (MLP, RF, SVM) with ESM-2b embedding on single-point mutations dataset
Table 1. Multi-model 5-fold Cross-validation Results for Single-point Mutations
|
Model |
TN |
FP |
FN |
TP |
TPR |
TNR |
Precision |
Accuracy |
AP |
F1 |
MCC |
AUC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dESM1v-AF3-GCN | 4998 | 257 | 230 | 5025 | 0.96 | 0.95 | 0.95 | 0.95 | 0.93 | 0.95 | 0.91 | 0.99 |
| dESM1v-AF3-GAT | 4944 | 311 | 270 | 4985 | 0.95 | 0.94 | 0.94 | 0.94 | 0.92 | 0.94 | 0.89 | 0.98 |
| dESM1v-MLP | 4960 | 295 | 323 | 4932 | 0.94 | 0.94 | 0.94 | 0.94 | 0.92 | 0.94 | 0.88 | 0.98 |
| dESM1v-RF | 4834 | 421 | 412 | 4843 | 0.92 | 0.92 | 0.92 | 0.92 | 0.89 | 0.92 | 0.84 | 0.96 |
| dESM1v-SVM | 4937 | 318 | 322 | 4933 | 0.94 | 0.94 | 0.94 | 0.94 | 0.91 | 0.94 | 0.88 | 0.97 |
| dESM2b-MLP | 4974 | 281 | 281 | 4974 | 0.95 | 0.95 | 0.95 | 0.95 | 0.92 | 0.95 | 0.89 | 0.98 |
| dESM2b-RF | 4877 | 378 | 389 | 4866 | 0.93 | 0.93 | 0.93 | 0.93 | 0.90 | 0.93 | 0.85 | 0.97 |
| dESM2b-SVM | 4931 | 324 | 342 | 4913 | 0.94 | 0.94 | 0.94 | 0.94 | 0.91 | 0.94 | 0.87 | 0.97 |
| EnzyAct | 5037 | 218 | 277 | 4978 | 0.95 | 0.96 | 0.96 | 0.95 | 0.93 | 0.95 | 0.91 | 0.98 |
Table 2. Multi-model Test Set Results for Single-point Mutations
|
Model |
TN |
FP |
FN |
TP |
TPR |
TNR |
Precision |
Accuracy |
AP |
F1 |
MCC |
AUC |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dESM1v-AF3-GCN | 4593 | 267 | 844 | 274 | 0.24 | 0.94 | 0.51 | 0.81 | 0.27 | 0.33 | 0.26 | 0.70 |
| dESM1v-AF3-GAT | 4596 | 264 | 854 | 264 | 0.24 | 0.95 | 0.5 | 0.81 | 0.26 | 0.32 | 0.25 | 0.69 |
| dESM1v-MLP | 4524 | 336 | 886 | 232 | 0.21 | 0.93 | 0.41 | 0.80 | 0.23 | 0.28 | 0.18 | 0.64 |
| dESM1v-RF | 4459 | 401 | 836 | 282 | 0.25 | 0.92 | 0.41 | 0.79 | 0.24 | 0.31 | 0.21 | 0.64 |
| dESM1v-SVM | 4451 | 409 | 844 | 274 | 0.24 | 0.92 | 0.40 | 0.79 | 0.24 | 0.30 | 0.20 | 0.62 |
| dESM2b-MLP | 4185 | 675 | 724 | 394 | 0.35 | 0.86 | 0.37 | 0.77 | 0.25 | 0.36 | 0.22 | 0.65 |
| dESM2b-RF | 4336 | 524 | 794 | 324 | 0.29 | 0.89 | 0.38 | 0.78 | 0.24 | 0.33 | 0.2 | 0.65 |
| dESM2b-SVM | 4455 | 405 | 823 | 295 | 0.26 | 0.92 | 0.42 | 0.80 | 0.25 | 0.32 | 0.22 | 0.65 |
| EnzyAct | 4817 | 43 | 981 | 137 | 0.12 | 0.99 | 0.76 | 0.83 | 0.26 | 0.21 | 0.26 | 0.73 |
| Scanner | 2802 | 20 | 530 | 60 | 0.10 | 0.99 | 0.99 | 0.84 | 0.23 | 0.18 | 0.24 | 0.68 |
4.2.2 Predicted probabilities of enhanced enzymatic activity
A total of 9,443 mutants were generated by substituting each of the 497 amino acids in the UGT73F3 sequence with the other 19 amino acids. These mutants were subsequently evaluated using a machine learning model to estimate the probability of enhanced enzymatic activity (Figure 28). From this set, the top 100 mutants with the highest predicted probabilities were selected, and molecular docking was performed for each mutant in five independent runs.
Figure 28. Predicted probabilities of enhanced enzymatic activity (gray indicates WT).
4.2.3 Molecular docking results
The results indicated that, after five independent docking runs were performed for each of the top 100 mutants, the binding energies remained relatively stable across replicates. Among these mutants, S175P showed the most stable performance, with an average binding energy of -8.64 kcal·mol⁻¹[23], suggesting the highest potential enzymatic activity[24]. Based on these findings, S175P was preliminarily identified as the optimal mutant, and further validation will be conducted through molecular dynamics simulations.
Figure 29. Docking result of the S175P mutant as rendered with PyMOL.
Figure 30. Docking result of the WT mutant as rendered with PyMOL.
4.3 Molecular dynamics simulation verification
The molecular docking results of the modified S175P enzyme can indicate that the lowest energy state is more stable in a static situation. However, the transformation from R1 to Ro is a dynamic spatiotemporal change process, and this result cannot explain the overall stability change of the reaction system in the spatiotemporal dimension.
To further explore the influence of the S175P enzyme on the reaction process and ensure low cost and repeatability, we used GROMACS[25], a supercomputing-verified, open-source and auditable platform, to conduct molecular dynamics simulations on the docking results of wild-R1 and S175P-R1[26], and further investigate the effect of this enzyme on the transformation of ginsenoside R1 to Ro.
In this study, we selected the CHARMM36 force field, a unified parameter-level force field that is currently the only public force field to pass all NIH-FDA benchmark criteria, as the force field parameters for the computational system, and used the TIP3P water model for solvation, adding ions to the solvent to neutralize the reaction system. Under the condition of system energy minimization, the system was heated from 0K to 300K within 500ps in the canonical ensemble, and then pre-equilibrated at 310.15K. Finally, a 100ns molecular simulation was conducted in the isothermal-isobaric ensemble, maintaining periodic boundary conditions. All covalent bonds involving hydrogen were restricted by the SHAKE method.
Figure 31. Overall wild-type movement trajectory.
Figure 32. Trajectory of Local Wild Type.
Figure 33. Overall wild-type movement trajectory.
Figure 34. Trajectory of Local Mutation Type.
Analyses included root-mean-square deviation (RMSD), root-mean-square fluctuation (RMSF), radius of gyration (Rg), hydrogen bonds (H-bonds), and binding free energy (MM/PBSA). In addition, residue-level free-energy decomposition and free-energy landscape (“energy topography”) analyses were performed to localize energetic determinants and visualize conformational basins.
4.3.1 Analysis of atomic root mean square deviation (RMSD) and atomic fluctuation root mean square (RMSF)
The RMSD measures the root mean square deviation of the target structure from the reference frame;RMSD = 0 indicates perfect overlap.
RMSF measures the time-averaged positional deviation of atom/residue i relative to its reference position (computed per residue); RMSF = 0 indicates no fluctuation.
4.3.2 Hydrogen bond analysis
A hydrogen atom covalently bonded to an electronegative atom X can interact with another electronegative, small-radius atom Y to form an X-H···Y hydrogen bond. Criteria: (1) distance X-Y < 3.0-3.5 Å; (2) donor angle ∠X-H···Y < 30°.
4.3.3 Binding free energy analysis
The binding free energy of the interaction is derived as follows:
In last equation, the free energy G is calculated according to following equation:
Here, Δ EGasrepresents the gas-phase molecular mechanical energy, Δ GSolvationrepresents the solvation free energy, and TΔS represents the entropy term.
Δ EGascan be further divided into three components: van der Waals energy (Δ EvdW), electrostatic energy (Δ EELE), and gas-phase internal energy (Δ EINT).
The solvation free energy Δ GSolvation is calculated using a continuum solvent model and consists of polar contributions Δ GPB and nonpolar contributions Δ Gnonpolar.
The conformational entropy -TΔS is estimated through quasi-harmonic analysis of the simulation trajectory. Since its contribution to the results is minimal, it can be neglected.
4.3.4 Contribution of free energy from residue decomposition
To verify the effectiveness of the S175P mutation site, decomposition free energy contribution analysis was conducted on residues within a distance of 4.0 a from the ligand.
4.3.5 Free energy topography diagram
In order to understand the free energy distribution of the system under different conformations, the gmx sham in the gromacs subcommand was used to generate the free energy map, which can show the free energy distribution of the system under different conformations, just like a "map" of the energy state. Through this "map", we can clearly see how the energy of the system is distributed in different states, thereby better understanding the energy state of the system.
4.3.6 Root mean square deviation (RMSD) analysis and root mean square fluctuation (RMSF) analysis results
Figure 35. Trajectory of Local Wild-Type Movement.
RMSD represents the distance between the same atom in different structures. The RMSD of a protein can reveal the positional changes between the protein's conformations during the simulation and its initial conformation. The trend in the RMSD of both the protein and the ligand is an important indicator of whether the simulation has reached stability. If the ligand may have a destabilizing effect on the protein, the protein's RMSD can not only reflect the stability of the protein structure but also further indicate whether the ligand has a destabilizing effect on the protein.
A lower RMSD value indicates high stability of the protein, while a higher RMSD suggests that the backbone has undergone conformational changes during the simulation.
As shown in RMSD result, the RMSD of S175P reached a stable state within a certain period of time, while the RMSD of Wild did not reach stability. This indicates that the structure after the combination was not in a balanced state.
Figure 36. Comparison results of RMSF.
RMSF represents the average atomic positional changes over time and can characterize the flexibility and degree of motion of individual amino acids throughout the simulation.
The RMSF of a protein is used to determine the fluctuation of each residue during the simulation. Higher RMSF values are associated with regions of greater flexibility, such as loop regions (and intrinsically disordered regions). Additionally, amino acid domains distant from the binding site may also exhibit higher RMSF values.
As can be seen from RMSF results the flexibility or the degree of movement in the "wild" area is higher, indicating greater instability.
4.3.7 Results of hydrogen bond analysis
Figure 37. Hydrogen bond analysis results.
Hydrogen bonds and hydrophobic interactions play a significant role in maintaining protein conformation. In addition, hydrogen bonds between amino acids can alter the internal geometric structure of proteins. The following figure shows the changes in the number of hydrogen bonds in the ligand-protein complex. As shown in Figure 37, the combination of S175P and R1 can form more hydrogen bonds, which means a more stable combination.
4.3.8 Results of combined with free energy analysis
Figure 38.combines the free energy results The results showed that the mutant had a lower free energy, which was conducive to the positive progress of the reaction.
4.3.9 Results of free energy from residue decomposition
Figure 39. shows the free energy contribution diagram of residue decomposition.
4.3.10 Results of free energy topography diagram
Figure 40. 3D free energy topography of wild-type.
Results of free energy topography diagram
Figure 41. 3D free energy topography of mutant type.
Results of free energy topography diagram
Figure 42. 2D free energy topography of wild-type.
Results of free energy topography diagram
Figure 43. Mutant 2D free energy topography.
4.3.11 Conclusion
Based on the results of comprehensive structural simulation, molecular docking and molecular dynamics simulation, the improved S175P has better binding effect and catalytic activity.
References
[1] Hao, L., et al. (2025). Multi-strategy UGT mining, modification, and glycosyl donor synthesis facilitate the production of triterpenoid saponins. Frontiers in Plant Science, 16, 1586295.
[2] Ren, S. C., et al. (2022). Sustainable production of rare oleanane-type ginsenoside Ro with an artificial glycosylation pathway. Green Chemistry, 24(22), 8302-8313.
[3] Carpita, N. C. (2011). Update on mechanisms of plant cell wall biosynthesis: How plants make cellulose and other (1→4)-β-D-glycans. Plant Physiology, 155(1), 171-184.
[4] Yin, Y. B., Johns, M. A., Cao, H. S., & Rupani, M. (2014). A survey of plant and algal genomes and transcriptomes reveals new insights into the evolution and function of the cellulose synthase superfamily. BMC Genomics, 15(1), 260.
[5] Kumar, S., et al. (2024). MEGA12: Molecular evolutionary genetic analysis version 12 for adaptive and green computing. Molecular Biology and Evolution, 41(5), msae263.
[6] Edgar, R. C. (2004). MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5), 1792-1797.
[7] Letunic, I., & Bork, P. (2021). Interactive Tree of Life (iTOL) v5: An online tool for phylogenetic tree display and annotation. Nucleic Acids Research, 49(W1), W293-W296.
[8] Michaelis, L., & Menten, M. L. (1913). Die kinetik der invertinwirkung. Biochemische Zeitschrift, 49, 333-369.
[9] Wang, Z., Xie, D., Wu, D., et al. (2025). Robust enzyme discovery and engineering with deep learning using CataPro. Nature Communications, 16(1), 2736.
[10] Holland, J. H. (1975). Adaptation in natural and artificial systems. University of Michigan Press.
[11] Li, L., Chen, G., & Zhou, H. (1999). Theory and application of metabolic control analysis. Guangxi Science, 6(2), 115-119.
[12] Li, G., Zhang, N., Dai, X., & Fan, L. (2024). EnzyACT: A novel deep learning method to predict the impacts of single and multiple mutations on enzyme activity. Journal of Chemical Information and Modeling, 64(12), 5912-5921.
[13] Meier, J., et al. (2021). Language models enable zero-shot prediction of the effects of mutations on protein function. In Proceedings of the 35th International Conference on Neural Information Processing Systems (Article 2243). Curran Associates Inc.
[14] Lin, Z., et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123-1130.
[15] Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., Ronneberger, O., Willmore, L., Ballard, A. J., Bambrick, J., Bodenstein, S. W., Evans, D. A., Hung, C. C., O'Neill, M., Reiman, D., Tunyasuvunakool, K., Wu, Z., Žemgulytė, A., Arvaniti, E., … Jumper, J. M. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature, 630(8016), 493-500.
[16] Lin, Z., et al. (2022). Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.07.20.500902.
[17] Vrahatis, A., Lazaros, K., & Kotsiantis, S. B. (2024). Graph attention networks: A comprehensive review of methods and applications. Future Internet, 16(8), 318.
[18] Denisko, D., & Hoffman, M. M. (2018). Classification and interaction in random forests. Proceedings of the National Academy of Sciences of the United States of America, 115(8), 1690-1692.
[19] Chen, C., Geng, H., Yang, N., Yang, X., & Yan, J. (2024). EasyDGL: Encode, train and interpret for continuous-time dynamic graph learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(9), 10845-10862.
[20] Wang, M., et al. (2019). Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315.
[21] Fang, J. (2020). A critical review of five machine learning-based algorithms for predicting protein stability changes upon mutation. Briefings in Bioinformatics, 21(4), 1285-1292.
[22] Eberhardt, J., Santos-Martins, D., Tillack, A. F., & Forli, S. (2021). AutoDock Vina 1.2.0: New docking methods, expanded force field, and Python bindings. Journal of Chemical Information and Modeling, 61(8), 3891-3898.
[23] Trott, O., & Olson, A. J. (2010). AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of Computational Chemistry, 31(2), 455-461.
[24] García-Ortegón, M., Simm, G. N. C., Tripp, A. J., Hernández-Lobato, J. M., Bender, A., & Bacallado, S. (2022). DOCKSTRING: Easy molecular docking yields better benchmarks for ligand design. Journal of Chemical Information and Modeling, 62(15), 3486-3502.
[25] Abraham, M. J., Murtola, T., Schulz, R., Páll, S., Smith, J. C., Hess, B., & Lindahl, E. (2015). GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX, 1-2, 19-25.
[26] Valdés-Tresanco, M. S., Valdés-Tresanco, M. E., Valiente, P. A., & Moreno, E. (2021). gmx_MMPBSA: A new tool to perform end-state free energy calculations with GROMACS. Journal of Chemical Theory and Computation, 17(10), 6281-6291.
Gratefully acknowledge the computing resources provided by the Yangtze River 3D Scientific Computing Center for molecular dynamics simulations.

