Plant natural products have been exploited for numerous applications ranging from pharmaceutical and cosmetic to agricultural and food. Engineering microbes for the production of these valuable compounds provides a promising alternative to traditional plant isolation, yet heavily relies on various biological modeling technologies and methods. Models enable to make experiment design and data analysis more efficiently. Here, we developed REvoDesign, a semi-rational protein design modeling pipeline, which aimed at designing functional mutants with high stability and activity of enzyme used for natural product biosynthesis in microbial hosts. REvoDesign integrates cutting-edge model tools into a user-friendly package for the enzyme design community, including structure modeling, co-evolution information analysis, semi-rational design model, cross-model filtering, and sequence-based clustering model. Use these models and approaches, we design and engineered taxadiene-5-hydroxylase (T5αH), a critical P450 enzyme in paclitaxel biosynthesis, with synergistical improvement of enzyme stability and activity. It greatly increased enzyme design accuracy and reduced experimental testing burden, providing a promising alternative to tailor enzyme properties towards desired function for microbial production of natural products.
These structurally and functionally diverse plant natural products and derivatives consist of a huge resource being mined for pharmaceutical, cosmetic, and food industries (1,2). Currently, their commercial supply largely relies on plant isolation, which is suffering from low yields and limited natural resources. Alternatively, engineering microorganisms provide a promising route for these compound production due to environmental friendliness and practical feasibility (3-5). However, achieving economically visible production is often limited by poor stability and low activity of plant-derived enzymes. This is because plant enzymes are highly evolved to adapt and function in the micro-environment of plant cells, often making them suboptimal physical properties when expressed in heterologous systems. An alternative solution would be to design more fitness enzymes with high stability and activity.
With rent vast advances in computational tools including deep learning-based protein modeling and de novo protein design, both physics and data-based calculations have accelerated the profound understanding of the intricate interrelations between enzyme sequence, structure, and function (6). Meanwhile, the rapid growth in protein databases have also promoted the investigations into the evolutionary trajectories of enzymes, providing a holistic perspective on guiding experimental protein designs (7-10). In the past years, several models and methods for protein engineering have been developed by incorporating the evolutionary insights of enzymes with computational power, including structure-, dynamics-, and quantum mechanics (QM)-based methods, to evaluate the numbers of residue mutations that alleviates the sampling bias, and to identify amino acid substitutions that lead to desired catalytic property, such as PROSS (11,12), FuncLib (13), FireProt (14) , and HotSpot (15).
Despite these promising computational models for protein engineering, most cases are human or bacteria proteins designed to improve thermostabilities against industrial production-line conditions, which are different from the plant enzyme adaptation problems frequently met in microbial production for plant natural product. In addition, current computational models, including PROSS, FuncLib and ProteinMPNN, highly relies on atomic-precise protein structure data for high-accuracy of conformational dynamics calculation in enzyme design (11,13,14), and multipoint mutations often have to be introduced to stabilize native proteins to meet the high industrial criteria, which is a time-consuming and labor-intensive process. Due to very limited three-dimensional structures of plant proteins in the PDB database, it further compromise the application of current computational models in addressing plant protein-optimization problems (9). Therefore, it is imperative to develop new models/methods that allow us to design plant enzymes with high stability and activity, as well as less experimental testing efforts. As such, it would be beneficial in constructing robust strains with high productivity for the production of plant natural products.
To solve this problem, we develop a rational evolutionary-involved enzyme redesign (REvoDesign) modeling pipeline to guide the rapid evolution of plant enzymes and improve their engineering efficiency for natural product biosynthesis in microbial hosts. REvoDesign alternatively integrates computational models with conserved and co-evolutionary data to select very limited residues from surface and active center of the enzymes. Subsequently, using rational design principle, it creates minimal variant library followed by cross-model filtering and sequence-based clustering. This powerful modeling pipeline greatly reduces experimental tests, and enables scientists to effectively engineer enzymes with synergistical improvement of the stability and activity based on accurate model predictions.
REvoDesign is defined as a semi-rational design modeling pipeline aimed at designing functional mutants with high stability and activity for the microbial production of natural products. The core of this pipeline is to effectively integrate molecular structure models with co-evolution information (Fig. 1), thereby greatly increasing design accuracy and reducing experimental testing. This approach is general and can be applied, in principle, to any natural enzyme, particularly plant-sourced. REvoDesign modeling pipeline mainly contains below models and steps. In brief, REvoDesign starts with a molecular structural model with high accuracy to identify the hotspot regions through modeling/docking objectives. In parallel, conservative and co-evolutionary analysis are carried out to determine structurally and functionally coupled residue pairs for potential combined mutations. Subsequently, it performs semi-rational design by combining data-driven and rational guided strategies for residue substitution and mutant screening. Finally, cross-model filtering and sequence-based clustering are employed to reduce the scale of the experimental library (Fig. 1). This modeling pipeline follows iterative framework in which enzyme performance could be continuously optimized through the experimental verification and feedback mechanism in each iteration.
Fig.1 Overview of REvoDesign pipeline.
REvoDesign starts with the input requirements including protein structural model (a), conservative (b) and co-evolutionary information (c). The constructed structure was used to identify the hotspot regions for mutant in active center (d) and protein surface (e) for activity and stability improvement, respectively. The conservative and co-evolutionary data, combining rational guided strategies, was used to identify the potential residue substitution and screen mutants. Subsequently, cross-model filtering (f) and sequence clustering (g) were employed to minimize the library scale for experimental testing (h) and iterative optimization (i). Finally, combined mutations by crossing-region in active center and surface achieve the balance of activity and stability of enzyme without triggering negative epistatic effect (j).
Recent deep learning-powered computational modeling tools, such as AlphaFold3 (15), RoseTTAFold-AA (16), and HelixFold3 (17), offer shortcuts for structure prediction, refinement, and simulation. This provides valuable structural insights for rational design and expands the capabilities of plant enzyme engineering. Additionally, based on high-precision structures, protein-ligand molecular docking facilitates understanding of catalytic mechanisms and contributes to rational/semi-rational enzyme design. Advanced docking protocols, like DiffDock (18), RosettaLigand (19), AutoDock Vina (20), and AlphaFill (21), provide a promising approach to perform induced fitness and molecular dynamics simulations.
REvoDesign modeling pipeline integrates these cutting-edge tools into a user-friendly package for the enzyme design community. It accepts as input either from experimentally determined (e.g., X-ray, NMR, Cryo-EM) or computationally predicted enzyme-ligand complex structures, and docked using the aforementioned methods (Fig. 2). To mitigate structural artifacts from experimental conditions or coarse-grained calculations that results in a structure that may not be in its lowest energy state or most physiologically relevant conformation (e.g., due to extreme crystallization conditions), REvoDesign employs an iterative relaxation function—a customized Rosetta protocol termed "Relax with C-alpha Constraints"—to refine input models.
Fig. 2 Molecular structure modeling/docking.
Construction of protein-ligand complex. It starts with ligand preparation and protein structure prediction that is performed by computational modeling tools. Protein-ligand complex can then be constructed either by atom pair-fit approach based on the known crystal structure with similar ligand, or by protein-ligand blind docking. The latter is then refined by energy-based Relax with C-alpha Constraints approach to obtain the matured complex model.
Using above molecular structure modeling/docking pipeline, we created the taxadiene-5-hydroxylase (T5αH) structure. T5αH is a key P450 enzyme involved in anti-cancer drug paclitaxel biosynthesis. Except for a small part of the N-terminal coiled region, the predicted T5αH structure model shows high credibility (Fig. 3), which suggested that this structure model could be used for next structure analysis.
Fig. 3 T5αH structure model.
Based on this high-precision structure of T5αH, DiffDock was employed to dock the coenzyme HEME and the substrate taxadiene. The complex structure of top 20 candidates was analyzed to determine whether its catalytic site C5 was facing the heme (Fig. 4), and the most reasonable and lowest energy conformation was selected as the optimal complex (Fig. 5).
Fig. 4 The top 20 structures of taxadiene docked into T5αH-HEME by DiffDock.
Fig. 5 T5αH complex structure.
(A) and the conformation of HEME and taxadiene (B)
Besides above structural insights, consensus-based features and co-evolutionary networks driven by evolutionary pressure also offer invaluable statistical evidence and clues for designing new enzyme variants. In REvoDesign modeling pipeline, the phylogenetic profiles for each enzyme are constructed by the protein sequence as a query using UniRef databases (Fig. 6). Consensus analysis performed by position-specific substitution matrix (PSSM) can describe the evolutionary probabilities of amino acids at each residue position, in which the evolutionarily conserved residues are highlighted. REvoDesign can combine PSSM with multiple methods, such as the simplified surface entropy reduction (simSER) substitution strategy, disulfide bridging, surface electrostatic recharging, and hydrogen bonding, to search for enzyme variants with desired properties in heterologous expression systems.
On the other hand, co-evolutionary information can show how residues in a protein evolve together, reflecting structural or functional interdependencies and potential interactions between residues and across proteins. Potts-model-based algorithms like GREMLIN were originally developed for calculating residue contact maps in template-free structural modeling (22-27) and protein-protein interaction modeling (28,29), which are essentially direct coupling analysis (DCA) problems. We found that GREMLIN can effectively scan coupling matrices with a fixed known residue. This process reflects the co-evolution between positions to identify strongly correlated sites (Fig. 6). The single-site scanning approach ranks potential interaction partners, guiding single or multi-residue mutations (30). By analyzing these matrices, we can find positions that co-evolve with a specified residue, pinpointing potential sites for rational enzyme design (31). Analyzing correlated mutations with GREMLIN would be beneficial to identify potential residue pairs that are crucial for enzyme stability and activity. These findings guide to target mutations, preserving key interactions while introducing beneficial changes. Thus, coevolution-based strategies could enhance enzyme engineering capabilities, enabling precise construction of mutant libraries and more efficient optimization of protein function and performance.
Fig. 6 Evolutionary analysis.
Hotspot regions are selected for design, and at each position, sequence space is constrained by evolutionary conservation analysis (PSSM) and co-evolutionary analysis of residue pairs (GREMLIN).
Using above approaches, we performed the conserved and co-evolution analysis of T5αH hotspot region (catalytic active center). The amino acid sequence was used as input and the position-specific iterative basic local alignment search program PSI-BLAST was run to search the UniRef90 database to obtain the PSSM scoring matrix (Fig. 7A). In parallel, the multiple sequence alignment tool HH-suite was used to search the UniClust30 protein sequence database to obtain multiple sequence alignment data, and the co-evolutionary association probability distribution of residue pairs was calculated using the GREMLIN algorithm to reveal the variation trend of associated co-evolutionary residue pairs in protein sequences (Fig. 7B). For example, we here show the top 10 candidate residue pairs co-evolved with T310 residue (Fig. 7C).
Fig. 7 T5αH conservative analysis.
(A) and co-evolutionary analysis (B). Among them, the top 10 candidate residue pairs co-evolved with T310 residue was shown (C).
REvoDesign modeling pipeline could effectively identify the key designable residues in binding pockets or channels through a simplified spatial distance-based search mechanism (based on heterologous residues). Based on the structural information, amino acids within 6 Å of the T5αH substrate, 6 Å of the coenzyme HEME, and the substrate channel were selected as design hotspots. There are 42 design sites near the substrate taxadiene (Fig. 8A), 26 design hotspots near HEME (Fig. 8B), and 9 design hotspots in the substrate import and export channels (Fig. 8C).
Fig. 8 Identification of the T5αH design hotspots near substrate taxadiene.
(A), HEME (B), and substrate import and export channels (C).
In addition, we developed the quick reference tables which enable to provide rational design strategies for guiding residue substitutions in catalytic centers and improving biophysical properties (Tab 1). Meanwhile, built-in libraries (such as the Dunbrack rotamer library (32) or DLPacker (33) are utilized to calculate side-chain conformations of candidate mutants.
Tab. 1 Rational design strategies for guiding activity engineering
Due to design of the catalytic pocket allowing for relatively high potential energy, a more flexible substitution constraint strategy is employed by integrating evolutionary information in silico saturation mutagenesis. The conservation value PSSM can be set between -10 and 10, and the higher value indicates the stronger conservation. The extremely conserved sites in the catalytic pocket usually play critical roles in catalysis, and replacements should be avoided. Using this information, we propose a reverse-conservation mutagenesis strategy to identify the potential residues for engineering. Amino acid residues with relatively low PSSM values are generally selected to mutate to residues with relatively high values to enhance enzyme catalytic activity. Conversely, amino acids with comparatively high PSSM values are mutated to residues with relatively low values to engineer enzyme catalytic specificity (Fig. 9). Through this process, together with rational design principles, including reshape catalytic pocket and substrate channel, and adjust interactions in catalytic region, 32 T5αH variants were finally generated.
Fig. 9 Semi-rational design of enzymes guided by PSSM score.
In REvoDesign, the improvement in enzyme stability and solubility are achieved by modifying the residues on protein surface. To this end, REvoDesign firstly identifies the solvent-accessible surface area (SASA) residue hotspots (Fig. 10) through an enhanced and simplified surface residue exposure search mechanism based on the PyMOL interface, derived from the FindSurfaceResidues algorithm. For example, When the exposed area is greater than or equal to 15 Å (2), it is considered an exposed site. In addition, if the surface exposed area is close to the substrate pocket, the substrate pocket area can also be excluded.
Fig. 10 Protein surface residue exposure analysis
Subsequently, we developed a simplified surface entropy reduction (simSER) to perform surface residue substitutions. Together with rational design strategies (Tab 2) like conformational change, hydrophobic engineering, electrostatic energy optimization, and hydrogen-bond modification, T5αH variants were generated, in which four amino acids E/K/N/Q with higher surface entropy were replaced by amino acids with lower entropy (E-D/A/T/Y, K-R/A/T/Y, N-A/T/Y, Q-E/D/A/T/Y) (Fig. 11).
Tab 2 Rational design strategies for guiding stability engineering.
Fig. 11 Amino acids with higher surface entropy in T5αH.
(A) 12 Asn; (B) 15 Gln; (C) 25 Glu; (D) 34 Lys.
After generating and preliminarily inspecting the mutant candidate pools for the improvement of activity and stability of enzymes, respectively, cross-tool computational filtering experiments are next performed to exclude mutants that are unreasonable in certain aspects. Evaluation dimensions include thermostability scoring, sequence conservation, dynamic behavior analysis, catalytic potential prediction, and evolutionary consistency checks. Using molecular modeling and design tools, the theoretically fully or nearly fully optimized variants can be identified through multi-dimensional screening strategies, such as free energy change (ΔΔG) calculations, molecular dynamic (MD) simulation-based energy analysis, binding energy and affinity calculations, subregion dynamic property analysis, ligand channel analysis, electron transfer modeling, surface charge analysis, intermolecular interaction analysis, and machine learning-assisted prediction. For example, catalytic activity design makes use of language models UniKP 34 and ESM-1v 35 to evaluate ligand binding affinity and predict functional adaptability, respectively. Using these approaches, T5αH variants for activity optimization were further evaluated and filtered by ESM-1v, and ultimately 19 variants were selected to perform experimental testing (Tab. 3). Among them, two T5αH variants, L72M and V226E displayed much higher activity over wild type (Fig. 12).
Tab 3 T5αH variants for activity optimization.
Fig. 12 Experimental test for T5αH variants with desired activity.
Stability optimization can be assessed by PSSM conservation scores, ddG, and MD simulation-based dynamic property changes. For example, the PSSM score was set in the range of -2 to 100 (positive infinity), together with simSER, to screen the candidate substitutions with more conservative yet lower solvent-accessible entropy. On the other hand, we can also employ pre-calculated ddG scanning dataset and set ΔddG in the range of -100 to 0.1 (lower value indicates greater stability) to design the low-surface-energy mutant library. Pythia ddG was used to calculate the free energy changes of T5αH variants generated by surface entropy reduction approach, and 26 mutants with reduced ddG were selected (Tab. 4). On this basis, mutants with increased or not significantly decreased PSSM values were further screened, and finally 8 variants were selected for next experiment test (Tab. 5). Among them, two T5αH variants, L72M and V226E displayed much higher production over wild type (Fig. 13).
Tab. 4 Mutants with reduced ddG.
Tab. 5 Mutants screened by PSSM.
Fig. 13 Experimental test for T5αH variants with surface engineering.
Through virtual screen and cross-tool filtration, a compact and highly credible variant library can be refined from the initial design pool, which is beneficial for reducing experimental testing and improving the success rate. Our experience shows that combining evaluations of conservation, stability, and mechanistic rationality can effectively screen out optimized mutants.
To further minimize the size of variant library for experimental testing, REvoDesign includes an optional sequence-based clustering module, which groups similar mutants based on their sequence changes and optionally Rosetta scoring results. For each cluster, the representative variants (i.e., ‘leave nodes’) are selected, thereby ensuring that diverse functional hypotheses are retained with minimal experiment workload. Furthermore, Users can set constraints, such as the minimum and maximum numbers of mutations per design (defaults are 1 and 1 for no combinations) and the desired final library size. This helps to balance the exploration depth and wet-lab feasibility. Here, we conducted a clustering analysis on the mutants obtained by the PSSM and Gremlin algorithms for T5αH. After clustering (Fig. 14), REvoDesign used Rosetta to generate rapid point mutation structures and evaluated energy (Tab. 6). The mutant with the lowest energy in each cluster was selected as the representative (Fig. 14).
Fig. 14 Clustering of T5αH mutants.
(A) catalytic center; (B) Cofactor pocket; (C) Surface.
Tab. 6 The mutants with the lowest energy of each cluster.