Enzyme Mining
Overview
To address the global challenge of polyester pollution in marine environments, we focused on discovering novel ester hydrolases capable of efficiently degrading marine-derived polyester plastics such as PET, PBAT, and PLA. Our enzyme mining strategy integrated bioinformatics screening, sequence clustering, and structure-based evaluation, establishing a systematic workflow to identify and prioritize candidate enzymes from marine metagenomic resources.
1. Protein Sequence Retrieval
In the enzyme mining stage, we selected Thermobifida fusca cutinase (TfCut) as our reference template. TfCut is one of the earliest and most extensively studied polyester hydrolases, with well-documented high activity toward aliphatic–aromatic polyesters such as PBAT and PET. In addition, its high-quality structural information provides a reliable foundation for sequence comparison and modeling. These features make TfCut an ideal benchmark enzyme to guide our mining and modeling efforts.
Based on these advantages, we retrieved the amino acid sequence of TfCut from the NCBI Protein database in FASTA format. The sequence is shown below:
This sequence was used as the query input to the BLAST platform to search for homologous candidate enzymes.
Figure 1. Workflow of candidate enzyme selection and refinement.
2. BLAST Search
To identify potential polyester hydrolases, we performed sequence similarity searches using NCBI blastp, with the amino acid sequence of Thermobifida fusca cutinase (TfCut) as the query.
To ensure reliability, we applied stringent thresholds:
sequence identity ≥35%
query coverage ≥50%
E-value ≤1.0E-58
These parameters eliminated spurious matches and retained only homologs with high confidence of structural and functional relevance. To further refine the pool, we focused on sequences with identity ≥40% but only moderate coverage (35–60%), which allowed us to exclude incomplete fragments or multidomain proteins while preserving candidates likely to share TfCut’s α/β-hydrolase fold and catalytic motifs.
This search strategy balanced breadth and precision: while stringent cutoffs guaranteed statistical robustness, the moderate-coverage refinement enriched for sequences structurally comparable to cutinases. As a result, the initial BLAST hits were reduced to a manageable set of dozens of candidates, forming the foundation for subsequent multiple sequence alignment (MSA), phylogenetic analysis, and structural evaluation.
Figure 2. Amino acid sequence of Thermobifida fusca cutinase (TfCut) retrieved from NCBI Protein database.
Figure 3. BLAST search workflow and filtering conditions applied to identify homologous enzymes.
3. Multiple Sequence Alignment
After the BLAST-based filtering, we performed a multiple sequence alignment (MSA) to verify the structural and functional relevance of the candidate enzymes. This step was critical to confirm whether the sequences retained the conserved motifs and catalytic residues characteristic of cutinase-type hydrolases.
The alignment was carried out using MAFFT v7 (L-INS-i mode). To balance accuracy and computational efficiency, we selected the auto mode option, which ensures high accuracy for sequences with moderate similarity. The BLOSUM62 substitution matrix was chosen, as it is well suited for homologous proteins in the 20–80% identity range.
The resulting MSA provided a detailed view of sequence similarities and differences among the candidates. The MSA revealed strong conservation of the Ser–His–Asp catalytic triad and the Gly-x-Ser-x-Gly motif, both of which are hallmarks of the α/β-hydrolase fold. These findings confirmed that the filtered candidates not only passed statistical thresholds but also preserved essential functional elements required for polyester hydrolysis.
In summary, the MSA step served as a functional validation filter, ensuring that the candidate sequences were bona fide cutinase-like hydrolases and providing a reliable foundation for downstream structural modeling and docking analyses.
Figure 4. Multiple sequence alignment (MSA) of selected candidate enzymes using MAFFT.
4. Phylogenetic Analysis
Following the filtering of BLAST hits and validation of sequence consistency through multiple sequence alignment (MSA), we further investigated evolutionary relationships among candidate sequences by constructing a phylogenetic tree. This analysis provided an evolutionary perspective, allowing us to correlate sequence similarity with organismal origins and corresponding ecological niches. To reduce redundancy in the candidate set, we applied CD-HIT to cluster sequences based on similarity thresholds prior to tree construction.
By integrating phylogenetic topology with metadata, we identified distinct clades associated with specific environments. We paid particular attention to marine-derived branches, as enzymes originating from marine microorganisms are often naturally adapted to saline conditions and low temperatures—properties highly relevant to our intended application, where polyester degradation is expected to occur under similar cold and saline settings. Focusing on these clades enabled us to prioritize enzymes with inherent structural and functional adaptations to such conditions.
This phylogeny-guided filtering, combined with previous criteria—sequence identity, coverage, and conservation of catalytic residues—added a robust layer of confidence in selecting the most promising candidates.
Figure 5. Phylogenetic tree of candidate enzymes with environmental origin annotation.
5. Solubility Prediction
Following the phylogenetic identification of marine-adapted clades, we assessed the recombinant expression potential of candidate enzymes through in silico solubility prediction. We utilized two independent algorithms, DeepSoluE and Protein-sol, to cross-validate our results. The screening thresholds were defined as a DeepSoluE score > 0.48 and a Protein-sol score > 55.00. Adopting a conservative approach, we selected only those candidates that concurrently passed the solubility threshold on both platforms. This consensus strategy minimized false positives and increased our confidence in the downstream experimental viability of the selected candidates. Detailed prediction results are shown in Table 1, Table 2, and Figure 7.
Figure 6. DeepSoluE Prediction Framework A: Feature Extraction of Protein Sequences; B: Deep Learning Classification Model
Table 1. Solubility prediction of candidate enzymes using Protein-sol.
| NUMBER | Sequence id | Probability | Result |
|---|---|---|---|
| 1 | Amycolatopsis marina | 0.404 | Insoluble |
| 2 | Halopseudomonas formosensis | 0.24 | Insoluble |
| 3 | Halopseudomonas nanhaiensis | 0.229 | Insoluble |
| 4 | Halopseudomonas pertucinogena | 0.321 | Insoluble |
| 5 | Halopseudomonas xiamenensis | 0.344 | Insoluble |
| 6 | Marinobacter daqiaonensis | 0.355 | Insoluble |
| 7 | Marinobacter sp. OP 3.4 | 0.381 | Insoluble |
| 8 | Cellulomonas marina | 0.434 | Insoluble |
| 9 | Cellvibrionaceae bacterium NBRC 116181 | 0.428 | Insoluble |
| 10 | Marinobacter | 0.402 | Insoluble |
| 11 | Thermocatellispora tengchongensis | 0.37 | Insoluble |
| 12 | Thermocatellispora tengchongensis | 0.379 | Insoluble |
| 13 | Alteromonadaceae bacterium M269 | 0.319 | Insoluble |
| 14 | Glaciecola sp. MH2013 | 0.341 | Insoluble |
| 15 | Oceaniserpentilla sp. 4NH20-0058 | 0.489 | Soluble |
| 16 | Oceaniserpentilla sp. 4NH20-0058 | 0.574 | Soluble |
| 17 | Oceaniserpentilla sp. 4NH20-0058 | 0.544 | Soluble |
| 18 | Shewanella surugensis | 0.419 | Insoluble |
| 19 | Bermanella sp. WJH001 | 0.421 | Insoluble |
| 20 | Aliiglaciecola sp. | 0.272 | Insoluble |
| 21 | Alteromonas sp. F394 | 0.29 | Insoluble |
| 22 | Spongiibacteraceae bacterium | 0.387 | Insoluble |
| 23 | Halomonadaceae bacterium | 0.427 | Insoluble |
| 24 | Kangiellaceae bacterium | 0.378 | Insoluble |
| 25 | Oleispira sp. | 0.334 | Insoluble |
| 26 | Bermanella sp. | 0.422 | Insoluble |
Table 2. Solubility prediction of candidate enzymes using DeepSoluE.
| NUMBER | Sequence ID | Scaled Solubility | Population Solubility |
|---|---|---|---|
| 1 | Amycolatopsis marina | 0.404 | 0.446 |
| 2 | Halopseudomonas formosensis | 0.24 | 0.446 |
| 3 | Halopseudomonas nanhaiensis | 0.229 | 0.446 |
| 4 | Halopseudomonas pertucinogena | 0.321 | 0.446 |
| 5 | Halopseudomonas xiamenensis | 0.344 | 0.446 |
| 6 | Marinobacter daqiaonensis | 0.355 | 0.446 |
| 7 | Marinobacter sp. OP 3.4 | 0.381 | 0.446 |
| 8 | Cellulomonas marina | 0.434 | 0.446 |
| 9 | Cellvibrionaceae bacterium NBRC 116181 | 0.428 | 0.446 |
| 10 | Marinobacter | 0.402 | 0.446 |
| 11 | Thermocatellispora tengchongensis | 0.37 | 0.446 |
| 12 | Thermocatellispora tengchongensis | 0.379 | 0.446 |
| 13 | Alteromonadaceae bacterium M269 | 0.319 | 0.446 |
| 14 | Glaciecola sp. MH2013 | 0.341 | 0.446 |
| 15 | Oceaniserpentilla sp. 4NH20-0058 | 0.489 | 0.446 |
| 16 | Oceaniserpentilla sp. 4NH20-0058 | 0.574 | 0.446 |
| 17 | Oceaniserpentilla sp. 4NH20-0058 | 0.544 | 0.446 |
| 18 | Shewanella surugensis | 0.419 | 0.446 |
| 19 | Bermanella sp. WJH001 | 0.421 | 0.446 |
| 20 | Aliiglaciecola sp. | 0.272 | 0.446 |
| 21 | Alteromonas sp. F394 | 0.29 | 0.446 |
| 22 | Spongiibacteraceae bacterium | 0.387 | 0.446 |
| 23 | Halomonadaceae bacterium | 0.427 | 0.446 |
| 24 | Kangiellaceae bacterium | 0.378 | 0.446 |
| 25 | Oleispira sp. | 0.334 | 0.446 |
| 26 | Bermanella sp. | 0.422 | 0.446 |
Figure 7. Intersection of Solubility Predictions of enzyme mining
The solubility prediction results allowed us to compare the relative feasibility of candidate enzymes in expression systems more directly. By integrating phylogenetic information, conservation of catalytic residues, and solubility prediction, we ultimately retained four promising candidates as the focus for downstream structural modeling and experimental validation.
The four final candidates polyester hydrolase sources are:
• Amycolatopsis marina
• Glaciecola sp. MH2013
• Oceaniserpentilla sp. 4NH20-0058 (1)
• Oceaniserpentilla sp. 4NH20-0058 (2)
6. Enzyme Selection Report
In order to identify the most suitable candidate enzyme for our project, we performed a comprehensive computational analysis of four enzymes mined from different microbial sources:
• Amycolatopsis marina
• Glaciecola sp. MH2013
• Oceaniserpentilla sp. 4NH20-0058 (1)
• Oceaniserpentilla sp. 4NH20-0058 (2)
Each enzyme was evaluated through a multi-step computational pipeline that included structure prediction, molecular docking, and molecular dynamics (MD) simulations. These computational tools allowed us to assess substrate binding affinity, active-site stability, and dynamic conformational behavior of each candidate enzyme.
Molecular Docking Methodology
1) Protein Structure Preparation
The 3D structures of all four enzymes were predicted using AlphaFold 3 (Abramson et al., 2024), which is one of the most advanced AI-based protein structure prediction tools. The obtained structures were carefully inspected to ensure completeness and correctness. Missing side chains, bond geometries, and stereochemistry were automatically corrected.
To prepare proteins for docking:
a) Protonation: Protonation states of residues were assigned using the Protein Preparation Wizard in Schrödinger Maestro at physiological pH (7.0). Correct protonation is essential for ensuring accurate electrostatic interactions during docking.
b) Hydrogen addition: Missing hydrogens were added, particularly to polar residues and catalytic residues.
c) Charge assignment: Force-field derived charges were applied, ensuring realistic electrostatics for the docking process.
d) Energy minimization: A restrained minimization was performed to relieve steric clashes, while keeping backbone atoms fixed. This ensures the protein adopts a low-energy conformation close to its native fold.
2) Ligand Preparation
The two substrates of interest — PLA and PBAT oligomers — were prepared using LigPrep (Schrödinger). Geometry optimization and energy minimization were carried out using the OPLS4 force field to ensure accurate conformations. Tautomers and ionization states at physiological pH were generated to account for realistic binding states.
3) Docking Proceduren
Docking was performed using Glide (Schrödinger suite) in Standard Precision (SP) mode for initial screening, followed by Extra Precision (XP) docking for the best-scoring candidates. Binding poses were clustered based on RMSD, and the top cluster with the lowest docking score was selected for analysis.
Binding interactions (hydrogen bonding, hydrophobic contacts, π–π stacking, and salt bridges) were analyzed using PyMOL and Maestro visualization tools (DeLano, 2002).
Purpose of docking:
Docking was used to estimate binding affinity and orientation of PLA/PBAT substrates in the enzyme’s active site. However, docking provides only a static snapshot; therefore, MD simulations were conducted to further validate binding stability under dynamic conditions.
Table 3. Docking results summary
Molecular Dynamics (MD) Simulation Methodology
1) System Preparation
a) The top docking pose for each enzyme-substrate complex was selected.
b) Using tleap (AMBER22), each complex was solvated in an octahedral OPC water box with a 12 Å buffer distance to avoid boundary artifacts.
c) Neutralization: Na⁺ and Cl⁻ counter ions were added to neutralize overall charge.
d) Force field: Protein interactions were described using ff19SB force field (Case et al., 2022), and substrates were parameterized using GAFF2 (general Amber force field for small molecules).
2) Energy Minimization
A two-step minimization was performed:
a) Restrained minimization: Solvent and ions were minimized while restraining protein heavy atoms (2000 steepest descent steps + 5000 conjugate gradient steps).
b) Unrestrained minimization: Full system minimization to remove steric clashes.
This step ensures the system starts from a physically realistic conformation.
3) Heating
The system was gradually heated from 0 K to 300 K under the NVT ensemble using the Langevin thermostat. This controlled heating avoids sudden energy jumps that may distort protein structure.
4) Equilibration
a) Stage 1 (NVT, 1 ns): The system volume was kept fixed while stabilizing temperature.
b) Stage 2 (NPT, 1 ns): Pressure was stabilized at 1 atm using the Berendsen barostat (Lin et al., 2017).
c) SHAKE constraints were applied to all hydrogen bonds.
5) Production Run
a) A 100 ns production run was carried out for each enzyme under periodic boundary conditions.
b) Stimulations were performed in the apo form (without substrate) to establish baseline structural stability of each protein.
c) Time step: 1 fs.
d) Trajectories were saved every 1 ps for downstream analysis.
e) This allowed us to monitor overall protein folding stability, active-site rigidity, and global conformational dynamics before substrate docking studies.
6) Trajectory Analysis
Post-simulation analyses were carried out using CPPTRAJ:
a) RMSD: Used to monitor backbone stability of enzymes in absence of substrate.
b) RMSF: Evaluated residue-level flexibility, focusing on catalytic residues and loop regions to determine intrinsic stability.
c) Hydrogen bonds: Average number and lifetime of hydrogen bonds between enzyme and substrate were calculated.
d) Visualization: Trajectories were visualized in VMD and PyMOL. Plots were generated using OriginLab 2022.
Table 4. RMSD and RMSF profiles of four candidate cutinases from different microorganisms
Based on the analysis of RMSD and RMSF values, we can consider either Amycolatopsis Marina or Glaciecola SP 2013 for our further experimental process.
Enzyme Optimization
Overview
In previous wet-lab validations, the wild-type Glaciecola sp. MH2013 enzyme exhibited significant expression bottlenecks: its soluble expression level was notably low. This characteristic severely constrained the efficiency of enzyme purification and functional studies. To overcome this limitation, we constructed a multi-stage computational optimization strategy that organically integrated a deep learning-based sequence redesign algorithm with protein stability prediction tools, aiming to systematically enhance the enzyme’s soluble expression level and thermal stability.
1. Solubility Optimization: ProteinMPNN-Driven Sequence Redesign and Multi-Stage Screening
To significantly improve the solubility of the target enzyme, we employed ProteinMPNN (v2.1), a sequence design tool that utilizes graph neural networks (GNN) and message-passing mechanisms to predict amino acid sequences based on the three-dimensional structure of proteins. Under the premise of maintaining the protein's backbone structure unchanged, global optimization of the amino acid sequence was performed to enhance its folding stability and solubility. The specific steps were as follows:
1) Input the optimized full-length initial protein PDB structure file.
2) Enter the SETTING interface:
a) Set the sampling temperature to 0.15 for conservative sampling under probability distribution;
b) Select the model vanilla-v_48_020;
c) Set backbone noise to 0.2 to introduce random perturbations, thereby exploring different conformational designs.
3) Fixed positions: Input the union of active sites, binding sites, and conserved sites. Conserved sites were determined using ConSurf with default parameters, retaining sites with a conservation score >5 to avoid mutations in key functional regions.
4) Retain the top ten sequences with a ProteinMPNN score < 1.500 and predict their structures using AlphaFold. Calculate the RMSD and average pLDDT values relative to the input structure, and select sequences with high structural consistency (low RMSD) and high average confidence (high pLDDT).
Following this process, we initially generated 400 new sequences and selected 80 low-scoring sequences based on ProteinMPNN's built-in energy function for the next round of evaluation.
These 80 sequences were individually subjected to structural alignment and visual analysis using PyMOL, aligned to the reference structure TfCut. Eleven sequences with significant conformational distortions in the active pocket were eliminated. The remaining sequences exhibited RMSD values between 0.4–0.6 Å relative to the original structure (typically, RMSD < 1.5 Å is considered a high structural fit), and all had structural match alignment scores above 400, indicating high structural reliability across all candidate sequences. Representative alignment results are shown in the table1 below.
Table 1. Partial Align Results
| seq | Score | RMSD |
|---|---|---|
| 1 | 448.000 | 0.561 |
| 2 | 448.000 | 0.554 |
| 3 | 482.000 | 0.515 |
| 4 | 583.500 | 0.466 |
| 5 | 471.000 | 0.450 |
| 6 | 540.000 | 0.530 |
| 7 | 477.500 | 0.504 |
| 8 | 462.000 | 0.533 |
| 9 | 448.500 | 0.577 |
| 10 | 460.000 | 0.532 |
| 11 | 492.000 | 0.574 |
| 12 | 582.500 | 0.506 |
| 13 | 580.500 | 0.603 |
| 14 | 590.000 | 0.528 |
| 15 | 487.000 | 0.553 |
| 16 | 572.000 | 0.526 |
| 17 | 585.000 | 0.558 |
Figure 1. Alignment Diagram of seq4/seq7/seq11/seq17
Subsequently, the above sequences were systematically evaluated using two complementary solubility prediction platforms, DeepSoluE and Protein-sol, with a screening threshold set at DeepSoluE score > 0.48, Solubility >55.00. For example, mutant seq8 achieved a high prediction score of 0.5599 and 59.769. To be conservative, only mutants identified as highly soluble on both independent prediction platforms were selected. Detailed prediction results are shown in Table 2, Table 3, Figure 3, and Figure 4. Ultimately, 13 candidate sequences with significantly higher predicted solubility than the wild-type were retained for subsequent wet-lab validation and stability optimization.
Table 2. Partial DeepSoluE Prediction Results
| SEQUENCE ID | Probability | Result |
|---|---|---|
| SEQ1 | 0.2452 | insoluble |
| SEQ2 | 0.3473 | insoluble |
| SEQ3 | 0.286 | insoluble |
| SEQ4 | 0.5059 | soluble |
| SEQ5 | 0.2809 | insoluble |
| SEQ6 | 0.2199 | insoluble |
| SEQ7 | 0.2605 | insoluble |
| SEQ8 | 0.5599 | soluble |
| SEQ9 | 0.1717 | insoluble |
| SEQ10 | 0.3749 | insoluble |
| SEQ11 | 0.5191 | soluble |
| SEQ12 | 0.6083 | soluble |
| SEQ13 | 0.5758 | soluble |
| SEQ14 | 0.2875 | insoluble |
| SEQ15 | 0.3704 | insoluble |
| SEQ16 | 0.5786 | soluble |
| SEQ17 | 0.3543 | insoluble |
| SEQ18 | 0.1525 | insoluble |
| SEQ19 | 0.4027 | insoluble |
| SEQ20 | 0.3114 | insoluble |
| SEQ21 | 0.3702 | insoluble |
| SEQ22 | 0.3923 | insoluble |
| SEQ23 | 0.15 | insoluble |
| SEQ24 | 0.1432 | insoluble |
| SEQ25 | 0.1129 | insoluble |
| SEQ26 | 0.1072 | insoluble |
| SEQ27 | 0.0651 | insoluble |
| SEQ28 | 0.0899 | insoluble |
| SEQ29 | 0.4853 | soluble |
| SEQ30 | 0.1678 | insoluble |
| SEQ31 | 0.0544 | insoluble |
| SEQ32 | 0.0599 | insoluble |
| SEQ33 | 0.1379 | insoluble |
Table 3. Partial Protein-sol Prediction Results
| Sequence ID | Solubility (%) | Scaled Solubility | Population Solubility | Isoelectric Point (pI) |
|---|---|---|---|---|
| seq1 | 44.197 | 0.361 | 0.446 | 9.950 |
| seq2 | 41.218 | 0.333 | 0.446 | 10.250 |
| seq3 | 51.096 | 0.425 | 0.446 | 10.240 |
| seq4 | 56.676 | 0.476 | 0.446 | 10.310 |
| seq5 | 45.863 | 0.376 | 0.446 | 9.850 |
| seq6 | 46.100 | 0.379 | 0.446 | 9.680 |
| seq7 | 49.191 | 0.407 | 0.446 | 10.160 |
| seq8 | 59.769 | 0.505 | 0.446 | 10.100 |
| seq9 | 49.438 | 0.409 | 0.446 | 10.020 |
| seq10 | 52.052 | 0.434 | 0.446 | 10.150 |
| SEQ11 | 57.327 | 0.489 | 0.446 | 10.184 |
| SEQ12 | 42.619 | 0.308 | 0.446 | 9.276 |
| SEQ13 | 55.742 | 0.455 | 0.446 | 10.392 |
| SEQ14 | 39.815 | 0.337 | 0.446 | 9.057 |
| SEQ15 | 41.093 | 0.370 | 0.446 | 9.683 |
| SEQ16 | 46.278 | 0.408 | 0.446 | 10.319 |
| SEQ17 | 43.761 | 0.354 | 0.446 | 10.125 |
| SEQ18 | 57.496 | 0.452 | 0.446 | 10.401 |
| SEQ19 | 44.932 | 0.402 | 0.446 | 10.238 |
| SEQ20 | 52.674 | 0.311 | 0.446 | 10.346 |
| SEQ21 | 47.185 | 0.370 | 0.446 | 10.192 |
| SEQ22 | 41.839 | 0.392 | 0.446 | 10.074 |
| SEQ23 | 59.127 | 0.510 | 0.446 | 10.463 |
| SEQ24 | 38.954 | 0.343 | 0.446 | 9.927 |
| SEQ25 | 49.816 | 0.412 | 0.446 | 10.378 |
| SEQ26 | 45.672 | 0.367 | 0.446 | 10.209 |
| SEQ27 | 50.439 | 0.465 | 0.446 | 10.332 |
| SEQ28 | 42.758 | 0.389 | 0.446 | 10.082 |
| SEQ29 | 56.291 | 0.505 | 0.446 | 10.419 |
| SEQ30 | 48.963 | 0.467 | 0.446 | 10.287 |
| SEQ31 | 44.125 | 0.454 | 0.446 | 10.176 |
| SEQ32 | 39.647 | 0.399 | 0.446 | 9.962 |
| SEQ33 | 51.884 | 0.457 | 0.446 | 10.352 |
Figure 3. Solubility Prediction of seq4
Figure 4. Intersection of Solubility Predictions
2. Stability Enhancement: Multi-Tool Consensus Mutation Screening
During the wet-lab validation phase, small-scale prokaryotic expression and purification were performed on the 13 initially screened candidate sequences. Among them, three mutants—seq4, seq11, and seq59—successfully achieved soluble expression and were purified to high purity via nickel-column affinity chromatography. SDS-PAGE results showed that the protein band for seq4 was significantly darker than the others, suggesting higher expression levels or solubility. Thus, seq4 was selected as the template sequence for subsequent stability engineering.
To enhance the enzyme's thermal stability, systematic mutation design was conducted for seq4. Stability mutations were predicted using three mature protein stability prediction tools:
a) PROSS (based on global backbone optimization and side-chain repacking),
b) FireProt (integrating evolutionary conservation and energy calculations), and
c) PoPMuSiC (based on machine learning and force field evaluations).
By integrating results from these computational tools, we aimed to eliminate individual biases and comprehensively evaluate potential stability mutations, incorporating physical modeling, statistical potentials, and evolutionary information.
To avoid impacting enzyme activity, a strict conserved-site filtering strategy was implemented:
1. Active center protection:Using PyMOL software to analyze the crystal structure, all residues within 8 Å of the active site (binding pocket of the substrate analog dodecaethylene glycol (12P)) were fixed to ensure the catalytic center structure remained intact.
2. Evolutionary conservation exclusion:Based on ConSurf analysis, sites with conservation scores >5 (highly conserved) were excluded, focusing on sites more tolerant to mutation evolutionarily.
These tools identified 41 (PROSS), 41 (FireProt), and 42 (PoPMuSiC) potential stability-enhancing mutation sites, respectively. These potential stability mutations are detailed in Table 4. We then compared all single-point mutations across the three mutation sets and selected those identified by at least two tools (Figure 5). The mutation results are visualized in Figure 7.
Ultimately, 17 high-frequency, high-confidence mutation sites (e.g., W4Y, A19R, V40I, K67A, Q154M, S210Y, T227I) were identified for subsequent experimental validation. This consensus strategy significantly enhanced the reliability of the mutants and minimized the risk of compromising protein function.
Figure 5. Intersection of Mutation Sets
Figure 6. Workflow of the FireProt Strategy
Figure 7. Overview of Mutations
Table 4. Mutation Results
| position | WT | FireProt | PROSS | PoPMuSiC |
|---|---|---|---|---|
| 4 | W | Y | Y | - |
| 5 | F | Q | - | - |
| 9 | M | D | - | - |
| 19 | A | R | R | - |
| 23 | T | - | A | - |
| 30 | S | P | - | - |
| 35 | Y | - | - | F |
| 36 | G | - | - | P |
| 37 | A | - | G | |
| 38 | G | - | - | F |
| 40 | V | I | I | - |
| 49 | T | P | - | K |
| 62 | R | T | - | - |
| 67 | K | A | A | - |
| 86 | N | - | - | P |
| 87 | S | - | T | - |
| 88 | T | P | - | - |
| 89 | L | - | Y | - |
| 91 | K | Q | D | P |
| 95 | R | - | - | V |
| 96 | S | A | A | - |
| 97 | S | R | R | - |
| 99 | Q | - | M | W |
| 100 | M | - | L | |
| 104 | R | D | - | - |
| 106 | V | L | L | - |
| 108 | S | A | - | - |
| 110 | N | - | - | L |
| 111 | G | N | - | - |
| 113 | S | - | P | P |
| 118 | Y | - | R | - |
| 120 | K | - | R | W |
| 121 | V | - | I | - |
| 124 | A | T | - | T |
| 126 | M | L | L | - |
| 127 | G | - | - | W |
| 130 | G | - | - | W |
| 134 | G | - | - | W |
| 136 | G | - | - | F |
| 137 | G | - | - | W |
| 138 | S | - | - | W |
| 141 | S | A | V | - |
| 143 | A | - | M | - |
| 144 | N | D | - | - |
| 145 | N | - | R | - |
| 150 | A | - | - | V |
| 151 | A | - | L | V |
| 153 | P | - | - | W |
| 154 | Q | L | M | W |
| 156 | P | - | - | W |
| 161 | T | - | - | K |
| 165 | S | - | R | - |
| 166 | V | - | I | - |
| 170 | T | - | - | I |
| 179 | R | S | - | - |
| 180 | I | V | - | - |
| 181 | A | - | - | W |
| 183 | V | - | P | - |
| 185 | S | T | - | - |
| 186 | S | - | Y | H |
| 189 | P | - | - | R |
| 192 | D | - | - | K |
| 200 | Q | - | - | W |
| 203 | E | - | - | W |
| 204 | I | - | L | - |
| 205 | K | - | N | - |
| 207 | G | - | - | W |
| 208 | S | D | - | - |
| 210 | S | - | Y | Y |
| 212 | G | - | C | F |
| 213 | G | T | - | - |
| 217 | I | T | N | - |
| 227 | T | I | I | I |
| 228 | S | A | - | W |
| 231 | R | - | K | - |
| 233 | H | F | F | - |
| 235 | C | D | - | - |
| 238 | K | T | - | - |
| 239 | A | R | R | - |
| 240 | H | - | Y | Y |
| 242 | T | Q | Q | - |
| 246 | E | S | - | - |
| 253 | L | V | - | - |
| 254 | G | - | - | P |
References
1. Hebditch M., Carballo-Amador M.A., Charonis S., Curtis R., Warwicker J. (2017). Protein-Sol: a web tool for predicting protein solubility from sequence. Bioinformatics, 33(19): 3098–3100.
2. Dehouck Y., Kwasigroch J.M., Gilis D., Rooman M. (2011). PoPMuSiC 2.1: a web server for the estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinformatics, 12: 151.
3. Herrero Acero E., Ribitsch D., Steinkellner G., Gruber K., Guebitz G.M. (2011). Enzymatic surface hydrolysis of PET: effect of structural diversity on kinetic properties of cutinases from Thermobifida. Macromolecules, 44(12): 4632–4640.
4. Goldenzweig A., Goldsmith M., Hill S.E., Gertman O., Laurino P., Ashani Y., Dym O., Unger T., Albeck S., Prilusky J., Lieberman R.L., Aharoni A., Silman I., Sussman J.L., Tawfik D.S., Fleishman S.J. (2016). Automated structure- and sequence-based design of proteins for high bacterial expression and stability. Molecular Cell, 63(2): 337–346.
5. Tournier V., Duquesne S., Guillamot F., Colin J., Marty A. (2023). Enzymes’ power for plastics degradation. Chemical Reviews, 123(9): 5612–5701.
6. Wang C., Zou Q. (2023). Prediction of protein solubility based on sequence physicochemical patterns and distributed representation information with DeepSoluE. BMC Biology, 21(1): 12.
7. Yang Y., Min J., Xue T., Zhang X., et al. (2023). Complete biodegradation of poly(butylene adipate-co-terephthalate) via engineered cutinases. Nature Communications, 14(1): 1645.
8. Musil M., Ježík A., Horáčková J., et al. (2023). FireProt 2.0: web-based platform for the fully automated design of thermostable proteins. Briefings in Bioinformatics, 25(1).
9. Guo R.-T., Li X., Yang Y., et al. (2024). Natural and engineered enzymes for polyester degradation: a review. Environmental Chemistry Letters, 22(3): 1275–1296.
10. Dauparas J., Anishchenko I., Bennett N., et al. (2022). Robust deep learning-based protein sequence design using ProteinMPNN. Science, 378(6615): 49–56.
11. Weinstein J.J., Goldenzweig A., Hoch S., Fleishman S.J. (2021). PROSS 2: a new server for the design of stable and highly expressed protein variants. Bioinformatics, 37(1): 123–125.