Engineering | XJTLU-AI-China

Cycle 1 — Data Integration & Benchmark Preparation

Status Quo

Existing plastic-degrading enzyme databases were fragmented, inconsistent, and often incomplete. Many entries lacked standardization, with missing sequences, structures, or annotations, making them difficult to use for systematic modeling.

Goal

To construct a standardized, unified, and large-scale dataset for enzyme–plastic interactions, with rigorous train/test isolation, difficulty-stratified benchmarks, and biological validation, so that downstream models can be developed and evaluated objectively.

Sub-Cycle 1A — Data Standardization & Cleaning

Design

Before dataset expansion, we first needed to define what counts as a valid entry. Many existing databases contained fragmented or inconsistent records, with missing EC annotations, structures, or partial sequences. We asked: what is the minimal requirement for inclusion?

Build

We set enzyme sequence as the minimal requirement, since it carries sufficient biological information and is the easiest to retrieve reliably. To achieve this, we used a specialized database of plastic-degrading enzymes as the primary reference. From this starting point, we developed multiple retrieval scripts to query large-scale resources such as UniProt and NCBI, aiming to supplement missing information (e.g., structures, metadata). All retrieved data were then cross-validated against the original annotations to detect inconsistencies and remove redundancies. This systematic approach allowed us to reconstruct a coherent and enriched dataset.

Test

We ensured that no duplicate sequences remained and that every plastic-degradation entry contained at least its corresponding enzyme sequence. Sequence length distributions became more coherent, and all retrievable metadata (e.g., structures, annotations) were systematically added back.

Learn

We learned that defining a minimal entry standard (sequence) early not only prevented data collapse but also ensured feasibility for downstream model training. Through automated cross-database cleaning combined with manual review, we identified and corrected multiple issues in raw resources, including mismatched IDs, misannotated plastic types, and uneven classification granularity. This process improved dataset quality and reproducibility, laying a robust foundation for future cycles.

Sub-Cycle 1B — Benchmark Design (Isolation + Stratification)

Design

Since our ultimate goal is to build a predictive model for plastic-degrading enzymes, the first step must be to fix an independent test set from the very beginning. This is a necessary step for engineering-grade projects, ensuring that evaluation metrics remain objective and not artificially inflated by models indirectly "seeing" similar data during training. Once the test set is fixed, the next challenge is how to split it more meaningfully: a purely random hold-out cannot capture the true biological challenge of enzyme prediction. To address this, we proposed using sequence similarity as the guiding metric, since similarity reflects evolutionary and functional relationships between enzymes. This allows the test set to represent real biological diversity and stratified difficulty levels, rather than being just a leftover subset.

Build

Following the design principle, we developed an algorithmic pipeline for test set stratification based on sequence similarity. For each candidate enzyme, we computed its nearest sequence identity to the training set using global alignment. This ensured that classification into difficulty tiers reflected the hardest comparison it would face in practice. Using iterative calibration, we finalized two thresholds—0.9 and 0.7—which divided the benchmark into three levels:

Easy: identity ≥ 0.9 (high similarity to training sequences)
Medium: 0.7 ≤ identity < 0.9 (moderate similarity)
Hard: identity < 0.7 (low similarity, distant from training data)

In addition, the algorithm included balancing constraints, ensuring that the number of samples across the three tiers was as even as possible. The entire process was automated and reproducible, generating stratified FASTA files and metadata tables for downstream modeling.

Test

After constructing the benchmark, we carried out additional validation to ensure its correctness. Using independent scripts, we confirmed that no duplicate entries existed between training and test sets, eliminating any risk of data leakage. We also re-calculated the sequence similarity values to cross-check the automated pipeline, ensuring that each sample was assigned to the correct difficulty tier. Finally, analysis of the distribution across tiers showed that the benchmark achieved a nearly balanced ratio (~1:1:1) among easy, medium, and hard levels. Together, these results indicate that the benchmark not only preserves biological meaning but also provides strong evaluation capacity, capable of reliably distinguishing true performance differences between models.

Learn

From this sub-cycle, we learned that designing a benchmark is not just about splitting data into training and test sets, but about embedding biological logic into evaluation. By using sequence similarity as the splitting criterion, we ensured that the test set represents realistic levels of biological diversity. The near-balanced distribution across easy, medium, and hard tiers confirmed the robustness of our pipeline. Importantly, this setup enables tier-specific evaluation: by separately analyzing performance on each difficulty level, we can better judge a model's generalization ability and its biological interpretability, rather than relying on a single aggregated score.

Sub-Cycle 1C — Biological Validation of Dataset

Design

To ensure that our dataset is not only computationally consistent but also biologically reliable, we designed a two-pronged validation strategy: (1) wet-lab experiments to confirm that selected enzymes truly exhibit degradation activity, and (2) bioinformatic analyses to examine family composition and evolutionary structure.

Build

Wet-lab:

We selected five representative enzyme sequences using a semi-random stratified sampling strategy. On the one hand, we included well-documented entries from the literature to cover established representatives; on the other, we incorporated underexplored candidates to ensure representativeness and diversity. The chosen genes were cloned into expression plasmids and heterologously expressed in the standard E. coli host via the pelB signal peptide pathway (widely used and previously reported in literature for periplasmic secretion). The expressed proteins were subsequently purified for downstream validation.

PET plasmid construction and expression strategy

Bioinformatics:

All entries were standardized and redundancy-removed before analysis. Family assignments were performed using Pfam/InterPro domain and HMM-based annotations. Representative sequences were then aligned with MAFFT/Clustal Omega, and a phylogenetic tree was constructed using the maximum-likelihood method with bootstrap support, to assess the consistency of family composition and evolutionary structure.

Test

Wet-lab:

Soluble expression of the target proteins was confirmed by SDS-PAGE, which showed clear bands at the expected molecular weights. Western Blot (WB) further validated the presence of the expressed enzymes. The purified proteins were subjected to activity assays using substrates relevant to ester-bond and polyester hydrolysis. All five samples exhibited significant enzymatic activity compared with negative controls, with distinct differences in substrate degradation patterns, confirming that the selected dataset entries correspond to biologically functional enzymes.

Bioinformatics:

Statistical profiling confirmed that the dataset is consistent and biologically meaningful: sequence length distributions, taxonomic coverage, and plastic-category coverage were coherent, and redundancies were successfully removed. Family assignment revealed that α/β-hydrolases constitute the majority, accompanied by cutinase-like and lipase-like families known for polyester degradation. Phylogenetic analysis further supported this: representative sequences clustered into clades consistent with annotated families, and the overall evolutionary structure was in line with prior literature on plastic-degrading enzymes.

Learn

This validation confirmed that our curated entries are biologically meaningful: the selected enzymes showed real activity, and bioinformatic analysis revealed family and evolutionary structures consistent with literature. Importantly, we observed that enzyme sequences are often dominated by conserved, structure-maintaining regions, reflecting evolutionary constraints. At the same time, we recognized that plastic-degrading enzymes are not products of purposeful evolution but rather cases of fortuitous adaptive gain, where the natural substrates of hydrolases share structural similarity with synthetic plastics. This helps explain why small mutations in catalytic regions can sometimes unlock unexpected plastic-degrading activity.

Conclusion

Through this cycle, we successfully constructed a standardized, unified, and large-scale dataset for enzyme–plastic interactions. The dataset was built with rigorous train/test isolation, designed difficulty-stratified benchmarks using sequence similarity, and validated through both wet-lab experiments and bioinformatic analyses. As a result, we achieved a dataset of high completeness, reliability, and biological grounding, which can now be packaged and shared as a milestone deliverable. This foundation ensures that downstream models can be developed and evaluated in an objective, reproducible, and biologically meaningful manner.

Cycle 2 — Baseline ML with ESM Embeddings

Goal

To design and implement the first predictive baseline for enzyme–plastic interactions by framing the problem as a supervised classification task (enzyme sequence → plastic type). This cycle focused on exploring different enzyme sequence embeddings within a fixed ML framework and on testing strategies to handle class imbalance between common and rare plastics. The objective was to assess the feasibility of prediction, establish an initial baseline, and prepare insights for further model development.

Sub-Cycle 2A — Embedding Comparison

Design

In our first modeling attempt, we adopted the most straightforward design: treating plastics as labels and using enzyme sequences as inputs. The central question was whether sequence information alone is sufficient to predict the corresponding plastic type. To make the model truly "understand" sequences, we needed to transform the amino acid strings into numerical embeddings. Therefore, we tested multiple embedding approaches to explore which representation would be most suitable as the starting point for baseline prediction.

Build

One-hot encoding: each amino acid is treated as an isolated symbol, without any notion of biochemical similarity.
Physicochemical descriptors: each amino acid is described by properties such as charge, hydropathy, or secondary structure tendencies, giving more biological meaning.
ESM embeddings: large protein language models (ESM) are trained on millions of sequences, learning evolutionary context — e.g., conserved regions, motifs, and functional patterns — and output rich numerical vectors for each sequence.
All three representations were used as inputs to a fixed Histogram-based Gradient Boosting (HistGB) classifier.

Test

The comparison showed a consistent ranking: ESM embeddings > physicochemical features > one-hot. More complex representations like ESM contained richer hidden information, which allowed the classifier to achieve higher F1 and AUROC scores on the same dataset.

F1 score comparison across different embedding methods

Learn

We learned that advanced embeddings such as ESM capture hidden evolutionary and structural signals that simpler encodings miss, leading to stronger performance. This highlights that in biological prediction tasks, the choice of embedding method can substantially influence model outcomes. At the same time, the experiment validated that our dataset (from Cycle 1) supports meaningful learning, providing a solid baseline for future model iterations.

Sub-Cycle 2B — Handling Class Imbalance (SMOTE)

Design

In plastic-degradation research, a few plastics such as PET or PLA have been extensively studied, resulting in many training samples, while most plastics remain rarely represented. This long-tail distribution caused the classifier to overfit frequent plastics and underperform on rare ones. Our goal in this sub-cycle was to reduce imbalance so that rare plastics could be predicted more fairly.

Build

We applied SMOTE (Synthetic Minority Oversampling Technique) on the training set. SMOTE generates synthetic samples by interpolating between existing minority-class samples in feature space. Conceptually, this is like imagining plausible intermediate enzymes between rare sequences. Importantly, the validation and test sets remained untouched to preserve a realistic distribution for evaluation.

Test

After applying SMOTE, rare plastics achieved clear improvements in F1 scores, while frequent plastics remained stable. Confusion matrices showed fewer cases of rare labels being systematically misclassified as common ones.

Learn

We confirmed that class imbalance strongly reduces rare-class performance, and that SMOTE can partially compensate for this effect. However, treating plastics solely as labels still limits biological interpretability, especially given the large number of small classes.

Conclusion

In Cycle 2, we completed the first end-to-end predictive baseline for enzyme–plastic interactions. By testing different embeddings under a fixed ML framework and applying SMOTE to handle imbalance, we achieved solid predictive performance and confirmed the feasibility of our dataset for modeling. This established a practical baseline benchmark. Building on this success, our next step is to move beyond the "labels-only" framing and design models that incorporate biochemical principles and interpretable structures, enabling deeper biological insight.

Cycle 3 — Dual-Tower with Plastic Features

Goal

In earlier attempts, plastics were treated only as labels, which caused imbalance and limited interpretability. In reality, degradability arises from the interaction between enzyme properties and plastic properties. The goal of this cycle is to design a model framework that can explicitly incorporate plastic information, so predictions move beyond label-only learning and become more biochemically grounded.

Sub-Cycle 3A — Plastic Embedding Representation

Design

Molecular descriptors are widely used for chemical modeling tasks — they are numerical summaries of a molecule's properties (e.g., charge, hydrophobicity, or atom counts). These usually perform well on small molecules. However, most existing descriptors (e.g., RDKit features) were originally designed with small molecules in mind. When applied directly to polymers, they show strong correlation with molecular size, making dimensionality reduction maps cluster by chain length rather than true chemical similarity between plastics. The central challenge became: how to design embeddings that reflect the intrinsic chemical properties of plastics, instead of trivial size bias?

Build

To address this, we introduced density normalization, dividing descriptors by the number of heavy atoms, which reduced the dominance of molecular weight. Intuitively, this means turning descriptors into an "average per atom" representation, so that embeddings focus on chemical composition rather than sheer size. Additionally, we leveraged co-degradation relationships as a weak supervision signal: plastics that are reported to degrade together were encouraged to form closer embeddings. This was implemented by pre-training a plastic encoder with contrastive-style objectives, allowing embeddings to reflect both chemical logic and biological relevance.

Test

PCA analysis of plastic polymer embeddings

Dimensionality reduction with PCA and UMAP revealed that normalized embeddings produced more chemically coherent clusters. Polyesters grouped closely, polyolefins formed a separate cluster, and importantly, members of the PHA (polyhydroxyalkanoate) family began to cluster together in line with their shared biochemical properties. Unlike raw RDKit descriptors, the embeddings no longer simply followed molecular weight gradients.

Learn

We learned that polymer-specific adjustments are essential for meaningful representations. Density normalization effectively reduced molecular-weight artifacts, while co-degradation pre-training aligned embeddings with known biological behavior. Together, these strategies provided a chemically interpretable and biologically relevant representation of plastics.

Sub-Cycle 3B — Interactive Model Architecture

Design

In previous cycles, plastics were treated purely as labels, which meant the model learned only from enzyme features. However, degradability is the result of an interaction between enzyme and plastic properties. This raised a key design question: instead of predicting labels, can we build a framework where both enzymes and plastics are explicitly represented, and their compatibility is learned through interaction?

Build

Inspired by recommendation systems (e.g., user–item models), we constructed a dual-tower architecture. The enzyme tower encodes proteins using backbones, while the plastic tower encodes polymers using the newly developed embeddings (density-normalized + co-degradation pre-trained). An interaction head was then implemented to combine the two towers, with different strategies tested:

Dot product (measuring similarity directly)
Bilinear layer (capturing feature-specific interactions)
MLP fusion (nonlinear combination of features)

This design allowed enzyme–plastic compatibility scores to be learned in a structured and interpretable manner.

Test

Evaluation showed that the dual-tower framework achieved more balanced performance across easy, medium, and hard tiers, compared with the label-only baseline. This indicates stronger generalization: the model was not only fitting frequent cases but also extending to more challenging, less-represented plastics.

Learn

We learned that enzyme–plastic interactions are best modeled as a two-sided problem, where both enzymes and plastics contribute features. This interactive framework moved beyond simple label classification and laid the foundation for biochemically interpretable predictions, bridging molecular representation with functional outcomes.

Conclusion

In this cycle, we moved beyond treating plastics as mere labels and introduced dedicated plastic embeddings. With density-normalized descriptors and co-degradation signals, the embeddings captured chemical and biological relevance more faithfully. Integrated into a dual-tower framework, enzymes and plastics were encoded separately and combined through interaction heads, which improved interpretability and yielded more balanced performance across difficulty tiers. However, the increased complexity of the architecture also highlighted the limitations of current backbones, as overall metrics still showed room for improvement.

Cycle 4 — Structure-Aware Backbones

Goal

In the previous cycle, we improved interpretability by introducing plastic features. However, the enzyme side was still represented only by sequences, which inevitably lose critical information. Since enzyme activity is fundamentally determined by its 3D structure—including folds, catalytic pockets, and binding sites—this cycle shifts focus to building structure-aware backbones, aiming to capture spatial and geometric features essential for plastic degradation.

Sub-Cycle 4A — Structure Sources & Reliability

Design

In the previous cycle, we realized that relying on sequences alone leaves out critical information. Enzyme activity is fundamentally tied to 3D structure, such as folding, catalytic pockets, and binding residues. To incorporate structural knowledge into modeling, we first had to decide where these structures would come from. Protein structures can be obtained either from experiments (e.g., X-ray, Cryo-EM) or from computational prediction (e.g., AlphaFold). For plastic-degrading enzymes, which mostly belong to well-studied hydrolase families, AlphaFold provides high-confidence predictions. Since its method is based on residue–residue contact patterns, naturally aligned with our graph construction, we asked: Can AlphaFold-predicted structures serve as a reliable default for introducing structural backbones into our model?

Build

We systematically retrieved AlphaFold models for enzymes in our dataset. Confidence scores (pLDDT) were inspected, showing consistently high values for plastic-degrading families. Where available, experimental PDBs were retained as gold standards, but the majority of entries were supplemented with AlphaFold structures. All structures were processed into residue contact graphs, ensuring consistency regardless of origin.

Test

pLDDT-colored predicted enzyme structure.

The predominance of deep blue regions (mean pLDDT = 97.7) indicates a highly confident and accurate structural prediction.

Benchmark comparisons showed that models trained with AlphaFold-predicted structures performed on par with those using experimental PDBs. This demonstrated that AlphaFold is reliable for our dataset. Moreover, the graph construction pipeline worked seamlessly across both sources, producing consistent node/edge features.

Learn

We confirmed that AlphaFold can serve as a robust default source of structures, freeing the project from the bottleneck of limited experimental PDB coverage. This not only increased the dataset's completeness but also ensured scalability, as any new enzyme sequence can directly be assigned a predicted structure. Thus, structure-aware modeling becomes practical at scale, without being constrained by experimental availability.

Sub-Cycle 4B — Graph Representation of Enzyme Structures

Design

To move beyond sequence-only inputs, we needed a way to represent enzyme 3D structures in a machine-readable format. Proteins can be naturally modeled as graphs, where residues are nodes and contacts define edges. This raised the central question: how can we construct residue-level contact graphs that preserve both biochemical and spatial information?

Build

Instead of discarding sequence embeddings, we reorganized them into a graph representation. Each residue's initial embedding was obtained from ESM, preserving evolutionary and biochemical context. These embeddings were then assigned to graph nodes, while edges encoded spatial proximity (e.g., Cα–Cα distances within 10Å). In this way, the model could jointly leverage sequence-derived residue embeddings and their 3D spatial relationships, making the enzyme backbone structure-aware.

Protein graph representation with ESM embeddings

Test

Initial experiments showed that models with structural graphs achieved more balanced performance across difficulty tiers compared to sequence-only baselines. Notably, enzymes with low sequence similarity but similar folds were better recognized, suggesting that structural context improved generalization.

Learn

From an engineering synthetic biology perspective, this step confirmed that proteins can be represented not just as linear sequences but as residue contact graphs, where biochemical properties are grounded in a physical 3D scaffold. By embedding structural constraints into the model, we move closer to capturing the true determinants of enzyme activity — pockets, binding sites, and catalytic environments — which sequences alone cannot fully describe. This validates that structural embeddings add essential information missing from pure sequence models, and provides a foundation for building interpretable, design-oriented predictive frameworks in future cycles.

Sub-Cycle 4C — GNN vs. GVP Backbones

Design

After confirming that residue contact graphs provide structural value, the next question was: is it enough to only know which residues are in contact, or do we also need to encode their 3D orientation? Standard GNNs (GCN, GraphSAGE, GAT) propagate scalar features across graph edges, but they cannot directly represent vector information such as bond directions or spatial orientation. Geometric Vector Perceptrons (GVPs) extend this idea by introducing both scalar and vector channels, enabling models to capture structural geometry.

Build

We implemented two enzyme backbones:

GNN-based: residue graphs with scalar node features (AA identity, charge, hydrophobicity) and distance-based edges.
GVP-based: same residue graphs, but augmented with 3D vector channels (residue coordinates, orientation vectors) to encode spatial geometry.

Both were trained within the dual-tower framework, ensuring a fair comparison.

Test

Both GNN and GVP backbones outperformed sequence-only baselines, especially on medium/hard tiers. Among them, GVP showed the most consistent improvements, yielding higher F1 and AUROC, and more stable predictions across rare plastics.

Learn

We learned that while GNNs capture who contacts whom, GVPs go further by encoding how residues are oriented in 3D. This geometric sensitivity not only improved accuracy but also enhanced interpretability, as important catalytic residues were better highlighted. From an engineering perspective, this confirmed that integrating geometric structure is essential for biochemically meaningful models.

Conclusion

By integrating structure-aware backbones—retrieving reliable 3D structures (with AlphaFold as default), representing them as residue contact graphs, and adopting geometry-aware GVP networks—our model achieved its best F1 scores to date, with more stable performance across difficulty tiers. More importantly, it gained biochemical interpretability, as predictions naturally highlighted catalytic residues and structural motifs. At this stage, our framework has matured into its most complete form: accurate, robust, and biologically consistent. However, the project remains highly specialized, and the current technical complexity still sets a high barrier for many intended users in the biological community.

Cycle 5 — User-Friendly Web Application

Goal

To evolve the specialized research framework into a more accessible and user-friendly "one-stop" online platform, making it easier for researchers across diverse backgrounds — including those primarily engaged in wet-lab or experimental work — to explore enzyme–plastic interactions. The goal was to create an intuitive web application that integrates both model prediction and database exploration, offering a seamless, installation-free experience for users worldwide.

Design

The platform's design process was grounded in extensive feedback from its target users — including researchers in microbiology, environmental science, synthetic biology, and chemistry. These discussions revealed a shared need for an integrated "one-stop" online environment that combines predictive modeling with a comprehensive enzyme–plastic database. The guiding design principle was to achieve usability and efficiency without compromising scientific rigor. The platform had to be approachable for users unfamiliar with computational tools, yet powerful enough to meet the standards of professional research.

Build

The platform is deployed on AWS cloud servers to ensure global accessibility and stable performance. It consists of two primary components: the Plaszyme model platform and the PlaszymeDB online database.

The model platform integrates two complementary predictors: PlaszymeAlpha, a sequence-based model for rapid predictions directly from FASTA files, and PlaszymeX, a structure-informed model designed for higher interpretability.

To simplify user interaction, we built an automated backend workflow that integrates colabfold_batch for 3D structure prediction. This design follows the "less is more" principle — users only need to provide sequences once, and the system automatically handles structure generation and model inference. It also supports direct prediction from user-uploaded PDB files when available.

To meet the needs of large-scale research scenarios, the platform also includes a metagenomic analysis module, which employs a custom HMM pipeline to discover potential enzyme candidates from environmental datasets for prediction. All model outputs follow a standardized format, providing degradation probabilities, confidence scores, an interactive 3D molecular viewer (Mol* API), and CSV export options for further analysis.

The PlaszymeDB online database complements the model platform by serving as a centralized, searchable knowledge base. It offers a high-performance search algorithm based on traversal logic, interactive data visualizations, and a locally developed BLAST alignment tool tailored to our curated database. Phylogenetic relationships are dynamically displayed via an embedded iTOL tree, while detailed enzyme pages integrate multiple interactive modules through iframes — including a 3D structure viewer (Mol* API), a chemical formula renderer (Ketcher API), full amino acid sequences, literature citations, and EC annotations derived from both curated references and our DeepEC-based functional predictions.

Test

We conducted a two-stage validation process to ensure both technical correctness and user usability.

(1) Internal functional testing: Using established benchmark datasets, we performed end-to-end predictions directly on the live platform. The system successfully reproduced expected outcomes, verifying the model's accuracy and confirming that the backend workflow operated as intended. Additional simulations were conducted for the metagenomic input pipeline, confirming that large-scale sequence inputs were correctly processed from structure prediction to output generation.

(2) User acceptance testing: We invited several researchers from our earlier user interviews—representing the microbiology, environmental, synthetic biology, and chemistry communities—to test the public platforms (http://plaszyme.org/plaszyme and http://plaszyme.org/plaszymedb). They completed full prediction and database exploration workflows without prior instruction, providing direct usability feedback. Their responses consistently highlighted that the system was intuitive, efficient, and practical for real laboratory workflows, validating that our design successfully met its accessibility goals.

Learn

Throughout this process, we realized that developing an effective scientific tool must begin with understanding the real-world research contexts it is meant to serve—such as the metagenomic analysis of environmental samples and microbial communities.

Continuous dialogue with microbiology, environmental, chemistry, and synthetic biology researchers helped us recognize that accessibility and professionalism must coexist. Their feedback shaped our guiding principle: a platform should be easy to use without oversimplifying the science behind it.

The consistently positive responses to the platform's clarity and smooth workflow validated this approach. Ultimately, we learned that the real value of computational frameworks lies not only in prediction accuracy, but in bridging advanced algorithms with practical usability, enabling researchers to focus on discovery rather than deployment. This experience also revealed the platform's potential to expand toward more specialized applications in the future.