Dataset Preparation and Splitting
Dataset Source and Composition
The predictive dataset was constructed from a curated collection of experimentally verified or literature-supported enzyme–plastic pairs, reflecting the biochemical relationships between hydrolases and their polymer substrates.
Each entry includes a complete amino acid sequence and at least one validated degradation record, ensuring biological completeness and experimental reliability.
Enzyme ID | Sequence | PET | PE | PP | PCL | PHB | PU | PLA | ... |
---|---|---|---|---|---|---|---|---|---|
X001 | AANPYERGPNPTDALLEAR... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... |
X002 | AANPYQRGPDPTESLLRAA... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... |
X003 | AANPYQRGPNPTEASITAA... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... |
X004 | AAQTNAPWGLARISSTSPG... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... |
X005 | AYLTPGQSGEFTVKKVADT... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... |
Choosing full-length protein sequences as the primary input is both rational and essential. Sequence information inherently encodes the enzyme’s evolutionary constraints and functional determinants, enabling a generalizable biological representation space even in the absence of 3D structural data. This design preserves the evolutionary and functional information required for downstream representation learning (Rives et al., 2021) and provides a foundation for capturing catalytic motifs and co-evolutionary signals relevant to substrate recognition.
The resulting dataset spans a diverse range of hydrolase families and covers more than 30 polymer types (e.g., PET, PE, PP, PCL), providing a biologically interpretable foundation for model training.
Sequence Similarity Guided Dataset Partitioning
To ensure objective evaluation, the dataset was divided into independent training and test subsets guided by global sequence similarity rather than random partitioning.
Five experimentally validated enzymes were first fixed as reference anchors in the test set, providing experimentally confirmed ground truth.
For the remaining samples, pairwise global alignments were computed using the BLOSUM62 substitution matrix (gap open = –10.0, gap extend = –0.5) to quantify evolutionary similarity. Each sequence’s nearest-neighbor identity to the training pool was then calculated, allowing the algorithm to automatically select test samples under strict constraints that prevent data leakage and overrepresentation:
- (i) No test enzyme shared its closest homolog with the training set;
- (ii) Internal redundancy among test sequences was capped at 0.95 identity;
- (iii) The final subsets maintained approximate balance across enzyme families.
This produced 87 test sequences and 387 training sequences, corresponding to a split ratio of 18.4 : 81.6. Although this partition slightly reduces training data, it ensures statistical independence and biological realism, consistent with benchmark standards such as UniRef50 (Suzek et al., 2015) and TAPE (Rao et al., 2019).
Benchmark Release and Accessibility
All dataset splits used in model training and evaluation are publicly available under the Zenodo record :
Zenodo DOI:![]()
Within the repository, the splits/
directory contains the
finalized training and test partitions, along with metadata tables
specifying sequence identifiers, polymer associations, and difficulty
tiers (Easy/Medium/Hard).
This release follows the FAIR data principles ensuring that the benchmark is Findable, Accessible, Interoperable and Reusable allowing other researchers to fully reproduce or extend the Plaszyme predictor pipeline.
Sequence-based Feature Extraction
After curating and standardizing enzyme–polymer pairs, the first step in predictive modeling was to encode enzyme sequences into quantitative representations suitable for machine learning.
Because amino acid sequences inherently contain rich biological and evolutionary information, we explored three complementary embedding strategies — ranging from symbolic encodings to deep contextual embeddings — to evaluate how representation depth affects downstream model performance.

One-Hot Encoding (Baseline Representation)
As the most interpretable and architecture-independent baseline, each enzyme sequence was represented by a 21-dimensional one-hot vector per residue, corresponding to the 20 canonical amino acids plus one token for unknown residues (“X”).
This discrete representation preserves exact sequence identity but lacks biochemical interpretability or evolutionary context. Nevertheless, one-hot encoding provides a strong non-parametric control, allowing fair comparison with more sophisticated embeddings (Hinton, 1984; Mikolov et al., 2013).
Physicochemical Embedding (Feature-based Representation)
To incorporate biochemical meaning, we extracted physicochemical property vectors using
the Peptides.py
library (Osorio, 2020).
This package computes a diverse set of quantitative structure–activity relationship (QSAR) descriptors, including Kidera factors, Atchley factors, VHSE, and Z-scales, each summarizing amino acid properties such as hydrophobicity, polarity, charge, and molecular volume. For each enzyme, residue-level descriptors were averaged to form a global feature vector:
where denotes the descriptor vector for residue and is sequence length.
This representation preserves biochemical interpretability and facilitates direct integration with traditional machine learning models, such as Random Forest and XGBoost.
Protein Language Model Embedding (ESM-1b Representation)
Finally, we adopted Evolutionary Scale Modeling (ESM-1b) — a large-scale protein language model with 650M parameters (Rives et al., 2021) — to derive contextualized embeddings that capture residue-level co-evolution patterns learned from >250 million UniProt sequences. Each enzyme sequence was tokenized using the ESM alphabet and passed through the 33-layer Transformer.
We extracted the final hidden representation for each residue and performed mean pooling to obtain a fixed-size embedding vector:
out = model(toks, repr_layers=[33])
rep = out["representations"][33]
embedding = rep.mean(dim=1)
This procedure yields a 1280-dimensional embedding per protein, encoding both local sequence context and global evolutionary constraints. Compared with one-hot and physicochemical features, ESM embeddings significantly enhance representation richness and transferability, providing the foundation for downstream structure-aware learning.
Together, these three embedding schemes form a hierarchical feature pipeline — from symbolic to physicochemical to deep contextual levels — enabling systematic evaluation of how representation granularity influences model generalization and interpretability.
Machine Learning Multi-Classifier Backbone
After obtaining numerical representations for enzyme sequences through diverse embedding strategies — from one-hot encodings to deep contextual embeddings — the next logical step was to construct a predictive framework that directly connects sequence features to plastic degradability outcomes.
The most straightforward formulation of this task is to treat plastic type as the target label, thereby enabling supervised classification: each enzyme sequence corresponds to one or multiple degradable polymers. This approach provides an interpretable baseline for mapping protein sequence information to biochemical function and forms the foundation for subsequent structure-aware learning stages.
Model Architecture and Data Flow

After generating sequence embeddings, we implemented a multi-class classification framework to predict enzyme–polymer degradation relationships directly from sequence-derived representations. This framework bridges protein language modeling and classical machine learning, providing an interpretable yet high-performance baseline for subsequent structure-aware models.
Each enzyme sequence was transformed into a 1280-dimensional embedding using the ESM-1b protein language model (Rives et al., 2021). A mean pooling operation was applied to aggregate residue-level embeddings into a fixed-length representation suitable for supervised learning. These embeddings were used to predict the degradability profile across multiple polymer types (e.g., PET, PCL, PHB, PLA).

Effect of SMOTE Oversampling on Class Balance.
Comparison of confusion matrices before and after applying SMOTE. After oversampling, class distribution becomes more uniform, reducing the dominance of abundant categories such as PET and improving the overall balance of predictions.
To address class imbalance commonly observed in biological datasets, we employed the Synthetic Minority Oversampling Technique (SMOTE) (Chawla et al., 2002), which generates synthetic samples by interpolating between neighboring points of minority classes in the embedding space. This strategy expands under-represented categories without simple duplication, allowing the model to better capture rare degradation functions.
The processed vectors were then fed into a Histogram-based Gradient Boosting (HGB) classifier, which captures non-linear interactions via additive ensemble learning:
where denotes the number of polymer classes, and is the mean-pooled embedding vector (Ke et al., 2017).
This supervised multi-class baseline is referred to as PlaszymeAlpha, our sequence-based machine learning model designed for enzyme–plastic classification:
Model Repository: GitLab – PlaszymeAlpha
Structural Graph Construction
Understanding enzymatic plastic degradation requires not only sequence-level information but also a representation that captures spatial organization and physicochemical interactions. While sequence embeddings encode residue identity and contextual dependencies, they lack the geometric topology that governs catalysis and substrate binding. To overcome this limitation, we adopted a graph-based structural representation — a mathematical abstraction well-suited for molecular systems and widely used in protein representation learning (Gainza et al., 2020; Gligorijević et al., 2021).
Graph

For illustration, a 4-node undirected graph can be represented by an adjacency matrix . In a sparse chain fully connected graph
A graph is a mathematical structure defined as:
where is the set of nodes, and is the set of edges representing relationships between nodes.
Each node is described by a feature vector , which encodes local or global properties, and all pairwise connections can be summarized by an adjacency matrix : >
This formulation provides a generic, domain-independent framework for representing systems with interacting elements—ranging from molecular residues to social networks.
In the context of proteins, nodes correspond to amino acid residues, while edges denote their spatial or functional relationships.
Such a representation enables non-Euclidean learning, allowing neural networks to capture topological and relational structure beyond fixed sequence order or Cartesian coordinates.
Structure Source and Reliability
Since most enzymes in our dataset lack experimentally resolved structures, we adopted a structure prediction approach to bridge the gap between sequence and spatial representation.
Among available structure prediction methods, we selected AlphaFold 2 — currently the most reliable and widely validated protein structure prediction system (Jumper et al., 2021).
All protein structures were either obtained or predicted via AlphaFold 2, executed through ColabFold (Mirdita et al., 2022).
AlphaFold predicts atomic coordinates by reconstructing inter-residue distance and orientation matrices learned from large-scale multiple sequence alignments (MSAs).

pLDDT-colored predicted enzyme structure.
The predominance of deep blue regions (mean pLDDT = 97.7) indicates a highly confident and accurate structural prediction.
Because these distance maps directly reflect physical residue contacts, they can be confidently converted into adjacency matrices for graph construction — a principle supported by multiple benchmark studies (Senior et al., 2020; Baek et al., 2021).
Protein Graph Representation

Protein Graph Representation.
A local region of the enzyme is magnified to illustrate the graph construction principle: residues are represented as nodes (Cα atoms), and edges are formed when Cα–Cα distances are below 10 Å.
To incorporate structural information into the learning process, each enzyme was represented as a graph (G) that jointly encodes spatial topology and sequence-derived features.
Formally, a protein graph can be expressed as:
where denotes the set of nodes (amino acid residues), each associated with a feature vector , and is the adjacency matrix describing pairwise residue interactions.
An edge is defined between residues i and j if the distance between their Cα atoms is less than 10 Å:
The Cα atom was selected as the structural reference because it provides a stable and universal representation of the protein backbone, present in all amino acids (including glycine, which still defines a Cα position).
Using Cα–Cα distances avoids side-chain noise and yields a coarse-grained yet faithful approximation of the protein’s overall fold — a convention widely adopted in protein graph construction and structural learning frameworks (Gainza et al., 2020; Gligorijević et al., 2021). The 10 Å cutoff reflects the average spatial range of non-covalent interactions such as hydrogen bonding and van der Waals contacts, balancing graph connectivity and biological realism.
Importantly, the graph formulation aligns naturally with protein structure, as residues interact locally yet form globally connected networks — making graph neural networks (GNNs) an ideal framework for capturing both local biochemical environments and long-range structural dependencies within enzymes.
The node features were derived from the protein language model ESM-2 (Lin et al., 2023), which encodes residue-level sequence context, evolutionary constraints, and implicit structural priors, forming a feature matrix .
By combining AlphaFold-predicted 3D coordinates (Jumper et al., 2021; Mirdita et al., 2022) with ESM embeddings, we constructed a structure-aware protein graph that integrates sequence semantics and spatial geometry.
Such graph representations have proven effective in modeling protein function, binding sites, and enzyme activity (Senior et al., 2020; Gligorijević et al., 2021; Gainza et al., 2020), providing a biologically grounded basis for downstream GNN learning.
GNN-based Protein Backbone
Having represented each enzyme as a residue-level contact graph derived from its 3D structure, the next step is to learn from this representation effectively.
Graph neural networks (GNNs) naturally fit this paradigm — they operate directly on non-Euclidean structures, aggregating local chemical environments while propagating signals across long-range residue contacts, both of which are crucial for understanding catalysis and substrate recognition.
GNN-based Backbone
To implement this framework, we employed the PyTorch Geometric (PyG) library, which provides well-established molecular operators such as Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT).
Our backbone supports multiple architectures (GCN, GAT, GraphSAGE, GIN, GINE), enabling systematic comparison of how different graph propagation mechanisms affect model performance and biological interpretability.
A minimal PyG implementation for our protein graph backbone is shown below:
import torch
from torch_geometric.nn import GCNConv, global_mean_pool
class ProteinGNN(torch.nn.Module):
def __init__(self, in_dim=1280, hidden=256, out_dim=64):
super().__init__()
self.conv1 = GCNConv(in_dim, hidden)
self.conv2 = GCNConv(hidden, out_dim)
self.act = torch.nn.ReLU()
def forward(self, x, edge_index, batch):
x = self.act(self.conv1(x, edge_index))
x = self.act(self.conv2(x, edge_index))
return global_mean_pool(x, batch)
GVP-enhanced Structural Encoding
While standard GNNs effectively capture the topology of residue connections, they treat edges as simple links and may overlook geometric directionality.
To better encode 3D orientation and vectorial relationships, we also experimented with the Geometric Vector Perceptron (GVP) framework (Jing et al., 2021), an extension of GNNs specifically designed for protein structures.
GVP introduces both scalar features (e.g., residue identity, hydrophobicity) and vector features that describe 3D orientations between residues, allowing the model to directly reason over spatial geometry.
In practical terms, GVP enables the network to “see” not only which residues interact but also how they are positioned in space — a key aspect for modeling active sites, binding pockets, and structural flexibility.
This enriched representation is particularly suitable for enzyme–substrate systems, where both chemistry and conformation jointly determine catalytic performance.
Plastic Feature Extraction and Polymer Optimization
In previous sections, plastic type was treated merely as a categorical label in enzyme classification. However, this approach ignores the chemical diversity and structural complexity among polymers that determine their degradability.
To achieve a more interpretable and biologically grounded prediction framework, polymer information itself was incorporated into the model. Instead of classifying plastics as abstract labels, their molecular properties were explicitly quantified, allowing the system to learn how enzyme behavior varies with substrate chemistry.
Molecular Descriptors
Plastic polymers were numerically characterized using the RDKit cheminformatics toolkit,
extended with a custom feature extraction module plastic_featurizer.py
.
The extractor computes over 200 physicochemical descriptors, covering multiple categories of chemical information that capture both global and local molecular properties.

- Standard Descriptors: include molecular weight, polar surface area (PSA), number of rotatable bonds, and lipophilicity (logP), describing general physicochemical properties of polymers.
- Fragment-based Descriptors: count the occurrence of functional groups such as esters, amides, ethers, and aromatic rings, which are chemically relevant to polymer hydrolytic reactivity and degradation susceptibility.
- Charge Descriptors: derived from Gasteiger charge calculations, including maximum, minimum, and absolute charge values, reflecting electron distribution and polarization along the polymer backbone.
This descriptor-based approach provides a chemically interpretable representation, bridging machine learning with polymer chemistry. It allows downstream models to infer polymer degradability from quantitative structural patterns, rather than relying on abstract categorical labels.
Density Normalization

Directly applying raw molecular descriptors to polymers introduces scale bias, since extensive properties such as molecular weight, surface area, or atom count dominate the representation.
To mitigate this effect, all extensive features were normalized by molecular size, generating density descriptors:
where denotes the number of non-hydrogen atoms in the molecule.
This normalization strategy was deliberately chosen instead of molecular weight normalization because heavy atom count provides a more chemically meaningful and size-independent basis.
Unlike molecular weight, which fluctuates with hydrogen content and polymer chain length, heavy atom count reflects the true backbone complexity—the number of atoms contributing to bonding topology and potential reaction sites (C, N, O, S, etc.).
As a result, the normalization removes chain-length bias while preserving the structural essence of the repeating unit.
Features that are already ratio-based (e.g., FractionCsp³, logP) or intrinsically intensive were excluded from normalization.
After normalization, descriptor distributions became more compact and comparable across polymers, highlighting the chemical essence of repeating units that governs polymer reactivity and degradation kinetics.
Feature Optimization and Validation

To assess feature redundancy and intrinsic structure, Principal Component Analysis (PCA) was applied to the normalized descriptor matrix (Jolliffe & Cadima, 2016).
The first few principal components (PCs) explained over 85% of the total variance, indicating that the extracted features effectively captured key chemical dimensions such as hydrophobicity, polarity, and electronic distribution (Todeschini & Consonni, 2009).
Notably, polymers with similar chemical backbones exhibited consistent spatial clustering in the reduced feature space:
the PHA family (e.g., PHB, PHBV, PHA) formed a tight cluster,
PCL, PEG, and PHO grouped closely due to their flexible aliphatic chains,
while PET and PLA appeared adjacent — consistent with their shared ester linkages and partially crystalline nature (Iannace & Nicolais, 1997).
These clustering patterns confirm that the density-normalized descriptor representation preserves chemically meaningful relationships, providing a robust foundation for downstream learning.
Dual-tower Architecture and Interaction Head
To jointly model enzymes and plastics, we designed a dual-tower framework that learns from both biological (protein) and chemical (polymer) representations in a unified latent space.
This design draws on the Siamese network paradigm originally proposed by Bromley et al. (1993) and later extended in contrastive representation learning (Hadsell et al., 2006), enabling comparable embeddings from distinct molecular modalities.
This architecture enables flexible interaction modeling and interpretable prediction of enzyme–polymer degradation compatibility.
Architecture Overview

Each tower encodes one molecular modality (protein or polymer):
- The enzyme tower transforms 3D structural graphs (GNN / GVP / ESM embeddings) into vector embeddings .
- The polymer tower projects RDKit-based descriptor vectors into embeddings .
This cross-domain dual-tower framework, integrating sequence and structure representations, is referred to as PlaszymeX:
Model Repository: GitLab – PlaszymeX
Shared Embedding Space
Both are projected into a shared embedding space ( ) via the TwinProjector module:
where lie in the shared embedding space.
Interaction Heads
After projecting enzyme and plastic embeddings into the same latent space
,
their interaction is computed through a scoring function
,
which measures how likely an enzyme–polymer pair represents a degradative relationship.
Cosine Interaction
This baseline measures directional similarity, assuming enzymatic affinity correlates with vector alignment.
Bilinear Interaction
A learnable bilinear matrix captures pairwise dependencies between feature dimensions,
allowing richer cross-feature coupling than cosine similarity (Gao et al., 2019).
Factorized Bilinear Interaction
The element-wise product fuses local feature correspondences,
while the multilayer perceptron (MLP) introduces nonlinear cross-feature interactions.
Gated Interaction
A gating vector dynamically controls the contribution of each feature channel,
enabling conditional modulation depending on enzyme–polymer context (Gao et al., 2019).
These interaction mechanisms range from interpretable geometric similarity to nonlinear adaptive gating, allowing the system to balance explainability and expressiveness.
Loss Computation
Once the enzyme and polymer embeddings are mapped into the shared space and their interaction scores are computed, the model is trained to distinguish degradable from non-degradable pairs.
To optimize the interaction between enzymes and polymers, the model employs a listwise ranking objective rather than a simple pairwise contrastive loss.
For each enzyme, a list of candidate polymers is evaluated simultaneously, producing a set of compatibility scores .
Listwise InfoNCE Objective
The model is trained with a multi-positive InfoNCE loss, which encourages the embeddings of degradable pairs to cluster closely while pushing non-degradable ones apart.
Formally, for a list of L candidate polymers, the loss is defined as:
where is a temperature coefficient controlling the sharpness of the ranking distribution.
This formulation extends the traditional InfoNCE loss to multi-positive ranking, aligning with listwise learning objectives widely used in retrieval and recommender systems (Cao et al., 2007; Xie et al., 2022).
Intuitive Interpretation
Intuitively, the objective encourages the model to assign higher scores to degradable substrates and lower scores to non-degradable ones, thus maximizing the separation between positive and negative samples within each enzyme’s local list.
This listwise formulation reflects real biochemical scenarios — enzymes often act on a set of structurally similar polymers, and the model must learn relative preferences rather than absolute binary outcomes.
Evaluation and Inference
At inference, the trained model computes a compatibility score matrix , enabling:
- Top-K retrieval: ranking candidate polymers for each enzyme.
- Compatibility thresholding: classifying degradable vs. non-degradable pairs.
- Cross-domain analysis: visualizing enzyme–plastic clustering in shared space.
This provides not only predictive capability but also interpretable cross-domain embeddings useful for mechanistic analysis and enzyme discovery.
Model Training and Evaluation
ML Training and Evaluation
To systematically improve enzyme–plastic degradability prediction, the model underwent three progressive training stages, each introducing more biologically informative representations.
All experiments were conducted under consistent evaluation settings using Top-n Hit and Micro F1 metrics across three difficulty buckets (Easy, Medium, Hard).
Training stability, convergence patterns, and generalization were continuously monitored during optimization.
Training Configuration
All models were trained on the same dataset partition, consisting of 387 training and 87 testing enzyme samples.
The test set was stratified into Easy, Medium, and Hard buckets according to sequence identity thresholds of 0.9 and 0.7.
Plastic type served as the supervision signal for multi-class classification.
Training employed Histogram-based Gradient Boosting (HGB) classifiers implemented in scikit-learn, with SMOTE oversampling to alleviate class imbalance.
Each model used a fixed random seed (42) to ensure reproducibility.
Performance was evaluated using Top-k Hit Rate (Hit@k) and Micro-F1, both reported per difficulty bucket and overall.
One-hot Encoding Model
Training utilized the same HGB framework with SMOTE balancing.
Dropout (0.1) was applied to the
upstream embeddings to enhance generalization.
The model reached stable convergence after
approximately 25 rounds.


The ESM-based model exhibited the strongest overall performance, achieving Hit@1 = 0.89 and Micro-F1 = 0.65 in the Easy bucket.
Both Medium and overall scores improved significantly, confirming the superior generalization of contextual embeddings.
Although performance on the Hard bucket remained limited, the upward trend across all difficulty levels demonstrated clear gains in expressive power and robustness.
Peptides Descriptor Embeddings
To incorporate biochemical semantics, the second stage utilized 102-dimensional physicochemical descriptors derived from the Peptides library.
These descriptors included aliphatic index, Boman index, hydrophobicity scales, and isoelectric properties, providing a richer and more interpretable biological representation.
Input features were standardized via z-score normalization.
The model adopted an early-stopping
strategy with a reduced learning rate, maintaining training stability while preventing overfitting.


Performance improved across all buckets, especially on simpler data.
The Easy bucket achieved Hit@1 = 0.81 and Micro-F1 = 0.61, while the overall Micro-F1 rose to 0.43.
Although high-complexity samples remained challenging (Hard F1 = 0.20), the inclusion of biochemical descriptors enhanced robustness and interpretability compared to one-hot encoding.
ESM Contextual Representations
In the third training stage, handcrafted physicochemical descriptors were replaced with deep contextual embeddings derived from the ESM-1b protein language model (Lin et al., 2023). This transition marks a shift from static, manually designed features toward context-aware representations learned directly from raw sequences.
Each amino acid residue was encoded as a 1280-dimensional embedding using the Transformer-based ESM architecture pretrained on the UR50D database. Residue-level embeddings were aggregated by mean pooling to produce a fixed-size vector per sequence, effectively capturing both evolutionary and structural-context dependencies across residues.
The ESM embeddings were extracted using the following configuration:
def get_embeddings(sequences, batch_size=4):
"""ESM-1b contextual embedding (mean pooling per sequence)"""
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
model.eval().to(device)
batch_converter = alphabet.get_batch_converter()
Here, the pretrained weights are automatically loaded from the esm2_t33_650M_UR50D checkpoint, which contains 33 Transformer layers and 650 million parameters, offering high expressiveness while maintaining reasonable computational cost. The extracted mean-pooled embeddings were subsequently passed to the downstream classifier for degradability prediction.
Training followed the same Histogram-based Gradient Boosting (HGB) pipeline used in previous stages, with SMOTE oversampling to balance label distribution. A dropout rate of 0.1 was applied to upstream embeddings to prevent overfitting and enhance generalization. The model achieved stable convergence after approximately 25 iterations, with smooth validation curves and minimal variance across folds.


The ESM-based model exhibited the strongest overall performance among all variants. In the Easy bucket, it achieved Hit@1 = 0.89 and Micro-F1 = 0.65, surpassing both one-hot and descriptor-based models. Performance on Medium and All samples also improved markedly, confirming the superior generalization and expressiveness of contextual embeddings.
Although prediction accuracy on the Hard bucket remained modest, the consistent upward trend across all difficulty levels demonstrated that contextual embeddings effectively bridge sequence variation and functional similarity, enabling the model to infer degradation potential even from distantly related enzymes.
Overall Trends
Across all stages, performance improvements followed the hierarchy:
One-hot < Physicochemical descriptors < ESM embeddings
This progression highlights the growing representational richness—from raw residue identity to biologically informed properties, and finally to contextualized sequence semantics. Models trained with SMOTE achieved better recall for underrepresented plastics without severe overfitting. Performance differences across Easy, Medium, and Hard buckets mirrored biological expectations: enzymes sharing higher sequence identity were more predictable, while distant homologs required deeper contextual understanding.
Dual-tower Training and Evaluation
To systematically evaluate the dual-tower framework, a series of controlled experiments were conducted focusing on interaction mechanisms, listwise sampling strategies, and backbone architectures. The goal was to determine the optimal configuration that balances performance, training stability, and biological interpretability.
Indicator Interpretation
Loss Curve
Indicates overall optimization stability. A smooth and monotonically decreasing curve suggests steady convergence, while oscillations or plateaus may imply overfitting or gradient instability. Models with faster and more stable loss decay generally exhibit stronger learning efficiency.
Hit@k Curve
Measures ranking quality: the proportion of true degradable plastics appearing in the top k predictions. A higher Hit@1 or Hit@3 reflects better ranking precision and retrieval reliability, especially relevant for prioritizing enzyme–substrate pairs in experimental screening.
Score Separation
Plots the mean predicted scores for positive (degradable) versus negative (non-degradable) samples. A larger positive–negative gap indicates that the model successfully separates degradable interactions from irrelevant pairs, reflecting discriminative embedding quality.
Interaction Head Comparison




Across all formulations, convergence was achieved within approximately 60 epochs, as shown by the synchronized stabilization of loss and score curves.
The cosine interaction, though the simplest mathematically, exhibited the most stable convergence behavior and consistent performance across all metrics, with minimal fluctuations in both loss and Hit@k.
The bilinear and Hadamard-MLP variants displayed faster early convergence but suffered from greater late-stage variance, suggesting sensitivity to parameter scaling and data imbalance.
The factorized bilinear approach provided moderate performance, balancing model size and feature coupling but lacking robustness under small-sample conditions.
The gated interaction achieved the best overall ranking accuracy, maintaining smooth loss decay and the widest score separation, demonstrating its advantage in adaptively modulating enzyme–polymer feature channels.
Backbone Comparison
To evaluate the impact of structural encoding strategies, two protein backbones — Graph Convolutional Network (GCN) and Geometric Vector Perceptron (GVP) — were compared under identical data and training conditions.
Both models processed residue-level protein graphs derived from Cα–Cα contacts (<10 Å) and shared the same polymer tower, projection layers, and gated interaction head, ensuring that observed performance differences originated solely from the backbone design.


Training stability and convergence were tracked through loss, Hit@k, and score curves, analogous to the interaction head evaluation.
However, the GVP-based model, despite its richer geometric parameterization and ability to encode vectorial features, did not outperform the simpler GCN-based backbone in this specific enzyme–polymer task.
The additional geometric channels introduced by GVP increased the model complexity and parameter count, which, under a relatively limited dataset size, led to slower convergence and slight overfitting tendencies in later epochs.
In contrast, the GCN backbone maintained lower variance and faster stabilization, producing more consistent ranking scores across validation folds.
Quantitatively, both architectures achieved comparable Top-1 Hit and Micro-F1 metrics, with GCN slightly ahead in stability and generalization.
This result suggests that for moderate-scale biochemical datasets, the simpler scalar-based message passing of GCN suffices to capture residue connectivity and chemical context, whereas vector-aware extensions like GVP may yield marginal benefits only when trained on larger and more structurally diverse datasets.
Effect of Listwise Sampling
The listwise sampling length (L) — representing the number of candidate polymers presented per enzyme during training — was a key hyperparameter affecting convergence and generalization.
Empirical testing across L ∈ {5, 10, 20, 30} revealed that a moderate list size (≈10) achieved the best overall performance:
- Shorter lists (L < 5) limited the model’s ability to learn robust ranking relationships, reducing contrast diversity.
- Longer lists (L > 20) increased gradient noise and slowed convergence without clear gains in retrieval precision.
Thus, maintaining a compact but diverse candidate set allowed the model to efficiently capture relative preferences while preserving training stability.
Summary
The progressive training strategy — from classical machine learning (ML) baselines to structure-aware dual-tower models — collectively illustrates the evolution of enzyme–plastic degradability prediction from sequence-level classification to representation-level interaction modeling.
In the ML phase, performance steadily improved as feature representations became more biologically grounded: simple one-hot encodings provided basic residue identity, physicochemical descriptors introduced interpretable biochemical properties, and ESM contextual embeddings captured deeper evolutionary and structural semantics. This progression reflects the growing integration of biological knowledge into data-driven learning, ultimately enhancing both predictive accuracy and generalization.
In the dual-tower stage, the model transitioned from learning independent enzyme features to understanding cross-domain relationships between enzymes and polymers. Extensive experiments on interaction heads, backbone architectures, and listwise ranking strategies demonstrated that architectural complexity does not guarantee superior performance. While advanced modules such as GVP and high-rank interaction heads offer theoretical expressiveness, the GCN + gated head configuration achieved the best balance between training stability, interpretability, and ranking precision.
Overall, these findings underscore two key insights for biochemical deep learning:
- Representation quality—the ability to encode biologically meaningful structure and context—is more critical than model depth or parameter count;
- Architectural parsimony, when aligned with biological constraints and dataset scale, can lead to more stable, interpretable, and generalizable models for enzyme–polymer prediction.
Interpretability and Feature Analysis
Understanding why a model makes a certain prediction is essential for validating its biological relevance, improving scientific transparency, and guiding the rational design of enzymes and polymers.
Interpretability analysis aims to trace back the model’s internal decision process, identifying which features, representations, or interactions most influence its final output.
Rather than treating the model as a black box, this approach allows a mechanistic understanding of what the model has truly learned — whether it aligns with known biochemical mechanisms or reveals new, data-driven insights.
Descriptor Importance in Polymer Representation

High-importance features such as TPSA, hydrogen-bond donors/acceptors, and ester, amide, and ether groups showed strong positive contributions, indicating their key role in hydrolytic reactivity and enzyme accessibility.
The high weight of aromatic density (fr_benzeneDensity) highlights the resistance of rigid aromatic polymers like PET and PS, consistent with known barriers caused by π–π conjugation and crystallinity (Wei & Zimmermann, 2017; Tournier et al., 2020).
Interestingly, features related to molecular flexibility (e.g., NumRotatableBonds, fr_etherDensity) correlated positively with degradability, suggesting that flexible chains are more easily accommodated in enzyme pockets, enhancing catalytic access (Joo et al., 2018).
In contrast, larger or more rigid molecules showed reduced susceptibility, implying that spatial accessibility—not just bond type—affects degradation.
Hydrophobicity and polar surface distribution (MolLogP, TPSADensity) also ranked highly, emphasizing that surface polarity modulates enzyme–polymer interactions by affecting binding affinity and pocket compatibility (Yoshida et al., 2016; Han et al., 2017).
Overall, degradability emerges as a multifactorial property shaped by both chemical bonds and three-dimensional accessibility, suggesting that enzyme engineering should consider pocket polarity and spatial accommodation in addition to catalytic residues.
SHAP-based Interpretability in ML Classification

The SHAP summary plot (Figure X) highlights several high-impact descriptors:
- KF4 (Kidera Factor 4) — reflects amino acid hydrophobicity and side-chain bulkiness; higher KF4 values indicate a stronger hydrophobic core, often associated with plastic-binding regions (Kidera et al., 1985).
- SVGER3, SVGER2 — encode electrostatic and solvent accessibility patterns; positive SHAP contributions suggest that favorable surface charge distributions enhance substrate recognition.
- PRIN3 / F5 — represent polarity and residue–residue interaction energy; their high positive SHAP values imply that polarity heterogeneity aids in adapting to chemically diverse plastics.
- BLOSUM4, Z3, ST3 — describe sequence substitution patterns and topological autocorrelations; negative contributions indicate that conservative or rigid sequence motifs correlate with lower degradative versatility.
Color gradients (red = high feature value, blue = low) visualize how each descriptor modulates the model output:
red features push predictions toward degradable plastics, while blue features suppress degradability — consistent with the understanding that flexible, surface-accessible, and electrostatically adaptive enzymes are more likely to interact with polymer substrates (Tournier et al., Nature, 2020).
Overall, SHAP analysis confirms that the model learned biophysically meaningful patterns rather than memorizing label correlations.
The results align with established enzymatic degradation mechanisms, where surface charge, flexibility, and hydrophobic exposure jointly determine substrate specificity (Wei & Zimmermann, Microbial Biotechnology, 2017).
Shared Embedding Space Analysis

To further assess how the dual-tower model organizes enzymatic and polymeric representations in the shared latent space, a dimensionality-reduced projection (via UMAP) was generated from the final enzyme and polymer embeddings.
The visualization reveals a clear structural organization on the enzyme side: proteins cluster according to sequence and functional similarity, reflecting the model’s ability to encode biologically meaningful relationships.
In contrast, polymer representations exhibit a more diffuse distribution.
Notably, distinct polymer families such as PET, Nylon, PLA, and PHA form recognizable but partially overlapping regions, suggesting that the model captures major chemical motifs (e.g., ester or amide linkages) while retaining cross-family relational awareness.
However, several minor polymer classes appear compressed or aggregated near the periphery of the embedding map.
This phenomenon likely results from limited training samples for these categories, leading to representation collapse under the shared-space constraint.
Such clustering behavior explains why the dual-tower framework, despite being more expressive in theory, sometimes underperforms the simpler multi-class classifiers.
When data are imbalanced or sparse, the shared representation must jointly optimize both domains, which can dilute the separability of underrepresented polymer types.
Overall, this observation underscores the untapped potential of the dual-tower architecture: with more balanced polymer data and fine-tuned alignment, the shared latent space could achieve clearer inter-class boundaries and enhanced interpretability across enzyme–polymer interactions.
Discussion and Insights
Bridging Biology and Computation
Computational modeling has become an indispensable engine in modern synthetic biology, enabling the rational design, prediction, and optimization of biological systems that were once driven mainly by empirical trial-and-error.
This study demonstrates how computational learning frameworks can be harnessed to capture the intricate determinants of plastic biodegradation, a phenomenon governed by both enzymatic structure and polymer chemistry. Through progressive model design, from interpretable ML classifiers to dual-tower embedding systems, the Plaszyme project bridges biochemical intuition and data-driven discovery.
In this study, the sequence-only classifier (PlazymeAlpha) and the cross-domain dual-tower (PlazymeX) exemplify this progression. Our findings also highlight that biological function can emerge naturally from learned representations, even without explicit annotation of catalytic residues or binding motifs.
In particular, contextual embeddings from protein language models (ESM) effectively internalized evolutionary constraints and structure–function relationships, allowing the model to generalize across homologous and remote enzyme families.
At the same time, the polymer descriptors revealed how molecular flexibility, polarity, and aromatic rigidity collectively shape degradability, moving beyond the simplistic view that hydrolysis depends solely on ester or amide bonds.
Model Interpretability and Scientific Transparency
The interpretability analyses gradient-based feature attribution, SHAP value interpretation, and shared-space visualization consistently revealed chemically and biologically plausible patterns.
These analyses indicate that the model is not a “black box,” but a learnable hypothesis generator, capable of identifying the molecular factors most correlated with enzymatic degradability.
The alignment between model-derived importance scores and known biochemical determinants (e.g., aromatic density, hydrogen-bond potential, surface polarity) validates the system’s scientific transparency and provides rational directions for enzyme engineering and substrate design.
Biological and Engineering Implications
From a biological perspective, the models elucidate how enzyme specificity may arise from combined spatial and electrostatic complementarity rather than catalytic residues alone.
Such insights could guide directed evolution or structure-guided mutagenesis, focusing on improving substrate accommodation and binding dynamics instead of merely optimizing active-site chemistry.
On the polymer side, descriptor-level analyses highlight potential strategies for designing next-generation biodegradable materials, such as tuning flexibility, introducing hydrolysis-prone linkages, or modulating aromatic density to achieve desired degradation profiles.
Limitations and Future Directions
Current Limitations
Although the current framework demonstrates strong predictive performance and chemical interpretability, several limitations remain:
- Data imbalance across plastic categories.
The number of polymer types greatly exceeds the available enzyme–polymer pairs, leading to severe long-tail effects. Certain underrepresented plastics (e.g., specialty copolymers) are insufficiently sampled, causing their representations to collapse or cluster into a single region in the shared latent space.
- Limited enzyme sample diversity.
Despite integrating multiple known hydrolase families, the current dataset remains small in both sequence and structure coverage. This limits generalization to novel or distant homologs and may bias the learned representation toward well-studied enzyme classes.
- Incomplete modeling of structural and kinetic factors.
The model primarily focuses on static structural embeddings. Physical attributes such as polymer crystallinity, chain packing, and enzyme conformational flexibility which substantially influence degradation efficiency are not yet explicitly modeled.
- Optimization sensitivity and shared-space compression.
Some plastics show spatial compression or overlap within the shared embedding space, indicating that the current dual-tower model has not fully converged to disentangled, domain-specific manifolds. This explains why, despite structural insight, the overall retrieval performance does not yet surpass the multi-class baseline in every metric.
Future Directions
To overcome these limitations and enhance biological interpretability, the dual-tower framework offers a flexible foundation for more advanced and generalizable training paradigms:
- Multi-task joint training (Protein-tower side)
Integrate multiple biological supervision signals onto the same protein backbone — including sequence/structure contrastive tasks (InfoNCE), fold-type or family classification, binding-pocket and active-site prediction, residue accessibility and electrostatic regression, and contact-map reconstruction.
Such multi-objective constraints enhance the backbone’s ability to capture spatial geometry and electrostatic environments, mitigating overfitting to specific degradase families.
- Cross-domain consistency and auxiliary tasks (Polymer-tower side)
Introduce self- or weakly supervised tasks on the polymer side descriptor reconstruction and masking, normalized descriptor consistency, regression of basic physical properties (e.g., Tg, polar surface density), and functional-group fragment recognition.
Aligning these auxiliary tasks with the protein domain helps sharpen the enzyme–polymer boundary in the shared latent space.
- Unified multi-task alignment objectives (Shared space)
Extend the current listwise InfoNCE into a multi-task alignment objective jointly optimizing:
(a) degradation-pair ranking,
(b) pocket–functional group compatibility scoring, and
(c) conformational penalty/flexibility reward terms.
Integrating hard negative mining and curriculum learning (from homologous to remote pairs) can stabilize convergence and enhance separability.
- Data and supervision expansion
Incorporate high-throughput screening and metagenomic mining to expand enzyme diversity, while leveraging semi-supervised pre-training to reduce reliance on labeled data.
Including kinetic proxies (e.g., relative activity tiers) and polymer morphology descriptors (e.g., crystallinity, roughness) would better link structure and catalysis under realistic biophysical contexts.
Overall, these directions aim to evolve Plaszyme’s dual-tower architecture into a biologically grounded multi-task representation framework, capable of capturing both structural realism and catalytic logic paving the way for rational enzyme design, novel polymer discovery, and interpretable biodegradation prediction.
Reference
Machine Learning and Representation Learning
- Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., & Shah, R. (1993). Signature verification using a “Siamese” time delay neural network. Advances in Neural Information Processing Systems, 6, 737–744. https://proceedings.neurips.cc/.../Abstract.html
- Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., & Li, H. (2007). Learning to rank: From pairwise approach to listwise approach. Proceedings of the 24th International Conference on Machine Learning (ICML 2007), 129–136. https://doi.org/10.1145/1273496.1273513
- Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
- Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1735–1742. https://doi.org/10.1109/CVPR.2006.100
- Hinton, G. E. (1984). Distributed representations. Technical Report CMU-CS-84-157, Carnegie Mellon University.
- Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., et al. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30. https://proceedings.neurips.cc/.../Abstract.html
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. https://arxiv.org/abs/1301.3781
- McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. https://arxiv.org/abs/1802.03426
Protein Structure and Modeling
- Baek, M., DiMaio, F., Anishchenko, I., et al. (2021). Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557), 871–876. https://doi.org/10.1126/science.abj8754
- Gainza, P., Sverrisson, F., Monti, F., et al. (2020). Deciphering interaction fingerprints from protein surfaces using geometric deep learning. Nature Methods, 17(2), 184–192. https://doi.org/10.1038/s41592-019-0666-6
- Gligorijević, V., Renfrew, P. D., Kosciolek, T., et al. (2021). Structure-based protein function prediction using graph convolutional networks. Nature Communications, 12(1), 3168. https://doi.org/10.1038/s41467-021-23303-9
- Jing, B., Eismann, S., Soni, P. N., Townshend, R. J. L., & Dror, R. O. (2021). Learning from protein structure with geometric vector perceptrons. International Conference on Learning Representations (ICLR 2021). https://openreview.net/forum?id=1YLJDvSx6J4
- Jumper, J., Evans, R., Pritzel, A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589. https://doi.org/10.1038/s41586-021-03819-2
- Lin, Z., Akin, M., Rao, R., et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123–1130. https://doi.org/10.1126/science.ade2574
- Mirdita, M., Schütze, K., Moriwaki, Y., Heo, L., Ovchinnikov, S., & Steinegger, M. (2022). ColabFold: Making protein folding accessible to all. Nature Methods, 19(6), 679–682. https://doi.org/10.1038/s41592-022-01488-1
- Tunyasuvunakool, K., et al. (2021). Highly accurate protein structure prediction for the human proteome. Nature, 596(7873), 590–596. https://doi.org/10.1038/s41586-021-03828-1
- Fey, M., & Lenssen, J. E. (2019). Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428. https://arxiv.org/abs/1903.02428
- Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR 2017). https://arxiv.org/abs/1609.02907
- Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2018). Graph attention networks. International Conference on Learning Representations (ICLR 2018). https://arxiv.org/abs/1710.10903
- Xu, K., Hu, W., Leskovec, J., & Jegelka, S. (2019). How powerful are graph neural networks? International Conference on Learning Representations (ICLR 2019). https://arxiv.org/abs/1810.00826
Polymer Chemistry and Descriptor Analysis
- Atchley, W. R., Zhao, J., Fernandes, A. D., Drüke, T., & Su, Z. (2005). Solving the protein sequence metric problem. Proceedings of the National Academy of Sciences, 102(18), 6395–6400. https://doi.org/10.1073/pnas.0408677102
- Iannace, S., & Nicolais, L. (1997). Biodegradable aliphatic polyesters: Mechanical properties and degradation kinetics. Journal of Applied Polymer Science, 64(5), 911–919. https://doi.org/10.1002/(SICI)1097-4628(19970502)64:5%3C911::AID-APP11%3E3.0.CO;2-W
- Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A, 374(2065), 20150202. https://doi.org/10.1098/rsta.2015.0202
- Kidera, A., Konishi, Y., Oka, M., Ooi, T., & Scheraga, H. A. (1985). Statistical analysis of the physical properties of amino acids and construction of the new physicochemical scale: the Kidera factors. Protein Engineering, Design & Selection, 1(5), 399–408. https://doi.org/10.1093/protein/1.5.399
- Landrum, G. (2016). RDKit: Open-source cheminformatics. http://www.rdkit.org/
- Osorio, D. (2020). Peptides.py: Physicochemical properties, indices and descriptors for amino-acid sequences. Zenodo. https://doi.org/10.5281/zenodo.3814196
- Todeschini, R., & Consonni, V. (2009). Molecular descriptors for chemoinformatics (2nd ed.). Wiley-VCH. https://doi.org/10.1002/9783527628766
Plastic Biodegradation and Enzyme Mechanism
- Han, X., Liu, W., Huang, J. W., Ma, J., Zheng, Y., Ko, T. P., et al. (2017). Structural insight into catalytic mechanism of PET hydrolase. Nature Communications, 8, 2106. https://doi.org/10.1038/s41467-017-02255-z
- Joo, S., Cho, I. J., Seo, H., Son, H. F., Sagong, H. Y., Shin, T. J., et al. (2018). Structural insight into molecular mechanism of poly(ethylene terephthalate) degradation. Nature Communications, 9, 382. https://doi.org/10.1038/s41467-018-02881-1
- Tournier, V., Topham, C. M., Gilles, A., et al. (2020). An engineered PET depolymerase to break down and recycle plastic bottles. Nature, 580, 216–219. https://doi.org/10.1038/s41586-020-2149-4
- Wei, R., & Zimmermann, W. (2017). Microbial enzymes for the recycling of recalcitrant petroleum-based plastics: how far are we? Microbial Biotechnology, 10(6), 1308–1322. https://doi.org/10.1111/1751-7915.12710
- Yoshida, S., Hiraga, K., Takehana, T., Taniguchi, I., Yamaji, H., Maeda, Y., et al. (2016). A bacterium that degrades and assimilates poly(ethylene terephthalate). Science, 351(6278), 1196–1199. https://doi.org/10.1126/science.aad6359