Database | XJTLU-AI-China

Background

The necessity for PlaszymeDB stems from two critical shortcomings in the current research landscape. Relevant research findings are scattered across various publications and databases, creating significant data fragmentation that makes it unduly difficult for researchers to search, compare, and apply plastic-degrading enzymes. This necessitates complex, manual integration of information that exists in disconnected repositories, raising the data access threshold in this field. Compounding this issue, the underlying data points within existing repositories often suffer from inconsistent quality and lack of standardization, meaning the data is not readily usable for advanced computational approaches, such as machine learning, without substantial upfront investment in data curation and cleaning. Researchers urgently need a systematic, interactive platform to not only integrate and present this fragmented information but also ensure the reliability and standardization of the data.

To address this demand, we developed PlaszymeDB — a comprehensive and meticulously curated database of plastic-degrading enzymes.

Introduction

PlaszymeDB aims to provide an open, user-friendly, and sustainably updated resource for researchers worldwide. As a comprehensive database dedicated to plastic-degrading enzymes, it includes 474 enzyme records reported up to 2025, covering 34 types of plastics (such as PET, PE, PP, and PVC), and integrates AlphaFold-predicted structural files along with selected experimentally resolved structures.

The database not only provides basic information (e.g., sequence, EC number, host organism) but also integrates structural analysis, functional annotations, and phylogenetic insights. Its goal is to enable systematic management of plastic-degrading enzymes.

Design Objectives

PlaszymeDB is more than just an information platform — it is closely linked to downstream model development. On the one hand, it provides standardized datasets for machine learning model training, ensuring accuracy and reliability. On the other hand, in this project, five test sequences used in model training were directly screened for wet-lab validation. By combining data-driven approaches with experimental confirmation, the database serves as a practical bridge between theoretical research and real-world applications.

PlaszymeDB aims to:

Provide a new integrated data source for researchers worldwide, reducing redundant efforts caused by dispersed information;

Innovatively combine structural data, laying the groundwork for enzyme engineering and environmental remediation;

Supply candidate sequences and annotations for laboratory studies, shortening the path from data to experimental validation;

Serve as a dataset for artificial intelligence model training;

Lower the barrier to data use through open access and interactive design, enabling researchers from diverse backgrounds and disciplines to benefit and fostering global academic exchange and interdisciplinary collaboration.

Overall Workflow

The construction of PlaszymeDB follows a strict data collection and processing procedure to ensure the accuracy and reproducibility of the database content.

Data Collection

We integrated enzyme information from PlasticDB (Gambarini et al., 2022), PAZy (Buchholz et al., 2022), PMBD (Gan & Zhang, 2019), two open-source GitHub projects (Medina-Ortiz et al., 2025; Zrimec et al., 2021), and one relevant research article (Jiang et al., 2023). All collected entries were cross-validated with UniProt, NCBI, and PDB to ensure the reliability of existing sequences and annotations.

Data Standardization and Cleaning

Plastic type names were unified, a systematic ID coding scheme was designed, and redundant or anomalous sequences were removed to ensure data consistency and scientific value.

Deduplication was implemented via two approaches:

- Script-based cleaning

By means of a self-developed script, batch processing was carried out through four main steps:

Data preprocessing: Organize datasets collected from six different sources (the professional databases PlasticDB, PAZy, and PMBD; two open-source GitHub projects; and one research article on plastic-degrading enzymes), ensuring preliminary consistency in field formats and content.

Data integration: Merge datasets by mapping to a unified set of key fields (plastic, label, sequence, genbank_ids, uniprot_ids, pdb_ids, refseq_ids, mgnify_ids, enzyme_name, ec_number, gene_name, host_organism, taxonomy, reference, source_name), resulting in a comprehensive integrated dataset.

Sequence completion: For entries lacking sequence information, retrieve the missing sequences using available ID data via external database APIs, in the order of UniProt → GenBank → PDB → RefSeq, thereby minimizing missing values.

Deduplication and quality control: Remove records without any sequence information. Perform deduplication using ["plastic", "label", "sequence"] as a composite key; entries are merged only if plastic type, sample label (positive/negative), and sequence are completely identical. In addition, standardize and merge synonymous plastic names, check field formats and fill missing values, and validate sequence quality (retaining only standard amino acid letters while discarding sequences containing non-standard characters, low-complexity regions, or abnormal lengths).

- Manual curation

Suspect entries flagged by scripts (e.g., ambiguous synonyms, anomalous EC numbers, host names with misspellings or incorrect Latin formatting, broken references, or invalid database links) were manually verified by cross-checking original publications against UniProt, NCBI, and PDB records.

Throughout the process, all operations were logged, and version updates were documented to form a complete audit trail. The combination of “scripted automation + manual curation” enhanced both efficiency and reliability.

Information Integration

Functional Annotation

The database includes enzyme annotations such as EC numbers, plastic degradation relationships, and host organisms. Annotations were supplemented according to data availability, which refers to the varying richness of information across database records and literature. Wherever possible, PlaszymeDB enriched incomplete entries with additional references.

Structural Information

For every protein entry, PlaszymeDB provides predicted 3D structures generated by AlphaFold (developed by DeepMind). In addition, some entries include high-resolution structures determined experimentally via X-ray crystallography, serving as critical validation for predictions.

By integrating experimental and computational structures, PlaszymeDB offers a panoramic resource for investigating enzyme features and mechanisms.

The following structural files are selected from PlaszymeDB, providing a visual representation of the protein sequence structure:

Left to right: PlasticDB structure IDs X02000, X0264, and X0267.

Source Tracing

PlaszymeDB establishes a systematic traceability mechanism for every protein entry:

Database links: Direct connections to UniProt, NCBI, and PDB enable quick verification of original data.

Research references: Each entry is accompanied by citation information, ensuring annotation credibility and respect for intellectual property.

This combination of database links and literature validation enhances both the credibility and accessibility of the resource.

Quality Assurance and Review

All entries were cross-validated with UniProt, NCBI, and PDB to ensure sequence and annotation reliability. Final review followed a “two reviewers plus one arbitrator” system: two curators independently assessed the data; if they agreed, the result was confirmed; in case of disagreement, a third reviewer arbitrated through discussion to finalize the decision.

Biological Validation and Phylogenetic Analysis

Wet-lab Validation

To verify that representative enzymes within the database exhibit true degradation activity, five enzyme sequences were selected through stratified sampling that balanced enzyme family diversity and plastic substrate types.

The corresponding genes were cloned into pET expression vectors and heterologously expressed in E. coli via the pelB signal peptide pathway.

Protein expression was confirmed using SDS-PAGE and Western blot, followed by activity assays with model substrates representing ester-bond and polyester hydrolysis. All selected enzymes displayed measurable activity compared with negative controls, confirming that database entries correspond to biologically functional proteins.

View wet-lab validation details

Bioinformatic Validation

In parallel, all enzyme sequences were standardized, de-duplicated, and annotated using Pfam and InterPro domain analyses. Representative sequences were aligned via MUSCLE, and phylogenetic trees were constructed using the Maximum Likelihood method.

Results were visualized on iTOL, where color-coded clades represent different plastic types and enzyme families. The resulting tree shows that α/β-hydrolases dominate the dataset, with distinct subgroups corresponding to cutinase-like and lipase-like families, in line with known literature on polyester degradation mechanisms.

Explore the complete phylogenetic tree in an interactive preview on our website: plaszyme.org/plaszymedb/V9.html.

This combined experimental and computational validation demonstrates that PlaszymeDB is both accurate and biologically grounded, bridging in silico curation with real-world enzymatic function.

Data Accessibility and Citation

To promote open science and reproducibility, PlaszymeDB has been fully packaged and archived according to the data management guidelines recommended by iGEM.

The entire curated dataset — including enzyme sequences, plastic associations, annotations, and benchmark metadata — has been deposited on Zenodo and assigned a Digital Object Identifier (DOI) for long-term accessibility.

Zenodo DOI:

Researchers can freely download the dataset, cite it in publications, and integrate it into computational or experimental workflows.

The standardized Zenodo release ensures data persistence, citation traceability, and compliance with FAIR (Findable, Accessible, Interoperable, Reusable) principles.

Database Implementation

PlaszymeDB Webapp: http://plaszyme.org/plaszymedb

The database uses an HTML5 + CSS3 + native JavaScript front end (Single-Page Application, SPA) and a RESTful API back end (PDO for database connections). The modular API integrates Mol* 3D visualization, the iTOL phylogenetic tree tool, and the Ketcher chemical editor. PlaszymeDB provides a user-friendly interface supporting fast search and multidimensional filtering. It also integrates a local BLAST tool, allowing users to submit new sequences for similarity analysis. In addition, downloadable .csv tables are available for diverse research applications.

Conclusion

PlaszymeDB establishes a high-quality, continuously curated foundation for global research on plastic-degrading enzymes.

By integrating sequence data, structural information, phylogenetic relationships, and functional annotations — and validating selected entries through wet-lab enzymatic assays and bioinformatic analyses — the database ensures both computational accuracy and biological reliability.

To promote open science and reproducibility, PlaszymeDB has been packaged and archived following iGEM’s FAIR data standards, and the full dataset has been publicly deposited on Zenodo (DOI: 10.5281/zenodo.17257278) for long-term accessibility and citation.

As an open-access platform freely available at PlaszymeDB, it enables researchers from diverse backgrounds to search, compare, and apply enzyme data seamlessly, supporting downstream applications in synthetic biology, enzyme engineering, and environmental biotechnology.

Ultimately, PlaszymeDB is not merely a data repository but a living infrastructure that bridges data, experiments, and computational discovery, catalyzing innovation toward a sustainable and AI-empowered future in biodegradation research.

References

Buchholz, P. C. F., Feuerriegel, G., Zhang, H., et al. (2022). Plastics degradation by hydrolytic enzymes: The plastics-active enzymes database—PAZy. Proteins, 90(7), 1443–1456. https://doi.org/10.1002/prot.26325

Gambarini, V., Pantos, O., Kingsbury, J. M., Weaver, L., Handley, K. M., & Lear, G. (2022). PlasticDB: A database of microorganisms and proteins linked to plastic biodegradation. Database (Oxford), 2022, baac008. https://doi.org/10.1093/database/baac008

Gan, Z., & Zhang, H. (2019). PMBD: A comprehensive plastics microbial biodegradation database. Database (Oxford), 2019, baz119. https://doi.org/10.1093/database/baz119

Jiang, R., Shang, L., Wang, R., Wang, D., & Wei, N. (2023). Machine learning–based prediction of enzymatic degradation of plastics using encoded protein sequence and effective feature representation. Environmental Science & Technology Letters, 10(6), 464–470. https://doi.org/10.1021/acs.estlett.3c00293

Medina-Ortiz, D., Alvares-Saravia, D., Soto-García, N., Sandoval-Vargas, D., Aldridge, J., Rodríguez, S., Andrews, B., Asenjo, J. A., & Daza, A. (2025). Discovering potential plastic degrading enzymes using machine learning strategies. bioRxiv. https://doi.org/10.1101/2025.02.09.637306

Zrimec, J., Kokina, M., Jonasson, S., Zorrilla, F., & Zelezniak, A. (2021). Plastic-degrading potential across the global microbiome correlates with recent pollution trends. mBio, 12, e02155–21. https://doi.org/10.1128/mbio.02155-21

← Previous

Overview

Predictor