Web App | XJTLU-AI-China

Background

The transition from specialized research frameworks to accessible web platforms represents a critical step in democratizing computational enzyme discovery. Prior to this cycle, our tools—while scientifically rigorous—required local installation, programming expertise, and significant computational resources. This created barriers for wet-lab researchers, environmental scientists, and interdisciplinary teams who needed quick access to predictive models and comprehensive enzyme databases.

Our core idea is to build a user-friendly platform. Through extensive consultations with researchers across microbiology, environmental science, synthetic biology, and chemistry, we identified a shared need: a unified, installation-free online environment that integrates predictive modeling with database exploration. The platform needed to serve those purposes: (1) enable rapid prediction of enzyme–plastic interactions from user-provided sequences or structures, and (2) facilitate knowledge discovery through a searchable database of curated enzyme–plastic relationships.

Core Requirements:

Model Platform Requirements

Multi-Input Support: Accept protein sequences (FASTA format, raw text) and 3D structures (PDB files)

Dual-Model Architecture: Integrate both sequence-based (PlaszymeAlpha) and structure-informed (PlaszymeX) predictors

Automated Workflow: Eliminate manual structure prediction steps—users provide sequences once, system handles structure generation

Scalability: Support large-scale metagenomic analyses with custom HMM pipelines for enzyme discovery

Interpretability: Provide confidence scores, degradation probabilities, and interactive 3D visualization

Export Functionality: Enable CSV export for downstream computational analysis

Asynchronous Processing: Long-running structure predictions should not block user interface

Global Accessibility: Deploy on cloud infrastructure for worldwide availability

Database Platform Requirements

High-Performance Search: Implement efficient traversal-based search algorithm across enzyme properties

Sequence Alignment: Provide BLAST-like local sequence similarity search
3D Structure Visualization: Interactive molecular viewer for experimental and predicted structures

Phylogenetic Context: Dynamic display of evolutionary relationships via embedded phylogenetic trees

Chemical Information: Render plastic chemical structures with standard nomenclature

Comprehensive Metadata: Include EC annotations, literature citations, taxonomy, and database cross-references

Filtering & Sorting: Multi-criteria filtering (plastic type, organism, EC class, structure availability)

User interface design

Predictor Platform design

Dashboard Page
User-friendly features conducive to management

The dashboard page is designed for visualizing the project operation and basic management. Here, researcher could directly view information such as the ongoing projects, completed projects, and the total number of projects. When making the prediction input, users can customize the project name. We also used a variety of colors in the page design to distinguish the status of different tasks. Help to address the specific work management issues mentioned by the researchers.

“Model selector” function
Help users make the appropriate model selection

When conducting the Plastic Degradation Prediction task, users can select the model based on their specific task requirements in the model selector. Different models have different functions, and the selector provides helpful descriptions to address users' concerns. This feature is designed to enable each model to be better selected and to truly utilize its strengths. And this also provided feedback on the additional usage guidelines mentioned by the users during the interview.

“Plastic selector” function
Offer plastic options to facilitate specific research

Through the plastic selector on the PlaszymeX input interface, users can select the specific type of study they are expecting to conduct. This in total provides 34 types of plastics. This function is designed to assist users in conducting precise research without generating redundant results.

Metagenome HMM Scan
Using PlaszymeHMM, we combine the Plaszyme model with the actual usage environment

In the Metagenome HMM Scan task type, users are still provided with detailed descriptions in the interface. During the task process, PlaszymeHMM will play a powerful role, which will help users handle complex Metagenome data. Potential plastic-degrading enzyme sequences will be identified and read by the HMM model, and then automatically processed as the model's input.

The design of this section is mainly based on the communication with the professor. This has enabled most of the current researchers who utilize environmental samples to utilize the model, helping them accelerate their research.

Result Page
The page for displaying and exporting the output results of the model

After obtaining the output of one's own prediction results, one can "View" this output through the Dashboard project management. Users will be able to see the degradation probabilities and Confidence for different plastic types, and can also export the results in CSV format for next stage use.

PlaszymeDB Platform design

BLAST Professional Tool And Result
Actual screenshots showing the input and output of the BLAST function

The specialized sequence alignment tool built into the database platform is a function that effectively enhances convenience. To facilitate the one-stop use by researchers, we have packaged our tool based on the BLAST algorithm and local sequence data for users. We have also set up various parameters and threshold options to facilitate diverse usage.

iTOL Embedded Visualization Tool
Screenshot of statistical tool for visualizing the phylogenetic tree embedded in the webpage

During our communication with users and the examination of the online platform, we learned about the function of embedding the iTOL phylogenetic tree display. This statistical feature enables users to visually view the data. Based on its rich content and the plastic type labels we added, users can conveniently search for and locate the information they are interested in without leaving the website. Such a feature further enhances the user-friendliness of the platform.

Design

Design Philosophy

The platform design adheres to three core principles:

“Less is More” Interaction Design: Minimize user input requirements while maximizing output richness. Users should provide data once and receive comprehensive, multi-faceted results.

Professional Rigor Without Complexity: Maintain scientific accuracy and comprehensive functionality without overwhelming users unfamiliar with computational tools. The interface should guide users naturally through prediction workflows without requiring manual reading.

Modular Architecture: Design independent but complementary modules—model prediction and database exploration—that can evolve separately while sharing common data standards.

System Architecture

Plaszyme Model Platform Architecture

Key Design Features

Isolated Environments: V1 model runs in plaszyme_env, X1 model runs in plaszyme_gpu (completely isolated conda environments)

Smart Model Routing: Automatic model selection—PDB files → X1 (structure-based), sequences → V1 (sequence-based)

Background Processing: Long-running X1 predictions execute in daemon threads with real-time status polling

Resource Management: Automatic cleanup triggers when project count reaches 50 (deletes oldest 20 projects)

PlaszymeDB Database Platform Architecture

Key Design Features

- Custom BLAST Implementation: PHP-based sequence similarity algorithm using sliding window approach (avoids external BLAST+ dependencies)
- Dual Structure Repository: Separate directories for experimental (pdb_experimental/) and predicted (pdb_predicted/) structures
- Embedded Visualization: Iframe-based integration of Mol* viewer, Ketcher chemical editor, and iTOL phylogenetic trees
- Comprehensive Metadata: EC numbers from both literature curation and DeepEC predictions

Implementation

Plaszyme Model Platform Implementation Core Technologies

Component	Technology	Specification
Backend Framework	Flask (Python 3.9+)	Lightweight WSGI application with CORS support
Deep Learning	PyTorch 1.13+	CUDA-enabled for GPU acceleration (optional)
Protein Language Model	ESM-1b (650M parameters)	Pre-trained on UniRef50, 1280-dimensional embeddings
Sequence Classifier	HistGradientBoostingClassifier	Scikit-learn implementation, supports multi-label prediction
Structure Prediction	ColabFold (AlphaFold2)	Integrated via subprocess in isolated conda environment
Graph Neural Network	Custom GNN Bilinear Model	Protein-plastic interaction scoring (X1 model)
Async Task Queue	Threading + BackgroundTaskManager	Custom implementation with status tracking
Data Format	CSV, FASTA, PDB	Standardized input/output formats
Deployment	AWS EC2	DCloud-hosted for global accessibility

Technical Details

ESM Embedding Extraction (from utils.py):

# ESM-1b model: 650M parameters, 33 transformer layers, 1280-dim embeddingsmodel, alphabet = esm.pretrained.esm1b_t33_650M_UR50S()
model.eval().to(device)
# Mean-pooling strategy for sequence-level representationfor batch in sequences:
    tokens = batch_converter(batch).to(device)
    with torch.no_grad():
        output = model(tokens, repr_layers=[33])  # Last layer    embeddings = output["representations"][33][:, 1:-1].mean(1)

Background Task Management (from app.py):

class BackgroundTaskManager:
    """Manages async PDB predictions to prevent UI blocking"""
    def __init__(self):
        self.tasks = {}  # {task_id: {status, progress, result, error}}        self.lock = threading.Lock()
    def create_task(self, task_id, project_name, task_type='pdb'):
        # Thread-safe task creation        with self.lock:
            self.tasks[task_id] = {
                'status': 'processing',
                'progress': 0,
                'created_at': datetime.now().isoformat()
            }

Automatic Resource Cleanup:

def auto_cleanup_old_projects():
    """Delete oldest 20 projects when count reaches 50"""    if total_projects >= 50:
        cycles = total_projects // 50        to_delete = cycles * 20        # Sort by creation time, delete oldest        projects.sort(key=lambda x: x['ctime'])
        for project in projects[:to_delete]:
            shutil.rmtree(project['path'])

ColabFold Integration:

# X1 model uses isolated 'plaszyme_gpu' conda environment# Prevents dependency conflicts between V1/X1 modelssubprocess.run([
    str(PLASZYME_GPU_PYTHON),  # Dedicated Python interpreter    str(prediction_script),
    '--pdb', pdb_file,
    '--output', output_dir
], timeout=600)

Deployment & Infrastructure

AWS Deployment Architecture

Plaszyme Model Platform:
- Instance Type: AWS EC2 g4dn.xlarge (1 NVIDIA T4 GPU, 16GB GPU memory, 4 vCPUs, 16GB RAM)
- Operating System: Ubuntu 20.04 LTS
- Web Server: Gunicorn (4 workers) + Nginx reverse proxy
- SSL/TLS: Let’s Encrypt certificate with automatic renewal
- Domain: http://plaszyme.org/plaszyme
- Storage: 100GB EBS SSD for model weights and results cache

PlaszymeDB Database Platform:
- Instance Type: AWS EC2 t3.medium (2 vCPUs, 4GB RAM)
- Operating System: Ubuntu 20.04 LTS
- Web Server: Apache 2.4 + PHP 7.4-FPM
- Database: MySQL 8.0 (local, not RDS for cost optimization)
- Domain: http://plaszyme.org/plaszymedb
- Storage: 50GB EBS SSD for database and PDB files

Load Balancing: Both platforms currently run on single instances; future scalability can be achieved via AWS Elastic Load Balancer for horizontal scaling.

Performance Optimization

Model Prediction:
- GPU acceleration reduces PlaszymeAlpha inference time from ~30s/100 sequences (CPU) to ~10s (GPU)
- ESM model caching: Pre-load ESM-1b weights on server startup to avoid repeated downloads
- Batch processing: Group sequences into batches of 4 for optimal GPU utilization

Database Queries:
- MySQL query caching enabled (256MB cache)
- Composite indexes on frequently filtered columns (plastic + ec_number)
- LIMIT 50 on search results to reduce data transfer and rendering time

3D Visualization:
- Mol* viewer lazy-loading: Initialize only when structure tab is clicked
- PDB file compression: Gzip compression reduces transfer size by ~70%
- CDN for Mol* library assets (reduces load time from ~2s to ~0.3s)

Frontend Optimization:
- Minified JavaScript/CSS bundles
- AJAX pagination for large result sets
- Debounced search input (300ms delay) to reduce API calls

Technical Innovations

Dual-Model Integration: Seamless switching between sequence-based (PlaszymeAlpha) and structure-based (PlaszymeX) predictors based on input type, using isolated conda environments to prevent dependency conflicts.

Automated Structure Prediction: Eliminated manual ColabFold execution—users submit sequences, system automatically generates structures for X1 model inference.

Custom BLAST Implementation: Pure PHP sequence similarity algorithm enables BLAST-like searches without external binary dependencies, simplifying deployment on shared hosting environments.

Asynchronous Task Management: Background thread architecture prevents UI blocking during long-running predictions (PDB structure generation ~5-10 minutes).

Intelligent Resource Management: Automatic cleanup mechanism (delete 20 oldest projects when count reaches 50) prevents server storage exhaustion.

Lessons Learned

Design Insights

User-Centered Design is Non-Negotiable: Early user interviews revealed that computational barriers (installation, dependencies) were greater obstacles than scientific complexity. Prioritizing web deployment from the start would have accelerated adoption.

“Less is More” Requires Backend Complexity: Simplifying user workflows (e.g., automatic structure prediction) necessitates sophisticated backend orchestration (isolated environments, task queuing, error handling). The trade-off is worthwhile—users need not understand underlying technical details.

Transparency Builds Trust: Providing real-time task status, confidence scores, and algorithm explanations helps users trust AI predictions, especially in high-stakes research contexts.

Technical Learnings

Conda Environment Isolation is Critical: Initially, V1 and X1 models shared dependencies, causing version conflicts (e.g., PyTorch 1.13 vs. 2.0). Complete isolation via separate environments eliminated instability.

Custom BLAST Trade-offs: PHP implementation is ~8x slower than NCBI BLAST but avoids binary dependencies and simplifies deployment. For databases >10,000 entries, migrating to BLAST+ would be necessary.

Cloud Cost Optimization: GPU instances (g4dn.xlarge) cost ~$0.50/hour; using CPU for PlaszymeAlpha and GPU only for X1 reduced costs by 40% while maintaining acceptable performance.

Future Directions

API for Programmatic Access: Enable command-line tool users to submit jobs via REST API (currently web-only).

Enhanced Interpretability: Integrate attention visualization for ESM embeddings to show which sequence regions contribute most to predictions.

Multi-Plastic Expansion: Extend prediction targets beyond current 15 plastic types to include emerging materials (e.g., bioplastics, composites).

Community Contribution Portal: Allow users to submit newly discovered enzymes to PlaszymeDB after literature validation.

Conclusion

The development of the Plaszyme model platform and PlaszymeDB database represents a successful transition from specialized research tools to accessible, production-grade web applications. By integrating advanced AI models (ESM-1b, graph neural networks), automated workflows (ColabFold structure prediction), and user-friendly interfaces, we have created a “one-stop” platform that serves both computational and experimental researchers.

Key achievements include:

(1) dual-model architecture supporting both sequence- and structure-based predictions, (2) custom BLAST implementation for dependency-free deployment, (3) asynchronous task management for long-running computations, and (4) comprehensive database with 749 curated enzyme entries and integrated 3D visualization.

Explore the full interactive platform at: http://plaszyme.org/plaszyme/

Reference

Letunic, I., & Bork, P. (2024). Interactive Tree of Life (iTOL) v6: Recent updates to the phylogenetic tree display and annotation tool. Nucleic Acids Research, 52(W1), W78–W82. https://doi.org/10.1093/nar/gkae268

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403–410. https://doi.org/10.1016/S0022-2836(05)80360-2

← Previous

Predictor