SPADE: An Intelligent Software Platform for Antimicrobial Peptide Discovery

Project Background

Antimicrobial peptide research depends on scattered publications, heterogeneous databases and ad‑hoc spreadsheets, making routine questions—what sequences have evidence against a target organism, what properties matter for a design constraint, what toxicity signals are reported—slow to answer and hard to reproduce across teams. SPADE was started to gather dependable AMP data with provenance into a single entry point, normalize key fields into a consistent schema, and provide a simple path from query to inspection to interpretation.

The scope reflects day‑to‑day needs: a curated index cross‑referenced from peer‑reviewed literature, a workflow that stays within three surfaces (Search, Peptide Card, AMP Visualization), and transparent scoring governed by program/weights_config.yaml so efficacy, stability, synthesis and toxicity can be reviewed with the same assumptions across projects. Updates are reproducible and auditable, and users can see how the dataset evolves with each release.

Operations favor reliability with low maintenance overhead: a static frontend and a lightweight Flask backend keep the attack surface small and make deployments predictable; hosting on Netlify provides global CDN distribution, automatic HTTPS and preview environments for reviewing changes before publication, while SPADE's architecture remains provider‑agnostic and centered on maintainability, security and retrieval efficiency.

Project Vision and Motivation

SPADE serves researchers and industry R&D teams who need dependable antimicrobial peptide (AMP) data that can be queried quickly and applied directly in discovery workflows, bringing curated entries, clear structures and practical filters together so everyday questions can be answered without wrestling with format inconsistencies or scattered sources.

The index covers more than 39,000 peptides, cross‑referenced from PubMed and peer‑reviewed literature, and consolidated into a single schema that preserves provenance while standardizing key fields to support consistent queries.

Data quality is maintained through a streamlined yet rigorous pipeline that combines human judgment and automation: records are cross‑checked across sources and harmonized by manual curation when fields conflict, duplicates are removed by exact sequence and normalized name matching with additional near‑duplicate flags from edit‑distance checks, and each release undergoes stratified random sampling to verify correctness and relevance; we publish per‑release quality indicators—manual cross‑check counts, automated deduplication rate, and sampling coverage—together with the index so users can see how the dataset is maintained over time.

To match common usage habits, the website supports Chinese, English, Japanese, German and Spanish, and the search interface keeps interactions consistent with a left‑side panel for filters and a results table that updates immediately as conditions change; filters name the fields users care about—ID, name, sequence length, activity type and target organism, physicochemical attributes such as net charge and hydrophobicity, and predicted structural features—so typical queries can be expressed in one pass.

A typical combination might target antifungal activity against Candida albicans with net charge above +2 and a hydrophobicity range suited to membrane interaction, and the operational path keeps analysis close to the task by setting conditions on the Search page, reading the updated results table, and opening the peptide's card to examine consolidated details in a single view.

A Guided Tour of the SPADE Platform

The website follows a workflow that aligns with how people actually use AMP data: you begin on the Home page to understand the modules and reach the key entry points, you set conditions on the Search page with a left‑side filter panel and watch the results table update as you refine queries, and you open a peptide's card to read consolidated details in one place so the transition from overview to inspection does not interrupt the task.

The Home page introduces the platform and provides direct navigation to Search, Peptide Card, AMP Visualization, Tools and Statistics; images on this page are placeholders that will be replaced later.

Figure 1: Home Page

The Search page implements multi‑dimensional filtering over a curated index of more than 39,000 peptides, with fields that match common habits—ID, name, sequence length, activity type and target organism, physicochemical attributes such as net charge and hydrophobicity, and predicted structural features—so typical queries can be expressed in one pass and the table of results remains the primary data display for scanning and selection.

Figure 2: Search Page

The Peptide Card presents a single peptide in a structured view that brings together sequence, derived properties, measured activities, predicted structure, and literature references; the card is designed as the second data display page in the workflow, and clicking from the results table to the card keeps context while shifting from list‑level screening to item‑level verification.

Below is a live interactive view of the SPADE peptide card interface showing sample peptide SPADE_N_00001. You can scroll within the embedded interface to explore different information categories including general information, activity data, structural details, and physicochemical properties.

Figure 3: Interactive Peptide Card Interface

AMP Visualization focuses on scoring views that aid interpretation, exposing plots such as hydrophobicity profiles and simple 2D structure predictions alongside a score summary of how a peptide aligns with desired properties. Score composition is transparent and follows program/weights_config.yaml: efficacy, stability, synthesis and toxicity contribute at weighted proportions, with sub‑weights covering MIC with a synergy bonus, half‑life, pH/thermal stability, protease sensitivity, disulfide count, sequence length and rare amino acids, and toxicity indicators such as Boman score, cytotoxicity and hemolysis.

Below is a live interactive view of the SPADE AMP Visualization interface for peptide SPADE_N_00001. You can scroll within the embedded interface to explore scoring visualizations, hydrophobicity profiles, and structural predictions.

Figure 4: Interactive AMP Visualization Interface

The Tools page links to internal scripts and selected external utilities for alignment, structure prediction and toxicity analysis, and the Statistics page offers interactive charts that summarize global distributions—sequence lengths, activity types and target organisms—so dataset shape can be reviewed before or after focused filtering.

Figure 5: Tools Page + Statistics Page

User Guide: Getting Started with SPADE

Search & Filtering

Set filters on the left‑side panel; the results table updates instantly as you refine.

Peptide Information

Open the Peptide Card to read sequence, properties, activities, structure and references in one consolidated view.

Scoring Visualization

AMP Visualization displays a even more detail information and a score summary controlled by the specific .yaml file.

Glossary

results table: the main list view on the Search page that updates instantly with filters.
left‑side filter panel: the control area on the Search page for Basic, Activity, Structure conditions.
Peptide Card: the consolidated single‑peptide view with sequence, properties, activity, structure, references.
AMP Visualization: the page for hydrophobicity profiles, simple structure predictions, and the score summary.
score summary: the weighted overview of efficacy, stability, synthesis, and toxicity following program/weights_config.yaml.
Flask backend: the lightweight server processes specific tasks behind the static frontend.
program/weights_config.yaml: the configuration file specifying weights and ranges used in scoring and visualization.

Design Philosophy and Architectural Overview

SPADE favors maintainability through a simple, modular architecture: the frontend is a static site (HTML/CSS/vanilla JS) and specific processing tasks are handled by a lightweight Flask backend. Pages are loosely coupled, and data, UI, and configuration are separated into predictable file boundaries—for example, the curated index under data/index, result artifacts under data/result, and scoring parameters in program/weights_config.yaml—so new features can be added by introducing a page and wiring it into navigation without disturbing existing modules.

Management and retrieval efficiency are achieved by a pre‑compiled JSON index (data/index/peptide_index.json) combined with client‑side filtering and incremental loading. Large datasets are delivered in chunks (see js/chunks/) to keep first paint fast and reduce memory pressure, and the Search page updates results immediately as conditions change so screening stays on the same surface. Index generation and optimization scripts (program/generate_extended_index.py, program/optimize_data.py) keep refreshes reproducible; outputs are versioned under data/result to simplify rollbacks and audits.

Security is addressed by minimizing the attack surface and constraining server responsibilities. The static frontend avoids long‑lived servers at the edge, sensitive operations are limited to well‑scoped endpoints, and common abuse patterns are mitigated by protective scripts where appropriate. SPADE is deployed on Netlify to inherit CDN distribution and automatic HTTPS, with preview environments useful for reviewing changes before publication; the provider's role is operational and supporting, while SPADE's design choices remain the primary basis for security and reliability.

AMP-Oriented Multi-task Model

Overview

Model Background

Antibacterial peptides are a type of small molecule polypeptides produced by the immune system of organisms. Unlike traditional antibiotics, antibacterial peptides mainly destroy the cell membranes of bacteria through physical means, causing the leakage of their contents and leading to death. This unique mechanism of action makes it difficult for bacteria to develop resistance. In addition to their powerful broad-spectrum antibacterial ability (which can effectively combat bacteria, fungi, viruses, and parasites), antibacterial peptides can also regulate the body's immune response and promote wound healing. Due to their high efficiency and the fact that they are less likely to develop resistance, antibacterial peptides have shown great application potential in the fields of medicine, agriculture, and food preservation, becoming a hot spot in the research and development of new anti-infection drugs.

Motivation

Although antimicrobial peptides (AMPs) hold tremendous therapeutic promise, their clinical translation remains hindered by multiple challenges. The central difficulty lies in balancing efficacy and safety—many natural AMPs exhibit strong antibacterial activity but also significant cytotoxicity and hemolysis toward human cells. Furthermore, they are prone to rapid proteolytic degradation, resulting in poor stability and short half-lives. Large-scale synthesis is costly, and traditional experimental screening methods struggle to efficiently identify candidates with desirable properties such as high potency, broad spectrum, and low toxicity. These factors collectively impede the transition of AMPs from laboratory research to clinical application.

To address these challenges, we established the SPADE Antimicrobial Peptide Database, providing a systematically curated and standardized data foundation AMP-Oriented Six tasks (AMPOS) that integrates key experimental indicators of activity, toxicity, and stability. Building upon this foundation, which is available at hugging face, we developed the AMP-Oriented Multi-task Model (AOMM)—a unified neural network framework designed to predict multiple key properties of AMPs simultaneously, including antimicrobial spectrum, half-life, and hemolytic activity, known as AMP-oriented Multi-Property Prediction Task (AMPPT).

By enabling comprehensive computational evaluation of peptide properties, AOMM bridges the gap between fragmented experimental data and rational peptide design, offering an intelligent pathway to accelerate AMP discovery and optimization.

Highlight

The development of the AMP-Oriented Multi-task Model (AOMM) marks a significant step toward integrating computational intelligence with experimental research. By providing simultaneous quantitative predictions of key antimicrobial peptide properties, such as activity, stability, toxicity, and hemolytics. AOMM enables researchers to perform pre-screening of candidate peptides before costly laboratory validation.

This approach greatly reduces experimental workload and resource consumption, accelerating the iteration cycle of wet-lab experiments while maintaining high predictive precision. Beyond serving as a screening tool, AOMM also functions as a design assistant engine, capable of guiding the directed evolution and optimization of antimicrobial peptides. Its multi-task and generalizable architecture further provides a foundation for expanding to other bioactive peptide domains, offering broad potential for future algorithmic innovation and peptide discovery.

Simple Usage Steps

Figure 6: Demo

Step 1: Prepare Your Peptide Sequence Data

Prepare a plain text file containing your peptide sequences.

For example:

Sequence: GFLGPLLKLAAKGVAKVIPHLIPSRGQ

Step 2: Access Our Online Platform

Open the website: AOMM
Enter your peptide sequence into the input box.

Step 3: Select an Analysis Task

Choose the analysis type from the task panel:

AMP Identification – Predicts the probability of antimicrobial activity
Bioactivity Classification – Labels overall peptide bioactivity
Hemolysis Score – Estimates hemolytic potential
MIC Regression – Predicts minimum inhibitory concentration against multiple microorganisms
Half-life Regression – Estimates peptide stability (in minutes)

Step 4: Run and View Results

Click "Run Prediction."

The system will automatically process your sequence and generate a "Prediction Result" within minutes.

Technical Design Philosophy

Why Choose Multi-Task Learning?

In our framework, masked language modeling (MLM) functions analogously to a cloze test: by masking individual amino acids and asking the model to predict the masked residues, MLM fosters contextual understanding of peptide "language" and yields foundational sequence representations. Building on this foundation, the AMP classification task refines the model's ability to discriminate peptides with antimicrobial potential, providing discriminative features that underpin subsequent tasks. The bioactivity classification task then specifies the functional scope by identifying target microorganism categories (i.e., antimicrobial spectrum), thereby constraining the application domain. Finally, the regression tasks (hemolysis, MIC, and half-life) deliver fine-grained quantitative predictions that rely on the richer semantic and functional representations learned upstream.

Collectively, MLM → classification → bioactivity → regression establishes a hierarchical learning progression: low-level structural modeling produces general representations, mid-level classification achieves functional recognition, and high-level regression yields precise quantitative measures. Joint multi-task optimization enables cross-task information flow—improving robustness and generalization across all subtasks.

Figure 7: MLM as reading-comprehension (cloze) analogy

Hierarchical Relationship

MLM (Context Understanding)
↓
AMP Classification (Functional Identification)
↓
Bioactivity Classification (Target Specificity)
↓
Regression Tasks (Quantitative Evaluation)

Key Technical Decisions:

1. Progressive Learning Strategy

Just as students advance from foundational courses to more specialized and applied subjects, our model follows a structured progression of learning tasks:

Foundational Understanding → Functional Recognition → Quantitative Reasoning
(Masked Sequence Modeling) → (Classification Tasks) → (Regression Predictions)

This curriculum-like design allows the model to first grasp the contextual patterns within antimicrobial peptide sequences, then learn to identify functional properties, and finally perform detailed quantitative evaluations.

Figure 8: AOMM Architecture

2. Knowledge Protection Mechanism

To prevent forgetting old knowledge when learning new tasks, we implemented:

EWC Technology: Mark important "knowledge points" to prevent overwriting
Experience Replay: Regularly review previously learned content

2. Unified Sequence Encoder

All tasks share a common sequence encoding backbone, referred to as the Sequence Understanding Module. This unified encoder captures contextual and structural patterns within antimicrobial peptide sequences, providing consistent and transferable representations for all downstream tasks — from classification to regression.

Figure 9: Encoder Architecture

Design Trade-offs and Optimizations

Accuracy vs. Speed

Our Choice: Prioritize accuracy while maintaining reasonable computational efficiency.

Solution	Advantages	Disadvantages	Decision
Simple Model	Fast execution	Limited accuracy and generalization	❌ Not adopted
Complex Model	High prediction accuracy	Computationally intensive	✅ Optimized adoption
Balanced Solution	Trade-off between speed and precision	Requires parameter tuning	✅ Actually adopted

To ensure both precision and practicality, the adopted configuration employs a balanced architecture that maintains competitive accuracy without excessive resource consumption.

Generality vs. Specialization

Our Approach:

General Foundation: Pre-train on a large-scale antimicrobial peptide dataset to capture universal sequence representations.
Specialized Fine-tuning: Refine the model for specific downstream tasks (e.g., hemolysis regression, MIC prediction).
Continuous Learning: Integrate newly collected experimental data to iteratively improve task performance.

This dual-stage design enables AOMM to combine broad generalization with task-specific optimization.

Memory Usage Optimization

To achieve optimal model performance within limited computational resources, several optimization strategies were implemented:

Layer Freezing: Only train task-relevant layers during fine-tuning to reduce memory consumption and overfitting.
Memory Cleanup: Regularly release cached variables and intermediate tensors to ensure efficient resource utilization.

Together, these techniques improve the efficiency and stability of multi-task learning, enabling AOMM to operate effectively even in constrained hardware environments.

System Architecture Overview

Overall Workflow

The overall pipeline of AOMM follows a hierarchical and modular structure, ensuring that peptide sequences are efficiently processed and accurately analyzed through each stage.

Peptide Sequence Input  →  Unified Sequence Encoder  →  Multi-Task Heads →  Final Predictions
        ↓                            ↓                        ↓                        ↓
     Raw Data         Contextual Representation   Task-specific Analysis    Quantitative Results

This workflow reflects the end-to-end design philosophy of AOMM:

Peptide Sequence Input: Users provide raw antimicrobial peptide sequences in text format.
Unified Sequence Encoder: The model encodes contextual dependencies within each sequence.
Multi-Task Heads: Specialized output layers handle classification and regression subtasks simultaneously.
Final Predictions: The system produces interpretable, quantitative results ready for experimental validation.

Core Components Explanation

1. Sequence Understanding Engine (AMP Encoder)

Function: Transforms antimicrobial peptide sequences into structured numerical representations suitable for downstream analysis.

Features: Utilizes a multi-head self-attention mechanism to capture both local residue dependencies and long-range contextual interactions within peptide sequences.

Innovation: Integrates rotary positional encoding to enhance the model's ability to preserve sequence order and structural integrity, improving representation quality for longer peptides.

2. Task-Specific Analyzers (Task Heads)

Each predictive objective is handled by a dedicated task-specific module, ensuring optimal performance across heterogeneous tasks.

Classification Head: Processes binary or categorical decisions (e.g., AMP vs. non-AMP).
Regression Head: Predicts continuous biochemical properties (e.g., MIC values, half-life, hemolysis score).
Multi-label Head: Supports simultaneous prediction of multiple microbial targets or biological activities.

These specialized heads allow AOMM to flexibly adapt to diverse property prediction tasks while maintaining a unified learning backbone.

3. Training Coordination System

Responsible for synchronizing all learning processes and maintaining model stability throughout multi-task training.

Knowledge Retention: Employs continual learning mechanisms (e.g., EWC, replay) to prevent catastrophic forgetting.
Resource Optimization: Dynamically allocates computation to task-relevant components for efficient training.
Stable Convergence: Applies gradient clipping and adaptive learning rate scheduling to ensure smooth and reliable optimization.

This coordination system enables AOMM to achieve balanced, efficient, and robust performance across all subtasks.

Usage Recommendations

Download

git clone https://github.com/NeurEv0/Pure_Era.git

Common Application Scenarios:

Laboratory Research: Rapid screening of potential antimicrobial peptides
Drug Development: Evaluate characteristics of candidate drugs
Academic Research: Explore sequence-function relationships
Industrial Production: Optimize peptide product design

RAG System

Figure 10: RAG System Architecture

The Retrieval-Augmented Generation (RAG) system developed in this project is an intelligent search and reasoning framework specifically optimized for antimicrobial peptide (AMP) research. It bridges large-scale peptide databases and advanced language understanding models, enabling researchers to efficiently explore complex biological knowledge embedded within heterogeneous datasets.

1. Core Principle

RAG integrates two complementary components:

Retriever: Uses semantic embedding models to encode both user queries and peptide data into high-dimensional vectors. Through similarity search across the SPADE peptide database, it retrieves the most contextually relevant records — not just by keyword matching, but by conceptual and functional similarity.
Generator: Based on retrieved content, the generator reformulates and synthesizes comprehensive answers. It reorganizes information through domain-specific knowledge reordering, ensuring scientific accuracy and interpretability.

2. Key Features

Semantic Search: Understands queries at the meaning level (e.g., "AMPs active against Gram-negative bacteria with low hemolysis").
Natural Language Interaction: Supports full-sentence queries, lowering the technical barrier for users without bioinformatics backgrounds.
Multi-Database Integration: Simultaneously accesses curated AMP repositories (e.g., SPADE, APD3, DRAMP, DBAASP, LAMP) for cross-source analysis.
Intelligent Ranking: Prioritizes results by biological relevance, experimental validation strength, and contextual match.

3. System Advantages

Compared to traditional database search engines, RAG introduces:

Context Awareness: Goes beyond literal matching to infer functional relationships among sequences and targets.
Knowledge Generalization: Leverages pre-trained embeddings fine-tuned on peptide-specific corpora to handle incomplete or ambiguous inputs.
Dynamic Scalability: Automatically adapts to new data updates, ensuring continuous improvement in search precision.

4. Application Scenarios

Drug Discovery: Rapidly identifies candidate peptides with desired bioactivities, reducing the cost and time of early-stage screening.
Bioinformatics Research: Assists in literature review and hypothesis generation by linking peptide structure, activity, and mechanism data.
Experimental Design: Guides wet-lab researchers in selecting promising peptide templates for synthesis and validation.

In summary, the RAG system serves as the intelligent gateway to the SPADE database — combining semantic understanding, retrieval efficiency, and knowledge synthesis. It transforms peptide data exploration from manual searching to automated, intelligent discovery, greatly empowering both computational biologists and experimental researchers.

Summary

Figure 11: System Overview

The entire framework is built upon a progressive architecture that connects data integration, property prediction, and intelligent retrieval through three core components — SPADE, AOMM, and RAG.

1. SPADE: The Foundational Database

SPADE (Systematic Platform for Antimicrobial Peptide Database with Evaluation) serves as the foundational layer of the system. It integrates and standardizes data from multiple major AMP repositories (APD3, DRAMP, DBAASP, and LAMP), providing high-quality, deduplicated, and uniformly annotated peptide records. This unified data resource establishes the groundwork for model training, benchmark evaluation, and cross-task comparisons.

2. AOMM: The Predictive Intelligence

On top of SPADE's curated data, we developed AOMM (AMP-Oriented Multi-task Model) — a deep learning framework designed to simultaneously predict multiple key peptide properties. Through a multi-task learning strategy, AOMM performs five interconnected tasks: sequence mask prediction, AMP classification, bioactivity identification, MIC regression, hemolysis regression, and half-life regression. It enables comprehensive evaluation of unknown peptides, thereby accelerating candidate screening and reducing experimental costs.

3. RAG: The Intelligent Retrieval System

To complement predictive modeling with intelligent information access, the RAG (Retrieval-Augmented Generation) system provides an advanced semantic retrieval interface. By leveraging embedding-based similarity search and domain-specific knowledge reordering, RAG allows users to perform natural language queries across the SPADE database. It can interpret complex research questions — such as identifying "low-hemolytic peptides active against Gram-negative bacteria" — and return contextually relevant sequences and annotations in seconds.

4. Integrated Value

Together, SPADE, AOMM, and RAG form a unified and self-reinforcing ecosystem:

SPADE provides structured, reliable data.
AOMM transforms that data into predictive insights.
RAG makes those insights accessible through intelligent retrieval.

This integration transforms antimicrobial peptide research from static data management into a dynamic, AI-driven discovery process — enabling efficient data utilization, rapid hypothesis generation, and cost-effective peptide development.

Getting Help

If you encounter any issues while using the system, please contact us at igem@xjtlu.edu.cn.

Our team will respond as soon as possible and provide timely technical support.