Contribution
We provide openly accessible data, models, software, and documentation to support future iGEM teams and relevant researchers working on antimicrobial peptides (AMPs) and ML-assisted peptide design. Below are our reusable contributions and how they benefit the community.
Our Contribution to the iGEM Community
As an exclusively dry lab team, we are committed to contributing practical resources that future iGEM teams can directly reuse and build upon. We developed the SPADE database, a large-scale and standardized AMP knowledge base that solves the issue of fragmented data. From this foundation, we curated the AMPOS benchmark dataset with six well-defined machine learning tasks, providing a shared standard for training and evaluation.
On top of these resources, we created the AOMM multitask prediction model, which achieves state-of-the-art performance across AMPOS subtasks, and a RAG-based intelligent retrieval system that supports natural-language queries with semantic understanding and relevance scoring. These tools offer both predictive insights and accessible literature-grounded evidence.
Beyond tool development, we prepared educational materials, tutorials, and outreach content to help iGEM teams and students with limited machine learning background quickly get started with computational peptide design.
Through these contributions, we aim to empower the iGEM community with open, reliable, and reusable data, models, and educational resources, accelerating innovation in synthetic biology.
Data Resources
SPADE Database (Standardized AMP Knowledge Base)
What it is
SPADE (Systematic Platform for Antimicrobial peptide Database with Evaluation) integrates >39,000 natural & modified AMPs from APD3 / DRAMP / DBAASP / LAMP with rigorous standardization (MIC unit harmonization, unified activity labels, consolidated metadata, duplicate resolution).
Why it's useful
Researchers often spend days piecing together AMP data from multiple sources, only to find inconsistent annotations. SPADE solves this by offering a single, standardized knowledge base that can be used directly for training models, performing sequence analysis, or designing experiments. It is also ready to support similarity search, motif discovery, feature engineering, and transfer learning.
What's inside (schema)
- Core fields:
sequence,modifications,organism_target,MIC_value(standardized),assay_conditions(temp, medium, pH),activity_labels,toxicity/hemolysis,reference(DOI/PMID). - Provenance fields:
source_db,source_id,curation_timestamp,confidence_tag. - Quality flags:
outlier_check,unit_check,deduplication_group.
Access
Web: https://xjtlu-spade.netlify.app/
AMPOS Benchmark Dataset (AMP-Oriented Six Tasks)
What it is
From SPADE we distilled a benchmark dataset, AMPOS, that defines a set of machine-learning tasks relevant to AMP research. It currently contains six subtasks:
- Sequence Mask Prediction for language-model style pretraining
- AMP Classification (binary and multiclass)
- Bioactivity Classification
- Half-life Regression
- MIC Regression
- Hemolytic Activity Score Regression
Why it's useful
Most previous AMP models were trained on their own private datasets, which makes comparison difficult. AMPOS provides standardized splits, clear task definitions, and baseline models, so that results are directly comparable across teams. This benchmark makes it easier to reproduce experiments, test ablations, and apply transfer learning to specific targets such as MRSA.
Access
Hugging Face Dataset: https://huggingface.co/datasets/muskwff/AMP_six_tasks
Recommended evaluation protocol
- Splits:
train/valid/testwith sequence-identity-controlled partitioning (e.g., 40% identity threshold) to avoid train-test leakage. - Metrics:
- Classification: ROC-AUC / PR-AUC / F1 (macro) / accuracy (micro)
- Regression: MAE / RMSE / Spearman ρ
- Reporting: mean ± std across k-fold or 3 seeds (42/73/2025)
What future teams can add
- New subtasks (e.g., antifungal spectrum prediction, resistance-evolution risk)
- New labels (e.g., stability in serum, protease susceptibility)
- Data from specific organisms or environments
Model & Tools
AOMM (AMP-Oriented Multi-task Model)
What it is
AOMM (AMP-Oriented Multi-task Model) is a neural network trained on AMPOS that can predict multiple peptide properties at once. Unlike single-task models that only focus on activity, AOMM evaluates antimicrobial activity, hemolytic risk, and stability together. This integrated approach provides a more complete computational profile for each peptide and achieves state-of-the-art results on all AMPOS subtasks.
Why it's useful
As a computational framework, AOMM offers standardized benchmarks and ready-to-use baselines for future teams. It allows researchers to quickly reproduce our experiments, extend them to new organisms or peptide types, and fairly compare their own models. Instead of starting from scratch, teams can build on a tested architecture and focus on innovating their own designs.
Access
- Hugging Face Model: https://huggingface.co/muskwff/amp4multitask_124M
- Inference examples & fine-tuning tips: within model card and repo
Reproducibility essentials
- Model size/params: 124M
- Tokenizer / encoding: ESM2-based amino acid level tokenization
- Training details:
- Optimizer: AdamW
- Learning rate: 5e-5 (pretrain), 2e-5 (multi-task)
- Batch size: 64
- Max epochs: 100 (with early stopping)
- Hardware & time: NVIDIA GPU, varies by task
- Seed control: 42 / 73 / 2025
- Best checkpoints & early stopping: patience=5 epochs
Performance snapshot
(mean across test split; see detailed results in our AOMM page)
| Task | Metric | AOMM 124M | ESM2 150M |
|---|---|---|---|
| Sequence Mask Prediction | Top-1 Accuracy | 0.8736 | 0.3300 |
| AMP Classification | ROC-AUC | 0.9951 | - |
| MIC Regression (overall) | Pearson r | 0.7000 | - |
| Hemolysis Regression | MAE | 0.1061 | - |
| Half-life Regression | Pearson r | 0.9851 | - |
RAG-based Intelligent Retrieval
What it is
We built a Retrieval-Augmented Generation (RAG) engine on top of SPADE to make peptide searching faster and smarter. Instead of simple keyword matching, our RAG system uses embeddings and cosine similarity, combined with an AMP-specific terminology library. This allows it to understand biological context in a way that traditional search tools cannot.
Why it's useful
A researcher can ask natural questions such as:
"Find peptides active against MRSA, with low hemolysis risk, and shorter than 20 amino acids."
The system interprets the request, searches the curated database, and returns relevant candidates along with a relevance score. Each result is linked back to experimentally validated literature data. This makes the retrieval both convenient and trustworthy.
Access
- Integrated into the SPADE web UI: https://xjtlu-spade.netlify.app/
- API endpoints available (see documentation in GitHub repo)
Community & Impact
Educational & Social Impact
We believe contribution is not only about building tools, but also about empowering people to use them responsibly. To lower the barrier for new iGEM teams and students, we prepared introductory slide decks and demo scripts that explain both the basics of antimicrobial peptides and practical steps to use SPADE, AMPOS, and AOMM. These materials are designed for high-school and undergraduate outreach as well as team onboarding, ensuring that even those with limited computational biology background can get started quickly.
At the same time, we actively considered the biosafety and ethical aspects of AI in biology. Our dataset, model, and code are released under permissive but clearly non-clinical licenses. We emphasize in all documentation that predictions require wet-lab validation and should not be misused in medical contexts. By doing so, we aim to encourage responsible and safe application of machine learning in synthetic biology.
Through these efforts, we hope to make computational peptide research both more accessible and more responsible, ensuring that innovation goes hand-in-hand with safety and ethics.
Community Engagement
We have communicated with the BIT-LLM team, and this is the feedback they provided to us: the platform demonstrates solid technical depth and a clear understanding of user needs. Experts highlighted three key areas for refinement: (1) enhancing the search and filtering architecture through autocomplete, advanced filtering, and full-text keyword search to accelerate access to specific mutations or activity data; (2) improving result presentation and language accessibility with a customizable summary view and a one-click Chinese interface to balance data richness and readability; and (3) introducing a structured onboarding module, such as a concise "Sequence-to-Activity in 5 Minutes" tutorial, to lower the learning curve for new users while empowering experienced ones. Overall, the feedback recognizes the project's strong foundation and suggests targeted improvements that would elevate it from a robust data platform to a benchmark-level research tool.
Beyond technical resources, we also contributed to the iGEM community through direct interactions and knowledge sharing. Our team hosted the Jiangsu–Zhejiang–Shanghai iGEM Exchange Conference, where participating teams shared project progress, exchanged experiences, and discussed challenges in both wet-lab and dry-lab work. This event not only helped us refine SPADE through feedback but also provided younger teams with practical guidance on project organization and outreach.
We also actively joined the CCiC event in Beijing, where we presented SPADE to a national audience of iGEMers, experts, and mentors. The feedback collected during this event highlighted the demand for standardized peptide datasets and validated the usability of our tools.
In addition, we carried out online outreach through platforms such as Xiaohongshu, Bilibili, Instagram, and Xiaoyuzhou. By sharing project stories, behind-the-scenes progress, and easy-to-understand explanations of AMPs and synthetic biology, we reached audiences beyond academia and sparked conversations with people new to the field.
These experiences strengthened our belief that contribution goes beyond building databases and models — it is also about creating open spaces for dialogue, collaboration, and shared learning within the iGEM community and with the wider public.
Usage & Licensing
Reusable Protocols & How-To Guides
To make sure our work can be picked up and reused by future iGEM teams, we wrote down every step of the process in plain, practical guides. These are not just abstract descriptions, but real instructions that we ourselves followed and tested.
- Data curation pipeline
How we merged data from different AMP databases, standardized MIC units, unified activity labels, and removed duplicates. This makes the dataset directly usable without the weeks of manual cleaning we went through at the start. - Model training recipe
A clear record of the hyperparameters, configurations, and training routines we used for AOMM, including optimizer choices, scheduling, task weights, and our early-stopping criteria. Teams can reproduce our results or adapt the setup to their own needs. - Evaluation
A checklist of metrics and validation protocols we used for each subtask. This ensures that results are reported consistently and fairly, and helps avoid data leakage across train/test splits. - RAG stack
Documentation of how the retrieval engine is built: which embedding model we used, how the database is indexed, what kind of query templates work best, and how relevance scoring is calculated.
Each of these is written as a step-by-step README with code snippets, so that someone new to the project can follow along without guesswork.
Software & Code
- Code repository (ETL + training + evaluation + RAG API):
https://github.com/NeurEv0/Pure_Era - Includes dataset loaders, training scripts, evaluation harness, and demo notebooks.
Quickstart
git clone https://github.com/NeurEv0/Pure_Era
cd Pure_Era
pip install -r requirements.txt
python scripts/quickstart_demo.py
Licensing & Reuse
- Data & docs: Creative Commons CC BY 4.0.
- Code: MIT License.
- Model weights: research-friendly terms (see HF model card).
- Ethics: no clinical claims; predictions require wet-lab validation.
How to Cite
If you use SPADE / AMPOS / AOMM or our RAG in your project, please cite:
- SPADE – Systematic Platform for Antimicrobial peptide Database with Evaluation (XJTLU-Software 2025).
- AMPOS&PT – AMP-Oriented Six tasks (AMPOS) with the AMP-oriented Multi-Property Prediction Task (AMPPT) (Hugging Face dataset).
- AOMM – AMP-Oriented Multi-task Model (Hugging Face model).
- Code – Pure_Era repository (GitHub).
A BibTeX snippet is provided in our repository and on the dataset/model cards.
Why This is a Contribution to iGEM
Our contribution goes beyond developing technical tools. What we offer is meant to be practical, reusable, and supportive for the whole iGEM community. By making our datasets, models, and retrieval system openly available, we give future teams a clear starting point instead of asking them to rebuild everything from scratch. This lowers the entry barrier for newcomers and helps them focus more on their own creative ideas.
Through AMPOS and AOMM, we also provide a shared way to measure and compare results, so that different teams can talk about their models on the same standard. With SPADE and the RAG engine, dry-lab outputs become directly useful for wet-lab planning, turning predictions into experiments that can really be tested.
Our engagement work — such as hosting the Jiangsu–Zhejiang–Shanghai Exchange Conference and presenting at CCiC Beijing — shows that contribution is not only about code and data, but also about building connections between teams, collecting feedback, and growing together as a community.
Finally, we take biosafety and responsibility seriously. All our resources are released with clear documentation and non-clinical disclaimers, ensuring that innovation stays safe and ethical. Combined with user-friendly interfaces, permissive licenses, and continuous outreach, our work is designed to be reused, adapted, and expanded by future iGEMers.