Results

Results Overview

In this page, we will show the state-of-the-art performance of AOMM. We did not evaluate auxiliary task (bioactivity_classification). In addition, we also give some cases when testing RAG system.

Metrics for Evaluation

Accuracy (Classification Accuracy) and F1 Score

1. Accuracy

Principle: Accuracy measures the proportion of correctly predicted samples (both positive and negative) among all samples. It is a basic metric for classification tasks but has limitations in imbalanced datasets (e.g., when one class accounts for 90% of samples, a trivial model that predicts all samples as this class can still achieve 90% accuracy)
Formula: $$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$
Where:
$TP$ (True Positive): Number of positive samples correctly predicted as positive;
$TN$ (True Negative): Number of negative samples correctly predicted as negative;
$FP$ (False Positive): Number of negative samples incorrectly predicted as positive (Type I error);
$FN$ (False Negative): Number of positive samples incorrectly predicted as negative (Type II error).

2. F1 Score

Principle: F1 score is the harmonic mean of Precision and Recall, designed to balance the two metrics. It is particularly useful for imbalanced datasets, where Accuracy may be misleading. Precision focuses on the correctness of positive predictions, while Recall focuses on the completeness of positive predictions.
Define Precision and Recall:
Precision: Proportion of correctly predicted positive samples among all samples predicted as positive.
$$\text{Precision} = \frac{TP}{TP + FP}$$
Recall (Sensitivity/True Positive Rate): Proportion of correctly predicted positive samples among all actual positive samples.
$$\text{Recall} = \frac{TP}{TP + FN}$$
F1 Score Calculation
F1 score takes the harmonic mean (instead of arithmetic mean) to emphasize the impact of low values in Precision or Recall (e.g., if either is 0, F1 becomes 0).
$$\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

Area Under Curve (AUC)

Principle: AUC (Area Under ROC Curve) evaluates the ability of a binary classification model to distinguish between positive and negative classes. It is derived from the ROC (Receiver Operating Characteristic) curve, which plots the True Positive Rate (Recall) on the y-axis and the False Positive Rate (FPR) on the x-axis across different classification thresholds.
Step 1: Define FPR (False Positive Rate): FPR represents the proportion of negative samples incorrectly predicted as positive, reflecting the model's false alarm rate.
$$\text{FPR} = \frac{FP}{FP + TN}$$
Step 2: ROC Curve and AUC: The ROC curve is generated by varying the classification threshold (e.g., from 0 to 1 for probability outputs) and calculating corresponding TPR and FPR values.
AUC is the area under the ROC curve, with a range of [0.5, 1]:
AUC = 0.5: The model has no discriminative ability (equivalent to random guessing);
AUC = 1: The model achieves perfect classification (all positive samples are ranked higher than negative samples);
AUC > 0.5: The model has some ability to distinguish between classes (higher values indicate better performance).
Key Interpretation: AUC is threshold-invariant, meaning it does not depend on the specific classification threshold, making it suitable for comparing different models.

Pearson Coefficient

For regression tasks with a large range, we used the Pearson coefficient as the evaluation metric.

Principle: The Pearson Correlation Coefficient (denoted as $r$) measures the strength and direction of the linear correlation between two continuous variables (e.g., the true values $y$ and predicted values $\hat{y}$ of a regression model). Its range is [-1, 1].
Formula:
$$r = \frac{n \sum (y_i \hat{y}_i) - (\sum y_i)(\sum \hat{y}_i)}{\sqrt{[n \sum y_i^2 - (\sum y_i)^2][n \sum \hat{y}_i^2 - (\sum \hat{y}_i)^2]}}$$
Where:
$n$: Number of samples;
$y_i$: $i$-th true value of the target variable;
$\hat{y}_i$: $i$-th predicted value of the target variable;
$\sum$: Summation over all samples (from $i=1$ to $i=n$).
Key Interpretation:
$r = 1$: Perfect positive linear correlation (all predicted values lie exactly on the line of best fit for true values);
$r = -1$: Perfect negative linear correlation;
$r = 0$: No linear correlation (but there may be non-linear relationships);
The closer $|r|$ is to 1, the stronger the linear correlation between true and predicted values (better model performance for regression).

MAE (Mean Absolute Error)

Principle: MAE is a common metric for regression tasks, measuring the average of the absolute differences between the true values and predicted values. It reflects the average magnitude of prediction errors (without considering direction) and is robust to outliers (compared to MSE, which amplifies large errors via squaring).
Formula:
$$\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|$$
Where:
$n$: Number of samples;
$y_i$: $i$-th true value;
$\hat{y}_i$: $i$-th predicted value;
$|y_i - \hat{y}_i|$: Absolute error of the $i$-th sample.
Key Interpretation:
MAE has the same unit as the target variable (e.g., if the target is "house price in dollars", MAE is also in dollars), making it easy to interpret;
A smaller MAE indicates that the model's predictions are closer to the true values (better regression performance).

Model Performance

Masked LM (compared with ESM2 150M)

k = 5

Model	top1_accuracy	top5_accuracy
AOMM 124M	0.8736	0.9407
ESM2 150M ^[1]	0.3300	0.6392

Comparison of the performance of AMP4multitask 124M and ESM2 150M on the task of masked_lm

Figure 1: Comparison of the performance of AOMM 124M and ESM2 150M on the task of masked_lm.

[1] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., ... & Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123-1130.

Task AMP Classification

AUC: 0.9951, F1: 0.9644

Figure 2: Performance of AMP classification task.

MIC Regression

Organism (or total)	Organism ID	Pearson Correlation	Number of Samples
Total	-	0.7000	18668
Acinetobacter baumannii	0	0.7294	711
Bacillus subtilis	1	0.8109	1515
Candida albicans	2	0.5976	1314
Enterococcus faecalis	3	0.7450	681
Escherichia coli	4	0.6681	4461
Klebsiella pneumoniae	5	0.7687	956
Micrococcus luteus	6	0.6871	570
Pseudomonas aeruginosa	7	0.6636	2551
Salmonella enterica	8	0.7218	966
Staphylococcus aureus	9	0.7187	3850
Staphylococcus epidermidis	10	0.6830	1097

Hemolysis Regression

MAE: 0.1061

Half-Life Regression

Pearson Correlation: 0.9851

Pure Era RAG: A Retrieval-Augmented Generation System for Antimicrobial Peptide Data

Introduction

Pure Era RAG is a retrieval-augmented generation (RAG) system specifically designed for the retrieval of antimicrobial peptide (AMP) data. By integrating dense vector retrieval with semantic re-ranking capabilities, our system enables researchers to quickly identify the most relevant peptide sequences based on natural language queries.

RAG Model Construction

Figure 1: Model structure

1. Data Preparation

Data quality determines the final performance of the model. We downloaded 46,876 antimicrobial peptide data from APD3 (Wang, G, 2016), DRAMP (Ma, T, 2025), LAMP (Ye, G, 2020), and DBAASP (Pirtskhalava, M., 2021), either by using Python web crawlers or APIs. These data include sequence information, activity information, various physicochemical properties, and were saved in JSON format.

To continuously obtain new antimicrobial peptide information, we deployed a timed crawling module on the server. It can automatically synchronize data with APD3, DRAMP, etc. at regular intervals.

2. Data Processing Flow

First, we convert structured JSON data into standard text format (Figures 2 and 3) for subsequent embedding representation generation and semantic retrieval. During the conversion, we set key fields, converting only the necessary fields into standard text and filtering out invalid information such as "no", "unknown", and null values. For special field processing, we adopt a structured conversion strategy: converting the list of biological activities into numbered entries, extracting the ID and URL key information from UniProt entries, and standardizing the format of author, title, and other metadata for literature data. Complex nested structures are expanded into readable text to ensure semantic integrity while removing redundant information, providing standardized input for subsequent embedding models.

The output format is: [Field name] value

Figure 2: JSON file

Figure 3: txt file

*The picture is only for illustration. If there is any mismatch, it is a normal phenomenon

3. Language Embedding

We use Qwen3-embedding-4B to convert the biological information of peptides into a dense 2560-dimensional vector representation. This process transforms the complex biological descriptions - including amino acid sequences, antibacterial activity, physicochemical properties, and other functional characteristics - into a numerical vector form, enabling computers to understand and process the semantic information of peptides.

All the generated embedding vectors are efficiently organized through the FAISS index. This specially optimized vector database supports similarity search in milliseconds. FAISS uses the IVFFlat index structure, dividing the vector space into multiple clustering units, which significantly improves the efficiency of nearest neighbor search. For example, peptides of Gram-negative bacteria are close to each other in the vector space, while peptides with high hemolytic activity maintain a relatively large distance.

During the search process, the system first conducts a rapid coarse screening in the FAISS index to identify potential candidate peptides, and then performs a fine sorting through cosine similarity calculation. This hierarchical search strategy, combined with GPU acceleration capabilities, enables the system to respond to user queries in real time among thousands of peptide entries. When a user queries "finding peptides that are effective against Escherichia coli and have low hemolytic activity", the system can accurately understand its underlying requirements and identify the most suitable results from thousands of peptides, providing a powerful intelligent search foundation for antibacterial peptide research and discovery.

4. Retrieval and Reordering

In the reordering stage, the Qwen3-reranker-4B model is used for deep semantic evaluation. This model receives formatted query-document pairs and calculates the relevance score by analyzing the difference in output logits of the "true/false" tokens. The system specifically added 25 keywords related to antimicrobial peptides (such as MIC, hemolytic, gram positive, etc.) to enhance the model's understanding of professional terms.

Figure 4: 25 key words in the field of antimicrobial peptides

With GPU acceleration, the model processes candidate peptides in batches in parallel and determines the matching degree of each peptide with the query through deep semantic understanding. Unlike simple keyword matching, the reordering model can understand the biological context and implicit requirements, accurately identifying complex conditions such as "low hemolytic activity but effective against Escherichia coli". Finally, the system arranges the candidate results in descending order based on the relevance score, ensuring that the returned peptide sequence best meets the user's true intention, significantly improving the accuracy and practicality of the retrieval results.

5. Technical Features

Our system adopts a dual-model architecture. The Embedding model is responsible for quickly retrieving candidate results, while the Reranker model performs fine sorting to ensure retrieval accuracy. Through domain adaptation technology, we added AMP professional domain keywords to enhance the semantic understanding ability of the tokenizer. Efficient retrieval is enabled by the support of FAISS index, which can achieve vector similarity search in milliseconds. The system has good interpretability, with output results including relevance scores and the original text, facilitating subsequent analysis and verification by researchers. The entire system provides an end-to-end processing flow, from the original JSON data to the final retrieval results, achieving fully automated processing and significantly improving research efficiency.

References

Ma, T., Liu, Y., Yu, B., Sun, X., Yao, H., Hao, C., Li, J., Nawaz, M., Jiang, X., Lao, X., & Zheng, H. (2025). DRAMP 4.0: An open-access data repository dedicated to the clinical translation of antimicrobial peptides. Nucleic Acids Research, 53(D1), D403–D410. https://doi.org/10.1093/nar/gkae1046
Pirtskhalava, M., Amstrong, A. A., Grigolava, M., Chubinidze, M., Alimbarashvili, E., Vishnepolsky, B., Gabrielian, A., Rosenthal, A., Hurt, D. E., & Tartakovsky, M. (2021). DBAASP v3: Database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics. Nucleic Acids Research, 49(D1), D288–D297. https://doi.org/10.1093/nar/gkaa991
Wang, G., Li, X., & Wang, Z. (2016). APD3: The antimicrobial peptide database as a tool for research and education. Nucleic Acids Research, 44(D1), D1087–D1093. https://doi.org/10.1093/nar/gkv1278
Ye, G., Wu, H., Huang, J., Wang, W., Ge, K., Li, G., Zhong, J., & Huang, Q. (2020). LAMP2: A major update of the database linking antimicrobial peptides. Database, 2020, baa061. https://doi.org/10.1093/database/baaa061