Results

State-of-the-art performance of AOMM and RAG system evaluation

Scroll to see more

Results

Results Overview

In this page, we will show the state-of-the-art performance of AOMM. We did not evaluate auxiliary task (bioactivity_classification). In addition, we also give some cases when testing RAG system.

Metrics for Evaluation


Accuracy (Classification Accuracy) and F1 Score

1. Accuracy

  • Principle: Accuracy measures the proportion of correctly predicted samples (both positive and negative) among all samples. It is a basic metric for classification tasks but has limitations in imbalanced datasets (e.g., when one class accounts for 90% of samples, a trivial model that predicts all samples as this class can still achieve 90% accuracy)
  • Formula: $$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$
  • Where:
    \(TP\) (True Positive): Number of positive samples correctly predicted as positive;
    \(TN\) (True Negative): Number of negative samples correctly predicted as negative;
    \(FP\) (False Positive): Number of negative samples incorrectly predicted as positive (Type I error);
    \(FN\) (False Negative): Number of positive samples incorrectly predicted as negative (Type II error).

2. F1 Score

  • Principle: F1 score is the harmonic mean of Precision and Recall, designed to balance the two metrics. It is particularly useful for imbalanced datasets, where Accuracy may be misleading. Precision focuses on the correctness of positive predictions, while Recall focuses on the completeness of positive predictions.
  • Define Precision and Recall:
    Precision: Proportion of correctly predicted positive samples among all samples predicted as positive.
    $$\text{Precision} = \frac{TP}{TP + FP}$$
    Recall (Sensitivity/True Positive Rate): Proportion of correctly predicted positive samples among all actual positive samples.
    $$\text{Recall} = \frac{TP}{TP + FN}$$
  • F1 Score Calculation
    F1 score takes the harmonic mean (instead of arithmetic mean) to emphasize the impact of low values in Precision or Recall (e.g., if either is 0, F1 becomes 0).
    $$\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

Area Under Curve (AUC)

  • Principle: AUC (Area Under ROC Curve) evaluates the ability of a binary classification model to distinguish between positive and negative classes. It is derived from the ROC (Receiver Operating Characteristic) curve, which plots the True Positive Rate (Recall) on the y-axis and the False Positive Rate (FPR) on the x-axis across different classification thresholds.
  • Step 1: Define FPR (False Positive Rate): FPR represents the proportion of negative samples incorrectly predicted as positive, reflecting the model's false alarm rate.
    $$\text{FPR} = \frac{FP}{FP + TN}$$
  • Step 2: ROC Curve and AUC: The ROC curve is generated by varying the classification threshold (e.g., from 0 to 1 for probability outputs) and calculating corresponding TPR and FPR values.​
    AUC is the area under the ROC curve, with a range of [0.5, 1]:​
    AUC = 0.5: The model has no discriminative ability (equivalent to random guessing);​
    AUC = 1: The model achieves perfect classification (all positive samples are ranked higher than negative samples);​
    AUC > 0.5: The model has some ability to distinguish between classes (higher values indicate better performance).​
  • Key Interpretation: AUC is threshold-invariant, meaning it does not depend on the specific classification threshold, making it suitable for comparing different models.

Pearson Coefficient

For regression tasks with a large range, we used the Pearson coefficient as the evaluation metric.

  • Principle: The Pearson Correlation Coefficient (denoted as \(r\)) measures the strength and direction of the linear correlation between two continuous variables (e.g., the true values \(y\) and predicted values \(\hat{y}\) of a regression model). Its range is [-1, 1].
  • Formula:
    $$r = \frac{n \sum (y_i \hat{y}_i) - (\sum y_i)(\sum \hat{y}_i)}{\sqrt{[n \sum y_i^2 - (\sum y_i)^2][n \sum \hat{y}_i^2 - (\sum \hat{y}_i)^2]}}$$
  • Where:
    \(n\): Number of samples;
    \(y_i\): \(i\)-th true value of the target variable;
    \(\hat{y}_i\): \(i\)-th predicted value of the target variable;
    \(\sum\): Summation over all samples (from \(i=1\) to \(i=n\)).
  • Key Interpretation:
    \(r = 1\): Perfect positive linear correlation (all predicted values lie exactly on the line of best fit for true values);
    \(r = -1\): Perfect negative linear correlation;
    \(r = 0\): No linear correlation (but there may be non-linear relationships);
    The closer \(|r|\) is to 1, the stronger the linear correlation between true and predicted values (better model performance for regression).

MAE (Mean Absolute Error)

  • Principle: MAE is a common metric for regression tasks, measuring the average of the absolute differences between the true values and predicted values. It reflects the average magnitude of prediction errors (without considering direction) and is robust to outliers (compared to MSE, which amplifies large errors via squaring).
  • Formula:
    $$\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|$$
  • Where:
    \(n\): Number of samples;
    \(y_i\): \(i\)-th true value;
    \(\hat{y}_i\): \(i\)-th predicted value;
    \(|y_i - \hat{y}_i|\): Absolute error of the \(i\)-th sample.
  • Key Interpretation:
    MAE has the same unit as the target variable (e.g., if the target is "house price in dollars", MAE is also in dollars), making it easy to interpret;
    A smaller MAE indicates that the model's predictions are closer to the true values (better regression performance).

Model Performance


Masked LM (compared with ESM2 150M)

k = 5

Model top1_accuracy top5_accuracy
AOMM 124M 0.8736 0.9407
ESM2 150M [1] 0.3300 0.6392
Comparison of the performance of AMP4multitask 124M and ESM2 150M on the task of masked_lm

Figure 1: Comparison of the performance of AOMM 124M and ESM2 150M on the task of masked_lm.

[1] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., ... & Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123-1130.

Task AMP Classification

AUC: 0.9951, F1: 0.9644

Performance of AMP classification task

Figure 2: Performance of AMP classification task.

MIC Regression

Organism (or total) Organism ID Pearson Correlation Number of Samples
Total-0.700018668
Acinetobacter baumannii00.7294711
Bacillus subtilis10.81091515
Candida albicans20.59761314
Enterococcus faecalis30.7450681
Escherichia coli40.66814461
Klebsiella pneumoniae50.7687956
Micrococcus luteus60.6871570
Pseudomonas aeruginosa70.66362551
Salmonella enterica80.7218966
Staphylococcus aureus90.71873850
Staphylococcus epidermidis100.68301097

Hemolysis Regression

MAE: 0.1061

Half-Life Regression

Pearson Correlation: 0.9851

Pure Era RAG: A Retrieval-Augmented Generation System for Antimicrobial Peptide Data


Introduction

Pure Era RAG is a retrieval-augmented generation (RAG) system specifically designed for the retrieval of antimicrobial peptide (AMP) data. By integrating dense vector retrieval with semantic re-ranking capabilities, our system enables researchers to quickly identify the most relevant peptide sequences based on natural language queries.

RAG Model Construction

Model structure

Figure 1: Model structure

1. Data Preparation

Data quality determines the final performance of the model. We downloaded 46,876 antimicrobial peptide data from APD3 (Wang, G, 2016), DRAMP (Ma, T, 2025), LAMP (Ye, G, 2020), and DBAASP (Pirtskhalava, M., 2021), either by using Python web crawlers or APIs. These data include sequence information, activity information, various physicochemical properties, and were saved in JSON format.

To continuously obtain new antimicrobial peptide information, we deployed a timed crawling module on the server. It can automatically synchronize data with APD3, DRAMP, etc. at regular intervals.

2. Data Processing Flow

First, we convert structured JSON data into standard text format (Figures 2 and 3) for subsequent embedding representation generation and semantic retrieval. During the conversion, we set key fields, converting only the necessary fields into standard text and filtering out invalid information such as "no", "unknown", and null values. For special field processing, we adopt a structured conversion strategy: converting the list of biological activities into numbered entries, extracting the ID and URL key information from UniProt entries, and standardizing the format of author, title, and other metadata for literature data. Complex nested structures are expanded into readable text to ensure semantic integrity while removing redundant information, providing standardized input for subsequent embedding models.

The output format is: [Field name] value

JSON file

Figure 2: JSON file

txt file

Figure 3: txt file

*The picture is only for illustration. If there is any mismatch, it is a normal phenomenon

3. Language Embedding

We use Qwen3-embedding-4B to convert the biological information of peptides into a dense 2560-dimensional vector representation. This process transforms the complex biological descriptions - including amino acid sequences, antibacterial activity, physicochemical properties, and other functional characteristics - into a numerical vector form, enabling computers to understand and process the semantic information of peptides.

All the generated embedding vectors are efficiently organized through the FAISS index. This specially optimized vector database supports similarity search in milliseconds. FAISS uses the IVFFlat index structure, dividing the vector space into multiple clustering units, which significantly improves the efficiency of nearest neighbor search. For example, peptides of Gram-negative bacteria are close to each other in the vector space, while peptides with high hemolytic activity maintain a relatively large distance.

During the search process, the system first conducts a rapid coarse screening in the FAISS index to identify potential candidate peptides, and then performs a fine sorting through cosine similarity calculation. This hierarchical search strategy, combined with GPU acceleration capabilities, enables the system to respond to user queries in real time among thousands of peptide entries. When a user queries "finding peptides that are effective against Escherichia coli and have low hemolytic activity", the system can accurately understand its underlying requirements and identify the most suitable results from thousands of peptides, providing a powerful intelligent search foundation for antibacterial peptide research and discovery.

4. Retrieval and Reordering

In the reordering stage, the Qwen3-reranker-4B model is used for deep semantic evaluation. This model receives formatted query-document pairs and calculates the relevance score by analyzing the difference in output logits of the "true/false" tokens. The system specifically added 25 keywords related to antimicrobial peptides (such as MIC, hemolytic, gram positive, etc.) to enhance the model's understanding of professional terms.

25 key words in the field of antimicrobial peptides

Figure 4: 25 key words in the field of antimicrobial peptides

With GPU acceleration, the model processes candidate peptides in batches in parallel and determines the matching degree of each peptide with the query through deep semantic understanding. Unlike simple keyword matching, the reordering model can understand the biological context and implicit requirements, accurately identifying complex conditions such as "low hemolytic activity but effective against Escherichia coli". Finally, the system arranges the candidate results in descending order based on the relevance score, ensuring that the returned peptide sequence best meets the user's true intention, significantly improving the accuracy and practicality of the retrieval results.

5. Technical Features

Our system adopts a dual-model architecture. The Embedding model is responsible for quickly retrieving candidate results, while the Reranker model performs fine sorting to ensure retrieval accuracy. Through domain adaptation technology, we added AMP professional domain keywords to enhance the semantic understanding ability of the tokenizer. Efficient retrieval is enabled by the support of FAISS index, which can achieve vector similarity search in milliseconds. The system has good interpretability, with output results including relevance scores and the original text, facilitating subsequent analysis and verification by researchers. The entire system provides an end-to-end processing flow, from the original JSON data to the final retrieval results, achieving fully automated processing and significantly improving research efficiency.

References

  1. Ma, T., Liu, Y., Yu, B., Sun, X., Yao, H., Hao, C., Li, J., Nawaz, M., Jiang, X., Lao, X., & Zheng, H. (2025). DRAMP 4.0: An open-access data repository dedicated to the clinical translation of antimicrobial peptides. Nucleic Acids Research, 53(D1), D403–D410. https://doi.org/10.1093/nar/gkae1046
  2. Pirtskhalava, M., Amstrong, A. A., Grigolava, M., Chubinidze, M., Alimbarashvili, E., Vishnepolsky, B., Gabrielian, A., Rosenthal, A., Hurt, D. E., & Tartakovsky, M. (2021). DBAASP v3: Database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics. Nucleic Acids Research, 49(D1), D288–D297. https://doi.org/10.1093/nar/gkaa991
  3. Wang, G., Li, X., & Wang, Z. (2016). APD3: The antimicrobial peptide database as a tool for research and education. Nucleic Acids Research, 44(D1), D1087–D1093. https://doi.org/10.1093/nar/gkv1278
  4. Ye, G., Wu, H., Huang, J., Wang, W., Ge, K., Li, G., Zhong, J., & Huang, Q. (2020). LAMP2: A major update of the database linking antimicrobial peptides. Database, 2020, baa061. https://doi.org/10.1093/database/baaa061