AMPilot

Background

Problem Analysis

Our project aims to treat diabetic wounds using antimicrobial peptides (AMPs) with good inhibitory effects against Staphylococcus aureus (A common wound infection bacteria), produced by yeast, in conjunction with a hydrogel (mainly composed of L-HBC). To assist our wet lab in finding suitable AMPs, we initially planned to search databases like APD, CAMP, and DBAASP. However, the data downloaded and compiled from multiple databases often described AMP properties in natural language and unstructured text, making it difficult to extract information accurately using SQL, Pandas, or regular expressions, and easy to miss key details. Even using the built-in filtering functions of these databases presented many problems: on one hand, the data in a single database is often limited, and switching between multiple databases reduces efficiency; on the other hand, the retrieved information often requires secondary processing and judgment, which increases the learning and time costs for wet-lab personnel, further decreasing efficiency. For example, a user might want to find AMPs that are positively charged and active against both fungi and Gram-negative bacteria. The database (or local CSV data table) might describe this using completely different linguistic logic but similar semantics. Traditional retrieval methods cannot effectively extract such information, which is complex and highly dependent on context. At the same time, the wet lab's requirements for AMP properties are not fixed (they may be dynamically adjusted based on experimental results), so the screening work may be performed multiple times. In our project, the wet lab requires AMPs to have a good antibacterial effect against Staphylococcus aureus and a good anti-inflammatory effect on diabetic wounds. For instance, when searching in DBAASP with parameters like "Target Group: Gram+, Target Species: Staphylococcus aureus, Biofilm: Effective, Hemolytic and Cytotoxic Activities: Prioritize peptides with low hemolytic and cytotoxic activities", the search results yield many candidate AMPs, making further selection based on the actual situation impossible.

Considering that the AMPs we need might not have been discovered yet as our research progresses, we might need to design more suitable AMPs from scratch or use machine learning or deep learning methods to build generative models for predicting AMP sequences that meet our requirements. However, whether designing AMPs based on traditional physicochemical properties or fine-tuning generative models based on existing AMPs, the actual success rate is often below 10%^[1], which leads to extremely high validation costs for wet labs, thereby limiting the efficiency of AMP design. Furthermore, due to the lack of interpretability of algorithms, a natural gap has emerged between the representations of model-designed AMPs and the experience of wet-lab personnel^[2], making it difficult to align the needs of modeling and wet-lab experiments.

Additionally, our team encountered difficulties with data analysis in the wet lab: the raw data commonly contained missing and abnormal values, and traditional automated scripts were unable to perform intelligent screening in the context of the experiment, easily leading to the loss of key information or misinterpretation of results^[3]. Meanwhile, the software provided with the instruments has fixed functionalities and cannot meet exploratory analysis needs such as custom model fitting, forcing researchers to spend a significant amount of time on inefficient manual programming and data processing. This model not only significantly increases the cognitive load on wet-lab scientists, diverting their energy from core scientific insights to tedious technical details, but also directly leads to an unnecessarily prolonged cycle of converting data into knowledge, severely hindering the rapid iteration and innovation of the project.

Overview

To address the challenge of accurately retrieving antimicrobial peptides from multiple databases, we first constructed a question-answering system based on Retrieval-Augmented Generation (RAG). Although this system enabled natural language queries of literature, we found that its fixed, linear workflow was rigid when faced with complex retrieval tasks that require iterative optimization. To overcome this, we upgraded it to an Agentic RAG. This agent employs a dynamic "think-act" loop framework, allowing it to autonomously assess retrieval quality, reconstruct queries, and iterate searches, thereby overcoming the limitations of traditional RAG and achieving a robust and precise response to complex scientific needs.

To tackle the core pain points of "low hit rates" and "lack of interpretability" in de novo antimicrobial peptide design, we proposed a novel solution centered on LLM (Large Language Model) reasoning. Instead of building another "black box" generative model, we developed the AMP Rerank Agent. This agent learns and summarizes interpretable "design experiences" in natural language by comparing known sequences. It utilizes this accumulated knowledge to intelligently re-prioritize a set of candidate antimicrobial peptide sequences. This approach not only bridges the cognitive gap between dry and wet labs with its transparent decision-making process but also directly reduces the high validation costs of wet-lab experiments by recommending the most likely successful candidates first.

Finally, to address the common data analysis challenges in wet-lab experiments, we built the Data Analysis Agent. This agent is designed to solve the fundamental flaw of traditional automated scripts, which cannot make dynamic decisions based on experimental context. It can break down high-level analysis requirements into specific, executable steps and autonomously generate code for data cleaning, statistical analysis, and model fitting. Its core advantage lies in a self-correcting loop mechanism: after executing and verifying the code, if errors are found or the results are suboptimal, the agent can re-plan and correct its analysis strategy. This process liberates scientists from tedious programming and data processing, significantly accelerating the conversion of raw data into scientific insights.

To integrate these independent functional modules into a cohesive whole and provide end-to-end intelligent support for our wet lab, we combined the three agents into a unified Multi-Agent System. Due to its strong specialization in tasks related to antimicrobial peptides, we named it AMPilot. AMPilot is more than just a simple collection of tools; it aims to break down the barriers between information retrieval, sequence design, and data analysis, providing researchers with a seamless and coherent operational experience. More importantly, to ensure the continuity of our work and benefit the broader scientific community, we have open-sourced the complete code for AMPilot. We hope it will serve as an extensible platform, not only continuing to support our future research projects but also facilitating innovation in other laboratories.

Finding Antimicrobial Peptides

RAG Q&A System

Problem Analysis and Model Establishment

To first solve the problem of finding antimicrobial peptides, we abandoned the method of searching for them in data. Instead, we hoped to leverage the ability of large models to process complex semantic information, thereby avoiding the reduced effectiveness and information loss encountered with traditional CSV/regex retrieval methods. While a general-purpose large model can fully understand user needs and produce well-formatted natural language text as output, it struggles with knowledge-intensive tasks and is prone to generating hallucinations^[4]. The combination of specialized knowledge and the reasoning capabilities of large models is what creates a Retrieval-Augmented Generation (RAG) framework. By retrieving knowledge related to the user's needs from a database and placing this augmented information into the LLM's prompt, it becomes possible to answer questions in specific professional domains. The following image shows an example of responses from keyword search, a general large model, and a RAG system:

Img.1 Example responses from keyword search, a general large model, and a RAG system

In summary, the advantages of retrieving antimicrobial peptides through a RAG system are mainly threefold: first, before performing a retrieval task, the LLM uses its reasoning ability to summarize appropriate queries (usually keywords for AMP-related questions) based on the user's input, which improves retrieval accuracy; second, the LLM can efficiently determine the validity of the retrieved information, summarize it, and use it as context to generate formatted output; third, the model's retrieval, summarization, and response tendencies can be customized through prompts, offering a degree of flexibility.

Through a literature review, we found that there is currently no RAG system that uses AMP-related literature or datasets as its database. Therefore, to find antimicrobial peptides that meet the requirements of our wet lab, we implemented a Retrieval-Augmented Generation (RAG)-based antimicrobial peptide knowledge question-answering system. We hope this system can help us find the AMPs we need and their related information from a large database of information on antimicrobial peptides.

Compared to tabular data, literature on antimicrobial peptides often contains richer context about the peptide, from which detailed biological and physicochemical properties, antimicrobial mechanisms, and other information can be obtained. Therefore, we aim to use collected AMP-related literature (usually in PDF format) as the raw data for our RAG system. We then split each document into text chunks, add indices, and finally encode all chunks and store them in a database for retrieval. The parameter selection is as follows:

The whole system include two parts: database building and retrieval module establishment, the entire workflow of the RAG system can be summarized in the following diagram:

Img.2 Diagram of the RAG Q&A system composition and workflow

We did not choose to use a LLM based on local data here, but instead used the Deepseek-R1 model with strong inference capabilities called by APIs. ALL technology stack used in this system is as follows:

Document Processing: PyPDF
Embedding Model: Sentence-Transformers
Vector Database: FAISS
Large Language Model: DeepSeek-R1
Web Framework: Flask
Frontend: HTML & JavaScript

The codes of the RAG system can be accessed in our Software repository in our Gitlab: https://gitlab.igem.org/2025/software-tools/ouc-haide/-/tree/main/RAG_System?ref_type=heads.
you can also follow the latest progress of this project by accessing the code here: https://github.com/Xiaoyun-0922/ragllm.
Below is a demonstration of the RAG system's frontend and basic functionality:

Video 1. The video of showing the frontend and basic functionality of the RAG system

Model Testing and Evaluation

Model Evaluation Rules

Evaluating a RAG system essentially means assessing the quality of both retrieval and generation, which is difficult to benchmark using traditional model evaluation systems. Inspired by the KILT benchmark^[5], we designed testing and evaluation rules for our RAG system, which is a knowledge-intensive retrieval model for antimicrobial peptides:

1. All tests are based on a Fixed Knowledge Source, i.e., a pre-selected, closed set of literature on antimicrobial peptides. This ensures a controllable evaluation environment and reproducible results. We selected 6 papers on AMPs with activity against Staphylococcus aureus^{[6], [7], [8], [9], [10], [11]}, and 4 papers on AMPs active against different bacteria^{[12], [13], [14], [15]} as the test literature set. The selection and a brief introduction of the literature are shown in the table below (only showing the 6 papers on AMPs targeting Staphylococcus aureus):

Brief introduction to 6 papers on AMPs targeting Staphylococcus aureus — Tab.2 Brief introduction to 6 papers on AMPs targeting *Staphylococcus aureus*

2. Emphasize that Traceability is Key. The standard answer for each question in the test set must be clearly linked to a specific chunk ID in the knowledge base. This rigorously verifies the factual basis of the system's answers and prevents information hallucination. For the selected literature, we design a query based on its specific content, manually select the relevant chunks, and record their IDs.
3. To further measure the system's robustness and intelligence, we introduced Task Diversity by designing two different query tasks to simulate real research scenarios: retrieving property information based on the AMP's sequence or name, and retrieving the corresponding AMP sequence or name based on property features.
4. Finally, we adopt an End-to-End Evaluation perspective, simultaneously examining multiple stages from the accuracy of information retrieval to the faithfulness and relevance of the final answer generation. This allows us to precisely locate system bottlenecks and identify whether performance shortcomings originate from the retrieval or generation module.

Overall Performance Test

Following the prescribed test guidelines, we first conduct end-to-end testing on the given dataset. The test content follows the suggestions of our wet lab and selects the questions they are most concerned about: first, whether effective information can be extracted and summarized from a given piece of literature; second, whether the required antimicrobial peptide information can be retrieved from a massive database.

The test is divided into two groups to evaluate these two functions respectively. In this testing phase, we adopt an end-to-end perspective, evaluating only the overall performance of the RAG system. Therefore, the evaluation is based solely on the quality of the message output on the front end. Before the test, we have manually located the relevant information in the literature for comparison. The two groups of test conducted as follows:

Test Group 1:

To test first function, we used the natural language query:

"Find me antimicrobial peptides that have inhibitory effects on Staphylococcus aureus (S. aureus)"

as input to test the RAG system's ability to extract information from 6 different papers. The test results are shown in the figure below (Note: only the Q&A for the first four papers are shown, the other two had similar results; only one demonstration process is described in detail, others are similar):

Img.3 Example responses of the RAG system for 4 selected papers on different antimicrobial peptides

As can be seen from the figure, the LLM correctly identified keywords such as S. aureus and effects, and combined the context of the paper to provide important information including the mechanism of antimicrobial action and current limitations (if mentioned in the literature), perfectly meeting the previously required test standards. However, the LLM's output still has some minor issues:

1. Some bacterial names are misspelled (but still recognizable), which may be due to semantic matching problems during retrieval. The retrieved information was only summarized by the LLM without further evaluation of its correctness.
2. There are some formatting errors in the LLM's output. This might be a frontend configuration issue. Since this problem did not affect the results at the time, it was not modified during the testing phase, and the focus remained on backend optimization.

Test Group 2:

To test second function, We used the natural language query:

"Find me antimicrobial peptides that have inhibitory effects on E.coli."

as input to test the RAG system's response performance on 4 papers concerning the inhibitory effects of different antimicrobial peptides. As the input suggests, this test aimed to find AMPs effective against E. coli. The demonstration video of the test is as follows:

Video 2. The video of showing the response performance of the RAG system on 4 papers concerning the inhibitory effects of different antimicrobial peptides

As can be seen from the video, the LLM successfully identified the user's query intent. Verification proved that the RAG system correctly retrieved the correct information from the source data, meeting the previously required test standards.

Retrieval Quality Evaluation

Retrieval capability is the core of a RAG system, and the demand from wet labs for literature information retrieval is high. Therefore, the model's retrieval ability needs further evaluation. We can manually select a specific question from a paper, design a query based on it, manually select the relevant chunks, and then compare the retrieval results with the designated "gold standard." Here, we only perform text splitting on the 4 papers about AMPs with activity against different bacteria mentioned above. We set the overlap to 50, collecting a total of 238 chunk records. And based on the content of the paper with title of "Antimicrobial Peptide against Mycobacterium Tuberculosis That Activates Autophagy Is an Effective Treatment for Tuberculosis.", we designed a representative query:

"What is the mechanism of action of nisin against Staphylococcus aureus?"

And the relevant text block is as follows:

"...nisin, a bacteriocin produced by Lactococcus lactis subsp. lactis, inhibits bacterial growth by a dual mechanism. It binds to lipid II, a precursor of cell wall biosynthesis, thus preventing cell wall formation, and it also forms pores in the bacterial membrane causing leakage of cell content..."

Based on the specific content of the literature, we selected 6 chunks related to this query as the "gold standard." We then used the following three industry-recognized metrics to quantify retrieval performance—Hit Rate, Mean Reciprocal Rank, and Normalized Discounted Cumulative Gain. The specific formulas for these three statistics are as follows:

1: Hit Rate (@6): This is a basic binary evaluation metric that measures whether the retrieval system can find at least one relevant text block from the gold standard within the top 6 returned results. Its mathematical definition is as follows: \[ \text{HR}@6 = \begin{cases} 1, & \text{if there exists at least one gold standard chunk } i \text{ such that } \text{rank}(i) \le 6 \\ 0, & \text{otherwise} \end{cases} \] where $\text{rank}(i)$ is the rank of the gold standard chunk $i$ in the retrieval results. An HR@6 of 1 indicates a successful retrieval, while 0 means no relevant content was recalled in the top 6 results.
2: Mean Reciprocal Rank (MRR) aims to evaluate the system's ability to place the first correct answer high up in the ranking. For a single query, this metric calculates the reciprocal of the rank of the first gold standard text block ($\text{rank}_1$). If the first relevant block is ranked 1st, the score is 1; if it's 2nd, the score is 1/2, and so on. This mechanism rewards high-ranking hits much more than low-ranking ones. For a single query, its Reciprocal Rank (RR) is calculated as: \[ \text{RR} = \frac{1}{\text{rank}_1} \] where $\text{rank}_1$ is the rank of the first gold standard text block in the retrieval results. If no relevant block is found in the top 6 results, the RR for that query is 0. MRR is the arithmetic mean of the RR values for all queries in the test set. An MRR score approaching 1.0 indicates that the model consistently and efficiently places the most relevant answer at the top of the results list.
3: Discounted Cumulative Gain (nDCG@6) is a more refined and comprehensive evaluation metric that considers both the relevance level of all relevant items in the top 6 results and their positions. The calculation of nDCG is a two-step process:
1. 1. Calculate Discounted Cumulative Gain (DCG@6), which is a weighted sum of the relevance scores of the top 6 results, where the weight is a logarithmic discount function of the position $i$: $\frac{1}{\log_2(i+1)}$. This causes documents ranked lower to contribute progressively less to the total score. \[ \text{DCG}@6 = \sum_{i=1}^{6} \frac{rel_i}{\log_2(i+1)} \]
2. 2. Since the size of the gold standard set varies for different queries, their DCG scores are not directly comparable. Therefore, we normalize by dividing by the Ideal Discounted Cumulative Gain (IDCG@6). IDCG@6 is the maximum possible DCG score obtained by ranking all gold standard text blocks in perfect order (from highest to lowest relevance) in the top 6 positions. \[ \text{nDCG}@6 = \frac{\text{DCG}@6}{\text{IDCG}@6} \]
The final nDCG@6 is a value between 0.0 and 1.0, where a higher score indicates that the ranking quality is closer to the ideal state. An nDCG@6 of 1.0 means the system returned a perfect ranking.

With this theoretical foundation for evaluation in place, we proceeded with the actual test: First, we submitted the input to the LLM, which extracted the following keywords for retrieval:

'nisin', 'mechanism of action', 'staphylococcus aureus'

Then, the formatted query generated by the LLM was passed to the retriever for semantic similarity search. Unfortunately, no matter how we adjusted the prompt or changed the embedding model, none of the top 10 returned chunks matched our pre-defined "gold standard." Therefore, the Hit Rate, Mean Reciprocal Rank, and Normalized Discounted Cumulative Gain were all 0!

We wanted to investigate the retrieved information to determine the cause of this situation. Here is the information retrieved as the TOP 1 result:

h 3.35 kDa in size) in combination with colistin against six P . aeruginosa strains. They found that the MIC of nisin was reduced from 128 to 256 μg/mL to 16–32 μg/mL, the MIC of colistin was reduced from 0.5–8 μg/mL to 0.125–4 μg/mL, and the FICI was 0.375–0.625, displaying a synergistic antibacterial effect against four P . aeruginosa strains (Jahangiri et al., 2021). In a previous study, it was observed that nisin Z combined with antibiotics had synergistic effects on P . fluorescens

Note: Some spelling errors are caused by the tokenizer and generally do not affect the retriever or the LLM's performance.

The query keywords included "nisin," "mechanism of action," and "Staphylococcus aureus." This chunk clearly contains "nisin" (mentioned multiple times) and discusses its combined effect with antibiotics (such as MIC reduction and synergistic effects), which the embedding model might have considered semantically related to "mechanism of action." The LLM-enhanced retriever used these keywords to expand the query, helping the FAISS database retrieve chunks containing "nisin" and antibacterial-related terms. Although the score was 0 (possibly indicating it did not meet a threshold), it was still ranked highly because the embedding vector captured the similarity of "nisin" as a core entity and the potential association of "synergistic antibacterial effect" with a mechanism description.

The reason for the test failure is self-evident: on one hand, the frequent appearance of keywords in the literature interfered with the retrieval; on the other hand, semantic matching likely deviated, leading the retriever astray. Subsequent rounds of multi-faceted testing also revealed frequent issues such as misspellings of technical terms and incorrect detection results. For example, literature on antimicrobial peptides often mentions not only the bacteria that the peptide can inhibit but also other bacterial names in the experimental design. If an unexpectedly appearing peptide name happens to be the one the user wants to retrieve, relying solely on limited semantic matching based on keywords cannot effectively determine if the literature is relevant, which could lead to the inclusion of incorrect information.

For this new type of retrieval model, there are two optimization approaches: one is from the perspective of algorithms and parameter tuning, and the other is from the perspective of process and mechanism optimization. Both of these require us to not only understand the problem at a macro level but also to further grasp its internal mechanisms.

System Information Flow Modeling

To address the aforementioned issues, we can mathematically model the RAG system from the perspectives of information theory and probabilistic optimization, focusing on a unified framework for information retrieval and generation, a probabilistic modeling approach, and the theoretical basis for system optimization. First, we make the following assumptions about the entire system:

1. Probabilistic Decomposability Hypothesis: We assume that the complex reasoning process from a query $Q$ to a final evidence set $\mathcal{S}$ can be effectively decomposed into a three-stage probabilistic cascade model: $P(\mathcal{S}|Q,\mathcal{C}) = P_{\text{ann}} \cdot P_{\text{filter}} \cdot P_{\text{rerank}}$. This hypothesis asserts that these three stages are conditionally independent and that their probability product can accurately approximate the true relevance distribution.
2. Quantifiable Risk Hypothesis: We assume that in scientific literature retrieval, the costs of information omission (false negatives), $\mathcal{C}_{\text{FN}}$, and information irrelevance (false positives), $\mathcal{C}_{\text{FP}}$, can be explicitly quantified. Based on this, the entire retrieval decision process can be successfully abstracted as a Bayesian decision problem aimed at minimizing expected risk.
3. Objective Orthogonality Hypothesis: We assume that the system's total objective ($\mathcal{L}_{\text{total}}$) can be decomposed into three approximately orthogonal sub-objectives: ranking quality ($\mathcal{L}_{\text{listwise}}$), generation quality ($\mathcal{L}_{\text{generation}}$), and factual consistency ($\mathcal{L}_{\text{consistency}}$). This means we can perform joint optimization through a linear weighted sum of these objectives without considering complex non-linear coupling effects between them.

Based on these assumptions, we model the RAG system: given a user query $Q$ and a corpus $\mathcal{D}=\{d_i\}_{i=1}^N$ of $N$ documents, which is preprocessed into a set of document chunks $\mathcal{C}=\{c_j\}_{j=1}^M$. The system's goal is to generate an answer $A$ that maximizes the expected factual consistency $\texttt{rel}(A,c)$ supported by the evidence $c$ provided by a retrieval strategy $\pi$, while penalizing its redundancy $\texttt{cost}(A)$:

\[ A^* = \arg\max_{A\in\mathcal{A}} \left( \mathbb{E}_{c\sim\pi(\cdot|Q,\mathcal{C})}\big[\texttt{rel}(A,c)\big] - \lambda\,\texttt{cost}(A) \right) \]

where $\mathcal{A}$ is the set of all possible answers, and $\lambda$ is a trade-off hyperparameter. This formula reveals the core of RAG technology: the quality of the answer directly depends on the quality of the retrieval strategy $\pi$. We can further model the multi-stage retrieval process as a Probabilistic Cascade Model, where the output of a later stage is conditioned on the output of the previous one: given a query $Q$ and a document library $\mathcal{C}$, we design the retrieval scheme as follows: first, use an ANN algorithm to initially obtain a candidate set $\mathcal{S}_0$ of Top-$K_0$; then, filter $\mathcal{S}_0$ based on semantic similarity to get a filtered set $\mathcal{S}_1$ of Top-$K_1$; finally, perform fine-grained ranking on $\mathcal{S}_1$ using a domain-aware reranking function and truncate it to the final Top-$K$ evidence set $\mathcal{S}$.

The generation probability of the final evidence set $\mathcal{S}$ can be decomposed into the product of three conditional probabilities (a three-level Bayesian network):

\[ P(\mathcal{S}|Q,\mathcal{C}) = P_{\text{ann}}(\mathcal{S}_0|Q) \cdot P_{\text{filter}}(\mathcal{S}_1|\mathcal{S}_0,Q,\tau) \cdot P_{\text{rerank}}(\mathcal{S}|\mathcal{S}_1,Q,\phi) \]

where:

$P_{\text{ann}}(\mathcal{S}_0|Q)$: The process of Approximate Nearest Neighbor (ANN) search recalling the Top-$K_0$ candidate set $\mathcal{S}_0$ from $\mathcal{C}$. It is worth noting that a brute-force algorithm was used during the testing phase due to the small data size.
$P_{\text{filter}}(\mathcal{S}_1|\dots)$: For each document chunk $c_j$ in $\mathcal{S}_0$, decide whether to keep it based on a comparison of its similarity score $s(\mathbf{q},\mathbf{e}_j)$ with a threshold $\tau$, forming the filtered set $\mathcal{S}_1$.
$P_{\text{rerank}}(\mathcal{S}|\dots)$: Perform fine-grained ranking on $\mathcal{S}_1$ using a domain-aware reranking function $\phi(c;Q)$ and truncate it to the final Top-K evidence set $\mathcal{S}$.

This retrieval design ensures the validity of the retrieved information to the greatest extent possible through three screenings of varying strictness. It also implies that it is difficult to improve the system's recall rate through algorithmic improvements, which is a bottleneck of standard RAG technology.

Further modeling from an optimization perspective, for each "query-document chunk" pair $(Q, c_j)$ in the validation set, we introduce a binary variable $y_j \in \{0, 1\}$, which represents the true relevance label of the document chunk $c_j$ with respect to the query $Q$, also known as the "Gold Standard."

1: $y_j = 1$: Indicates, by manual judgment, that document chunk $c_j$ is relevant to query $Q$. For example, if the query is "MIC value of nisin" and the document chunk explicitly states "the MIC of nisin is 2 μg/mL."
2: $y_j = 0$: Indicates, by manual judgment, that document chunk $c_j$ is not relevant to query $Q$. For example, for the same query, the document chunk only discusses the discovery history of nisin and does not mention the MIC value.

It can be seen that while the original similarity score $s(\mathbf{q},\mathbf{e}_j)$ can provide a relative ranking of document chunks, its numerical value itself does not have a clear probabilistic meaning. For example, a score of 0.7 does not directly equate to a 70% probability of relevance. To make subsequent risk-based scientific decisions, we must Calibrate these arbitrary-scale scores into statistically meaningful posterior probabilities $p_j = P(y_j=1|s(\mathbf{q},\mathbf{e}_j))$.

We use the Platt calibration method, which maps the original score $s_j$ to the probability interval using a parameterized Sigmoid function:

$$ p_j = \sigma(a s_j + b), \quad \text{where} \quad \sigma(x)=\frac{1}{1+e^{-x}} $$

Here, the parameters $(a,b)$ are hyperparameters. The optimization process is essentially a parameter tuning process—we use our prepared validation set containing true labels $y_j$ to find the optimal $(a,b)$ values by minimizing the Negative Log-Likelihood loss function. This loss function measures the discrepancy between the model's predicted probability $p_j$ and the true label $y_j$:

\[ \min_{a,b}\sum_{j=1}^{M_{\text{val}}} \Big[-y_j\log p_j - (1-y_j)\log(1-p_j)\Big] + \gamma_R\|(a,b)\|_2^2 \]

In this formula:

1. When the true label $y_j=1$, the loss term is $-\log p_j$. To minimize the loss, the model must make the predicted probability $p_j$ approach 1.
2. When the true label $y_j=0$, the loss term is $-\log(1-p_j)$. To minimize the loss, the model must make the predicted probability $p_j$ approach 0.

Through this process, the resulting calibration function learns how to convert an original score (e.g., 0.7) into a more credible probability (e.g., 0.85), allowing $p_j$ to serve as a reliable estimate of the confidence that $c_j$ is relevant to $Q$.

Let's assume that in the antimicrobial peptide domain, the cost of introducing an irrelevant document chunk into the context (false positive) is $\mathcal{C}_{\text{FP}}$, and the cost of missing a critical piece of evidence (like an MIC value, a false negative) is $\mathcal{C}_{\text{FN}}$. According to Bayesian decision theory, to minimize the overall risk, a document chunk should be classified as relevant ($\hat{y}_j=1$) if and only if its calibrated probability $p_j$ satisfies the following condition:

\[ \hat{y}_j = \mathbb{I}\left[p_j \ge \frac{\mathcal{C}_{\text{FP}}}{\mathcal{C}_{\text{FP}}+\mathcal{C}_{\text{FN}}}\right] \]

Let $\theta^* = \frac{\mathcal{C}_{\text{FP}}}{\mathcal{C}_{\text{FP}}+\mathcal{C}_{\text{FN}}}$; this $\theta^*$ is the optimal probability threshold. We can then solve for the corresponding original score threshold $\tau^*$ by inverting the equation $\sigma(a\tau^*+b) = \theta^*$.

For the practical scenario of antimicrobial peptide retrieval, in a literature review task, the cost of missing one key paper is far greater than reading one extra irrelevant paper. If we set $\mathcal{C}_{\text{FN}}=5$ (high cost for false negatives) and $\mathcal{C}_{\text{FP}}=1$, then the optimal probability threshold is $\theta^*=1/6\approx0.167$. This means that as long as the probability of a document chunk being relevant exceeds 16.7%, it should be recalled by the system to prioritize high recall.

To further align the model's training with our domain-specific evaluation metrics, we can align the ranking objective with domain value. For any document chunk $c$, we define its Domain Gain as the weighted sum of the key information it contains:

\[ \text{gain}(c) = w_{\text{seq}}\mathbb{I}_{c\in\text{Seq}} + w_{\text{MIC}}\mathbb{I}_{c\in\text{MIC}} + w_{\text{mech}}\mathbb{I}_{c\in\text{Mech}} \]

where $w$ are preset weights and $\mathbb{I}$ is the indicator function. Based on this gain, we adopt weighted nDCG@K as the core evaluation metric:

\[ \text{nDCG@K}(\mathcal{S}) = \frac{1}{\text{IDCG@K}}\sum_{i=1}^K \frac{\text{gain}(c_i)}{\log_2(i+1)}, \quad c_i \in \mathcal{S} \]

To align the training process with this evaluation metric, we modify the classic Listwise ranking loss function by introducing domain gain as sample importance:

\[ \mathcal{L}_{\text{listwise}} = -\sum_{Q} \sum_{c_j \in \mathcal{S}_0} P(y_j=1|\mathcal{S}_0, Q) \log P_{\text{model}}(c_j|\mathcal{S}_0,Q) \]

where $P_{\text{model}}(c_j|\mathcal{S}_0,Q) = \frac{\exp(\phi(c_j;Q))}{\sum_{k \in \mathcal{S}_0}\exp(\phi(c_k;Q))}$, and the reranking function $\phi(c_j;Q)$ already includes the term $\text{gain}(c_j)$, thereby guiding the model to learn a ranking strategy that maximizes nDCG.

Furthermore, to address the "hallucination" problem of large language models, we incorporate constraints from the generation stage into the total objective function. Let the generated answer $A$ contain $L$ factual statements $\{r_k\}_{k=1}^L$. We train a differentiable discriminator $\mathcal{V}(r_k, \mathcal{S})$ to evaluate the degree to which statement $r_k$ is supported by the evidence set $\mathcal{S}$. The evidence consistency loss is then defined as:

\[ \mathcal{L}_{\text{consistency}} = \frac{1}{L}\sum_{k=1}^L \left( 1 - \mathcal{V}(r_k, \mathcal{S}) \right)^2 \]

This loss term penalizes all generated content that lacks evidentiary support. Finally, the system's total objective function (which can be trained end-to-end or alternately) is:

\[ \mathcal{L}_{\text{total}} = \mathcal{L}_{\text{listwise}} + \lambda_{\text{gen}}\mathcal{L}_{\text{generation}} + \lambda_{\text{consist}}\mathcal{L}_{\text{consistency}} \]

where $\mathcal{L}_{\text{generation}}$ is the standard autoregressive language model loss, and $\lambda$ are the weights for each loss term.

Discussion and Summary

Through the above modeling of the entire RAG system's information flow and by introducing a probabilistic model and a comprehensive optimization loss function, it is theoretically possible to optimize system performance through parameter tuning. In our modeling, we conceived a three-tiered screening process for the RAG system—ANN search, cosine similarity matching, and information reranking—which maximizes the optimization of the retrieval algorithm and reduces information loss and LLM hallucinations during the retrieval process. However, further optimization will be difficult to achieve.

For an LLM with RAG technology, constructing such a unidirectional retrieval-QA flow can effectively improve retrieval quality. However, looking back at the modeling process of the entire system's information flow, we find that the limitation of the RAG system comes from the unidirectional nature of the retrieval process itself: a fixed retrieval paradigm and the practice of compressing complex relevance information into a single low-dimensional vector^[16]. In contrast, when we search for information in real life, we always dynamically adjust our search strategy to optimize the search plan, which greatly increases the chances of finding relevant information in a database (if a match for the target query exists). Therefore, we will no longer attempt to optimize this inherently limited method.

At this point, we have generally achieved the function of retrieving antimicrobial peptide information from literature, but it still performs poorly for specific tasks. The RAG system is essentially a data-driven retriever. However, the workload of collecting antimicrobial peptide literature is far greater than that from databases, making it impractical to build a rich literature database. Although the performance of the RAG system was poor, it provided us with valuable optimization ideas.

AMP Research Agent

Solution Exploration

The rise of agents has brought us new solutions. The antimicrobial peptide Q&A RAG system we built essentially follows a static, linear workflow: Query -> Retrieve -> Generate. While this fixed pipeline model is efficient, it exposes its inherent limitations when dealing with the complex and variable information needs of the real world. It is precisely these limitations that form the fundamental motivation for our move towards a more advanced, agent-driven Agentic RAG paradigm.

The core of an Agentic RAG system is no longer a linear pipeline but a cyclical "Think-Act" framework (e.g., ReAct: Reason, Act). After retrieval, the Agent first "thinks": "Is the content I retrieved of high quality? Can it answer the question?" If the answer is no, it can take Action," such as rewriting the query (Reformulate Query), adjusting retrieval parameters, and then retrieving again, thereby achieving a logical closed loop of iterative improvement.

For the previous problem, for example, the Agentic RAG system's processing would be based on the ReAct framework^[17], iteratively solving the problem through a "Reason -> Act -> Observe" cycle. Below is a comparison of the mechanisms of a standard RAG system and an Agentic RAG system:

Img.4 Comparison of Standard RAG and Agentic RAG Mechanisms

We no longer focus solely on searching for antimicrobial peptide information by retrieving literature. Instead, we first find relevant information about the target antimicrobial peptide based on an AMP data table, and then let a general-purpose large model search for and summarize the relevant literature on that peptide. To this end, we found the open-source antimicrobial peptide dataset grampa.csv^[18], which covers almost all AMP sequences and their property information. By further processing the grampa.csv dataset, we retained parameters like MIC value and Target bacteria. This dataset provides sufficient data support for our Agentic RAG system's "Sequence to Property" and "Property to Sequence" functions. As for the task of searching for specific literature on a given peptide, the Function Calling capabilities of existing general-purpose large models can already solve this perfectly. We will not replicate this but will focus on improving the retrieval effectiveness of the former. We used the following technology stack to build this system:

Backend Framework: LangGraph
Vector Database: Weaviate
Embedding Model: BAAI/bge-large-en-v1.5
Language Model: Deepseek-R1
Frontend: HTML & JavaScript

Based on the exceptional capabilities of the Agentic RAG System for antimicrobial peptide retrieval, we named it the AMP Research Agent. We also optimized the previous frontend design. Below is a demonstration of a "hello" interaction with it:

Video 3. The video of showing how AMP Research Agent works when greeting it.

Main Functions Testing

This Agent primarily implements two functions:

1. Retrieving the properties of an antimicrobial peptide from its sequence.
2. Retrieving suitable antimicrobial peptide sequences based on their properties.

By implementing these two functions, we essentially meet all the wet lab's needs for AMPs information retrieval. Now, we will test these two functions separately: For the first function, we can select an AMP sequence from the known database and evaluate the function based on the accuracy and completeness of the retrieval. For the second function, we can select certain features of an antimicrobial peptide and examine whether the Agent can find the optimal sequence.

Function 1 Test:

Using the natural language:

"What is the property of the AMP with LPLLAGLAANFLPKIFCKITRK sequence?"

as input, the output we received through the front-end is shown in the following video:

Video 4. The video of showing output of function 1 test through the front end.

We checked the returned information, and it was completely consistent with the information in the database, indicating that this function has been perfectly implemented.

Function 2 Test:

Using the natural language:

"Find me an antimicrobial peptide with a sequence length of 22 or 23 and an inhibitory effect on Staphylococcus aureus (S. aureus)?"

as input, the output we received through the front-end is shown in the following video:

Video 5. The video of showing output of function 2 test through the front end.

Through multiple rounds of testing and checking, the returned information was almost completely identical to the real situation, indicating that this function has also been perfectly implemented, making further quantitative evaluation unnecessary.

It is worth mentioning that, based on professional experience, we set descriptions with strong semantics like "effect" to require an MIC value of less than 20 μM, while for descriptions with weaker semantics like "inhibitory," we relaxed the requirement to an MIC value of less than 50 μM. These rules were specified in the prompt. For the Agent's output, we can see that the sequence length was constrained to between 22 and 23, strictly conforming to our input. At the same time, the MIC values were sorted in ascending order until they just met the requirement for "effect."

Through the tests above, we have successfully implemented the "property-to-sequence" and "sequence-to-property" retrieval functions. The think-act mechanism of the ReAct framework has greatly enhanced the system's robustness. As for the literature retrieval function, general-purpose large models can already perfectly extract and summarize literature content, so we do not need to replicate it. In fact, the test question for function two is precisely a need from our wet lab. Finally, after comprehensive consideration of the top eight antimicrobial peptides and research from the wet lab, we ultimately chose the second peptide retrieved: Pexiganan. The specific reasons and criteria for this choice are outside the scope of the dry lab and will not be explained in detail here.

Designing Antimicrobial Peptides

Problem Analysis

The work described above successfully found a suitable antimicrobial peptide, Pexiganan, through the AMP Research Agent, which has been put into wet-lab experiments. However, since our project uses AMPs to treat diabetic wounds, the requirements for properties like toxicity and hemolytic concentration are extremely stringent. The lab's needs for AMPs will certainly change dynamically as the experiments progress, and we cannot guarantee that existing AMPs will fully meet our future needs. To meet such demanding requirements, we may need to design antimicrobial peptides from scratch. However, whether using traditional models like designing AMP sequences through chemical and physical methods, or using machine learning and deep learning-based models for AMP sequence generation and prediction, the success rate in experiments is typically very low^[19]. Currently, relatively effective autoregressive prediction models based on the Transformer framework often use attention mechanisms to capture local features of AMP sequences. However, the lack of interpretability in deep learning algorithms creates a natural gap between the model and wet-lab personnel: professionals cannot improve the model based on their own experience, and the information learned by the model cannot be understood by humans, often leading to a misalignment between dry and wet lab collaboration.

The most fundamental principle of deep learning models that rely solely on learning from AMP sequence information is to implicitly capture the local features of the sequence, which is similar to how humans accumulate experience in their brains. The deep learning model "digests" the learned, non-natural language experience into its model weights. On the training set, a machine learning model adjusts the model's mapping range through backpropagation (using common gradient descent algorithms) to approximate the peptide space from the parameter space as closely as possible. This method shows high stability and controllability. However, limited by the convergence speed of the algorithm and the risk of getting stuck in local optima, it is difficult to achieve excellent prediction performance with a limited dataset.

Experts in antimicrobial peptides often have a high sensitivity to AMP sequences. For example, when faced with an AMP sequence, they can draw conclusions like "The guanidinium group of Arg (pKa≈12) is more positively charged than the amino group of Lys (pKa≈10.5), which can significantly enhance binding to and killing of Gram-negative bacteria (like E. coli )", and thus judge some properties of the AMP based on experience. From a cognitive science perspective, this method of judgment is different from that of machine learning models. Instead, it "bypasses" complexity through high-level abstraction, pattern recognition, and building causal models, making it a very efficient method of judgment to some extent. Although this mechanism prevents humans from designing AMP sequences from scratch, it allows for accurate judgment of local features. However, the speed of human learning is limited, and it is impossible to enhance this ability by learning all AMP sequences for humans.

The innate reasoning ability and in-context memory of LLM provides us with a new approach. If a pre-trained LLM can learn this kind of experience at a very high speed and call upon this experience to assist in reasoning during relevant tasks, it would break the limitation of the high learning cost of this experience-driven method. This "human-like" learning method, especially in-context learning, exhibits different characteristics in terms of "convergence" speed compared to traditional gradient descent training. If we understand "convergence" as the "speed of acquiring new skills," then the in-context learning of an LLM is undoubtedly extremely fast. It is almost instantaneous because it does not require retraining the model parameters. This learning speed comes from the "phase transition" of the knowledge network: for a vast knowledge network like that of antimicrobial peptide sequences, knowledge points are scattered in the early stages of learning. As learning progresses, more and more knowledge points are connected. When the density and breadth of connections cross a critical point, the entire network undergoes a "phase transition"—from a series of isolated knowledge clusters to an intelligent system capable of comprehensive understanding and flexible application. New problems no longer require a slow search for paths within the network but can be quickly understood and answered by the network as a whole. The comparison of converging the model's mapping range from the parameter space to the peptide space using machine learning methods versus experience-driven LLM reasoning methods is shown in the figure below:

Img.5 Schematic diagram of convergence effects for machine learning methods and experience-driven LLM reasoning methods

It can be seen that using experience and the reasoning ability of an LLM to predict antimicrobial peptide properties has a high level of convergence but also extremely high instability. Therefore, it is almost impossible to design new antimicrobial peptide sequences using this method. Inspired by the design of the retrieval module in our previous modeling of the RAG system's information flow (where after ANN search and cosine similarity matching, the priority of the selected document chunks was reranked) and the CoT design of Molrag^[20], we creatively propose that the priority of antimicrobial peptides designed by machine learning can be reranked. For example, for 100 pre-generated candidate AMP sequences, reranking will likely place the peptides that better meet the user's needs at the top, which would greatly reduce the validation cost for the wet lab. This approach actually utilizes the stability and controllability of machine learning models: that is, first map the value range from the parameter space to a region close to the peptide space, and then use the "epiphany-like" convergence ability of the LLM to search within this region. This reduces its instability while still leveraging the high prediction level brought by its powerful knowledge network. The convergence effect is shown in the figure below:

Img.6 Schematic diagram of the convergence effect of combining machine learning methods with experience-driven LLM reasoning

Therefore, we have creatively proposed this method of combining machine learning and large language models to design antimicrobial peptide sequences. The former already has mature implementation methods, so we will not redundantly replicate them. Instead, we focus on implementing the reranking of pre-generated antimicrobial peptides by an LLM based on experience. This is a paradigm that processes sequence information entirely based on natural language, utilizing the reasoning ability of the LLM rather than attention mechanisms to learn implicit encoded information. However, such a massive learning task places high demands on the short-term context memory of the LLM, and the response speed will decrease as the context grows. Among them, the context generated during the LLM's learning process often contains a large amount of redundant information, and the actually effective information is not concise enough. The emergence of the Memory Agent brings us a solution—a Memory Agent can extract and store learned experience, dynamically update the experience library, and search for suitable experience to place in a temporary location in the System Prompt during actual tasks, thus achieving efficient experience management and invocation. Generally, it is easier to learn effective experience by comparing similar things. Therefore, we can have the Agent summarize general rules from the comparison of similar antimicrobial peptide sequences and store them as experience in the memory module. When handling actual tasks, it can then make more profound inferences based on the appropriate experience. For example, when the Agent is comparing:

these two antimicrobial peptide sequences with known properties, the experience it might learn could be:

"The 'K' to 'P' substitution in sequence 2 reduces its antimicrobial potency."

Thereafter, when encountering similar situations, it can directly call upon this type of experience. If a contradictory situation is encountered, it updates its own experience. When the Agent has learned enough experience, the knowledge network will "phase transition," and it can then effectively rerank the priority of pre-generated antimicrobial peptide sequences.

AMP Rerank Agent Construction

Based on the above theory, our project constructed an antimicrobial peptides priority reranking Agent. To build this model, we implemented a Memory Module, a Training Module, and a Reranking Module, and named this Agent the AMP Rerank Agent. Below are the detailed modeling and construction process

Memory Module Mathematical Modeling & Construction

Before mathematically modeling the system logic of the memory module, the following assumptions must be made about the entire system to ensure that the use of certain models in the modeling is reasonable:

1. Structural Knowledge Representation Hypothesis: We assume that the highly complex, implicit design knowledge in the antimicrobial peptide domain can be isomorphically mapped and represented as a discrete, symbolic knowledge graph $M_t = G_t = (V_t, E_t)$. This hypothesis asserts that the abstraction of nodes (entities) and edges (relations) is an effective and sufficient form for capturing this knowledge.
2. Markovian Knowledge Evolution Hypothesis: We assume that the dynamic evolution process of the memory module conforms to the Markov property. That is, the next state of the knowledge graph $M_{t+1}$ depends only on the current state $M_t$ and the newly received experience $\mathcal{E}_t$, and is independent of the history of states before $M_t$. This makes the knowledge update process local and computable.
3. Local Context Sufficiency Hypothesis: We assume that when performing any reasoning task $o_t$, there exists a relevant local subgraph $M_{r,t}$ within the vast global knowledge graph $M_t$ that matches the task and contains sufficient context to make a high-quality decision. The agent does not need to perform global reasoning over the entire knowledge base.
4. Emergent Reasoning from Symbolic Accumulation Hypothesis: We assume that through the continuous, discrete accumulation of nodes and edges in the knowledge graph via the $f_{\text{update}}$ function, once the scale and density of the graph exceed a certain critical point, the system will non-linearly emerge with higher-order, integrated reasoning capabilities, i.e., a "knowledge phase transition" will occur.

Based on the above assumptions, we can provide a reasonable mathematical representation for the memory module: We can model the Agent's memory as a dynamically evolving knowledge graph $M_t$, which learns and enhances itself during each sequence comparison event $t$. The core process can be represented as:

$$ M_t \xrightarrow{\text{1. Retrieval with } o_t} M_{r,t} \xrightarrow{\text{2. Reasoning with } \pi} a_t \xrightarrow{\text{3. Update with } \mathcal{E}_t} M_{t+1} $$

At each time step $t$, the system receives an observation $o_t$ and operates based on the current memory state $M_t$. $o_t$ is a sequence comparison task:

$$ o_t = (S_A, S_B, q_t) $$

where $S_A, S_B$ are the AMP sequences to be compared, and $q_t$ is the specific comparison query, such as whether a difference in amino acids at a certain position in the sequence affects its antimicrobial activity, which depends on how the system prompt is specified. $M_t$ is the AMP knowledge graph, which can be represented in the basic form of a graph:

$$ M_t = G_t = (V_t, E_t) $$

where $V_t$ is the set of nodes (representing AMPs, functional motifs, etc.), and $E_t$ is the set of edges (representing relationships between nodes). Each node $v \in V_t$ contains attributes, including sequence and property information.

In an actual task, the Agent will retrieve a relevant subset of contextual information $M_{r,t}$ from its long-term memory $M_t$ based on the current task $o_t$. After the action $a_t$ is executed, the system receives external feedback $F_t$. Part of this comes from the experience summarized by the LLM in the AMP sequence comparison task, and another part comes from the guidance of external experts or experimental data. This complete interaction is encapsulated into an experience tuple $\mathcal{E}_t$:

$$ \mathcal{E}_t = (o_t, a_t, F_t) $$

The memory update function $f_{\text{update}}$ is responsible for integrating this experience $\mathcal{E}_t$ into the memory, completing the state transition from $M_t$ to $M_{t+1}$:

$$ M_{t+1} = f_{\text{update}}(M_t, \mathcal{E}_t) $$

The update method of this function is the state update of nodes and edges in the knowledge graph, represented by the formulas:

$ V_{t+1} = V_t \cup \{v | v \text{ is a new entity in } \mathcal{E}_t \} $
$ E_{t+1} = E_t \cup \{e | e \text{ is a new relation in } \mathcal{E}_t \} $

Through theoretical modeling, we can see that the implementation of the memory module is actually through the state update of the knowledge graph nodes. For the engineering implementation of this module, we drew inspiration from memory-agent^[21].

Training Module Construction

Data Processing

We want the LLM to learn from higher-dimensional information to summarize more profound experiences. Therefore, the dataset needs to include more property parameters. However, the grampa.csv dataset only contains information such as antimicrobial peptide sequences and MIC values, lacking information on physicochemical properties. Here, we use the open-source Python library peptides to calculate the physicochemical properties of the sequences. This library can accurately calculate the corresponding physicochemical properties based on the antimicrobial peptide sequence. We use the z-scale system for antimicrobial peptide physicochemical properties, the contents of which are shown in the table below:

Tab.3 The z-scale physicochemical property system for antimicrobial peptides

In the new dataset, only the antimicrobial peptide sequence, MIC value, and physicochemical properties are retained for training.

It is generally believed that comparing similar sequences is more likely to yield effective information. Therefore, we cluster the antimicrobial peptide sequences in the database. Sequences that are sufficiently similar are grouped together, and each group is used for one training session. The training process is actually a process of accumulating experience, and this process is defined by the system prompt. Therefore, prompt engineering is a crucial part of the AMP Design Agent.

For the practical workflow, if every two sequences were to be compared, then for $n$ sequences, a total of $C_{2}^{n} = \frac{n(n-1)}{2}$ comparisons would be needed, with a time complexity of $O(n^2)$. In reality, during the training phase, what the model learns is the relationship between local features and representations of the sequence. We can have the LLM analyze each sequence individually to evaluate it. If we were to force pairwise comparisons (although we assume the generative model will map the input to the true distribution of antimicrobial peptides), once we encounter sequences with large similarity differences, the comparison could become ineffective.

Training and Test Set Construction

Based on our previous discussion, the Agent's training is actually achieved by guiding the LLM to compare similar antimicrobial peptide sequences, linking local differences with property differences, and thus summarizing useful experience. Therefore, the training data must meet two points: each group of training data must cover at least two sufficiently similar sequences; and each piece of data should completely include the property information specified in the "Data Processing" section.

For the processed grampa dataset, we considered using a sequence alignment method with biological evolutionary significance, such as BLAST, to cluster the entire database based on sequence similarity. However, because BLAST uses a local alignment algorithm (often the Smith-Waterman algorithm), it often clusters antimicrobial peptide sequences that are similar in a certain local region but may differ greatly overall (in terms of arrangement and length). Below is an example of a cluster from the grampa dataset after similarity scoring with BLAST and clustering with the DBSCAN algorithm (showing some properties):

Tab.4 Example of clustering results for antimicrobial peptides using the BLAST algorithm for similarity scoring

The reason the BLAST algorithm gives these two sequences a high score is that they share certain local similarities. However, such a combination of clusters with large overall and multiple local feature differences makes it difficult for the LLM to learn useful experience through comparison, even if this clustering has biological significance. In fact, our goal is not to cluster biologically similar sequences, but to find more sequences with high overall similarity for comparison. Therefore, we considered using another sequence comparison method. After comparative research, we found that a simple algorithm like Cosine Similarity is actually more suitable for our current situation, here is the calculation formula of cosine similarity:

$$ \cos(\theta) = \frac{A \cdot B}{|A||B|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}} $$

This algorithm happens to score similarity based on the external representation of the sequence and performs excellently in evaluating overall similarity. Below is an example of a cluster from the grampa dataset after similarity scoring using cosine similarity and clustering with DBSCAN:

Tab.5 Example of clustering results for antimicrobial peptides using cosine similarity for similarity scoring

This cluster result contains four data points, corresponding to two different types of antimicrobial records for two different antimicrobial peptides. With similar effects, we obtained a total of 14,018 training groups. To facilitate subsequent evaluation of the model's performance and to compare it with general machine learning models, we found a project on GitHub for predicting the MIC value of antimicrobial peptides^[22]. This project uses ProtBERT as a base (a BERT variant designed for protein sequences) and adds a custom regression network layer on top (including linear layers, activation functions, and Dropout) to output the final pMIC value (the logarithmic form of MIC). This project also provides a test dataset. After checking for duplicates in the grampa dataset, we selected 9 antimicrobial peptide sequences as the "gold standard" for testing.

Model Training

The model training process is actually a process of accumulating experience, and this process relies entirely on the LLM itself. By setting an appropriate Chain-of-Thought (CoT) in the System Prompt, the LLM can be guided to complete the tasks of sequence comparison and experience summarization. Its core task is: first, to provide basic guidance and record some expert experience, which ensures that the LLM starts its work from a high starting point. Then, through a rigorous chain-of-thought analysis process (including analyzing substitution sites, functional motifs, physicochemical properties, etc.), it distills general, transferable sequence design rules. Finally, it must output these profound insights in a strict JSON format to provide high-quality prior knowledge for subsequent reranking tasks. We sincerely thank Mr. Pengfei Cui for providing professional knowledge as a reference for constructing the system prompt, and the following is a schematic diagram of the prompt engineering conducted during this training process:

Prompt Engineering Schematic for the Training Module — Img.7 Prompt engineering schematic for the training module

By defining such a CoT, the Agent can be trained effectively. For example, for similar sequences:

"GVFTLIKGATQLIGKTLGKELGKTGLELMACKITNQC" "GIFSLIKGAAKVVAKGLGKEVGKFGLDLMACKVTNQC"

The training module of our Agent, through comparison and analysis, summarizes the following experience (partial):

(1)"Position 3: V→F substitution increases hydrophobicity and potential membrane interaction." (2)"Position 5: T→S substitution may slightly reduce rigidity and alter local hydrophilicity." (3)"Maintain a balance between hydrophobicity and cationicity to optimize membrane interaction and selectivity." (4)"Strategically place aromatic residues towards the C-terminal to enhance membrane anchoring."

This training process places high demands on the LLM's reasoning ability. Here, we chose Deepseek-R1 as basic LLM. However, during actual testing, we found that the training process consumes a large number of tokens, leading to high costs. Therefore, we only selected 134 cluster groups as the training set for the Agent. The Agent ultimately learned 1948 pieces of experience.

Reranking Module Construction

The reranking task is to reorder the priority of antimicrobial peptide sequences pre-generated by a machine learning model based on the user's needs. This process is based on the experience learned by the Agent during training. To ensure that the reranking task can fully compare sequences and call upon experience, we designed the following CoT:
1. Sequence Analysis -> 2. Experience Retrieval -> 3. Final Ranking.

First, the system does not immediately compare all sequences. It analyzes each candidate antimicrobial peptide sequence one by one to prepare the basic data for the subsequent ranking.

1. Calculate Physicochemical Properties: For each sequence, key biophysical parameters are first calculated using the Peptide function, including:

1. Length
2. Net Charge
3. Hydrophobicity
4. Hydrophobic Moment
5. Isoelectric Point

2. Obtain Expert-level Analysis: The calculated physicochemical properties are filled into a prompt template in the System Prompt. This prompt requests the LLM to act as an antimicrobial peptide expert and conduct a comprehensive evaluation of the potential of a single sequence, including:

1. Predicting its possible antimicrobial mechanism.
2. Analyzing its structure-activity relationship (SAR).
3. Predicting its possible range of target bacterial species.
4. Evaluating its structural stability, synthesis feasibility.

The output of this stage is a detailed, independent analysis report for each sequence. Next, the system calls the experience database to provide data support for the ranking. The Agent will perform a similarity search in the experience database to find the N most similar sequences in history and extract their analysis data and design insights. The purpose of this is to learn from history and provide context and justification for the current ranking.

Finally, the system aggregates all the information collected in the first two stages and passes it to a prompt specifically responsible for ranking to make the final decision. This step places high demands on the LLM's reasoning ability, so we also use the Deepseek-R1 model to handle such tasks. The LLM is asked to conduct a comprehensive evaluation and ranking of all sequences based on the user's needs according to the above prompt, and finally to produce a formatted output. This process is a systematic, multi-dimensional, data-supported decision-making process. It is not just a simple sorting by a single physicochemical parameter, but an imitation of a domain expert's workflow: first, independently evaluate each option, then look up relevant literature and historical data, and finally, conduct a comprehensive, well-reasoned trade-off based on the user's needs to arrive at the final ranking. The entire reranking module can be summarized in the figure below:

Img.8 Workflow diagram of the Reranking Module

With this, we have completed the construction of the entire AMP Rerank Agent. The technology stack used for the model includes:

LLM & Framework: Deepseek-R1, LangChain + LangGraph
Embedding & Similarity: BAAI/bge-large-en-v1.5, Cosine Similarity
Clustering/ML: DBSCAN, scikit-learn / SciPy
Frontend: FastAPI (REST API), Uvicorn (ASGI Server)

Model Testing

Overall Performance Test

Now, we select 10 antimicrobial peptide sequences from the test set and set a specific task to drive the Agent to rank the priority of these ten peptides. Although the data selected from the database does not guarantee that the ten peptides have high similarity, considering the cost of wet-lab validation, we chose to build a "gold standard" using existing antimicrobial peptides instead of designing new ones with a sequence generation model. The test question is designed as follows:

I need to prioritize these antimicrobial peptides for E. coli treatment. Please analyze and rank all 10 sequences considering antimicrobial potency, selectivity, and safety profile:
    
            1. KWKLFKKIGKFL
            2. FLPAIAGMAAKFLPKIFCAISKKC  
            3. GNNRPVYIPQPRPPHPRI
            4. FLSTLWNAAKSIF
            5. GWGSIFKHGRHAAKHIGHAAVNHYL
            6. GVFTLIKGATQLIGKTLGKELGKTGLELMACKITEQC
            7. LVKDNPLDISPKQVQALCTDLVIRCMCCC
            8. GLPRKILCAIAKKKGKCKGPLKLVCKC
            9. GFGCPGDAYQCSEHCRALGGGRTGGYCAGPWYLGHPTCTCSF
            10. YPELQQDLIARLL

The reranking results and frontend display are shown in the video below:

Video 6. The video of showing the reranking results and frontend display

Quantitative and Comparative Testing

Next, we use the aforementioned test set for quantitative and comparative analysis: Since the selected machine learning model can only predict MIC values, we also have the AMP Rerank Agent rank according to MIC values. The table below shows the rankings from the "gold standard," the machine learning model's prediction, and the AMP Rerank Agent's prediction:

Table of rankings from the Gold Standard, Machine Learning Prediction, and AMP Rerank Agent Prediction — Tab.6 Table of rankings from the "Gold Standard," Machine Learning Prediction, and AMP Rerank Agent Prediction

We calculate the nDCG@9 for the ranking results of both the Machine Learning Method and the AMP Rerank Agent. The results are as follows:

Tab.7 nDCG@9 results for the Machine Learning Method and AMP Rerank Agent rankings

Limited by the amount of experience data and the quality of the CoT, the AMP Rerank Agent's score is slightly lower than the machine learning model's. However, the progressiveness and interpretability of the AMP Rerank Agent still indicate its great value and potential. For example, the AMP Rerank Agent's explanation for the peptide it ranked first is (the thought process is not explicitly output): "Strong antimicrobial activity with good safety profile; stable amphipathic structure, facilitates membrane interaction; synthesis is feasible but slightly complex due to length." This shows that although our Agent has not formed a complete knowledge graph, it possesses powerful interpretability. In the future, model performance can be improved by continuing to optimize the CoT and conducting larger-scale training until the knowledge network achieves a "phase transition."

Then we want to test whewher the AMP rerank Agent truly understands the antimicrobial peptide sequence, if so, it can easily distinguish between AMPs sequences and non-AMPs sequences. We found 50 AMP sequences which is different from training set, and randomly make 50 non-AMPs sequences (using the dictionary of antimicrobial peptide sequences to randomly generate combinations, it can be assumed that the probability of the randomly combined sequence being an antimicrobial peptide sequence is 0). Then we simply modify the prompt to instruct it to classify AMP sequences and non-AMP sequences. Now we just use backend to test this function, the confusion matrix of this test is shown below:

Tab.8 Confusion matrix of the AMP Rerank Agent's classification of AMPs and non-AMPs sequences

From the confusion matrix, we can calculate the accuracy of the AMP Rerank Agent's classification of AMPs and non-AMPs sequences is 61%. This is mainly because we just conducted a Zero-shot learning, which means that the model must complete the classification task based on known information or prior knowledge before it is specifically trained to handle this classification task. there is no experience to help Agent classify. So we should try Few-shot learning to improve the performance. Now we use the same data to conduct Few-shot learning, we have tried 50-shot and 100-shot, and choose the latter which has a better conclusion as the final test result. the confusion matrix of this test is shown below:

The table of accuracy, precision, recall, and F1 score of the AMP Rerank Agent's classification of AMPs and non-AMPs sequences after Few-shot learning is shown below:

Tab.10 Accuracy, precision, recall, and F1 score of the AMP Rerank Agent's classification of AMPs and non-AMPs sequences after Few-shot learning

From the results, we can see that the accuracy, precision, recall, and F1 score of the AMP Rerank Agent's classification of AMPs and non-AMPs sequences after Few-shot learning are all significantly improved. This shows that Few-shot learning is effective in improving the performance of the AMP Rerank Agent's classification of AMPs and non-AMPs sequences.

The probability histogram of the AMP Rerank Agent's classification of AMPs and non-AMPs sequences after Few-shot learning is shown below:

From the probability histogram, we can see that the probability of the AMP Rerank Agent's classification of AMPs and non-AMPs sequences after Few-shot learning is significantly improved.

Through testing, we can see that the LLM with a memory module can indeed understand the antimicrobial peptide sequence, which also explains the good results of our previous antimicrobial peptide priority re-ranking task.

Summary

Our core work in this section is the development of the innovative AMP Rerank Agent. It leverages the reasoning capabilities of a LLM to learn and accumulate human-understandable "design experiences" in natural language by comparing existing antimicrobial peptide sequences. It then applies this accumulated knowledge to intelligently re-prioritize a set of candidate AMP sequences, recommending the most promising ones for experimental validation.

This work critically fills the interpretability gap that exists between computational design and wet-lab experimentation. Traditional machine learning models operate as "black boxes," making their design logic incomprehensible and creating a significant cognitive and collaborative barrier between dry and wet labs. Our method renders the model's decision-making process completely transparent, aligning the AI's reasoning with the domain expertise of human scientists and enabling an unprecedented level of deep integration.

More importantly, our research contributes to the next generation of intelligent drug design by establishing a new, experience-driven AI design paradigm. We provide not only a practical tool that immediately reduces experimental costs and accelerates the R&D cycle, but also a foundational framework for the future. This interpretable knowledge base can serve as an "intelligent constraint" or "design guide" for more advanced de novo generative models, directing them to create novel molecules that are not only effective in silico but also more plausible based on biochemical principles.

Automated Data Analysis

Problem Analysis

In our wet lab, we encountered an issue where the rheometer malfunctioned, causing the machine-recorded data to have missing and abnormal values (specifically, the data for the gelation temperature of L-HBC). This presented significant difficulties for the wet-lab personnel when processing this data. Additionally, for some data collection, the corresponding equipment was not configured, and the demand for data analysis in the wet lab remains high. Building an end-to-end automated data analysis tool could greatly alleviate the lab's data analysis challenges.

Our future intelligent control system for the robotic arm also has a high demand for such an automated data analysis tool. This would allow the robot to move from offline processing to online, real-time intelligent control. Combined with this tool, a well-designed system architecture and effective use of ROS could enable the construction of an advanced automated system capable of autonomously identifying wounds, analyzing their condition, and applying medication with precision.

Data Analysis Agent Establishment

An automated data analysis tool must possess these three characteristics: 1. It must fully understand the user's needs and extract key information from natural language input, which is nearly impossible for hard-coded tools. 2. It must have a thorough understanding of data science-related knowledge and be able to translate practical problems into visible code, raising the ceiling of data analysis capabilities. 3. It must have a module for executing code and analyzing conclusions, a feature that ensures the end-to-end application scenarios of this automated tool.

An AI Agent has a natural advantage in implementing these functions. Therefore, we developed a Data Analysis Agent to assist the wet lab with data analysis and subsequent hardware development. The theoretical and engineering implementation of this Agent is much simpler than the two mentioned above because a fixed workflow can be orchestrated to fully achieve its functionality.

When a user proposes a task, for example, "Please handle the outliers and missing values in this high-throughput screening data," the agent's "Planner" node is first activated. It breaks down the high-level instruction into a series of specific sub-tasks, such as "identify data distribution," "select an appropriate outlier detection algorithm," "execute a missing value imputation strategy," etc., and constructs an initial execution plan. This plan defines the flow logic between subsequent nodes. According to the plan, the "Code Generator" node is called, which is responsible for generating corresponding Python code snippets for each sub-task. This code utilizes Python's powerful standard libraries for specific data operations. Finally, the results of the code execution are summarized by the LLM and output in a fixed report format. A "Report Generator" tool then creates a file in a fixed format and returns it to the user. The following is a schematic diagram of the Agent's workflow:

Img.11 Data Analysis Agent Workflow Diagram

In engineering, we still chose LangGraph as the development framework because its characteristic of connecting nodes by building a graph is naturally suited for orchestrating workflows. Since we are only developing locally and have not designed multi-modal functions, data can only be input in natural language form, but this has almost no effect on the results. Below is the technology stack used for this model:

Frontend: FastAPI (REST API), Uvicorn (ASGI Server)
Large Model & Orchestration: LangGraph, LangChain, Deepseek-R1
Data Processing & Statistics: pandas, numpy, SciPy, scikit-learn, statsmodels
Visualization: matplotlib, seaborn

Model Testing

General Test

Below is a regression task related to antimicrobial peptides as a general test, data is randomly created based on linear relationships:

I have antimicrobial peptide data for 10 sequences. Please analyze the correlation between peptide charge and MIC values. Here's my data:
    
    Peptide A: length 12 amino acids, charge +3, MIC against S. aureus is 8.5 μM
    Peptide B: length 18 amino acids, charge +6, MIC against S. aureus is 4.2 μM  
    Peptide C: length 8 amino acids, charge +2, MIC against S. aureus is 15.3 μM
    Peptide D: length 15 amino acids, charge +5, MIC against S. aureus is 6.1 μM
    Peptide E: length 10 amino acids, charge +4, MIC against S. aureus is 7.8 μM
    Peptide F: length 22 amino acids, charge +7, MIC against S. aureus is 3.9 μM
    Peptide G: length 14 amino acids, charge +3, MIC against S. aureus is 9.2 μM
    Peptide H: length 16 amino acids, charge +5, MIC against S. aureus is 5.7 μM
    Peptide I: length 11 amino acids, charge +2, MIC against S. aureus is 12.4 μM
    Peptide J: length 20 amino acids, charge +8, MIC against S. aureus is 2.8 μM
    
    Please calculate correlation coefficients and provide statistical analysis with recommendations.

The output of the Data Analysis Agent in the front end is shown in the video below:

Video 7. The video of showing the Data Analysis Agent's output

As shown in the video, Data Analysis Agent analyzed the user's needs, transformed the requirements and data into executable code, obtained the results by calling the code executor through function calling, and judged whether the results met the user's needs. Finally, the analysis process, executable code, and results were written into a report. The report generated by the Agent is as follows (It is actually a markdown file, but it is presented here in PDF format, the next report is also the same):

From the report, it can be seen that the Data Analysis Agent can effectively assist users in completing data analysis and generating corresponding reports. Subsequent improvements can focus on addressing some of the irregularities in the report and implementing functions such as inserting images.

Real-world Task Test

To test the effectiveness of the Data Analysis Agent in assisting wet-lab experiments, we selected a data analysis task actually encountered in the lab for testing. For example, in the rheology data for the gelation temperature of L-HBC, due to differences in instrument signal sensitivity in different physical states of the sample, the Storage modulus, Loss modulus, and the Tan delta calculated from them are most likely to have missing and abnormal values (in our data, some Storage modulus and Loss modulus values were negative). This usually occurs in the low-temperature liquid phase before gelation or due to data spikes caused by sample slippage after gelation.

The core idea for processing this data is that it must be treated as a continuous time-series data with a strong physical trend, and methods that maintain its intrinsic patterns should be used for repair. The specific steps are as follows: First, obvious outliers like negative moduli should be identified and removed, marking them as missing data. Second, for a small number of isolated missing points, the optimal and most direct method is Spline Interpolation, as it can smoothly fit the non-linear curve of the gelation process, preserving the physical reality of the data to the greatest extent. We can use the Data Analysis Agent to process this data. The model input is designed as follows:

My task is to identify and handle anomalous values in this dataset. According to physical principles, the Storage modulus (G') of a material cannot be negative. Therefore, any negative values in the first column should be treated as outliers, likely caused by instrument noise when the sample's response was too weak to measure accurately.
    
    Please perform the following tasks:
    1.  Identify all rows where the 'Storage modulus (G')' is negative.
    2.  Explain briefly why these values are considered anomalies.
    3.  Recommend a robust data processing strategy (e.g., a specific type of interpolation) to replace these anomalies, justifying why your chosen method is appropriate for continuous time-series data from a physical process.
    4.  Do not perform the calculation, just provide the methodology.
    
    Here is the data:
    0.26361, 0.91175
    0.16369, 0.90036
    0.051607, 0.90807
    -0.01929, 0.91641
    0.057022, 0.91002
    ……

The report generated by the Agent is as follows:

In this report, the agent understood the user's needs, adopted the correct interpolation method to handle outliers, and successfully generated a comprehensive report.

Summary

To address the common and time-consuming challenge of data analysis in wet-lab experiments, we developed the Data Analysis Agent. Traditionally, researchers must dedicate significant time to learning programming or manually processing raw experimental data, which often contains missing values and anomalies. This diverts their focus from core scientific inquiry.

Our Data Analysis Agent functions as an automated data scientist. It understands high-level analysis requests given in natural language and autonomously breaks them down into a series of specific, executable steps. The agent then automatically generates, executes, and verifies the code required for data cleaning, statistical analysis, model fitting, and visualization. Finally, it consolidates the entire process and its findings into a clear, formatted report.

The core contribution of this work is that it liberates scientists from the complex and tedious tasks of programming and data processing, fundamentally transforming the "data-to-knowledge" pipeline. By reducing analysis tasks that would typically take hours or even days to mere minutes, the agent significantly accelerates the iterative cycle of scientific discovery. This allows researchers to dedicate their valuable time and energy to more creative scientific thinking and experimental design, effectively bridging the gap between raw data collection and profound scientific insight.

AMPilot Construction

Although the aforementioned AMP Research Agent, AMP Rerank Agent, and Data Analysis Agent each perform excellently in their respective domains—information retrieval, sequence optimization, and data analysis—they still fail to fully simulate the complete lifecycle of a real research project when operating independently. A successful research project is not a simple accumulation of isolated tasks, but a dynamic, iterative closed loop from hypothesis proposal, candidate screening, experimental validation, to data feedback.

To break down the functional barriers between these independent agents and achieve seamless integration and intelligent collaboration in the research process, we have integrated them into a unified Multi-Agent System, coordinated by a high-level scheduler. We have named it AMPilot, implying that it can act like an experienced "co-pilot," providing comprehensive intelligent navigation for antimicrobial peptide research.

The engineering implementation of AMPilot simply requires the addition of a Manager Agent. This Agent's role is to answer general questions and perform intent recognition. The latter is achieved through an LLM-based Router, which allows the LLM to understand the user's purpose and summarize it into an appropriate input to be passed to the corresponding Agent. Here, we did not use a single large graph to connect all the agents, but rather a soft connection like API calls. Below is a schematic diagram of the AMPilot architecture:

Img.10 Multi-Agent System Architecture Diagram

In this way, AMPilot integrates previously fragmented steps into an intelligent, efficient, and continuously learning organic whole, truly achieving a deep fusion of dry and wet labs. You can access the AMPilot code in the Software repository in our Gitlab: https://gitlab.igem.org/2025/software-tools/ouc-haide/-/tree/main/AMPilot?ref_type=heads.
you can also follow the latest progress of this project by accessing the code here: https://github.com/Xiaoyun-0922/AMPilot/tree/Liu .

In addition, we have not only added a startup method for the entire system, but also added debugging interfaces for each Agent, providing developers with convenient deployment methods and secondary development conditions.

Summary

To address the problem of antimicrobial peptide retrieval, our project improved the RAG System into a Research Agent, solving the issues of high cost and poor performance of traditional AMP retrieval methods. Wet-lab researchers previously had to spend a significant amount of time manually screening suitable antimicrobial peptides from multiple databases and vast amounts of literature, a process that was tedious and prone to missing key information. Our AMP Research Agent automates this process. By specifying our requirements, we ultimately found the antimicrobial peptide that met our needs: Pexiganan.

To address the problem of designing antimicrobial peptides, although our project did not design new AMP sequences within the competition period, it provided a solution for our future needs. The AMP Rerank Agent addresses the bottlenecks of "low hit rate" and "lack of interpretability" in AMP design. The success rate of de novo AMP design is extremely low, and traditional machine learning models act like a "black box," with their design logic being incomprehensible and unusable for wet-lab experts, leading to a disconnect between dry and wet labs. Our AMP Rerank Agent innovatively solves this problem. It does not directly generate sequences but relies on the LLM's own reasoning ability and the dynamic experience storage function of its memory module to intelligently prioritize candidate antimicrobial peptides. This can guide wet-lab personnel to prioritize validating the most likely successful peptides, directly reducing the high cost of trial and error, and bridging the gap between computational design and experimental experience with its transparent decision-making process.

To address the tedious data analysis work in the wet lab, we built the Data Analysis Agent to solve the "tedious and time-consuming" task of experimental data analysis. Processing raw data containing missing values and outliers from experiments, and performing statistical analysis and visualization, usually requires programming knowledge, which occupies researchers' valuable time that should be spent on scientific thinking. Our Data Analysis Agent liberates scientists from this dilemma. It can understand analysis tasks, automatically generate code to complete data cleaning, model fitting, and report generation, reducing what originally took hours or even days of data processing work to just a few minutes, significantly accelerating the process of converting raw data into scientific insights.

Further Discussion

The AMPilot project has successfully demonstrated the immense potential of applying a Multi-Agent System, particularly one based on LangGraph's cyclical "think-act" framework, to the retrieval, design, and data analysis of antimicrobial peptides. By combining Agentic RAG, LLM-based sequence reranking, and automated data analysis, AMPilot is not just a toolset, but a collaborative intelligent system aimed at accelerating the dry-lab/wet-lab feedback loop. However, while acknowledging our current achievements, we recognize that the system still has several limitations, which point to several key directions for future evolution.

1. From Experience-based Reasoning to Hybrid Modeling: Enhancing the Robustness of AMP Reranking

The core innovation of the current AMP Rerank Agent lies in using an LLM to learn natural language "experiences" from sequence comparisons. The advantage of this method is its interpretability—researchers can directly understand the basis of the model's decisions. However, the accuracy of this qualitative reasoning is highly dependent on the LLM's internal knowledge and reasoning capabilities, and may be biased when faced with highly complex or counter-intuitive sequence-function relationships.

A key next step is to build a Hybrid Evaluation Model. We can combine the LLM's qualitative reasoning as high-level features with traditional, quantitative prediction models based on physicochemical properties (such as models based on graph neural networks (GNNs) or Transformer encoders). For example, the LLM could propose the hypothesis that "adding aromatic residues to the C-terminus may enhance membrane anchoring," while a quantitative model could precisely calculate the specific impact of this change on parameters like hydrophobic moment and charge distribution. This collaborative model of "qualitative insight + quantitative validation" will greatly improve the reliability and scientific rigor of the ranking results.

2. Closing the Dry-Wet Lab Loop: Introducing Active Learning and Experimental Feedback

AMPilot currently serves primarily as an in-silico prediction and analysis platform. Its true value will be maximized through tight coupling with wet-lab experiments. Our next envisioned step is to establish an Active Learning framework.

In this framework, AMPilot would not only rank candidate sequences but also, based on its internal model's uncertainty, actively propose the most informative sequences for wet-lab validation. The experimental results (such as actual MIC values, hemolysis data) would serve as high-quality feedback, directly integrated into the AMP Rerank Agent's memory module to correct or reinforce its existing "experiences." This closed-loop system will enable AMPilot to continuously learn and evolve from real experimental data, gradually narrowing the gap between computational prediction and physical reality, thereby fundamentally improving the hit rate of antimicrobial peptide design.

3. Beyond Reranking to Generation: Achieving True De Novo Design

The current Rerank Agent operates on an existing pool of candidate sequences. The ultimate goal of antimicrobial peptide design is the De Novo Generation of entirely new sequences with specific properties. The structure-function "experiences" accumulated by AMPilot provide a solid foundation for this.

A future AMP Design Agent could evolve into a constrained generative model. The "design rules" learned by the LLM in the memory module (e.g., "maintain a balance between hydrophobicity and cationicity") could serve as constraints or guiding signals for a generative model (such as a Generative Adversarial Network (GAN) or a diffusion model) to explore a more optimal sequence space. The agent will no longer be just a "referee" but will become a "designer" capable of actively creating new molecules based on user requirements (e.g., "design a highly effective, low-toxicity peptide against E. coli") and its own learned knowledge.

4. Deep Integration and Autonomous Planning: From Multi-Agent to a Unified Scientific Agent

Currently, the three agents in AMPilot work together under the coordination of a Manager. The future goal is to achieve a deeper level of knowledge sharing and dynamic task planning among them. For example, if the Data Analysis Agent, while analyzing historical experimental data, discovers that "a certain class of sequences tends to aggregate in a specific buffer (manifesting as data anomalies)," this insight should be able to dynamically update the retrieval strategy of the Research Agent or the ranking criteria of the AMP Rerank Agent.

Ultimately, we envision AMPilot evolving into a more ambitious Autonomous Science Agent. It would not only answer questions, rank sequences, and analyze data but also be able to proactively propose scientific hypotheses, design validation experiments, interpret experimental results, and plan the next steps of research based on those results. At that point, AMPilot will no longer be just a "helper" in the lab, but a partner working alongside researchers, capable of driving the entire scientific discovery process.

References

[1] Goles, M., Daza, A., Cabas-Mora, G., Sarmiento-Varón, L., Sepúlveda-Yañez, J., Anvari-Kazemabad, H., Davari, M. D., Uribe-Paredes, R., Olivera-Nappa, Á., Navarrete, M. A., & Medina-Ortiz, D. (2024). Peptide-based drug discovery through artificial intelligence: Towards an autonomous design of therapeutic peptides. Briefings in Bioinformatics, 25(4), bbae275. https://doi.org/10.1093/bib/bbae275
[2] Jiménez-Luna, J., Grisoni, F., & Schneider, G. (2020). Drug discovery with explainable artificial intelligence. Nature Machine Intelligence, 2(10), 573-584. https://doi.org/10.1038/s42256-020-00236-4
[3] Boiko, D. A., MacKnight, R., Kline, G., & Gomes, G. (2023). Emergent autonomous scientific research capabilities of large language models. Nature, 624(7992), 596-602. https://doi.org/10.1038/s41586-023-06798-z
[4] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, & H. Lin (Eds.), Advances in Neural Information Processing Systems, 33 (pp. 9459-9474). Curran Associates, Inc.
[5] Petroni, F., Piktus, A., Fan, A., Lewis, P., Yazdani, M., Dehghani, M., Kiela, D., Séaghdha, D. Ó., Min, S., & Riedel, S. (2021). KILT: A benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 2523–2544). Association for Computational Linguistics.
[6] Mohamed, M. F., Hamed, M. I., Pan, C., Wei, Y., & Ele-Dij-A, M. M. (2022). An amphipathic peptide, A24, combats multidrug-resistant Staphylococcus aureus persisters and biofilms. Frontiers in Microbiology, 13, 993421.
[7] Boonsri, B., Ristl, R., Egerbacher, M., Windhager, R., & Holinka, J. (2020). In vitro activity of nisin A against prosthetic joint infection-associated staphylococci. Journal of Global Antimicrobial Resistance, 22, 815-822.
[8] Avitabile, C., D'Andrea, L. D., Saviano, M., Olivieri, M., Gaglione, R., Vellecco, V., ... & Romanelli, A. (2013). The bactericidal activity of temporin analogues against methicillin‐resistant Staphylococcus aureus (MRSA). Journal of Peptide Science, 19(11), 690-700.
[9] Di Somma, A., Avitabile, C., Cirillo, A., Morello, S., Romanelli, A., & Gaglione, R. (2020). The antimicrobial peptide temporin G: anti-biofilm, anti-persister, potentiates tobramycin. Cellular and Molecular Life Sciences, 77, 4563-4577.
[10] Pompilio, A., Crocetta, V., Pomponio, S., Di Vincenzo, V., Fiscarelli, E., & Di Bonaventura, G. (2008). Antimicrobial peptide LL-37 is bactericidal against Staphylococcus aureus biofilms. Antimicrobial Agents and Chemotherapy, 52(12), 4539-4542.
[11] Mishra, B., Reiling, S., Zarena, D., Sylla, M., Holzgrabe, U., & Friedland, K. (2020). Evaluation of Short LL-37-Derived Peptides and Their Hybrids with Other Antimicrobial Peptides Alone and in Combination with Vancomycin against Staphylococcus aureus. International Journal of Molecular Sciences, 21(19), 7175.
[12] Chen, X., Su, S., Yan, Y., Yin, L., & Liu, L. (2023). Anti-Pseudomonas aeruginosa activity of natural antimicrobial peptides when used alone or in combination with antibiotics. Frontiers in Microbiology, 14, 1239540. https://doi.org/10.3389/fmicb.2023.1239540
[13] Zhu, C., Zhao, Y., Zhao, X., Liu, S., Xia, X., Zhang, S., Wang, Y., Zhang, H., Xu, Y., Chen, S., Jiang, J., Wu, Y., Wu, X., Zhang, G., Bai, Y., Hu, J., Fotina, H., Wang, L., & Zhang, X. (2022). The Antimicrobial Peptide MPX Can Kill Staphylococcus aureus, Reduce Biofilm Formation, and Effectively Treat Bacterial Skin Infections in Mice. Frontiers in Veterinary Science, 9, 819921. https://doi.org/10.3389/fvets.2022.819921
[14] Keeratikunakorn, K., Aunpad, R., Ngamwongsatit, N., & Kaeoket, K. (2024). The Effect of Antimicrobial Peptide (PA-13) on Escherichia coli Carrying Antibiotic-Resistant Genes Isolated from Boar Semen. Antibiotics, 13(2), 138. https://doi.org/10.3390/antibiotics13020138
[15] Peláez Coyotl, E. A., Barrios Palacios, J., Muciño, G., Moreno-Blas, D., Costas, M., Montiel Montes, T., Diener, C., Uribe-Carvajal, S., Massieu, L., Castro-Obregón, S., Ramos Espinosa, O., Mata Espinosa, D., Barrios-Payan, J., León Contreras, J. C., Corzo, G., Hernández-Pando, R., & Del Rio, G. (2020). Antimicrobial peptide against Mycobacterium tuberculosis that activates autophagy is an effective treatment for tuberculosis. Pharmaceutics, 12(11), 1071. https://doi.org/10.3390/pharmaceutics12111071
[16] Weller, O., Boratko, M., Naim, I., & Lee, J. (2025). On the theoretical limitations of embedding-based retrieval. arXiv.
[17] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. In Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
[18] grampa.csv dataset. Retrieved from https://github.com/gwct/grampa/tree/main
[19] Porto, W. F., Irwin, J. J., Franco, O. L., & Gatica-Andrades, M. (2021). Computer-Aided Design of Antimicrobial Peptides: Are We Generating Effective Drug Candidates? Frontiers in Microbiology, 12, 668800. https://doi.org/10.3389/fmicb.2021.668800
[20] Xian, Z., Gu, J., Li, L., & Liang, S. (2025). MolRAG: Unlocking the power of large language models for molecular property prediction. In W. Che, J. Nabende, E. Shutova, & M. T. Pilehvar (Eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 15513–15531). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.acl-long.755
[21] memory-agent. Retrieved from https://github.com/langchain-ai/memory-agent/tree/main
[22] AMP_regression_EC_SA. Retrieved from https://github.com/janecai0714/AMP_regression_EC_SA