RAG

A Retrieval-Augmented Generation System for SPADE database

Scroll to see more

Pure Era RAG

Introduction

Pure Era RAG System

Pure Era RAG is a retrieval-augmented generation (RAG) system specifically designed for the retrieval of antimicrobial peptide (AMP) data. By integrating dense vector retrieval with semantic re-ranking capabilities, our system enables researchers to quickly identify the most relevant peptide sequences based on natural language queries.

RAG Model Construction

Model Structure

Figure 1: Model structure

RAG Model Construction


Collection: Python crawlers & APIs Format: JSON Updates: Regular intervals QC: Automated filtering

Data Preparation

Data quality determines the final performance of the model. We downloaded 46,876 antimicrobial peptide data from APD3 [1], DRAMP [2], LAMP [3], and DBAASP [4], either by using Python web crawlers or APIs. These data include sequence information, activity information, various physicochemical properties, and were saved in JSON format.

References:
  1. Wang, G., Li, X., & Wang, Z. (2016). APD3: The antimicrobial peptide database as a tool for research and education. Nucleic Acids Research, 44(D1), D1087–D1093. https://doi.org/10.1093/nar/gkv1278
  2. Ma, T., Liu, Y., Yu, B., Sun, X., Yao, H., Hao, C., Li, J., Nawaz, M., Jiang, X., Lao, X., & Zheng, H. (2025). DRAMP 4.0: An open-access data repository dedicated to the clinical translation of antimicrobial peptides. Nucleic Acids Research, 53(D1), D403–D410. https://doi.org/10.1093/nar/gkae1046
  3. Ye, G., Wu, H., Huang, J., Wang, W., Ge, K., Li, G., Zhong, J., & Huang, Q. (2020). LAMP2: A major update of the database linking antimicrobial peptides. Database, 2020, baa061. https://doi.org/10.1093/database/baaa061
  4. Pirtskhalava, M., Amstrong, A. A., Grigolava, M., Chubinidze, M., Alimbarashvili, E., Vishnepolsky, B., Gabrielian, A., Rosenthal, A., Hurt, D. E., & Tartakovsky, M. (2021). DBAASP v3: Database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics. Nucleic Acids Research, 49(D1), D288–D297. https://doi.org/10.1093/nar/gkaa991

To continuously obtain new antimicrobial peptide information, we deployed a timed crawling module on the server. It can automatically synchronize data with APD3, DRAMP, etc. at regular intervals.

Data Processing Flow

First, we convert structured JSON data into standard text format (Figures 2 and 3) for subsequent embedding representation generation and semantic retrieval. During the conversion, we set key fields, converting only the necessary fields into standard text and filtering out invalid information such as "no", "unknown", and null values. For special field processing, we adopt a structured conversion strategy: converting the list of biological activities into numbered entries, extracting the ID and URL key information from UniProt entries, and standardizing the format of author, title, and other metadata for literature data. Complex nested structures are expanded into readable text to ensure semantic integrity while removing redundant information, providing standardized input for subsequent embedding models. The input format is:

JSON Data Format
"SPADE_N_00001": {
        "SPADE ID": "SPADE_N_00001",
        "Peptide Name": "Variacin (Bacteriocin)",
        "Source": "Micrococcus varians (Gram-positive bacteria)",
        "Family": "Belongs to the lantibiotic family (Class I bacteriocin)",
        "Gene": "Not found",
        "Sequence": "GSGVIPTISHECHMNSFQFVFTCCS",
        "Sequence Length": 25,
        "UniProt Entry": "No entry found",
        "Protein Existence": "Protein level",
        "Biological Activity": [
            "Antimicrobial",
            "Antibacterial",
            "Anti-Gram+"
        ],
        "Target Organism": "Gram-positive bacteria:Lactobacillus helveticus, L. bulgaricus, Lactobacillus lactis, Lactobacillus delbrueckii, Lactobacillus acidophilus, Lactobacillus plantarum, Lactobacillus sake (LSK), Lactobacillus curvatus, Leuconostoc mesenteroides, Streptococcus thermophilus, Lactococcus lactis (SL2), Enterococcus faecalis, Enterococcus faecium, Listeria innocua, Listeria monocytogenes, Listeria welhia. Note:Inhibitory activity tested with supernatant adjusted to pH 7.",
        "Hemolytic Activity": "No hemolysis information or data found in the reference(s) presented in this entry",
        "Cytotoxicity": "Not found",
        "Binding Target": "Not found",
        "Linear/Cyclic": "Linear",
        "N-terminal Modification": "Not included yet",
        "C-terminal Modification": "Not included yet",
        "Stereochemistry": "L",
        "Structure Description": "Not found",
        "Formula": "C118H175N31O36S4",
        "Mass": 2732.1,
        "PI": 5.98,
        "Net Charge": 1,
        "Hydrophobicity": 0.45,
        "Half Life": "Mammalian:30 hourYeast:>20 hourE.coli:>10 hour",
        "Function": "Has a broad host range of inhibition against Gram-positive food spoilage bacteria. Variacin is resistant to heat and pH conditions from 2 to 10.",
        "Biophysicochemical properties": "Variacin is resistant to heat and pH conditions from 2 to 10.",
        "Literature": [
            {
                "Title": "Variacin, a new lanthionine-containing bacteriocin produced by Micrococcus varians comparison to lacticin 481 of Lactococcus lactis.",
                "Pubmed ID": "8633879",
                "Reference": "Appl Environ Microbiol. 1996 May;62(5)1799-1802.",
                "Author": "Pridmore D, Rekhif N, Pittet AC, Suri B, Mollet B.",
                "URL": "http://www.ncbi.nlm.nih.gov/pubmed/?term=8633879"
            }
        ],
        "Frequent Amino Acids": "SCF",
        "Absent Amino Acids": "ADKLORUWY",
        "Basic Residues": 2,
        "Acidic Residues": 1,
        "Hydrophobic Residues": 9,
        "Polar Residues": 15,
        "Positive Residues": 2,
        "Negative Residues": 1,
        "Similar Sequences": [
            {
                "SPADE_ID": "SPADE_N_06161",
                "Similarity": 1.0,
                "Sequence": "MTNAFQALDEVTDAELDAILGGGSGVIPTISHECHMNSFQFVFTCCS"
            },
            {
                "SPADE_ID": "SPADE_N_00041",
                "Similarity": 0.88,
                "Sequence": "KGGSGVIHTISHECNMNSWQFVFTCCS"
            },
            {
                "SPADE_ID": "SPADE_N_06089",
                "Similarity": 0.88,
                "Sequence": "MKEQNSFNLLQEVTESELDLILGAKGGSGVIHTISHECNMNSWQFVFTCCS"
            }
        ]
    }

The output format is:

Text Output Format
[Peptide Name] Variacin (Bacteriocin)
[Source] Micrococcus varians (Gram-positive bacteria)
[Family] Belongs to the lantibiotic family (Class I bacteriocin)
[Sequence] GSGVIPTISHECHMNSFQFVFTCCS
[Sequence Length] 25
[Protein Existence] Protein level
[Biological Activity] 1. Antimicrobial 2. Antibacterial 3. Anti-Gram+
[Target Organism] Gram-positive bacteria:Lactobacillus helveticus, L. bulgaricus, Lactobacillus lactis, Lactobacillus delbrueckii, Lactobacillus acidophilus, Lactobacillus plantarum, Lactobacillus sake (LSK), Lactobacillus curvatus, Leuconostoc mesenteroides, Streptococcus thermophilus, Lactococcus lactis (SL2), Enterococcus faecalis, Enterococcus faecium, Listeria innocua, Listeria monocytogenes, Listeria welhia. Note:Inhibitory activity tested with supernatant adjusted to pH 7.
[Linear/Cyclic] Linear
[Stereochemistry] L
[Formula] C118H175N31O36S4
[Mass] 2732.1
[PI] 5.98
[Net Charge] 1
[Hydrophobicity] 0.45
[Half Life] Mammalian:30 hourYeast:>20 hourE.coli:>10 hour
[Function] Has a broad host range of inhibition against Gram-positive food spoilage bacteria. Variacin is resistant to heat and pH conditions from 2 to 10.
[Biophysicochemical properties] Variacin is resistant to heat and pH conditions from 2 to 10.
[Literature]
 1. [Title]: Variacin, a new lanthionine-containing bacteriocin produced by Micrococcus varians comparison to lacticin 481 of Lactococcus lactis.; [Pubmed ID]: 8633879; [Reference]: Appl Environ Microbiol. 1996 May;62(5)1799-1802.; [Author]: Pridmore D, Rekhif N, Pittet AC, Suri B, Mollet B.; [URL]: http://www.ncbi.nlm.nih.gov/pubmed/?term=8633879
[Frequent Amino Acids] SCF
[Absent Amino Acids] ADKLORUWY
[Basic Residues] 2
[Acidic Residues] 1
[Hydrophobic Residues] 9
[Polar Residues] 15
[Positive Residues] 2
[Negative Residues] 1
            

Language Embedding

We use Qwen3-embedding-4B to convert the biological information of peptides into a dense 2560-dimensional vector representation. This process transforms the complex biological descriptions - including amino acid sequences, antibacterial activity, physicochemical properties, and other functional characteristics - into a numerical vector form, enabling computers to understand and process the semantic information of peptides. All the generated embedding vectors are efficiently organized through the FAISS index. This specially optimized vector database supports similarity search in milliseconds. FAISS uses the IVFFlat index structure, dividing the vector space into multiple clustering units, which significantly improves the efficiency of nearest neighbor search. For example, peptides of Gram-negative bacteria are close to each other in the vector space, while peptides with high hemolytic activity maintain a relatively large distance. During the search process, the system first conducts a rapid coarse screening in the FAISS index to identify potential candidate peptides, and then performs a fine sorting through cosine similarity calculation. This hierarchical search strategy, combined with GPU acceleration capabilities, enables the system to respond to user queries in real time among thousands of peptide entries. When a user queries "finding peptides that are effective against Escherichia coli and have low hemolytic activity", the system can accurately understand its underlying requirements and identify the most suitable results from thousands of peptides, providing a powerful intelligent search foundation for antibacterial peptide research and discovery.

Retrieval and Reordering


In the reordering stage, the Qwen3-reranker-4B model is used for deep semantic evaluation. This model receives formatted query-document pairs and calculates the relevance score by analyzing the difference in output logits of the "true/false" tokens. The system specifically added 25 keywords related to antimicrobial peptides (such as MIC, hemolytic, gram positive, etc.) to enhance the model's understanding of professional terms.

25 Keywords in AMP

Figure 2: 25 key words in the field of antimicrobial peptides

With GPU acceleration, the model processes candidate peptides in batches in parallel and determines the matching degree of each peptide with the query through deep semantic understanding. Unlike simple keyword matching, the reordering model can understand the biological context and implicit requirements, accurately identifying complex conditions such as "low hemolytic activity but effective against Escherichia coli". Finally, the system arranges the candidate results in descending order based on the relevance score, ensuring that the returned peptide sequence best meets the user's true intention, significantly improving the accuracy and practicality of the retrieval results.

Technical Features


Our system adopts a dual-model architecture. The Embedding model is responsible for quickly retrieving candidate results, while the Reranker model performs fine sorting to ensure retrieval accuracy. Through domain adaptation technology, we added AMP professional domain keywords to enhance the semantic understanding ability of the tokenizer. Efficient retrieval is enabled by the support of FAISS index, which can achieve vector similarity search in milliseconds. The system has good interpretability, with output results including relevance scores and the original text, facilitating subsequent analysis and verification by researchers. The entire system provides an end-to-end processing flow, from the original JSON data to the final retrieval results, achieving fully automated processing and significantly improving research efficiency.