Software Header Image

Introduction

With the rapid expansion of probiotic research, a vast amount of strain-related data has been scattered across various scientific literature, patents, and industry reports. However, due to the lack of an integrated and easily searchable platform, researchers need to spend significant time identifying strain-disease associations, evaluating evidence quality, and comparing candidate probiotics. This leads to delays in wet lab workflows and may overlook promising research clues, severely constraining the efficiency of probiotic research, design, and application.

To address this bottleneck, we envisioned "ProbiEase"---an AI Platform of Probiotia Design for Diseases (http://probiease.qscn.online/). We aim to combine a high-quality probiotic database with LLM-driven semantic search technology, enabling the platform not only to extract valuable insights from extensive existing data and research papers but also to interact with users through natural language. This will streamline the probiotic strain screening process and guide wet lab planning, assisting researchers, clinicians, and industry professionals in making rapid, scientifically informed decisions.

iGEM Software Criteria

For your convenience, we've summarized our work within iGEM's software criteria. However, we strongly recommend reviewing all of our deliverables for a holistic review.

1. How well is the software compatible with, and does it leverage, existing synthetic biology standards (e.g. SBOL, other RFCs, data formats, etc.)?

We primarily use natural language for data storage in software. However, leveraging the capabilities of integrated LLM models, we can achieve bidirectional conversion between natural language data and standards such as SBOL. While the current implementation focuses on specific data modeling in the field of probiotics (such as strain characteristics, disease associations, and evidence levels), our data architecture has been designed with scalability in mind. In the future, compatibility with standards like SBOL (Synthetic Biology Open Language) can be achieved through metadata mapping and schema alignment. Additionally, the four-tier evidence grading system (levels 1–4) we employ aligns with the scientific validation framework commonly used in the field of bioengineering, enhancing interoperability between this platform and iGEM, as well as other data-driven tools in synthetic biology.

2. Was this software validated by experimental work?

Yes, we have collected a large number of peer-reviewed papers and used them to build a probiotic-disease association database, which provides guidance for wet lab team members in strain selection. For example, the platform recommends E. coli Nissle 1917, Lactobacillus plantarum WCFS1, and Enterococcus faecium as the preferred candidate strains for treating Parkinson's disease, and these strains are subsequently prioritized for laboratory testing. The tight feedback loop between software predictions and wet lab validation ensures that the platform delivers actionable insights with clear biological significance.

3. Can the software be useful to other projects?

Yes, the software can play a role in various projects and assist researchers in strain selection. ProbiEase is particularly suitable for iGEM teams and academic laboratories engaged in microbiome engineering, gut health therapies, or synthetic biology projects involving bacterial chassis. Teams can leverage the platform's knowledge graph for hypothesis generation, use its evidence-based grading system for risk assessment, and intuitively explore data through the natural language interface, thereby accelerating early-stage research processes for a variety of bio-design challenges.

4. How well can the software be integrated with external tools/software applications (including APIs, packages, etc.)?

The software features externally callable APIs, a Python-based backend, and a decoupled multi-service architecture, enabling seamless integration with other software packages or services. Our RESTful API exposes key endpoints such as strain data, disease associations, and knowledge graph queries, allowing bioinformatics workflows or machine learning models to programmatically access the data. The Flask-based backend is compatible with common scientific Python libraries (e.g., pandas, NumPy, scikit-learn), facilitating data export and analysis. Additionally, the Dockerized microservices architecture ensures smooth deployment in cloud environments or CI/CD workflows, allowing ProbiEase to be used either as a standalone tool or integrated into larger computational biology platforms.

5. Is the software user-friendly?

Yes, the software comes with detailed installation/deployment documentation. The web client provides comprehensive descriptions of the software's features and usage methods. The intuitive interface includes guided user tutorials, interactive walkthroughs, and preset question suggestions in the AI Chat module, significantly lowering the barrier for non-technical users. Features such as clickable edges in the knowledge graph (displaying evidence levels and source papers) and strain information cards offer immediate context, while the natural language Q&A system enables users to retrieve complex data without needing to understand the database structure or API calls. The dual support for both web and desktop clients further enhances accessibility across different user preferences and environments.

6. How well is the software written and documented for future groups to extend and improve? (This could include code documentation, comments in the code, performance evaluation, architecture diagrams, installation and execution scripts, etc.)

Yes, the software source code includes detailed inline comments and README documentation to explain its functionality and usage. The project contains clear architecture diagrams (such as microservice layouts and data flows), Docker configuration files, and step-by-step guides for both lightweight and full deployments. The code is organized using Flask blueprints and modular React components, enhancing maintainability and feature scalability. Comprehensive documentation covers API specifications, RAG integration, and front-end customization, enabling subsequent developers to extend the platform’s capabilities—such as adding new data types, integrating multimodal inputs, or connecting to external databases—with minimal learning effort.

Roadmap

To achieve the above vision, we have developed a plan with progressively ambitious goals, starting from complex, high-threshold data and gradually moving toward simple, barrier-free natural language interaction:

Target 1: Create a High-Quality Dataset

Systematically collect literature from peer-reviewed sources and, through a combination of manual and large-model cross-reading and verification, compile a comprehensive and reliable dataset on probiotic-disease associations.

Target 2: Data Structure Design

To utilize the dataset efficiently, we will design corresponding probiotics profiles and establish detailed multi-dimensional probiotic archives.

Target 3: Visualization of Complex Relationship Networks

Upon completing the above objectives, we plan to visualize the disease association data scattered across various probiotic archives, making it easier for users to explore existing relationship chains and uncover potential connections.

Target 4: Enable Natural Language Interaction

Using retrieval-augmented generation technology, we will build the above data into a knowledge base suitable for large language models. Researchers can ask any question, and the large language model will provide answers based on the knowledge base, supported by research papers as evidence.

Target 5: Multimodal Interaction

Through multimodal large models, we will implement interaction methods using text, voice, and images, further lowering the barrier to use.

Project Roadmap
Figure 1. Project Roadmap

Architecture

React Web Client: The React client built with Next.js scaffolding serves as the frontend, offering advantages such as fast response times and a modern, aesthetically pleasing interface. The frontend communicates with the backend through RESTful API requests.

Python Flask Server: The Flask server acts as the backend, providing RESTful API endpoints for data read and write operations.

Nginx Agent: Nginx is used as a reverse proxy to expose container ports to the host machine's ports.

Pandas and Excel Database: We use pandas and Excel for data management, which is more lightweight compared to databases and also facilitates data modification for individuals without specialized computer knowledge.

Electron Client: For users who prefer desktop clients, we provide an Electron client, which offers cross-platform compatibility and excellent performance.

Java App: Mobile users can utilize the native Java application.

We employ Docker to containerize the React client, Flask service, and Nginx proxy, ensuring consistency between development and deployment environments and avoiding compatibility issues. This decoupled design ensures high scalability and maintainability of the software.

Main software architecture
Figure 2. Main software architecture

Implementation

1. Dataset Collection

First, we conducted a hierarchical literature search at the genus-species-subspecies level and compiled the results into a literature inventory. Team members then collaborated to analyze each paper on the list, extracting the studied subjects, (potential) treatable diseases, and evidence levels.

To enhance the efficiency of our platform, we adopted a four-tier grading strategy to evaluate the evidence levels:

  • Level 1: Theoretical Basis - Preliminary research suggesting potential efficacy through functional genes or metabolites.
  • Level 2: In Vitro Evidence - Confirmed biological effects in cell models (pathogen inhibition, immune modulation, etc.).
  • Level 3: Animal Model Evidence - Validated efficacy and safety in live animals with mechanistic insights.
  • Level 4: Clinical Trial Evidence - Statistically significant health improvements demonstrated in human RCTs - the highest evidence grade.

In practice, we implemented a cross-verification approach involving manual work and large language models (LLMs). For instance, if a literature entry was analyzed manually, it was reviewed by an LLM, and vice versa.

2. Data Structure and Backend Service Implementation

To utilize the data efficiently, we have designed the following data structure:

Data structure
Figure 3. Data structure

The data is stored in Excel sheets and interacts with the backend service through the Pandas package.

For the backend service, we adopt the blueprint architecture of Flask to organize the code, improving system decoupling and maintainability. We provide comprehensive API endpoints that, when called, return the corresponding data packages.

Strain Endpoints /api/strain
Endpoint Method Description
/api/strain GET Basic test endpoint
/api/strain/card-urls GET Get list of all strain card URLs
/api/strain/card/<id> GET Get condensed card info for a specific strain
/api/strain/detail/<id> GET Get full detailed info for a specific strain
/api/strain/names GET Get list of all strain names
/api/strain/search?key=xxx GET Fuzzy search (multi‑field contains); returns matching strain card URL list

Notes: Searchable fields originate from strain_search_allow_fields in db/strain.py. Returned card URLs follow /strain/card/<encoded_id>; frontend can prepend base origin.

Knowledge Graph Endpoints /api/knowledge-graph
Endpoint Method Description
/api/knowledge-graph GET Get full graph (nodes + edges)
/api/knowledge-graph/nodes GET Get all nodes (id/label)
/api/knowledge-graph/filter?query=<id> GET Get subgraph centered on a node
/api/knowledge-graph/node/<id> GET Get node detail (type inferred by prefix)
/api/knowledge-graph/edge/<id> GET Get edge detail (evidence/link/level)

Node ID Conventions: Prefixes strain_ / species_ / genus_: derived from strain Excel sheets. Prefix disease_: disease nodes.

Suggested Questions (Example) /api/chat/suggested-question
Endpoint Method Description
/api/chat/suggested-question GET Return list of preset suggested questions

3. Implementation of an Interactive Web Client

We have built a user interface on the web client, where a navigation bar directs users to various functions, including Home, User Guide, Strain Data, Knowledge Graph, and more.

The Home page introduces the project background, the scale and content of the dataset, as well as the value and use cases of the project, enabling users to quickly understand the purpose and significance of the software.

Home page
Figure 4. Home page

The User Guide page provides detailed instructions to help users quickly learn how to use each page of the platform in detail and make full use of the software's features.

User Guide page
Figure 5. User Guide page

On the Strain Data page and the Knowledge Graph page, we use strain information cards and an interactive knowledge graph, respectively, to visualize the data, while also compensating for each other's shortcomings.

Strain Data
Figure 6. Strain Data
Knowledge Graph
Figure 7. Knowledge Graph

The information cards provide relatively detailed content but have the drawback of being somewhat isolated, making it difficult to reflect the associations between probiotics and diseases. The knowledge graph addresses this shortcoming, while also relying on the information cards to provide specific details.

4. Implementation of Natural Language Interaction

We have performed secondary development based on the open-source RAGflow project to achieve natural language interaction and retrieval-augmented generation, while also optimizing the user experience.

In terms of the service layer architecture, we have separated the main service program from the RAG service and enabled interaction between them through RESTful APIs. This improves compatibility and facilitates updates, iterations, and maintenance.

RAG service architecture
Figure 8. RAG service architecture

Additionally, we have added an AI Chat page to our web client, where users can converse with the AI assistant. The assistant will provide answers based on the dataset we have collected and organized. We offer some recommended questions, but users are also free to ask any questions of their own.

AI Chat page
Figure 9. AI Chat page

Usage Example

Example 1: View Strain Details

Click on "Strain Data" in the navigation bar, then select a card to view detailed information about a strain. For example, if we click on the card for Akkermansia muciniphila ATCC BAA-835, the following information can be seen:

View Strain Details
Figure 10. View Strain Details

Example 2: Exploring the network of probiotic-disease relationships

Click on "Knowledge Graph" in the navigation bar to explore the probiotic-disease relationship network. In this module, we provide three interactive features:

  1. Select a central node from the dropdown menu to view a subgraph.
  2. Click on a node to view detailed information.
  3. Click on an edge to view evidence and its validity level.

For example, if we click on Escherichia coli Nissle 1917, we can see more information about the diseases and strains associated with it.

View node details
Figure 11. View node details

Additionally, by selecting a central node from the dropdown menu---for instance, to view a subgraph related only to Akkermansia muciniphila---and clicking on its association arrow with Type 2 Diabetes, we can view relevant research paper evidence and the evidence validity level.

View evidence and its validity level
Figure 12. View evidence and its validity level

Example 3: Recommended Candidate Strains

Click on "AI Chat" in the navigation bar, select a recommended question, or enter your own research question. For example:

"Which probiotics have shown potential efficacy for Parkinson's disease? Please provide relevant clinical research evidence."

The system returns a comprehensive summary based on our curated literature database. For the above question, the response would be as follows:

Recommended Candidate Strains
Figure 13. Recommended Candidate Strains

Innovations

1. Evidence-Based Probiotic-Disease Database

We systematically collected scientific literature to build a high-quality probiotic-disease database, and introduced a 4-level evidence grading system (from theoretical basis to human trials), helping users assess the reliability of findings.

2. Strain Information Cards

Strain information cards present a standardized, structured profile for each probiotic strain—covering, phenotype highlights, functional features, safety annotations. Consistent field order and concise visual cues allow rapid comparison, lowering cognitive load during early screening and enabling non‑experts to quickly identify promising candidates.

3. Interactive Knowledge Graph

The interactive knowledge graph provides a relational view of the ecosystem: nodes (strain / species / disease) and evidence‑weighted edges expose multi‑layer associations, co‑occurrence pathways, and potential intervention or co‑administration opportunities. Users can progressively expand neighborhoods, filter by evidence level or node category, and pivot from high‑level network topology to a specific strain card in one click—bridging exploratory discovery with structured detail.

4. Natural Language Q&A System

By integrating RAG and large language models, we enabled users to ask questions in natural language (e.g., "Which strains are effective for colitis?"). The system retrieves evidence and generates clear answers---no technical skills needed. This makes data access much easier, especially for non-computational researchers.

Future Improvement

1. Implementation of Integrated Data Management:

Currently, an interactive data management client is not available, and data management relies on external programs. Future plans include migrating data to PostgreSQL, establishing backend communication with the database, developing database-related RESTful APIs, and creating a visual data management interface on the frontend.

2. Enhancement of Knowledge Graph Interaction:

The interactive experience of the knowledge graph requires optimization. Future improvements will include adding filtering options based on evidence level, disease type, mechanism of action, and deeper integration with the RAG system to leverage LLMs for explaining probiotic-disease relationships to users.

3. Implementation of Multimodal Interaction:

Among the five initially set targets, the first four have been successfully accomplished. However, multimodal interaction has not been achieved due to the lack of mature solutions and scarce multimodal data related to probiotics. As relevant technologies advance, further in-depth secondary development of the RAG system will be pursued to integrate multimodal capabilities.

4. Integration of Metabolomics and Genomics Data:

We plan to link observed probiotic-disease associations with underlying genetic and metabolic connections. Based on this integration, we will develop predictive algorithms to further uncover potential relationships and explore new possibilities.

Usage

1. Using the Web/Client Version

Web Version: http://probiease.qscn.online/

Client Version:

2. Self-Deployment

git clone https://gitlab.igem.org/2025/software-tools/tjusx.git

2.1 Lightweight Deployment (Excluding RAG Service)

cd docker
docker compose up --build

Visit http://localhost:4080. This setup uses the RAG service provided by our server.

2.2 Full Deployment

In the project root directory:

git clone https://github.com/infiniflow/ragflow.git

Copy the files from probiease-ragflow/web/src/pages to ragflow/web/src/pages, overwriting the existing files.

cd ragflow/docker
docker compose -f docker-compose.yml up -d
cd ../web
npm install
npm run build
docker exec -it ragflow-server /bin/sh
rm -rf /ragflow/web/dist
exit
docker cp dist ragflow-server:/ragflow/web/
docker restart ragflow-server

Visit http://localhost:8000.

Provide your model provider API key, create a knowledge base and assistant, then click the "Embed in Website" feature for the assistant. Copy the embedded link. Return to the project directory. (For instructions on using the RAG service management page, refer to: https://github.com/infiniflow/ragflow)

code probiease-web/src/app/ai-chat/page.tsx

Replace the src="{original link}" on line 36 with your embedded link.

cd docker
docker compose up --build

Visit http://localhost:4080. This uses the locally running RAG service.

Conclusion

ProbiEase is a disease-oriented probiotic design platform designed to bridge the gap between the rapidly expanding scientific literature on probiotics and actionable experimental design. By integrating a manually curated high-confidence database with an intuitive natural language interface and powerful visualization tools, ProbiEase significantly reduces the time and expertise required to identify promising probiotic candidates for therapeutic applications.

Our platform successfully addresses a critical bottleneck in synthetic biology by providing a data-driven foundation for wet-lab experiments. For instance, one of our wet-lab team members noted: "ProbiEase's structured data and interactive knowledge graph enabled us to quickly identify three candidate strains with documented anti-inflammatory properties."

Future work will focus on expanding the database to include metabolomics and genomics data, and incorporating predictive models to further refine candidate strain selection. As an open-source tool, ProbiEase will make a substantive contribution to the iGEM community and the broader field of therapeutic microbiology by democratizing access to complex data analysis and accelerating the development of novel probiotic therapies.

Footer Example (No Bootstrap)