1 Parameter Definition & Glossary

Abstract

This comprehensive glossary document provides systematic parameter definitions and terminology explanations for the CytoFlow antimicrobial peptide development platform. It includes 15 detailed parameter tables covering CytoGrow (medium optimization, yeast kinetics, glucose consumption), CytoGuard (activity prediction with hypergraph neural networks), and CytoEvolve (reinforcement learning-based sequence optimization), along with 82 alphabetically organized technical terms ranging from mathematical concepts (Gaussian Processes, TF-IDF) to biological metrics (MIC, OD₆₀₀). The document serves as a complete reference guide for understanding the mathematical models, machine learning architectures, and experimental parameters underlying the CytoFlow framework developed by iGEM 2025 Jiangnan-China team.

📄 Open PDF in a new tab

2 DBTL Cycle

Internal DBTL Cycle of the Dry Lab

The dry lab's computational workflow follows a systematic Design-Build-Test-Learn (DBTL) cycle, iteratively refining the CytoFlow platform through data-driven optimization. This cycle integrates three core models—CytoGrow, CytoGuard, and CytoEvolve—to achieve end-to-end antimicrobial peptide development from fermentation optimization to molecular design.

Hover over the DBTL cycle to explore each phase

Interactive DBTL Exploration

Hover over any quadrant of the DBTL cycle diagram to view detailed information about that specific phase—Design, Build, Test, or Learn.

Summary

The dry lab's internal DBTL cycle drives continuous improvement of the CytoFlow platform through rigorous computational experimentation. Each cycle transforms biological insights into refined models, validated predictions, and actionable knowledge, enabling efficient antimicrobial peptide development without requiring exhaustive wet lab experimentation. This iterative process exemplifies modern synthetic biology's integration of machine learning, statistical modeling, and reinforcement learning to tackle complex bioengineering challenges.

3 Supplementary Experiments

3.1 Gene Impact Prediction for Fermentation Optimization

Abstract

This document presents a comprehensive multi-environmental gene influence prediction model developed using Random Forest Regression and Principal Component Analysis. The system integrates advanced machine learning techniques to identify key genes and interpret their biological significance across different environmental conditions.

Core Technical Framework

Mathematical Foundation: - Z-score normalization eliminates scale effects in raw gene abundance data - PCA reduces high-dimensional environmental features to 2D visualization space - K-means clustering partitions environments into 3 distinct groups based on gene expression similarity - Feature engineering constructs second-order and third-order interaction terms to capture nonlinear environmental relationships - Random Forest Regression (100 decision trees) predicts gene influence scores using multi-order environmental features - Bootstrap stability analysis (100 iterations) validates prediction robustness and identifies core genes

Key Performance Metrics: - Gini importance for feature contribution assessment - Pearson correlation coefficients for environment relationship analysis - Bootstrap frequency (selection rate across resampling iterations) for gene stability evaluation

Seven-Figure Visualization Pipeline

Environment Clusters (PCA projection): Reveals similarity patterns among environmental conditions
Environment Correlation Heatmap: Displays positive correlation strength (ρ ≥ 0) between environment pairs
Gene Abundance Heatmap: Shows Z-score profiles of Top 10 non-duplicate genes across all environments
Feature Importance: Ranks Top 5 predictive features (single environments + interaction terms)
Gene Expression Trends: Tracks Z-score trajectories across environments for pattern classification
KEGG Pathway Enrichment: Links predicted genes to biological pathways
Bootstrap Frequency: Quantifies prediction stability (80-100 times = extremely reliable)

Biological Impact & Applications

Scientific Value: - Identifies environment-specific marker genes and universal core genes - Reveals synergistic/antagonistic effects between environmental factors - Provides data-driven guidance for experimental design and environment selection

Practical Applications: - Drug target discovery (high-frequency genes with >80% bootstrap stability) - Biomarker identification for specific environmental conditions - Risk reduction in experimental validation through stability filtering

Workflow Integration: The system follows a complete pipeline from raw data normalization → clustering analysis → feature engineering → model training → stability validation → pathway enrichment, ensuring both statistical rigor and biological interpretability.

📄 Open PDF in a new tab

3.2 iGEM Human Practices Data Collection and Analysis Project

Abstract

This project represents a successful practice of deep collaboration between Dry Lab and HP teams from the 2025 iGEM Jiangnan-China team. Addressing the inefficiency of traditional manual reading of HP content, we innovatively adopted automated data collection and artificial intelligence analysis technologies to systematically study the Human Practices working patterns of 208 iGEM teams worldwide (105 teams in 2023 + 103 teams in 2024).

Technical Implementation: The Dry Lab team obtained team data from the iGEM official website, developed Python web scrapers to automatically collect HP pages from 176 undergraduate teams (96.7% success rate), gathering approximately 10 MB of text data. Subsequently, large language models such as Google Gemini were used for batch processing analysis, extracting structured knowledge along three dimensions: "core philosophy, main activities, highlights and creativity," generating two complete analysis documents.

Collaboration Model: Dry Lab provided technical support (data collection, text processing, AI analysis), while the HP team was responsible for analysis framework design and content interpretation. Together, both parties produced data-driven HP work guidelines, achieving deep integration of technology and social responsibility.

Core Values:

Methodological Innovation: Established a replicable HP data collection and analysis process
Knowledge Accumulation: Built a best practices library of 200+ excellent cases
Capability Enhancement: Mastered skills including Web Scraping, NLP, and LLM applications
Community Contribution: Provided open data resources and interdisciplinary collaboration demonstrations for future teams

Significance for Future Teams:

Learning Resources: Provides a systematic library of excellent global HP cases, enabling new teams to quickly understand industry best practices
Methodological Reference: Demonstrates how to use data science to improve HP work quality and avoid working behind closed doors
Tool Support: Plans to open-source an HP intelligent analysis platform, lowering the threshold for data collection and analysis
Collaboration Paradigm: Proves that Dry Lab and HP teams can deeply integrate, breaking traditional barriers
Standard Evolution: Through cross-year comparative analysis, helps teams grasp trends in HP evaluation system changes

This project not only provides solid support for the Jiangnan-China team's LL-37 antimicrobial peptide project HP work, but also explores a new path of data-driven Human Practices for the iGEM community, promoting the transformation of synthetic biology from "qualitative description" to "quantitative assessment."

📄 Open PDF in a new tab

3.3 Intelligent Antimicrobial Peptide Recommendation System Based on Large Language Models

Abstract

The rising threat of antimicrobial resistance necessitates accelerated discovery and optimization of novel antimicrobial peptides (AMPs). We present an Intelligent Antimicrobial Peptide Recommendation System that integrates Large Language Models (LLMs) with curated AMP databases to provide expert-level peptide selection, efficacy prediction, and mechanistic analysis. Built on the Dify workflow platform utilizing the qvq-max-latest model, the system employs a dual-LLM architecture with intelligent query routing: a primary LLM analyzes user intent and determines information requirements, while a secondary LLM synthesizes knowledge-base data with parametric biological understanding to generate scientifically rigorous responses. The system offers four core capabilities: (1) context-aware AMP recommendation based on target organisms and application scenarios, (2) minimum inhibitory concentration (MIC) prediction, (3) rational sequence variant generation, and (4) functional mechanism explanation. When queried to recommend a human antimicrobial peptide, the system delivered a comprehensive analysis of LL-37 (Cathelicidin), encompassing its broad-spectrum activity, resistance-resilient membrane disruption mechanism, immunomodulatory properties, wound healing capabilities, antibiotic synergy potential, safety profile, and clinical applications—demonstrating the integration of molecular data (sequence, MIC values) with higher-order reasoning (comparative analysis, clinical relevance). As part of the iGEM 2025 BIOSHIELD project, this system accelerates research cycles by reducing peptide selection from weeks to minutes, facilitates knowledge transfer across multidisciplinary teams, and serves as an interactive educational tool. The modular architecture ensures scalability with evolving AI capabilities and expanding AMP knowledge bases, positioning the system as a sustainable resource for antimicrobial peptide research and therapeutic development.

Keywords: Large Language Models, Antimicrobial Peptides, Dify Platform, Knowledge Retrieval, Drug Discovery, AI-Assisted Research, LL-37, BIOSHIELD

📄 Open PDF in a new tab