This comprehensive glossary document provides systematic parameter definitions and terminology explanations for the CytoFlow antimicrobial peptide development platform. It includes 15 detailed parameter tables covering CytoGrow (medium optimization, yeast kinetics, glucose consumption), CytoGuard (activity prediction with hypergraph neural networks), and CytoEvolve (reinforcement learning-based sequence optimization), along with 82 alphabetically organized technical terms ranging from mathematical concepts (Gaussian Processes, TF-IDF) to biological metrics (MIC, OD₆₀₀). The document serves as a complete reference guide for understanding the mathematical models, machine learning architectures, and experimental parameters underlying the CytoFlow framework developed by iGEM 2025 Jiangnan-China team.
The dry lab's computational workflow follows a systematic Design-Build-Test-Learn (DBTL) cycle, iteratively refining the CytoFlow platform through data-driven optimization. This cycle integrates three core models—CytoGrow, CytoGuard, and CytoEvolve—to achieve end-to-end antimicrobial peptide development from fermentation optimization to molecular design.
Hover over the DBTL cycle to explore each phase
Hover over any quadrant of the DBTL cycle diagram to view detailed information about that specific phase—Design, Build, Test, or Learn.
The Design phase establishes computational frameworks and defines optimization objectives based on biological constraints and project goals. For CytoGrow, we design hybrid models combining response surface methodology with Gaussian processes to optimize medium composition (Grow-Medium), model yeast kinetics (Grow-Yeast), and predict glucose consumption (Grow-Glucose). For CytoGuard, we design a hypergraph neural network architecture that integrates multi-model embeddings (ESM-2, Ankh, ProtT5) with dynamic k-mer selection to predict antimicrobial activity. For CytoEvolve, we design a reinforcement learning framework using policy networks with diffusion architecture to guide sequence mutations toward higher antimicrobial activity.
The Build phase implements the designed computational models through feature engineering, model training, and hyperparameter tuning. For CytoGrow, we build quadratic response surfaces fitted to experimental OD data, train Gaussian process residual models with RBF kernels, and implement Bayesian optimization with UCB acquisition functions for medium optimization. For CytoGuard, we fine-tune protein language models on 10K AMP sequences, construct hypergraph representations with TF-IDF weighted k-mers, and train attention-based fusion layers to combine multi-model embeddings. For CytoEvolve, we build the Mutator policy network integrated with diffusion mechanisms, implement the REINFORCE algorithm with experience replay, and connect CytoGuard as the activity evaluator to provide reward signals.
The Test phase evaluates model performance using quantitative metrics and validation strategies on held-out datasets. For CytoGrow, we test medium optimization predictions through grid search over 14,641 candidate formulations, comparing Mean and UCB strategies against experimental OD values (best: OD=0.401, predicted: 0.408-0.424), and validate kinetic models with R² scores (Grow-Yeast: R²=0.980, Grow-Glucose: R²=0.955). For CytoGuard, we test activity predictions on independent test sets, achieving Spearman correlation 0.8543, Pearson correlation 0.9105, RMSE 0.1806, and R² 0.9053, and conduct ablation studies confirming multi-model fusion and dynamic k-mer selection improve performance. For CytoEvolve, we test generated LL-37 variants by running iterative mutation-evaluation cycles, measuring convergence of activity scores over episodes, and validating that evolved sequences show predicted improvements in antimicrobial activity compared to wild-type LL-37.
The Learn phase extracts insights from test results to refine models and guide the next DBTL iteration:
These insights inform subsequent iterations: adjusting exploration-exploitation balance, incorporating additional protein language models, and extending to multi-objective optimization for both OD and LL-37 yield.
The dry lab's internal DBTL cycle drives continuous improvement of the CytoFlow platform through rigorous computational experimentation. Each cycle transforms biological insights into refined models, validated predictions, and actionable knowledge, enabling efficient antimicrobial peptide development without requiring exhaustive wet lab experimentation. This iterative process exemplifies modern synthetic biology's integration of machine learning, statistical modeling, and reinforcement learning to tackle complex bioengineering challenges.
This document presents a comprehensive multi-environmental gene influence prediction model developed using Random Forest Regression and Principal Component Analysis. The system integrates advanced machine learning techniques to identify key genes and interpret their biological significance across different environmental conditions.
Core Technical Framework
Mathematical Foundation: - Z-score normalization eliminates scale effects in raw gene abundance data - PCA reduces high-dimensional environmental features to 2D visualization space - K-means clustering partitions environments into 3 distinct groups based on gene expression similarity - Feature engineering constructs second-order and third-order interaction terms to capture nonlinear environmental relationships - Random Forest Regression (100 decision trees) predicts gene influence scores using multi-order environmental features - Bootstrap stability analysis (100 iterations) validates prediction robustness and identifies core genes
Key Performance Metrics: - Gini importance for feature contribution assessment - Pearson correlation coefficients for environment relationship analysis - Bootstrap frequency (selection rate across resampling iterations) for gene stability evaluation
Seven-Figure Visualization Pipeline
Biological Impact & Applications
Scientific Value: - Identifies environment-specific marker genes and universal core genes - Reveals synergistic/antagonistic effects between environmental factors - Provides data-driven guidance for experimental design and environment selection
Practical Applications: - Drug target discovery (high-frequency genes with >80% bootstrap stability) - Biomarker identification for specific environmental conditions - Risk reduction in experimental validation through stability filtering
Workflow Integration: The system follows a complete pipeline from raw data normalization → clustering analysis → feature engineering → model training → stability validation → pathway enrichment, ensuring both statistical rigor and biological interpretability.
This project represents a successful practice of deep collaboration between Dry Lab and HP teams from the 2025 iGEM Jiangnan-China team. Addressing the inefficiency of traditional manual reading of HP content, we innovatively adopted automated data collection and artificial intelligence analysis technologies to systematically study the Human Practices working patterns of 208 iGEM teams worldwide (105 teams in 2023 + 103 teams in 2024).
Technical Implementation: The Dry Lab team obtained team data from the iGEM official website, developed Python web scrapers to automatically collect HP pages from 176 undergraduate teams (96.7% success rate), gathering approximately 10 MB of text data. Subsequently, large language models such as Google Gemini were used for batch processing analysis, extracting structured knowledge along three dimensions: "core philosophy, main activities, highlights and creativity," generating two complete analysis documents.
Collaboration Model: Dry Lab provided technical support (data collection, text processing, AI analysis), while the HP team was responsible for analysis framework design and content interpretation. Together, both parties produced data-driven HP work guidelines, achieving deep integration of technology and social responsibility.
Core Values:
Significance for Future Teams:
This project not only provides solid support for the Jiangnan-China team's LL-37 antimicrobial peptide project HP work, but also explores a new path of data-driven Human Practices for the iGEM community, promoting the transformation of synthetic biology from "qualitative description" to "quantitative assessment."
The rising threat of antimicrobial resistance necessitates accelerated discovery and optimization of novel antimicrobial peptides (AMPs). We present an Intelligent Antimicrobial Peptide Recommendation System that integrates Large Language Models (LLMs) with curated AMP databases to provide expert-level peptide selection, efficacy prediction, and mechanistic analysis. Built on the Dify workflow platform utilizing the qvq-max-latest model, the system employs a dual-LLM architecture with intelligent query routing: a primary LLM analyzes user intent and determines information requirements, while a secondary LLM synthesizes knowledge-base data with parametric biological understanding to generate scientifically rigorous responses. The system offers four core capabilities: (1) context-aware AMP recommendation based on target organisms and application scenarios, (2) minimum inhibitory concentration (MIC) prediction, (3) rational sequence variant generation, and (4) functional mechanism explanation. When queried to recommend a human antimicrobial peptide, the system delivered a comprehensive analysis of LL-37 (Cathelicidin), encompassing its broad-spectrum activity, resistance-resilient membrane disruption mechanism, immunomodulatory properties, wound healing capabilities, antibiotic synergy potential, safety profile, and clinical applications—demonstrating the integration of molecular data (sequence, MIC values) with higher-order reasoning (comparative analysis, clinical relevance). As part of the iGEM 2025 BIOSHIELD project, this system accelerates research cycles by reducing peptide selection from weeks to minutes, facilitates knowledge transfer across multidisciplinary teams, and serves as an interactive educational tool. The modular architecture ensures scalability with evolving AI capabilities and expanding AMP knowledge bases, positioning the system as a sustainable resource for antimicrobial peptide research and therapeutic development.
Keywords: Large Language Models, Antimicrobial Peptides, Dify Platform, Knowledge Retrieval, Drug Discovery, AI-Assisted Research, LL-37, BIOSHIELD