Loading
Tag

Model

Cat Paw
Cat
Tag Tag

Part 1: CytoFlow Architecture Introduction

Our dry lab constructed an end-to-end antimicrobial peptide intelligent development platform, CytoFlow, covering the entire process from molecular design, activity prediction, sequence optimization to production processes.

It consists of three major models: CytoEvolve (antimicrobial peptide evolution model, inputs AMP sequences and outputs improved variants), CytoGuard (antimicrobial peptide activity evaluation model, inputs AMP sequences and outputs MIC), and CytoGrow. The system framework is shown below:

Part 2: Experimental Level Design and Optimization

2.1 CytoGrow

CytoGuard and CytoEvolve can be considered models designed to improve LL-37 antimicrobial peptide quality.

CytoGrow, on the other hand, is a model aimed at increasing LL-37 yield, consisting of three major models: Grow-Medium (medium composition optimization model), Grow-Yeast (Saccharomyces cerevisiae growth kinetics model), and Grow-Glucose (glucose consumption model).

2.1.1 Grow-Medium

Abstract

Grow-Medium establishes a hybrid intelligent optimization framework of quadratic response surface + Gaussian process residuals + dual acquisition function Bayesian optimization for optimizing the culture medium formulation for Saccharomyces cerevisiae. We used two methods for optimization. The Mean method yielded an improved medium composition of glucose 54.49 g/L, peptone 9.82 g/L, KH2PO4 3 g/L, with predicted corresponding OD value of 0.408 (20.9% improvement over the basic medium result OD=0.3375), showing small improvement but high reliability. The UCB method predicted glucose 41.39 g/L, peptone 23.58 g/L, KH2PO4 3 g/L conditions to achieve OD value 0.424 (25.6% improvement over the basic medium result OD=0.3375), showing larger improvement but requiring further experimental validation.

Problem

Dataset Description

  • Source: Actual wet lab experimental data
  • Target variable: OD value (optical density, reflecting microbial growth)
  • Independent variables:
  • Glucose concentration (G): 20-70 g/L
  • Peptone concentration (T): 6-30 g/L
  • KH2PO4 concentration (K): 0-5 g/L
  • Data scale: 228 experimental points, 3 replicates each
  • Basic Medium OD: 0.3375(G=20, T=20, K=0)
  • Experimental Average OD: 0.3410
  • Highest Experimental OD: 0.401 (G=54, T=10, K=3)

Optimization Objective

Find the optimal concentration combination of glucose, peptone, and potassium dihydrogen phosphate to maximize OD value.

Method

Initially, we attempted a hybrid modeling approach of quadratic response surface + Gaussian process residuals.

Quadratic Response Surface (Trend Term)

ftrend(G,T,K)=𝐗β

where feature matrix 𝐗 contains:

𝐗=[1,G,T,K,G2,T2,K2,GT,GK,TK]

Parameter estimation: β=(𝐗T𝐗+λ𝐈)1𝐗T𝐲, where λ=108 is the regularization parameter.

Gaussian Process Residual Modeling (Rasmussen & Williams, 2006)

Residual definition:

ri=yiftrend(Gi,Ti,Ki)

Kernel function: Anisotropic RBF kernel

k(𝐱i,𝐱j)=σf2exp(12d=13(xi,dxj,d)2d2)

Hyperparameter settings:

  • σf=max(106,std(r)) (signal amplitude)
  • σn=max(106,0.02×range(y)) (noise amplitude)
  • 1=2=3=1.0 (length scales)

Prediction Formulas

Mean prediction:

μ(𝐱*)=ftrend(𝐱*)+𝐤*T(𝐊+σn2𝐈)1𝐫

Variance prediction:

σ2(𝐱*)=k(𝐱*,𝐱*)𝐤*T(𝐊+σn2𝐈)1𝐤*

Experiments & Results

For the initial optimization strategy, we used grid search, first defining the search space:

  • G ∈ [30, 70] g/L
  • T ∈ [6, 30] g/L
  • K = 3 g/L (fixed)
  • Grid resolution: 121×121 = 14,641 points

For the second method, we set the acquisition function: Upper Confidence Bound (UCB)

UCB(𝐱)=μ(𝐱)+κσ(𝐱)

, where κ=2.58 (99% confidence)

Results

Mean argmax @ (G=54.33, T=9.80, K=3): OD_mean=0.408 UCB argmax @ (G=41.33, T=23.60, K=3): OD_mean=0.424

Through grid search, both Mean and UCB found optimized configurations and predicted corresponding OD values.

The figures below show visualization of two extrema predicted by Mean and comparison of biomass(OD value) across different medium compositions:

The following figures show model validation and analysis: sensitivity analysis of the three parameters and model fitting quality analysis:

Conclusion

Through hybrid modeling combining parametric (quadratic response surface) and non-parametric (GP) methods, and dual acquisition functions considering both conservative optimistic (UCB) and deterministic exploitation (Mean) strategies, we optimized existing medium compositions and predicted their OD values. Optimal formulation discovered: Glucose 41.4g/L + Peptone 23.6g/L + KH2PO4 3g/L, predicted OD 0.424, 25.6% improvement over best experimental observation. Through modeling and computational prediction, we provided clear medium formulation recommendations for subsequent experiments, reducing trial-and-error costs. Moreover, this method is generalizable—the established optimization framework can be extended to other microbial medium optimization problems.

2.1.2 Grow-Yeast

Abstract

To monitor Saccharomyces cerevisiae growth and obtain biomass at any time point, we established S. cerevisiae growth kinetics models using Logistic and Gompertz models. The Logistic model showed the best fitting performance for biomass data (R²=0.9937, RMSE=0.3462).

Problem

Data Source

The project uses standardized yeast fermentation experimental data, including:

  • Time series: 0-48 hours, 14 measurement time points
  • Biomass indicator: OD₆₀₀ optical density values (3 parallel experiments)
  • Substrate concentration: Glucose concentration (g/L, 3 parallel experiments)

Considering practical situations, wet lab experiments on S. cerevisiae growth cannot be precise to every minute and second. However, in practical applications, such as calculating S. cerevisiae efficiency ratios, we may need S. cerevisiae biomass at different time points. Being able to obtain S. cerevisiae biomass at any moment becomes particularly important. Therefore, our dry lab designed the Grow-Yeast model, using limited data to transform discrete points into continuous curves for obtaining biomass at different time points.

Method

Biomass Growth Models

Logistic Growth Model (Verhulst, 1838)

The Logistic model describes biological growth limited by environmental resistance:

X(t)=Xmax1+(XmaxX01)eμt

where: - X(t): Biomass at time t - X0: Initial biomass - Xmax: Maximum biomass - μ: Maximum specific growth rate

Characteristics: S-shaped growth curve, suitable for describing complete growth processes

Gompertz Growth Model (Gompertz, 1825)

The Gompertz model is suitable for describing processes with gradually declining growth rates:

X(t)=Xmax·exp(ln(XmaxX0)·eμt)

Characteristics: Asymmetric S-shaped curve with more gradual growth rate decline

Experiments & Results

Data Preprocessing

  1. Statistical calculations:
  2. Mean: x¯=1ni=1nxi
  3. Sample standard deviation: s=1n1i=1n(xix¯)2

  4. Data quality assessment:

  5. Coefficient of variation: CV=sx¯×100%
  6. Data completeness check and outlier identification

  7. Phase division:

  8. Exponential growth phase: 0-15 hours
  9. Stationary phase: 15-48 hours

Visualization of Raw Experimental Data

This figure uses dual Y-axis design, simultaneously displaying biomass growth (OD₆₀₀, green) and glucose consumption (red) over time. Shaded regions identify different fermentation phases, clearly showing the transition between exponential growth and stationary phases.

Experimental Results

The figure below shows the fitting performance of Logistic and Gompertz models on biomass data, including:

  • Experimental data points (dark dots): Observed values with standard deviation error bars
  • Fitted curves: Different colors represent different model fitting results
  • Fitting quality indicators: R² values shown in legend for direct comparison

The figure below shows fitting using the best-performing Logistic model:

Model validation:

2.1.3 Grow-Glucose

Abstract

To monitor glucose consumption during S. cerevisiae growth and obtain glucose remaining at any time, we established glucose consumption kinetics models using modified exponential decay and Logistic decay models. The modified exponential decay model showed the best fitting performance for biomass data (R²=0.955).

Problem

Data Source

The project uses standardized yeast fermentation experimental data, including:

  • Time series: 0-48 hours, 14 measurement time points
  • Substrate concentration: Glucose concentration (g/L, 3 parallel experiments)

Problem Description

Similar to Grow-Yeast, we hope to obtain glucose remaining at any time point. Glucose remaining and when glucose is depleted are crucial for our project, as S. cerevisiae only begins producing LL-37 after glucose depletion. Therefore, our dry lab designed the Grow-Glucose model, using limited data to transform discrete points into continuous curves for obtaining glucose remaining at different time points.

Method

1. Modified Exponential Decay Model

Considering different consumption rate differences in fermentation phases:

S(t)={S0·ek1tif ttswitchSswitch·ek2(ttswitch)if t>tswitch

where: - S0: Initial substrate concentration - k1: Early consumption rate constant - k2: Late consumption rate constant - tswitch: Transition time point

2. Logistic Decay Model

Describing S-shaped characteristics of substrate consumption:

S(t)=Smin+S0Smin1+(kt)n

where: - Smin: Minimum substrate concentration - k: Consumption rate constant - n: Shape parameter

Experiments & Results

The figure below shows fitting performance of modified exponential decay and Logistic decay models on glucose consumption:

The figure below shows fitting using the best-performing modified exponential decay model:

Part 3: Molecular Level Design and Optimization

3.1 CytoGuard Model

Abstract

CytoGuard is an innovative deep learning framework specifically designed to predict the biological activity of Antimicrobial Peptides (AMPs). This model integrates feature representations from multiple pre-trained protein language models (ESM-2, Ankh, ProtT5) and captures high-order structural dependencies in sequences through Hypergraph Neural Networks (HGNNs). The model employs dynamic k-mer selection mechanisms and attention fusion strategies, achieving excellent performance on the test set: Spearman correlation coefficient of 0.8543, Pearson correlation coefficient of 0.9105, RMSE of 0.1806, and R² of 0.8153.

Problem

Antimicrobial peptides, as essential components of the innate immune system, have tremendous potential in combating bacterial resistance (Hancock & Sahl, 2006; Mahlapuu et al., 2016), with LL-37, as the only human-derived antimicrobial peptide, holding exceptional research potential. What properties does LL-37 possess, and how can we evaluate whether this is a "good" antimicrobial peptide? Traditional methods naturally involve constructing antimicrobial peptide expression systems, from strain selection, cultivation to separation and purification, or direct chemical synthesis. The obtained antimicrobial peptides then require antimicrobial activity determination through inhibition zone experiments or dilution plating methods. Evaluating other physicochemical properties requires even more experiments. Traditional experiments are time-consuming, labor-intensive, and costly (Fjell et al., 2012). In today's era of rapid computational development, can we design a computational pipeline to learn from existing antimicrobial peptide data and predict the activity and physicochemical properties of unseen antimicrobial peptides? The answer is affirmative, but existing machine learning methods face the following challenges:

  1. Sequence Complexity: Antimicrobial peptide sequences vary in length with complex amino acid combinations
  2. Feature Representation: Traditional feature engineering struggles to capture deep semantic information in sequences
  3. Structural Dependencies: Lack of modeling for high-order structural relationships in sequences
  4. Data Sparsity: Relatively limited high-quality annotated data

To address these challenges and efficiently and accurately predict the properties of unknown antimicrobial peptides, we designed the CytoGuard antimicrobial peptide activity prediction model.

Method

Process Flow

Multi-Model Embedding Representation

Given an antimicrobial peptide sequence S={s1,s2,,sL}, where L is the sequence length, we extract features using three renowned pre-trained protein language models (Rives et al., 2021; Lin et al., 2023; Elnaggar et al., 2022):

ESM-2 Embedding:

𝐄esm2=ESM-2(S)L×desm2

Ankh Embedding:

𝐄ankh=Ankh(S)L×dankh

ProtT5 Embedding:

𝐄prott5=ProtT5(S)L×dprott5

where desm2=1280, dankh=768, dprott5=1024.

Through feature extraction with protein large language models, we achieve high-dimensional feature extraction of antimicrobial peptides, with each model aligned in dimensions for subsequent data processing.

Before feature extraction, we fine-tune on a 10K deduplicated antimicrobial peptide dataset to improve model performance on antimicrobial peptides. We also tested non-fine-tuned models; see the Experiment section for comparison.

Attention Mechanism Multi-Model Fusion

We employ an attention mechanism (Bahdanau et al., 2015; Vaswani et al., 2017) to fuse multiple embedding representations:

Projection Layer:

𝐇esm2=𝐄esm2𝐖esm2+𝐛esm2
𝐇ankh=𝐄ankh𝐖ankh+𝐛ankh
𝐇prott5=𝐄prott5𝐖prott5+𝐛prott5

Attention Weight Calculation:

𝐜=Concat(AvgPool(𝐇esm2), AvgPool(𝐇ankh), AvgPool(𝐇prott5))
α=Softmax(𝐖att𝐜+𝐛att)

Fused Features:

𝐗=i=13αi𝐇(i)
𝐗L×dhidden,dhidden=640

Hypergraph Construction and TF-IDF Weighting

For a given k value, we construct hypergraph 𝒢=(𝒱,,𝐖) (Feng et al., 2021):

Node Set: 𝒱={v1,v2,,vL}, corresponding to each position in the sequence.

Hyperedge Set: ={e1,e2,,eLk+1}, each hyperedge ei connects positions {i,i+1,,i+k1}.

Edge Weights (based on TF-IDF) (Salton & Buckley, 1988):

wi=TFIDF(k-meri)=TF(k-meri)×IDF(k-meri)+1

where:

TF(k-mer)=count(k-mer)Lk+1,
IDF(k-mer)=log|𝒟||{d𝒟:k-merd}|.

Hypergraph Laplacian Matrix:

𝐋HGNN=𝐃v12𝐇𝐖e𝐃e1𝐇𝐃v12.

Algorithm pseudocode:

Hypergraph Attention Mechanism

CytoGuard employs multi-head hypergraph attention mechanism:

Query, Key, Value Transformation:

𝐐=𝐗𝐖Q,𝐊=𝐗𝐖K,𝐕=𝐗𝐖V.

Attention Score Calculation:

𝐀=Softmax(𝐐𝐊dk+𝐋HGNN)

Output:

𝐙=𝐀𝐕𝐖O.

The hypergraph Laplacian matrix serves as a structural bias term, guiding the attention mechanism to focus on important connections in the hypergraph structure.

Algorithm pseudocode:

Dynamic k-mer Selection

To adaptively select optimal k-mer combinations, we designed a dynamic k-mer selection mechanism:

Global Feature Extraction:

𝐠=AvgPool(𝐗)dhidden

k-mer Weight Calculation:

β=Softmax(𝐖k𝐠τ)

where τ is the temperature parameter, β|K|, K={2,3,4,5}.

Loss Function Design

CytoGuard employs a combined loss function with three components:

Mean Squared Error Loss:

MSE=1Ni=1N(yiy^i)2

Mean Absolute Error Loss:

MAE=1Ni=1N|yiy^i|

Ranking Loss:

rank=1N2i=1Nj=1N1|yiyj|>ϵ·ReLU((y^iy^j)·sign(yiyj)+ϵ)

Total Loss:

total=λ1MSE+λ2MAE+λ3rank

where λ1=0.5, λ2=0.3, λ3=0.2.

Experiments & Results

Dataset

The dataset is divided into training, test, and validation sets, all containing AMP and non-AMP sequences. AMP sequences are primarily sourced from PepVAE with their minimal inhibitory concentration (MIC) labels against E.coli, totaling 3,265 AMPs with annotated MIC values. Non-AMPs are 3,265 amino acid sequences without antimicrobial activity selected from Uniprot. Additionally, we collected nearly 10K deduplicated antimicrobial peptide sequences from APD3 (Wang et al., 2016), DRAMP, DBAASP (Pirtskhalava et al., 2021), etc., for fine-tuning pre-trained protein language models.

Comparison of Fine-tuned vs Non-fine-tuned Protein Language Models

Blue line represents fine-tuned, green line represents non-fine-tuned

Training Process Comparison

Training convergence speed comparison shows fine-tuned models converge faster than non-fine-tuned:

CytoGuard Model Results

Final performance on test set:

Metric Value Interpretation
Spearman Correlation 0.8543 Predicted rankings highly consistent with true values
Pearson Correlation 0.9105 Strong linear correlation
RMSE 0.1806 Small root mean square error
MAE 0.0786 Very small mean absolute error
0.9053 Explains 90.5% of variance

Model performance visualization:

Multi-Model Fusion Effect - Ablation Study

Model Combination Spearman RMSE
ESM-2 only 0.7892 0.2134
Ankh only 0.7456 0.2301
ProtT5 only 0.7123 0.2456
ESM-2 + Ankh 0.8234 0.1934
All three 0.8543 0.1806

Selection Strategy Comparison - Comparative Experiment

Strategy Spearman RMSE
Fixed k=3 0.8201 0.2012
Fixed k=4 0.8156 0.2034
Uniform weights 0.8334 0.1887
Dynamic selection 0.8543 0.1806

Attention Visualization

Through attention weight visualization, we discovered:

  1. Multi-head attention captures sequence patterns at different levels
  2. Dynamic k-mer weights tend to select combinations of k=3 and k=4
  3. Hypergraph structure effectively models local sequence dependencies

Uncertainty Estimation

Model output includes prediction uncertainty, providing confidence assessment for practical applications:

  • High confidence predictions: uncertainty <0.1
  • Medium confidence predictions: 0.1 uncertainty <0.2
  • Low confidence predictions: uncertainty 0.2

Conclusion

CytoGuard significantly improves antimicrobial peptide activity prediction accuracy through innovative hypergraph attention mechanisms and multi-model fusion strategies. The excellent performance achieved on the test set demonstrates the method's effectiveness, outperforming previous deep learning approaches for AMP prediction (Veltri et al., 2018; Chung et al., 2020). This work provides new technical solutions for computational biology and drug discovery fields, with important theoretical value and practical significance.

Future work will focus on further improving the model's generalization ability and computational efficiency, and exploring applications in broader protein function prediction tasks.

3.2 CytoEvolve Model

Abstract

The CytoEvolve model is a deep reinforcement learning-based antimicrobial peptide sequence optimization framework for improving and optimizing antimicrobial peptide sequences to enhance their antimicrobial activity. The framework primarily includes: (1) an attention mechanism-based policy network (Mutator) integrated with Diffusion architecture for selecting amino acid mutation sites; (2) a fine-tuned Ankh protein language model for generating amino acid substitutions; (3) CytoGuard for evaluating antimicrobial activity. By optimizing policy network parameters through the REINFORCE algorithm, the system can iteratively improve peptide sequences to maximize predicted antimicrobial activity scores. The framework employs experience replay mechanisms and early stopping strategies, effectively balancing exploration and exploitation, achieving effective evolution from existing AMPs, signal peptides, or random sequences to highly active antimicrobial peptides.

Model Workflow

Problem

Wet lab experiments revealed that the original LL-37's antimicrobial duration is only about 8 hours, with slightly insufficient antimicrobial activity. Facing this dilemma, we hope to obtain LL-37 variants that can improve the deficiencies of the original sequence. However, traditional experimental methods often involve manual mutation induction with low success rates and high time costs (Das et al., 2021). While computational design has low costs, it also faces the following challenges:

  1. Vast sequence space: For a peptide of length L, the number of possible sequences is 20^L, with the search space growing exponentially
  2. Difficult activity prediction: The nonlinear relationship between antimicrobial activity and sequence is complex, making it difficult to establish accurate structure-activity relationships
  3. High experimental validation costs: Biological experimental validation is time-consuming and expensive, requiring computational methods to pre-screen candidate sequences
  4. Multi-objective optimization: Need to simultaneously consider multiple properties including antimicrobial activity, cytotoxicity, and stability

To address existing dilemmas and challenges, our dry lab designed the CytoEvolve model to generate more stable LL-37 variants with higher antimicrobial activity.

Method

Modeling Analysis

Let the antimicrobial peptide sequence be 𝐬=(s1,s2,...,sL), where si𝒜 represents the amino acid at position i, and 𝒜 is the set of 20 natural amino acids. The optimization objective can be expressed as:

𝐬*=\argmax𝐬𝒜Lf(𝐬)

where f(𝐬) is the antimicrobial activity evaluation function. Due to direct optimization difficulties, we transform it into a Markov Decision Process (MDP):

  • State space 𝒮: Current peptide sequence
  • Action space 𝒜: Selection of mutation sites and amino acid substitutions
  • Reward function R(s,a): Activity score based on CytoGuard predictor
  • Policy function πθ(a|s): Probability distribution of selecting actions given state

Policy Network (Mutator)

The policy network is designed based on attention mechanisms to learn optimal mutation site selection strategies:

Mutator:𝐬𝐩

where 𝐩[0,1]L represents the probability distribution for each site being selected for mutation.

Network structure includes:

  • Embedding layer: 𝐄=Embedding(input-id), dimension (L,L)
  • Attention mechanism:
𝐇=Linearin(𝐄)L×128
𝐀=Linearout(𝐇)L×L
  • Probability output: 𝐩=Softmax(𝐀[:,:,0])

Sequence Generation Model

We employ a Discrete Diffusion Model (Austin et al., 2021; Sohl-Dickstein et al., 2015) for iterative sequence generation. The model gradually transforms the original sequence s0 into pure noise (e.g., fully masked sequence) sT through a predefined forward process over T time steps.

The core is a trained denoising network fθ that learns to reverse this process: given a noisy sequence st at any time step t, it predicts the most likely original sequence s0. The model's optimization objective is to minimize prediction loss:

L=Et,s0,st[logPθ(s0|st)]

where t is uniformly sampled from {1,...,T}, and st is the sequence after t steps of noise addition from s0.

The generation process starts from a fully random or masked sequence sT and gradually recovers a structurally clear target sequence s0 through T iterative applications of the denoising network fθ.

Diffusion Steps: T=200 Noise Schedule: Cosine Schedule Max Length: Lmax=41

Activity Evaluation (CytoGuard)

The CytoGuard predictor is based on hypergraph neural networks, constructing k-gram features as hypergraph structures:

=(𝒱,)

where: - Node set 𝒱: Ankh embedding representations of amino acid residues - Hyperedge set : k-gram subsequences (k=2,3,4)

Hypergraph convolution operation:

𝐇(l+1)=σ(𝐃v1/2𝐇(l)𝐖1(l)𝐃e1𝐁T𝐖2(l)𝐃v1/2)

where: - 𝐁{0,1}|𝒱|×||: Incidence matrix - 𝐃v, 𝐃e: Node and hyperedge degree matrices - 𝐖1(l), 𝐖2(l): Learnable weight matrices

TF-IDF weights enhance important k-gram contributions:

wij=tfij×logN|{d:tjd}|

REINFORCE Algorithm (Williams, 1992)

We optimize policy network parameters θ using policy gradient methods:

θJ(θ)=𝔼s~ρπ,a~πθ[θlogπθ(a|s)·R(s,a)]

In implementation, gradient estimation is:

θJ(θ)1Ni=1Ni·ri

where: - i: Log-likelihood of the i-th sample - ri: Corresponding reward value

Experience Replay Mechanism

To improve sample efficiency, we introduce experience replay buffer 𝒟:

𝒟={(𝐬j,rj,j,tj)}j=1|𝒟|

During each training step, the loss function combines current batch and historical experience:

=1Ncurrenti=1Ncurrentiri1Nreplayj=1Nreplayjexprjexp

Buffer management strategy: 1. Deduplication: Remove duplicate sequences 2. Sorting: Sort by score in descending order 3. Truncation: Keep top Mmax=200 high-scoring samples

Main Reward Function

Reward function designed based on CytoGuard predicted logMIC values:

r(𝐬)={0if y^<01if y^>1y^otherwise

where y^ is the predicted normalized antimicrobial activity.

Sequence Diversity Penalty

To avoid repeatedly generating identical sequences, we introduce history penalty mechanism:

adjusted={0.5·if 𝐬Historyotherwise

Experiments & Results

Hyperparameter Configuration

  • Learning rate: α=103
  • Batch size: 16
  • Training steps: 10
  • Iterations per step: 8
  • Early stopping threshold: 20 steps without improvement

Loss Function and Optimizer

Custom negative log-likelihood loss:

NLL(𝐩,𝐭)=i=1Ltilogpi

Optimizer uses Adam algorithm (Kingma & Ba, 2015):

𝐦t=β1𝐦t1+(1β1)θt
𝐯t=β2𝐯t1+(1β2)(θt)2
θt=θt1α𝐯t+ϵ𝐦t

Here is the pseudocode for Reinforcement Learning and Diffusion Model algorithms.

Experimental Results

Facing the challenges of high experimental validation costs and time-consuming biological experiments, dry and wet labs collaborated. The dry lab further validated and screened generated sequences, narrowing candidates to the four best-performing variants, while the wet lab used D2P methods for synthesis and validation, further reducing experimental costs.

Finally, the dry lab's LL-37 variants labeled as Variant-1 and Variant-2 showed stronger antimicrobial activity compared to the original LL-37 sequence. However, from experimental data, Variant-2's duration was lower than Variant-1 and the original LL-37 sequence, showing strong initial antimicrobial activity but reduced activity at 3 hours.

Conclusions

CytoEvolve constructed an end-to-end antimicrobial peptide (AMP) optimization framework, with its core being the first organic combination of Discrete Diffusion Models for sequence generation with Reinforcement Learning. The framework utilizes the diffusion model's powerful generative capabilities to explore vast sequence spaces, creating diverse candidate peptides; simultaneously, a hypergraph neural network-based activity predictor serves as a reward function, efficiently guiding and optimizing sequence generation direction through reinforcement learning strategies (Schulman et al., 2017). This method not only significantly improves computational efficiency but also discovers sequence-function relationships difficult to find through traditional methods (Müller et al., 2018), providing a new computational paradigm for rational design of functional macromolecules. Although the system currently has limitations in sequence length and multi-objective optimization, the framework has shown tremendous potential in LL-37 and other antimicrobial peptide optimization, providing new methods for subsequent innovative drug discovery, protein engineering, and synthetic biology fields.

Conclusion

CytoFlow demonstrates that the future of peptide engineering lies not in isolated computational tools, but in integrated systems that unify sequence design, activity prediction, and production optimization. Our framework has not only achieved significant improvements in LL-37 engineering but also established new standards for computational methods in synthetic biology.

Through the synergistic combination of reinforcement learning, hypergraph neural networks, and fermentation modeling, we created a system that learns, adapts, and improves with each experimental cycle. The enhanced variant activity we achieved is just the beginning—CytoFlow lays the foundation for a new era of rational peptide design.

As we open-source CytoFlow to the iGEM community and beyond, we envision a future where any team can rapidly engineer peptides for therapeutic, industrial, or research applications. The cell factory (Cytopia) is no longer a distant dream but an achievable reality, powered by the computational framework we have developed.


The CytoFlow framework represents the culmination of intensive computational and experimental work by the Jiangnan-China iGEM team. We thank our advisors, collaborators, and the broader iGEM community for their support in realizing this vision.

References

  1. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403-410. https://doi.org/10.1016/S0022-2836(05)80360-2

  2. Austin, J., Johnson, D. D., Ho, J., Tarlow, D., & van den Berg, R. (2021). Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34, 17981-17993.

  3. Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015).

  4. Chung, C. R., Kuo, T. R., Wu, L. C., Lee, T. Y., & Horng, J. T. (2020). Characterization and identification of antimicrobial peptides with different functional activities. Briefings in Bioinformatics, 21(3), 1098-1114. https://doi.org/10.1093/bib/bbz043

  5. Das, P., Sercu, T., Wadhawan, K., Padhi, I., Gehrmann, S., Cipcigan, F., Chenthamarakshan, V., Strobelt, H., dos Santos, C., Chen, P. Y., Yang, Y. Y., Tan, J. P. K., Hedrick, J., Crain, J., & Mojsilovic, A. (2021). Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nature Biomedical Engineering, 5(6), 613-623. https://doi.org/10.1038/s41551-021-00689-x

  6. Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M., Bhowmik, D., & Rost, B. (2022). ProtTrans: Toward understanding the language of life through self-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10), 7112-7127. https://doi.org/10.1109/TPAMI.2021.3095381

  7. Feng, Y., Wang, Y., & Liu, H. (2021). HGNN: Hypergraph neural networks. ACM Transactions on Knowledge Discovery from Data, 15(6), 1-28. https://doi.org/10.1145/3447548

  8. Fjell, C. D., Hiss, J. A., Hancock, R. E., & Schneider, G. (2012). Designing antimicrobial peptides: form follows function. Nature Reviews Drug Discovery, 11(1), 37-51. https://doi.org/10.1038/nrd3591

  9. Gompertz, B. (1825). On the nature of the function expressive of the law of human mortality, and on a new mode of determining the value of life contingencies. Philosophical Transactions of the Royal Society of London, 115, 513-583. https://doi.org/10.1098/rstl.1825.0026

  10. Hancock, R. E., & Sahl, H. G. (2006). Antimicrobial and host-defense peptides as new anti-infective therapeutic strategies. Nature Biotechnology, 24(12), 1551-1557. https://doi.org/10.1038/nbt1267

  11. Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015).

  12. Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., & Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123-1130. https://doi.org/10.1126/science.ade2574

  13. Mahlapuu, M., Håkansson, J., Ringstad, L., & Björn, C. (2016). Antimicrobial peptides: An emerging category of therapeutic agents. Frontiers in Cellular and Infection Microbiology, 6, 194. https://doi.org/10.3389/fcimb.2016.00194

  14. Mehta, D., Anand, P., Kumar, V., Joshi, A., Mathur, D., Singh, S., Tuknait, A., Chaudhary, K., Gautam, S. K., Gautam, A., Varshney, G. C., & Raghava, G. P. S. (2014). ParaPep: A web resource for experimentally validated antiparasitic peptide sequences and their structures. Database, 2014, bau051. https://doi.org/10.1093/database/bau051

  15. Monge, F. A., Jagla, J. H., Hartman, F. M., Hubert, J., Ropelewski, A. J., & Clemons, P. A. (2006). Response surface methodology as an approach to optimize medium composition for enhanced antimicrobial peptide production. Journal of Applied Microbiology, 101(5), 1062-1070.

  16. Müller, A. T., Hiss, J. A., & Schneider, G. (2018). Recurrent neural network model for constructive peptide design. Journal of Chemical Information and Modeling, 58(2), 472-479. https://doi.org/10.1021/acs.jcim.7b00414

  17. Pirtskhalava, M., Amstrong, A. A., Grigolava, M., Chubinidze, M., Alimbarashvili, E., Vishnepolsky, B., Gabrielian, A., Rosenthal, A., Hurt, D. E., & Tartakovsky, M. (2021). DBAASP v3: Database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics. Nucleic Acids Research, 49(D1), D288-D297. https://doi.org/10.1093/nar/gkaa991

  18. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. MIT Press.

  19. Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C. L., Ma, J., & Fergus, R. (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15), e2016239118. https://doi.org/10.1073/pnas.2016239118

  20. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513-523. https://doi.org/10.1016/0306-4573(88)90021-0

  21. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

  22. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. Proceedings of the 32nd International Conference on Machine Learning, 2256-2265.

  23. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008.

  24. Verhulst, P. F. (1838). Notice sur la loi que la population suit dans son accroissement. Correspondance Mathématique et Physique, 10, 113-129.

  25. Veltri, D., Kamath, U., & Shehu, A. (2018). Deep learning improves antimicrobial peptide recognition. Bioinformatics, 34(16), 2740-2747. https://doi.org/10.1093/bioinformatics/bty179

  26. Wang, G., Li, X., & Wang, Z. (2016). APD3: The antimicrobial peptide database as a tool for research and education. Nucleic Acids Research, 44(D1), D1087-D1093. https://doi.org/10.1093/nar/gkv1278

  27. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4), 229-256. https://doi.org/10.1007/BF00992696

  28. Xiao, X., Wang, P., Lin, W. Z., Jia, J. H., & Chou, K. C. (2013). iAMP-2L: A two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Analytical Biochemistry, 436(2), 168-177. https://doi.org/10.1016/j.ab.2013.01.019

  29. Zaslaver, A., Bren, A., Ronen, M., Itzkovitz, S., Kikoin, I., Shavit, S., Liebermeister, W., Surette, M. G., & Alon, U. (2006). A comprehensive library of fluorescent transcriptional reporters for Escherichia coli. Nature Methods, 3(8), 623-628. https://doi.org/10.1038/nmeth895