Overview

The harsh conditions of extreme cold impede humanity's pursuit of beauty. We have discovered that antifreeze proteins possess functions such as lowering the freezing point and enhancing thermal hysteresis activity, offering novel insights into frost resistance in extreme environments. However, the types of naturally occurring antifreeze proteins are limited, and the precise mechanisms of frost resistance among different proteins remain unclear. We plan to design novel synthetic antifreeze proteins to further enhance their frost resistance activity.

We contend this project aligns strongly with the DBTL principle. We commence with rational protein design using computational models. Following scoring of antifreeze protein candidates via predictive models, we proceed with synthetic biology and wet-lab experiments to construct them in diverse expression systems. Validation employs molecular dynamics simulations and ice crystal formation assays, aiming to identify key mechanisms in phenotypically superior proteins. This approach holds promise for future iGEM projects in antifreeze protein design.

The scope of in vitro experiments spans de novo protein design to molecular dynamics exploration, incorporating machine learning, neural networks, and large language model technologies. This systematic approach to exploration and innovation delivers multidimensional solutions to the challenge of antifreeze protein design.

Tianjin 2025 Model Framework Diagram

Figure 1. Tianjin 2025 Model Framework Diagram

Protein Design

Design Concept

This project establishes a three-track parallel computational design strategy, combining point mutation optimization and inverse folding based on natural templates, as well as de novo design based on physical rules, to systematically explore the design space of antifreeze proteins. We aim not only to optimize the performance of existing AFPs, but also to create new antifreeze proteins with novel structures and enhanced functions.

Design Architecture

Our design platform includes three complementary design dimensions, each subject to rigorous validation criteria.

Track 1: Point Mutation Optimization Based on Natural Templates

We selected four natural antifreeze proteins (PDB codes of 3WP9, 3ULT, 4NU2 and 6A8K) with significant antifreeze effect as design templates, and designed mutation sites through the ESM-1v model. Highly conserved key residues during evolution (used to maintain structural core stability) and highly variable surface residues (as potential targets for functional optimization) were identified. On this basis, we prioritize directed mutations in the variable region to explore the possibility of functional enhancement while maintaining the overall fold of the protein.

Track 2: Inverse Folding Optimization Based on Natural Templates

Based on the multi-objective optimization framework, the design strategy organically combines deep learning and molecular simulation technology to achieve systematic improvement of antifreeze protein performance. The entire process follows DBTL's engineering principles to ensure a high success rate for the final candidate molecule.

The first stage is the initial design of multiple templates
We continued to modify the above four proteins, and through the multi-parameter scanning strategy of ProteinMPNN, we generated a series of candidate sequences for each template, realizing the systematic exploration of the AFP sequence space.

The second stage is multi-dimensional performance evaluation
High-throughput screening of candidate sequences was achieved through quadruple parallel calculations of AlphaFold2 structure verification, FoldX stability prediction, protein-sol solubility analysis and predictor function evaluation.

The third stage is dynamic performance verification
Molecular dynamics simulation is used to evaluate the initial dynamics of excellent candidate molecules to provide decisions for the final output. See our Molecular Dynamics Simulation page for details.

The fourth stage is the iterative optimization cycle
Based on the simulation results, we establish an adaptive optimization loop: candidate molecules with better performance than natural antifreeze proteins or better performance are directly exported for wet experimental verification. The variants that do not meet the standard enter the next round of design as parent sequences, and achieve directed evolution of performance through a gradual cooling strategy.

Inverse folding optimization process

Figure 2. Inverse folding optimization process
Track 3: DeNovo Design Based on Inverse Folding Optimization

We have developed three innovative de novo design strategies to tailor the structure to the functional needs of antifreeze proteins.

Strategy 1 Spiral periodic function array
Considering that the structure of type I antifreeze protein is mainly composed of α helix and rich in alanine, and alanine is the strongest helix former, we use alanine as the sequence to build a standard α helix, and use 11 residue intervals to implant threonine. This interval corresponds precisely to the 3-turn period of the α spiral, ensuring that all antifreeze sites are oriented on the same side of the helix to form a continuous ice-binding interface, which is visualized using PyMOL.Subsequent DeNovo1 and DeNovo2 were designed in this way.

Spiral array design visualization
Figure 3. Using PyMOL to ensure that the antifreeze sites are oriented on the same side of the helix (the figure shows Cα)

Strategy 2 Ice lattice matching function array
The literature shows that the strongest ice binding can be achieved when the spacing of the antifreeze sites is 1.5 times that of the ice lattice (4.5 Å), or 6.75 Å. Therefore, we also use the alanine spiral skeleton as the basis and implant threonine at 3-residue intervals. The design makes the spacing of the antifreeze sites approximately 6.75Å to maximize the complementarity of the ice crystal interface and optimize the adsorption energy.Subsequent DeNovo3 and DeNovo4 were designed in this way.

Ice lattice matching visualization
Figure 4.1 Regulation of antifreeze site spacing using PyMOL (the figure shows Cα)
Ice lattice matching visualization
Figure 4.2 Adjusted final spacing (the figure shows Cα)
Ice lattice matching visualization
Figure 5. Simplified process of designing strategies 1 and 2 from scratch

Strategy 3 AI generates a new skeleton
RFdiffusion2 was used to generate a new skeleton from scratch to explore the design space beyond natural proteins, and subsequent DeNovo5 and DeNovo6 were generated using this method. Unified optimization process.

At this point, all the skeleton structures generated by the de novo design strategy have subsequently entered the unified inverse folding optimization program. Sequence design was performed by ProteinMPNN, and key functional residues were fixed during optimization while the remaining sites were optimized for stability and solubility. At the same time, based on feedback from subsequent molecular dynamics simulations and the potential for ice nucleation due to the lack of a water transport core, we added caps to DeNovo2 and DeNovo4 (source Uniprot:A0A915PK15) and not to DeNovo1 and DeNovo3.

Results and Analysis

Track 1: We selected the loci with a score threshold greater than 1 to obtain mutants with no more than 3 loci for four protein mimicking mutations, and successfully constructed them in different chassis strains of E. coli and yeast.

ESM-1v mutation simulation results
Figure 6. Simulation results of ESM-1v mutation

Track 2 & 3: Through the inverse folding optimization process, we generated no less than 300 candidate sequences for each template, and successfully obtained and output 10 high-quality antifreeze protein design variants that were validated in multiple dimensions, which were then verified by wet experiments. Among them, 3WP9-IF, 3ULT-IF, 4NU2-IF and 6A8K-IF were obtained by the inverse folding optimization pipeline, DeNovo1 and DeNovo2 were obtained by spiral periodic functional arrays, DeNovo3 and DeNovo4 were obtained by ice lattice matching functional arrays, and DeNovo5 and DeNovo6 were generated by AI (the sequences of the 10 output proteins and some candidate sequences will be placed in the supplementary file).

Blastp sequence alignment
We first sequence aligned the protein sequences obtained with ProteinMPNN defolding, which was done using Blastp.

Sequence comparison

3WP9-IF                                                                                3ULT-IF


Sequence comparison

4NU2-IF                                                                                6A8K-IF


Figure 7. Sequence comparison
Structural Reliability Analysis (AlphaFold2)

The sequence consistency between the four variants and the original protein was less than 55%, and the sequence similarity was less than 70%, indicating that the sequence of the variant obtained by defolding was large and the original sequence was far from the original sequence, and it had strong novelty. Structural reliability analysis of AlphaFold2

We conducted a rigorous folding reliability evaluation of 10 design variants using a comprehensive structural biology analysis method.

AlphaFold2 structure reliability

3WP9-IF                                                                                3ULT-IF


AlphaFold2 structure reliability

4NU2-IF                                                                                6A8K-IF


AlphaFold2 structure reliability

DeNovo1                                                                                DeNovo2


AlphaFold2 structure reliability

DeNovo3                                                                                DeNovo4


AlphaFold2 structure reliability

DeNovo5                                                                                DeNovo6


Figure 8.1 pLDDT

Most of the regional confidence levels of the inverse fold variant and the de novo design variant are above 90, and the local low confidence area is mostly a flexible region, and each variant shows high confidence as a whole. The predicted structure is very similar to the actual structure, which lays a good foundation for stability testing and molecular dynamics simulation.

AlphaFold2 structure reliability

3WP9-IF                                                                                3ULT-IF


AlphaFold2 structure reliability

4NU2-IF                                                                                6A8K-IF


AlphaFold2 structure reliability

DeNovo1                                                                                DeNovo2


AlphaFold2 structure reliability

DeNovo3                                                                                DeNovo4


AlphaFold2 structure reliability

DeNovo5                                                                                DeNovo6


Figure 8.2 PAE

We analyzed the prediction alignment error plots provided by AlphaFold Server and found that the PAE plots of variants other than 3ULT-IF and DeNovo5 roughly showed diagonally distributed dark green squares with clear outlines, indicating that each variant had one or more stable domains and the relative positions between internal residues were very stable. The PAE plots of 3ULT-IF and DeNovo5 showed some light green plates, indicating that there were some disordered regions in their sequences, which could not form stable folds, but we could still further verify their frost resistance through wet experiments.

Comprehensive Performance Analysis

We quantitatively evaluated and compared key performance indicators for all variants.

Table 1 Key calculation evaluation indicators of output proteins
AlphaFold2 structure reliability

Stability improvement: ΔΔG of all variants is less than 0, and the inverse folding strategy has obvious advantages in stability optimization.

Significant functional activity: all variants showed high frost resistance probability, and the predicted frost resistance activity of rational design variants was greater than 0.8, which verified the effectiveness of physical rule design.

Good solubility: All variants have high solubility, which provides a good foundation for frost resistance. Gromacs molecular dynamics simulation.

This section will not be demonstrated first, and a separate module will be opened for detailed explanation.

Cross-validation of dry and wet experiments

The computational design pipeline of this project was finally verified by wet experiments. The experimental results show that the success rate and performance trend of the computational design strategy are more consistent with our predictions.

Track 1

ESM-1v mutation simulation results
Figure 9. Comparison of IRI activity between wild-type AFPs and their mutants. (a) Ice crystal morphology after freezing at -6 °C for 30 min. (b) Mean ice crystal diameter analysis. Samples on the x-axis include the AFPs and their mutants (Sta: wild-type E. coli control). The y-axis indicates the measured ice crystal size (µm²)

Track 2

ESM-1v mutation simulation results
Figure 10. IRI activity of crude enzyme solutions containing the engineered proteins 6A8K-IF, 4NU2-IF, and 3WP9-IF. (a) Representative ice crystal images after incubation at -6 °C for 30 minutes. (b) Mean ice crystal diameter analysis. The x-axis indicates the different protein samples, with "Sta" corresponding to the wild-type E. coli crude enzyme solution (control). The y-axis shows the ice crystal size (µm²) measured after 30 minutes of freezing.

Track 3

ESM-1v mutation simulation results
Figure 11. IRI activity characterization of crude enzyme solutions containing de novo proteins designed by inverse folding. (a) Ice crystal morphology observed after freezing at -6 °C for 30 minutes. (b) Mean ice crystal diameter analysis. The x-axis indicates the different protein samples, with “Sta” representing the wild-type E. coli crude enzyme solution (control). The y-axis shows the ice crystal size (µm²).

The wet experiment results show that our computational design strategy is indeed effective, especially the de novo design strategy, which proves its great value in cutting-edge protein engineering with its high success rate of 50% and excellent performance. However, compared with natural antifreeze protein, there is still a certain gap in the design of antifreeze protein, and the design strategy needs to be further improved and perfected. See the Engineering section for a detailed analysis.

Comparison of Design Strategy Effectiveness

Both the point mutation optimization and the inverse folding optimization strategies show the most reliable stability improvement effect. This result is in line with expectations, as both strategies are based on an optimized framework for native proteins, maximizing the preservation of evolutionarily tested stable structural cores. The four inverse fold variants also showed high solubility and frost resistance probability, and the protein crude enzyme solution of the mutants of the two strategies was verified by wet experiments, proving the robust optimization ability of these proteins under the premise of maintaining function.

The rational design strategy performed well in terms of frost resistance, especially the variant DeNovo3 based on the ice crystal matching principle obtained a predicted frost resistance probability of 0.960. This result verifies the effectiveness of the functional locus arrangement design based on physical rules, and its excellent performance in core functional indicators makes it the first choice for function-oriented applications.

Although DeNovo1 and DeNovo3 exhibited a certain degree of ice-promoting nucleation activity in the experiment, after we added hydrophobic cores, DeNovo2 and DeNovo4 showed higher antifreeze activity than natural proteins and exhibited the same trend as predicted.

The AI-generated design strategy is unique in terms of structural novelty, and the resulting variants have an unnatural folding pattern, providing a new structural template for the antifreeze protein family, which has important basic research value. Among them, the DeNovo5 and DeNovo6 variants have good indicators, which are worthy of further in-depth study.

Conclusion

The three-track design strategy proved highly successful. Point mutation and inverse folding optimization are reliable for stable variants, rational de novo design shows great functional potential, and AI generation offers structural innovation. All output molecules have been computationally verified, providing a solid foundation for wet experiments.

Prediction Model

Description

AFP-igemTJ2025 presents an end-to-end computational framework for predicting protein function directly from sequence data. It employs a two-stage architecture combining a pre-trained protein language model for feature extraction with a bespoke deep learning classifier for accurate functional annotation.

Datasets

We constructed a dataset comprising 920 AFPs and 9493 non-AFPs. A comprehensive search was conducted in the UniProtKB database until January 24, 2025 using specific keywords, resulting in the collection of 6589 AFPs. Next, the maximal pairwise sequence identity of the proteins in the manually inspected dataset was culled to ≤40% using CD-HIT, yielding a set of 920 unique AFPs.

AFP dataset curation pipeline overview

Figure 12. Keywords for antifreeze proteins

The negative dataset was derived from 9493 seed proteins of Pfam protein families that are not associated with antifreeze proteins, which is widely used for evaluating the performance of AFPs prediction methods.

Balance dataset: The dataset was divided into training and test sets, 644 AFPs and 644 non-AFPs were randomly selected as positive and negative samples to form the training dataset, and the remaining 276 AFPs and 8849 non-AFPs were designated as the test dataset.

Imbalanced dataset: To further validate the predictive performance and facilitate subsequent research, we introduced an imbalanced dataset. This dataset was divided with 70% AFPs and non-AFPs for training and the remaining 30% for independent testing respectively.

Predicted Model Architecture

The model is built upon the ProtT5-XL-UniRef50 encoder, a transformer-based model pre-trained on the UniRef50 database, to generate high-dimensional feature representations (1024 dimensions per amino acid) from input protein sequences. These rich embeddings capture complex semantic and syntactic patterns within the protein sequences.

The architecture of the ProtT5 model essentially involves the complete transfer and adaptation of the highly successful T5 (Text-To-Text Transfer Transformer) model from natural language processing to the field of protein sequence analysis. Its core is an encoder-decoder framework based on the Transformer architecture. The model learns the "language rules" of proteins through self-supervised pre-training on large databases comprising billions of protein sequences, using a method called "Span Corruption." Specifically, the model randomly masks continuous segments of the input amino acid sequence, and the decoder reconstructs these original masked amino acids based on the sequence context understood by the encoder. This process forces the model to deeply grasp the complex biochemical, structural, and evolutionary relationships between amino acids. Ultimately, the power of ProtT5 lies in its ability to generate a highly context-aware vector representation for each amino acid in its input. These representations, enriched with extensive biological knowledge, can be directly used as features for various downstream tasks, such as structure prediction, function annotation, and protein engineering, making it a fundamental and powerful tool in computational biology.

Figure 2 The Transformer-model architecture
Figure 13. The Transformer-model architecture
Figure 3 Feature extraction overview of ProtT5
Figure 14. Feature extraction overview of ProtT5

The subsequent classifier, named igemTJModel, utilizes a sophisticated multi-modal neural network architecture to process these features. It integrates 1D convolutional layers (1D-CNN) to identify local amino acid motifs, bidirectional LSTM (Bi-LSTM) layers to capture long-range contextual dependencies across the sequence, and an attention mechanism to dynamically weigh the importance of specific residues, significantly enhancing model interpretability. Regularization strategies like Layer Normalization, Dropout, and residual connections are incorporated to ensure training stability and prevent overfitting.

Figure 4 The igemTJmodel architecture
Figure 15. The igemTJmodel architecture
Evaluation Metrics and Results

The following are commonly used evaluation formulas for binary classification models and our evaluation results.

  1. Accuracy (ACC) — Measures the proportion of correct predictions among all predictions.
    $$ {ACC} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} $$
  2. Matthews Correlation Coefficient (MCC) — A balanced measure that considers all four confusion matrix categories, reliable even when classes are of very different sizes.
    $$\text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}$$
  3. Precision — Measures the proportion of correctly predicted positive observations among all predicted positives.
    $$\text{Precision} = \frac{TP}{TP + FP}$$
  4. Area Under the ROC Curve (AUC) — The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR). The AUC measures the model's ranking ability.
    $$\text{TPR} = \frac{TP}{TP + FN}, \quad \text{FPR} = \frac{FP}{FP + TN}$$
    $$ \displaystyle AUC = \int_{0}^{1} TPR(FPR) \, dFPR $$
  5. Area Under the Precision-Recall Curve (AUPR / AP) — Especially useful for imbalanced datasets.
    $$Recall = \frac{TP}{TP + FN}, \quad Precision = \frac{TP}{TP + FP}$$
    $$ \displaystyle AUPR = \int_{0}^{1} Precision(Recall) \, dRecall $$
Figure 5 Evaluation results
Figure 16. Evaluation results

Comparison with existing predictor

Figure 6 Comparison with existing predictors.
Figure 17. Comparison with existing predictors.

The main advantage of the AFP-igemTJ2025 model lies in its achievement of extremely high accuracy, specificity, and AUC on an imbalanced dataset, while maintaining high MCC and precision. This makes the model particularly valuable in applications requiring high reliability and low false positive rates, such as protein function prediction in bioinformatics. Although its sensitivity is relatively low, comprehensive metrics indicate that the model outperforms other compared models in overall classification performance. When deploying the model, it is recommended to balance sensitivity and specificity according to specific needs, for example, by adjusting the classification threshold to optimize sensitivity.

Application in our project

Based on the integrated results of the dry lab predictive model and the wet lab ice crystal analysis, a certain consistency between the two can be clearly observed: the predictive model identified DeNovo2 and DeNovo4 as antifreeze proteins with an extremely high probability exceeding 0.99, while DeNovo6 was excluded with a low probability of less than 0.1. Correspondingly, the wet lab mean ice crystal grain size analysis showed that the ice crystal sizes in the DeNovo2 and DeNovo4 treatment groups were significantly smaller than those in the wild-type control group, clearly demonstrating antifreeze activity that inhibits ice crystal growth. Compared to DeNovo2 and DeNovo4, the ice crystal size in the DeNovo6 treatment group was closer to that of the control group, exhibiting weaker antifreeze activity. Although the predicted probability of DeNovo6 being an antifreeze protein is very low, the mean ice crystal grain size analysis chart indicates that it still possesses some antifreeze activity. This reflects the difference in sensitivity between the computational model and wet lab experiments.

Figure 7 Prediction of the dry-lab designed proteins DeNovo2, DeNovo4, and DeNovo6 against wet-lab mean ice crystal size analysis.
Figure 18. Prediction of the dry-lab designed proteins DeNovo2, DeNovo4, and DeNovo6 against wet-lab mean ice crystal size analysis.

ProtT5, as an evolutionarily informed pre-trained protein language model, excels at extracting "structure-sensitive" features from sequences that are crucial for the overall folding and stable three-dimensional structure of proteins. However, antifreeze activity is a highly specific "function" that relies not only on the overall protein fold but more precisely on an "Ice Binding Surface (IBS)" formed by specific amino acid residues on its surface that precisely complements ice crystal planes (such as primary prism or secondary prism planes). In the case of DeNovo6, ProtT5 likely accurately captured its overall structural features and, based on this, judged that it "does not belong to a typical AFP fold category present in the training set," thus assigning a very low prediction probability. However, the weak activity detected by wet lab experiments suggests that the sequence of DeNovo6 may have incidentally formed a minor, localized surface patch with weak structure-function correlation, which coincidentally possesses some ice-binding capability. The AFP dataset used to train the model consists of strongly effective AFPs with clearly defined functions, whose features are very prominent. Consequently, the model learned the patterns of "typical AFPs," but for these ambiguous, weakly functional "atypical" sequences, it rejects them due to their lack of prominent features.

This illustrates a common challenge in functional prediction models: the model learns the "typical patterns of known functions," while experiments measure the "real physico-chemical effects." ProtT5 excellently accomplished its task – distinguishing DeNovo6 from typical AFPs. Meanwhile, the wet lab experiments revealed a more nuanced fact: even a protein deemed "atypical" by the structural model may possess certain biological functions due to its local, weak physico-chemical properties. This finding is not contradictory but rather highlights the necessity of integrating dry and wet lab approaches – computational models are used for efficiently screening high-potential candidates, while highly sensitive experiments are employed to capture those interesting "edge cases" that the model might miss.

In our protein design process, we do not rely solely on prediction probabilities for selection but instead integrate numerous indicators. For more details, please refer to the "Protein Design" module.These mutually corroborating results not only effectively validate the accuracy and reliability of our model but also fully demonstrate its practical value for the rapid and precise screening of potential antifreeze proteins, providing robust support for subsequent rational protein design and functional research.

Web application
Figure 8 AFP-igemTJ2025 web application
Figure 19. AFP-igemTJ2025 web application

A user-friendly web application is built on Streamlit for real-time prediction and hypothesis testing by researchers. This tool provides a powerful, high-performance, and interpretable platform for accelerating AFP discovery and analysis.

Discussion

The core advantage of this antifreeze protein prediction model lies in its construction of an end-to-end deep learning framework based on state-of-the-art protein language models. It skillfully leverages the ProtT5-XL model to extract deep evolutionary and structural features from protein sequences, moving beyond traditional methods that rely on manually designed features. In its architecture, the model integrates convolutional neural networks to capture local patterns, bidirectional long short-term memory networks to understand sequence context, and an attention mechanism to focus on critical regions, forming a powerful multi-modal network. This design enables excellent handling of imbalanced datasets while providing multidimensional evaluation metrics such as AUC and MCC, ensuring high prediction accuracy and robustness. Furthermore, through an integrated Streamlit application, it offers user-friendly interaction, supports threshold adjustment and batch sequence analysis, ultimately transforming cutting-edge artificial intelligence technology into a convenient, efficient, and reliable next-generation AFP discovery tool for biologists.

Looking forward, there are several promising directions to further enhance the model's capabilities. First, while ProtT5 excels at capturing evolutionary and fold-level information, the antifreeze activity is ultimately determined by the specific physicochemical properties of the Ice Binding Surface (IBS). Integrating dedicated IBS prediction modules or explicit structural features (e.g., from AlphaFold2 predictions) could directly model this structure-function relationship, potentially improving the detection of atypical or de novo AFPs with weak but specific ice-binding patches, as hinted by the case of DeNovo6. Second, expanding the training dataset to include more diverse and weakly active AFPs could help the model learn a broader functional spectrum. Finally, exploring explainable AI (XAI) techniques to better visualize the decisive residues for prediction would greatly aid in protein engineering efforts, transforming the model from a black-box predictor into a design guide. These future developments will bridge the gap between sequence-based prediction and mechanistic understanding, pushing the boundary of computational AFP discovery.

Supplementary instruction

For more methods on installation and usage, please visit the README at https://gitlab.igem.org/2025/software-tools/tianjin. Here we demonstrates how to use the page interactions to predict antifreeze protein probability after locally installing and training our model.

Demonstration Video

Molecular Dynamics Simulation

In this module, we used molecular dynamics simulation to simulate the effect of the antifreeze protein we constructed and designed in ice-water mixtures. We used molecular dynamics simulation to predict the action process of our designed antifreeze protein in ice-water mixtures and evaluated the thermal stability and residue flexibility of the protein.

In our communication with Professor Zhang Lei, we learned that the antifreeze activity of AFP can be more intuitively characterized by the F3/F4 model. Therefore, after testing the antifreeze performance of our antifreeze protein using a wet experiment, we used the F4 program obtained from Professor Zhang Lei's research group to evaluate the inhibitory effect of the antifreeze protein on the growth rate of ice crystals, hoping to explore new antifreeze characteristics of the antifreeze protein.

Relevant parameters

All MD simulations were performed using the GROMACS software package and the CHARMM 27 all-atom force field implemented on the LINUX architecture. We first dehydrated the antifreeze protein and then placed it in the ice box we constructed using the TIP4P model.

In order to neutralize the charge of the system, Na+ and Cl- were added to the system. The distance between all protein atoms and the box edge was equal to 1.0 nm. Each system was minimized for 50,000 steps using the steepest descent method. Nonbonded interactions were described using the Lennard-Jones potential with a cutoff radius of 12 Å. The switching function was used to smoothly cut off the van der Waals and electrostatic potentials at the cutoff distance to enhance energy conservation.

The switching process began at 10.0 Å. Long-range electrostatic interactions were calculated using the particle mesh Ewald (PME) method. The time step was set to 1 fs. Before modeling, the ice box containing the antifreeze protein was equilibrated for 100 ps at both temperature (NVT) and pressure (NPT) at 200 K and atmospheric pressure.

After equilibration, the components were integrated. Following integration, molecular dynamics simulations were performed at 200 K and atmospheric pressure with a step size of 2 fs for 500,000 simulation steps, for a total of 10 ns.

RMSD and RMSF Analyses

In molecular dynamics simulation, RMSD (Root Mean Square Deviation) is the core indicator to characterize the degree of conformational change of molecules/systems. Its essence is to quantitatively describe the stability, fluctuation amplitude or change trend of molecular conformation by calculating the "difference in atomic coordinates between the target structure and the reference structure during the simulation process". The smaller the RMSD value, the closer the protein structure is to the reference state and the more stable the overall conformation is; otherwise, it indicates that the protein has undergone significant overall conformational changes. The root mean square deviation (RMSD) of the selected element relative to its reference value is defined as

$$ \mathrm{RMSD}=\sqrt{\frac{1}{N} \sum_{i=1}^{N}\left(r_{i}-r_{0}\right)^{2}} $$

where ri represents the element position at time I and ro is the reference value.

In molecular dynamics simulation, RMSF (Root Mean Square Fluctuation) is the core indicator to characterize the dynamic flexibility (or degree of fluctuation) of a single atom or residue (such as an amino acid residue in a protein) in a molecular system. The larger the RMSF value, the more intense the movement of the residue in the region and the higher the flexibility; otherwise, the more rigid the region is. The root mean square fluctuation (RMSF) of the selected element relative to its average value is defined as

$$ \mathrm{RMSF}=\sqrt{\frac{1}{N} \sum_{i=1}^{N}\left(r_{i}-\langle r\rangle\right)^{2}} $$

where ri represents the element position at time I and r represents the average value.

Principle of the F4 code:

The four-body order parameter (F4) characterizes the structural arrangement of water molecules, determining whether they are in a liquid or hydrated state, and thus characterizing the tetrahedral network structure of water molecules. This can be used to analyze ice crystal growth, melting, or phase transitions.

F4 measures the dihedral angle formed by the outermost hydrogen and oxygen atoms of two adjacent water molecules within a system. The calculation formula is as follows:

$$ F_4 = \frac{1}{n}\sum_{i=1}^{n} \cos(3\varphi_i) $$

where n is the total number of water-water pairs; ∅i represents the torsion angle between H and O. We use the values of the F4 order parameter curve to characterize the growth rate of hydrates in different systems. The F4 values for liquid water and ice molecules are −0.04 and −0.4. Therefore, we can use the difference in the decrease in the F4 value to assess ice growth. A smaller decrease indicates less ice and more water; a larger decrease indicates more ice and less water, which in turn indicates the ice-inhibiting activity of our antifreeze protein.

Practical Workflow

After running a molecular dynamics simulation, we first processed the trajectory file and exported it to the GRO format.

This can be done directly using the gmx trjconv command:

gmx trjconv -f md.xtc -s md.tpr -o traj.gro -pbc whole

Select 0 and press Enter to generate traj.gro.

CalF4.exe
On Windows, open a CMD command window, switch to the working directory, and enter the command:

CalF4.exe -f traj.gro -a OW

Press Enter to start the calculation. After the calculation is complete, a text file, F4.txt, will be generated. Open it in Notepad and you will see the frame number in the first column and the F4 value in the second column. We then used Oringin to perform a fitting plot and observe the decrease in the F4 value. We believe that the smaller the decrease in the fitted curve, the slower the ice crystal growth rate and the better the antifreeze effect of the antifreeze protein.

Hydrogen Bond Analysis

After running the simulation, we hope to explore the reasons for the improved antifreeze effect of the antifreeze protein after the helix replacement and our DeNovo design. We used AFP11 and NEW11-9 (NEW series design, see engineering) in this study. Our literature research revealed that hydrogen bonds are an important factor affecting the antifreeze effect of antifreeze proteins. Most AFPs can exert their antifreeze activity by breaking hydrogen bonds between water molecules, or by forming hydrogen bonds with water to prevent ice formation. Therefore, we used the gmx hbond program included in gromacs and the hydrogen bond identification program in VMD to observe changes in hydrogen bonds. gmx hbond can calculate: the number of hydrogen bonds in the system changes over time; the donor-acceptor distance and angle distribution of hydrogen bonds, etc. The default criteria for hydrogen bonds in GROMACS are: donor-acceptor distance ≤ 0.35 nm; donor-hydrogen bond-acceptor angle ≥ 150°.

Results
Comparison of AFP11 and NEW11-9
AFP dataset curation pipeline overview

Figure 20. Comparison of RMSD between AFP11 (left) and NEW11-9 (right)
AFP dataset curation pipeline overview

Figure 21. Comparison of RMSF of AFP11 (left) and NEW11-9 (right)

Through the comparison of RMSD and RMSF, we can see that NEW11-9 has a more stable structure than AFP11, and the helix we replaced is at about residue 570. It can be seen that the residues of NEW11-9 are more rigid, indicating that the replaced helix on NEW11-9 forms more stable hydrogen bonds with ice suface and exerts antifreeze activity.

AFP dataset curation pipeline overview

Figure 22. Comparison of the number of hydrogen bonds between AFP11 and NEW11-9

As shown in the figure, the NEW11-9 system exhibits fewer hydrogen bonds than AFP11.

AFP dataset curation pipeline overview

Figure 23. Comparison of F4 values of AFP11 (red) and NEW11-9 (black).

Our fitted curves show that the decrease in the F4 of NEW11-9 is smaller than that of AFP11. This suggests that the antifreeze activity of AFP11 after the helical modification is stronger, which was confirmed in our wet lab experiment.

AFP dataset curation pipeline overview

Figure 24. Wet lab characterization results of AFP11 and NEW11-9
Comparison of Definitely DeNovo1 and DeNovo3
AFP dataset curation pipeline overview

Figure 25. RMSD of DeNovo 1 (left) and DeNovo 3 (right)
AFP dataset curation pipeline overview

Figure 26. RMSF of DeNovo 1 (left) and DeNovo 3 (right)

At the same time, the RMSF we designed from scratch shows that the residues of our designed antifreeze protein DeNovo3 are more rigid than those of DeNovo1, indicating that the hydrogen bonds formed by DeNovo3 with ice suface are more stable and the antifreeze activity is stronger.

AFP dataset curation pipeline overview

Figure 27. Schematic diagram of hydrogen bond formation between De Novo1 (left) and DeNovo3 (right) in an ice-water system.

It is clearly visible from the figure that DeNovo1 forms more hydrogen bonds than DeNovo3.

AFP dataset curation pipeline overview

Figure 28. Comparison of the number of hydrogen bonds between De Novo1 and DeNovo3

As shown in the figure, DeNovo1 has a higher number of hydrogen bonds than DeNovo3.

F3/F4 scatter fitting curve:
AFP dataset curation pipeline overview

Figure 29. F4 curves of DeNovo1 and DeNovo3

The F4 curves of DeNovo1 and DeNovo3 indicate that both have weak antifreeze activity. Furthermore, the F4 mechanism suggests that DeNovo3 has stronger antifreeze activity.

AFP dataset curation pipeline overview

Figure 30. Wet lab characterization results of DeNovo1 and DeNovo3

However, our simulated proteins DeNovo1 and 3 ultimately failed to achieve the expected results in the experimental phase, whereas control groups 2 and 4—which employed identical design strategies—yielded outcomes consistent with simulation predictions.

Conclusion: From the above figure showing the number of hydrogen bonds of each antifreeze protein, proteins with more stable structures, including those with structural rigidity and fewer hydrogen bonds, are more likely to exhibit antifreeze activity. This evidence demonstrates that they bind more tightly to ice, and the hydrophobic regions of antifreeze proteins play an indispensable role—a finding consistent with the literature.

Short Peptide Synthesis

Construction and Cleaning of the Dataset

Figure 1 Data processing workflow for cold-tolerant proteins
Figure 31. Data processing workflow for cold-tolerant proteins
We obtained 33,882 sequences and species data by searching for keywords such as "Antifreeze protein" in UniProt. We conducted an initial screening of the data, applying filters based on species and sequence length. The filtering process was divided into pre-filtering and precise filtering.

Pre-filtering within the group was performed using the k-mer algorithm. Taking two sequences as an example, after generating k-mers, the Jaccard similarity was calculated.
$$ K_1 = \{ kmer_{1,1}, kmer_{1,2}, \ldots, kmer_{1,m} \},\quad K_2 = \{ kmer_{2,1}, kmer_{2,2}, \ldots, kmer_{2,n} \} $$
$$ J(\text{Sequence}_1, \text{Sequence}_2) = J(K_1, K_2) = \frac{|K_1 \cap K_2|}{|K_1 \cup K_2|} $$
This simplifies the complex sequence alignment problem into a set operation. The k-mer method is first used for rapid screening, followed by precise alignment only on sequences with high similarity. Then, the Bio.Align module is employed for global sequence alignment.


After obtaining 7,178 data entries, we continued to clean the data. We used the CKSAAP algorithm from AFP-LSE for data cleaning, aiming to remove non-AFP protein data. Here, we applied the principles of the CKSAAP algorithm, drawing inspiration from the AFP-LSE model, and defined the amino acid group:
AA = {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V} (20 amino acids in total; other amino acids are not included in the statistics).

The CKSAAP feature vector V is a 400-dimensional vector (v₁, v₂, ..., v₄₀₀), where vᵢ represents the frequency of the i-th amino acid pair (e.g., A-A, A-R, ..., V-V) appearing with a spacing of 8 positions in the sequence.

The team's research found that the full version of AFP-LSE mandatorily depends on the TensorFlow deep learning framework, which is its primary limitation. Based on this, the team developed a lightweight model that is more user-friendly for Windows systems.

The mathematical principle of the lightweight model is based on the assumption that the CKSAAP feature distribution of AFPs has greater variance, enabling binary classification through a variance threshold.

$$ \sigma^2 = \frac{1}{400} \sum_{i=1}^{400} (v_i - \mu)^2 $$
For the results, we only retained data with a confidence level above 95%, resulting in 6,557 entries.


Identification and Analysis of Repetitive Motifs

Next, we drew inspiration from HHrepID to screen for repetitive motifs. Protein sequences containing intrinsic repeat units exhibit significantly higher local self-similarity compared to random sequences.

The model quantifies this self-similarity using normalized scores from local sequence alignments. Given a protein sequence S (with length L), the detection process is as follows:

  1. Step 1: Candidate Motif Generation
    Starting from the N-terminus of the sequence, a series of candidate repetitive motifs Mk of increasing lengths are systematically generated.
    $$ M_k = S[1:k],\quad k \in [k_{\min}, k_{\max}] $$
  2. Step 2: Self-Alignment and Score Calculation
    For each candidate motif Mk, the Smith-Waterman local alignment algorithm is used to align it against the full length of sequence S, calculating an optimal alignment score A(S,Mk). The scoring function is as follows:
    $$ A(S, M_k) = \max_{\text{alignments}} \left( \sum \text{(match score)} - \sum \text{(mismatch penalty)} - \sum \text{(gap penalty)} \right) $$
  3. Step 3: Significance Threshold Determination
    To distinguish true repetitive signals from random background noise, an empirical significance score threshold θ is set. A candidate motif is identified as a valid repetitive motif if and only if its alignment score exceeds this threshold.
    $$ \text{Repeat Motif} = \{ M_k \mid A(S, M_k) > \theta \} $$

It was specified that the repetitive motifs must consist of the 20 standard amino acids. A total of 5,164 repetitive motifs were successfully screened.

Phylogenetic Analysis and Selection of Characteristic Sequences

Next, we proceeded to construct an evolutionary tree. We first performed a phylogenetic analysis of the sequences using MEGA software. In initial attempts, we exported a distance matrix, hoping to screen for similar AFP proteins based on genetic distance. However, this method proved unsatisfactory, as the distance matrix failed to clearly reflect the evolutionary relationships between sequences, resulting in ambiguous classification boundaries and making it difficult to effectively distinguish between functionally similar subtypes.

Figure 2 Classification based on distance matrix
Figure 32. Classification based on distance matrix

After discussion, the team concluded that relying solely on genetic distance for screening was overly dependent on pre-set thresholds and could not adequately capture the topological evolutionary information between sequences, potentially missing key phylogenetic signals. Consequently, we adjusted our strategy, shifting focus to the overall phylogenetic tree constructed. By analyzing the clustering patterns of leaf nodes and the branching structure, we could more intuitively identify protein groups with a shared evolutionary history.

Figure.3 Statistical Analysis of Amino Acid Frequency
Figure 33. Statistical Analysis of Amino Acid Frequency

We selected 19 subtrees from them, constructed and exported their NWK files. After text processing, the files were converted to FASTA format and uploaded to WebLogo for visualization.

Figure.4 Partial Repeat Motif Visualisation via WebLogo
Figure 34. Partial Repeat Motif Visualisation via WebLogo

The WebLogo plots illustrate the amino acid conservation within the repetitive motifs. The screened short peptides will be further validated in subsequent experiments.

Chatbot

Introduction

The communication with the iGEM team from Zhejiang University at CCIC greatly inspired us. We realized that apart from fields closely related to experiments such as protein prediction, artificial intelligence can also assist our HP work. We envisioned a Chatbot, similar to the common general language models (LLM) like DeepSeek and ChatGPT on the market, capable of precisely answering questions related to our team's project raised by users. In this year's project, we mainly applied the Chatbot to the external publicity of the iGEM team, so that others could fully understand our project.

AI agent

An AI agent refers to an intelligent agent system with the ability to think actively and take actions. It can understand requirements, plan goals, call tools and execute tasks through large models.

User - AI Agent - LLM

User-provided information is referred to as the user prompt, which is typically bundled with the system prompt and sent to the LLM. The system prompt for general LLMs is usually fixed and invisible, failing to meet project requirements. However, AI agents enable developers to customize system prompts. These customized prompts are then bundled with user prompts and sent to the LLM upon receiving them. Customizable system prompts empower us to define the “role” of the AI large model according to our project needs, including positioning, response format, output constraints, and more.

Figure 1. User - AI Agent - LLM Workflow Comparison
Figure 35. User - AI Agent - LLM Workflow Comparison
Tool - AI Agent - LLM

Although the prompt issue has been resolved, the LLM essentially can only understand language and generate text, and cannot actively call tools to solve problems. For example, by customizing the System Prompt, we hope that the LLM can only answer questions based on the project wiki web page content we provide, but the LLM cannot actively call and read the web page. At this time, the AI agent can inform the LLM of the tools it possesses, such as web browsing plugins, drawing plugins, and their relevant information in a way that the large model can understand(This content is usually directly included in the System Prompt, or expressed in the form of Function calling). After receiving this information, the LLM can return a request to call the tool to AI Agent, and AI Agent will return the result after calling the tool, and then the LLM will generate the final result to return to the user.

Figure 2. Tool - AI Agent – LLM Workflow
Figure 36. Tool - AI Agent – LLM Workflow
Implementation

We used Coze to build our Chatbot that incorporates an AI agent. After continuous testing and improvement, we have determined the following system prompt

# character
You are an information assistant specifically serving the 2025 Synthetic Biology Project of the iGEM team of Tianjin University. Your name is "Kunpeng Project Assistant". Your main responsibility is to provide clear and accurate project introductions and scientific popularization to others based on the official and verified materials provided by the team, while strictly maintaining the scientific rigor of the project. You must be able to accurately and quickly answer questions based on the given url web page links:https://2025.igem.wiki/tianjin/description,https://2025.igem.wiki/tianjin/implementation,https://2025.igem.wiki/tianjin/contribution,https://2025.igem.wiki/tianjin/notebook Answer in the language the user asks in.
 

## target:
Enable users to clearly understand the content within the URL, namely the work of the IGEM team from Tianjin University.
 

## skill:
1,When receiving a user's question, the first step is to determine whether the intention is related to understanding the content of the URL.

If it is relevant, immediately call the link reading and image understanding plugins to conduct a comprehensive and detailed search within the existing URL; if relevant questions and answers are found, reply in the following format, and if necessary, call the TreeMind diagram plugin to explain to the user.
2,If no relevant questions and answers are found, politely refuse to answer.
3,If it is not relevant, only then allow the use of the kimi plugin for a simple response. Do not expand on it and do not follow a certain format for answering.

## Restrictions -
The content must be output strictly in the prescribed format and must not deviate from the given framework requirements.

If no relevant answers can be found in the given URL, do not make assumptions. Just reply that you don't know.
When using the Kimi plugin, do not diverge. Simply provide a brief answer.
When answering questions for the user, use the first person.
The content of Tianjin University iGEM includes wet experiments, dry experiments, HP, etc.
The answer should be as concise as possible, and must be less than 600 words. Even if the user requests a detailed or complex explanation, please comply.
Use a more standardized numbered list marking method to avoid garbled text due to inconsistent platform grammar.
Use standard written punctuation, do not use bold, and do not add dots before each numbered list item.

We have selected the following plugins for the AI agent.

Figure.3 Plugins for the AI agent

Figure 37. Plugins for the AI agent

Used respectively for web browsing, image understanding, mind mapping creation, and answering non-project-related questions. After conducting tests and comparisons, we chose Doubao-1.6-thinking-250715 as the LLM for the Chatbot.

Deployment

We deployed the Chatbot on the popular social platform WeChat used by Chinese people, embedding it in the WeChat official account (a public account information service platform), and hosting the private message responses of the official account. The public can ask questions to the Chatbot through private messages to learn about the relevant content of iGEMTianjin 2025.

Figure.4 Deploying a Chatbot on the WeChat Platform

Figure 38. Deploying a Chatbot on the WeChat Platform

References

[1] Yeh Y, Feeney R E. Antifreeze proteins: structures and mechanisms of function[J]. Chemical Reviews, 1996, 96(2): 601-618.
[2] O'Neil K T, DeGrado W F. A thermodynamic scale for the helix-forming tendencies of the commonly occurring amino acids[J]. Science, 1990, 250(4981): 646-651.
[3] Zhang N, Du Y T, Yao P Q, et al. Synergistic effect of hyperactive antifreeze protein on inhibition of gas-hydrate growth by hydrophobic and hydrophilic groups[J]. The Journal of Physical Chemistry B, 2023, 127(49): 10469-10477.
[4] Cui S L, Zhang W J, Shui X G, et al. Revealing the effect of threonine on the binding ability of antifreeze proteins with ice crystals by free-energy calculations[J]. Chemical Journal of Chinese Universities-Chinese, 2022, 43(3): 97-103.
[5] Zhang X, Yang J, Tian Y, et al. Precise de novo design principle of antifreeze peptides[J]. Journal of the American Chemical Society, 2025, 147(21): 17682-17688.
[6] Watson J L, Juergens D, Bennett N R, et al. De novo design of protein structure and function with RFdiffusion[J]. Nature, 2023, 620(7976): 1089-1100.
[7] Elnaggar A, Heinzinger M, Dallago C, et al. ProtTrans: Toward understanding the language of life through self-supervised learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(10): 7112-7127.
[8] Wu J, Liu Y, Zhu Y, et al. Improving antifreeze proteins prediction with protein language models and hybrid feature extraction networks[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2024, 21(6): 2349-2358.
[9] Sonnhammer E L, Eddy S R, Durbin R. Pfam: a comprehensive database of protein domain families based on seed alignments[J]. Proteins, 1997, 28(3): 405-420.
[10] Kandaswamy K K, Chou K C, Martinetz T, et al. AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties[J]. Journal of Theoretical Biology, 2011, 270(1): 56-62.
[11] Mondal S, Pai P P. Chou's pseudo amino acid composition improves sequence-based antifreeze protein prediction[J]. Journal of Theoretical Biology, 2014, 356: 30-35.
[12] He X, Han K, Hu J, et al. TargetFreeze: Identifying antifreeze proteins via a combination of weights using sequence evolutionary information and pseudo amino acid composition[J]. The Journal of Membrane Biology, 2015, 248(6): 1005-1014.
[13] Pratiwi R, Malik A A, Schaduangrat N, et al. CryoProtect: A web server for classifying antifreeze proteins from nonantifreeze proteins[J]. Journal of Chemistry, 2017, 2017: 9861752.
[14] Usman M, Khan S, Lee J A. AFP-LSE: Antifreeze proteins prediction using latent space encoding of composition of k-spaced amino acid pairs[J]. Scientific Reports, 2020, 10: 7197.
[15] Khan A, Uddin J, Ali F, et al. Prediction of antifreeze proteins using machine learning[J]. Scientific Reports, 2022, 12: 20672.
[16] Jiang W T, Yang F J, Chen X, et al. Molecular simulation-based research on antifreeze peptides: advances and perspectives[J]. Journal of Future Foods, 2022, 2-3: 203-212.
[17] Kundu S, Roy D. Structural analysis of Ca²⁺ dependent and Ca²⁺ independent type II antifreeze proteins: A comparative molecular dynamics simulation study[J]. Journal of Molecular Graphics and Modelling, 2012, 38: 211-219.
[18] Maddah M, Maddah M, Peyvandi K. Investigation on structural properties of winter flounder antifreeze protein in interaction with clathrate hydrate by molecular dynamics simulation[J]. The Journal of Chemical Thermodynamics, 2021, 152: 106267.
[19] Wang M, Ma A, Wang H, et al. Atomic molecular dynamics simulation advances of de novo-designed proteins[J]. Quarterly Reviews of Biophysics, 2024, 57: e14.
[20] Patel S N, Graether S P. Structures and ice-binding faces of the alanine-rich type I antifreeze proteins[J]. Biochemistry and Cell Biology, 2010, 88(2): 223-229.
[21] Kuiper M J, et al. The biological function of an insect antifreeze protein simulated by molecular dynamics[J]. eLife, 2015, 4: e05142.
[22] Smith T F, Waterman M S. Identification of common molecular subsequences[J]. Journal of Molecular Biology, 1981, 147(1): 195-197.
[23] Usman M, Khan S, Lee J A. AFP-LSE: Antifreeze proteins prediction using latent space encoding of composition of k-spaced amino acid pairs[J]. Scientific Reports, 2020, 10: 7197.
[24] Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 2002, 77(2): 257-286.
[25] Biegert A, Söding J. De novo identification of highly diverged protein repeats by probabilistic consistency[J]. Bioinformatics, 2008, 24(6): 807-814.
[26] Tamura K, et al. MEGA4: molecular evolutionary genetics analysis (MEGA) software version 4.0[J]. Molecular Biology and Evolution, 2007, 24(8): 1596-1599.
[27] Crooks G E, et al. WebLogo: a sequence logo generator[J]. Genome Research, 2004, 14(6): 1188-1190.