AFP-igemTJ2025 presents an end-to-end
computational framework for predicting
protein function directly from sequence data. It employs a two-stage
architecture combining a pre-trained
protein language model for feature extraction with a bespoke deep
learning classifier for accurate functional
annotation.
We constructed a dataset comprising 920 AFPs and 9493 non-AFPs. A
comprehensive search was conducted in the
UniProtKB database until January 24, 2025 using specific keywords,
resulting in the collection of 6589 AFPs.
Next, the maximal pairwise sequence identity of the proteins in the
manually inspected dataset was culled to
≤40% using CD-HIT, yielding a set of 920 unique AFPs.
The negative dataset was derived from 9493 seed proteins of Pfam
protein families that are not associated
with antifreeze proteins, which is widely used for evaluating the
performance of AFPs prediction methods.
Balance dataset: The dataset was divided into training and
test
sets, 644 AFPs and 644 non-AFPs were
randomly selected as positive and negative samples to form the
training dataset, and the remaining 276 AFPs
and 8849 non-AFPs were designated as the test dataset.
Imbalanced dataset: To further validate the predictive
performance and facilitate subsequent research,
we introduced an imbalanced dataset. This dataset was divided with
70%
AFPs and non-AFPs for training and the
remaining 30% for independent testing respectively.
Predicted Model Architecture
The model is built upon the ProtT5-XL-UniRef50 encoder, a
transformer-based model pre-trained on the UniRef50
database, to generate high-dimensional feature representations (1024
dimensions per amino acid) from input
protein sequences. These rich embeddings capture complex semantic
and
syntactic patterns within the protein
sequences.
The architecture of the ProtT5 model
essentially involves the complete
transfer and adaptation of the highly successful T5 (Text-To-Text
Transfer Transformer) model from natural language processing to the
field of protein sequence analysis. Its core is an encoder-decoder
framework based on the Transformer architecture. The model learns
the
"language rules" of proteins through self-supervised pre-training on
large databases comprising billions of protein sequences, using a
method
called "Span Corruption." Specifically, the model randomly masks
continuous segments of the input amino acid sequence, and the
decoder
reconstructs these original masked amino acids based on the sequence
context understood by the encoder. This process forces the model to
deeply grasp the complex biochemical, structural, and evolutionary
relationships between amino acids. Ultimately, the power of ProtT5
lies
in its ability to generate a highly context-aware vector
representation
for each amino acid in its input. These representations, enriched
with
extensive biological knowledge, can be directly used as features for
various downstream tasks, such as structure prediction, function
annotation, and protein engineering, making it a fundamental and
powerful tool in computational biology.
Figure 13. The Transformer-model
architecture
Figure 14. Feature extraction overview of
ProtT5
The subsequent classifier, named igemTJModel, utilizes a
sophisticated multi-modal neural network
architecture to process these features. It integrates 1D
convolutional
layers (1D-CNN) to identify local
amino acid motifs, bidirectional LSTM (Bi-LSTM) layers to capture
long-range contextual dependencies across
the sequence, and an attention mechanism to dynamically weigh the
importance of specific residues,
significantly enhancing model interpretability. Regularization
strategies like Layer Normalization, Dropout,
and residual connections are incorporated to ensure training
stability
and prevent overfitting.
Figure 15. The igemTJmodel
architecture
Evaluation Metrics and Results
The following are commonly used evaluation
formulas for binary classification models and our evaluation
results.
-
Accuracy (ACC) — Measures the proportion of correct
predictions among all predictions.
$$
{ACC} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} +
\text{FP} + \text{FN}}
$$
-
Matthews Correlation Coefficient (MCC) — A balanced measure
that considers all four confusion matrix
categories, reliable even when classes are of very different
sizes.
$$\text{MCC} = \frac{TP \times TN - FP
\times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}$$
-
Precision — Measures the proportion of correctly predicted
positive observations among all predicted
positives.
$$\text{Precision} = \frac{TP}{TP +
FP}$$
-
Area Under the ROC Curve (AUC) — The ROC curve plots the
True
Positive Rate (TPR) against the False
Positive Rate (FPR). The AUC measures the model's ranking ability.
$$\text{TPR} = \frac{TP}{TP + FN},
\quad
\text{FPR} = \frac{FP}{FP + TN}$$
$$
\displaystyle AUC = \int_{0}^{1} TPR(FPR) \, dFPR
$$
-
Area Under the Precision-Recall Curve (AUPR / AP) —
Especially useful for imbalanced datasets.
$$Recall = \frac{TP}{TP + FN}, \quad
Precision = \frac{TP}{TP + FP}$$
$$
\displaystyle AUPR = \int_{0}^{1} Precision(Recall) \, dRecall
$$
Figure 16. Evaluation results
Comparison with existing predictor
Figure 17. Comparison with existing
predictors.
The main advantage of the AFP-igemTJ2025
model
lies in its achievement of extremely high accuracy, specificity, and
AUC on an imbalanced dataset, while maintaining high MCC and
precision. This makes the model particularly valuable in
applications
requiring high reliability and low false positive rates, such as
protein function prediction in bioinformatics. Although its
sensitivity is relatively low, comprehensive metrics indicate that
the
model outperforms other compared models in overall classification
performance. When deploying the model, it is recommended to balance
sensitivity and specificity according to specific needs, for
example,
by adjusting the classification threshold to optimize
sensitivity.
Application in our project
Based on the integrated results of the dry
lab
predictive model and the wet lab ice crystal analysis, a certain
consistency between the two can be clearly observed: the predictive
model identified DeNovo2 and DeNovo4 as antifreeze proteins with an
extremely high probability exceeding 0.99, while DeNovo6 was
excluded
with a low probability of less than 0.1. Correspondingly, the wet
lab
mean ice crystal grain size analysis showed that the ice crystal
sizes
in the DeNovo2 and DeNovo4 treatment groups were significantly
smaller
than those in the wild-type control group, clearly demonstrating
antifreeze activity that inhibits ice crystal growth. Compared to
DeNovo2 and DeNovo4, the ice crystal size in the DeNovo6 treatment
group was closer to that of the control group, exhibiting weaker
antifreeze activity. Although the predicted probability of DeNovo6
being an antifreeze protein is very low, the mean ice crystal grain
size analysis chart indicates that it still possesses some
antifreeze
activity. This reflects the difference in sensitivity between the
computational model and wet lab experiments.
Figure 18. Prediction of the dry-lab
designed proteins DeNovo2, DeNovo4, and DeNovo6 against wet-lab
mean
ice crystal size analysis.
ProtT5, as an evolutionarily informed
pre-trained protein language model, excels at extracting
"structure-sensitive" features from sequences that are crucial for
the
overall folding and stable three-dimensional structure of proteins.
However, antifreeze activity is a highly specific "function" that
relies not only on the overall protein fold but more precisely on an
"Ice Binding Surface (IBS)" formed by specific amino acid residues
on
its surface that precisely complements ice crystal planes (such as
primary prism or secondary prism planes). In the case of DeNovo6,
ProtT5 likely accurately captured its overall structural features
and,
based on this, judged that it
"does not belong to a typical AFP fold category present in the training set,"
thus assigning a very low prediction probability. However, the weak
activity detected by wet lab experiments suggests that the sequence
of
DeNovo6 may have incidentally formed a minor, localized surface
patch
with weak structure-function correlation, which coincidentally
possesses some ice-binding capability. The AFP dataset used to train
the model consists of strongly effective AFPs with clearly defined
functions, whose features are very prominent. Consequently, the
model
learned the patterns of "typical AFPs," but for these ambiguous,
weakly functional "atypical" sequences, it rejects them due to their
lack of prominent features.
This illustrates a common challenge in
functional prediction models: the model learns the
"typical patterns of known functions," while experiments measure the
"real physico-chemical effects." ProtT5 excellently accomplished its
task – distinguishing DeNovo6 from typical AFPs. Meanwhile, the wet
lab experiments revealed a more nuanced fact: even a protein deemed
"atypical" by the structural model may possess certain biological
functions due to its local, weak physico-chemical properties. This
finding is not contradictory but rather highlights the necessity of
integrating dry and wet lab approaches – computational models are
used
for efficiently screening high-potential candidates, while highly
sensitive experiments are employed to capture those interesting
"edge cases" that the model might miss.
In our protein design process, we do not
rely
solely on prediction probabilities for selection but instead
integrate
numerous indicators. For more details, please refer to the
"Protein Design" module.These mutually corroborating results not
only
effectively validate the accuracy and reliability of our model but
also fully demonstrate its practical value for the rapid and precise
screening of potential antifreeze proteins, providing robust support
for subsequent rational protein design and functional research.
Web application
Figure 19. AFP-igemTJ2025 web
application
A user-friendly web application is built on Streamlit
for real-time prediction and hypothesis
testing by researchers. This tool provides a powerful,
high-performance, and interpretable platform for
accelerating AFP discovery and analysis.
Discussion
The core advantage of this antifreeze
protein
prediction model lies in its construction of an end-to-end deep
learning framework based on state-of-the-art protein language
models.
It skillfully leverages the ProtT5-XL model to extract deep
evolutionary and structural features from protein sequences, moving
beyond traditional methods that rely on manually designed features.
In
its architecture, the model integrates convolutional neural networks
to capture local patterns, bidirectional long short-term memory
networks to understand sequence context, and an attention mechanism
to
focus on critical regions, forming a powerful multi-modal network.
This design enables excellent handling of imbalanced datasets while
providing multidimensional evaluation metrics such as AUC and MCC,
ensuring high prediction accuracy and robustness. Furthermore,
through
an integrated Streamlit application, it offers user-friendly
interaction, supports threshold adjustment and batch sequence
analysis, ultimately transforming cutting-edge artificial
intelligence
technology into a convenient, efficient, and reliable
next-generation
AFP discovery tool for biologists.
Looking forward, there are several promising
directions to further enhance the model's capabilities. First, while
ProtT5 excels at capturing evolutionary and fold-level information,
the antifreeze activity is ultimately determined by the specific
physicochemical properties of the Ice Binding Surface (IBS).
Integrating dedicated IBS prediction modules or explicit structural
features (e.g., from AlphaFold2 predictions) could directly model
this
structure-function relationship, potentially improving the detection
of atypical or de novo AFPs with weak but specific ice-binding
patches, as hinted by the case of DeNovo6. Second, expanding the
training dataset to include more diverse and weakly active AFPs
could
help the model learn a broader functional spectrum. Finally,
exploring
explainable AI (XAI) techniques to better visualize the decisive
residues for prediction would greatly aid in protein engineering
efforts, transforming the model from a black-box predictor into a
design guide. These future developments will bridge the gap between
sequence-based prediction and mechanistic understanding, pushing the
boundary of computational AFP discovery.
Supplementary instruction
For more methods on installation and usage, please visit the README
at
https://gitlab.igem.org/2025/software-tools/tianjin.
Here we demonstrates how to use the page interactions to predict
antifreeze protein probability after locally
installing and training our model.
Demonstration Video
Molecular
Dynamics Simulation
In this module, we used molecular dynamics
simulation to simulate the effect of the antifreeze protein we
constructed and designed in ice-water mixtures. We used
molecular dynamics simulation to predict the action process of our
designed antifreeze protein in ice-water mixtures and evaluated
the thermal stability and residue flexibility of the
protein.
In our communication with Professor Zhang
Lei, we learned that the antifreeze activity of AFP can be more
intuitively characterized by the F3/F4 model. Therefore, after
testing the antifreeze performance of our antifreeze protein using
a wet experiment, we used the F4 program obtained from Professor
Zhang Lei's research group to evaluate the inhibitory effect of
the antifreeze protein on the growth rate of ice crystals,
hoping to explore new antifreeze characteristics of the antifreeze
protein.
Relevant parameters
All MD simulations were performed using the GROMACS software
package and the CHARMM 27 all-atom force field implemented on the
LINUX architecture. We first dehydrated the antifreeze protein and
then placed it in the ice box we constructed using the TIP4P
model.
In order to neutralize the charge of the system, Na+ and
Cl- were added to the system. The distance between all protein
atoms and the box edge was equal to 1.0 nm. Each system was
minimized for 50,000 steps using the steepest descent method.
Nonbonded interactions were described using the Lennard-Jones
potential with a cutoff radius of 12 Å. The switching function was
used to smoothly cut off the van der Waals and electrostatic
potentials at the cutoff distance to enhance energy conservation.
The switching process began at 10.0 Å. Long-range electrostatic
interactions were calculated using the particle mesh Ewald (PME)
method. The time step was set to 1 fs. Before modeling, the ice
box containing the antifreeze protein was equilibrated for 100 ps
at both temperature (NVT) and pressure (NPT) at 200 K and
atmospheric pressure.
After equilibration, the components were
integrated. Following integration, molecular dynamics simulations
were performed at 200 K and atmospheric pressure with a step size
of 2 fs for 500,000 simulation steps, for a total of 10 ns.
RMSD and RMSF Analyses
In molecular dynamics simulation, RMSD (Root Mean Square
Deviation) is the core indicator to characterize the degree of
conformational change of molecules/systems. Its essence is to
quantitatively describe the stability, fluctuation amplitude or
change trend of molecular conformation by calculating the
"difference in atomic coordinates between the target structure and the reference structure during the simulation process".
The smaller the RMSD value, the closer the protein structure is to
the reference state and the more stable the overall conformation
is; otherwise, it indicates that the protein has undergone
significant overall conformational changes. The root mean square
deviation (RMSD) of the selected element relative to its reference
value is defined as
$$
\mathrm{RMSD}=\sqrt{\frac{1}{N} \sum_{i=1}^{N}\left(r_{i}-r_{0}\right)^{2}}
$$
where ri represents the element position at time I and ro is the
reference value.
In molecular dynamics simulation, RMSF (Root Mean Square
Fluctuation) is the core indicator to characterize the dynamic
flexibility (or degree of fluctuation) of a single atom or residue
(such as an amino acid residue in a protein) in a molecular
system. The larger the RMSF value, the more intense the movement
of the residue in the region and the higher the flexibility;
otherwise, the more rigid the region is. The root mean square
fluctuation (RMSF) of the selected element relative to its average
value is defined as
$$
\mathrm{RMSF}=\sqrt{\frac{1}{N} \sum_{i=1}^{N}\left(r_{i}-\langle r\rangle\right)^{2}}
$$
where ri represents the element position at time I and r
represents the average value.
Principle of the F4 code:
The four-body order parameter (F4) characterizes the structural
arrangement of water molecules, determining whether they are in a
liquid or hydrated state, and thus characterizing the tetrahedral
network structure of water molecules. This can be used to analyze
ice crystal growth, melting, or phase transitions.
F4 measures the dihedral angle formed by the outermost hydrogen
and oxygen atoms of two adjacent water molecules within a system.
The calculation formula is as follows:
$$
F_4 = \frac{1}{n}\sum_{i=1}^{n} \cos(3\varphi_i)
$$
where n is the total number of water-water pairs; ∅i represents
the torsion angle between H and O. We use the values of the F4
order parameter curve to characterize the growth rate of hydrates
in different systems. The F4 values for liquid water and ice
molecules are −0.04 and −0.4. Therefore, we can use the difference
in the decrease in the F4 value to assess ice growth. A smaller
decrease indicates less ice and more water; a larger decrease
indicates more ice and less water, which in turn indicates the
ice-inhibiting activity of our antifreeze protein.
Practical Workflow
After running a molecular dynamics simulation, we first processed
the trajectory file and exported it to the GRO format.
This can be done directly using the gmx trjconv command:
gmx trjconv -f md.xtc -s md.tpr -o traj.gro -pbc whole
Select 0 and press Enter to generate traj.gro.
CalF4.exe
On Windows, open a CMD command window, switch to the working
directory, and enter the command:
CalF4.exe -f traj.gro -a OW
Press Enter to start the calculation. After the calculation is
complete, a text file, F4.txt, will be generated. Open it in
Notepad and you will see the frame number in the first column and
the F4 value in the second column. We then used Oringin to perform
a fitting plot and observe the decrease in the F4 value. We
believe that the smaller the decrease in the fitted curve, the
slower the ice crystal growth rate and the better the antifreeze
effect of the antifreeze protein.
Hydrogen Bond Analysis
After running the simulation, we hope to explore the reasons for
the improved antifreeze effect of the antifreeze protein after the
helix replacement and our DeNovo design. We used AFP11 and
NEW11-9 (NEW series design, see engineering) in this study. Our
literature research revealed that hydrogen bonds are an important
factor affecting the antifreeze effect of antifreeze proteins.
Most AFPs can exert their antifreeze activity by breaking hydrogen
bonds between water molecules, or by forming hydrogen bonds with
water to prevent ice formation. Therefore, we used the gmx
hbond program included in gromacs and the hydrogen bond
identification program in VMD to observe changes in hydrogen
bonds. gmx hbond can calculate: the number of hydrogen bonds in
the system changes over time; the donor-acceptor distance and
angle distribution of hydrogen bonds, etc. The default criteria
for hydrogen bonds in GROMACS are: donor-acceptor distance ≤ 0.35
nm; donor-hydrogen bond-acceptor angle ≥ 150°.
Results
Comparison of AFP11 and NEW11-9
Figure 20. Comparison of RMSD between AFP11 (left) and NEW11-9 (right)
Figure 21. Comparison of RMSF of AFP11 (left) and NEW11-9 (right)
Through the comparison of RMSD and RMSF, we can see that NEW11-9
has a more stable structure than AFP11, and the helix we replaced
is at about residue 570. It can be seen that the residues of
NEW11-9 are more rigid, indicating that the replaced helix on
NEW11-9 forms more stable hydrogen bonds with ice suface and
exerts antifreeze activity.
Figure 22. Comparison of the number of hydrogen bonds between
AFP11 and NEW11-9
As shown in the figure, the NEW11-9 system exhibits fewer hydrogen
bonds than AFP11.
Figure 23. Comparison of F4 values of AFP11 (red) and NEW11-9
(black).
Our fitted curves show that the decrease in the F4 of
NEW11-9 is smaller than that of AFP11. This suggests that the
antifreeze activity of AFP11 after the helical modification is
stronger, which was confirmed in our wet lab experiment.
Figure 24. Wet lab characterization results of AFP11 and NEW11-9
Comparison of Definitely DeNovo1 and DeNovo3
Figure 25. RMSD of DeNovo 1 (left) and
DeNovo 3 (right)
Figure 26. RMSF of DeNovo 1 (left) and
DeNovo 3 (right)
At the same time, the RMSF we designed from scratch shows that the
residues of our designed antifreeze protein DeNovo3 are more
rigid than those of DeNovo1, indicating that the hydrogen bonds
formed by DeNovo3 with ice suface are more stable and the
antifreeze activity is stronger.
Figure 27. Schematic diagram of hydrogen bond formation between De
Novo1 (left) and DeNovo3 (right) in an ice-water system.
It is clearly visible from the figure that DeNovo1 forms more
hydrogen bonds than DeNovo3.
Figure 28. Comparison of the number of hydrogen bonds between De
Novo1 and DeNovo3
As shown in the figure, DeNovo1 has a higher number of hydrogen
bonds than DeNovo3.
F3/F4 scatter fitting curve:
Figure 29. F4 curves of DeNovo1 and DeNovo3
The F4 curves of DeNovo1 and DeNovo3 indicate that both have
weak antifreeze activity. Furthermore, the F4 mechanism suggests
that DeNovo3 has stronger antifreeze activity.
Figure 30. Wet lab characterization results of DeNovo1 and DeNovo3
However, our simulated proteins DeNovo1 and 3 ultimately failed
to achieve the expected results in the experimental phase, whereas
control groups 2 and 4—which employed identical design
strategies—yielded outcomes consistent with simulation
predictions.
Conclusion: From the above figure showing the number of hydrogen
bonds of each antifreeze protein, proteins with more stable
structures, including those with structural rigidity and fewer
hydrogen bonds, are more likely to exhibit antifreeze activity.
This evidence demonstrates that they bind more tightly to ice, and
the hydrophobic regions of antifreeze proteins play an
indispensable role—a finding consistent with the literature.
Short Peptide Synthesis
Construction and Cleaning of the
Dataset
Figure 31. Data processing workflow
for
cold-tolerant proteins
We obtained 33,882 sequences and species data by searching for
keywords such as "Antifreeze protein" in
UniProt. We conducted an initial screening of the data, applying
filters based on species and sequence length.
The filtering process was divided into pre-filtering and precise
filtering.
Pre-filtering within the group was performed using the k-mer
algorithm. Taking two sequences as an example,
after generating k-mers, the Jaccard similarity was
calculated.
$$
K_1 = \{ kmer_{1,1}, kmer_{1,2}, \ldots, kmer_{1,m} \},\quad
K_2
=
\{ kmer_{2,1}, kmer_{2,2}, \ldots,
kmer_{2,n} \}
$$
$$
J(\text{Sequence}_1, \text{Sequence}_2) = J(K_1, K_2) =
\frac{|K_1
\cap K_2|}{|K_1 \cup K_2|}
$$
This simplifies the complex sequence alignment problem into a
set
operation. The k-mer method is first used
for rapid screening, followed by precise alignment only on
sequences
with high similarity. Then, the Bio.Align
module is employed for global sequence alignment.
After obtaining 7,178 data entries, we continued to clean
the
data. We used the CKSAAP algorithm from
AFP-LSE
for data cleaning, aiming to remove non-AFP protein data. Here,
we
applied the principles of the CKSAAP
algorithm, drawing inspiration from the AFP-LSE model, and
defined
the amino acid group:
AA = {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T,
W,
Y,
V} (20 amino acids in total; other
amino acids are not included in the statistics).
The CKSAAP feature vector V is a 400-dimensional vector (v₁, v₂,
..., v₄₀₀), where vᵢ represents the frequency
of the i-th amino acid pair (e.g., A-A, A-R, ..., V-V) appearing
with a spacing of 8 positions in the
sequence.
The team's research found that the full version of AFP-LSE
mandatorily depends on the TensorFlow deep
learning
framework, which is its primary limitation. Based on this, the
team
developed a lightweight model that is more
user-friendly for Windows systems.
The mathematical principle of the lightweight model is based on
the
assumption that the CKSAAP feature
distribution of AFPs has greater variance, enabling binary
classification through a variance threshold.
$$
\sigma^2 = \frac{1}{400} \sum_{i=1}^{400} (v_i - \mu)^2
$$
For the results, we only retained data with a confidence level
above
95%, resulting in 6,557 entries.
Identification and Analysis of
Repetitive
Motifs
Next, we drew inspiration from HHrepID to screen for repetitive
motifs. Protein sequences containing intrinsic
repeat units exhibit significantly higher local self-similarity
compared to random sequences.
The model quantifies this self-similarity using normalized
scores from local sequence alignments. Given a
protein sequence S (with length L), the detection process is as
follows:
-
Step 1: Candidate Motif Generation
Starting from the N-terminus of the sequence, a series of
candidate repetitive motifs Mk of
increasing lengths are systematically generated.
$$
M_k = S[1:k],\quad k \in [k_{\min}, k_{\max}]
$$
-
Step 2: Self-Alignment and Score Calculation
For each candidate motif Mk, the Smith-Waterman
local
alignment algorithm is used to align it
against the full length of sequence S, calculating an optimal
alignment score A(S,Mk). The
scoring function is as follows:
$$
A(S, M_k) = \max_{\text{alignments}} \left( \sum
\text{(match
score)} - \sum \text{(mismatch penalty)} -
\sum \text{(gap penalty)} \right)
$$
-
Step 3: Significance Threshold Determination
To distinguish true repetitive signals from random background
noise, an empirical significance score
threshold θ is set. A candidate motif is identified as a valid
repetitive motif if and only if its alignment
score exceeds this threshold.
$$
\text{Repeat Motif} = \{ M_k \mid A(S, M_k) > \theta \}
$$
It was specified that the repetitive motifs must consist of the
20
standard amino acids. A total of 5,164
repetitive motifs were successfully screened.
Phylogenetic Analysis and Selection
of
Characteristic Sequences
Next, we proceeded to construct an evolutionary tree. We first
performed a phylogenetic analysis of the
sequences using MEGA software. In initial attempts, we exported
a
distance matrix, hoping to screen for
similar AFP proteins based on genetic distance. However, this
method
proved unsatisfactory, as the distance
matrix failed to clearly reflect the evolutionary relationships
between sequences, resulting in ambiguous
classification boundaries and making it difficult to effectively
distinguish between functionally similar
subtypes.
Figure 32. Classification based on
distance matrix
After discussion, the team concluded that relying solely on
genetic
distance for screening was overly
dependent on pre-set thresholds and could not adequately capture
the
topological evolutionary information
between sequences, potentially missing key phylogenetic signals.
Consequently, we adjusted our strategy,
shifting focus to the overall phylogenetic tree constructed. By
analyzing the clustering patterns of leaf
nodes and the branching structure, we could more intuitively
identify protein groups with a shared
evolutionary history.
Figure 33. Statistical Analysis of
Amino
Acid Frequency
We selected 19 subtrees from them, constructed and exported
their
NWK files. After text processing, the files
were converted to FASTA format and uploaded to WebLogo for
visualization.
Figure 34. Partial Repeat Motif
Visualisation via WebLogo
The WebLogo plots illustrate the amino acid conservation within
the
repetitive motifs. The screened short
peptides will be further validated in subsequent experiments.
Chatbot
Introduction
The communication with the iGEM team from Zhejiang University at
CCIC greatly inspired us. We realized that apart from fields
closely
related to experiments such as protein prediction, artificial
intelligence can also assist our HP work. We envisioned a
Chatbot,
similar to the common general language models (LLM) like
DeepSeek
and ChatGPT on the market, capable of precisely answering
questions
related to our team's project raised by users. In this year's
project, we mainly applied the Chatbot to the external publicity
of
the iGEM team, so that others could fully understand our
project.
AI agent
An AI agent refers to an intelligent agent system with the
ability
to think actively and take actions. It can understand
requirements,
plan goals, call tools and execute tasks through large models.
User - AI Agent - LLM
User-provided information is referred to as the user prompt,
which
is typically bundled with the system prompt and sent to the LLM.
The
system prompt for general LLMs is usually fixed and invisible,
failing to meet project requirements. However, AI agents enable
developers to customize system prompts. These customized prompts
are
then bundled with user prompts and sent to the LLM upon
receiving
them. Customizable system prompts empower us to define the
“role”
of
the AI large model according to our project needs, including
positioning, response format, output constraints, and more.
Figure 35. User - AI Agent - LLM
Workflow Comparison
Tool - AI Agent - LLM
Although the prompt issue has been resolved, the LLM essentially
can
only understand language and generate text, and cannot actively
call
tools to solve problems. For example, by customizing the System
Prompt, we hope that the LLM can only answer questions based on
the
project wiki web page content we provide, but the LLM cannot
actively call and read the web page. At this time, the AI agent
can
inform the LLM of the tools it possesses, such as web browsing
plugins, drawing plugins, and their relevant information in a
way
that the large model can understand(This content is usually
directly
included in the System Prompt, or expressed in the form of
Function
calling). After receiving this information, the LLM can return a
request to call the tool to AI Agent, and AI Agent will return
the
result after calling the tool, and then the LLM will generate
the
final result to return to the user.
Figure 36. Tool - AI Agent – LLM
Workflow
Implementation
We used Coze to build our Chatbot that incorporates an AI agent.
After continuous testing and improvement, we have determined the
following system prompt
# character
You are an information assistant specifically serving the 2025 Synthetic Biology Project of the iGEM team of Tianjin University. Your name is "Kunpeng Project Assistant". Your main responsibility is to provide clear and accurate project introductions and scientific popularization to others based on the official and verified materials provided by the team, while strictly maintaining the scientific rigor of the project. You must be able to accurately and quickly answer questions based on the given url web page links:https://2025.igem.wiki/tianjin/description,https://2025.igem.wiki/tianjin/implementation,https://2025.igem.wiki/tianjin/contribution,https://2025.igem.wiki/tianjin/notebook Answer in the language the user asks in.
## target:
Enable users to clearly understand the content within the URL, namely the work of the IGEM team from Tianjin University.
## skill:
1,When receiving a user's question, the first step is to determine whether the intention is related to understanding the content of the URL.
If it is relevant, immediately call the link reading and image understanding plugins to conduct a comprehensive and detailed search within the existing URL; if relevant questions and answers are found, reply in the following format, and if necessary, call the TreeMind diagram plugin to explain to the user.
2,If no relevant questions and answers are found, politely refuse to answer.
3,If it is not relevant, only then allow the use of the kimi plugin for a simple response. Do not expand on it and do not follow a certain format for answering.
## Restrictions -
The content must be output strictly in the prescribed format and must not deviate from the given framework requirements.
If no relevant answers can be found in the given URL, do not make assumptions. Just reply that you don't know.
When using the Kimi plugin, do not diverge. Simply provide a brief answer.
When answering questions for the user, use the first person.
The content of Tianjin University iGEM includes wet experiments, dry experiments, HP, etc.
The answer should be as concise as possible, and must be less than 600 words. Even if the user requests a detailed or complex explanation, please comply.
Use a more standardized numbered list marking method to avoid garbled text due to inconsistent platform grammar.
Use standard written punctuation, do not use bold, and do not add dots before each numbered list item.
We have selected the following plugins for the AI agent.
Figure 37. Plugins for the AI
agent
Used respectively for web browsing, image understanding, mind
mapping creation, and answering non-project-related questions.
After
conducting tests and comparisons, we chose
Doubao-1.6-thinking-250715 as the LLM for the Chatbot.
Deployment
We deployed the Chatbot on the popular social platform WeChat
used
by Chinese people, embedding it in the WeChat official account
(a
public account information service platform), and hosting the
private message responses of the official account. The public
can
ask questions to the Chatbot through private messages to learn
about
the relevant content of iGEMTianjin 2025.
Figure 38. Deploying a Chatbot on
the
WeChat Platform
References
[1] Yeh Y, Feeney R E. Antifreeze proteins: structures and
mechanisms of function[J]. Chemical Reviews, 1996, 96(2): 601-618.
[2] O'Neil K T, DeGrado W F. A thermodynamic scale for the
helix-forming tendencies of the commonly occurring amino acids[J].
Science, 1990, 250(4981): 646-651.
[3] Zhang N, Du Y T, Yao P Q, et al. Synergistic effect of
hyperactive antifreeze protein on inhibition of gas-hydrate growth
by hydrophobic and hydrophilic groups[J]. The Journal of Physical
Chemistry B, 2023, 127(49): 10469-10477.
[4] Cui S L, Zhang W J, Shui X G, et al. Revealing the effect of
threonine on the binding ability of antifreeze proteins with ice
crystals by free-energy calculations[J]. Chemical Journal of Chinese
Universities-Chinese, 2022, 43(3): 97-103.
[5] Zhang X, Yang J, Tian Y, et al. Precise de novo design principle
of antifreeze peptides[J]. Journal of the American Chemical Society,
2025, 147(21): 17682-17688.
[6] Watson J L, Juergens D, Bennett N R, et al. De novo design of
protein structure and function with RFdiffusion[J]. Nature, 2023,
620(7976): 1089-1100.
[7] Elnaggar A, Heinzinger M, Dallago C, et al. ProtTrans: Toward
understanding the language of life through self-supervised
learning[J]. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2022, 44(10): 7112-7127.
[8] Wu J, Liu Y, Zhu Y, et al. Improving antifreeze proteins
prediction with protein language models and hybrid feature
extraction networks[J]. IEEE/ACM Transactions on Computational
Biology and Bioinformatics, 2024, 21(6): 2349-2358.
[9] Sonnhammer E L, Eddy S R, Durbin R. Pfam: a comprehensive
database of protein domain families based on seed alignments[J].
Proteins, 1997, 28(3): 405-420.
[10] Kandaswamy K K, Chou K C, Martinetz T, et al. AFP-Pred: A
random forest approach for predicting antifreeze proteins from
sequence-derived properties[J]. Journal of Theoretical Biology,
2011, 270(1): 56-62.
[11] Mondal S, Pai P P. Chou's pseudo amino acid composition
improves sequence-based antifreeze protein prediction[J]. Journal of
Theoretical Biology, 2014, 356: 30-35.
[12] He X, Han K, Hu J, et al. TargetFreeze: Identifying antifreeze
proteins via a combination of weights using sequence evolutionary
information and pseudo amino acid composition[J]. The Journal of
Membrane Biology, 2015, 248(6): 1005-1014.
[13] Pratiwi R, Malik A A, Schaduangrat N, et al. CryoProtect: A web
server for classifying antifreeze proteins from nonantifreeze
proteins[J]. Journal of Chemistry, 2017, 2017: 9861752.
[14] Usman M, Khan S, Lee J A. AFP-LSE: Antifreeze proteins
prediction using latent space encoding of composition of k-spaced
amino acid pairs[J]. Scientific Reports, 2020, 10: 7197.
[15] Khan A, Uddin J, Ali F, et al. Prediction of antifreeze
proteins using machine learning[J]. Scientific Reports, 2022, 12:
20672.
[16] Jiang W T, Yang F J, Chen X, et al. Molecular simulation-based
research on antifreeze peptides: advances and perspectives[J].
Journal of Future Foods, 2022, 2-3: 203-212.
[17] Kundu S, Roy D. Structural analysis of Ca²⁺ dependent and Ca²⁺
independent type II antifreeze proteins: A comparative molecular
dynamics simulation study[J]. Journal of Molecular Graphics and
Modelling, 2012, 38: 211-219.
[18] Maddah M, Maddah M, Peyvandi K. Investigation on structural
properties of winter flounder antifreeze protein in interaction with
clathrate hydrate by molecular dynamics simulation[J]. The Journal
of Chemical Thermodynamics, 2021, 152: 106267.
[19] Wang M, Ma A, Wang H, et al. Atomic molecular dynamics
simulation advances of de novo-designed proteins[J]. Quarterly
Reviews of Biophysics, 2024, 57: e14.
[20] Patel S N, Graether S P. Structures and ice-binding faces of
the alanine-rich type I antifreeze proteins[J]. Biochemistry and
Cell Biology, 2010, 88(2): 223-229.
[21] Kuiper M J, et al. The biological function of an insect
antifreeze protein simulated by molecular dynamics[J]. eLife, 2015,
4: e05142.
[22] Smith T F, Waterman M S. Identification of common molecular
subsequences[J]. Journal of Molecular Biology, 1981, 147(1):
195-197.
[23] Usman M, Khan S, Lee J A. AFP-LSE: Antifreeze proteins
prediction using latent space encoding of composition of k-spaced
amino acid pairs[J]. Scientific Reports, 2020, 10: 7197.
[24] Rabiner L R. A tutorial on hidden Markov models and selected
applications in speech recognition[J]. Proceedings of the IEEE,
2002, 77(2): 257-286.
[25] Biegert A, Söding J. De novo identification of highly diverged
protein repeats by probabilistic consistency[J]. Bioinformatics,
2008, 24(6): 807-814.
[26] Tamura K, et al. MEGA4: molecular evolutionary genetics
analysis (MEGA) software version 4.0[J]. Molecular Biology and
Evolution, 2007, 24(8): 1596-1599.
[27] Crooks G E, et al. WebLogo: a sequence logo generator[J].
Genome Research, 2004, 14(6): 1188-1190.