Model | ZQT-Nanjing

Inspiration for the Research

During our lab experiences, we came up with the solution of using KY Nase (kynureninase) to target and cure rectal cancer, however, professors point out that using enzyme as a cure has several problems:

(a)Enzymes may accidentally damage normal cells that rely on the same metabolic pathway.

(b)There is no immune memory effect for enzymes.

(c)The combination use of multiple enzymes may cause an increase in toxicity.

(d)Enzyme may target at a specific organ.

Later they introduced another possible method for using antigenic peptides to curing cancer. Antigenic peptides are said to have several benefits while overcoming come of the restrictions for KY Nase, for:

(a)It can target cancer cells directly and accurately.

(b)It can activate T cells, forming a long-term immune surveillance, increasing the efficiency of response and reduce the risk of recurrence.

(c)By targeting specific HLA, we can come up with cures for specific cancer more effectively.

(d)It can transform around the body through immune system.

We later figure out that there are several difficulties and problems during the experiment to come up and develop a useful cure, as:

(a)The repeated experiment of changing the combination and sequence of amino acids to build up an effective peptide toward cancer is expensive.

(b)Different batches of synthetic peptides or cell culture conditions may cause fluctuations in the results.

(c)The experiment to come up with a success solution is time costly.

(d)The experience has a high requirement on the control of purity and accuracy.

So we decide to use models to assist the development of antigenic peptide.

Background and Existing Models

2.1 Current Research Status

In the research of TCR-MHC-antigen peptide complex, a number of representative progress has been made internationally in recent years. For example, in TCR-pMHC complex prediction, in a study published in Frontiers in Immunology, researchers compared the performance of several different docking platforms such as Clus Pro, Light Dock, ZDOCK, and HADDOCK to determine the most suitable method for TCR-pMHC modeling. The results show that the HADDOCK platform performs best in the modeling of TCR-MHC complexes, which provides a reliable tool for subsequent research.

Another important development is the use of deep learning techniques to enhance the prediction accuracy of neoantigens. In a study by Nature Biotechnology, researchers analyzed tumor HLA-associated peptides through mass spectrometry, combined with deep learning models, significantly improving the recognition accuracy of neoantigens. Through this method, researchers can accurately predict highly immunogenic peptides that can bind to patients' TCRs, providing new ideas for the design of tumor vaccines.

Professor Zeng Jianyang of the Institute of Interdisciplinary Information Research of Tsinghua University published a paper on the prediction of binding of peptides to MHC class I in Bioinformatics. In the development of cancer treatment vaccines, the prediction of peptides that bind to major histocompatibility complexes (MHCs) plays a crucial role. The authors propose a deep neural network prediction framework based on attention mechanisms-ACME. The authors introduce the Attention Mechanism module in ACME's network framework, where the algorithm uses convolutional neural networks to learn features extracted from peptide and MHC sequences, and uses attention mechanisms to extract explainable binding patterns. This makes the model highly interpretable and provides useful insights for the analysis of peptide binding preferences to MHC.

In addition, some studies have explored the application of antibodies that bind TCR to MHC (TCR-mimic antibodies) in cancer treatment. These antibodies are designed to bind to MHC-antigen peptides in a TCR-like manner, thereby activating the patient's immune system. In this way, researchers can quickly develop therapeutic antibodies against specific tumor antigens, reducing the time from laboratory to clinical application.

Overall, the research of TCR-MHC-antigen peptide complex in immunotherapy has made important progress both domestically and internationally. Domestic research mainly focuses on the prediction and design of antigen peptides through artificial intelligence and bioinformatics technology to improve the specificity and efficiency of immunotherapy. Foreign research is at the forefront of structural elucidation and clinical application of antigen peptide vaccines, and has promoted the application and development of antigen peptide vaccines through a large number of basic research and technical optimization.In the future, with the further strengthening of international cooperation and the continuous advancement of technology, the research of TCR-MHC-antigen peptide complex will continue to promote the development of personalized immunotherapy and tumor precision therapy.

2.2 Limitations for Existing Models

Limitations in Antigen Peptide-HLA Affinity Prediction:

There are currently nine mainstream methods, but these nine methods are not perfect and they have many limitations:

(a) Data limitations:

HLA is generally divided into two categories, HLA class I (HLA-I) and HLA class Ⅱ (HLA-II). HLA-I is expressed on the surface of all nucleated cells, while HLA-II is only expressed in a small number of antigen-presenting cells. Therefore, the datasets of the above 9 methods are only for HLA-I and do not consider HLA-II. As of now (November 2023), the biological science community has identified antigen peptides containing 8 to 14 amino acids that can bind to HLA. However, when we read the papers on the above 9 methods, we found that not every method is applicable to antigen peptides containing 8 to 14 amino acids, which is related to the publication time of the papers, except for NetMHCpan Apart from BA, CombLib only supports peptides with a fixed length of 9, while other methods only support peptides with lengths ranging from 8 to 11.

(b) Prediction Accuracy Limitations:

The evaluation metrics used for prediction accuracy are AUC, ACC, MCC, and F1. Although these methods achieve high predictive accuracy for 9-amino-acid peptides binding to HLA-I sequences, their predictive performance for peptides of other lengths (when prediction is possible) remains unsatisfactory. Biologically, this can be explained by the fact that 9-amino-acid peptides are more likely to bind to HLA alleles.

It can be seen from the above that none of these 9 methods can be used for the length of antigen peptides that do not exist in the training data, so the generalizability of these 9 methods is not excellent.

Limitations in Antigen Peptide-TCR Affinity Prediction:

Currently, research on antigen peptide-TCR binding is considerably more challenging than antigen peptide-HLA binding, resulting in much smaller dataset sizes for peptide-TCR interactions. The earliest approaches for predicting peptide-TCR affinity relied on experimental methods such as tetramer analysis, combined with sequencing and scanning of tetramer-associated T-cell receptors to detect interactions between TCRs and antigen peptides. However, such experimental approaches are time-consuming, technically demanding, and costly. Beyond experimental techniques, existing computational methods for predicting peptide-TCR affinity can be broadly classified into two categories:

(a) Peptide-specific models,such as TCRGP, TCRex, and NetTCR-2.0;

(b) General models that do not restrict predictions to specific peptides but instead require available known binding TCRs for training,including pMTnet, DLpTCR, ERGO, and TITAN.

Evidently, the first category of tools is limited to specific peptides. The second category represents general peptide-TCR binding prediction models. However, these approaches fail to identify peptides with very few known binding TCRs or peptides absent from the training data. In other words, their generalizability is weak and cannot extend to novel antigen peptides or exogenous peptides not represented in the training set. Yet, the ability to recognize such novel or exogenous peptide-TCR binding patterns is critical for immunological research and immunotherapy.

Antigen Immunogenicity Prediction Model Based on the Attention Mechanism

3.1 Hypothesis Development

Antigen peptide immunogenicity can be accurately predicted through the HLA-peptide-TCR ternary interaction,while the Transformer architecture is particularly effective at capturing long-range dependencies within amino acid sequences.

The design of tumor antigen peptide vaccines is fundamentally based on two key processes of the tumor immunity cycle:antigen presentation and T-cell receptor (TCR) recognition. These processes are driven by two essential binding events: antigen peptide-HLA binding, and antigen peptide-TCR binding. The interaction between antigen peptides and HLA molecules is the critical step in antigen presentation, whereas binding to TCRs represents the indispensable step for T-cell activation.

At present, mainstream approaches to predicting antigen immunogenicity rely heavily on machine learning models, particularly neural networks, to estimate the binding affinities between peptides and HLA or TCR alleles. However, these methods exhibit substantial limitations in predictive accuracy, generalizability, and adaptability across diverse datasets.

To address these challenges, we have designed and developed a Transformer-based deep learning model, the HLA-Pep-TCR Transformer, which can be applied to predict both antigen peptide-HLA affinity and antigen peptide-TCR affinity.

This model provides several significant advantages: high predictive performance,comprehensive datasets, broad applicability, and strong generalizability. Collectively, these strengths fill a critical gap in the field by offering researchers a robust and efficient tool for affinity prediction, thereby supporting more accurate and effective tumor antigen peptide vaccine design.

3.2 Model Foundation

The general structure of Prediction Model is shown as Figure 3.2-1:

Figure 3.2-1 Prediction Model structure

In this model, we designed two models for predicting the affinity of HLA-Pep-TCR for the first time, based on Transformer and RetNet, among which RetNet model is the new format offered by Microsoft Research Asia in July 2023, with better property, lower reasoning cost and training that can be conducted in parallel compared to transformer:

(a) HLA-Pep-TCR Transformer

(b) HLA-Pep-TCR RetNet

These two models can both be used to predict affinity scores for Pep-HLA and Pep-TCR, thus determining the affinity of predicted Pip-HLA and Pep-TCR.

The input of Predict Model includes:

(a) Amino acid sequence for antigen peptide

(b) Amino acid sequence for HLA

The output for Predict Model includes:

(a) Predict Binding Score: the predicted probability value for the affinity of Pep-HLA and TCR-HLA.

(b) Contrib: a contributory matrix gained by Predict Model from past experiences.

We save Contrib into a stable.npy document for direct loading, with a formula:

Among which, Contribution is the contribution value of amino acids at a certain position; Frequency is the frequency or weight for the amino acid to appear at a certain position; Score is Self-Attention or Self-Retention for the amino acid at it's position. Contrib is the matrix with dimension of (batch_size, amino_acid_num, max_peplen) formed by each contribution.

(c) Pep Self-Attention/Pep Self-Retention: the Self-Attention Matrix or Self-Preserving Attention Matrix of antigenic peptides.

(d) Concat Multi-Head Self-Attention/Multi-Scale Self-Retentionthe Multi-Head Self-Attention Matrix or Multi-Scale Self-Retention Matrix of Pep-HLA or TCR-HLA.

First introducing the HLA-Pre-TCR Affinity Predicting Model, as shown in Figure 3.2-2, based on Transformer.

Figure 3.2-2 HLA-Pep-TRC Transformer

The HLA-Pep-TCR Transformer Model is mainly constructed of four modules, the Embedding Block, the Encoder Block, the Decoder Block and the Solution Block.

(a) Embedding Block (shown as Figure 3.2-3):

Figure 3.2-3 Embedding Block

Excepting for embedding coding for HLA, antigen peptides and TCR sequence, we used absolute position triangular embedding coding to add in position information in coding:

In which, pos is the position for embedded coding sequence.

Mean while, Embedding Block uses Dropout to enhance robustness of the model and suppress over-fitting.

(b) Encoder Block (shown as Figure 3.2-4):

Figure 3.2-4 Encoder block

Encoder Block includes two sectors, with first sector Muti-Head Self-Attention and second sector FC layers, LayerNorm is used between and after the two sectors for normalization, stabilizing the forward input distribution,accelerating convergence.

Muti-Head Self-Attention, used in the first sector, is the Muti-Head Self-Attention calculation for sequence vectors of HLA, antigenic peptides and TCR, meanwhile Mask the filling position of the sequence to prevent misleading the model.

Formula for Self-attention is shown by Figure 3.2-5:

Figure 3.2-5 Self-Attention

Among which, Q, K, V all have a dimension of (batch_size, len_k, d_k).

Then, we design Multi-Head Self-Attention as Figure 3.2-6:

Figure 3.2-6 Multi-Head Self-Attention

Multi-Head Self-Attention is equivalent to repeating Self-Attention,this improves the predicting performance of the model, capturing more abundant features while preventing contingency, enhance fault tolerance. Meanwhile, residual joins are performed on the results of the Muti-Head Self-Attention calculation to make sure there is no vanishing gradient when optimizing the back propagation gradient.

The model will calculate Self-Attention Score Matrix when runs to this point:

Then through Multi-Head transformation to gain Multi-Head Self-Attention Score:

The Multi-Head Selfattention Score Matrix will act as an input for mutation model, which is of great importance in mutation model.

The second part is FC layers (shown as Figure 3.2-7):

Figure 3.2-7 FC layers

Using a fully connected layer where the channel rises first and then descends to dispose features gained from the previous Encoder block.

Among which using ReLU for ReLU (x)=max (0, x) shown in Figure 3.2-8:

Figure 3.2-8 ReLU

Figure 3.2-9 Decoder block

First Concat the time series disposed by Encode Block and later process the result in a manner similar to an Encoder Block,however use Decoder Block to calculate and gain Concat Multi-Head Self-Attention score, which is similar to the calculation of Self-Attention Score Matrix.

(d) Solution Block (shown as Figure 3.2-10)

Figure 3.2-10 Solution Block

Use multiple FC layer to predict the final affinity score, use ReLU in the process as activation function, use BatchNorm for normalization to stabilize the forward input distribution and accelerate convergence.

In summary, the core idea of our HLA-Pep-TCR Transformer is to use the Self-Attention mechanism to determine the similarity between different sequences using the matrix multiplication between Q, K, V, and then compute the Multi-Head Self-Attention Score and Concat Multi-Head Self-Attention Score to obtain all the antigenic peptides for subsequent mutation modeling. Attention Sore and Concat Multi-Head Self-Attention Score of all antigenic peptides were then computed for subsequent mutation modeling. The Encoder and Decoder of HLA-Pep-TCR Transformer have one layer each, while the head has nine.Embedding Block adds absolute triangular positions to the amino acid The Embedding Block adds absolute triangular positional encoding on top of the amino acid encoding,and then employs the Dropout technique to enhance the robustness of the model.

After the Embedding Block, the HLA-Pep-TCR Transformer generates embeddings of HLA, antigenic peptide and TCR allelic amino acid sequences, respectively. These embedded sequences are then used as inputs to the Encoder Block, which contains the Mask multi-head self-attention mechanism and FC layers. The FC layers are a combination of fully connected layers in which the channels rise and then fall, and this module improves the feature representation obtained by the self-attention mechanism. Then the output features of antigenic peptide and HLA and TCR alleles are concatenated separately, after that the data obtained by Concat is passed through Decoder Block, Solution Block is utilized to predict the affinity scores of antigenic peptide and HLA with antigenic peptide and TCR, respectively, so as to further determine whether the antigenic peptide and HLA are in affinity.

3.3 Data Provided

When building the model, we collected the vast majority of data provided by Immune Epitope Database (IEDB) of the National Institute of Allergy and Infectious Diseases (NIAID) in the United States, and also collected datasets from relevant papers in Nature, encompassing common cancer antigen datasets, including non-small cell lung cancer, melanoma, ovarian cancer, and pancreatic cancer.

Ultimately, we collected a total of 718,322 samples for the antigenic peptide-HLA dataset, covering all types of HLA, shown as Figure 3.3-1.

Figure3.3-1 Peptide-HLA Dataset

The attributes are:

Peptide: The amino acid sequence of antigenic peptides

Length: The length of antigen peptides

HLA: Name of HLA

HLAsequence: The amino acid sequence of HLA

Label: Affinity Label

The antigenic peptide-TCR dataset we collected contains a total of 115,318 samples, including all types of TCRS, shown as Figure 3.3-2:

Figure3.3-2 Peptide-TCR Dataset

The attributes of the dataset are:

Peptide: The amino acid sequence of an antigenic peptide

TCR: The amino acid sequence of TCR

Label: Affinity Label

Later, we randomly divided these datasets into Train sets and Test sets respectively, with a ratio of 8:2. Then, the parts of the datasets from the 9 antigen peptide-HLA affinity prediction methods and antigen peptide-TCR affinity prediction methods studied by predecessors that are different from the Train Set and Test Set are used as the External Set, and the antigen peptides in the External Set do not appear in the Train Set.

Conclusion

With those data gained from the model,we can emphasis following advantages:

Better performance

using mainstream deep learning technologies, our two designed predicting models, HLA-Pep-TCR Transformer and HLA-Pep-TCR RetNet, gain higher performances compared to 9 antigenic peptide-HLA affinity prediction methods previously established and the four kinds of Pep-HLA model used.

Good versatility

The two prediction models we designed based on deep learning Transformer and Retnet can be used to predict the antigenic peptide-HLA affinity and also to predict the antigenic peptide-TCR affinity. And it can be used for antigen peptides of all lengths, that is, the length of the antigen peptide is no longer a limiting condition,shown as Figure 4.0-1 and Figure 4.0-2.

Figure4.0-1 Peptide-HLA Set

Figure 4.0-2 Peptide-TRC Set

3. Good generalization

The prediction models we designed based on Transformer and Retnet have strong generalization ability. When tested in the External Set dataset, the scores for all indicators are above 87%, as shown in Figure 4.0-3 to Figure 4.0-8, which exceeds our expectations.

Figure4.0-3 Peptide-HLA Train Set Score

Figure4.0-4 Peptide-HLA Test Set Score

Figure4.0-5 Peptide-HLA External Set Score

Figure4.0-6 Peptide-TCR Train Set Score

Figure4.0-7 Peptide-TCR Test Set Score

Figure4.0-8 Pepitde-TCR External Set Score

Moreover, with the help of those data, we can better improve our experiment by increasing the efficiency while reducing the cost during developments, offering more effective and specific cures for cancers.

Content