The variation in binding affinity among different peptide chains is a crucial design element for ensuring the proper functioning of our experimental pathway. To assess the feasibility of this pathway, we employed multiple strategies to determine the binding energies of the peptide chain combinations used in our experiments. We found that affinity prediction generally needs first simulating binding conformations with molecular docking software and then evaluating the resulting structures using a prediction tool. However, we faced a practical problem: there are no specialized tools to predict how strongly two peptides bind, but the common models are not precise enough and cannot reliably tell our peptide pairs apart based on their small differences in affinity. As for detailed molecular simulations, although they could theoretically provide more reliable data, they are often prohibitively time-intensive and technically demanding.

We therefore developed a computational model to enable both more straightforward and rapid prediction of peptide-binding affinity and rational design of mutations for enhanced affinity in order to facilitate efficient peptide engineering in our detection system.

Prediction part

Principles

Given the high-dimensional nature of the database and the limited number of samples, many powerful models, such as random forests and neural networks, are prone to severe overfitting and are therefore unsuitable for this task. Instead, we employ lasso regression and ridge regression, which are more appropriate for such scenarios and help mitigate overfitting.

The key advantage of lasso and ridge regression lies in their incorporation of a penalty term into the loss function, which constrains model complexity and helps prevent overfitting—a technique known as regularization. In the lasso model, the loss function is defined as:

$L_{\text{lasso}} = \frac{1}{2n}\sum_{i=1}^{n}(y_i - X_i\beta)^2 + \lambda \sum_{j=1}^{m} |\beta_j|$

In the ridge model, the loss function becomes:

$L_{\text{ridge}} = \frac{1}{2n}\sum_{i=1}^{n}(y_i - X_i\beta)^2 + \alpha \sum_{j=1}^{m} \beta_j^2$

Compared to ordinary linear regression, lasso introduces an L1 regularization term $\lambda \sum_{j=1}^{m} |\beta_j|$ , while ridge regression uses an L2 regularization term $\alpha \sum_{j=1}^{m} \beta_j^2$ . These different penalty forms lead to distinct model behaviors: lasso tends to produce sparse solutions by driving some coefficients exactly to zero¹, while ridge regression does not force coefficients to zero but reduces the risk of inadvertently excluding relevant variables². Therefore, it is beneficial to construct and compare both models to identify the most effective approach.

Construction Process

The construction process of our prediction model can be illustrated by the flowchart below:

Fig.1 The construction process of the prediction model

Database Creation.

Previous investigations revealed a lack of databases documenting the binding strengths of homo- and heterodimeric peptide chains, especially for CC domain structures. In collaboration with Wetlab, we constructed a database containing over 100 entries with sequence and Kd (dissociation constant, indicating binding strength) information, sourced from the RCSB Protein Data Bank (RCSB PDB)³, PDBbind+, and related literature. The information of the database and all the references involved in the database can be found in our igem gitlab.
Pre-training: Encoding Sequences with ESM-2.

According to Cui et al. (2021), converting peptide chain sequences and their physicochemical properties into comprehensive and accurate data information that models can understand involves multiple sequence representation methods. Among these, typical end-to-end learning approaches such as one-hot encoding (a type of orthogonal encoding) only provide basic representation of amino acid types, while manually extracted amino acid descriptors can encode both positional information and rich physicochemical properties, offering more comprehensive feature representation.

Furthermore, the literature indicates that transfer learning-based representation methods (such as the ESM model) have become the predominant approach. Such models, pre-trained on large-scale unlabeled sequences, can generate low-dimensional, abstract, context-aware embedding vectors. Compared to one-hot encoding and handcrafted features, the representations derived from Transformer-based architectures like ESM are not only compact in dimensionality but also better at capturing deep structural and evolutionary information in sequences. Therefore, they demonstrate stronger applicability and generalization capability on datasets with limited sample sizes⁴.

As a consequence, we used ESM-2(esm2_t12_35M_UR50D) to pre-train all sequences and extracted the last hidden layer as input features for our model to decrease the dimension while tracking key features. ESM-2 is a protein language model developed by Meta AI, based on the Transformer architecture. It is pre-trained on millions of protein sequences using masked language modeling, generating amino acid embeddings that capture structural and functional information without requiring multiple sequence alignment⁵.
Model Building and Validation.

This stage is crucial for obtaining a suitable model and can be divided into several steps:
- The dataset was split into training and test sets using stratified sampling.
- Models were constructed and trained on the training set, with parameters tuned appropriately.
- Model performance was evaluated on the test set, and different models were systematically compared.

Results and Discussion: Current Challenges

We evaluated the performance of Lasso and Ridge models, respectively, using a test set split from our dataset, as illustrated in Fig.2 and Fig.3.

Fig.2 The performance of Ridge regression model

Fig.3 The performance of Lasso regression model

As shown, the Lasso regression model achieved the higher $R^2$ score, reaching a qualitative to semi-quantitative level of prediction. Therefore, it was selected as our final predictive model.

We further evaluated the Lasso model using the CD and CC' complexes from our experimental system. The model output a higher score for CC' (0.67) than for CD (3.78), where a higher score corresponds to lower binding affinity— and this is consistent with the results obtained from molecular dynamics simulations.

To enhance practical utility and classification performance, we also transformed the regression model into a discriminative model with five distinct affinity levels. This adaptation contributed to a certain degree of improvement in model performance.

These results demonstrate that compared with the current prediction method, our model is more straightforward and convenient to use. It bypasses the conventional sequence-to-structure relay, significantly streamlining the prediction process and reducing computational time while maintaining some accuracy. Thus, the model can serve as an efficient tool for rapidly estimating peptide-binding strength under time constraints.

Despite these advantages, the model still exhibits limitations in predictive accuracy. We attempted several optimization strategies—including parameter tuning, algorithm substitution (e.g., Partial Least Squares, PLS), dimensionality reduction via Principal Component Analysis (PCA), and dataset refinement—yet these measures yielded only marginal improvements. Further development is therefore necessary to enhance model performance. For a detailed overview of the model selection process and the rationale behind our final choice, please refer to the Engineering part.

Optimization and Design Part

Based on the established prediction model, we have developed two methods to propose effective mutation strategies. Once the user provides the sequences of the peptide complex and specifies the chain to be engineered, the system rapidly outputs a set of promising mutation proposals.

This approach can be an efficient preliminary screening tool that can be effectively integrated with molecular simulation analysis. While molecular simulations provide detailed mechanistic insights, our method quickly offers preparatory and exploratory directions, helping users identify promising improvement targets and refine designs promptly. Through this strategy, we enable rapid and directional modulation of peptide binding strength, complementing molecular simulation, reducing experimental and computational time costs, and offering intuitive insights into the sequence–binding strength relationship.

Case A: Exhaustive Search for Single-Site Mutations

During database construction, we observed that even a single-point mutation could significantly alter the binding strength between two peptide chains. Given the manageable complexity of our model and the relatively short length of each peptide chain, we implemented an exhaustive search strategy based on the prediction model to identify the optimal single-site mutation while keeping the rest of the peptide fixed. Our results confirm the feasibility of using exhaustive search for designing single-site mutations.

For example, we chose an entry from our database as the input to show our result:

Fig.4 the example simulation result of Case A

Based on the results presented in Fig. 4, the five single-site mutations shown represent the top candidates identified by our exhaustive search, each leading to a consistent reduction in predictive score relative to the original sequence. Since a lower score corresponds to higher predicted binding affinity in our model, these mutations are those most likely to enhance peptide-peptide binding strength. For example, as illustrated in Fig. 4, when the original sequence “DILGMLKSLHQLQVENRRLEEQIKNLTAKKERLQLLNAQLSV” is mutated to “KILGMLKSLHQLQVENRRLEEQIKNLTAKKERLQLLNAQLSV” (i.e., D→K at the first residue), the model predicts a significant increase in binding affinity, demonstrating the practical utility of our design approach.

Case B: Multi-Site Statistical Design Strategy Driven by High-Frequency Mutations in Genetic Algorithm

Fig.5 The construction process of the design model

(case B)

As shown in Fig. 5, we employed a genetic algorithm (GA) guided by our predictive model to optimize multi-site mutations. The GA was executed over multiple independent runs, and the top-performing mutant sequences from each run were collected for statistical analysis. By systematically analyzing the mutation positions and amino acid substitutions appearing in these high-fitness candidates, we derived robust reference suggestions for sequence optimization.

To visualize these results intuitively, we summarized the mutation statistics in a heatmap. In this representation, the x-axis corresponds to residue positions along the peptide chain, while the y-axis indicates possible amino acid substitutions(Each number represents an amino acid). The color intensity at each coordinate reflects the frequency with which that specific mutation (i.e., a particular substitution at a particular position) appears across all top-performing mutants. Mutations that occur with high frequency—indicative of their strong association with improved binding affinity (fitness score)—are readily identifiable, providing a clear and actionable overview of beneficial mutation patterns.

We demonstrated this approach using the C–C′ complex from our pathway, with Chain C designated as the optimization target. The resulting heatmap (Fig.6) clearly highlights several mutation hotspots, where brighter shades indicate higher mutation frequencies and, by extension, stronger predicted binding affinity. These high-frequency positions offer experimentally actionable guidance for enhancing interaction strength.

Fig.6 Example GA-driven mutation analysis for the CC′ complex, with Chain C as the design target

For instance, the mutation at coordinate (7,1)—representing the highest-frequency spot in the heatmap—is strongly associated with enhanced binding affinity. This is evidenced by the fact that the sequence containing this mutation achieved the highest predicted binding strength in our analysis. Moreover, a controlled comparison with its most similar sequence in the collection of the best sequence in each GA run—which lacks this specific mutation—shows a substantially lower binding affinity. This contrast provides compelling evidence that the (7,1) mutation is indeed beneficial for strengthening peptide binding under the evaluation framework of our prediction model. However, due to time constraints, we have not yet conducted wet experiments to verify this point.

In summary, this GA-based strategy significantly streamlines the mutation design process, enabling rapid and data-driven optimization of peptide configurations. This contributes directly to enhanced reaction efficiency and detection accuracy in our proposed pathway.

But it should be noted that the quality of the design suggestions depends intrinsically on the accuracy of the underlying predictive model serving as the GA fitness function. As more accurate affinity prediction models become available, this framework can be seamlessly adapted to further improve the reliability and value of its mutation recommendations.

Future Work

The current accuracy of this part also presents room for improvement. Consequently, our immediate objective is to enrich the underlying database, a critical step toward elevating the model's predictive fidelity. With a matured dataset, we aim to deploy next-generation algorithms tailored for higher performance. Simultaneously, we seek to innovate our design framework by exploring more suitable algorithms and more reliable and advanced design methods, such as automating existing thermodynamic mutation analysis workflows, thereby streamlining and optimizing the mutation design process.

View our codes on igem gitlab.

References

1 Robert Tibshirani, Regression Shrinkage and Selection Via the Lasso, Journal of the Royal Statistical Society: Series B (Methodological), Volume 58, Issue 1, January 1996, Pages 267–288. ↩

2 Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67. ↩

3 Berman, H., Henrick, K., & Nakamura, H. (2003). Announcing the worldwide protein data bank. Nature structural & molecular biology, 10(12), 980-980. ↩

4 Feifei Cui, Zilong Zhang, Quan Zou, Sequence representation approaches for sequence-based protein prediction tasks that use deep learning, Briefings in Functional Genomics, Volume 20, Issue 1, January 2021, Pages 61–73. ↩

5 Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., ... & Rives, A. (2022). Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022, 500902. ↩