M o d e l

Abstract

Directed evolution has been a central methodology for protein engineering, playing a critical role in applications such as pharmaceutical research (e.g., antibody development) and environmental solutions (e.g., the improvement of plastic-degrading enzymes).However, directed evolution is labor-intensive and requires high-throughput screening systems. Furthermore, it can only explore a limited portion of sequence space and is prone to converging on local optima. To address these limitations, we developed LEAPS (Language model guided Exploration of Augmented Protein Sequence space), a machine learning model that improves protein function from limited data. LEAPS efficiently explores vast sequence space through an iterative process of generating diverse novel sequences and predicting their functions. While designing proteins with desired properties from small datasets has traditionally been challenging, LEAPS overcomes this barrier, and pioneers a new era of protein design.

Introduction

Proteins are essential biomacromolecules involved in all biological phenomena. Through billions of years of evolution, they have diversified, been selected, and adapted to perform various functions: catalyzing chemical reactions as enzymes, supporting cellular and tissue morphology as structural components, and participating in immune responses as antibodies. However, natural proteins have constraints in thermal stability, catalytic efficiency, substrate specificity, and expression levels, and are not necessarily optimized for artificial use. Most proteins require improvement before industrial application. The most representative of this methods are directional evolution and rational design. Directed evolution involves experimentally searching for proteins with desired properties through iterative cycles of random mutagenesis and selection [1]. Rational design is a method that deliberately modifies properties based on knowledge of protein structure and mechanisms [2]. While both approaches have achieved numerous successes, each has inherent limitations. Directed evolution can only explore a small fraction of the vast sequence space and tends to converge on local optima. Furthermore, a large number of mutants must be selected, making high-throughput screening systems indispensable. Even when screening systems can be established, enormous human, material, and temporal costs are required. For these reasons, it is not universally applicable to all proteins. Rational design, based on structural information and mechanistic understanding, theoretically enables efficient improvement. In practice, however, complete understanding of protein three-dimensional structure, dynamics, and substrate interactions is rarely achieved, and many unknown factors remain [3]. Consequently, functional modifications often fail to achieve the intended results. It requires specialized knowledge of proteins and cannot be easily performed by just anyone. Thus, despite the countless proteins requiring improvement, both methods have limited applicability.

Machine learning, which has made remarkable advances in recent years, has attracted significant attention as a potential solution. Machine learning technologies, including deep learning, have been applied across various fields with substantial success. For example, the emergence of AlphaFold [4], which predicts protein three-dimensional structure from amino acid sequences, brought a major revolution to protein research and life sciences as a whole. Numerous models have been proposed to predict functional changes accompanying mutations and to generate novel sequences by applying language models.

These approaches demonstrate the potential to computationally expedite the exploration of vast sequence space, which was difficult with conventional directed evolution or rational design. However, challenges remain in these machine learning methods. First, when generative models are used alone, while they can generate enormous numbers of novel sequences, it remains unclear which possess improved function, ultimately requiring experimental validation of many sequences [5]. Therefore, it is difficult to apply directly to protein modification. Conversely, using only predictive models necessitates reliance on random mutations or existing variant libraries to obtain candidate sequences [6], resulting in restricted search ranges. That is, the conventional framework of independently operating generation and prediction makes efficient sequence exploration with diversity assurance difficult. Furthermore, many existing models require vast experimental datasets, making application to proteins with limited data challenging [7]. There is also a tendency for search ranges to be biased toward local sequence distributions [8]. Thus, mechanisms to maintain diversity while exploring broad ranges and optimizing function remain insufficiently established.

Against this background, we developed LEAPS (Language model guided Exploration of Augmented Protein Sequence space), a novel approach that efficiently achieves functional improvement from limited experimental data. LEAPS enables iterative optimization through the integrated combination of data augmentation, sequence generation, and function prediction, leveraging the mutual interaction between generative and predictive models. Specifically, generative models create novel sequences, and predictive models evaluate their function. High-scoring sequences are used to fine-tune the generative model via LoRA, and the cycle of generating sequences is repeated, gradually converging generated sequences toward the target. This framework enables multi-objective optimization—simultaneous improvement of multiple properties—which was difficult with conventional directed evolution, provided that a predictive model can be constructed.

This overcomes the constraints inherent in using generative or predictive models alone, enabling broad and efficient exploration of sequence space. LEAPS surmounts the problems faced by conventional methods—“data scarcity,” “susceptibility to local optima,” and “lack of search diversity”—functioning as a highly versatile protein design model.

Model Overview

image.png

LEAPS is a model developed to improve single or multiple protein functions simultaneously, even from as few as 40 labeled sequences. First, large quantities of virtual functional variant data are generated from input protein sequences through shuffling and mutagenesis, followed by selection via the Qualifier. Next, a predictive model is trained to perform regression prediction of protein functional values (e.g., enzymatic activity, thermal stability) from the input 40 labeled sequences. Using this predictive model, high-scoring sequences are selected from the previously generated variants. The Generator, trained on these high-scoring sequences, generates novel related sequences. Generated sequences are selected by the Validator and Predictor, and highly evaluated sequences are fed back to the Generator for retraining. By iteratively repeating this cycle of learning and generating high-scoring sequences followed by evaluation, the generated sequences converge toward high-function regions in sequence space, improving the protein.

Module Description

Shuffling & Mutation Program

The Shuffling & Mutation program is a module designed to efficiently create numerous variants for data augmentation required for Generator training.

First, variant generation through sequence shuffling is performed. In this method, multiple wild-type sequences are divided into fixed window sizes (1, 3, 5), and sequence fragments are randomly exchanged between different sequences at each window unit, creating variants with novel combinations.

Second, for all 40 input sequences, all point substitution variants are created by replacing the amino acid at every position with any of the 20 standard amino acids. This operation constructs a point mutation library that enables comprehensive exploration of functional importance at each position in each sequence.

The above processes enable diverse and comprehensive variant generation. However, likelihood estimation accuracy is known to deteriorate significantly when multiple “challenging mutations” exist [9]. This problem is circumvented by screening only variants with differences within four residues from any of the input variants as candidates.

Qualifier

While the Shuffling & Mutation program generates diverse variants, many have lost function. Such non-functional variants must be excluded from Generator training. The Qualifier removes variants that have lost function by calculating likelihood, a measure of protein sequence plausibility.

Using the protein language model SaProt-650M, likelihood scores of generated variants are calculated.

iMlogp(xi=ximtxM)logp(xi=xiwtxM)\sum_{i \in M} \log p(x_i = x_i^{mt} | \boldsymbol{x}_{-M}) - \log p(x_i = x_i^{wt} | \boldsymbol{x}_{-M})

Using the above masked marginal scoring function, log-likelihood differences from wild-type sequences are calculated for each variant. The masked marginal scoring function replaces the target residue position i with the mask token [MASK], and the model predicts the probability P(aicontext)P(a_i| context) of each amino acid appearing at that position. From this probability distribution, log-likelihoods of wild-type and mutant residues are obtained, and the difference Δloglikelihood\varDelta log - likelihood is calculated to quantitatively evaluate the impact of mutations on sequence naturalness and evolutionary plausibility.

Variants with positive values (i.e., variant likelihood is higher than or equal to wild-type) are selected. Only variants remaining after this primary screening proceed to secondary screening as virtual functional variants.

The Shuffling & Mutation program and Qualifier construct high-quality virtual variant datasets without experimentation. This enables Generator training from 40 sequences, which was previously difficult. (See Engineering Cycle 7.1)

Predictor

The Predictor learns relationships between sequences and functions from functional values labeled on input sequences and predicts functional values of unknown sequences. We achieved improved predictive model accuracy through neural networks, novel data augmentation methods, and introduction of custom dropout.

This Predictor consists of ESM2 with frozen parameters and a fully connected layer coupled to its final layer serving as a regression model. This fully connected layer learns the relationship between protein embedding representations output from ESM2 and their functions. Neural network nonlinearity enables learning of high-order feature interactions that cannot be captured by linear models or random forests.

While neural networks possess high representational capacity, they also carry risk of overfitting. Therefore, we developed data augmentation methods and custom dropout that suppress overfitting while maintaining neural network representational capacity. This enables the Predictor to achieve high representational capacity and generalization performance, predicting with higher accuracy than conventional methods using LASSO regression.

Generator

The Generator learns sequences that receive high evaluation from the predictor and pass secondary screening, generating similar sequences. By fine-tuning ProGen2, a pre-trained autoregressive protein language model, sequences are generated based on evolutionary and structural constraints learned by the protein language model. This preferentially generates sequences located in biologically plausible mutation space, efficiently exploring toward global optima difficult to reach through random search.

Validator

As an external evaluation method independent of the predictive model to remove non-functional sequences, the Validator utilizes EVmutation.

EVmutation is a method that predicts mutational effects based on evolutionary information. Specifically, by analyzing sequence conservation and covariation patterns (patterns where multiple positions change in coordination) in protein families, it quantitatively evaluates the impact of specific mutations on function. By removing sequences with high scores calculated by the following equations, contamination by non-functional sequences is prevented.

E(x)=i=1Lhi(xi)+i<jJij(xi,xj)E(\mathbf{x}) = \sum_{i=1}^{L} h_i(x_i) + \sum_{i<j} J_{ij}(x_i, x_j)

ΔE=E(xmut)E(xwt)\Delta E = E(\mathbf{x}_{\mathrm{mut}}) - E(\mathbf{x}_{\mathrm{wt}})

While the Generator tends to generate evolutionarily and biologically plausible sequences, generated sequences do not necessarily retain function. Moreover, the Predictor has been demonstrated unable to identify and exclude these non-functional sequences (see Engineering Cycle 4).

In iterative optimization cycles combining predictive and generative models, contamination by non-functional sequences leads to a vicious cycle where lower-quality sequences are generated based on them. To prevent such chains of quality degradation, a mechanism for removing non-functional sequences through independent evaluation criteria is essential. Therefore, by introducing EVmutation, which considers evolutionary constraints, as a Validator, sequences that have lost function are effectively filtered, maintaining the soundness of the optimization process.

Advantage of LEAPS: Comparison with Random Mutation Baseline

A key feature of LEAPS is its ability to efficiently explore high-activity sequences through iterative evaluation by predictive models and LoRA fine-tuning of generative models.

Simple optimization by random mutation has limited search range and easily falls into local optima. In contrast, LEAPS utilizes evolutionary and structural knowledge learned by protein language models, preferentially exploring biologically plausible mutation space and is expected to reach optimal solutions more efficiently.

However, quantitative comparison with random mutation is essential to demonstrate generative model superiority. If equivalent performance is obtained, the significance of using computationally expensive generative models diminishes. Therefore, we implemented a method that introduces point mutations at each sequence position with 1% probability, iteratively evaluating and selecting with the predictive model, and conducted control experiments unifying all conditions except substituting this as the Generator.

20251008_17.png

Fig. 1. Predicted brightness scores of final iteration sequences when using Random mutation and ProGen2 as Generator

In this histogram, frequency plotting is partially omitted to emphasize differences in the Estimated brightness > 2 region. Using ProGen2 as Generator exceeded Random mutation in both maximum value sequences and top 25% sequences.

As shown in Fig. 1, comparing predicted brightness score distributions of sequences obtained in the final iteration, clear performance improvement was confirmed when using ProGen2 as Generator compared to Random mutation. Specifically, with ProGen2, the maximum predicted brightness score reached 50, greatly exceeding approximately 22 for Random mutation. Moreover, the average score of the top 25% sequences was approximately 12 for ProGen2 versus approximately 2.5 for Random mutation, demonstrating ProGen2’s superiority.

Particularly noteworthy is the difference in sequence numbers in the high-score region of Estimated brightness > 15. ProGen2 had relatively many sequences exceeding this value, suggesting high probability that the entire sequence population is becoming highly functional, rather than only a few variants achieving high activity.

These results suggest the possibility that utilizing protein language models enables efficient access to high-activity regions difficult to reach through simple stochastic search.

In this study, we validated the effectiveness of LEAPS in protein sequence optimization by comparing it with control experiments using random mutation. The results demonstrated that using ProGen2 as Generator significantly improved both the maximum predicted brightness score and average scores of top sequences compared to Random mutation. These results quantitatively demonstrate that utilizing evolutionary and structural knowledge learned by protein language models enables efficient access to high-score regions difficult to reach through simple stochastic search.

Random mutation introduces mutations at each position with equal probability, dispersing exploration throughout sequence space and easily falling into local optima. In contrast, ProGen2 proposes mutations based on sequence patterns learned from large-scale protein sequence databases, preferentially exploring biologically plausible mutation space. Furthermore, by learning features of high-activity sequences selected through LoRA fine-tuning, search direction is optimized with each iteration, achieving efficient convergence.

In drug discovery and enzyme engineering contexts where experimental costs far exceed computational costs, finding high-activity sequences with limited experimental trials is critical. These results demonstrate that LEAPS can reach higher-scoring sequences compared to conventional stochastic methods, supporting this method’s superiority in candidate sequence selection before experimental validation.

LEAPS can substantially reduce experimental trials through efficient sequence exploration, making it a powerful approach to accelerate AI-driven protein improvement. This method enables efficient identification of optimal sequence candidates within limited resources, and is expected to dramatically shorten research and development cycles in protein engineering.

Reference

[1] Arnold FH. Design by directed evolution. Acc Chem Res. 1998;31(3):125–31. [2] Bornscheuer UT, Huisman GW, Kazlauskas RJ, Lutz S, Petiard J, Schwaneberg U. Engineering enzymes for non-natural reactions. Annu Rev Biochem. 2012;81:53–82.

[3] Kinch LN, Grishin NV. Opportunities and challenges in design and optimization of protein function. Protein Sci. 2020;29(11):2311–23.

[4] Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. a. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., … Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589.

[5] Gustafsson, C., Govindarajan, S., & Emig, R. (2001). Exploration of sequence space for protein engineering. Journal of Molecular Recognition, 14(5), 308–314.

[6] Teyra J, Colak R, Kinch LN, Grishin NV, Ghabas A. Recent Advances in Machine Learning Variant Effect Prediction Tools for Protein Engineering. Int J Mol Sci. 2022;23(7):3799.

[7] Sliwoski GR, Lowe EW, Kinch LN, Grishin NV, Ghabas A. Incorporating physics to overcome data scarcity in predictive modeling of protein function: A case study of BK channels. PLoS Comput Biol. 2023;19(9):e1011460.

[8] Romero PA, Arnold FH. Exploring protein fitness landscapes by directed evolution. Nat Rev Mol Cell Biol. 2009;10(12):866–76. 

[9]Kinch LN, Grishin NV, Ghabas A. What makes the effect of protein mutations difficult to predict? Cell Rep Methods. 2023;3(10):100609. 

Slide 1Slide 2Slide 3Slide 4Slide 5

© 2025 - Content on this site is licensed under a Creative Commons Attribution 4.0 International license

The repository used to create this website is available at gitlab.igem.org/2025/tsukuba.