D e s c r i p t i o n

Contributors

Index

Abstract

Background

Our Solution: The LEAPS Method

Practical Application

Software: “LEAPS-Software”

Safety & Security

Future Vision

Abstract

In this study, we developed a novel machine learning model named “LEAPS” that enables efficient protein function improvement from limited datasets. Conventional protein engineering requires large-scale experimental data, extensive expertise, and prolonged trial-and-error processes, making research execution heavily dependent on institutional funding and human resources. Consequently, significant barriers remain for student teams and small-scale laboratories attempting to engage in protein engineering. LEAPS substantially reduces these constraints and provides an accessible environment for researchers to perform multi-objective optimization of proteins. This model enables student teams participating in iGEM and researchers with limited resources to undertake functional improvements such as enzymatic activity enhancement and antibody affinity optimization, dramatically expanding research possibilities. Furthermore, we have developed and publicly released LEAPS as a web application, establishing a platform accessible to the broader research community. This model has the potential to promote data-driven protein engineering and accelerate research and development across all fields of life sciences.

Background

Importance of Protein Function Improvement and Multi-Objective Optimization

Proteins play a central role in governing biological phenomena at the molecular level. At the same time, they serve as fundamental tools in medical and biological research and are positioned as key components in the emerging fields of synthetic biology and biomanufacturing. Therefore, understanding protein function and engineering proteins for specific purposes represents an extremely important research challenge spanning from basic science to applied development.

Particularly in practical applications, it is necessary to simultaneously optimize multiple properties rather than a single characteristic, such as enzymatic activity, substrate specificity, and stability. For example, in industrially utilized enzymes, achieving both stability under high-temperature processes and high enzymatic activity directly contributes to the reduction of production costs. In therapeutic antibodies, the combination of high binding affinity to target antigens and thermodynamic stability required for long-term storage and predictable in vivo behavior is essential for ensuring therapeutic efficacy and quality. Thus, multi-objective optimization of proteins—improving multiple properties simultaneously—is indispensable for creating proteins suited to specific purposes.

Challenges in Existing Protein Improvement Methods

However, achieving efficient multi-objective optimization remains difficult with conventional protein engineering approaches. The major existing methods face the following challenges:

Directed Evolution: This method obtains proteins with desired functions through repeated random mutagenesis and selection (screening). However, library construction and evaluation require enormous numbers of experiments, consuming substantial time and cost. Furthermore, reliance on sequential mutagenesis, such as single amino acid substitutions, makes it difficult to explore sequence spaces where multiple mutations cooperatively express function (epistatic interactions), leading to entrapment in local optima.
Rational Design: This approach designs amino acid sequences based on protein structure information and knowledge of functional mechanisms. Its application is restricted due to the necessity of specialized knowledge and structural information. Moreover, precisely engineering a solution that simultaneously addresses factors influencing a plurality of properties presents a significant challenge.

To address these challenges, it is necessary to develop new methodologies for efficient and comprehensive multi-objective optimization.

The Barrier of Vast Sequence Space

A fundamental difficulty in protein improvement lies in the astronomically large sequence space to be explored. Even for a tripeptide, with 20 possible amino acids at each position, the number of combinations reaches $20^3 = 8,000$ . For a protein composed of 250 amino acid residues, the total number of theoretically possible sequences reaches $20^{250} \approx 1.81 \times 10^{325}$ . This number far exceeds even the number of atoms in the observable universe, making exhaustive exploration of all sequences impossible.

Machine Learning in Protein Engineering

As an approach to efficiently explore this vast sequence space, machine learning applications have recently attracted significant attention. Machine learning models are expected to predict the function of unknown sequences or generate sequences with desired functions by learning complex relationships between amino acid sequences and functions from data. This capability holds great potential for achieving multi-objective optimization of proteins. However, many existing models still require large-scale experimental datasets for high-accuracy prediction and design. This “high data requirement” presents a new barrier to entry for student teams and small-scale laboratories attempting to utilize machine learning. Therefore, development of new machine learning methods that can effectively learn from small datasets and enable high-accuracy multi-objective optimization is necessary.

Our Solution: The LEAPS Method

Utilization of Protein Language Models (pLM)

To address this challenge, we focused on Protein Language Models (pLMs). pLMs are based on the same principles as Large Language Models (LLMs) typified by ChatGPT and Gemini. Just as LLMs learn word occurrence patterns and context from vast amounts of text data to intrinsically acquire natural language grammar, pLMs treat protein amino acid sequences as “language.” Proteins are polymers (corresponds to sentences) in which 20 types of amino acids (corresponds to words) are linked with a clear direction from N-terminus to C-terminus. By learning from billions of known amino acid sequences, pLMs acquire universal “grammar” for protein viability—the rules governing amino acid combinations and inter-residue interactions.

スライド2.png

fig.1 Similarities between large-scale language models and protein language models

This “grammatical understanding” is the key to enabling multi-objective optimization. Because pLMs evaluate and generate amino acid sequences considering the context of the entire protein, they can capture not only individual residue functions but also complex long-range interactions between residues (epistasis). This enables jumps (large changes) to functionally superior sequence regions that cannot be reached through simple combinations of point mutations. In other words, pLMs enable exploration of discontinuous sequence spaces in silico, which is difficult with directed evolution methods, bringing new possibilities to protein design.

Our Method: The LEAPS Workflow

We developed “LEAPS,” a unique machine learning method integrating predictive and generative models to achieve multi-objective protein optimization from small datasets. LEAPS performs “in silico-complete” protein improvement through the following workflow:

Generate novel sequences using a generative model
Screen the output sequences for high-function variants using a predictive model
Train the generative model on the selected high-function sequences
Return to step 1, where the generative model generates high-function sequences

Through this iterative optimization cycle, LEAPS efficiently explores the vast sequence space to identify optimal sequence candidates that simultaneously satisfy multiple properties. This method overturns the conventional wisdom that “high-performance models cannot be built without large-scale data,” enabling practical protein improvement from minimal data. This flexible capacity for small-data adaptation supports challenging applications such as unknown proteins and creation of novel functions, demonstrating the feasibility of a more universal protein design framework, regardless of data abundance.

Practical Application

The particularly important application target of this research is resource-constrained research environments, including student teams participating in iGEM. Many iGEM students face constraints in resources such as funding, time, and experimental facilities. Consequently, even when they wish to tackle the attractive theme of protein function improvement, constructing the necessary large-scale datasets themselves is difficult, forcing them to abandon realization of their ideas.

LEAPS has the potential to break through this situation. Because LEAPS can guide practical-level functional improvements from as few as 40 experimental data points, even student teams can undertake advanced projects such as designing and improving original enzymes or antibodies. This dramatically increases freedom in research theme selection and leads to more challenging and impactful research.

Providing young researchers who will lead the next generation with an environment where they can easily validate their ideas directly impacts the future of science and technology. LEAPS enables realization of innovative ideas that might have been buried due to resource constraints. LEAPS aims to serve not merely as a single technical tool but as a platform that liberates research possibilities.

Software: “LEAPS-Software”

To deliver the benefits of LEAPS broadly to the research community, we developed “LEAPS-Software”, a web application implementing LEAPS method, and are preparing for its public release. This enables wet-lab researchers without programming or machine learning expertise to work on improving their proteins through an intuitive interface.

Safety & Security

When providing powerful protein design tools like LEAPS as an open platform, It is necessary to carefully consider potential risks and dual-use concerns. Technology that enables anyone to easily improve proteins carries risks of misuse by malicious actors. For example, the possibility cannot be denied that functions of proteins harmful to human health or the environment—such as toxin proteins, allergens, or viral receptor-binding domains—could be enhanced. Furthermore, we must consider the possibility that improved proteins might unintentionally acquire unexpected toxicity or allergenicity (unintended gain-of-function).

To mitigate these risks, we implemented safety protocols in LEAPS-Software. This system automatically cross-references sequences output by the model against databases of known harmful protein sequences and issues alerts when sequence similarity exceeds a certain threshold. Technological advancement and safety assurance are two sides of the same coin, and we will promote responsible research practices.

Future Vision

Our LEAPS model currently utilizes existing protein language models. Therefore, advances in the pLM field directly translate to functional enhancements of LEAPS. In the future, when more advanced next-generation pLMs are developed, integrating them into the LEAPS framework is expected to yield improved prediction accuracy, generation of sequences with higher novelty, and dramatic improvements in search efficiency for multi-objective optimization. Thus, LEAPS is envisioned not as a static tool completed upon initial development but as a dynamic platform that continuously improves performance synergistically with AI technology evolution.

We strongly hope that LEAPS will contribute to raising the standard of AI-driven research in the iGEM community. By demonstrating successful cases of overcoming resource barriers faced by student teams through AI capabilities, we aim to reduce psychological and technical barriers to introducing computational methods into wet-lab experiments. If this project can serve as a catalyst for future iGEM teams to regard AI as part of their standard toolkit and contribute to further expanding the possibilities of synthetic biology, there would be no greater satisfaction.

The repository used to create this website is available at gitlab.igem.org/2025/tsukuba.