E n g i n e e r i n g

Project

During the conceptual phase of LEAPS, we recognized the significant potential of this system.

The main features of LEAPS can be summarized in three key points:

  • Improvement from limited data: While conventional methods require large amounts of experimental data, LEAPS is expected to enable protein improvement even from limited datasets.

  • Protein improvement with a minimal wet lab experiments: Through iterative prediction and generation processes, protein improvement can be achieved with a minimal wet lab experiments. This is expected to reduce experimental costs and time.

  • Multi-objective optimization: By simultaneously optimizing multiple properties, this approach may lead to more practical protein design.

To harness the potential of LEAPS, the Engineering phase proceeded with system construction through the following approach.

First Stage: Proof of Concept (Cycles 1-2)

In conventional in silico protein improvement using pLMs, methods employing either prediction models or generation models independently have been mainstream. In contrast, LEAPS connects prediction and generation models and uses them iteratively. Therefore, the initial goal was to demonstrate that the complete design flow of prediction, generation, and evaluation functions properly. At this stage, we focused on conceptually demonstrating that protein sequences with targeted properties can actually be designed through a single cycle of the prediction, generation, and evaluation process.

Second Stage: Improvement of Individual Modules (Cycles 3-7)

After confirming the basic operation of the system through proof of concept, we worked on improving the performance of the prediction and generation models. LEAPS is highly modularized, allowing independent modification and replacement of prediction and generation models. We validated each module under various experimental conditions to optimize them, establishing a foundation that enables higher-quality sequence generation and accurate evaluation in subsequent iterative optimization processes.

Third Stage: Achievement of Optimization through Iterative Prediction & Generation (Cycles 8, 10)

After sufficiently improving the performance of individual modules, the project progressed to sequence improvement through iterative generation and evaluation, verifying that generated sequences gradually converge toward improvement of the targeted multi-objective functions. In each iteration, we monitored the distribution of evaluation values and sequence diversity, making adjustments to ensure proper functioning of the optimization process.

Through this three-tiered approach, we were able to systematically achieve establishment of basic system functions, performance improvement of individual components, and acquisition of practical optimization capabilities. The following sections detail the specific experiments conducted at each stage and the various improvements made to enhance model performance.

Cycle 1: Initial Construction of Generation and Prediction Models

Cycle 1.1: Sequence Generation with ProGen2

Design:

Our model improves protein sequences to possess desired properties through iterative generation and prediction. For this purpose, the generation model must be capable of producing functional protein sequences. In this cycle, we selected an appropriate protein for proof of concept and determined the construction strategy for the generation model.

Through an interview with Mr. Tsuzuki, an expert in experimental design optimization, we received advice that the protein selected for proof of concept improvement should possess the following properties:

  1. Length suitable for total synthesis as DNA sequences: From the perspectives of synthesis cost and experimental efficiency, moderate sequence length is desirable

  2. Expressible in E. coli: Enables easy establishment of experimental systems and rapid verification

  3. Easy to assay: Has an established measurement system capable of quantitatively evaluating improvement effects

As a protein possessing all these properties, we selected green fluorescent protein (GFP). GFP has a sequence length of approximately 238 amino acids [1], which falls within the range of total synthesis, has an established expression system in E. coli, and has a clear and quantitative evaluation metric in fluorescence intensity. Based on these characteristics, we determined that GFP is an optimal target for proof of concept of our iterative improvement approach.

From interviews with machine learning experts, it became clear that building large-scale protein sequence generation models from scratch ourselves would not be realistic. Considering the computational resources, data volume, and development period required for training, utilizing existing pre-trained models would be more efficient.

Therefore, in subsequent cycles, we adopted a strategy of fine-tuning pre-trained models for use. We selected ProGen2 [2] as the generation model.

Build:

We adopted ProGen2 [2] as the generation model for the following three reasons:

・Pre-trained on a protein sequence database of hundreds of millions, enabling generation of diverse and structurally natural sequences without excessive dependence on known proteins

・Capable of fine-tuning for specific protein families

・Already utilized in prior research [3], demonstrating the ability to generate functional sequences

In this study, we adopted the ProGen2-small model among multiple available model sizes in ProGen2. This choice was made to efficiently conduct training from the perspective of computational resources, as fine-tuning needs to be performed repeatedly.

We obtained approximately 30,000 sequences from prior research [3] and used them as training data. For model fine-tuning, we adopted full fine-tuning that updates all parameters.

Additionally, generated sequences, Aequorea victoria-derived GFP sequences [1], and sequences randomly obtained from UniProt [4] were converted into embeddings from the protein language model ESM2 [5], dimensionally reduced by PCA, and plotted.

Test:

Generated sequences showed high homology with avGFP, greatly exceeding 30%, which is generally considered the criterion for homologs.

Unknown-10.png

fig. 1. Principal component analysis of various protein sequences using ESM-2 embeddings

Generated amino acid sequences (blue), sequences randomly obtained from UniProt (red), and Aequorea victoria-derived GFP sequences (green) were converted into features by the protein language model ESM-2 and visualized by principal component analysis (PCA).

Generated sequences occupied a feature space in the vicinity of the avGFP sequence and were distributed in a region clearly distinct from sequences randomly obtained from UniProt.

Learn:

Sequence homology strongly indicates that the generated sequences are homologs of avGFP. Additionally, protein language models can capture the “meaning” of sequences, and embeddings of input sequences tend to cluster together in latent space when they belong to the same family or possess the same function. Considering this along with the above plots, there is a high possibility that the generated sequences are functionally similar to GFP. From these two results, we conclude that ProGen2 was able to generate biologically meaningful GFP-like proteins.

Cycle 1.2: Brightness Prediction of GFP with Prediction Model ProtT5

Design:

The framework we ultimately aim for requires a model that predicts protein function from sequence alone. In this cycle, to prove that the prediction model can predict protein function and select sequences with desired properties, we aimed to construct a prediction model for GFP brightness using the protein language model ProtT5.

Build:

We obtained brightness data for approximately 50,000 sequences from prior research and assigned labels of Bright to those above a certain brightness threshold and Dim to those below. We divided training and test data in a 9:1 ratio and performed binary classification fine-tuning using LoRA with ProtT5-XL-uniref50 [6].

Test:

Unknown-11.png

fig. 2. Confusion matrix of actual labels and model-predicted labels

In the confusion matrix, True Positive (TP) represents the number of sequences that are actually Bright and predicted as Bright by the model, False Positive (FP) represents the number of sequences that are actually Dim but predicted as Bright by the model, False Negative (FN) represents the number of sequences that are actually Bright but predicted as Dim by the model, and True Negative (TN) represents the number of sequences that are actually Dim and predicted as Dim by the model.

From these results, the model’s accuracy was calculated to be approximately 0.843, precision approximately 0.731, recall approximately 0.909, and F1 score approximately 0.810. The particularly high recall indicates that approximately 91% of sequences that are actually Bright were correctly detected.

Learn:

In this cycle, fine-tuning using ProtT5-XL-uniref50 demonstrated that brightness can be predicted with high accuracy from GFP sequence information alone. The accuracy of approximately 84% and recall of 91% indicate that protein language models can effectively capture functional features within sequences.

It was confirmed that this model can efficiently screen functional protein sequences while maintaining high recall. This proved that sequence selection by prediction models is practically feasible in the framework we aim for, establishing a foundation for the next cycle.

However, there was also the challenge of requiring 50,000 sequences for training, and improvement was desired.

Cycle 1.3: GFP Design Using Generation Model ProGen2 and Prediction Model ProtT5

Design:

In this cycle, we aimed to design novel fluorescent protein sequences similar to existing GFP sequences through sequence generation by the generation model and sequence selection by the prediction model. Additionally, iterative optimization was not performed; the objective was to conceptually demonstrate the possibility of obtaining fluorescent GFP through the complete design flow from generation to evaluation.

Build:

Using the generation model ProGen2 trained on the DMS dataset, we generated novel sequence candidates with existing GFP sequences as input (see Cycle 1.1). The generated sequence population was input into the prediction model to predict fluorescence intensity as a binary classification of bright or dim (see Cycle 1.2), and candidate sequences predicted to possess fluorescence were selected.

For candidate sequences selected in the design phase, we conducted wet experimental verification. Selected sequences were codon-optimized and synthesized using IDT’s gene synthesis service. Synthesized genes were cloned into the NcoI site of the pET28a(+) expression vector. The expression vector was transformed into BL21(DE3), expressed in liquid LB medium by IPTG induction. E. coli were lysed, fluorescent proteins were extracted, and spectra and brightness were measured with a luminometer.

Test:

Enginering_Fig. 4.jpg

fig. 3. Fluorescent E. coli

(A) is the extracted protein from T01_02 sequence, (B) is the E. coli colony expressing the T01_04 sequence.

As a result of fluorescence measurement of the three selected candidate sequences, clear fluorescence was confirmed in two sequences. The relative fluorescence intensity per protein concentration for each sequence was as follows (normalized with wild-type GFP as 100%).

Fluorescence was confirmed in 2 out of 3 sequences.

Candidate sequence 1 (3 residue substitutions): No fluorescence

Candidate sequence 2 (4 residue substitutions): Fluorescence intensity 317% (fluorescence confirmed)

Candidate sequence 3 (5 residue substitutions): Fluorescence intensity 54% (fluorescence confirmed)

Among the three sequences classified as bright by the prediction model, fluorescence was actually observed in two sequences. For the one sequence where fluorescence was not confirmed, problems with folding or hydrophobic packing surrounding the chromophore are suggested.

Learn:

Five-residue variants of GFP theoretically exist in more than 15 quadrillion combinations, representing a vast sequence space. Prior research [7] has reported that when five or more mutations are introduced, variants with low fluorescence intensity become extremely common, making it exceedingly difficult to create variants with five or more residues that maintain function.

In this cycle, through a design approach combining the ProGen2 generation model and prediction model, we successfully synthesized a fluorescent five-residue variant in one of the three selected sequences. This demonstrates that functional sequences could be explored with only a small number of attempts from a vast sequence space (more than 15 quadrillion combinations).

In conventional random mutagenesis or screening methods, a large number of experiments would be necessary to find functional five-residue variants from the vast sequence space of 15 quadrillion combinations. However, this method successfully obtained five-residue variants, previously considered difficult to achieve, from a small number of candidate sequences by combining sequence generation by the generation model and preliminary screening by the prediction model.

The superiority of AI-leveraging approaches was demonstrated in the domain of multi-mutant design, which has been conventionally considered difficult.

Cycle 2: Single-Objective Optimization in Fluorescent Proteins

Cycle 2.1: Utilizing ESM2+LASSO Regression for Prediction Model

Design:

In Cycle 1, we achieved very high prediction accuracy with a recall of approximately 0.909 using the prediction model. However, this model required a dataset of approximately 50,000 sequence-fluorescence brightness pairs for training. In an interview with Mr. Nukui from the biopharmaceutical venture BiophenomX, we were advised that obtaining such a large amount of assay data for typical target proteins for improvement is realistically difficult. Specifically, the amount of data that can actually be collected is approximately 40 sequences. Based on this, we aimed to develop a model capable of predicting with practical accuracy even when training data is limited to approximately 40 sequences.

Additionally, in an interview with Mr. Tsuzuki, an expert in experimental optimization, we received a suggestion that improving or modifying the color of fluorescence, specifically the fluorescence wavelength, would be comprehensible and interesting. Fluorescence wavelength is an easy-to-measure and objective metric, and changes are visually apparent, making it an appropriate target for evaluating the prediction accuracy of models.

Therefore, in this cycle, we set the goal of constructing a prediction model with practical accuracy from fluorescence maximum wavelength data for 40 sequences. In such limited-data environments, binary classification models cannot achieve high accuracy. Therefore, referencing prior research, we constructed a prediction model using LASSO regression, which is capable of regression prediction without overfitting even with limited data.

Build:

First, using fluorescent protein sequences as input, we obtained high-dimensional embedding vectors (1280 dimensions × n pieces of residues) using the protein language model ESM2. The obtained embeddings reflect evolutionary and structural features of sequences and are considered capable of representing not only sequence similarity but also potential functional differences. The referenced paper used a protein language model called UniRep, but in this study we adopted ESM2, which has been reported to possess superior representational capacity having been trained on larger-scale datasets [5]. ESM2 is pre-trained on hundreds of millions of protein sequences and was expected to capture richer evolutionary information compared to UniRep. Subsequently, using this vector as explanatory variables and the corresponding maximum fluorescence wavelength as the objective variable, we trained a LASSO regression model.

For training data, we utilized a dataset of 40 sequences and maximum fluorescence wavelengths obtained from FPbase [9], and test data was similarly obtained from FPbase.

Appendix: What is embedding and training

Embedding

While protein amino acid sequences concisely represent protein structure and function, it is difficult for computers to understand amino acid sequences directly [8]. Furthermore, because the amount of information contained in amino acid sequences is very large, using them directly in machine learning is also difficult [8].

To solve these problems, an operation called embedding is performed in machine learning. In this case, embedding refers to the task of converting protein amino acid sequences into multi-dimensional arrays that computers can easily understand using ESM2 [10].

Considering inputting an amino acid sequence with n pieces of residues into ESM2, ESM2 outputs a 1280-dimensional vector for each residue. This vector represents the interrelationship between the target residue and other residues, and even for the same residue, different vectors are output if the position within the amino acid sequence differs.

For n-residue amino acids, the aforementioned vectors are output for each residue, so a matrix of size 1280×n, [x11x21xn1][x12x22xn2][x1,1280x2,1280xn,1280]\begin{bmatrix} x_{11} \\ x_{21} \\ \vdots\\x_{n1} \end{bmatrix}\begin{bmatrix} x_{12} \\ x_{22} \\ \vdots\\x_{n2} \end{bmatrix}\cdots\begin{bmatrix} x_{1,1280} \\ x_{2,1280} \\ \vdots\\x_{n,1280} \end{bmatrix}, is output.

Training

A prediction model is created based on sequences output by ESM2.

The equation below is used for prediction in LASSO regression [11].

y^=b0+b11x11+b21x21bn,1280xn,1280\hat{y}=b_{0}+b_{11}x_{11}+b_{21}x_{21} \cdots b_{n,1280}x_{n,1280}

y^\hat{y} is called the objective variable and in this project represents scalar values such as GFP brightness or fluorescence wavelength.

Next, x11x_{11} to xn,1280x_{n,1280} are called explanatory variables, and each component of the matrix output by embedding is used. The objective variable y^\hat{y} is obtained by multiplying each explanatory variable by constants b11b_{11} to bn,1280b_{n,1280} and adding constant b0b_{0}.

The training dataset consists of paired protein amino acid sequence information for 40 sequences and the functional values (brightness, fluorescence wavelength, etc.) of each protein. By performing training, we can determine b11b_{11} to bn,1280b_{n,1280} and b0b_{0} such that the error between the objective variable y^\hat{y} and the true functional value yy of that protein is minimized as much as possible, and subsequently predict functions for amino acid sequences not included in the training dataset.

elhynt2utar73sz9lu76.png

fig. 4. Fluorescence wavelength prediction pipeline combining ESM2 embeddings and LASSO regression

Test:

fig5.png

fig. 5. Correlation between predicted and measured values for maximum fluorescence wavelength of fluorescent proteins

Scatter plot of wavelengths predicted by LASSO regression model using ESM2 embedding vectors (horizontal axis) and measured wavelengths (vertical axis). The coefficient of determination R2=0.8R^2=0.8 indicates good prediction accuracy.

Learn:

Fluorescence wavelength prediction by the LASSO regression model yielded good prediction accuracy with a coefficient of determination of R2=0.8R^2=0.8 (fig. 5). This result demonstrates that practical accuracy in fluorescence wavelength prediction is possible by combining embedding representations from ESM2 with LASSO regression, even with training data limited to only 40 sequences.

Through this cycle, the following important insights were obtained. First, it was confirmed that embedding vectors obtained from the protein language model ESM2 effectively capture evolutionary and structural features of sequences and function as useful features even in limited-data environments. Second, it was demonstrated that automatic feature selection by LASSO regression is effective in suppressing overfitting with limited sample sizes. This represents a realistic solution to the practical constraint faced by actual biopharmaceutical ventures that “obtaining large amounts of assay data is difficult.”

Additionally, by setting fluorescence wavelength, a continuous value, as the prediction target, we became able to quantitatively evaluate subtle differences in properties that could not be captured by the binary classification model in Cycle 1. This enabled not merely binary judgment but specific predictions of “how much wavelength shift can be expected,” allowing the prediction model to provide more useful information for decision-making in protein engineering.

Cycle 2.2: Design of Fluorescent Proteins with Various Fluorescence Wavelengths

Design:

In this cycle, we aimed to design fluorescent proteins that emit specified fluorescence wavelengths by selecting generated sequences based on predicted values obtained from the prediction model. Here too, iterative optimization was not performed; the objective was to conceptually demonstrate the possibility of obtaining fluorescent proteins with targeted properties through the complete design flow from generation to evaluation.

Build:

The prediction model was the ESM2+LASSO regression model constructed in Cycle 2.1, and the training data utilized a dataset of 40 sequences and maximum fluorescence wavelengths obtained from FPbase.

The generation model was ProGen2, and while in Cycle 1 we used avGFP DMS datasets for training data, in this case we utilized 700 fluorescent protein sequences with various fluorescence wavelengths derived from FPbase. These sequences covered a broad fluorescence spectrum from blue to red regions.

The generated sequence population was input into the prediction model to obtain predicted maximum fluorescence wavelengths. Based on the predicted maximum fluorescence wavelengths, sequences predicted to fluoresce blue, green, or yellow were selected.

For candidate sequences predicted to fluoresce blue, green, or yellow, six sequences each, wet experimental verification was conducted. Selected sequences were codon-optimized and synthesized using IDT’s gene synthesis service. Synthesized genes were cloned into the NcoI site of the pET28a(+) expression vector. The expression vector was transformed into BL21(DE3), expressed in liquid LB medium by IPTG induction. E. coli were lysed, fluorescent proteins were extracted, and spectra and brightness were measured with a luminometer.

Test:

Representative sequences for each of the three colors were irradiated with black light and the fluorescence was photographed (fig. 6). Additionally, fluorescence spectrum data were acquired for each sample using a fluorescence spectrophotometer, yielding measured maximum fluorescence wavelengths. The fluorescence spectra for one representative sample each that fluoresced blue, green, or yellow among those samples are shown below (fig. 7). Furthermore, for samples from which fluorescence spectrum data could be acquired, a correlation plot of predicted maximum fluorescence wavelength versus measured maximum fluorescence wavelength was created, and although the sample size was small, strong correlation was observed (fig. 8).

opzcpuqgowhgfrft6m5z.jpg

fig. 6. Extracted proteins fluorescing under ultraviolet light

Representative sequences for each of the three colors were irradiated with black light and the fluorescence was photographed.

(a)

20251008_13.png

(b)

20251008_14.png

(c)

20251008_15.png

fig. 7. Fluorescence spectra of representative sequences for blue (a), green (b), and yellow (c) fluorescent proteins

Solid lines indicate fluorescence spectra.

20251008_7.png

fig. 8. Correlation plot of predicted maximum fluorescence wavelength (λem-pred) and measured maximum fluorescence wavelength (λem-obs)

Dashed line represents perfect agreement line (y = x). Color coding indicates targeted fluorescence color (blue/green/yellow), plotting only sequences that exhibited fluorescence.

Learn:

This cycle demonstrated that design of fluorescent proteins that emit specified fluorescence wavelengths is possible through a complete design flow combining generation and prediction models.

ProGen2 trained on diverse fluorescent protein sequences from FPbase was able to generate sequence populations covering a wide fluorescence wavelength range. When fluorescence wavelength prediction was performed on these generated sequences using the ESM2+LASSO regression model and sequences corresponding to target wavelength ranges (blue, green, yellow) were selected, multiple sequences that actually fluoresced in the intended colors were obtained (fig. 6). Particularly, fluorescence spectrum measurements of representative sequences for each color confirmed clear fluorescence peaks (fig. 7), and strong correlation was observed between predicted maximum fluorescence wavelength and measured maximum fluorescence wavelength not only for test data but also for designed sequences (fig. 8). Although the sample size is limited, this result supports the effectiveness of the prediction model. In the future, to demonstrate the versatility of this method, it is desired to clarify that properties other than fluorescence wavelength can also be specified.

On the other hand, notable differences were observed in success rates by color. For yellow fluorescent proteins, 3 out of 6 designed sequences fluoresced in the target wavelength range, whereas for blue and green, only 1 out of 6 sequences each exhibited fluorescence. This result reveals that while the prediction model’s fluorescence wavelength prediction accuracy demonstrates certain effectiveness, there is room for improvement in its ability to select sequences that actually express and fold as functional fluorescent proteins.

Cycle 2.3: Design of Fluorescent Proteins with Targeted Brightness

Design:

In this cycle, we aimed to design highly bright fluorescent proteins by selecting generated sequences based on brightness prediction values obtained from the prediction model. Similar to maximum fluorescence wavelength prediction, we constructed a brightness prediction model combining ESM2 and LASSO from a dataset of 40 sequences and brightness, but favorable results were not obtained with test data (see fig. 9).

fig9.png

fig. 9. Correlation between predicted and measured values for fluorescent protein brightness

Scatter plot of brightness predicted by LASSO regression model using ESM2 embedding vectors (horizontal axis) and measured brightness obtained from FPbase (vertical axis). The coefficient of determination R2=0.29R^2=0.29 indicates no strong correlation between predictions and measured values.

However, since prior research [12] predicted brightness using similar methods and improved brightness by combining with random mutations, we decided to conduct this cycle. Here too, iterative optimization was not performed; the objective was to conceptually demonstrate the possibility of obtaining fluorescent proteins with targeted properties through the complete design flow from generation to evaluation.

Build:

The prediction model constructed an ESM2+LASSO regression model as in Cycle 2.2. Training data utilized sequences and brightness datasets obtained from FPbase [9]. The generation model was ProGen2, and training data utilized 700 fluorescent protein sequences with various fluorescence wavelengths from FPbase, as in Cycle 2.2. The generated sequence population was input into the prediction model to obtain predicted brightness. Top 6 and worst 6 sequences by predicted brightness were selected and wet experimental verification was conducted. Selected sequences were codon-optimized and synthesized using IDT’s gene synthesis service. Synthesized genes were cloned into the NcoI site of the pET28a(+) expression vector. The expression vector was transformed into BL21(DE3), expressed in liquid LB medium by IPTG induction. E. coli were lysed, fluorescent proteins were extracted, and spectra and brightness were measured with a luminometer.

Test:

table 1. Table showing predicted brightness and actually measured brightness

Sequence NumberPredicted brightnessMeasured relative fluorescence intensityλEm Max
T02_1975.115180860.73196138346.5
T02_2072.038562620.8719466337.5
T02_2171.6504750417.6066062337
T02_2362.600405110.17434677522.5
T02_25-23.729723170.171889755511.5
T02_26-16.599433730.108705103509.5
T02_28-10.878384770.427399856474
T02_29-10.600208590.376423762474.5

The low-predicted brightness group (Worst 6) all showed relative fluorescence intensity below 1 as predicted, indicating relatively accurate prediction of low-brightness sequences. However, many false positives (high predicted values but low actual measurements) were observed in high-brightness sequence predictions. This suggests that training data is limited and complex structural factors affecting brightness cannot be completely captured.

Learn:

These results revealed that the ESM2+LASSO regression model does not achieve sufficient performance for brightness prediction. To achieve practical accuracy with a limited dataset of 40 sequences, fundamental improvements in prediction model architecture and feature selection are considered necessary.

While this cycle had significance as “proof of concept” for a design flow combining generation and prediction models, it did not achieve establishment of a practical design method. Improvement of the prediction model is necessary.

Cycle 3: Improvement of Prediction Model Accuracy

Cycle 3.1: ESM2+Random Forest Regression for Prediction Model

Design:

ESM2+LASSO regression demonstrated high performance in the task of predicting maximum fluorescence wavelength (see figs. 5, 7). However, for brightness prediction, there was little correlation between predicted and measured values (see figs. 8, 9), which could not be considered practical.

The cause of this was thought to be that LASSO regression, being a linear model, could not fully capture the complex nonlinear relationships inherent in brightness. Maximum fluorescence wavelength is largely determined by relatively simple factors such as the amino acid types in the chromophore [13]. In contrast, brightness depends on diverse factors that are difficult to predict directly from sequence, including chromophore formation efficiency, protein folding, and quantum yield [13]. Therefore, we determined that a model capable of learning nonlinear interactions between features was necessary.

In this cycle, we adopted random forest regression as the prediction model. Random forest is an ensemble learning method based on decision trees, possessing high capability for learning nonlinear relationships and being relatively robust against overfitting [14]. Additionally, it is known to function effectively even with small datasets [14]. Here, we constructed a model that predicts brightness by random forest regression using embedding vectors from ESM2 as features.

Build:

As the prediction model, we constructed a random forest regression model with ESM2 embedding vectors as input. Specifically, we extracted representation vectors for each sequence from ESM2 and used embedding vectors of 1280 dimensions × n pieces of residues as features. Using scikit-learn’s RandomForestRegressor, we set the number of decision trees (n_estimators) to 1000 and adopted default values for other parameters (max_depth, min_samples_split, min_samples_leaf, etc.) to train the model. For training data, we utilized a dataset of 40 sequences and brightness obtained from FPbase, and test data was similarly obtained from FPbase. The prediction accuracy of the model was evaluated by the coefficient of determination (R²) on test data.

Test:

fig10.png

fig. 10. Scatter plot of predicted brightness (horizontal axis) and measured brightness (vertical axis) on test data

Blue circles indicate prediction results by LASSO regression (R² = 0.29), red circles indicate prediction results by random forest regression (R² = 0.39). The dashed line represents the diagonal line of ideal prediction (predicted value = measured value). Random forest regression showed improved prediction accuracy compared to LASSO regression.

Learn:

Random forest regression significantly improved brightness prediction accuracy compared to LASSO regression (see fig. 10). While LASSO regression showed almost no correlation between predicted and measured values, random forest regression improved the R² value, approaching a more practical level.

This improvement is attributed to random forest’s capability for learning nonlinear relationships. Brightness is a property determined by the complex interplay of multiple factors including chromophore formation efficiency, protein three-dimensional structure, and quantum yield [13]. Random forest, based on decision trees, is inferred to have flexibly learned these inter-feature interactions that linear models could not capture [15]. From the results of this cycle, it was concluded that models capable of capturing nonlinear relationships are necessary for predicting complex properties like brightness with high accuracy.

Although the introduction of random forest regression showed some improvement, the R² value still did not reach a practical level. This was thought to be due to issues with the dimensionality of ESM2 embedding vectors. The high-dimensional features of 1280 dimensions × n pieces of residues are excessive for the limited sample size of 40, potentially causing the so-called “curse of dimensionality” [16]. It was desired to further improve prediction accuracy by effectively reducing dimensionality while retaining information from the entire sequence.

Cycle 3.2: Max Pooling

Design:

The introduction of random forest regression in Cycle 3.1 enabled learning of nonlinear relationships. However, ESM2 embedding vectors are extremely high-dimensional features of 1280 dimensions × n pieces of residues, which were thought to potentially cause the “curse of dimensionality.”

The curse of dimensionality is a phenomenon where machine learning model performance deteriorates as data becomes sparsely distributed in high-dimensional space with increasing feature dimensionality. Particularly when sample size is limited (approximately 40 in this case), distances between samples become uniform in high-dimensional feature space, preventing models from learning meaningful patterns. While random forest is somewhat robust to high-dimensional data, when feature dimensionality greatly exceeds sample size, overfitting and degradation of generalization performance are likely to occur [AL].

The number of features of 1280 dimensions × n pieces of residues is clearly excessive for a sample size of 40. To solve this problem, a method to drastically reduce dimensionality while retaining information from the entire sequence was needed.

Therefore, in this cycle, we applied max pooling. Max pooling is an operation that takes the maximum value across all residue positions for each dimension (1280 dimensions), enabling aggregation into a fixed-length 1280-dimensional vector regardless of sequence length. This operation can extract the most prominently activated information in each feature dimension, and because dimensionality was drastically reduced (1280 dimensions × n pieces of residues → 1280 dimensions), the feature dimensionality ratio relative to 40 samples was improved, expected to mitigate the effects of the curse of dimensionality [17]. In particular, we aimed to improve prediction model performance by selectively extracting information from important amino acid residues affecting brightness and maximum excitation wavelength in fluorescent proteins (such as residues surrounding the chromophore) through max pooling.

Build:

As the prediction model, we constructed a random forest regression model with max-pooled ESM2 embedding vectors as input. Specifically, after extracting representation vectors (1280 dimensions × n pieces of residues) for each sequence from ESM2, we applied a max pooling operation that takes the maximum value across all residue positions for each dimension, aggregating into a fixed-length 1280-dimensional vector.

For training data, we used a brightness dataset of 40 sequences obtained from FPbase and prepared data following the same procedure as Cycle 3.1.

For model training, we used scikit-learn’s RandomForestRegressor and adopted the same hyperparameter settings as Cycle 3.1 (n_estimators=1000, others at default values). The prediction accuracy of the model was evaluated by the coefficient of determination (R²) on test data.

Test:

fig11.png

fig. 11. Scatter plot of predicted brightness (horizontal axis) and measured brightness (vertical axis) on test data

Red circles indicate prediction results by random forest regression (R² = 0.388), brown circles indicate prediction results by random forest regression + max pooling (R² = 0.434). The dashed line represents the diagonal line of ideal prediction (predicted value = measured value).

Learn:

The introduction of max pooling drastically reduced feature dimensionality (1280 dimensions × n pieces of residues → 1280 dimensions) and mitigated the effects of the curse of dimensionality. As a result, the random forest regression model showed improved prediction accuracy compared to LASSO regression (R² = X.XX). This is attributed to max pooling efficiently extracting important features from the entire sequence and improving the balance between features and sample size.

However, the R² value still remained at X.XX, leaving room for improvement. From the results of Cycles 3.1 and 3.2, it became clear that introducing flexible top models capable of capturing nonlinear relationships leads to improved prediction accuracy. On the other hand, while dimensionality reduction methods like max pooling have some improvement effect, it seems unlikely to expect further significant performance gains. As a future challenge toward further performance improvement, the introduction of top models with higher expressive power was identified.

Cycle 4: Extension of Prediction Model ESM2

Cycle 4.1: ESM2+Fully Connected Layers for Prediction Model

Design:

By changing the top model from LASSO regression to random forest regression capable of learning nonlinear relationships in Cycle 3.1, we were able to maintain some level of prediction accuracy even for difficult tasks like brightness prediction. However, room for improvement still remained, requiring the introduction of models with higher expressive power.

While random forest regression can learn nonlinear relationships, its learning mechanism is based on stepwise partitioning by an ensemble of decision trees. On the other hand, protein properties are thought to depend on complex interactions among multiple amino acid residues and hierarchical structural features. Neural networks were considered more suitable for learning such highly nonlinear and hierarchical feature representations.

Therefore, in this cycle, we introduced a neural network model with fully connected layers built on top of ESM2 embedding vectors. In fully connected layers, each element of the input is connected to each element of the output, and complex nonlinear mappings can be learned by combining linear transformations using weight matrices with activation functions [18].

By stacking multiple fully connected layers, the model can hierarchically extract features and represent more complex input-output relationships. Each layer nonlinearly transforms the output of the previous layer, enabling learning of higher-order inter-feature interactions that cannot be captured by simple linear models or random forests [18].

Specifically, we adopted a multi-layer structure that takes the 1280-dimensional vector after mean pooling as input and progressively reduces dimensions. The first layer reduces 1280 dimensions to 640 dimensions, and the second layer outputs a one-dimensional predicted value. ReLU activation functions were placed after each fully connected layer, introducing nonlinearity by converting negative values to zero.

Through this progressive dimension reduction, only the information truly necessary for the regression task can be selectively extracted from the high-dimensional features of 1280 dimensions and ultimately aggregated into a single predicted value. The weight matrices of fully connected layers are optimized through learning and are expected to hierarchically extract features important for prediction. Furthermore, because neural networks learn smooth and continuous mappings from input to output through optimization via gradient descent, improved interpolation performance within feature space and improved prediction accuracy for unknown sequences were expected.

Build:

As the prediction model, we constructed a neural network regression model with max-pooled ESM2 embedding vectors as input. Specifically, after extracting representation vectors (1280 dimensions × n pieces of residues) for each sequence from the pre-trained ESM2 model, we applied a max pooling operation weighted by attention masks and aggregated into a fixed-length 1280-dimensional vector. The parameters of the ESM2 model were frozen and used fixedly as a feature extractor.

The neural network structure sequentially arranged the following layers: as the first layer, a fully connected layer from 1280 to 640 dimensions and a ReLU activation function; as the second layer, a fully connected layer from 640 to 1 dimension. Only the parameters (weights and biases) of these fully connected layers were trained and optimized for the regression task.

For training data, we used a brightness dataset of 40 sequences obtained from FPbase and prepared data following the same procedure as Cycle 3.1. For model training, we used Transformers, adopting mean squared error (MSE) as the loss function and Adam (learning rate 0.001) as the optimization algorithm. Training was conducted for 100 epochs with early stopping and hyperparameter optimization using Optuna. The prediction accuracy of the model was evaluated by the coefficient of determination (R²) on test data.

Test:

fig12.png

fig. 12. Scatter plot of predicted brightness (horizontal axis) and measured brightness (vertical axis) on test data

Brown circles indicate prediction results by random forest regression + max pooling (R² = 0.434), green circles indicate prediction results by ESM2 + fully connected layers (R² = 0.51). The dashed line represents the diagonal line of ideal prediction (predicted value = measured value).

It became clear that ESM2+fully connected layers demonstrated superior prediction performance compared to random forest regression+max pooling.

Learn:

The introduction of a neural network model with fully connected layers built on top of ESM2 embedding vectors confirmed improved prediction accuracy compared to random forest regression+max pooling (R² = 0.434 → 0.51). This is attributed to multi-layer fully connected layers hierarchically extracting features and learning higher-order inter-feature interactions that random forests could not capture [17]. Particularly, it is inferred that through optimization via gradient descent, selectively extracting information truly necessary for brightness prediction from ESM2’s representation space and learning smooth and continuous mappings led to improved prediction accuracy for unknown sequences [5][17].

However, while neural networks possess high expressive power, they have the challenge of being prone to overfitting due to numerous parameters. Particularly in this study, with a limited sample size of approximately 40, it is necessary to learn numerous parameters including weights and biases of fully connected layers, making the risk of overfitting non-negligible [16].

Through the series of improvements from Cycles 2 to 4.1, it was demonstrated that prediction accuracy can be improved by progressively enhancing the expressive power of top models. In the future, we aim for further improvement in prediction accuracy by maintaining the high expressive power of neural networks while appropriately controlling overfitting.

Cycle 4.2: ESM2+Fully Connected Layers+Data Augmentation (Conservative Mutations) for Prediction Model

Design:

By introducing a neural network model with fully connected layers built on top of ESM2 embedding vectors in Cycle 4.1, we learned higher-order inter-feature interactions that random forest regression could not capture and achieved improved prediction accuracy. However, because neural networks possess high expressive power, they are prone to overfitting due to numerous parameters, and particularly with a limited sample size of approximately 40 as in this study, the risk of overfitting became a serious challenge.

To prevent overfitting and improve model generalization performance, it was considered effective to efficiently augment the training data itself in addition to introducing regularization techniques. However, simple random sequence mutations could potentially destroy protein structure and function, making them not biologically valid data augmentation.

Therefore, in this cycle, we introduced a data augmentation method through function-maintaining mutations utilizing the BLOSUM62 substitution matrix. BLOSUM62 is a matrix showing scores for evolutionarily conserved amino acid substitutions, based on substitution patterns actually observed in nature [19]. Substitutions with positive scores represent those that occur frequently during evolution, meaning substitutions that easily maintain protein function, while negative substitutions represent those that occur rarely and easily impair function.

This method realizes sequence mutations within a range that does not significantly impair protein function by allowing only substitutions to similar amino acids with positive or non-negative BLOSUM62 scores relative to the original amino acid. Specifically, sequences are sampled from existing training data, and amino acids at several positions in those sequences are substituted with other evolutionarily similar amino acids. Training data is augmented by assigning labels with small noise added to the original label values to sequences generated in this manner.

This method is expected to generate biologically meaningful diverse sequence variations from limited experimental data, enabling models to make robust predictions against slight sequence changes. Data augmentation through function-maintaining mutations brings effective increases in training data and was thought to contribute to improved generalization performance for unknown sequences while suppressing neural network overfitting.

Build:

The basic structure of the prediction model was composed of ESM2 embedding vectors, mean pooling, and two fully connected layers (1280 dimensions → 640 dimensions → 1 dimension), as in Cycle 4.1.

For data augmentation implementation, we introduced function-maintaining mutations based on the BLOSUM62 matrix to sequences randomly sampled from existing training data. Specifically, X positions in each sequence were randomly selected, and the amino acids at those positions were substituted with similar amino acids having positive or non-negative BLOSUM62 scores. When multiple candidate amino acids existed, one was randomly selected from among them. Labels were assigned to generated mutant sequences by adding random noise within ±10% range to the label value of the original sequence.

Through this operation, we additionally generated X mutant sequences for the original 40-sequence training data and trained the model on an augmented dataset totaling (40+X) sequences. Training settings were identical to Cycle 4.1 in using MSE as the loss function [20], Adam as the optimization algorithm [21], and implementing 100 epochs of training with early stopping [22] and hyperparameter optimization using Optuna [23].

For training data, we used a brightness dataset of 40 sequences obtained from FPbase and prepared data following the same procedure as Cycle 3.1. The prediction accuracy of the model was evaluated by the coefficient of determination (R²) on test data.

Test:

fig13.png

fig. 13. Scatter plot of predicted brightness (horizontal axis) and measured brightness (vertical axis) on test data

Green circles indicate prediction results by ESM2 + fully connected layers (R² = 0.509), orange circles indicate prediction results by ESM2 + fully connected layers + data augmentation (Conservative mutations) (R² = 0.51). The dashed line represents the diagonal line of ideal prediction (predicted value = measured value).

Learn:

The introduction of data augmentation through function-maintaining mutations utilizing the BLOSUM62 substitution matrix confirmed improved prediction accuracy compared to the model without data augmentation (R² = 0.509 → 0.51) (numerical values). This is attributed to more stable learning being possible even with limited experimental data by adding biologically valid sequence variations, improving the model’s generalization performance.

An important point of this method is allowing only substitutions to evolutionarily similar amino acids based on the BLOSUM62 matrix. With random mutations, substitutions that destroy protein function are included, so high activity values are assigned as labels despite actually being low activity, inhibiting model learning. On the other hand, function-maintaining mutations based on BLOSUM62 follow substitution patterns actually observed in nature, enabling generation of sequences with activity similar to the original sequence and allowing biologically valid data augmentation.

Mutant sequences generated by this method are similar to but subtly different from original sequences. Therefore, it is inferred that robust learning was promoted that prevents the model from overfitting to specific sequence patterns, does not overreact to slight sequence changes, and enables stable predictions even for unknown sequences. Adding small noise to original label values may also have contributed to learning predictions within an appropriate range.

The improvement in prediction accuracy through data augmentation demonstrates the effectiveness of this method in efficiently increasing biologically valid learning samples from limited experimental data. On the other hand, there may be limitations to the diversity of sequence space that the model can learn with only the function-maintaining mutations introduced in this cycle. Conservative substitutions based on BLOSUM62 are effective for generating sequences with activity similar to the original sequence but are thought to contribute less to improved prediction accuracy for sequences with significantly different activity. Room remains for further improving model prediction performance by covering broader sequence space and incorporating samples with diverse activity values into learning [24].

Cycle 4.3: ESM2+Fully Connected Layers+Data Augmentation (Conservative Mutations+Deleterious Mutations) for Prediction Model

Design:

By introducing data augmentation through function-maintaining mutations utilizing the BLOSUM62 substitution matrix in Cycle 4.2, we generated biologically valid sequence variations from limited experimental data and achieved improved model generalization performance.

However, the model in Cycle 4.2 learns data augmented only with function-maintaining mutations. While this allows the model to learn features of “functional sequences,” there are no opportunities to learn features of “non-functional sequences.” This study aims to develop a universal protein improvement model usable by anyone, and the approximately 40 limited training data provided by users is expected to contain high-activity sequences selected through experiments. Therefore, the possibility exists that sequences that significantly impair protein function are not included. In such situations, the model cannot appropriately identify non-functional sequences and risks predicting high activity values even for sequences with virtually no activity. This is similar to the phenomenon where a digit classification model confidently makes incorrect classifications even for completely different inputs never seen during training [25].

This problem has particularly fatal effects in iterative optimization cycles combining prediction and generation models as in this study. When the prediction model incorrectly predicts non-functional sequences as high activity, the generation model learns incorrect features from those sequences and falls into a vicious cycle of generating even lower quality sequences in the next cycle. Garbage in, garbage out—to prevent such chains of quality degradation, it is essential for the prediction model to possess the ability to accurately identify “non-functional sequences” [17][25].

Data augmentation through function-maintaining mutations is effective for learning fine features in high-activity regions but insufficient for covering the full range of activity in sequence space. Conservative substitutions based on BLOSUM62 are suitable for generating sequences with activity similar to the original sequence but do not contribute to improved prediction accuracy for low-activity or non-functional sequences with significantly different activity. For the model to learn clear decision boundaries discriminating functional from non-functional sequences, non-functional sequences must be explicitly included in training data.

Therefore, in this cycle, in addition to function-maintaining mutations, we introduced data augmentation through deleterious mutations that intentionally destroy protein function. This method artificially generates low-activity sequences using evolutionarily non-conservative substitutions with negative scores in the BLOSUM62 matrix. However, simply destroying random positions may have insufficient impact on function.

Therefore, we adopted a method to quantify the importance of each amino acid position utilizing ESM model’s mask prediction capability. Specifically, each position in the sequence is masked one at a time, and we evaluate how confidently the ESM model can predict the amino acid at that position. Positions with high prediction confidence likely play important roles in that sequence’s context, and destroying them is expected to have greater impact on protein function.

In introducing deleterious mutations, non-conservative amino acid substitutions with negative BLOSUM62 scores are made at identified important positions. By assigning extremely low activity values (essentially zero) as labels to generated destroyed sequences (pseudo-non-functional sequences), the model explicitly learns features of “non-functional sequences.”

This two-stage data augmentation strategy—function-maintaining and function-destroying mutations—enables the model to learn covering the full range of activity in sequence space. Function-maintaining mutations train fine features in high-activity regions, while deleterious mutations clarify decision boundaries for identifying non-functional sequences. This is expected to enable the prediction model to appropriately discriminate functional from non-functional sequences and maintain high-quality sequence generation even in subsequent generation cycles.

Build:

The basic structure of the prediction model was composed of ESM2 embedding vectors, mean pooling, and two fully connected layers (1280 dimensions → 640 dimensions → 1 dimension), as in Cycle 4.2.

Data augmentation was implemented in two stages. As the first stage, we introduced function-maintaining mutations as in Cycle 4.2.

As the second stage, we implemented generation of 8 low-activity sequences through deleterious mutations. First, for each sequence in the training data, we evaluated the importance of each amino acid position using ESM2’s masked language model functionality. Specifically, each position in the sequence was sequentially replaced with a mask token, and the log probability of the original amino acid was calculated from the prediction probability distribution output by the ESM model. Higher values indicate that the position is easier for the model to predict, meaning it is an important position in the sequence’s context.

Based on importance scores, we calculated the average score for each sequence and used this as a weight for weighted sampling to select sequences to be destroyed. For selected sequences, N positions were sampled from the top 3N positions (N is the number of positions to destroy) with high importance scores, with probability proportional to scores. This was designed to preferentially destroy important positions.

For amino acids at selected positions, non-conservative substitution candidates with BLOSUM62 scores of -1 or below were extracted. When candidates did not exist, the score threshold was sequentially relaxed to 0 or below, and if still no candidates existed, any amino acid other than the original amino acid was considered a candidate. One was randomly selected from candidates to execute substitution, and destroyed sequences were assigned extremely low activity values of 1e-6 as labels.

Through this two-stage data augmentation, we additionally generated function-maintaining and deleterious mutant sequences for the original 40-sequence training data and trained the model on a greatly augmented dataset in total. Training settings were identical to Cycle 4.2 in using MSE as the loss function, Adam as the optimization algorithm, and implementing 100 epochs of training with early stopping and hyperparameter optimization using Optuna.

For training data, we used a brightness dataset of 40 sequences obtained from FPbase and prepared data following the same procedure as Cycle 3.1. The prediction accuracy of the model was evaluated by creating violin plots from predicted score values for sequences with brightness 0 obtained from the DMS dataset. Evaluation was also conducted using the coefficient of determination (R²) on test data obtained from FPbase.

Test:

violin_plot_en.png

fig. 14. Violin plot of predicted values for non-functional sequences (activity value 0)

Yellow-green violin plot indicates prediction results by ESM2 + fully connected layers + data augmentation (Conservative mutations), blue violin plot indicates prediction results by ESM2 + fully connected layers + data augmentation (Conservative mutations + Deleterious mutations).

The model with data augmentation using deleterious mutations shows slightly smaller predicted values overall for non-functional sequences. However, it still appears to rarely return brightness 0 as a predicted value. Therefore, it is insufficient for identifying and removing non-functional sequences.

fig14.png

fig. 15. Scatter plot of predicted brightness (horizontal axis) and measured brightness (vertical axis) on test data

Orange circles indicate prediction results by ESM2 + fully connected layers + data augmentation (Conservative mutations) (R² = 0.509), yellow circles indicate prediction results by ESM2 + fully connected layers + data augmentation (Conservative mutations + Deleterious mutations) (R² = 0.48). The dashed line represents the diagonal line of ideal prediction (predicted value = measured value).

On the other hand, for the prediction task on functional sequences, decreased prediction accuracy was confirmed with the addition of deleterious mutations. Comparing test data R² scores, the model with Conservative mutations only (test R² = 0.48) versus the model with added Deleterious mutations (test R² = 0.51) showed decreased test scores, revealing that introduction of deleterious mutations actually inhibited prediction performance for functional sequences.

This decrease in prediction accuracy suggests that low-activity sequence data generated by deleterious mutations may have hindered learning of fine features in high-activity regions. While datasets augmented only with function-maintaining mutations allow models to learn concentrating on the feature space of high-activity sequences, adding deleterious mutations creates the need to simultaneously learn features of low-activity regions. With two-layer fully connected layers having limited learning capacity, simultaneously learning both regions with high accuracy is difficult, and as a result, prediction accuracy in high-activity regions was likely sacrificed.

Learn:

Introduction of data augmentation through deleterious mutations, contrary to expectations, resulted in decreased prediction accuracy for functional sequences. The model with Conservative mutations only (test R² = 0.48) versus the model with added Deleterious mutations (test R² = 0.51) showed decreased test scores, confirming degradation of generalization performance opposite to data augmentation with Conservative mutations. Low-activity sequences generated by deleterious mutations were introduced as negative samples but rather resulted in hindering learning of fine features in high-activity regions. In a model with limited learning capacity, being required to simultaneously learn both high- and low-activity regions likely resulted in failing to achieve sufficient accuracy in either region.

Furthermore, regarding the ability to identify non-functional sequences, which was the main objective of this cycle, the expected improvement was not obtained (fig. 14). Despite adding deleterious mutations to training data, the tendency to predict high activity values for non-functional sequences with actually zero activity remained.

The following possibilities are considered as causes of this failure. First, the relatively simple architecture of two-layer fully connected layers may have made learning complex nonlinear decision boundaries in sequence space difficult. Appropriately separating high- and low-activity regions might require deeper layers or more advanced architectures, but with limited data volume, the risk of overfitting increases [26]. Next, the balance between deleterious mutation data and function-maintaining mutation data may have been inappropriate. Excessive addition of low-activity sequences likely biased the model’s learning resources toward low-activity regions, sacrificing prediction accuracy in the originally important high-activity regions.

More fundamentally, there are limitations to the learning approach itself using artificially generated destroyed sequences. Machine learning models can only learn patterns contained in training data and cannot make appropriate predictions for types of non-functional sequences never seen during training. The results of this cycle indicate that artificial data generation and prediction model enhancement with simple architectures are insufficient for the challenge of identifying non-functional sequences. Furthermore, the fact that introduction of deleterious mutations decreased prediction accuracy for functional sequences revealed that the balance of quantity and quality in data augmentation critically affects prediction performance, suggesting that data augmentation methods generating high-quality virtual data are what lead to improved model prediction accuracy.

In iterative optimization combining prediction and generation models, depending on prediction models that cannot appropriately identify non-functional sequences gives incorrect signals to generation model learning, triggering chains of quality degradation. However, the data augmentation through deleterious mutations attempted in this cycle not only failed to improve non-functional sequence identification ability but even decreased prediction accuracy for functional sequences. To solve this problem, we concluded it necessary to identify and eliminate non-functional sequences by methods that do not depend on prediction models.

Cycle 5: ESM2+Fully Connected Layers+Data Augmentation (Conservative Mutations)+Custom Dropout for Prediction Model (Provisional due to length)

Design:

In Cycle 4.3, introduction of data augmentation through Conservative mutations [27] confirmed effects of improved prediction accuracy and suppression of overfitting. Appropriate regularization prevents models from overly fitting to specific patterns in training data and promotes learning of features essential for activity prediction.

However, room for improvement in generalization performance still remained. In this study’s setting of learning from small-scale data of only 40 sequences, in the high-dimensional ESM2 embedding space of 1280 dimensions [5], there is high risk that models incorrectly learn incidental correlations or noisy features contained in training data as true activity determinants.

This problem is fundamentally similar to the background correlation problem in object detection [28]. When a cat detection model co-occurred with “blue sky” backgrounds in many training images, the model learns incidental correlations of background color rather than essential features of cats (ear shape, whisker patterns). Similarly in protein activity prediction, when dataset-specific biases such as sequence length or amino acid composition at specific positions happen to correlate with activity values, models incorrectly learn these as activity determinants.

Normal dropout prevents excessive dependence on specific features by randomly deactivating neurons but treats essential features and noisy features equally [29]. With small-scale data, efficient learning of essential features is necessary within limited learning opportunities, requiring more strategic regularization methods.

Therefore, in this cycle, we introduced custom dropout based on coefficient of variation (CV) [30]. The core idea of this approach is based on the hypothesis that “feature dimensions with large value variations among samples within a batch are likely incidental correlations or noise.”

Specifically, for each dimension of feature vectors in each batch, we calculate the coefficient of variation CV = σ / |median|. Dimensions with large CV are unstable in value among samples and differ greatly by sequence—that is, they are likely not universal determinants of activity but dataset-specific noise or incidental patterns. Conversely, dimensions with small CV show consistent values across many samples and likely reflect essential contributions to activity.

In custom dropout, the top 50% of dimensions with large CV are selectively dropped. In the object detection example, this corresponds to preferentially deactivating features that vary greatly across images like “background color” while protecting essential features that consistently appear like “cat ear shape.”

This data-dependent strategy can achieve more efficient regularization than random dropout. In ESM2’s embedding space, this is expected to protect biophysical features consistently important across sequences (hydrophobicity patterns, secondary structure propensities, etc.) while suppressing dataset-specific artifacts.

Build:

Custom dropout implementation was as follows. During training, for feature vectors (1280 dimensions) in each batch, we calculated standard deviation and median for each dimension and computed coefficient of variation CV = σ / (|median| + ε). Here, ε is a small value (1e-8) for numerical stability. We identified the top 50% of dimensions (640 dimensions) with large CV and generated masks that set values of those dimensions to zero. Finally, we applied a scaling factor (1 / (1 - 0.5) = 2.0) to preserve expected values. During inference, dropout was not applied and all dimensions were used as is.

Other conditions were identical to Cycle 4.3.

To verify the effectiveness of custom dropout, we evaluated prediction accuracy on test data using coefficient of determination (R²) under three conditions: model without dropout, model using normal random dropout (p=0.5), and model using custom dropout (p=0.5).

Test:

fig16.png

fig. 16. Scatter plot of predicted brightness (horizontal axis) and measured brightness (vertical axis) on test data

Orange circles indicate ESM2 + fully connected layers + data augmentation (Conservative mutations), purple circles indicate prediction results by ESM2 + fully connected layers + data augmentation (Conservative mutations) + custom dropout (R² = 0.545). The dashed line represents the diagonal line of ideal prediction (predicted value = measured value).

It became clear that ESM2+fully connected layers+data augmentation (Conservative mutations)+custom dropout demonstrated superior prediction performance.

Learn:

Introduction of custom dropout based on coefficient of variation confirmed notable improvement in prediction accuracy. Comparing test data R² scores, the model with custom dropout applied (test R² = 0.545) showed clear performance improvement over the model without dropout (test R² = 0.51).

Particularly noteworthy is that custom dropout demonstrated superior performance even compared to random dropout. Random dropout treats all dimensions equally, indiscriminately dropping both essential and noisy features, inefficiently consuming limited learning opportunities in small-scale data. In contrast, custom dropout calculates coefficient of variation for each batch and dynamically selects drop targets at each learning stage. Through this adaptive approach, the model can gradually focus on essential features according to training progress, considered to have realized efficient learning.

The results of this cycle demonstrated that by combining high-quality data augmentation through Conservative mutations in Cycle 4.2 with strategic regularization through custom dropout, prediction model performance can be steadily improved even under the constraint of limited 40 sequences. This finding becomes an important foundation toward further improvement of prediction models in the next cycle and ultimately realization of iterative optimization cycles through integration with generation models.

Cycle 6: LoRA Fine-tuning of the Generative Model

Cycle 6.1:

Design:

In Cycles 1 and 2, we were able to generate proteins with targeted properties using fine-tuned generation models. The method used then was full fine-tuning, which updates all parameters the model possesses. This full fine-tuning has the highest task adaptation capability because it can update the entire model and can theoretically reach the highest performance ceiling compared to parameter-efficient methods. On the other hand, with small-scale datasets, there are too many parameters, making overfitting likely [31]. Furthermore, because all parameters are updated, long training times and enormous computational resources are required. Therefore, full fine-tuning cannot be reduced to realistic computational amounts for the method we aim for—converging generated sequences toward those with desired properties through repeated model fine-tuning. Therefore, in this cycle, to efficiently perform fine-tuning, we introduce parameter-efficient fine-tuning using LoRA (Low-Rank Adaptation) [32].

LoRA is a method that adds low-rank decomposed adaptation matrices to weight matrices of pre-trained models and learns only these adaptation matrices. This achieves performance close to full fine-tuning while drastically reducing the number of parameters to update. Specifically, by inserting small-scale learnable matrices into each layer while keeping the original model weights fixed, computational cost, memory usage, and training time can be dramatically reduced.

In this cycle, to verify the effectiveness of LoRA fine-tuning, we conduct theoretical and experimental comparisons with full fine-tuning.

Build:

Model training implementation was as follows. The dataset used sequences generated in the previous iteration with screening applied. This dataset was split into training and test data at a 9:1 ratio. The pre-trained model was loaded in half precision, and LoRA was limited to qkv_proj and out_proj of the attention mechanism, freezing other weights. This aimed to acquire context while suppressing computational cost and memory usage. For model training, we used Transformers, adopting cross-entropy per token as the loss function and Adam as the optimization algorithm. Training was conducted for 6 epochs with early stopping and hyperparameter optimization using Optuna.

Model generation implementation was as follows. Parameters required during generation adopted those reflecting characteristics of target sequences. For example, when assuming GFP, the prompt was “MSKGE” and generation was performed with 233 tokens, subtracting the prompt length of 5 from GFP’s sequence length of 238. This enabled stable generation of sequences with consistent length while maintaining the known N-terminal motif.

To verify the effectiveness of fine-tuning using LoRA, we compared the number of trainable parameters and training time between full fine-tuning and LoRA fine-tuning. Additionally, using known fluorescent proteins as training data, we quantitatively compared distributions of both models by principal component analysis of sequences generated by both models.

Test:

comparison_plot_v4.png

fig. 17. Comparison of trainable parameters and runtime between Full Fine-Tuning and LoRA

With LoRA, the number of trainable parameters was limited to 1/256 compared to Full Fine-Tuning. This reduced runtime to 0.7 times.

Unknown-18.png

fig. 18. Principal component analysis (ESM2 Embedding) of sequences output by Full Fine-tuning and LoRA with known fluorescent proteins

Learn:

This cycle demonstrated that parameter-efficient fine-tuning by LoRA is a practical method for iterative improvement of protein generation models. LoRA succeeded in reducing the number of trainable parameters to 1/256 while shortening runtime to 0.7 times compared to full fine-tuning. This dramatic reduction in computational cost made our approach of repeatedly fine-tuning models realizable with practical computational amounts. Furthermore, from comparison by principal component analysis, the distribution of sequences generated by LoRA fine-tuning showed no extreme divergence from full fine-tuning, suggesting that LoRA can demonstrate sufficient performance while reducing overfitting risk in small-scale datasets. In future cycles, we decided to adopt this LoRA fine-tuning as a standard method.

Cycle 7: Likelihood-based Data Augmentation

Cycle 7.1:

Design:

In Cycle 2.2, despite limiting prediction model training data to a very small number of 40, we were able to design fluorescent proteins that emit target fluorescence wavelengths. On the other hand, while the generation model can learn from sequences alone without requiring labels that quantitatively represent function, it still required relatively large data of 700 sequences.

To resolve this imbalance in training data required by prediction and generation models and develop a model capable of truly improving proteins from small amounts of data, we devised a data augmentation method utilizing protein language models as a technique to expand training data for generation models from limited sequences.

Specifically, after generating large numbers of variants from a small number of wild-type sequences, biologically valid sequences are selected based on likelihood from protein language models, thereby constructing high-quality virtual training datasets without conducting experiments.

Likelihood is a metric that quantifies how “plausible” a sequence is based on statistical patterns of natural protein sequences learned by protein language models [33].

In this cycle, we calculated using the masked marginal scoring function expressed by the following formula [34]:

iMlogp(xi=ximtxM)logp(xi=xiwtxM)\sum_{i \in M} \log p(x_i = x_i^{mt} | \boldsymbol{x}_{-M}) - \log p(x_i = x_i^{wt} | \boldsymbol{x}_{-M})

Here, MM is the set of mutation introduction positions, ximtx_i^{mt} is the mutant amino acid, xiwtx_i^{wt} is the wild-type amino acid, and xM\boldsymbol{x}_{-M} represents the sequence excluding mutation positions. This formula sums the logarithmic ratio of occurrence probabilities between mutant and wild-type amino acids at each mutation position across all mutation positions, conditioned on sequence information excluding mutation positions.

If this score is positive, the mutant sequence has higher likelihood from the language model’s perspective than wild-type, indicating it is more valid as a protein sequence that could exist in nature. Conversely, if negative, likelihood is low and there is high possibility it is a biologically unnatural sequence. In fact, when examining the relationship between likelihood and fluorescence brightness using GFP’s comprehensive mutation analysis (DMS) dataset, clear correlation was observed between the two (Fig. 19).

vprVVWVVqvNysoaPnz4s88q9Vqwx0aAMA1gMQOAAAAoJlAdycAAAAAzQQSOwAA0AAIBmAokdAAAAQDOBxA4AAACgmUBiBwAAANBMILEDAAAAaCbPHYt0w1foUwAAAAAElFTkSu0AQmCCのコピー.png

Fig. 19. Correlation between likelihood score and fluorescence brightness in GFP DMS dataset

Relationship between likelihood scores calculated by masked marginal scoring function using the SaProt-650M model and experimentally measured fluorescence brightness of GFP variants.

In this method, by using this likelihood score as a filtering criterion, we designed to select only sequences with high probability of being structurally and functionally stable from among the large number of generated variants and utilize them as training data for the generation model. This aimed to construct a model capable of effectively generating novel proteins even with limited data by producing high-quality virtual variants necessary for generation model training from just tens of experimental data points.

Build:

First, we generated variants through single residue substitution mutations (point mutations). For each wild-type sequence, we generated all combinations by substituting the amino acid at each position in the sequence with any of the 20 standard amino acids. In other words, from one wild-type sequence of length L residues, we generated a maximum of (L × 19) single residue substitution variants. Through this operation, we constructed a variant library enabling comprehensive exploration of functional importance at each position in wild-type sequences.

Second, we generated variants through sequence shuffling. In this method, multiple wild-type sequences are divided by certain window sizes (1, 3, 5), and sequence fragments are randomly exchanged between different wild-types in window units, generating variants with novel combinations. Specifically, each sequence was divided by the specified window size, and windows at the same positions were randomly swapped according to a set shuffling rate (0.1). This operation was repeatedly executed until reaching the set number of variants.

For all generated variants, we conducted validation based on sequence similarity to wild-type sequences. Specifically, amino acid residues were compared one by one between each variant and corresponding wild-type sequence, counting the number of different residues. Only variants with this difference within 4 residues were retained as candidates for subsequent likelihood screening.

Next, we calculated likelihood scores of generated variants using the protein language model SaProt-650M. Using the aforementioned masked marginal scoring function, we calculated the log-likelihood difference from wild-type sequences for each variant and selected only variants where this value was positive (i.e., variants with likelihood equal to or higher than wild-type).

For generation model training, we applied LoRA (Low-Rank Adaptation) using the PEFT (Parameter-Efficient Fine-Tuning) library with ProGen2-small [2] as the base model [32][35]. LoRA parameters were set to rank r=32, alpha=64 (r × 2), dropout=0.1.

To evaluate model performance, we compared models trained under three conditions: (1) model trained on the entire 700 sequences as in Cycle 2.2 (Full dataset model), (2) model trained on only 40 sequences (Limited dataset model), (3) model trained on sequences generated from 40 sequences through likelihood-based data augmentation (Augmented dataset model). Sequences were generated from each model under identical conditions, converted to embedding vectors with the ESM2-650M model [5], then dimensionally reduced to two dimensions by principal component analysis (PCA) and plotted to visually compare the distribution and diversity of sequences generated by each model.

Test:

20251008_20.png

fig. 20. Predicted structure of sequences generated by ProGen2 fine-tuned with LoRA on 40 sequences

With 40 sequences, proper sequences were not generated due to overfitting. Even predicted structures showed no biologically valid structures whatsoever.

Without lowering temperature to 0.1, sequences homologous to fluorescent proteins were not generated, and sequences practically usable were not produced at all.

スクリーンショット 2025-10-09 120801.png

fig. 21. Predicted structure of sequences generated by ProGen2 fine-tuned with LoRA on the same 700 sequences as Cycle 2.2

スクリーンショット 2025-10-08 194247.png

fig. 22. Predicted structure of sequences generated by ProGen2 fine-tuned with LoRA on sequences generated through data augmentation from 40 sequences

Learn:

In this cycle, we verified whether generation models can be effectively trained from limited training data through a data augmentation method based on protein language model likelihood. As a result, this method was demonstrated to be an effective approach for solving serious problems in generation model learning with small data.

First, the Limited dataset model trained on only 40 sequences could not generate any biologically valid structures, as shown in Fig. 20. Predicted structures did not form the typical β-barrel structure of fluorescent proteins, and overfitting clearly occurred. Without lowering temperature to 0.1, even sequences homologous to fluorescent proteins were not generated, rendering it completely impractical. This indicates that while generation models have the advantage of being learnable from sequences alone, a certain amount of data or more is still necessary to appropriately capture sequence space diversity.

In contrast, the Full dataset model trained on the entire 700 sequences (Fig. 21) could stably generate sequences with biologically valid structures, as shown in Cycle 2.2. This reconfirms that generation models function appropriately when sufficient training data exists.

An important finding was that the Augmented dataset model trained on sequences generated through likelihood-based data augmentation from 40 sequences could generate biologically valid fluorescent protein-like structures, as shown in Fig. 22, in contrast to the Limited dataset model. This result demonstrates that high-quality virtual training data can be constructed without experimental data by utilizing statistical patterns of natural proteins learned by protein language models.

The key to this method’s success lies in likelihood calculation through masked marginal scoring function and the selection process based on it. The correlation between likelihood scores and fluorescence brightness shown in Fig. 19 suggests that likelihood is related not only to sequence “naturalness” but also to functional validity. By selecting only variants with positive likelihood, sequences with high probability of being structurally and functionally stable can be preferentially included in training data, which is considered to have prevented overfitting and enabled learning of appropriate sequence distributions.

Additionally, combining two mutation introduction methods—point mutation and shuffling—was effective for balancing diversity and validity. Point mutation comprehensively explores functional importance at each position, while shuffling creates combinations of beneficial sequence fragments between different wild-types. By limiting differences from wild-type to within 4 residues, diversity was secured within a range not too deviant from wild-type.

This method resolved the imbalance in training data amounts required by prediction models (40 data) and generation models (conventionally requiring 700 data), opening the path to constructing both from equivalent small amounts of data. This is an important advancement enabling significant reduction in experimental costs and realization of protein design cycles at earlier stages.

Cycle 7.2:

Design:

In Cycle 7.1, we evaluated biological validity of generated variant sequences through likelihood calculation combining the protein language model SaProt-650M with masked marginal scoring function. This method showed relatively good correlation with experimentally measured GFP fluorescence brightness and was confirmed effective as a filtering criterion for sequences. However, masked marginal scoring evaluates each mutation position independently and thus had the limitation of being unable to consider interactions between multiple mutations, namely epistasis.

Epistasis is a phenomenon where mutations at certain positions combine with mutations at other positions to produce synergistic or antagonistic effects that cannot be explained by simple sums of individual mutation effects [36]. Epistasis is known to play an important role in protein function and stability, and particularly when introducing multiple mutations simultaneously, appropriately evaluating their interactions is essential for improving design accuracy.

This data augmentation method aims to produce high-quality virtual variants for generation models to learn [37]. However, if likelihood calculation accuracy is insufficient, biologically inappropriate sequences mix into training data, and generation models learn incorrect features and patterns from those sequences. As a result, models come to generate functionally inferior or structurally unstable sequences. This is the fundamental problem in machine learning known as “garbage in, garbage out.” Therefore, increasing accuracy of sequence selection in data augmentation becomes an extremely important element affecting generation model performance.

Therefore, in this cycle, we aimed to improve variant screening accuracy by combining likelihood calculation methods capable of considering epistasis [38] and protein language models with different architectures with masked marginal scoring.

Build:

Specifically, we designed experiments to compare and verify the following four combinations:

  1. SaProt and masked marginal scoring: The method used in Cycle 7.1, functioning as a baseline (reference). A standard evaluation method based only on sequence information.

  2. ESM2 and pseudo perplexity: Pseudo perplexity is a method that calculates likelihood considering the entire sequence context [39], and unlike masked marginal scoring, can indirectly capture interactions between mutations. There is possibility of more appropriately evaluating epistasis effects [40].

  3. ProGen2 and perplexity: ProGen2 [2] is an autoregressive generation model learned through the process of sequentially generating sequences from left to right. In this verification, we use two model sizes: small and base. Perplexity [41] is a standard likelihood evaluation metric in autoregressive models, reflecting the generation probability of the entire sequence. We verify the impact of model architecture differences on likelihood calculation performance.

By comparing these methods, we aimed to evaluate the balance between epistasis consideration capability and model performance, and identify optimal likelihood calculation methods for efficiently selecting high-quality variants from limited data.

To quantitatively evaluate the performance of each likelihood calculation method, we randomly extracted 10,000 sequences from GFP’s comprehensive mutation analysis (DMS) dataset and used them as test data. This dataset contains experimentally measured fluorescence brightness values corresponding to each variant sequence, functioning as a reference for evaluating correlation between likelihood scores and actual function. As protein language models, we used three types: SaProt-650M, ESM2-650M, and ProGen2-base. SaProt-650M and ESM2-650M have masked language model architectures and evaluate sequences considering bidirectional context information. Meanwhile, ProGen2-base is an autoregressive generation model learned through the process of generating sequences unidirectionally. As likelihood calculation methods, we implemented the following three. Masked marginal scoring [34] is the method used in Cycle 7.1, expressed by the following formula:

iMlogp(xi=ximtxM)logp(xi=xiwtxM)\sum_{i \in M} \log p(x_i = x_i^{mt} \mid \boldsymbol{x}_{-M}) - \log p(x_i = x_i^{wt} \mid \boldsymbol{x}_{-M})

Here, MM is the set of mutation introduction positions, ximtx_i^{mt} is the mutant amino acid, xiwtx_i^{wt} is the wild-type amino acid, and xM\boldsymbol{x}_{-M} represents the sequence excluding mutation positions. In this method, each mutation position is evaluated independently, and the overall variant score is calculated by summing log-likelihood differences between mutant and wild-type. Applied to the masked language model SaProt.

Pseudo perplexity [39] is a method that evaluates likelihood of the entire sequence, defined by the following formula:

exp(1Li=1Llogp(xixi))\exp\left(-\frac{1}{L}\sum_{i=1}^{L} \log p(x_i \mid \boldsymbol{x}_{-i})\right)

Here, LL is sequence length, xix_i is the amino acid at position ii, and xi\boldsymbol{x}_{-i} represents the sequence excluding position ii. In this method, each position in the sequence is sequentially masked and the occurrence probability of each amino acid is predicted conditioned on the remaining sequence. The average of negative log-likelihoods across all positions is calculated and its exponent taken to obtain perplexity. Lower pseudo perplexity indicates that the model predicts the sequence more easily, meaning there is higher possibility it is biologically valid. Unlike masked marginal scoring, because it considers the entire sequence context, it can indirectly capture interactions between mutations. Applied to the masked language model ESM2.

Perplexity [41] is a standard likelihood evaluation metric in autoregressive models, expressed by the following formula:

exp(1Li=1Llogp(xix<i))\exp\left(-\frac{1}{L}\sum_{i=1}^{L} \log p(x_i \mid \boldsymbol{x}_{<i})\right)

Here, x<i\boldsymbol{x}_{<i} represents the sequence before position ii. Because autoregressive models are learned through the process of sequentially generating sequences from left to right, amino acids at each position are predicted conditioned only on the sequence before it. The average of negative log-likelihoods across all positions is calculated and its exponent taken to obtain perplexity. Applied to ProGen2-base.

For each combination of model and likelihood calculation method, we calculated likelihood scores for the extracted 10,000 sequences and computed correlation coefficients with experimentally measured fluorescence brightness to quantitatively evaluate the prediction accuracy of each method.

Test:

スライド2.png

Fig. 23. Correlation between each likelihood calculation method and fluorescence brightness

For 10,000 sequences extracted from GFP DMS dataset, shows correlation between scores calculated by each model and likelihood calculation method combination with experimentally measured fluorescence brightness. Horizontal axis represents likelihood scores calculated by each method, vertical axis represents experimentally measured fluorescence brightness.

Left: (A) SaProt-650M and Masked marginal scoring (Spearman’s ρ = 0.52)

Center: (B) ESM2-650M and Pseudo perplexity (Spearman’s ρ = 0.199)

Right: (C) ProGen2-base and Perplexity (Spearman’s ρ = 0.312)

スライド1.png

Fig. 24. Comparison of brightness distributions by likelihood filtering

Violin plots of experimentally measured fluorescence brightness in sequences calculated as likelihood score ≥ 0 by each method (Likelihood ≥ 0). (A) SaProt+Masked marginal scoring, (B) ESM2+Pseudo perplexity, (C) ProGen2+Perplexity

Learn:

In this study, we systematically compared and verified combinations of multiple protein language models (pLMs) and likelihood calculation methods for predicting fluorescence brightness of GFP variants. As a result, it was confirmed that the combination of SaProt-650M and Masked marginal scoring that we had adopted from the beginning showed the highest prediction performance compared to other major models such as ESM2 and ProGen2.

A particularly noteworthy point is that in the sequence population judged by SaProt to have likelihood scores ≥ 0, the proportion of false positive sequences with low measured values was remarkably small. This high precision means that sequences judged by the model as “functionally promising” have high probability of actually possessing high fluorescence brightness. This characteristic is extremely important in practice and demonstrates that low-function sequences can be efficiently excluded at the screening stage of training data passed to generation models.

Cycle 8: In Silico Concept Validation

Design:

In this project, we have aimed to establish a method for optimizing protein sequences with desired properties from limited experimental data through iterative learning cycles of generation and prediction models using protein language models.

In Cycles 1 and 2, we conceptually demonstrated the possibility of designing proteins with desired properties by combining sequence generation by generation models with evaluation by prediction models. From Cycles 3 to 5, we progressively improved the accuracy and generalization performance of prediction models through introduction of neural networks utilizing ESM2 embedding vectors and data augmentation methods based on BLOSUM62. In Cycles 6 and 7, we constructed the technical foundation enabling fine-tuning of generation models in small-data environments of only 40 samples.

Through the cycles thus far, preparations to achieve our goals in small-data environments were completed. In this cycle, we integrate the improved prediction model with the small-data-compatible generation model and implement iterative cycles of generation and prediction. Specifically, novel sequences are generated by the generation model, evaluated by the prediction model, then the generation model is retrained using only highly-evaluated sequences. By repeating this process multiple times, we verify in an in silico environment that generated sequences converge toward desired properties.

We set the following as metrics to confirm sequence convergence. First, in sequence populations generated at each cycle, the proportion of sequences with multiple target properties simultaneously optimized increases as cycles are repeated. Second, while sequence diversity is appropriately maintained, the distribution in sequence space consolidates toward specific regions suited for target properties. Through these metrics, we quantitatively evaluate whether generation models effectively learn based on prediction model evaluation and can explore sequence spaces with desired properties.

Build:

For the prediction model, we adopted the multi-output neural network model using ESM2 embedding vectors and fully connected layers that showed the highest prediction performance in Cycle 5.

For the generation model, we adopted an approach combining the data augmentation method based on BLOSUM62 introduced in Cycle 7.2 with the efficient fine-tuning method using LoRA established in Cycle 6.

For training data, we used a fluorescent protein dataset of 40 sequences obtained from FPbase. Each sequence has three properties assigned as labels: maximum fluorescence wavelength, maximum excitation wavelength, and brightness.

In this cycle, we set clear optimization targets of maximum fluorescence wavelength 448 nm, maximum excitation wavelength 383 nm, and maximized brightness. This aims to design sequences achieving high brightness while possessing blue fluorescent protein characteristics.

For implementation of early stopping in generation model training, at the end of each epoch, 1000 sequences were sampled from the generation model, converted to embedding vectors by ESM2, and the variance of these embedding vectors was calculated.

We visualized the distribution of predicted property values of sequences generated at each cycle with histograms and the distribution in sequence space with embeddings by ESM2 and principal component analysis, tracking changes as cycles progressed.

Test:

(A)

image.png

(B)

image.png

(C)

image.png

fig. 25. Changes in distribution of predicted property values of generated sequences in iterative cycles

Histograms of predicted property values of sequences generated at Iteration 1 and Iteration 8. (A) Distribution of maximum fluorescence wavelength. Target value 448 nm shown by red dashed line. (B) Distribution of maximum excitation wavelength. Target value 383 nm shown by red dashed line. (C) Distribution of brightness. It can be confirmed that as cycles progress, distributions concentrate near target values and the proportion of sequences with high brightness increases.

Unknown-16.png

fig. 26. Changes in distribution of generated sequences in sequence space by principal component analysis

Distribution of generated sequences projected onto a 2-dimensional space of first principal component (PC1) and second principal component (PC2) by applying principal component analysis to ESM2 embedding vectors. Each color indicates different cycles (Iteration 1: blue, Iteration 4: red, Iteration 8: green). In initial cycles sequences are dispersed over a wide range, but as cycles progress, consolidation toward specific regions is observed. Yellow points indicate original training data (40 sequences from FPbase).

20251008_16.png

fig. 27. Transitions in proportion of sequences satisfying target properties and sequence diversity by cycle

Proportion of sequences among those generated at each cycle that satisfy all three target properties (maximum fluorescence wavelength 448±10 nm, maximum excitation wavelength 383±10 nm, brightness ≥10). As cycles progress, the proportion of sequences satisfying targets increases, and distribution in sequence space converges toward target regions.

Learn:

As shown in Fig. 25, while predicted property values of generated sequences were dispersed over a wide range in the initial cycle (Iteration 1), concentration near target values became prominent as cycles progressed. Particularly for maximum fluorescence wavelength (A), a clear peak formed near 448 nm at Iteration 8, and convergence toward 383 nm was also confirmed for maximum excitation wavelength (B). Regarding brightness (C), the proportion of high-brightness sequences progressively increased with cycle progression.

Additionally, focusing on Fig. 27, the proportion of sequences satisfying all three target properties started at approximately 1.7% at Iteration 1 and rose to approximately 3.9% at Iteration 8. This doubles the sequences satisfying the three target properties during the generation and evaluation process. Furthermore, considering that we started from extremely limited training data of only 40 sequences, this represents a remarkable achievement.

From visualization by principal component analysis in Fig. 26, gradual changes in the distribution of generated sequences in sequence space were confirmed. In the initial cycle (blue), sequences were scattered over a wide range, but through intermediate cycles (red), by the final cycle (green) they consolidated into a smaller region than initially. This change indicates that the iterative learning cycle functioned as designed, and the generation model effectively converged toward sequence spaces with desired properties based on prediction model evaluation.

Cycle 9: Validation of LEAPS version 3 Using Fluorescent Proteins

Design:

The complete picture of LEAPS was completed. This system begins with shuffling and proceeds through likelihood-based screening, sequence generation by the generation model, sequence selection by the prediction model, and sequence generation by the generation model again, repeating this latter half of the process. We verify whether this cycle can actually achieve protein improvement.

For verification, we conducted experiments using fluorescent proteins. We performed multi-objective optimization to maximize brightness while setting maximum excitation wavelength to 383 nm and maximum fluorescence wavelength to 448 nm. Through this experiment, we evaluate the effectiveness of the LEAPS system.

Build:

For training data, we used a brightness dataset of 40 sequences obtained from FPbase [9]. Data preparation followed the same procedure as Cycle 3.1.

With the dataset prepared for LEAPS v3 as input, we specified improvement of multiple labels as follows:

・Brightness: maximize

・Maximum excitation wavelength: approach 383 nm

・Maximum fluorescence wavelength: approach 448 nm

Sequences output by LEAPS were synthesized for cell-free systems and expressed using a cell-free system (PUREfrex®). Expressed proteins were purified using His-tag columns. For purified fluorescent proteins, brightness, maximum fluorescence wavelength, and maximum excitation wavelength were measured. Based on these measurements, we evaluated the optimization effects.

Test & Learn:

~coming soon~

It will be done by the Grand Jamboree.

Cycle 10: Comparative Study

Design:

Through Cycles 1 to 8, we completed LEAPS, a model that improves proteins toward target functions through iterative optimization combining generation and prediction models. What sets this model apart from conventional methods is that it efficiently explores sequence space and converges toward high-activity sequences by repeating evaluation by the prediction model and LoRA fine-tuning of the generation model. In computational sequence optimization, methods introducing simple random mutations and evaluating with prediction models are also conceivable, but these methods can only explore a small fraction of the vast sequence space and tend to fall into local optima. In contrast, LEAPS generates sequences based on evolutionary and structural constraints learned by protein language models. Therefore, it preferentially explores biologically valid mutation spaces and is expected to efficiently converge toward global optima that are difficult to reach through random search.

LEAPS is expected to more efficiently generate promising sequence candidates by leveraging biological knowledge learned by protein language models.

However, quantitatively demonstrating how much sequence optimization using generation models actually outperforms simple random mutations is essential for proving the utility of this model. If random mutations can achieve similar performance improvements, the significance of using computationally expensive generation models and LoRA fine-tuning would diminish. The true value of LEAPS lies in the generation model having learned biologically valid sequence mutations being able to efficiently converge toward high-activity regions that are difficult to reach through random search.

Therefore, in this cycle, we conducted a control experiment replacing the generation model component of LEAPS with random mutation generation and quantitatively evaluated the improvement effect of utilizing generation models by comparing predicted activity scores of finally output sequences. Specifically, using the same prediction model and same number of cycles, we concurrently implemented (1) generation model-based optimization by LEAPS and (2) optimization by random mutations. In each cycle, while LEAPS generates sequences from the generation model, evaluates and selects with the prediction model, and performs LoRA fine-tuning, the random mutation version generates sequence populations with random point mutations introduced to original sequences and similarly evaluates and selects with the prediction model. However, the random mutation version does not perform generation model fine-tuning, using selected high-score sequences as starting points for mutation introduction in the next cycle.

If LEAPS using generation models can be shown to generate sequences with clearly higher activity scores compared to random mutations, the effectiveness of sequence optimization utilizing protein language models will be demonstrated. This will achieve the objective of this research: efficiently designing highly functional proteins from limited experimental data.

Build:

In the control experiment with random mutations, we implemented a method performing sequence optimization only through stochastic mutation introduction and evaluation/selection by prediction models without using generation models. We constructed an iterative process introducing point mutations with 1% probability at each amino acid position and using selected high-activity sequences as starting points for mutation introduction in the next cycle.

The experimental setup was as follows. In the first iteration, 100,000 mutant sequences were generated from wild-type sequences, and after evaluation by the prediction model, the top 100 sequences were selected. From the second iteration onward, using the top 100 sequences from the previous cycle as starting points, 30,000 mutant sequences each were generated, evaluated, and selected.

For optimization by LEAPS, we directly applied the method from Cycle 8, aligning the number of generated sequences with the random mutation experiment (100,000 in the first iteration, 30,000 each from the second iteration onward). Both methods used the same prediction model and selected the top 100 sequences in each iteration. The only difference was whether sequence generation was by generation models having learned biological knowledge or by purely stochastic random mutations.

We plotted predicted brightness of selected sequences at the final iteration stopped by early stopping as histograms. We examined whether there was a significant difference in the mean values of respective predicted brightnesses using a two-sample Z-test (Fig.N).

Additionally, sequences generated by the generation model and random mutations in each iteration were converted to features by ESM-2 and visualized by principal component analysis (PCA).

Test:

20251008_17.png

Fig.28. Predicted brightness when using Random mutation and ProGen2 as Generator

In this histogram, frequency plotting was partially omitted to emphasize differences in the Estimated brightness > 2 region. Compared to Random mutation, using ProGen2 as Generator exceeded both maximum value sequences and top 25% sequences. The P-value calculated from the Z-value was less than 10^-3, indicating a significant difference.

Learn:

Comparing sequence optimization by random mutations with LEAPS, LEAPS demonstrated statistically significantly superior performance (p < 0.001). The difference between both methods was particularly prominent in high-activity regions where predicted brightness exceeded 2, with LEAPS consistently showing high activity not only in maximum values but across the entire top 25% sequence population.

This result demonstrates the effectiveness of leveraging biological knowledge learned by protein language models. Because random mutations introduce mutations independently at each amino acid position, exploration of vast sequence space is inherently inefficient. In contrast, ProGen2 has learned evolutionary and structural constraints from enormous natural protein sequences, and LEAPS leverages this knowledge to preferentially explore biologically valid mutation patterns. Furthermore, through iterative LoRA fine-tuning, the generation model learns features of high-activity sequences in each cycle and can generate more promising sequences in the next cycle.

This result clearly demonstrates the significance of using computationally expensive generation models and LoRA fine-tuning. In drug discovery and enzyme engineering contexts where experimental costs far exceed computational costs, the advantage of LEAPS being able to present higher-activity sequence candidates before experimental verification is important.

Conclusion

Through Cycles 1–10, we successfully developed and refined LEAPS, a system for efficient protein engineering from limited experimental data, and validated its effectiveness. This achievement was enabled by the following technical innovations.

Predictor Improvement

Conventional approaches relied on simple linear regression methods, such as LASSO regression applied to ESM2 embedding vectors, to build predictive models. We developed a neural network that combines ESM2 embedding vectors with fully connected layers, and introduced data augmentation based on BLOSUM62 along with custom dropout based on coefficient of variation. These innovations enabled us to achieve a practical prediction accuracy of R² = 0.545 for fluorescent protein brightness prediction tasks, even under limited data conditions. This represents a substantial improvement over the conventional LASSO regression approach, which achieved R² = 0.291.

Data Augmentation For Generator

Fine-tuning large-scale models such as ProGen2 under limited data conditions causes overfitting and fails to generate biologically plausible sequences. To address this challenge, we leveraged the protein language model SaProt and masked marginal scoring to create biologically plausible virtual mutants from a small number of wild-type sequences, thereby expanding the data required for training generative models without conducting additional experiments. This approach enabled sequence generation similar to the training data with only 40 data points.

Multi-objective Optimization

In in silico validation, we simultaneously specified three objectives and, through iterative cycles of generation and evaluation, more than doubled the proportion of sequences satisfying all objectives. Furthermore, the number of sequences with brightness scores exceeding 50 increased substantially throughout this process. These results demonstrate that LEAPS functions effectively even in complex multi-objective optimization tasks.

Comparison with Random Mutation Baseline

In the Cycle 10 comparative study, LEAPS demonstrated statistically significant superior performance compared to random mutation-based optimization (p < 0.001). Particularly in high-activity regions, LEAPS consistently achieved high predicted activity not only in maximum values but across the entire top 25% of sequences. These findings demonstrate that by leveraging the evolutionary and structural knowledge encoded in protein language models, LEAPS can efficiently explore sequence spaces that are difficult to reach through random search strategies.

Reference

  1. Inouye, S., & Tsuji, F. I. (1994). Aequorea green fluorescent protein. FEBS Letters, 341(2–3), 277–280.

  2. Nijkamp, E., Ruffolo, J. A., Weinstein, E. N., Naik, N., & Madani, A. (2023). ProGen2: Exploring the boundaries of protein language models. Cell Systems, 14(11), 968-978.e3.

  3. Sarkisyan, K. S., Bolotin, D. A., Meer, M. V., Usmanova, D. R., Mishin, A. S., Sharonov, G. V., Ivankov, D. N., Bozhanova, N. G., Baranov, M. S., Soylemez, O., Bogatyreva, N. S., Vlasov, P. K., Egorov, E. S., Logacheva, M. D., Kondrashov, A. S., Chudakov, D. M., Putintseva, E. V., Mamedov, I. Z., Tawfik, D. S., … Kondrashov, F. A. (2016). Local fitness landscape of the green fluorescent protein. Nature, 533(7603), 397–401.

  4. Bateman, A., Martin, M., Orchard, S., Magrane, M., Ahmad, S., Alpi, E., Bowler-Barnett, E. H., Britto, R., Bye-A-Jee, H., Cukura, A., Denny, P., Dogan, T., Ebenezer, T., Fan, J., Garmiri, P., Da Costa Gonzales, L. J., Hatton-Ellis, E., Hussein, A., Ignatchenko, A., … Sundaram, S. (2022). UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research, 51(D1), D523–D531.

  5. Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., Costa, A. D. S., Fazel-Zarandi, M., Sercu, T., Candido, S., & Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123–1130.

  6. Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M., Bhowmik, D., & Rost, B. (2021). ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10), 7112–7127.

  7. Gonzalez Somermeyer, L., Fleiss, A., Mishin, A. S., Bozhanova, N. G., Igolkina, A. A., Meiler, J., Alaball Pujol, M. E., Putintseva, E. V., Sarkisyan, K. S., & Kondrashov, F. A. (2022). Heterogeneity of the GFP fitness landscape and data-driven protein design. eLife, 11, e75842.

  8. Ofer, D., Brandes, N., & Linial, M. (2021). The language of proteins: NLP, machine learning & protein sequences. Computational and structural biotechnology journal, 19, 1750–1758.

  9. Lambert, T.J. FPbase: a community-editable fluorescent protein database. Nat Methods 16, 277–278 (2019).

  10. Yang, K. K., Wu, Z., Bedbrook, C. N., & Arnold, F. H. (2018). Learned protein embeddings for machine learning. Bioinformatics, 34(15), 2642–2648.

  11. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288.

  12. Predicting the Effects of Mutations on Protein Function with ESM-2. (n.d.).

  13. Tsien R. Y. (1998). The green fluorescent protein. Annual review of biochemistry, 67, 509–544.

  14. Kathuria, C., Mehrotra, D., & Misra, N. K. (2018). Predicting the protein structure using random forest approach. Procedia Computer Science, 132, 1654–1662.

  15. Svetnik, V., Liaw, A., Tong, C., Culberson, J. C., Sheridan, R. P., & Feuston, B. P. (2003). Random Forest: a classification and regression tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences, 43(6), 1947–1958.

  16. Peng, D., Gui, Z., & Wu, H. (2023). Interpreting the Curse of Dimensionality from Distance Concentration and Manifold Effect. ArXiv, abs/2401.00422.

  17. Valentini, G., Malchiodi, D., Gliozzo, J., Mesiti, M., Soto-Gomez, M., Cabri, A., Reese, J., Casiraghi, E., & Robinson, P. N. (2023). The promises of large language models for protein design and modeling. Frontiers in Bioinformatics, 3.

  18. Ian Goodfellow, Yoshua Bengio, & Aaron Courville (2016). Deep Learning. MIT Press.

  19. Henikoff, S., & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America, 89(22), 10915–10919.

  20. Jadon, A., Patil, A., Jadon, S. (2024). A Comprehensive Survey of Regression-Based Loss Functions for Time Series Forecasting. In: Sharma, N., Goje, A.C., Chakrabarti, A., Bruckstein, A.M. (eds) Data Management, Analytics and Innovation. ICDMAI 2024. Lecture Notes in Networks and Systems, vol 998. Springer, Singapore.

  21. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization.

  22. Hu, T., & Lei, Y. (2022). Early Stopping for Iterative Regularization with General Loss Functions.

  23. Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ‘19). Association for Computing Machinery, New York, NY, USA, 2623–2631.

  24. Minot, M., & Reddy, S. T. (2022). Nucleotide augmentation for machine learning-guided protein engineering. Bioinformatics Advances, 3(1).

  25. Yang, J., Zhou, K., Li, Y. et al. Generalized Out-of-Distribution Detection: A Survey. Int J Comput Vis 132, 5635–5662 (2024).

  26. Yue Zhou, Chenlu Guo, Xu Wang, Yi Chang, & Yuan Wu. (2024). A Survey on Data Augmentation in Large Model Era.

  27. Shen, H., Price, L. C., Bahadori, T., & Seeger, F. (2021). Improving Generalizability of Protein Sequence Models with Data Augmentations.

  28. Geirhos, R., Jacobsen, J., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., & Wichmann, F. A. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11), 665–673.

  29. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1), 1929–1958.

  30. Saeys, Y., Inza, I., & Larrañaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics (Oxford, England), 23(19), 2507–2517.

  31. Lialin, V., Deshpande, V., & Rumshisky, A. (2023). Scaling Down to Scale Up: A guide to Parameter-Efficient Fine-Tuning. arXiv (Cornell University).

  32. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, & Weizhu Chen. (2021). LoRA: Low-Rank Adaptation of Large Language Models.

  33. Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M., Zitnick, C. L., Ma, J., & Fergus, R. (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15).

  34. Hie, B., Zhong, E. D., Berger, B., & Bryson, B. (2021). Learning the language of viral evolution and escape. Science (New York, N.Y.), 371(6526), 284–288.

  35. Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, & Fu Lee Wang. (2023). Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment.

  36. Tewhey, R., Kotliar, D., Park, D. S., Liu, B., Winnicki, S., Reilly, S. K., Andersen, K. G., Mikkelsen, T. S., Lander, E. S., Schaffner, S. F., & Sabeti, P. C. (2016). Direct Identification of Hundreds of Expression-Modulating Variants using a Multiplexed Reporter Assay. Cell, 165(6), 1519–1529.

  37. Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on Image Data Augmentation for Deep Learning. Journal of Big Data, 6(1).

  38. Hsu, C., Nisonoff, H., Fannjiang, C., & Listgarten, J. (2022). Learning protein fitness models from evolutionary and assay-labeled data. Nature biotechnology, 40(7), 1114–1122.

  39. Salazar, J., Liang, D., Nguyen, T., & Kirchhoff, K. (2020). Masked Language Model Scoring. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.

  40. Madani, A., Krause, B., Greene, E.R. et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol 41, 1099–1106 (2023).

  41. Daniel Jurafsky, & James H. Martin (2025). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, with Language Models.

Slide 1Slide 2Slide 3Slide 4Slide 5

© 2025 - Content on this site is licensed under a Creative Commons Attribution 4.0 International license

The repository used to create this website is available at gitlab.igem.org/2025/tsukuba.