Models and Methodology | BIT-LLM

Overall Technical Roadmap

The core technical framework of our PROTEUS project is a multi-stage computational workflow aimed at achieving high-throughput, intelligent optimization of protein sequences. Its design philosophy follows the principle of learning universal laws from massive data and then performing fine-grained modifications for specific tasks. The entire process aims to take one or more original protein sequences as input and, through intelligent modification by an AI model, output a series of new candidate sequences that have undergone rigorous computational evaluation and possess higher performance potential, providing high-quality targets for subsequent wet lab validation [2]. The entire workflow can be summarized into the following five interconnected and logically progressive key steps:

1. Large-scale Data Foundation Construction and Multi-functional Scoring Function Training

This is the cornerstone of all our work. We did not limit ourselves to a single protein but systematically collected and organized 50 different protein datasets from the ProteinGym benchmark database. We performed in-depth functional classification of this data and, based on this, developed an iterative training strategy for scoring functions, moving from separate to unified models.

2. Base Model Selection

We chose the cutting-edge ESM-2 protein language model as the core computational engine of our work, leveraging its powerful sequence understanding and generation capabilities.

3. Iterative, Generalized Model Fine-tuning

We did not directly use the pre-trained model. Instead, through a carefully designed, multi-round iterative fine-tuning process, we injected knowledge learned from 50 datasets into the model, making it "adapt" to a broader range of protein optimization tasks.

4. High-throughput Sequence Generation and Modification

We ultimately established a sequence modification strategy based on Point-By-Point Scanning Mask Prediction. This strategy allows for a comprehensive and detailed exploration of the target sequence, systematically generating over 25,000 new candidate sequences.

5. Rigorous Performance Evaluation and Wet Lab Candidate Selection

We established a three-way comparative computational evaluation system, which compares the original sequences, the baseline-model-modified sequences, and the fine-tuned-model-modified-sequences, using our trained scoring function to rank the vast number of generated sequences. Finally, from the top-ranking sequences for several key proteins like A4GRB6_PSEAI_Chen_2020, D7PM05_CLYGR_Somermeyer_2022, and GFP_AEQVI_Sarkisyan_2016, we selected high-scoring mutants for wet lab validation.

The following sections will provide a highly detailed explanation of the technical details, decision-making rationale, and our iterative thought process for each step.

Foundational Components: Data, Scoring, and Base Model

We firmly believe that any successful machine learning application originates from a deep understanding and clever utilization of high-quality data. Therefore, before building and training any model, we invested a great deal of effort into constructing a robust data foundation, developing an accurate scoring function, and selecting a powerful base model.

Data Sourcing from ProteinGym

To ensure the breadth and reproducibility of our work, we chose ProteinGym as our primary data source. ProteinGym is a public database that provides standardized benchmark tests for protein language models and machine learning models. It aggregates a large amount of published Deep Mutational Scanning (DMS) experimental data. DMS experiments can measure the fitness scores of tens of thousands of protein variants (usually single-point mutations) at once, providing us with extremely valuable large-scale, high-quality sequence-function labeled data. Our team systematically selected and processed 50 different protein datasets from the ProteinGym database, covering a wide range from viral proteins[5], fluorescent proteins to various metabolic enzymes. This ensures that our model is exposed to a sufficiently diverse range of protein families, structures, and functions, laying the foundation for it to learn more generalizable principles.

Functional Classification of Datasets

The "performance" or "fitness" of a protein is a highly context-dependent concept. For an enzyme, performance might mean higher catalytic efficiency (kcat/Km); for an antibody, it could be stronger antigen-binding affinity; and for an industrial protein, it might be superior thermal stability. Directly predicting all these vastly different physicochemical properties with a single model is extremely difficult and imprecise. Therefore, our primary strategy is "divide and conquer." With 50 datasets in hand, we designed the following two-level classification standard:

Primary Standard - Experimental Method (selection_assay): We used the selection_assay field recorded in each dataset as the primary classification criterion. This field describes the biological method used in the DMS experiment to screen for high-activity variants. For example, variants that remain active after heat treatment are classified as "Thermostability"[3]; variants that promote cell growth on a specific substrate are classified as "Growth-based Selection."
Secondary Standard - Literature Source: For some datasets where the selection_assay description was vague or unique, we grouped them with other datasets from the same published study (reference). We assumed that the protein functions and evaluation systems focused on within the same study have inherent consistency.

Through this fine-grained classification process, we successfully organized the 50 datasets into several functional categories, including Stability, Growth, Viral Replication, Fluorescence, and Binding Affinity.

Functional classification of protein datasets based on selection assay methods

The Scoring Function: A "Virtual Experimenter"

The role of the scoring function in our workflow is that of a "computational biologist" or "virtual experimenter." After the model generates a new protein sequence, we need a reliable tool to quickly predict its performance. The development of our scoring function went through a significant iterative process.

Phase One: Separated, Specific Scoring Functions

In the initial phase, we followed the "divide and conquer" strategy, independently training a dedicated scoring function for each functional category (e.g., "Thermostability"). We used a classic and highly efficient machine learning model, specifically Ridge Regression, which is a form of regularized linear regression. This model was chosen for its fast training speed, strong interpretability, and its effectiveness in handling high-dimensional feature data (like sequence embeddings) by using regularization to prevent overfitting. We used sequence embeddings from a pre-trained ESM-2 model as input features and DMS fitness scores as the training target. The advantage of this method was its high specificity. However, it resulted in high management costs and data silos, as a model trained for "Thermostability" could not learn from "Binding Affinity" data, despite potential shared underlying principles.

4-panel evaluation plots for two proteins demonstrating model robustness and generalization ability

Phase Two: Merged, Generalized Scoring Function

To overcome these limitations, we entered the second phase: merged training. We hypothesized that different biological functions may share more fundamental, universal sequence patterns. For example, good "enzymatic activity" likely requires a degree of "structural stability." We merged datasets from functionally similar categories to form a larger training set, which was then used to train a single, more powerful scoring function.

Scoring Function Process

This is the relevant code implementation part:

def train_and_evaluate(X, y, test_size=0.2, random_state=42):
    """在嵌入上训练岭回归模型并评估其性能。"""
    print("\n--- 模型训练与评估 ---")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    print(f"训练集大小: {len(X_train)}")
    print(f"测试集大小: {len(X_test)}")

    model = Ridge(alpha=0.1)
    model.fit(X_train, y_train)

    y_test_pred = model.predict(X_test)
    test_r2 = r2_score(y_test, y_test_pred)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

    print(f"\n模型在测试集上的性能:")
    print(f"  R² (决定系数): {test_r2:.4f}")
    print(f"  RMSE (均方根误差): {test_rmse:.4f}")

    return model, (y_test, y_test_pred)

Below are the results of our similarity analysis on the stability-related category datasets, which we used as a basis for clustering and merging.

Similarity analysis of stability-related protein datasets for clustering and merging

This merged scoring function became the primary evaluation tool in our workflow. By training on richer and more diverse data, its generalization ability was significantly enhanced, providing a reliable basis for screening candidates.

The Core Engine: ESM-2 Protein Language Model

The core engine of all our computational work is the ESM-2 model, developed by the Meta AI team [6]. ESM-2 is a language model based on the Transformer architecture, specifically designed for protein sequences. Through self-supervised pre-training on a massive database of hundreds of millions of real protein sequences (like UniRef), it has learned the complex patterns of evolution, structure, and function embedded in amino acid sequences [1].

It comprehends both local grammar (adjacent amino acids) and long-range co-evolutionary relationships (global semantics). This profound "contextual understanding" enables ESM-2 to excel at its core task—Masked Language Modeling (MLM). The MLM task involves predicting the most likely original amino acid with high accuracy when one or more residues are masked. This is precisely the capability we need for protein sequence modification. By providing the model with a masked sequence, we are essentially asking: "Which amino acid substitution at this position would best maintain or improve the protein's overall fitness?" In this project, to balance efficiency and performance, we used two versions of the model: ESM2-650M and ESM2-35M.

Core Methodology: Iterative Fine-tuning with Contrastive Learning

While the pre-trained ESM-2 model is capable, its performance is limited by its general knowledge. Model fine-tuning is the key step to transform the model from a "generalist" to a "specialist." Our strategy evolved through a complete iterative process.

Initial Approach: Fine-tuning on Positive Samples

Goal: Our initial task was to verify the feasibility of the fine-tuning route—to prove that secondary training could make the model's performance surpass the original pre-trained model.
Dataset & Method: We selected about 300 high-activity sequences from the A4GRB6_PSEAI_Chen_2020.csv dataset to form a pure "positive example" training set. We used the esm2_t33_650M_UR50D model and trained it for 5 epochs with a learning rate of 1e-5.
Results and Reflection: A small-scale test showed that 7 out of 10 modified sequences had improved scores, proving the potential of fine-tuning. However, a serious problem emerged: the model's output showed high homogeneity and lacked diversity. This indicated that the model had overfit to the specific high-activity patterns, losing its ability to explore a broader sequence space, which is fatal for protein engineering.

Advanced Strategy: Integrated Contrastive Learning

After reviewing the problems from the first phase, we realized that a true expert knows both what is right and what is wrong. We fundamentally optimized our fine-tuning strategy.

Innovation 1: Generalized Integrated Contrastive Learning Dataset

We expanded our scope to all 50 ProteinGym datasets, creating a large-scale database of over one hundred thousand sequences. We defined high-activity sequences as "Positive Samples" and low-activity sequences as "Negative Samples." By training the model on both, it could not only learn the beneficial patterns from positive samples but also learn to avoid the harmful "non-target patterns" from negative samples. This contrastive learning approach enhanced the model's discriminative power and generalizability.

Innovation 2: Optimizing Training Efficiency and Model Selection

To accelerate iteration, we switched to the more lightweight ESM2-35M model. To prevent overfitting, we adopted an "Indirect Early Stopping" strategy. We set EPOCHS to 1, but after every training step, we evaluated the model's loss on a validation set. We only kept the model checkpoint from the moment the validation loss reached its minimum. This is a more intelligent and efficient training paradigm than a fixed number of epochs.

Contrastive learning fine-tuning workflow with positive and negative samples

Application and Validation

After obtaining a well-crafted, generalized fine-tuned model, the next step was to use it to generate new candidate sequences and rigorously evaluate its performance.

Sequence Generation: Point-by-Point Scanning Mask Prediction

We designed a systematic and interpretable strategy: Point-by-point Scanning Mask Prediction.

Core Algorithm: For a given sequence of length L, we loop from the first to the last position. In each iteration i, we replace the amino acid at position i with a <mask> token. This masked sequence is fed into our fine-tuned ESM-2 35M model. The model predicts the most likely amino acid for that position. If the predicted amino acid is different from the original and its confidence score is above a threshold, we log it as a potential beneficial mutation.

Output and Final Sequence Generation: This process generates a complete "mutation suggestion map." Based on this map, we can generate candidate sequences. In this project, we mainly focused on generating and validating high-confidence single-point mutation sequences.

Comprehensive mutation suggestion map generated by point-by-point scanning

The code implementation:

for i, single_pos_predictions in enumerate(pipeline_output):
        if not single_pos_predictions:
            predicted_seq += original_aas[i]
            continue
            
        top_pred = single_pos_predictions[0]
        
        if top_pred['score'] < confidence_threshold:
            predicted_seq += original_aas[i]
        else:
            predicted_seq += top_pred['token_str']
            if top_pred['token_str'] != original_aas[i]:
                analysis += f"\tposition {i}: predict {top_pred['token_str']} while original {original_aas[i]} (score: {top_pred['score']:.4f})\n"
                different_cnt += 1
                
    analysis_header = f'\ttotal different aa count: {different_cnt}\n'
    return predicted_seq, analysis_header + analysis

Strategy Advantages: The advantage of this method is its systematic and comprehensive nature. It ensures that the optimization potential of every single residue is examined by the model, providing valuable insights into the sequence-function relationship.

The figure below shows the sequence obtained after performing a point-by-point scanning mask prediction on the M1C mutant using the model fine-tuned on the A4GRB6_PSEAI_Chen_2020 dataset. The residues that differ from the original sequence, which are the potential beneficial mutations revealed by our fine-tuned model, are highlighted in yellow. The blue background indicates residues that are the same as the original sequence.

Point-by-point scanning results on M1C mutant showing beneficial mutations (yellow) and unchanged residues (blue)

Scientific Validation: A Three-Way Comparative Framework

To scientifically evaluate our workflow, we established a three-sequence comparative framework. For every low-activity sequence, we compared the following three scores evaluated by our merged scoring function:

s1 (Baseline Score): The score of the original low-activity sequence.
s2 (Control Score): The score of the optimal sequence generated by the untuned, original ESM-2 model.
s3 (Experimental Score): The score of the optimal sequence generated by our fine-tuned ESM2-35M model.

Our golden standard for a "successful modification" is: s3 > s2 > s1. This strict standard requires our method to not only improve upon the original but also significantly outperform a strong control based on the same base model. This proves the added value of our contrastive learning fine-tuning strategy. The figure below is a flowchart of our simulated experimental evaluation process.

Three-sequence comparative evaluation framework flowchart

Below are the visualized evaluation results from the model fine-tuned on the CSN4_MOUSE_Tsuboyama_2023_1UFM dataset, shown as box plots and violin plots. It can be intuitively seen that the quantiles, mean, and distribution of the modification effects of the model fine-tuned on this dataset have achieved better results compared to the original sequence and the original model.

Box plots and violin plots showing improved performance distribution of fine-tuned model

Below are the specific score plots of the modification results on this dataset. Each red scatter point indicates that the sequence score represented by the vertical axis is better than the sequence score represented by the horizontal axis, while yellow scatter points indicate the opposite.

Comparative score analysis plots (red points indicate better performance, yellow points indicate worse performance)

Frontier Exploration: Reinforcement Learning for Automated Hyperparameter Tuning

In addition to our main research, we explored using Reinforcement Learning (RL) to automate the choice of the Learning Rate.

Motivation: We aimed to build an "intelligent agent" that can autonomously and dynamically decide the optimal learning rate based on real-time training feedback, using our protein-specific scoring function for more biologically meaningful decisions.
RL Framework:
- State: The current training epoch.
- Action Set: Adjust the learning rate: {no change, increase by 20%, decrease by 20%}.
- Reward Function: The average score predicted by our protein scoring function on a validation subset.
- Policy: An ε-greedy strategy (ε=0.2) to balance exploitation and exploration.
Implementation and Results: We implemented a Q-learning framework. A comparative experiment on the BLAT_ECOLX_Firnberg_2014 dataset showed that the RL agent achieved a validation loss of 8.4e-3, significantly better than the 1e-2 loss achieved with a standard cosine annealing strategy.

This is the relevant code implementation part:

class LearningRateAgent:
    def __init__(self, num_epochs, n_actions):
        # Q表 shape = [epoch][action]
        self.Q = np.zeros((num_epochs, n_actions))
        self.num_epochs = num_epochs
        self.n_actions = n_actions
    def select_action(self, epoch, epsilon=EPSILON):
        print(f"当前epoch强化学习策略:", end=" ")
        if np.random.rand() < epsilon:
            action_index =  np.random.randint(self.n_actions)  
            print(f"探索策略{action_index}")
            return action_index
        print(f"选取目前最优化策略{np.argmax(self.Q[epoch])}")
        return np.argmax(self.Q[epoch])
    def update(self, epoch, action, reward):
        """Q-learning 更新"""
        if epoch < self.num_epochs - 1:
            best_next = np.max(self.Q[epoch + 1])
        else:
            best_next = 0  # 最后一个epoch没有后继
        self.Q[epoch, action] += ALPHA * (reward + GAMMA * best_next - self.Q[epoch, action])

In a specific experiment, we used an initial learning rate of 1e-3 and trained an agent over a 5-epoch training cycle. We took the optimal learning rate adjustment strategy for each epoch as the final result, obtaining the learning rate sequence for the 5 epochs: [1.2e-3, 1.2e-3, 9.6e-4, 7.68e-4, 7.68e-4].

Learning rate scheduling strategy learned by reinforcement learning agent

In the process of deep learning rate scheduling, it is common to gradually decrease the learning rate in the later stages of training to avoid model oscillation near the optimal solution and to obtain more stable training results. The learning rate scheduling strategy learned by our reinforcement learning agent is consistent with our prior knowledge.

Conclusion: This exploration successfully demonstrated the immense potential of RL in automating and optimizing model training. While not fully integrated into our final workflow due to time constraints, it serves as an important technical proof-of-concept for future work towards a more intelligent, automated AI protein engineering platform.

References

[1] Hayes T, Rao R, Akin H, et al. Simulating 500 million years of evolution with a language model. [J]. Science (New York, N.Y.), 2025, eads0018. DOI: 10.1126/SCIENCE.ADS0018.

[2] Qian H, Wang Y, Zhou X, et al. ESM-Ezy: a deep learning strategy for the mining of novel multicopper oxidases with superior properties [J]. Nature Communications, 2025, 16(1): 3274-3274. DOI: 10.1038/S41467-025-58521-Y.

[3] Ahmed M, Tin M, Mauricio M L, et al. Characterization of a thermostable Cas13 enzyme for one-pot detection of SARS-CoV-2. [J]. Proceedings of the National Academy of Sciences of the United States of America, 2022, 119(28): e2118260119-e2118260119. DOI: 10.1073/PNAS.2118260119.

[4] R J K, J A R, H J D, et al. Measuring the activity of BioBrick promoters using an in vivo reference standard. [J]. Journal of biological engineering, 2009, 3(1): 4. DOI: 10.1186/1754-1611-3-4.

[5] Chunlong X, Yingsi Z, Qingquan X, et al. Programmable RNA editing with compact CRISPR–Cas13 systems from uncultivated microbes [J]. Nature Methods, 2021, 18(5): 499-506. DOI: 10.1038/S41592-021-01124-4.

[6] Tristan B, Bonnie B. Learning the protein language: Evolution, structure, and function. [J]. Cell systems, 2021, 12(6): 654-669.e3. DOI: 10.1016/J.CELS.2021.05.017.

[7] Reshma S, Drew E, Thomas K. Engineering BioBrick vectors from BioBrick parts [J]. Journal of Biological Engineering, 2008, 2(1): 5.