Results and Analysis | BIT-LLM

After the rigorous method design and implementation detailed in the previous chapters, we have reached the most critical phase of validating our results. This chapter will systematically present the practical performance of our PROTEUS workflow in large-scale computational experiments and provide an in-depth, multi-dimensional analysis of the results, aiming to comprehensively validate the effectiveness, universality, and practical value of our method from both macro and micro perspectives.

Macro Performance Validation: Broad-Scale Testing Based on 50 Proteins and 25,000+ Generated Sequences

To evaluate the general effectiveness of our model fine-tuned with "integrated contrastive learning" (ESM-2 35M) and the "point-by-point scanning" sequence generation strategy, we conducted an unprecedentedly large-scale computational validation experiment, with a breadth far exceeding case studies on a single protein.

Test Scope

Our test set covered all 50 previously processed ProteinGym datasets. From these 50 datasets, we selected a total of several thousand original low-activity sequences as starting points for modification.

Experiment Scale and Output

For each selected low-activity sequence, we applied our complete "point-by-point scanning" modification workflow. Through this high-throughput computational process, we successfully generated and evaluated over 25,000 new, designed candidate sequences. This massive library of generated sequences is the core embodiment of our work's output and also the data basis for all subsequent analysis.

Core Evaluation Metric and Overall Result

We applied the aforementioned strict three-sequence comparative evaluation framework (s3 > s2 > s1) to determine whether each modification was a "success." After performing statistical analysis on these tens of thousands of "virtual experiments," we obtained an exciting macro conclusion: across a wide range of tests spanning 50 different protein families, our method achieved an average success rate significantly higher than the random baseline.

Macro-Scale Generalizability: Success Rate Across Protein Families

Key Data Point: Taking the A4GRB6_PSEAI_Chen_2020 dataset, which we focused on, as an example, we randomly selected 500 samples from its low-activity sequences for independent testing. The statistical results show that a total of 357 sequences, after modification by our fine-tuned model, had a final scoring function score (s3) higher than their original score (s1), achieving a success rate of 71.4%. This data strongly proves that, at least in this protein family, our method can systematically optimize low-activity sequences in the direction of high activity. Although the success rates on other proteins fluctuated, the overall trend remained consistently positive.

This macro result is of great significance, as it indicates that our "generalized" fine-tuning strategy is successful. By training on diverse data, our model has indeed learned some transferable, universal optimization principles that can be applied across protein families, making it not just an "A4GRB6 expert," but a "protein optimization assistant" with broader application potential.

From Results to Application: Screening High-Quality Candidate Sequences for Wet Lab Validation

The ultimate goal of our work is to serve real biological research and engineering applications. Therefore, we will strictly rank and screen the computationally generated massive sequence library to provide a concrete and operable list of candidate sequences for our team's wet lab component.

Screening Criteria

Scoring Function Ranking: We first rank all 25,000+ generated sequences in descending order based on the predicted scores from our trained "merged scoring function."
Meeting the Gold Standard: We only consider sequences that strictly satisfy the s3 > s2 > s1 gold standard, to ensure that their performance improvement is real and brought about by our fine-tuning strategy.
Synthesizability and Expressibility Considerations: We used some bioinformatics tools (like ProtParam) to perform a preliminary evaluation of the top-ranked sequences, excluding those with extreme physicochemical properties (such as too high/low isoelectric point, or too many rare codons), to increase the probability of their successful expression in subsequent wet lab experiments.

Final Delivery

After the multi-layer screening described above, we ultimately selected 3-5 optimal single-point mutation candidate sequences for each of several key research proteins, including A4GRB6_PSEAI_Chen_2020, D7PM05_CLYGR_Somermeyer_2022, and GFP_AEQVI_Sarkisyan_2016. This list has been delivered to our team's wet lab group, and the related gene synthesis, protein expression, and functional validation work is currently underway. We eagerly await the final convergence of computational design and real-world function.