Engineering Success | BIT-LLM

Overview

Here we describe our entire process PROTEUS from the project conception of the model to the success of the build.

The PROTEUS project is structured around the iterative DBTL cycle, which serves as a cornerstone for iterative optimization in synthetic biology [1][2].

As illustrated, by integrating our Dry lab experiments with targeted wet-lab experiments, we established a feedback loop that accelerates protein optimization. Furthermore, through the incorporation of our innovatively proposed HP-4R cycle, we continuously refine and renew the project throughout the iterative DBTL process.

Figure 1. The PROTEUS DBTL cycle

Design

Data Profiling & Transformation

Our data foundation is a robust, large-scale dataset comprising 50 distinct protein datasets curated and processed from the ProteinGym benchmark—a comprehensive collection of deep mutational scanning (DMS) assays [3]. This diverse dataset encompasses thousands of variants across multiple protein families, functionally categorized according to experimental assays (e.g., thermal stability, binding affinity, growth). This structured approach ensures that models learn generalizable principles of protein fitness rather than overfitting to individual targets [4].

Through the curation of DMS assays and harmonization of labels by assay type (e.g., activity, thermal stability, binding), a multi-task training corpus suitable for both zero-shot and supervised settings was generated. Benchmarks such as ProteinGym provide a large and diverse ground truth—millions of mutants across hundreds of assays—enabling robust evaluation and supervised fine-tuning. The data were transformed through deduplication, mapping of assay readouts to normalized fitness scores, metadata harmonization (assay type, organism, experimental conditions), and sequence-level filtering (length, low-complexity masking).

Figure 2. Data profiling & transformation: Functional classification of protein datasets based on selection assay methods

Strategy Selection

Pre-trained LMs, such as the ESM series, provide transferable sequence embeddings and masked prediction capabilities that encode structural constraints. Fine-tuning on DMS data adapts these general-purpose representations into task-specific fitness predictors [5].

Our model selection focuses on ESM-2, which demonstrates robust zero-shot capabilities and supports masked token prediction for generation purposes [6]. We plan to employ supervised fine-tuning with contrastive learning on positive and negative samples. This approach not only enables the model to recognize patterns associated with high-fitness sequences but also actively avoids features linked to poor functionality, thereby significantly enhancing its discriminative power [7].

Point-by-point scanning mask prediction is selected for generating systematic variants: masked token prediction is efficient, interpretable (yielding per-site scores), and compatible with the training objectives of PLMs. Similar strategies have been successfully applied in protein design and variant prediction [8][9]. For a target sequence, PROTEUS will sequentially mask each residue position and query the fine-tuned PLM to propose alternative amino acids. By masking each residue, predicting substitutions, and mapping mutations, this process generates an interpretable library that outperforms random masking strategies.

Figure 3. Point-by-point scanning mask prediction strategy

Build

Training the Scoring Function

First, we conducted benchmark tests and ablations on ProteinGym to ensure that the scoring function correlates with DMS readouts and improves upon zero-shot baselines.

The training of the scoring function proceeded through iterative DBTL cycles and involved two construction and optimization phases: (i) task-specific regression heads (e.g., XGBoost/LightGBM or simple MLPs) trained on ESM-2 embeddings for calibration per assay; and (ii) synthesis or integration of scorers to harmonize assay types into a unified, standardized cross-protein ranking score. This "computational referee" enables direct comparison of candidate designs across different functional objectives.

Figure 4. Scoring Function Process

Model Fine-tuning

Model fine-tuning was iteratively refined through the DBTL cycle, progressing through four major construction and optimization phases:

Figure 5. Model fine-tuning

Initial Validation and Failure Modes

Early experiments fine-tuned ESM2-650M on a small set of pure positive sequences to test feasibility. Although the fine-tuned model generated some improved variants (7/10 test cases showed improvement), the outputs exhibited a concerning lack of diversity (homogenization/overfitting): different inputs yielded nearly identical suggestions, indicating that the model memorized "high-activity motifs" rather than learning discriminative features useful across sequence contexts.

Contrastive and Integrated Fine-Tuning (Core Innovation)

To prevent overfitting and teach the model to distinguish effective from ineffective variants, an integrated contrastive learning dataset was constructed, combining high-activity (positive) and low-activity (negative) sequences from all 50 DMS datasets (>100k examples). During fine-tuning, the model learned to increase the relative likelihood of positive examples while decreasing that of negative examples, thereby improving discriminative capability and generalization to unseen proteins [10]. This contrastive setup explicitly taught the model which local and global patterns correspond to fitness gains and which represent pitfalls to be avoided, significantly enhancing downstream design performance [11].

Training Procedure and Regularization

For large-scale fine-tuning, a shift was made to ESM2-35M to accelerate the cycle. To avoid overfitting to the heterogeneous dataset, an "indirect early stopping" strategy was adopted: a single pass over the data (epochs = 1) was performed, but validation loss was evaluated after each optimization step, and the checkpoint with the minimum validation loss was retained. This approach maintained rapid iteration while preventing late-stage overfitting to noisy DMS labels.

Automated Hyperparameter Tuning (Proof of Concept)

Based on feedback from HP-4R, a prototype RL-driven dynamic learning rate agent (ε-greedy policy) was developed, which multiplicatively adjusted the learning rate based on validation loss feedback. On a held-out dataset (BLAT_ECOLX_Firnberg_2014), the RL scheduler achieved a lower validation loss (8.4e-3) compared to the standard cosine scheduler (1e-2), indicating potential for automated, adaptive training in future production workflows (not fully integrated due to time and resource constraints).

Wet-lab Build

The gene sequences of the top-scoring variants and their wild-type counterparts were reverse-translated and codon-optimized (e.g., NovoPro). Primers were designed using SnapGene, and the selected vector backbone was prepared via PCR-based backbone amplification and ligation. For instance, in the construction of β-lactamase variants, these genes were cloned into the pHT01-ATAAAA plasmid, replacing the native ampicillin resistance gene. The constructed plasmids were verified by colony PCR and sequencing prior to transformation into competent E. coli cells (or appropriate B. subtilis) for expression, ensuring direct assessment of protein phenotypes and avoiding confounding background properties such as intrinsic drug resistance. For β-lactamase constructs, agarose gel electrophoresis confirmed successful PCR amplification of the gene fragment (~800 bp) and plasmid backbone (~7 kb), as well as successful ligation via colony PCR. Since the case study focused on β-lactamase—an antibiotic resistance determinant—risk minimization measures were implemented, including the use of non-clinical laboratory strains, physical containment, selection with lower-risk antibiotic concentrations, and prior approval from the Institutional Biosafety Committee.

Figure 6. Electrophoresis results of the PCR synthesized gene and the product after PCR amplification of the plasmid backbone

Test

Scoring Results & Structure Prediction

Validation Metrics: A rigorous three-tier benchmark (s3 [Experimental Score] > s2 [Control Score] > s1 [Baseline Score]) was adopted as the gold standard for "successful engineering." The merged scoring function was used to rank the candidate pool, and modifications were only considered successful when the condition s3 > s2 > s1 was strictly met, thereby robustly demonstrating the added value of our fine-tuning strategy. In our primary β-lactamase case study, tests conducted on 500 low-activity sequences from the relevant dataset A4GRB6_PSEAI_Chen_2020 revealed that our method enhanced the performance scores of 71.4% of these sequences. Across various datasets, the proposed approach yielded statistically significant gains compared to both random baselines and zero-shot controls, demonstrating that PROTEUS is highly successful.

Figure 7. Macro-Scale Generalizability: Success Rate Across Protein Families

Wet-lab Test

Initially, the wet lab team utilized structure prediction tools such as AlphaFold to assist in validation, aiming to detect overall structural disruptions (e.g., loss of domain core, disruption of active site geometry). Candidates with unsatisfactory structural rationality were deprioritized, a strategy that reduced the waste of wet lab resources.

Figure 8. Predicted protein structure of optimized beta-lactamase (neurosnap-Alphafold2 online tool)

Subsequently, tailored experimental plans were determined based on protein characteristics, selectively performing protein expression, purification, and functional validation. Functional readouts were selected according to specific properties: activity assays (enzyme kinetics), thermal stability (ΔTm, DSF), binding (ELISA or BLI), or growth-based selection (colony counting of β-lactamase-expressing strains on antibiotic plates).

As an example, experimental validation was conducted via an antibiotic resistance assay: Escherichia coli strains transformed with plasmids containing either the optimized β-lactamase gene, the native β-lactamase gene, or no plasmid (blank control) were cultured on ampicillin plates. Bacterial colony growth was observed and quantified, with mean values calculated. Results showed 233 colonies for the optimized strain, compared to 117 for the control and 10 for the wild type. The higher survival rate observed for the optimized β-lactamase variant demonstrates the success of the modeling approach in this case study.

Figure 9. Culture status on ampicillin-containing medium

Learn

Results Analysis

The test results provided clear validation for the PROTEUS pipeline. Summary statistics from the modeling group included macro-level tests showing improvement rates (e.g., a 71.4% success rate for the target set), which guided model weight updates and score recalibration. On ampicillin-containing medium, strains expressing optimized β-lactamase variants produced significantly more colonies, indicating enhanced antibiotic resistance conferred by PROTEUS-designed variants and confirming the model's predictive accuracy. All wet-lab measurements were integrated back into the scoring pipeline. For each tested variant, predicted scores were compared with measured outcomes, and correlations were calculated to classify improved versus non-improved variants, as well as to validate effects at each site. Experimental results (DMS-style readouts at tested positions) were added to the training corpus and used for retraining or further fine-tuning of the scoring function via transfer learning. ProteinGym-style cross-validation guided the detection of overfitting and the selection of retraining strategies. This iterative approach helped capture context-dependent effects and partially addressed the issue of epistasis identified through wet-lab data.

Model Optimization

The integrated scoring function was retrained using new wet-lab experimental labels. Combinatorial and iterative strategies were explored to capture epistatic effects (e.g., nested rounds of scanning performed on backgrounds already containing validated beneficial mutations). Additionally, scaling to larger foundation models or multimodal (sequence & structure) representations was implemented to better capture interaction effects. Plans were also documented for the integration of scoring functions, iterative model fine-tuning and RL hyperparameter optimization.

Future work in our project will focus on further improvements, including the exploration of more advanced base models such as ESM3 and the development of multi-objective optimization capabilities to simultaneously meet diverse design requirements such as activity, stability, and expressibility. We will also actively incorporate automated experimental platforms and high-throughput screening methods to significantly enhance the efficiency and data output of the DBTL cycle, thereby continuously improving the model's practical design performance.

References

[1] National Academies of Sciences, Engineering, and Medicine; Policy and Global Affairs; Committee on International Security and Arms Control; Division on Engineering and Physical Sciences; Computer Science and Telecommunications Board; Division on Earth and Life Studies; Board on Life Sciences; Committee on Assessing and Navigating Biosecurity Concerns and Benefits of Artificial Intelligence Use in the Life Sciences. The Age of AI in the Life Sciences: Benefits and Biosecurity Considerations. Washington (DC): National Academies Press (US); 2025 Apr 23. 2, Design-Build-Test-Learn: Impact of AI on the Synthetic Biology Process. Available from: https://www.ncbi.nlm.nih.gov/books/NBK614601/

[2] Li, W., Mao, Z., Xiao, Z., Liao, X., Koffas, M., Chen, Y., Ma, H., & Tang, Y. J. (2025). Large language model for knowledge synthesis and AI-enhanced biomanufacturing. Trends in biotechnology, 43(8), 1864–1875. https://doi.org/10.1016/j.tibtech.2025.02.008

[3] ProteinGym was accessed from: https://registry.opendata.aws/proteingym/

[4] Notin, P., Kollasch, A. W., Ritter, D., van Niekerk, L., Paul, S., Spinner, H., Rollins, N., Shaw, A., Weitzman, R., Frazer, J., Dias, M., Franceschi, D., Orenbuch, R., Gal, Y., & Marks, D. S. (2023). ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction. bioRxiv : the preprint server for biology, 2023.12.07.570727. https://doi.org/10.1101/2023.12.07.570727

[5] Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., Dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., & Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (New York, N.Y.), 379(6637), 1123–1130. https://doi.org/10.1126/science.ade2574

[6] Zeming Lin et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123-1130 (2023). DOI:10.1126/science.ade2574

[7] Schmirler, R., Heinzinger, M. & Rost, B. Fine-tuning protein language models boosts predictions across diverse tasks. Nat Commun 15, 7407 (2024). https://doi.org/10.1038/s41467-024-51844-2

[8] Marquet, C., Schlensok, J., Abakarova, M., Rost, B., & Laine, E. (2024). Expert-guided protein language models enable accurate and blazingly fast fitness prediction. Bioinformatics (Oxford, England), 40(11), btae621. https://doi.org/10.1093/bioinformatics/btae621

[9] Deniz Akpinaroglu, Kosuke Seki, Amy Guo, Eleanor Zhu, Mark J. S. Kelly, Tanja Kortemme. Structure-conditioned masked language models for protein sequence design generalize beyond the native sequence space. bioRxiv 2023.12.15.571823; doi: https://doi.org/10.1101/2023.12.15.571823

[10] Singh, R., Sledzieski, S., Bryson, B., Cowen, L., & Berger, B. (2023). Contrastive learning in protein language space predicts interactions between drugs and protein targets. Proceedings of the National Academy of Sciences of the United States of America, 120(24), e2220778120. https://doi.org/10.1073/pnas.2220778120

[11] Fram, B., Su, Y., Truebridge, I., Riesselman, A. J., Ingraham, J. B., Passera, A., Napier, E., Thadani, N. N., Lim, S., Roberts, K., Kaur, G., Stiffler, M. A., Marks, D. S., Bahl, C. D., Khan, A. R., Sander, C., & Gauthier, N. P. (2024). Simultaneous enhancement of multiple functional properties using evolution-informed protein design. Nature communications, 15(1), 5141. https://doi.org/10.1038/s41467-024-49119-x