Project Description | BIT-LLM

Introduction

Proteins play a crucial role in various fields, including medicine, chemical manufacturing, energy, agriculture, and consumer goods [1]. However, for both scientific research and industrial applications, proteins often require engineering to enhance properties such as stability, activity, selectivity, and binding affinity [2][3]. To address this need, the BIT-LLM project, PROTEUS, leverages advanced Protein Language Models (PLMs) to accelerate the process of protein optimization. We fine-tuned a transformer-based PLM (ESM-2 [4]) to enable it to distinguish between beneficial and deleterious mutations. Using a masking and scanning strategy, we iteratively masked each amino acid in the target sequence and allowed the model to predict the optimal substitution. Based on this project, BIT-LLM has developed an open-source software tool that allows users to input a target protein sequence and obtain an optimization solution with a single click, thereby providing a universal AI-driven design platform for synthetic biology.

Figure 1. PROTEUS navigate the vast ocean of proteins

Meaning of "PROTEUS"

PROTEUS means "PROtein Transform Engineered by Universal Software", It is a direct statement derived from the acronym of our developed general model for protein optimization. At the same time, we have ingeniously alluded to Proteus, the versatile sea god of ancient Greek mythology. We chose this name precisely to signify the model's immense potential: it, too, can navigate the vast ocean of proteins and generate an endless array of optimized novel sequences.

Background

Proteins serve as the fundamental workhorses of life, functioning as efficient catalysts, precise molecular switches, and robust structural scaffolds, yet engineering them to achieve improved functions remains a significant challenge in synthetic biology [5]. For instance, a modestly sized protein of approximately 100 amino acids encompasses an astronomical sequence space on the order of 20^100 possible variants, rendering traditional directed evolution methods capable of exploring only a minuscule fraction of this vast landscape [6]. Conventional protein engineering approaches, such as directed evolution, are inherently labor-intensive and resource-demanding, often requiring extensive wet-lab iterations to validate the optimization of desired traits in proteins [7].

Much like a human language, a protein's sequence is a chain of letters from a finite alphabet—the 20 amino acids. These letters form secondary structures ("words"), which assemble into functional domains ("sentences") that convey a specific biological action ("meaning"). A key parallel is that both proteins and language are information-complete; the entire blueprint for a protein's structure and function is efficiently encoded in its sequence [8]. With the continuous advancement of artificial intelligence (AI), protein language has introduced a transformative paradigm: PLMs, pre-trained on massive sequence databases, can distill the "grammar" of proteins and capture complex evolutionary patterns, rapidly reshaping the field of protein science [9]. These AI-driven PLMs leverage deep learning, generative models, and evolutionary principles through training on extensive protein sequence datasets, enabling the prediction of mutation effects, functions, and structures, thereby facilitating the design of proteins with enhanced properties, among other applications [10][11][12].

Original Intent

By organizing data from the past decade retrieved via full-text searches of PubMed for "Protein Language Model (PLM)" OR "AI-driven protein design," we clearly observe a significant increase in the number of relevant literatures in this field from 2016 to 2025, with an accelerated growth trend especially after 2022. This growth is driven by the combined forces of increased computing power, algorithmic innovation, and data accumulation, and also reflects the shift in the field's research focus from exploring basic methods to solving practical biomedical problems. Key phases are summarized as follows:

2016–2018: Low-level Stagnation Phase

Research was scattered, focusing primarily on exploring how to apply Natural Language Processing (NLP) techniques to protein sequence analysis, leading to the development of preliminary models such as Long Short-Term Memory (LSTM) and early Transformer architectures.

2019–2021: Inflection Phase

Achievements like ESM and AlphaFold boosted attention in the field, with research beginning to focus on specific applications of protein language models, including protein structure prediction, function annotation, and mutation effect analysis.

2022–2023: Rapid Growth Phase

PLMs began to align with industrial needs (e.g., pharmaceuticals, industrial enzymes).

2024–2025: Explosive Growth Phase

The number of studies doubled, entering a phase of interdisciplinary integration and application transformation. Emphasis was placed on the integration of multi-modal data (e.g., combining 3D structural information, experimental data, and biological prior knowledge) to further enhance model performance and applicability.

Figure 2. Results of a Decade-long PubMed Search for "Protein Language Model" OR "AI-driven protein design"

Overall, the field of protein language models and AI-driven design exhibits the following prominent trends:

Integration as the Mainstream [13]: Single-sequence models are no longer sufficient to meet complex needs; multi-modal integration of sequence, structure, function, and even physicochemical properties is the core direction of current model development.
Focus on Experimental Closed Loops [14]: The value of research is ultimately validated by wet experiments. The rapid iterative closed loop of "AI design → experimental validation → data feedback" has become a standard for top-tier research.
Pursuit of Practicality and Accessibility [15]: Technology is becoming increasingly toolized and democratized through cloud platforms and automated processes, with the goal of making these powerful AI tools accessible to a wide range of biologists.

PLM technology is in a phase of vigorous development, and the scientific community generally has demands for protein function optimization, such as improving stability, enhancing catalytic efficiency and specificity, optimizing immune regulation, and designing new functions. Therefore, after continuous discussions with our team members and instructors, we shifted our initial focus from using PLMs to optimize a specific protein to developing the PROtein Transform Engineered by Universal Software (i.e., the PROTEUS project). Through a complete and rigorous DBTL (Design-Build-Test-Learn) cycle, we ultimately validate results via wet experiments and achieve continuous iteration.

Objectives

The PROTEUS project is designed to achieve the following four core objectives:

1. Universal Protein Optimization

To develop a general-purpose, AI-driven protein optimization platform capable of generating optimized novel sequences based on user-input protein sequences and specified performance metrics, such as enzymatic activity or stability.

2. Dry–Wet Lab Closed Loop

To establish a complete "DBTL" cycle, wherein AI-designed sequences are experimentally validated through wet lab experiments, and the resulting data are fed back into the model to form a continuously self-improving intelligent system.

3. Platform Accessibility and Open Source

To create a user-friendly, open-source software platform that enables synthetic biology researchers to easily utilize state-of-the-art AI-based protein design tools without requiring extensive computational expertise.

4. Innovation and Implementation of the HP-4R Cycle

Not only to adopt the HP-4R (Human Practices for Record, Reflect, Refine, and Renew) framework as the project's working methodology, but also to provide the iGEM community and the broader synthetic biology field with a systematic, replicable human practices framework, demonstrating how continuous social engagement can shape responsible technology projects.

5. Responsible Innovation

To prospectively consider and integrate AI biosafety and ethical guidelines throughout the entire project design and development lifecycle, thereby promoting responsible advancement in the field of protein design.

Result

We have successfully developed PROTEUS—a universal AI-driven protein optimization platform based on the ESM2 protein language model. Through function-oriented fine-tuning on 50 diverse protein datasets, and in combination with our in-house developed DMS scoring function and AlphaFold structural validation, the platform accurately identifies key functional sites and enables targeted modification of protein sequences. Using β-lactamase as an example, we validated functional improvements in the optimized sequences through a full-cycle experimental workflow encompassing gene editing, protein expression, and functional characterization. Ultimately, the entire pipeline was packaged into an open-source web application, providing users with a one-click protein optimization solution.

More importantly, the final form of the project was not the result of the initial design, but rather the outcome of the HP-4R cycle, which we pioneered and implemented. Through more than 20 in-depth interactions with domain experts, peer research teams, and the public, we systematically integrated external insights into each iteration of the project. This approach ensured that PROTEUS achieves not only technical sophistication but also leadership in responsible innovation.

Figure 3. Our software PROTEUS

Key Features

The PROTEUS platform exhibits the following five key features:

1. AI-Driven Directed Evolution

Leveraging the ESM-2 large language model fine-tuned for functional guidance, the platform employs masked language modeling and reinforcement learning strategies to intelligently explore protein sequence space, enabling efficient design beyond natural evolutionary constraints.

2. Integrated Computational and Experimental Validation

AlphaFold 3 structural prediction is incorporated as a computational validation module, which, combined with wet-lab functional assays, forms a robust multi-tier design verification system.

3. User-Friendly Interface Design

The platform offers both "Novice" and "Expert" modes to accommodate users with varying levels of experience. Streamlined sequence input and intuitive result visualization make complex protein design accessible and straightforward.

4. Open-Source and Modular Architecture

Core platform code and trained models are fully open-source. The modular design enables researchers to build upon and customize our framework, fostering community collaboration and innovation.

5. Responsible Innovation Shaped by the HP-4R Cycle

Every feature of the project stems from ongoing dialogue with society. From the integration of structural validation in response to expert recommendations, to the reinforcement of safety and ethical design addressing public concerns, PROTEUS itself exemplifies the HP-4R methodology.

Impact

The PROTEUS project is expected to significantly accelerate research progress in fields such as enzyme engineering, drug development, and biomanufacturing by lowering the technical barriers to AI-driven protein design. Our open-source strategy not only provides free tools for academia but also establishes a new paradigm for early-stage R&D in industry. The project's forward-looking considerations for AI biosecurity set a model for responsible innovation in the integration of synthetic biology and AI. Ultimately, we envision PROTEUS evolving into a dynamic community platform that brings together the collective intelligence of global researchers, jointly advancing protein design from an art to a science and offering novel biotechnological solutions to address global challenges in health, environment, and energy.

Future

Looking ahead, the PROTEUS team will continue to advance in the following directions:

1. Model and Algorithm Upgrades

Exploration of more advanced foundational models, such as ESM-3, and development of multi-objective optimization capabilities to simultaneously meet multiple design requirements, including activity, stability, and expressibility.

2. Experimental Throughput and Closed-Loop Expansion

Active integration of automated experimental platforms and high-throughput screening methods to significantly enhance the efficiency and data output of the DBTL cycle, thereby continuously improving the model's practical design capabilities.

3. Platform Functionality Expansion

Development of a natural language interaction interface to enable users to describe functional requirements via text, while extending optimization targets from single proteins to protein complexes and biological pathways.

4. Community and Ecosystem Development

Through ongoing open-source collaboration, educational outreach, and industry-academia partnerships, PROTEUS will be established as one of the standard tools in the field of AI-driven protein design. A mechanism for sharing and feedback on user-generated designs will be implemented, fostering a continuously evolving ecosystem for protein optimization. We will remain responsive to community input, allowing the HP-4R cycle to guide PROTEUS toward a broader future.

Reference

[1] Yang, J., Li, F. Z., & Arnold, F. H. (2024). Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS central science, 10(2), 226–241. https://doi.org/10.1021/acscentsci.3c01275

[2] Acevedo-Rocha, C. G., Li, A., D'Amore, L., Hoebenreich, S., Sanchis, J., Lubrano, P., Ferla, M. P., Garcia-Borràs, M., Osuna, S., & Reetz, M. T. (2021). Pervasive cooperative mutational effects on multiple catalytic enzyme traits emerge via long-range conformational dynamics. Nature communications, 12(1), 1621. https://doi.org/10.1038/s41467-021-21833-w

[3] Zhang, D., Xu, F., Wang, F., Le, L., & Pu, L. (2025). Synthetic biology and artificial intelligence in crop improvement. Plant communications, 6(2), 101220. https://doi.org/10.1016/j.xplc.2024.101220

[4] Zeming Lin et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123-1130(2023). DOI:10.1126/science.ade2574

[5] Bepler, T., & Berger, B. (2021). Learning the protein language: Evolution, structure, and function. Cell systems, 12(6), 654–669.e3. https://doi.org/10.1016/j.cels.2021.05.017

[6] Wenze Ding, Kenta Nakai, Haipeng Gong, Protein design via deep learning, Briefings in Bioinformatics, Volume 23, Issue 3, May 2022, bbac102, https://doi.org/10.1093/bib/bbac102

[7] Sumida, K. H., Núñez-Franco, R., Kalvet, I., Pellock, S. J., Wicky, B. I. M., Milles, L. F., Dauparas, J., Wang, J., Kipnis, Y., Jameson, N., Kang, A., De La Cruz, J., Sankaran, B., Bera, A. K., Jiménez-Osés, G., & Baker, D. (2024). Improving Protein Expression, Stability, and Function with ProteinMPNN. Journal of the American Chemical Society, 146(3), 2054–2061. https://doi.org/10.1021/jacs.3c10941

[8] Ferruz, N., Schmidt, S., & Höcker, B. (2022). ProtGPT2 is a deep unsupervised language model for protein design. Nature communications, 13(1), 4348. https://doi.org/10.1038/s41467-022-32007-7

[9] S. McCarthy, S. Gonen, δ-Conotoxin Structure Prediction and Analysis through Large-Scale Comparative and Deep Learning Modeling Approaches. Adv. Sci. 2024, 11, 2404786. https://doi.org/10.1002/advs.202404786

[10] Zhang, Q., Chen, W., Qin, M., Wang, Y., Pu, Z., Ding, K., Liu, Y., Zhang, Q., Li, D., Li, X., Zhao, Y., Yao, J., Huang, L., Wu, J., Yang, L., Chen, H., & Yu, H. (2025). Integrating protein language models and automatic biofoundry for enhanced protein evolution. Nature communications, 16(1), 1553. https://doi.org/10.1038/s41467-025-56751-8

[11] Jin, S., Zeng, Z., Xiong, X., Huang, B., Tang, L., Wang, H., Ma, X., Tang, X., Shao, G., Huang, X., & Lin, F. (2025). AMPGen: an evolutionary information-reserved and diffusion-driven generative model for de novo design of antimicrobial peptides. Communications biology, 8(1), 839. https://doi.org/10.1038/s42003-025-08282-7

[12] Furui, K., Sakano, K., & Ohue, M. (2025). Predictive and therapeutic applications of protein language models. Allergology international : official journal of the Japanese Society of Allergology, S1323-8930(25)00087-5. Advance online publication. https://doi.org/10.1016/j.alit.2025.08.004

[13] Xie, C., Wei, Y., Luo, X., Yang, H., Lai, H., Dao, F., Feng, J., & Lv, H. (2025). NeXtMD: a new generation of machine learning and deep learning stacked hybrid framework for accurate identification of anti-inflammatory peptides. BMC biology, 23(1), 212. https://doi.org/10.1186/s12915-025-02314-8

[14] Ni, B., Kaplan, D. L., & Buehler, M. J. (2024). ForceGen: End-to-end de novo protein generation based on nonlinear mechanical unfolding responses using a language diffusion model. Science advances, 10(6), eadl4000. https://doi.org/10.1126/sciadv.adl4000

[15] Rotkevich, M., Viana, C., Neguembor, M. V., & Cosma, M. P. (2025). Deep learning in chromatin organization: from super-resolution microscopy to clinical applications. Cellular and molecular life sciences : CMLS, 82(1), 323. https://doi.org/10.1007/s00018-025-05837-z