Overview

In the grand blueprint of life, proteins are the molecular machines that execute nearly all known biological functions. The directed evolution of these molecules—protein engineering—has become a core engine driving the modern biotechnology revolution, bringing limitless possibilities to fields from pharmaceutical development to green industrial manufacturing. However, this power is rooted in an immense challenge: the unimaginably vast sequence space. A small protein of just 100 amino acids has 20^100 possible variations, a number far exceeding the atoms in the universe. This makes finding a better-performing variant through traditional random mutagenesis and trial-and-error akin to finding a needle in a cosmic haystack—an extremely time-consuming, costly, and low-success-rate endeavor.

In recent years, the rapid development of artificial intelligence, especially Large Language Models (LLMs), has brought a historic opportunity to overcome this dilemma. Researchers have discovered a profound analogy: just as human languages follow specific grammatical and semantic rules, protein sequences have also formed an intrinsic "grammar" and "semantics" through billions of years of evolution. Protein Language Models (PLMs) like ESM-2, developed by the Meta AI team, can learn these deep patterns by pre-training on hundreds of millions of real sequences. This allows them to profoundly understand the contextual information of sequences, paving the way for a new paradigm of computationally driven protein engineering.

It is against this backdrop of challenges and opportunities that our project, PROTEUS, was born. PROTEUS is a computational platform designed to accelerate protein sequence design and intelligent optimization using AI. Our core mission is not just to apply existing models, but to develop, validate, and integrate a complete, end-to-end computational workflow. We aim to create a powerful and easy-to-use tool that empowers researchers to rapidly screen for promising protein modifications through efficient computational simulations, focusing precious wet lab resources—be it time, reagents, or manpower—only on the most promising candidate sequences "prophesied" by AI.

Our core methodology is based on the cutting-edge protein language model ESM-2, and through a series of rigorous, data-driven iterations, we have established a highly efficient workflow that spans from data processing to the delivery of wet lab-ready candidates.

A Foundation of Diverse Data: The cornerstone of our work is a large-scale dataset systematically constructed by integrating and functionally classifying 50 different protein datasets from the ProteinGym benchmark database.

An Iterative "Computational Referee": Based on this foundation, we developed an iterative training strategy for scoring functions—evolving from separate models for specific functions to a more generalizable merged model. This provides a reliable computational referee for accurately evaluating sequence performance.

Generalized Fine-Tuning: We then conducted "generalized" fine-tuning of the base language model (ESM-2 35M). We innovatively integrated over one hundred thousand high-activity (positive samples) and low-activity (negative samples) sequences from all 50 datasets for contrastive learning, enabling the model to grasp both universally beneficial and "pitfall" sequence patterns.

The Core Modification Algorithm: Building on this, we established our core algorithm: a Point-by-point Scanning Mask Prediction strategy. This method systematically masks and predicts at every position of an original sequence, comprehensively exploring potential beneficial mutations and successfully generating over 25,000 new candidate sequences.

The performance validation results are highly encouraging. In a specific test on 500 low-activity sequences from the A4GRB6_PSEAI_Chen_2020 dataset, our workflow successfully improved the performance score of 357 sequences (71.4%). More importantly, we have successfully selected high-quality single-point mutants from this vast pool of generated sequences for key proteins, such as A4GRB6_PSEAI_Chen_2020 and GFP_AEQVI_Sarkisyan_2016, to serve as the final candidates for our project's wet lab validation.

Finally, to make our powerful workflow accessible, we have encapsulated the entire process into a user-friendly web application. We aim to provide the iGEM community and the broader scientific community with a powerful and intuitive tool for intelligent protein design, effectively bridging the gap between computational prediction and experimental success.