Model | CAFA-Beijing

1 Overview

In the Model section, we use a series of protein de novo design tools driven by artificial intelligence, including RFdiffusion, ProteinMPNN, AutodockVina, Prodigy, semi-rationalally design a mutation library, who’s prototype bases on MSH-α. Then screened out targeted MSH-α analogues use affinity and biocompatibility as indicators. Finally, all the sequences that performed well in each of the virtual screening indicators will be combined, and experiments will be conducted on each one, then obtain the product we desire.

2 Background Information

The initial requirement of the protein design section is to design a short peptide drug that can inhibit melanin production. Through the initial understanding and further research of the melanin production metabolic pathway and protein activity, we have learned that melanin production is caused by MSH-α binding to MC1R and activating its structure, which then triggers a series of subsequent metabolic reactions.

Through further study, we transformed the requirements into: designing a short peptide drug that can specifically bind to MC1R and inhibit its conformational activation.

The purpose of the Model section is to use a series of protein de novo design tools driven by artificial intelligence, semi-rationalally design a mutation library, who’s prototype bases on MSH-α. Then screened out targeted MSH-α analogues use affinity and biocompatibility as indicators.

Fig.1 Part of the MSH-α analogues designed through model

Due to the diversity of protein sequences and spatial structures, the number of sequences in the fully arranged mutation library is extremely large. The Model section effectively and reasonably narrowed the scope of the mutation library, controlling it within the range that our team's laboratory can conduct experiments on one by one.

Among them, the affinity is determined by the Gibbs binding energy of the combination of MC1R and short peptides. Additionally, the stability of the MC1R-short peptide combination is used as the second criterion. Biocompatibility is judged by the quantitative results of direct contact experiments, analyzing the changes in cell morphology observed at the material contact area and adjacent regions.

The tools used in the model section are all open source and can be found on Github.

The codon bias was optimized using the preferred codons of the E.coli B group. Since the expressed sequence was relatively short, the preferred codons were used for optimization instead of the probability distribution.

REFERENCES
Cell Research (2021) 31:1061–1071

3.1 RFdiffusion

In 2016, Baker et al. pointed out in their review that the era of protein de novo design is approaching. For a protein with a full sequence length of 200 amino acids, the total sequence space size is 20^200. Proteins evolved through stepwise mutations and natural selection are not uniformly distributed in the total sequence space; instead, they cluster into dispersed families. Previously, human modifications of natural proteins were merely extensions of natural evolution and were not systematic. De novo design explores the total sequence space and generates proteins unrelated to natural proteins. Moreover, they pointed out that whether it is structure prediction, fixed framework protein design, or de novo design, they are all optimization problems, and they achieve the lowest overall protein energy by constructing an energy function.

RFdiffusion is a protein design model developed by the David Baker and his team. It uses deep learning to de novo design protein structures with specific functions. Based on the judgment of the known protein binding hotspots, de novo design different-length highly interacting proteins.

The physics-based protein design based on the construction of energy functions, employs molecular dynamics and statistical thermodynamics methods to construct the potential energy surface in the molecular conformation space of proteins. It searches for local minima to find the optimal backbone and sequence.

Fig.2 Part of the result through RFdiffusion

The output result of RFdiffusion is presented in pdb format. Due to the algorithm for optimizing protein stability, its sequence tends to be disordered, containing a large number of amino acids with the lowest energy. Therefore, the sequence needs to be restored using ProteinMPNN.

REFERENCES
Nature volume 639, pages225–231 (2025)

3.2 ProteinMPNN

Developed by the team of David Baker from the University of Washington in 2022, ProteinMPNN (Protein Message Passing Neural Network) is a deep learning-based protein sequence design model that achieves the mapping from protein three-dimensional structure to amino acid sequence through graph structure and message passing mechanism, which is called the "reverse folding problem”.

Its core advantage lies in the efficient generation of natural protein sequences and the ability to design complex structures that do not exist in nature, and has excellent performance in both computational and experimental tests. The amino acid sequences at different positions can be coupled between single chains or multiple chains, thus being widely applied in current protein design.

Not only do ProteinMPNN outperforms the traditional Rosetta method in the recovery rate of natural protein sequences, but also can it restore proteins that failed in previous designs. By breaking through the efficiency bottleneck of scientific research with cutting-edge AI technology, it has extremely significant implications for protein engineering, drug design, enzyme design, and other fields.

We used ProteinMPNN to restore the missing amino acid sequence information in the pdb file. Through the above operation, the sequence of the ligand was restored after a round of fuzzy processing, achieving a semi-rational design mutation library.

Fig.3 Part of the raw result through ProteinMPNN

3.3 Autodock Vina

AutoDock is an open-source molecular docking software developed by the Olson Laboratory of the Scripps Research Institute in the late 1980s. It aims to meet the research needs in computer-aided drug design regarding the interaction between biological macromolecules and small molecules. Its development is based on the advancements in X-ray diffraction technology for obtaining the structures of biological macromolecules, providing researchers with a tool to predict the binding mode of ligands and receptors.

AutoDock Vina is also a molecular docking software developed by the MGL Laboratory. Compared with AutoDock 4.0, AutoDock Vina has improved the average accuracy of binding mode prediction, accelerated the search speed by using a simpler scoring function, and still provides reproducible docking results when dealing with systems with approximately 20 rotatable bonds.
We used AutoDock Vina to predict the Gibson binding energy of the interaction between the receptor protein and the ligand protein, provide criteria for assessing ligand’s affinity. In order to reduce the prediction error, the first ten Gibson binding energy results of each sequence are retained. These results will serve as the main screening criteria.

Fig.4 Part of the result through Autodock Vina

In addition, to prevent the receptors from overfitting during the screening process, we also analyzed the pockets and contact interfaces of the receptor, and once again, the triple prediction of the Gibbs free energy for the binding of the receptor protein to the ligand protein. Work in this part is accomplished using HDOCK and Prodigy, and also PyMOL as visual aid.

3.4 HDOCK

HDOCK Server is a molecular docking tool developed by the Huang Laboratory of Huazhong University of Science and Technology. It focuses on predicting the interaction conformations between proteins and nucleic acids and is suitable for biomedical research. The PDB files generated by HDOCK can be visualized in three dimensions using PyMOL, allowing for the analysis of the interactions between the receptor-ligand interface residues and receptor’s pockets information.

Fig.5 Part of the result through HDOCK, visualized

Fig.6 Interface and pocket analysis result

Fig.7 pocket information visualized by PYMOL

3.5 Prodigy

PRODIGY (PROtein binDing enerGY prediction) is a protein binding energy prediction tool that can predict the binding affinity of protein-protein complexes using their three-dimensional structures. PRODIGY employs an efficient contact-based approach to estimate the binding free energy and dissociation constant, while also providing in-depth insights into the structural determinants of protein interactions. By combining interface contact characteristics with non-interaction surface features, PRODIGY can make reliable predictions, which are crucial for understanding molecular interactions, guiding the development of treatment plans, and designing protein complexes.

To facilitate batch operations and quick judgment, use the following code to run Prodigy in batch mode and only output the affinity result.

Fig.8 Run Prodigy in batch mode and only output the affinity result

4 Result

Fig.9 Cell Viability Under Co-Cultivation with Different Peptide Sequences

Fig.10 Expression of TRP1 Protein in Cells Under Different Peptide Intervention Conditions

*For more detailed information, please refer to Results.

5 Discussion and the Road Ahead

Through model, we successfully achieved the de novo design of MSH-α analogues. In addition, through virtual screening and subsequent experimental screening together, the repressor of MC1R was obtained, and has been proven to have the drug effect that we had anticipated.

Protein de novo design is a very promising new field in drug development. Furthermore, with the continuous development of AI tools and the expansion of databases, it is believed that in the near future, protein de novo design will become an indispensable force in drug design. In the future, we will continue our ongoing exploration in this field. Based on this initial attempt, we hope to develop a protein de novo design tool and simplify its operation, as to facilitate more iGEM teams to enter this field.

Meanwhile, during experiment,s we also discovered some issues with our protein de novo design process, including some sequences exhibit overfitting during the screening process. The overfitting can be optimized by adjusting the filtering parameters. Another significant issue is, through model we are unable to determine whether the ligand exerts an inhibitory or activating effect on the receptor. The main difference between the two lies in the effect of the ligand on the phosphorylation degree of the receptor.

Due to their characteristics of driven by static data, AI tools are still unable to detect the impact of receptor phosphorylation levels on ligands. Therefore, at present, it is impossible for us to directly determine whether the receptor is activated or inhibited by the ligand. Data of this part can only be obtained from phosphorylation experiments.In the future, we hope to collect and organize the data on protein phosphorylation, and combine the information of receptors and ligands to conduct further research on the protein-protein docking system, hoped to provide a solution that enables the preliminary assessment of receptor phosphorylation degree through digital tools before synthesis.