2.3.1 RFdiffusion
In 2016, Baker et al. pointed out in their review that the era of protein de novo design is approaching. For a protein with a full sequence length of 200 amino acids, the total sequence space size is 20200. Proteins evolved through stepwise mutations and natural selection are not uniformly distributed in the total sequence space; instead, they cluster into dispersed families. Previously, human modifications of natural proteins were merely extensions of natural evolution and were not systematic. De novo design explores the total sequence space and generates proteins unrelated to natural proteins. Moreover, they pointed out that whether it is structure prediction, fixed framework protein design, or de novo design, they are all optimization problems, and they achieve the lowest overall protein energy by constructing an energy function.
RFdiffusion is a protein design model developed by the David Baker and his team. It uses deep learning to de novo design protein structures with specific functions. Based on the judgment of the known protein binding hotspots, de novo design different-length highly interacting proteins.
The physics-based protein design based on the construction of energy functions, employs molecular dynamics and statistical thermodynamics methods to construct the potential energy surface in the molecular conformation space of proteins. It searches for local minima to find the optimal backbone and sequence.
Fig.2 Part of the result through RFdiffusion
The output result of RFdiffusion is presented in pdb format. Due to the algorithm for optimizing protein stability, its sequence tends to be disordered, containing a large number of amino acids with the lowest energy. Therefore, the sequence needs to be restored using ProteinMPNN.
REFERENCES
Nature volume 639, pages225–231 (2025)
2.3.2 ProteinMPNN
Developed by the team of David Baker from the University of Washington in 2022, ProteinMPNN (Protein Message Passing Neural Network) is a deep learning-based protein sequence design model that achieves the mapping from protein three-dimensional structure to amino acid sequence through graph structure and message passing mechanism, which is called the "reverse folding problem”.
Its core advantage lies in the efficient generation of natural protein sequences and the ability to design complex structures that do not exist in nature, and has excellent performance in both computational and experimental tests. The amino acid sequences at different positions can be coupled between single chains or multiple chains, thus being widely applied in current protein design.
Not only do ProteinMPNN outperforms the traditional Rosetta method in the recovery rate of natural protein sequences, but also can it restore proteins that failed in previous designs. By breaking through the efficiency bottleneck of scientific research with cutting-edge AI technology, it has extremely significant implications for protein engineering, drug design, enzyme design, and other fields.
We used ProteinMPNN to restore the missing amino acid sequence information in the pdb file. Through the above operation, the sequence of the ligand was restored after a round of fuzzy processing, achieving a semi-rational design mutation library.
Fig.3 Part of the raw result through ProteinMPNN
2.3.3 Autodock Vina
AutoDock is an open-source molecular docking software developed by the Olson Laboratory of the Scripps Research Institute in the late 1980s. It aims to meet the research needs in computer-aided drug design regarding the interaction between biological macromolecules and small molecules. Its development is based on the advancements in X-ray diffraction technology for obtaining the structures of biological macromolecules, providing researchers with a tool to predict the binding mode of ligands and receptors.
AutoDock Vina is also a molecular docking software developed by the MGL Laboratory. Compared with AutoDock 4.0, AutoDock Vina has improved the average accuracy of binding mode prediction, accelerated the search speed by using a simpler scoring function, and still provides reproducible docking results when dealing with systems with approximately 20 rotatable bonds.
We used AutoDock Vina to predict the Gibson binding energy of the interaction between the receptor protein and the ligand protein, provide criteria for assessing ligand’s affinity. In order to reduce the prediction error, the first ten Gibson binding energy results of each sequence are retained. These results will serve as the main screening criteria.
Fig.4 Part of the result through Autodock Vina
In addition, to prevent the receptors from overfitting during the screening process, we also analyzed the pockets and contact interfaces of the receptor, and once again, the triple prediction of the Gibbs free energy for the binding of the receptor protein to the ligand protein. Work in this part is accomplished using HDOCK and Prodigy, and also PyMOL as visual aid.
2.3.4 HDOCK
HDOCK Server is a molecular docking tool developed by the Huang Laboratory of Huazhong University of Science and Technology. It focuses on predicting the interaction conformations between proteins and nucleic acids and is suitable for biomedical research. The PDB files generated by HDOCK can be visualized in three dimensions using PyMOL, allowing for the analysis of the interactions between the receptor-ligand interface residues and receptor’s pockets information.
Fig.5 Part of the result through HDOCK, visualized
Fig.6 Interface and pocket analysis result
Fig.7 pocket information visualized by PYMOL
2.3.5 Prodigy
PRODIGY (PROtein binDing enerGY prediction) is a protein binding energy prediction tool that can predict the binding affinity of protein-protein complexes using their three-dimensional structures. PRODIGY employs an efficient contact-based approach to estimate the binding free energy and dissociation constant, while also providing in-depth insights into the structural determinants of protein interactions. By combining interface contact characteristics with non-interaction surface features, PRODIGY can make reliable predictions, which are crucial for understanding molecular interactions, guiding the development of treatment plans, and designing protein complexes.
To facilitate batch operations and quick judgment, use the following code to run Prodigy in batch mode and only output the affinity result.
Fig.8 Run Prodigy in batch mode and only output the affinity result