Introduction

In the fields of phage therapy and microbial control, tail fiber proteins serve as the core components for phages to recognize host bacteria. Their modification is a key approach to expanding the host range and enhancing targeted killing efficiency. Traditional phage tail fiber modification relies on a cyclic iterative model of sequence mutation and experimental verification, which consumes substantial human and material resources. This process is not only time-consuming but also struggles to efficiently meet the targeting requirements of specific pathogenic bacteria. Our Wet Lab also encountered this challenge during experimental design—they could not identify the optimal replacement sites through literature review and simple sequence alignment.

To overcome this limitation, we have developed Alphage—the first intelligent tool focused on the precise modification of T7 phage tail fiber proteins. By integrating multi-dimensional model evaluations, including database screening, AI-based sequence alignment, and structure prediction, Alphage can quickly output tail fiber replacement sites and modification schemes with high feasibility. This transforms the traditional "trial-and-error iteration" into "precision design", providing efficient digital support for the engineering modification of phages.

Figure.1 Flow chart of using the software

Modules

Data Cleaning

Figure.2 The process of Data Cleaning

We retrieved information on all viruses (including name, taxonomy, GenBank accession number, and genome coverage) from ICTV (International Committee on Taxonomy of Viruses). First, we filtered out non-phage viruses and viruses with genome coverage other than "Complete genome" to establish a database of phage complete genomes. Next, we searched in NCBI(National Center for Biotechnology Information) for the corresponding tail fiber protein sequences using the GenBank accession numbers. The collected sequences serve as the initial database for Alphage.

Sequence Alignment

Figure.3 The process of Sequence Alignment

We use the tail fiber protein amino acid sequences from the initial database as input to DeepBLAST, which predicts protein structural similarity directly from sequence data using Convolutional Neural Network(CNN), for alignment with T7 tail fiber proteins. First, a pre-trained protein language model is employed to encode both the T7 and goal-phage protein sequences into residue-level embeddings. A CNN scoring module calculates the match and gap score matrices for the two proteins. These score matrices are then fed into a differentiable Needleman-Wunsch algorithm to determine the optimal structural alignment path and generate the alignment, along with the corresponding TM-score for that alignment. If the TM-score ≥ 0.5 (an empirical threshold), the two sequences are considered to have significant similarity in their three-dimensional folding, leading to the inference that they are remote structural homologs.

The model typically outputs multiple potential replacement sites. Based on these sites, substitutions are made sequentially to the T7 phage tail fiber protein, and new tail fiber protein sequences are synthesized after each substitution. The TM-score is recorded as a metric for evaluating the structural homology between the original and modified tail fiber proteins.

Structure Prediction

The newly synthesized tail fiber protein sequences are input into ESMFold, deep-learning protein structure prediction model. A region adjacent to the target site is selected to predict the protein structure and evaluate the stability of the modified protein. Within this module, the ESM-2 language model encodes individual sequences into residue representations containing evolutionary constraints. Subsequently, spatial interactions are extracted via the Folding Trunk, and the equivariant Transformer structural module directly outputs atomic coordinates and confidence scores (pLDDT). Following iterative optimization, a high-precision three-dimensional structure is obtained.

The pLDDT values of the newly synthesized tail fibers are recorded as a reference for stability. If the pLDDT value is low after substitution, it indicates that such a replacement may disrupt the specific structural domains of the T7 phage tail fiber protein, and thus this substitution site is excluded.

Infection Efficacy

The newly synthesized tail fiber protein sequences are simultaneously input into a phage-host interaction prediction tool called PHIEmbed. It employs protein language models to represent the receptor-binding proteins of phages, and evaluates both the binding ability between the modified tail fibers and host bacteria, as well as the subsequent infection efficacy.Using this model, residue-level embeddings are generated and averaged to obtain fixed-length protein vectors. These embeddings are then fed into a weighted Random Forest classifier. During the training phase, host genera are used as labels, and class imbalance issues are addressed. In the prediction phase, a confidence threshold k is set to control classification decisions, enabling the prediction of binding probabilities between the newly synthesized tail fiber proteins and their corresponding hosts.

Figure.4 The process of Structure Prediction and Infection Efficacy

Comprehensive Evaluation

Through the three aforementioned modules, we obtain features of tail fiber proteins with clear biological significance. We select three representative features as the standard indicator parameters for our evaluation protocol: “Homology“, “ΔHomology“, and “Infection“.

Figure.5 The parameters settings page

“Homology” represents the sequence homology between the tail fiber proteins of the target-phage and T7 phage, as provided by the DeepBLAST model.
“ΔHomology” refers to the difference in homology between the fragments flanking the replacement site. Employing the concept of a descent algorithm, it identifies positions where homology decreases sharply, aiding in the discovery of “optimal” replacement sites.
“Infection” denotes the infection efficiency of the chimeric tail fiber phage against host bacteria, as generated by the PHIEmbed model.

To integrate these features and quantify their relative importance to the success rate of tail fiber replacement, we introduce a robust machine learning model. The innovation of this model lies in our integration of the Pairwise Comparison Concept from learning-to-rank with an interpretable linear model, aiming to construct a function capable of reliably prioritizing newly designed schemes.

We collected experimentally validated tail fiber replacement schemes from the literature, using their reported successful cleavage sites as “ground truth.” Subsequently, we performed a full-sequence scan of the relevant tail fiber protein sequences to identify all potential modification sites. For each potential site, we calculated the three feature values—“Homology,” “ΔHomology,” and “Infection” —using the three aforementioned analysis modules.

Figure.6 Pairwise lose and pair accuracy

Furthermore, drawing on the concept of Pairwise Learning-to-rank, we use the spatial distance between each potential site and the literature-reported standard site as a supervision signal: the closer the distance, the more the site is considered to approximate the ideal cleavage site. By pairwise comparing the distance differences between different sites and the ground truth, we construct sample pairs for model training.

In the model training phase, we reserve 15% of the data as the test set, adopt a weight regularization strategy, and use Binary Cross-Entropy Loss (BCELoss) as the optimization objective. A Logistic Regression Model is then used to learn the weights of the three features. These weights objectively reflect the relative importance of each feature in predicting the success rate of tail fiber replacement.

The weight coefficients obtained from model training are set as default values in the calculation process of evaluating replacement schemes: 62% for “Homology”, 29% for “ΔHomology”, and 9% for “Infection”. Additionally, we have added an interactive function in the software that allows modifying the parameter weights. Users can adjust these weights according to their needs to screen for the most suitable replacement scheme.

Results

We rank all predicted replacement schemes based on the standard parameters and corresponding weights determined by the Comprehensive Evaluation Module, and output the top optimal replacement schemes. The display interface includes the replacement site, the site’s comprehensive score (shown in red), and the scores of the three indicators—all of which are also visually presented in a line chart on the left. Additionally, it provides the reliability score (shown in pink) of the modified chimeric tail fiber, which is reflected by the pLDDT parameter calculated by ESMFold. The sequence of the chimeric tail fiber is also provided alongside these data.

Figure.7 The results of the Alphage

When Pseudomonas phage phiPsa17 is entered and a tail-fiber replacement scheme is predicted, Alphage recommends a replacement site window (approximately residues 140–150), streamlining Wet Lab tail-fiber engineering.

Based on the approximate window predicted by Alphage, the Wet Lab—after literature review and consultation of our Dry Lab Protein-Protein Docking Model—selected position 149 for substitution within the T7 phage gp17 protein. The resulting chimera validated infectivity, confirming Alphage’s accuracy of prediction.

Figure.8 The tail fiber-substituted phage T7 ∆C-gp17:: C-VO98_215 corresponding to the Alphage prediction results is tightly adsorbed on the phage surface.

Figure.9 The protocol provided by Alphage helped the Wet Lab team construct Phage-like Particle, enabling it to gain the ability to adsorb to DC3000.

Literature Validation

To further verify the general applicability of Alphage, we extended our validation beyond the T7 phage system. In a study published in Nature Communications by Gil et al. (2023), titled “Tailoring the Host Range of Ackermannviridae,” the authors focused on host range engineering of bacteriophages belonging to the Ackermannviridae family. Members of this family typically encode multiple tailspike proteins (TSP1–4), each responsible for recognizing distinct host receptors. This feature makes Ackermannviridae phages an ideal model for studying host recognition mechanisms and precise host range modulation, and highlights their potential as a platform for constructing multi-receptor-binding protein (multi-RBP) systems.
In their study, Gil team constructed chimeric phages by systematically replacing and optimizing multiple TSP regions, which markedly improved the specificity and sensitivity of bacterial detection. This strategy expanded the phage’s ability to recognize a broader range of Salmonella serovars while reducing cross-reactivity with non-target strains, demonstrating broad potential for applications in food safety monitoring and synthetic biology.

Figure.10 The tail fiber replacement schemes of Gil team

The replacement sites are 158 and 250.N-terminal regionsnative to the recipient, SPTDl.NL, are indicated in blue. C-terminal regions derived from the donor,CBA120.NL, are indicated in red.

We extracted the tailspike protein sequences analyzed in the study and used Alphage to predict potential fusion modification sites. With the default parameter weights applied, the results indicate that the replacement sites recommended by Alphage were highly consistent with the experimentally validated modification sites of RBP-CBA120-1 and RBP-CBA120-2 reported in the literature. It successfully reproduces the tail fiber replacement results of the research team, proving that Alphage’s predictions are available and reliable.

Figure.11 The prediction results of Alphage

The predicted replacement sites are 156 and 250, which are similar to the Gil team’s tail fiber replacement schemes.

In conclusion,Alphage is not only effective for tail fiber engineering of T7 phages, but also provides reliable computational guidance for the design of tailspike modifications in other phages. This demonstrates its potential value in phage-directed engineering, host range modulation, and cross-species applications.

Installation & Usage

All code for the Alphage software are available in our GitLab Software Tool repository.

The whole process of the usage of Alphage:

Simple Version

Please ensure that your computer has Python installed. If not, please go to https://www.python.org/downloads/ to download Python.
Please ensure that your computer has Git installed. If not, please go to https://git-scm.com/downloads to download Git.

Open your terminal (Command Prompt for Windows, Terminal for macOS/Linux), then create and activate a virtual environment:

For Windows (Command Prompt):

python -m venv Alphage_GUI

Alphage_GUI\Scripts\activate
For Windows (PowerShell):

python -m venv Alphage_GUI

.\Alphage_GUI\Scripts\Activate.ps1
For macOS/Linux (Terminal):

python3 -m venv Alphage_GUI

source Alphage_GUI/bin/activate

Please ensure that your Python environment has pandas, PyQt5, openpyxl and matplotlib deployed. The lack of any of these libraries may cause the software to fail. You can use the following code to install the required libraries (after activating the virtual environment):

pip install pandas PyQt5 openpyxl matplotlib

Note: For macOS/Linux, use pip3 instead of pip if pip points to Python 2.

Clone our GitLab repository:

git clone https://gitlab.igem.org/2025/software-tools/cau-china.git

Navigate to the specified directory:

For Windows:

cd cau-china\software\Alphage_GUI
For macOS/Linux:

cd cau-china/software/Alphage_GUI

Run the CAU-China.py program:

For Windows:

python CAU-China.py
For macOS/Linux:

python3 CAU-China.py
Open the software, enter the name of the phage for which you want to replace the tail fiber in the search box on the first interface, and click the search button.
Adjust the parameter weights on the second interface. Our default parameters are scientific and reliable weights obtained through linear programming; if you have no specific ideas about the weight values, you can use our default weights.
If you want to know the specific meaning of a parameter, you can hover your mouse cursor over the “?” next to the parameter, and a specific explanation of the parameter will appear on the right.
After setting the parameter weights, click the button below to proceed to the next page. During this process, if you think there is an error in the phage name you entered, you can click the back button to return to the phage search interface.
After a short loading animation, the detailed tail fiber replacement schemes for the phage will be listed in detail on the third interface, including replacement site, weighted score, reliability, and chimeric sequence. If you want to know the detailed specific scores, you can hover your mouse cursor over the small triangle next to the score, and the three specific component scores will expand below.

Full version

The installation & usage of Full version Please refer to the user manual below:

Contributions

Our Alphage provides the Wet Lab team with rationally designed tail fiber modification schemes, effectively saving the time and resources spent on design and trial-and-error in the early stage of wet experiments. At the same time,We learned from the Conference of China iGEMer Community (CCiC) that many teams use phages to regulate the suicide of engineered bacteria.So we expect Alphage to provide a new and customizable suicide module solution for other teams in the iGEM community: by engineering the model phage T7 and using Alphage to design matching tail fibers, it is possible to construct high-precision and efficiency, regulatable suicide elements that can specifically target and eliminate specific host bacteria. In this process, Alphage will serve as a key cornerstone for constructing such suicide modules.

Discussion

From the initial concept to the current presentation, our software has undergone five iterations. Initially, Alphage relied solely on homology to identify replacement sites. In subsequent testing, to avoid losing potential homologous fragments, we integrated DeepBLAST. Then, considering the binding between chimeric tail fiber phages and host bacteria, we employed the PHIEmbed, a deep-learning model, to predict infection efficiency. Meanwhile, we interviewed Prof. Feng, a phage expert from Beijing University of Chemical Technology. Based on his suggestions, we selected the pLDDT parameter from the ESMFold model as an indicator of the structural stability of chimeric tail fibers and included it in the final scheme display. Finally, we used linear regression to determine the optimal weights for the evaluation metrics, enhanced the UI design, and added user interaction functions, making Alphage not only scientifically rigorous but also more visually appealing and user-friendly.

While we have built a highly user-friendly system, we unfortunately have not yet expanded the existing dataset further due to time constraints. In the future, we will attempt to improve and optimize the software, enhance its search and runtime speed, incorporate more phage information, and conduct comprehensive testing of its functions.

For the specific iteration, please read our Engineering.

References

[1] Hamamsy, T., Morton, J.T., Blackwell, R. et al. Protein remote homology detection and structural alignment using deep learning. Nat Biotechnol 42, 975–985 (2024). https://doi.org/10.1038/s41587-023-01917-2

[2] Zeming Lin et al., Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023). DOI: 10.1126/science.ade2574

[3] Gonzales MEM, Ureta JC, Shrestha AMS. Protein embeddings improve phage-host interaction prediction. PLoS One. 2023 Jul 24;18(7):e0289030. doi: 10.1371/journal.pone.0289030. PMID: 37486915; PMCID: PMC10365317

[4] Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. (2005). Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning (pp. 89–96). ACM.