Overview
Protein screening, a long-established process, aims to narrow down extensive protein libraries to a select few candidates. This is done to identify the ideal protein for specific functions, such as discovering new drugs, biofuels, or biomarkers. In the early stages of our project, we underwent the usual process of screening for promising proteins by searching through online libraries such as NCBI, Uniprot and RSCB. However, we found the field of Bioremediation to be underdocumented, where proteins which bind to heavy metals (e.g, metallothioneins, phytochelatins) often lacked a record of crystal structure as well as data surrounding its binding properties.
For this reason, our Dry Lab team aimed to leverage tools such as AlphaFold 3 and MIB2 to computationally screen for protein candidates to be used in our metal-binding circuit. We also wanted to screen for proteins with the optimal properties, such as stability under acidic conditions and having sufficient binding capacity in order to increase real-world application.
Design
Before deciding on what tools to use, our entire team was faced by the question of:
How do we decide what an ‘ideal protein’ is for our project?
Within sludge, the most noticeable problem from a bioremediation standpoint was the large discrepancy of Zinc to Cadmium ions (around 1000 fold difference). Additionally, sources from government agencies like T-Park suggested an average of around 6kg of Cadmium to exist within around 2000 tonnes of sludge ash. Therefore, we decided our protein had to have 3 major criteria:
- Specificity - it should be very specific and bind exclusively to Cadmium, irrespective of the concentration or presence of other heavy metals. Only with this, we would be able to effectively resolve our problem.
- Sensitivity - it should be sensitive to low concentrations of cadmium and be able to facilitate the binding process even at uM concentrations. Our literature review suggested the cadmium concentration would exist around 4 mg / L (Halecki et al., 2024). This translated to roughly 40uM and was a reasonable level of sensitivity to begin with. Furthermore, metallothioneins are dynamic structures which only fold and coordinate properly under the induction of heavy metals (Freisinger & Vašák, 2013). Thus, an alignment between the metallothionein’s ideal cadmium concentration and the concentration found in real-world applications was crucial.
- Binding Efficiency - it should be able to bind to several cadmium ions per protein. However, the protein must be equally robust and compact to prevent metabolic burden or improper surface-expression within the bacteria. Therefore, we used this general equation as a guide:
Binding Efficiency = Amount of Cd2+ ions binding per protein / Protein Amino Acid Length
In our research, we stumbled across EC20 linked to the Lpp-OmpA surface display system. This circuit has documented use in many past iGEM projects. EC20 has many documented benefits, including its alleged specificity and artificially induced capacity of over 10 Cd2+ per 42 amino acids. However, in our research, we encountered some interesting matters.
- Despite much usage of both EC20 and EC20 in conjunction with Lpp-OmpA surface display, the actual characterisation of EC20 and literature regarding its properties surprisingly lacked documentation.
- Since its discovery (Bae et al., 2000), its crystal structure has not been resolved nor documented. In principle, EC20 was the most suitable and optimal candidate for our work. It provided a sufficient robustness and matched all three of the aforementioned criteria from our initial evaluation. Unfortunately, the lack of characterisation results in a degree of uncertainty for use in Wet Lab experiments. This led us to use computational tools to further analyse EC20.
- We also wanted to verify our outputs by screening proteins which had more literature support and compare the results. If working, we also hoped to screen for other less documented proteins to see if they might show some promise to be tested in the Wet Lab.
Build
Firstly, we used AlphaFold 3 to determine the structure of EC20. Taking the sequence from the iGEM part BBa_K1321005, we used the Expasy translate tool to acquire the amino acid sequence of EC20 and predict it in AlphaFold 3.
However, we did not want to limit ourselves to simply EC20. Therefore, we also built a library of other MTs, phytochelatins and proteins with homologous functions and used AlphaFold 3 to predict its folding with the Lpp-OmpA delivery system. For many proteins, they lacked an existing .pdb or .cif file on databases which could complicate simulations. Therefore, the first screening method was to check the pTM and pLdTT of the isolated structure to ensure that the plDDT was reasonable before continuing with Lpp-OmpA folding (showing medium confidence bands).
A key aspect of our selection process involved predicting the structures of candidate proteins to ensure they have well-defined conformations under physiological conditions, which is critical for their functionality as cadmium binders. Specifically, we aimed to achieve accurate structural representations of these proteins that reflect their behavior in the presence of metal ions. Given that AlphaFold 3 does not currently support cadmium ion simulations directly, we opted to use zinc ions as a proxy due to their structural and chemical similarities. Furthermore, metallothioneins are documented to possess affinity for both zinc and cadmium (Ruttkay-Nedecky et al., 2013; Nordberg, M. & Nordberg, G. F., 2022; Irvine, Pinter, & Stillman, 2016). By simulating the candidate proteins both in their unbound state and in complex with zinc ions, we analysed their structural stability and binding potential. To determine whether the predicted protein structures were well-defined, we utilized key metrics provided by AlphaFold 3, specifically the pLDDT (predicted Local Distance Difference Test) score, pTM (predicted Template Modeling) score, and ipTM (interacting predicted Template Modeling) score.
- pLDDT scores provide confidence levels for the accuracy of specific regions within the protein model, with scores above 70 generally indicating reliable structural conformations. Higher pLDDT values suggest that these regions are likely to be well-defined and stable.
- pTM scores evaluate the overall quality of the protein model by comparing it to known structures in the Protein Data Bank (PDB). A pTM score above 0.6 typically indicates a reliable model with a high likelihood of accurately representing the protein's true structure.
- ipTM scores are specifically designed to assess the quality of protein structures in the context of complexes, such as a metal ion bound to a protein. This score helps determine how well the protein will maintain its structural integrity upon binding with zinc ions (or cadmium in the future). Higher ipTM scores correlate with greater stability and proper interaction between the metal ion and the protein.
Figure 1. Example of AlphaFold 3 protein prediction results showing pLDDT, pTM and ipTM scores.
By assessing pLDDT, pTM, and ipTM scores of our candidate proteins, we could effectively gauge their structural integrity and potential for cadmium binding. This comprehensive evaluation process allowed us to select proteins that were likely to be functionally effective in removing cadmium from sludge. Thus, we utilized AlphaFold 3 to facilitate these structural predictions, providing valuable insights into the potential effectiveness of each protein as a cadmium-binding agent.
Structural and Binding Characteristics
We summarized the results of pTM and ipTM scores from our simulations in the table below, including additional relevant information from the literature, such as binding capacities and the presence of experimentally determined solution structures.
It is important to note that PflQ2MT has been observed to preferentially bind only 3 zinc ions instead of the typical 4, however, PflQ2MT can bind up to 4 cadmium ions (Habjanič, Zerbe, & Freisinger, 2018). Given this information, we chose to simulate the binding of the maximum capacity of 4 zinc ions in our AlphaFold 3 predictions, substituting for cadmium ions.
Name |
Organism Type |
Organism |
Binding Capacity (no. of Cd2+) |
Length (a.a.) |
RCSB Solution Structure* |
Structure Prediction Score |
Structure Prediction Score with Zn2+ |
EC20
(Bae et al., 2000)
|
N/A (synthetic) |
N/A (synthetic) |
10 |
21 |
- |
pTM = 0.32-0.51# |
ipTM = 0.18, pTM = 0.23 |
LgiMT1
(García-Risco et al., 2023)
|
Sea snail |
Lottia gigantea |
4 + 3 |
74 |
- |
pTM = 0.2 |
ipTM = 0.78, pTM = 0.68 |
LgiMT2
(García-Risco et al., 2023)
|
Sea snail |
Lottia gigantea |
4 + 3 |
74 |
- |
pTM = 0.39 |
ipTM = 0.67, pTM = 0.62 |
PflQ2MT (shortened)
(Habjanič et al., 2018)
|
Bacteria |
Pseudomonas fluorescens |
4 |
52 |
6GV6 |
pTM = 0.7 |
ipTM = 0.89, pTM = 0.79 |
Helix pomatia CdMT
(Beil et al., 2019)
|
Land snail |
Helix pomatia |
3 + 3 |
67 |
6QK6 |
pTM = 0.4 |
ipTM = 0.7, pTM = 0.6 |
Mouse-CdMT1
(Klaus Zangger et al., 1999)
|
Mammal |
Mouse |
4 |
31 |
1DFS |
pTM = 0.34 |
ipTM = 0.77, pTM = 0.66 |
2A81G5 (antibody)
(Blake et al., 1996)
|
Mammal |
Mouse |
1 |
Heavy Chain: 114, Light Chain: 112 |
- |
ipTM = 0.89, pTM = 0.92 |
Cannot be simulated
2A81G5 binds to EDTA-chelated cadmium, but AlphaFold does not support the chemical EDTA
|
Figure 2.Summary of literature and AlphaFold 3 results for comparison
*Experimentally determined structure (via NMR or similar techniques)
Identifying properties of EC20
To evaluate the structural properties of EC20, we input the protein sequence into AlphaFold 3 and conducted six different trials. This approach was chosen to assess the consistency and reliability of structure predictions across multiple simulations. Each trial was set up without any change in conditions.
The results from these trials are as follows:
- Trial #1: pTM = 0.33
- Trial #2: pTM = 0.51
- Trial #3: pTM = 0.42
- Trial #4: pTM = 0.32
- Trial #5: pTM = 0.33
- Trial #6: pTM = 0.41
To visually summarize the results, we have combined the structural representations from all six trials, labeled with its corresponding trial number and pTM value. Figure 3 below shows the structures of EC20 for all six trials.
Figure 3. AlphaFold 3 results showing the structures of EC20 for all 6 trials
Analysis of Trial Results
Trials #1, #2, #4, and #5 demonstrated that EC20 folds onto itself, with Trial #2 yielding the highest pTM score of 0.51, indicating a more favorable and stable configuration. In contrast, Trial #3 exhibited an arc-like structure, while Trial #6 illustrated EC20 as predominantly flat, lacking the self-folding characteristic seen in the other trials.
The pTM scores ranged from a low of 0.32 to a high of 0.51. This significant variability suggested that the structure of EC20 was not well-defined. Alternatively, it suggested that the entirety of EC20 was a variable region, and its folding depended on the environment. Given that EC20 is a synthetic metal-chelating protein, such variability is expected; however, it raises potential concerns during modeling due to the numerous structural variations and the likelihood of protein misfolding.
Results
Figure 4. Structure of EC20 with and without addition of zinc ions
Figure 5. Structure of LgiMT1 with and without addition of zinc ions
Figure 6. Structure of LgiMT2 with and without addition of zinc ions
Figure 7. Structure of PflQ2MT with and without addition of zinc ions
Figure 8. Structure of Helix Pomatia CdMT with and without addition of zinc ions
Figure 9. Structure of Mouse-CdMT1 with and without addition of zinc ions
Figure 10. Structure of 2A81G5
Analysis
Several notable observations emerged from the AlphaFold structural predictions:
- Zinc Ion Binding: Zinc ions bind primarily to cysteine residues within the proteins, demonstrating the key role of these amino acids in metal ion coordination. This binding pattern highlights the importance of cysteine in metalloproteins. [Figure 11. below]
- Cystine-Rich Regions: Zinc ions tend to concentrate in cystine-rich regions, known as cystine pockets, which are characterized by clusters of cysteine residues that facilitate metal binding. [Figure 12. below]
- Structural Certainty and Stability: Some proteins exhibited low structural certainty, as indicated by low pTM scores, but showed improved structural stability upon zinc binding. Notably, LgiMT1 and LgiMT2 demonstrated this behavior, as evidenced by significant alterations in color corresponding to the pLDDT scores of the entire protein. The increase in pLDDT scores upon zinc binding indicates greater confidence in structural predictions and suggests that metal binding may enhance folding and stability. [Figures 5 & 6. above]
Figure 11. Showing a zinc ion (in green) bound to 4 thiol group from cystine (yellow spheres)
Figure 12. Showing zinc ions (in green) bound to a cystine rich region (in yellow brackets) of a protein
After comparing the pTM and ipTM scores of the available proteins, we opted for PflQ2MT due to its well-defined structure, which exhibited an ipTM score of 0.89 and a pTM score of 0.79 when zinc ions are bound. These scores indicate that PflQ2MT has a high-quality structural model, with the ipTM score reflecting its stability and interaction profile in the presence of metal ions. Specifically, an ipTM score of 0.89 suggests that PflQ2MT maintains excellent structural integrity when bound to zinc, greatly enhancing the likelihood of proper function when engaging with cadmium. Additionally, the pTM score of 0.79 confirms that its overall structure is reliable and closely resembles known protein configurations found in the Protein Data Bank (PDB).
Selecting PflQ2MT not only allows us to leverage a protein with good binding characteristics but also facilitates future modeling efforts. Its well-defined structure reduces concerns about misfolding, which can be a significant issue with proteins that exhibit poorly defined structures under experimental conditions. Thus, choosing PflQ2MT optimizes our project’s chances for success in cadmium removal efforts due to its robust structural and functional credentials.
Further Screening
As previously mentioned, since AlphaFold 3 did not support cadmium ions (Cd²⁺), we sought alternative software to predict cadmium ion binding. The MIB2 (Metal Ion-Binding Site Prediction and Modeling Server) is an online tool designed to identify and model metal ion-binding sites in proteins. MIB2 supports predictions for 18 different metal ions and enhances accuracy by applying metal-specific scoring and a broader set of structural templates (Lu et al., 2022).
To test cadmium ion binding affinity (and other metal ions) to our selected protein, we utilized MIB2 to predict cadmium binding sites. MIB2 provides PS2 for protein structure prediction, but for more accurate results, we opted to generate a protein structure using other prediction tools such as AlphaFold and subsequently uploaded the PDB file to MIB2 for metal ion binding predictions. We selected all 18 available metal ions for the prediction: Ca²⁺, Cu²⁺, Fe³⁺, Mg²⁺, Mn²⁺, Zn²⁺, Cd²⁺, Fe²⁺, Ni²⁺, Hg²⁺, Co²⁺, Cu⁺, Au⁺, Ba²⁺, Pb²⁺, Pt²⁺, Sm³⁺ and Sr²⁺.
MIB2 Simulation
We can analyzed the binding potential for cadmium using the data generated by MIB2 [Figure 13 below]. The graph displays the binding potential of the metal ion (x-axis) to each amino acid residue number (y-axis). The orange dots represent residues that can bind cadmium ions, typically appearing at the peak; however, this is not always the case, as binding may occur through coordination of neighboring amino acids. For example, amino acids numbered 10, 28, 29, and 47, all cysteine residues, can bind cadmium with a score of 1.392 [Figure 14 below].
Figure 13. Binding potential of Cd2+ to amino acids residues of PflQ2MT
Figure 14. Docking results of Cd2+ to amino acids residues of PflQ2MT
We combined the MIB2 cadmium ion binding predictions for PflQ2MT to reveal the top 15 binding sites out of 63 [Figure 15 below], with docking scores ranging from 0.845 to 1.392, visualized using ChimeraX. The cadmium ions appear concentrated in one region (indicated by the red box) near the cysteine residues (indicated in yellow), aligning with previous AlphaFold 3 results and the solution structure of PflQ2MT. However, there was one exception where a cadmium ion binds outside of cysteine residues (indicated by the red arrow), indicating some inaccuracies in the predictions.
Figure 15. MIB2 results of cadmium ion binding for PflQ2MT, showing the top 15 results.
Moreover, we examined the affinity of various metal ions present in the MIB2 complex for individual cysteine residues in PflQ2MT. Our primary goal was to identify a metallothionein that exhibited specificity for cadmium while minimizing any interference or competition from other non-toxic heavy metals. We focused all the nine cysteine residues located at positions 5, 7, 10, 12, 28, 32, 42, 47, and 49. Positive binding scores indicate that a metal ion has an affinity for the cysteine residues, while a negative score signifies that there is no affinity.
The binding affinity results, as illustrated in Figure 16 below, reveal that the metal ions with the highest affinity is Ni²⁺, followed closely by Cu⁺ and Au⁺. Whereas the binding score for Cd²⁺ remained the most consistent, followed by Zn²⁺ and Pb²⁺. It is important to note that Pf²⁺ was not considered due to its extreme binding score, as the maximum score in our analysis was around 3, while the upper quartile was just barely above -0.5.
The results further indicate that the cysteine residues display binding affinities to metal ions such as Cu²⁺, Fe³⁺, Mn²⁺, Zn²⁺, Cd²⁺, Fe²⁺, Ni²⁺, Hg²⁺, Co²⁺, Cu⁺, Au²⁺, Ba²⁺, and Pb²⁺, all showing upper quartile binding scores greater than 1. This observation corroborates previous modeling attempts using AlphaFold 3, which highlighted the affinity of Zn²⁺ to cysteine, characterized by its thiol functional group, for various metal ions (Yang et al., 2024).
Figure 16. Binding score of different metal ions to cysteine residues on PflQ2MT
However, we wanted to ascertain whether the metal ions listed above could specifically bind to the cysteine pocket of PflQ2MT, as the previous results only reflect the binding affinity of individual cysteine residues. The binding score data from MIB2 predicted binding by evaluating each cysteine residue independently, scoring their affinities one at a time.
Since proteins exist as three-dimensional structures, metal ions typically bind to a pocket formed by multiple residues working together, rather than to one isolated cysteine alone. Thus, further analysis was necessary to consider the docking scores and to quantify the number of cysteines identified by MIB2 as participating in metal ion binding within the three-dimensional structure of PflQ2MT. This assessment helped determine how many cysteines are involved collectively in binding metal ions, ultimately clarifying which metal ions can effectively bind to PflQ2MT.
As such, we first analysed the number of cysteines involved in binding for each metal ion. Figure 17 below illustrates the number of cysteines identified for binding by each metal ion, with a maximum of 9. The results indicate a noteworthy discrepancy: while Cu²⁺, Fe³⁺, Zn²⁺, Hg²⁺, Co²⁺, and Au²⁺ exhibited affinity for individual cysteine residues, a consideration of the three-dimensional protein structure revealed that they do not effectively bind to these residues. As a result, the metal ions identified as capable of binding include Mn²⁺, Cd²⁺, Fe²⁺, Ni²⁺, Cu⁺, Ba²⁺, and Pb²⁺.
PflQ2MT exhibited the greatest preference for Cd²⁺ and Cu⁺ (binding to all 9 cysteines), followed by Fe²⁺ at 7 and Pb²⁺ at 5. Cu⁺ is less stable and tends to oxidize to the Cu²⁺ form, and since Cu²⁺ cannot bind to any cysteines, it is unlikely to compete effectively with Cd²⁺. These results suggest that PflQ2MT preferentially binds to Cd²⁺ more than any of the other predicted metal ions.
Figure 17. Number of cysteines involved in binding for each metal ion
Validating the Results
The solution structure of PflQ2MT demonstrates that cadmium binds to the cysteine-rich pocket (Habjanič et al., 2018). Additionally, it is known that metal ions generally bind to cysteine residues in metallothioneins (Yang et al., 2024). Therefore, the predictions are mostly accurate.
However, the research paper that characterized PflQ2MT indicates that it can bind both zinc and cadmium ions (Habjanič et al., 2018), which contrasts with our findings showing that Zn²⁺ cannot bind to any cysteine residues. Although Zn²⁺ exhibits a higher binding affinity than Cd²⁺, MIB2 did not mark any cysteine residues as capable of binding Zn²⁺. We suspect this discrepancy could be attributed to the following issue.
Learn: Limitations Regarding the Predictions
While the cadmium ion binding predictions appear generally accurate, some limitations exist with MIB2 due to its lack of molecular dynamics in docking metal ions to proteins.
MIB2 algorithm appears to treat proteins as a static structure, implying that the protein is entirely rigid and unable to alter its topology. This assumption could significantly impact the accuracy of its predictions for dynamic structures such as metal-binding proteins. This issue becomes particularly evident when predicting metal binding affinities for proteins such as LgiMT1 and LgiMT2, where the changes in topology are substantial upon metal ion binding (refer to the AlphaFold 3 analysis from before, and Figures 5 & 6. Consequently, the metal binding affinity results may lack precision, as they fail to account for the correct dynamic topology of the protein.
Figure 18 below shows the MIB2 cadmium binding prediction for LgiMT1, showing all potential cadmium binding positions. It is observed that cadmium concentrates at four distinct positions (indicated with red boxes), whereas literature indicates that cadmium should only bind at two positions, the β and γ domains in LgiMT1, which bind 3 and 4 cadmium ions, respectively) (García-Risco et al., 2023).
Figure 18. Visualization of MIB2 prediction for all cadmium binding positions of LgiMT1
The expected structure with annotations of the β and γ domains is included for comparison, derived from AlphaFold 3 results of LgiMT1 with 7 zinc ions [Figure 19 below], which we predicted previously. This observation reinforces that the absence of a well-defined initial structure can significantly affect prediction accuracy. Although PflQ2MT displays a relatively well-defined structure, subtle changes may nonetheless sufficiently alter the resulting predictions.
Figure 19. AlphaFold 3 structure prediction of LgiMT1 binding with 7 zinc ions from an earlier section, with annotations showing the β and γ domains
Thus, in our computational workflow, we highlighted the many limitations that AI servers face. This highlighted two key learning points which guided our future cycles.
- Models cannot be fully in-silico and must be validated by Wet Lab experiments or existing literature to increase validity of outputs
- AI tools themselves are not the most apt tool to simulate dynamic proteins such as EC20 and PflQ2MT. Thus, a more robust and specific model had to be developed.
- Therefore, it was in our future cycles where our team began working with Molecular Dynamics simulations in order to check for binding, folding and stability under the specific conditions we would test and use our protein.
References
- Bae, W., Chen, W., Mulchandani, A., & Mehra, R. K. (2000). Enhanced bioaccumulation of heavy metals by bacterial cells displaying synthetic phytochelatins. Biotechnol. Bioeng., 70(5), 518–524. https://doi.org/10.1002/10970290(20001205)70:5%3C518::AIDBIT6%3E3.0.CO;25
- Beil, A., Jurt, S., Walser, R., Schönhut, T., Güntert, P., Palacios, Ò., Atrian, S., Capdevila, M., Reinhard Dallinger, & Zerbe, O. (2019). The Solution Structure and Dynamics of Cd-Metallothionein from Helix pomatia Reveal Optimization for Binding Cd over Zn. Biochemistry, 58(45), 4570–4581. https://doi.org/10.1021/acs.biochem.9b00830
- Blake, D. A., Chakrabarti, P., Khosraviani, M., Hatcher, F. M., Westhoff, C. M., Goebel, P., Wylie, D. E., & Robert, B. (1996). Metal Binding Properties of a Monoclonal Antibody Directed toward MetalChelate Complexes *. Journal of Biological Chemistry, 271(44), 27677–27685. https://doi.org/10.1074/jbc.271.44.27677
- Freisinger, E., & Vašák, M. (2013). Cadmium in Metallothioneins. Metal Ions in Life Sciences, 11, 339–371. https://doi.org/10.1007/9789400751798_11
- García-Risco, M., Calatayud, S., Pedrini-Martha, V., Ricard Albalat, Palacios, Ò., Capdevila, M., & Reinhard Dallinger. (2023). A de novo evolved domain improves the cadmium detoxification capacity of limpet metallothioneins. Scientific Reports, 13(1). https://doi.org/10.1038/s41598-023-35786-1
- Habjanič, J., Zerbe, O., & Freisinger, E. (2018). A histidine-rich Pseudomonas metallothionein with a disordered tail displays higher binding capacity for cadmium than zinc. Metallomics, 10(10), 1415–1429. https://doi.org/10.1039/C8MT00193F
- Halecki, W., Sionkowski, T., Chmielowski, K., Kowalczyk, A., & Kalarus, K. (2024). Municipal wastewater quality control: Heavy metal comparative Analysis—Case study. Environmental Protection and Natural Resources, 34. https://doi.org/10.2478/oszn-2023-0023
- Klaus Zangger, GÜLIN ÖZ, Armitage, I. M., & Otvos, J. D. (1999). Three‐dimensional solution structure of mouse [Cd7]‐metallothionein‐l by homonuclear and heteronuclear NMR spectroscopy. Protein Science, 8(12), 2630–2638. https://doi.org/10.1110/ps.8.12.2630
- Lin, Y.-F., Cheng, C.-W., Shih, C.-S., Hwang, J.-K., Yu, C.-S., & Lu, C.-H. (2016). MIB: Metal Ion-Binding Site Prediction and Docking Server. Journal of Chemical Information and Modeling, 56(12), 2287–2291. https://doi.org/10.1021/acs.jcim.6b00407
- Lu, C.-H., Chen, C.-C., Yu, C.-S., Liu, Y.-Y., Liu, J.-J., Wei, S.-T., & Lin, Y.-F. (2022). MIB2: metal ion-binding site prediction and modeling server. Bioinformatics, 38(18), 4428–4429. https://doi.org/10.1093/bioinformatics/btac534
- Lu, C.-H., Lin, Y.-F., Lin, J.-J., & Yu, C.-S. (2012). Prediction of Metal Ion–Binding Sites in Proteins Using the Fragment Transformation Method. PLoS ONE, 7(6), e39252. https://doi.org/10.1371/journal.pone.0039252
- Yang, R., Dumila Roshani, Gao, B., Li, P., & Shang, N. (2024). Metallothionein: A Comprehensive Review of Its Classification, Structure, Biological Functions, and Applications. Antioxidants, 13(7), 825–825. https://doi.org/10.3390/antiox13070825
Overview
GROMACS is a software specialising in Molecular Dynamics (MD) simulations. MD simulations calculate the physical movements of molecules based on energies, modeling their behaviour under realistic conditions and physical constraints over time. In synthetic biology, these simulations predict interactions between biological components.
Our primary motivation for this model is to efficiently answer biophysical questions that are resource-intensive for wet lab experimentation and to verify assumptions from circuit design. This is crucial for enhancing our understanding of the genetic circuit and communicating results to other modules to facilitate informed design changes. The predictive nature of MD modeling is especially vital for novel circuits like the circuit like ours involving metallothionein (MT), as it provides a computational basis for supporting real-life experimental data.
The computational tool, GROMACS, was selected for this task due to its robust interface and high performance ability. A key feature is its ability to incorporate force fields in simulations, investigating the potential energy between molecules while taking into account various parameters such as temperature, pH, pressure and density. These different conditions may require more extensive and time-consuming measures to explore through wet lab experiments alone, but can be tested in a fairly accurate manner with the help of MD simulation.
Our Dry Lab team focused on evaluating the binding potential between metallothionein (MT) and cadmium ions by generating predictive trajectories of the cadmium ion around our fusion protein. We also worked to determine the optimal pH for binding. Results were analysed through trajectory animations and by quantifying intermolecular distances to evaluate binding probability.
The insights from the minimum distances between the cadmium ion and MT suggest that cadmium ions exhibit more interactions with the protein. However, due to limited computational resources, these results primarily provide a foundation for future optimization. Future directions include subsequent work that involve longer simulation times, calculation of binding free energy (ΔG), and refined concentration setups to draw stricter conclusions regarding the specificity of cadmium binding compared to zinc ions.
Design
Through molecular dynamics simulation on GROMACS, our team mainly aims at evaluating the binding potential between MT and cadmium ions.
Designing system
Given that the concentration of zinc ion is significantly higher than cadmium ion in a random sludge sample, in ratios which can range from approximately 20:1 to 100:1 (Milik et al., 2017). We aim to simulate the situation as accurately as possible and select a concentration of 18 zinc ions to 1 cadmium ion for a zinc ion concentration sufficient enough to provide insightful results while keeping that of the cadmium ion minimal to reflect reality. However, it is worth noting that this concentration is still higher than what is expected in a real life sludge sample.
System Overview and Expectations
Our designed system consists of our fusion protein in an environment of 18 zinc ions and 1 cadmium ions. The system then goes through neutralisation with addition of sodium and chloride ions, then its energy is minimized. We are studying specifically the ion behaviour around the MT through their trajectory, and the distance between the cadmium ion and the MT residues are measured. A distance is interpreted to be potential bonding between the protein residue and cadmium ion when it is in the range of approximately 2.52 – 2.54 angstrom (Å) (Jalilehvand et al., 2009).
Additional considerations
Furthermore, gromacs energies parameters such as density, pressure, potential energy, and temperature are also plotted against time to visualise how these factors change with time within the system. Such observations will help to determine the stability of the system.
Build
Getting the software ready
Most of the dry lab members working on this task accessed GROMACS through homebrew and worked on the task using MacOS terminal, but it can also be accessed through VS Code.
Convert protein structure into .gro format + Force field selection
A forcefield CHARMM36 is specifically used in our case since only this force field has cadmium ion properties (GROMACS forums editors, 2023).
In a new folder, we input our fusion protein structure, which is a .pdb file since it is a molecular structure. The overall construction of our fusion protein cannot be found on databases and therefore have to be constructed by ourselves. The .pdb fusion protein structural file is made using AlphaFold by inputting the full protein sequence. We convert the .pdf file into .gro file that is compatible with working with in gromacs by the command:
- gmx pdb2gmx -f fusion_pro2.pdb -o processed.gro -ter -ignh
Figure 1.Force field selection process
It shows a selection of force fields that are available to choose from, select charmm36 force field (number 8 in our selection).
Components selection + definition
Figure 2.Water molecule model selection
Next, we choose either TIP3P or TIP3P_ORIGINAL as our water molecule, which are compatible with our chosen force field.
Figure 3. Water molecule selection
Structure of the content in the .pdb file has been read and analysed, information stored for use later.
Figure 4.Terminus type selection
Since the environment in which this process happens is acidic (low pH), we choose a neutral N-terminus and a COO- (charged) terminus.
System configuration + solvation
We construct a cubic system - the “box” which encloses our system for simulation with the following command:
- gmx editconf -f processed.gro -o newbox.gro -c -d 1.0 -bt cubic
Figure 5.Construction of cubic system
What we have done here is inputting the .gro file which we have processed earlier, to obtain newbox.gro, which is the system placed in a box with the properties outputted above. After setting up the box, we fill the box with the water molecules which we have selected earlier by the command:
- gmx solvate -cp newbox.gro -cs spc216.gro -o solvated.gro -p topol.top
Figure 6.System configuration
Adding cadmium and zinc ions>
Next, adding our desired ions is a critical step. We add the cadmium and zinc ions respectively to a preprocessed .mdp file.
Add zinc ions by the command:
- gmx genion -s ions.tpr -o system_with_zn.gro -p topol.top -pname ZN -np 18
Figure 7.Addition of zinc ions
Preprocessing is needed after adding the zinc ions with the following command:
- gmx grompp -f minim.mdp -c system_with_zn.gro -p topol.top -o ions_zn.tpr
Figure 8.Preprocessing after addition of zinc ions
Next, add cadmium ions:
- gmx genion -s ions_zn.tpr -o system_with_zn_cd.gro -p topol.top -pname CD2 -np 1
Figure 9.Addition of cadmium ion
Preprocess again with:
- gmx grompp -f minim.mdp -c system_with_zn_cd.gro -p topol.top -o ions_final.tpr
Figure 10.Preprocessing after addition of cadmium ion
Neutralisation
The system then goes through neutralisation with addition of sodium and chloride ions.
Addition of sodium and chloride ions:
- gmx genion -s ions_final.tpr -o solvated_ions.gro -p topol.top -pname NA -nname CL -neutral
Figure 11.Neutralisation process
Energy minimization
Next, the energy of the system is minimized. The software will figure out optimal locations of the molecules relative to the MT. The MT also adjusts its structure to adopt lower energy conformations. The purpose of this step is to mimic nature's preference for lowest energy conformations.
NVT, NPT Equilibrium and MD Simulations
After running energy minimization, we can run NVT and NPT equilibrium, and md simulations to generate data for temperature, pressure, and trajectories respectively in order to check for system and protein stability.
Visualization
Using PyMOL, input the system file and trajectory file, an animation of the moving zinc and cadmium ions can be generated. Depending on the number of frames, the animation length varies.
Test
Binding Affinity
A protein is composed of long amino acid sequences, each of these amino acids are referred to as a residue. Binding potential is a variable that measures the probability of a residue binding to specific molecules. In our case, we focus on evaluating the binding potential between MT and cadmium ions. Higher binding potential values indicate a higher chance of binding occurring between residues and the specific molecule.
Figure 14. Residues’ binding potential for cadmium ions
The graph illustrates the binding potential for cadmium ions for the different residues. The x axis shows the residue numbers, and the y axis plots the binding potential which is a dimensionless value. The baseline for the binding potential score is approximately -0.5, residues having a binding potential score close to this value indicate weak attraction to bind to cadmium ions.
The graphs show a trend of comparably higher binding potential for higher-numbered residues. The residues with the highest binding potential to cadmium ions are shown to be 171-CYS, 189-GLN 193-GLU, and 201-CYS. It means that for these specific residues, they are most likely to bind to cadmium ions. Therefore, the measured distance between the MT and cadmium ions are displayed in the following graphs for further investigation.
The binding distance between cadmium ions and cystine is 2.52 – 2.54 angstrom (Å) (Jalilehvand et al., 2009). This means that a distance within this range or smaller indicates that cadmium ion is bound to cysteine. Our graphs are frames obtained from the trajectories of the ions which are generated in GROMACs. In our graphs, the distances between the cadmium ion and the specific MT residue are displayed, all in the units of Å, which can be directly compared to the cutoff for binding distance. Ideally, the expected results should show the residues having a distance within or smaller than 2.52 – 2.54 Å to verify the cadmium ion can bind to MT.
Figure 15. Measured distance between cadmium ion and 171-CYS residue
Figure 16. Measured distance between cadmium ion and 171-CYS residue
The distances shown between 171-CYS and the cadmium ion are measured to be 38.9Å and 13.1Å over two different frames, which are not in the range of expected values of binding distance. This means that even though given the higher binding potential, the distance between the cadmium ion and 171-CYS residue indicates that the condition in this case is not optimistic for binding.
Figure 17. Measured distance between cadmium ion and 189-GLN residue
Figure 18. Measured distance between cadmium ion and 193-GLU residue
The distances between 189-GLN and 193-GLU are shown to be 50.7Å and 27.5Å respectively. These values are also higher and not in the range of expected values of binding distance. Thus in this case, the chances of binding are also low.
Figure 19. Measured distance between cadmium ion and 201-CYS residue
Figure 20. Measured distance between cadmium ion and 201-CYS residue
201-CYS and the cadmium ion show a distance of 44.5Å and 20.3Å across 2 frames. These two values are also not in the range of expected values of binding distance, and therefore the chances of binding are low, according to this result.
System and Protein Stability
Molecular dynamics simulation also allows us to analyze the system conditions, structural properties and energies over time, facilitating the assessment of system and protein stability.
The properties examined include system density, potential energy, system pressure, system temperature, backbone and radius of gyration of the protein.
System Conditions
Figure 21. System density over time
The density is shown to be relatively stable with slight fluctuations around the level of 1020-1030 kg/m^3 after an initial spike, meaning that the system has stabilized rapidly and achieved equilibrium given that there is no major change in density.
Figure 22. Potential energy of the system over time
Potential energy rapidly decreases as the system relaxes, and reaches a low energy state at around -1.0kJ/mol, indicating stability has been achieved given that the curve is near-plateau.
Figure 23. System pressure over time
The pressure of the system fluctuates around the level of 0 bar. Since the fluctuations are near an average pressure of 0 bar, it shows that the controlled pressure is maintained to be around 0 bar which is typical for MD simulations.
Figure 24. System temperature over time
The temperature of the system is around 300K (~27℃). The fluctuations with an average temperature of 300K Indicates that system temperature is also under control and as expected.
Protein Structural Properties
Figure 25. Backbone RMSD of the protein over time
The Root Mean Square Deviation (RMSD) is a measure of similarity in conformation between the backbone and protein structure. The graph shows an initial increase until around 0.4nm, which indicates stabilisation around that value. After that, the RMSD fluctuates around 0.4nm, indicating that the protein structure has stabilised and is not undergoing major conformational changes.
Figure 26. Radius of gyration of the protein over time
Radius of gyration measures the distance of concentration of mass from a rotational axis. Overall, the measured data shows minor fluctuation around a steady range for all components (total and axial). It can be concluded that the spatial distribution of atoms is consistent throughout.
Ideally, we expect that system conditions and protein structure to stay consistent and stable in order to output results. Through studying the graphs which present us with such information, the fluctuation of results under our simulated conditions and timeframe, the protein did not reach a stable binding state.
Learn
While limited by computational resources, our GROMACS simulations provide insight on valuable initial trends rather than conclusive proof. Our analysis of the minimum measured distances between the cadmium ion and MT suggests that cadmium ions may have more interactions with the MT protein than zinc ions. However, this remains a preliminary observation.
Interpretation and Limitations
The distances observed in our simulation were consistently longer than the typical length for a stable metal-thiolate bond. This indicates that under our simulated conditions and timeframe, the system did not reach a stable binding state. The fluctuations in the distance graphs confirm that a longer simulation time is required to achieve stability and observe a definitive binding event, which is beyond the scope of our current computational resources.
Connections with other modules
For Wet Lab, our results highlight that binding should be treated as an assumption to be tested. It is recommended that wet-lab experiments prioritize direct validation of the cadmium-MT binding affinity, as the computational model could not yet confirm it.
For Circuit Design, our analysis highlights that the performance of the MT part is the project's critical uncertainty.
Future directions
Given the results obtained from our trails, future work should focus on:
1. Longer simulation times: Achieving a standard timeframe calculation of 100ns or longer.
2. Binding free energy calculation: Quantitatively compare the binding strength of cadmium ions and zinc ions, so that rather than purely investigating in the binding potential between the cadmium ion and MT, the binding specificity of the two different ions to the protein can also be explored and compared.
3. Refined setup: Optimisation of system parameters and ions concentration to achieve enhanced stability.
References
- Bondi, A. (1964). van der Waals Volumes and Radii. The Journal of Physical Chemistry, 68(3), 441–451. https://doi.org/10.1021/j100785a001
- GROMACS forums. (2023, February 22). CHARMM force field parameters for metal ions Cd(II) and Pb(II). GROMACS Forums. https://gromacs.bioexcel.eu/t/charmm-force-field-parameters-for-metal-ions-cd-ii-and-pb-ii/5852
- Hub, J. S., Bert, Helmut Grubmüller, & Gerrit Groenhof. (2014). Quantifying Artifacts in Ewald Simulations of Inhomogeneous Systems with a Net Charge. Journal of Chemical Theory and Computation, 10(1), 381–390. https://doi.org/10.1021/ct400626b
- Farideh Jalilehvand, Leung, B. O., & Mah, V. (2009). Cadmium(II) Complex Formation with Cysteine and Penicillamine. Inorganic Chemistry, 48(13), 5758–5771. https://doi.org/10.1021/ic802278r
- Milik, J., Pasela, R., Lachowicz, M., & Chalamoński, M. (2017). THE CONCENTRATION OF TRACE ELEMENTS IN SEWAGE SLUDGE FROM WASTEWATER TREATMENT PLANT IN GNIEWINO. Journal of Ecological Engineering, 18(5), 118–124. https://doi.org/10.12911/22998993/74628
Membrane Protein Docking
In order to understand whether the fusion protein could correctly express itself on the membrane, we decided to use computational tools to stimulate the expression of the fusion protein Lpp-OmpA-PflQ2MT into the bacterial outer membrane. By using the computational tools CHARMM-GUI and GROMACS, we can predict whether the Lpp-OmpA-PflQ2MT can correctly be expressed on the membrane of E. coli (strain K12) and P. putida (strain KT2440).
Step1: Preparing the Protein Sequence
Before using CHARMM-GUI to stimulate the membrane protein, we would need to know the characteristics and the original position of the component on a membrane to better understand the results. Although Wet Lab work was done in DH5a and BL21, the asset library and characterisation of E. coli K12 was more documented and well known, thus we first did the simulation using E. coli K12 structure. This served as a proxy for other E. coli strains and also a basis to validate the result of P. putida KT2440
Setup
Membrane source: E. coli K12 and P. putida KT2440
Fusion protein composition: Lpp-OmpA-PflQ2MT
Lpp (Braun’s Lipoprotein): A major outer membrane protein in E. coli that anchors the outer membrane to the peptidoglycan layer, typically lipidated at a cysteine residue for membrane attachment. [Figure 1]
UniProt Accession Number: P69776 (Originates from E. coli K12)
OmpA: An outer membrane protein in Gram-negative bacteria with a β-barrel transmembrane domain (N-terminal) and a periplasmic domain that interacts with peptidoglycan. [Figure 2]
RCSB Accession Number: 1QJP (Pautsch & Schulz, 2000) (Originates from E. coli BL21(DE3))
PflQ2MT: A metallothionein-like domain from another microorganism, rich in cysteine residues, designed to bind heavy metal ions (preferentially four Cd²⁺ ions via thiolate bonds). [Figure 3]
UniProt Accession Number: J2EKT7 (Originates from P. fluorescens (strain Q2-87))
Figure 1, 2, and 3 (from left to right)Showing the structures of Lpp, OmpA, and PflQ2MT (short-form) respectively
The protein sequence is then combined and modelled using AlphaFold 3 and the structure for Lpp-OmpA-PflQ2MT is produced as shown below. [Figure 4]
Figure 4. Fusion protein structure generated from AlphaFold 3
Step 2: Setting up CHARMM-GUI
CHARMM-GUI is a web-based platform to interactively build complex systems and prepare the inputs with well-established and reproducible simulation protocols for state-of-the-art molecular simulations. CHARMM-GUI is a suitable option to stimulate complex systems. Below we explain how to model a membrane protein.
In order to make sure the orientation of the fusion protein and composition of membrane will be accurate, we first modelled OmpA only. This is because OmpA’s orientation on protein is known and can be found on OPM database (Orientation of Proteins in Membrane database) with the access code 1qjp.
Figure 5. Image from OPM illustrating the orientation of OmpA (OPM, 2025)
To generate the membrane, we referenced a research done on a similar protein, OphrH modelled in E. coli K12. Thus the membrane composition shown in figure 6 below (Lee, Patel, Kucharska, Tamm, & Im, 2017).
System |
Lipid composition |
No. of Lipids |
E. coli K12 |
Top |
Bottom |
Top |
Bottom |
E. coli Lipid A + K12 core |
PPPE/PVPG/PVCL2 |
36 |
75/20/5 |
Figure 6. Table showing membrane lipid composition of E. coli K12 from research
For the composition of the upper leaflet of E. coli K12 membrane, we referenced CHARMM-GUI tutorial on “Membrane Builder: BtuB in E. coli outer membrane”, where the Core type is K-12 and the number of O-antigen repeating units is 2 and O6 is used as the available O-antigen and other parameters remains the same as default.
Results & Analysis
The resultant topology of OmpA in the membrane in figure 7 is shown below, which is the same as the OmpA from OPM shown in figure 5 above. With the correct topology and system size, we can add the other components into the model while using the same parameters.
Figure 7. CHARMM-GUI result of docking OmpA to the membrane
To further judge whether the positioning of the fusion protein is accurate, we compared the cross sectional area of OmpA to the fusion protein OmpA-PflQ2MT .
Cross sectional Area of the Protein |
 |
 |
Protein |
OmpA |
OmpA-PflQ2MT |
Parameters for positioning |
Directly imported form OPM server |
Positioned using PPM2.0 |
Figure 8 and 9. Cross-Sectional Depiction of OmpA and OmpA-PflQ2MT respectively through the membrane
As seen from the figure 8, OmpA’s cross-sectional area of the protein span across extra-cellular and intra-cellular with more surface extruding into the extra-cellular part, which corresponds to its topology in OPM server and its characteristics as a beta-barrel transmembrane protein. Thus, we can use this knowledge to validate whether the fusion protein is correctly positioned.
As an incorrect example, we can see from figure 9 that most of the protein is expressed on the intra-cellular side which is not how a transmembrane protein should be expressed inside a membrane.
Therefore, we tested out other PDB orientations until we got one that is the closest as its real-life counterpart as validated with OmpA topology on OPM.
PDB orientation parameters |
Calculated Cross Sectional Area |
Visualized |
- Positioned using PDB orientation
- Rotate molecule respect to the X axis -90 Degree
|
 |
 |
Figure 10. (left) shows the calculated cross sectional area;
Figure 11. (right) illustrates the docking of OmpA-PflQ2MT to the membrane
Figure 12. Illustration of OmpA-PflQ2MT with the respective layer of membrane labelled
The red box area indicates upper leaflet, whereas the pink box area indicates lower leaflet and the green arrow pointing to the green protein is OmpA-PflQ2MT (slightly hidden).
Learn: Results and interpretation
During our membrane protein simulation, we ran into several problems in the end, and we could not follow through with analysis and continue development of this dry lab part. The reasons are as follows:
1. Limited benchmark data for membrane protein simulations
During our initial research into membrane protein simulations, there were very few tutorials and information on membrane modelling. Therefore, it was difficult to verify the accuracy of our model or confirm whether the chosen parameters (e.g. force fields, PDB manipulation option, system size, component building options, etc) align with real-world bacterial behaviors through a purely in-silico model. Without established reference datasets or similar peer-reviewed case studies like ours utilizing Lpp-OmpA-(a metal binding protein), our results are largely unsubstantiated and would require characterisation studies and validation from wet lab data. Therefore, the original purpose of predicting wet lab results is quite challenging and requires a shift in paradigm. Rather, the fundamental model itself must first be validated before subsequently relying on it to predict Wet Lab results. Only then, it can be more reliable to help predict the outcome of experiments which exceed the scope of the Wet Lab.
2. Little to no data on the membrane lipid composition of P. putida KT2440
Our end goal is to take the simulation and analyze results of the fusion protein on E. coli K12 to compare with that on P. putida KT2440. General gram-negative membrane models like E. coli K12 are very well documented, with plenty of peer-reviewed articles and tutorials using the same parameters to generate the membrane. While for P. putida, the bacteria membrane is known to vary under environmental stresses (e.g. extreme pH value, heavy metal exposure, etc. ). There is not much literature surrounding the composition of the membrane, hence our model of the lipid membrane is simply an educated guess with many assumptions. An inaccurate membrane model would skew the protein insertion dynamics or cadmium accessibility to PflQ2MT on the membrane, resulting in analysis which must be interpreted with caution.
3. Structural Artifacts in Protein Prediction Affecting Model
The predicted structure of the fusion protein introduced a notable difference from the original protein component. Specifically as seen from figure 13 below, a structure that originally should be part of the OmpA β-barrel domain has manifested as uncoiled strings rather than the expected barrel conformation at the junction connecting to the PflQ2MT domain. This is likely due to limitations in structure prediction tools on predicting fusion proteins involving sequence from several organisms. This distortion would alter membrane insertion and exposure of PflQ2MT domain where it should be exposed outside of the membrane, but due the the protein conformation it is “squished” next to OmpA which result in it expressing partially within the membrane as well. We had tried other protein structural prediction software like SWISS-MODEL and RoseTTAFold which all gave similar results. This hinders our abilities to further stimulate the binding of metal-ions onto the membrane protein as well as inaccurate RMSD metric and binding energy calculations down the line. With more time, a dynamic model which accounted for the fact that fusion proteins are not static would have been more beneficial and insightful.
Figure 13. Comparison of fusion protein (OmpA-PflQ2MT) (left) with OmpA (Right)
CHARMM-GUI provided an easy entry point for membrane protein modeling, and we were able to utilise it to get successful docking results. Thus, we believe our engineering work provided insightful troubleshooting and thought processes for us and possibly for future teams.
However, there are still many uncertainties and assumptions which had to be made, leading to potential discrepancy between the simulation of the bacteria membrane and protein compared to its behaviour in bacteria, and due to the computational and time limitations, we did not have time to improve the model further.
In the future to better utilize membrane protein stimulation and improve the modelling, we could find experts to help validate the accuracy of our model and ask for their expertise on such matters. Not only that, if resources allow, we can do our own wet lab experiments on the membrane of P. putida to get more accurate results for the lipid composition of the membrane system, to further improve the computational model.
References
- Brown, T. P., Santa, D. E., Berger, B. A., Kong, L., Wittenberg, N. J., & Im, W. (2024). CHARMM GUI Membrane Builder for oxidized phospholipid membrane modeling and simulation. Current Opinion in Structural Biology, 86, 102813–102813. https://doi.org/10.1016/j.sbi.2024.102813
- Feng, S., Park, S., Choi, Y. K., & Im, W. (2023). CHARMM-GUI Membrane Builder: Past, Current, and Future Developments and Applications. Journal of Chemical Theory and Computation, 19(8), 2161–2185. https://doi.org/10.1021/acs.jctc.2c01246
- Gee, S., Glover, K. J., Wittenberg, N. J., & Im, W. (2024). CHARMM‐GUI Membrane Builder for Lipid Droplet Modeling and Simulation. ChemPlusChem, 89(8). https://doi.org/10.1002/cplu.202400013
- Jo, S., Cheng, X., Islam, S. M., Huang, L., Rui, H., Zhu, A., Lee, H. S., Qi, Y., Han, W., Vanommeslaeghe, K., MacKerell, A. D., Roux, B., & Im, W. (2014). CHARMM-GUI PDB Manipulator for Advanced Modeling and Simulations of Proteins Containing Nonstandard Residues. Advances in Protein Chemistry and Structural Biology, 235–265. https://doi.org/10.1016/bs.apcsb.2014.06.002
- Jo, S., Kim, T., & Im, W. (2007). Automated Builder and Database of Protein/Membrane Complexes for Molecular Dynamics Simulations. PLoS ONE, 2(9), e880. https://doi.org/10.1371/journal.pone.0000880
- Jo, S., Kim, T., Iyer, V. G., & Im, W. (2008). CHARMM-GUI: A web-based graphical user interface for CHARMM. Journal of Computational Chemistry, 29(11), 1859–1865. https://doi.org/10.1002/jcc.20945
- Jo, S., Lim, J. B., Klauda, J. B., & Im, W. (2009). CHARMM-GUI Membrane Builder for Mixed Bilayers and Its Application to Yeast Membranes. Biophysical Journal, 97(1), 50–58. https://doi.org/10.1016/j.bpj.2009.04.013
- Kong, L., Park, S.-J., & Im, W. (2024). CHARMM-GUI PDB Reader and Manipulator: Covalent Ligand Modeling and Simulation. Journal of Molecular Biology/Journal of Molecular Biology, 168554–168554. https://doi.org/10.1016/j.jmb.2024.168554
- Lee, J., Patel, D. S., Kucharska, I., Tamm, L. K., & Im, W. (2017). Refinement of OprH-LPS interactions by molecular simulations. Biophysical Journal, 112(2), 346–355. https://doi.org/10.1016/j.bpj.2016.12.006
- Lee, J., Patel, D. S., Ståhle, J., Park, S.-J., Kern, N. R., Kim, S., Lee, J., Cheng, X., Valvano, M. A., Holst, O., Knirel, Y. A., Qi, Y., Jo, S., Klauda, J. B., Widmalm, G., & Im, W. (2018). CHARMM-GUIMembrane Builderfor Complex Biological Membrane Simulations with Glycolipids and Lipoglycans. Journal of Chemical Theory and Computation, 15(1), 775–786. https://doi.org/10.1021/acs.jctc.8b01066
- OPM. (2025). Umich.edu. https://opm.phar.umich.edu/proteins/1
- Pautsch, A., & Schulz, G. E. (2000). High-resolution structure of the OmpA membrane domain11Edited by D. C. Rees. Journal of Molecular Biology, 298(2), 273–282. https://doi.org/10.1006/jmbi.2000.3671
- Park, S., Choi, Y. K., Kim, S., Lee, J., & Im, W. (2021). CHARMM-GUI Membrane Builder for Lipid Nanoparticles with Ionizable Cationic Lipids and PEGylated Lipids. Journal of Chemical Information and Modeling, 61(10), 5192–5202. https://doi.org/10.1021/acs.jcim.1c00770
- Park, S.-J., Kern, N., Brown, T., Lee, J., & Im, W. (2023). CHARMM-GUI PDB Manipulator: Various PDB Structural Modifications for Biomolecular Modeling and Simulation. Journal of Molecular Biology, 167995. https://doi.org/10.1016/j.jmb.2023.167995
- Wu, E. L., Cheng, X., Jo, S., Rui, H., Song, K. C., Dávila-Contreras, E. M., Qi, Y., Lee, J., Monje-Galvan, V., Venable, R. M., Klauda, J. B., & Im, W. (2014). CHARMM-GUI Membrane Builder Toward Realistic Biological Membrane Simulations. Journal of Computational Chemistry, 35(27), 1997–2004. https://doi.org/10.1002/jcc.23702
Development of neural network models
1. Prediction of metal binding sites in proteins
The model has been documented into a paper which has been accepted by 2025 9th International Conference on Computational Biology and Bioinformatics (ICCBB 2025). The preprint version has been released: https://doi.org/10.1101/2025.09.13.676010.
Design
Computational prediction of residue-level metal-binding sites has progressed quickly in recent years. However, we found that existing methods still face major limitations. Sequence-based predictors for multiple metals often show limited accuracy or rely on rigid, ion-specific designs that must be completely rebuilt to include new metal types. Structure-based predictors can achieve higher accuracy, but they depend on high-resolution 3D structures and computationally intensive steps such as homology modeling or structural alignment. More recently, protein language model approaches like ESM-1b and ESM-2 have improved predictive performance, but they remain resource-heavy, with long inference times and substantial GPU memory requirements for large sequence datasets.
Build
To address these challenges, we developed a new multi-metal binding site predictor based solely on amino acid sequences. As shown in the figure above, our framework uses a two-stage architecture: in the first stage, independent convolutional neural networks (CNNs) are trained for each specific metal ion; in the second stage, the outputs of these networks are combined into a fusion model.
Test
Training time
Because of this lightweight design, both the individual metal-specific predictors and the combined fusion network can be trained in less than an hour using a single NVIDIA A800 GPU. This fast turnaround allows for rapid testing of design choices and real-time parameter optimization, which is particularly valuable for large-scale proteome analysis.
Threshold-Dependent Analyses
To evaluate the performance of our fusion network, we used a stratified five-fold cross-validation approach during training and tested the final models on an independent dataset. Results from all folds were averaged to ensure reliability. For binary classification, we introduced a probability cutoff τ: residues predicted above this threshold were considered metal-binding, while those below it were treated as non-binding.
F1 Score and Matthews Correlation Coefficient
When examining different threshold values, the ensemble displayed strong and consistent performance across metals. The overall accuracy peaked within a moderate threshold range, where both precision and recall were well balanced. Matthews Correlation Coefficient (MCC) metrics followed a similar trend to the F1 score, confirming that the ensemble achieved a stable compromise between false positives and false negatives, even when dealing with uneven class distribution.
Precision–Recall Trade-Off
The precision–recall curves revealed different strengths for each type of metal. Iron predictions maintained good accuracy even as recall increased, making the model well-suited for exhaustive screening tasks. Zinc predictions required more cautious thresholding for applications where precision is most important. Magnesium predictions showed steadier performance across a wide range of thresholds, providing a good balance of sensitivity and reliability for more general applications.
Per-Metal Performance
At the optimal operating threshold, each metal displayed characteristic trade-offs. Iron showed the strongest balance between precision and recall, magnesium performed consistently across conditions, and the recall of zinc can be increased after minor adjustment to the threshold. These differences allow the method to be tuned for specific goals, whether maximizing coverage for large-scale annotation or prioritizing precision for experiments needing more confident predictions.
On the whole, the ensemble delivered reliable performance across all metals. Supported by the above testing results, we finally selected τ = 0.45, which ensured a balanced outcome.
Ablation Study
To better understand how each part of our model contributed to overall performance, we carried out an ablation study using stratified five-fold cross-validation. By systematically testing different configurations, we were able to observe how design choices progressively improved prediction accuracy.
CNN Layers and Dropout
Our initial baseline, which relied on simple sequence encoding with a single CNN layer, performed poorly and struggled to extract meaningful features. Adding more CNN layers improved the results greatly, as deeper hierarchical filters captured increasingly complex sequence patterns important for recognizing binding motifs. The introduction of a dropout layer further strengthened the model by reducing overfitting, making predictions more reliable across different datasets while still remaining responsive to rare binding residues.
Handling Class Imbalance
Because metal-binding residues are much less common than non-binding positions, class imbalance posed a significant challenge. To address this, we designed a weighted loss function that increased the influence of the minority binding class during training. We then compared our loss function with the traditional F1 loss. From the results, it is clear that our loss function boosted the model’s recall while preserving overall balance, countering the natural bias toward non-binding sites. Unlike data resampling methods, this loss-based approach improved the detection of rare positives in a direct and efficient manner.
Fusion Layer
When predictions from the separate metal-specific CNN modules were combined through a fusion layer, we observed further refinement of performance. This integration not only smoothed out inconsistencies across different metal types but also made it possible to apply a shared threshold without requiring separate calibration for each ion. Importantly, the fusion mechanism also captured the interplay between metals, leading to more balanced precision and recall. In practice, this meant fewer cases of the model favoring precision over sensitivity or vice versa, resulting in more stable predictions overall.
The ablation experiments highlighted two fundamental insights. First, the weighted loss function is indispensable for managing the extreme residue imbalance, ensuring that the model fairly represents both common non-binding and rare binding sites. Second, the fusion framework, though modest in impact when considered in isolation, adds an important layer of robustness by aligning performance across different metals and encoding subtle cross-metal relationships that single models cannot capture.
Learn
From our testing phase, we learned that it is possible to predict proteins accurately without explicitly using structural information as input. One possible explanation is that the amino acid sequence itself already encodes the relevant structural features. For future improvements, we plan to extend the model to additional metals and work towards raising the F1 score beyond 0.9.
2. Generating new protein sequences that can bind to specific metal ions
Design
Building upon our demonstration that accurate prediction of metal binding sites is feasible using solely protein sequences as input, we advance to the development of a generative model for producing novel protein sequences. Here we propose a diffusion-based generative model that is built around a U-Net architecture, wherein the U-Net receives noisy latents and metal labels as inputs, outputting the predicted noise for iterative denoising.
The overall architecture is a latent diffusion model tailored for generating protein sequences, particularly those involving metal-binding properties (conditioned on metal labels). It combines an autoencoder framework with a diffusion process in latent space, enabling efficient generation of diverse, novel protein sequences.
Key Components:
- Encoder:Protein sequences (represented as strings of amino acid one-letter codes, e.g., "NELRCGCPDCHKVDPE RVFNH...") are fed into an encoder, which maps them into a compressed latent representation denoted as Z0. This latent embedding captures essential features of the sequence in a lower-dimensional space, facilitating the subsequent diffusion process.
- Forward Diffusion:Starting from the latent Z0, Gaussian noise is progressively added over multiple steps through a forward diffusion process, transforming it into a highly noised version Zt. This creates a Markov chain where each step increases randomness, ultimately approximating a standard Gaussian distribution. The process is deterministic and serves to train the model on how to handle noise addition.
- U-Net for Denoising:The core of the generative process is a U-Net-based denoiser. During the reverse diffusion (generation or reconstruction), the U-Net receives the noised latent Zt as input, along with conditioned metal labels (e.g. specifying types like zinc or iron for metalloprotein design). The metal labels are first passed through a separate encoder to embed them into a compatible format, which is then integrated into the U-Net. The U-Net predicts the noise added at each step ("pred noise"), allowing iterative subtraction of noise to recover a clean latent representation.
- Decoder:Once the latent has been denoised through repeated U-Net applications, the resulting clean latent Z0’ is passed through a decoder to reconstruct or generate new protein sequences. This maps the latent back to the original sequence space, producing functional protein variants.
This design leverages the strengths of latent diffusion (computational efficiency in lower dimensions) and U-Net's hierarchical feature extraction, making it suitable for tasks like de novo metalloprotein design where conditioning on labels ensures generated sequences exhibit desired properties like metal-binding affinity.
Design
Overview
As part of Cycle 3 of our project, it involved the development of a dynamic cadmium-sensitive Biosensor. As more thoroughly described under Cycle 3 of our Engineering Cycles page, this biosensor circuit aimed to use a multi-regulatory system in order to increase the sensitivity of the circuit towards cadmium (to detect and glow under trace concentrations found in sludge). The circuit also aimed to increase expression rate of the downstream mScarlet fluorescent protein and its subsequent degradation by utilising a Toehold Switch expression system and an LVA degradation tag found natively in the Pseudomonas putida bacteria.
Our Hypothesis
The hypothesis from our team was to produce a circuit, that under induction of a certain concentration of aTc and Cadmium, would produce a sharp fluorescent peak in under 6 hours, followed by a rapid non-linear degradation over 2-3 hours.
Thus, we sought to try to use ODEs to model the entire biosensor model and prove our hypothesis on how the Biosensor Circuit was intended to work.
To model the entire biosensor, it needed to model the uptake and flux of both cadmium ions and anhydrotetracycline (aTc), and how the concentration of these two inputs affect the regulatory effect of the tetracycline repressor (tetR) for repression and the cadmium-responsive regulator (cadR) enabling cadmium detection and mScarlet serving as a fluorescent reporter.
Kinetic Equations
To model the entire biosensor, it needed to model the uptake and flux of both cadmium ions and anhydrotetracycline (aTc), and how the concentration of these two inputs affect the regulatory effect of the tetracycline repressor (tetR) for repression and the cadmium-responsive regulator (cadR) enabling cadmium detection and mScarlet serving as a fluorescent reporter. Therefore, we first sought to break the complex regulatory elements down into simpler kinetic equations first, as a means to illustrate the holistic system. In doing so, we formulated the equations as seen below in Figures 1, 2, 3, based on the major regulatory elements.
Figure 1. tetR Kinetic Equations
Figure 2. cadR Kinetic Equations
Figure 3. Riboswitch Kinetic Equations
Build
Using SimBiology and MatLab
Initially, our team opted to use SimBiology to model our system. While MatLab provided a greater degree of flexibility and power to model ODE systems, SimBiology provided certain safeguards to reduce likelihood of human errors in our model. Here, it allowed us to model our equations using simple kinetic equations and provided validity checks on aspects such as parameter units which helped us to quickly validate and revise the accuracy of our models, parameters and data. By using SimBiology, it allowed us to clearly define our parameters as seen in Figure 4 and better represent the mechanism of the Biosensor on a molecular level.
For the above kinetic equations, we made several assumptions for modelling purposes. Firstly, we assumed the majority of degradation, transcription and translation reactions were first-order reactions unless literature suggested otherwise. For initial modelling purposes, we also assumed the cell would remain a constant volume and would not grow. To model the influx of cadmium before triggering the cell, we assumed a constant passive diffusion mechanism, and similarly, assumed the cadmium would only be used to activate the Biosensor and would not be replenished or utilised in other reactions. We assumed the reaction would linearly scale-up with volume as we first modeled it under the environment of one cell 1 fl, although this volume is dynamic and varies depending on growth conditions (Volkmer & Heinemann, 2011).
Parameter Table
The following table lists the parameters used in the ODEs, grouped by their relevance to tetR, cadR, and Switch & Trigger components, with their values and units. Note that parameters without sources, particularly those focusing on degradation rates, were estimated by assuming a first-order degradation pathway and calculating values based on an estimated half-life. Some values were also predicted using the BioInformatics Tool ProtoParam (Gasteiger et al., 2005). Unfortunately, not every species has been characterised or studied in studies. Some values also had to be arbitrarily assigned and adjusted through trial and error testing as they are not able to be directly characterised, or the kinetic equations modelling them are
TetR Reaction
cadR Reaction
Switch & Trigger
Figure 4. Parameters of Kinetic Equations
By combining the kinetic laws above, we developed a schematic to represent the flux of the Biosensor Circuit components within a bacterial cell as seen in Figure 5. The schematic was built in one of the supplementary MatLab Apps, the SimBiology Model Builder.
Figure 5. SimBiology Schematic of Biosensor Circuit in Bacterial Cell
Within this schematic, the entire circuit is built within a single Pseudomonas putida bacteria cell which as mentioned was assumed to be 1 fl (Volkmer & Heinemann, 2011). Parameters from Figures 1, 2 & 3 were assigned within SimBiology’s parameter field. All variable species except for two were set with initial concentrations of 0, although this understandably would not be entirely realistic. The first was aTc, where according to the paper where the Biosensor was adapted from, we replicated their protocol and inputted a final concentration 0.938 micromolar of aTc (Hu et al., 2023). For the starting cadmium concentration, a median concentration was selected based on the survivability of the bacteria transformed with metallothionein.
Each blue oval represents a species with variable concentration to be determined by the chemical equations applied to it. The yellow circles are components to assign a reaction with the arrow following the direction of the reaction. The equations used were all ones derived from Figure 4. By linking a reactant and product through the arrow, it directly indicates to the cell that the reactant is consumed in the production of the product and thus sets that reactant as a limiting reactant. In the opposite case, such as with the tetR_repression reaction, tetR protein directly represses the transcription of tetR mRNA. However, the kinetic equation itself for tetR repression includes this repression dynamic, and the homodimer tetR protein which binds and represses the promoter undergoes a reversible reaction. Thus, while tetR protein is a main reactant, it is not directly linked as a reactant. This is the logic used when the schematic for the Biosensor model was designed.
Test
SimBiology Outputs
After the model was built, the SimBiology Model was exported into the built-in SimBiology Analyser where the kinetic laws were computed together as an ODE system. It was hypothesised by the Circuit Design Team that a fluorescent peak could be reached within 4 hours of mScarlet induction. It was also expected that this would be succeeded by rapid non-linear degradation as expected and demonstrated in Iteration 1 of the LVA-tag modelling. Thus, the timespan was reasonably set at 650 minutes.
To compute the ODEs, the SimBiology Analyser provides several solver types to choose between. In this case, the ODE15s was chosen as the solver. As the ODE system captures processes like transcription, translation, and protein degradation, which operate on significantly different timescales, a stiff model arises due to the discrepancy of timeframe in which species dynamics occurs. Therefore, it seemed beneficial to use ODE15s as the solver which is able to incorporate both the large steps necessary for the comparatively slow dynamics of protein translation and degradation along with the small steps necessary in fast dynamics. ODE15s employs variable-order numerical differentiation formulas, which handle the disparate timescales more effectively. This allows ODE15s to take larger time steps while maintaining stability and accuracy, making it more efficient for my model. Additionally, its adaptive nature adjusts the solver’s order dynamically, optimizing performance for the complex regulatory relationships in the Biosensor Circuit.
Thus, the outputs of the model were as generated, with a generated model for mScarlet as seen in Figures 6 below.
Figure 6. mScarlet degradation tag
The graph plots mScarlet concentration against time and focuses on the reagent most wanted for this model. It results in a Biosensor which peaks around 0.065 micromole / liter concentration. Based on this model, the behaviour of this circuit behaved similarly to the expectation from Circuit design, having a fast response, followed by a fluorescent peak, and then a drop, thus producing the circuit behaviour of a dynamic biosensor.
Modelling in MatLab
With an output that behaved as expected, we transitioned and plotted the model in MatLab to create a more universally readable file. Firstly, this involved combining the Kinetic Equations to create a set of 11 ODEs to model the entire system as seen in Figure 7:
Species |
ODE equation |
tetR free protein |
 |
cadR free protein |
 |
Trigger RNA |
 |
aTc |
 |
tetR mRNA |
 |
Cadmium Ions |
 |
Hairpin Structure |
 |
Induced Riboswitch |
 |
mScarlet |
 |
tetR-aTc complex |
 |
cadR mRNA |
 |
Figure 7. Combined ODEs of Biosensor Model
The ODEs were plotted in MatLab and computed using ODE15s again with the same tolerances, timespan and parameters. Subsequently, a species profile and mScarlet plot was generated and the model largely simulated as expected with no anomalous behaviour or outputs (such as negatives or largely disobeying with the behavioural species in SimBiology).
In Figure 8 and 9, both the concentration profile and mScarlet could be seen.
Figure 8. Concentration Profiles
Figure 9. mScarlet Model
Learn
What do the outputs show?
For the output, the general expected output is around 0.065 which appears to not be a detectable fluorescence by the human eye, even under UV or Blue-green excitation (NISWENDER et al., 1995). The model was intended to be a proof of concept and to verify the hypothesis of the Biosensor circuit, however, the Biosensor was not able to successfully express any visible fluorescent protein and thus was not able to be characterised or used to verify our model. Interestingly, literature suggests that the model aligns with the Wet Lab output as 0.065 micromole / litre of most fluorescent proteins is insufficient to be detected by eye. Potentially, this indicates the circuit is behaving as sequencing results show the circuit was assembled and transformed properly. Thus, the circuit may be behaving as expected but at an insufficient sensitivity.
Notably, there is also a slight discrepancy in the degradation relationship of the plots in Figure 6 and Figure 9. This slight difference in computation may be due to hidden computational processing in SimBiology as all equations and parameters were verified to be identical in both the MatLab and SimBiology file. However, they both peak at the same concentration and follow the same general behaviour, perhaps indicating it is a negligible discrepancy.
It is also important to note that the parameters are educated guesses at best, and concentrations seen from profiles and the mScarlet graph are very unlikely to reflect the actual model of the circuit. With more time, we had hoped to be able to take a more reductionist approach and attempt to characterise certain parts we used that were not extensively documented. This would have benefitted both the validity of our model and the wider iGEM community.
Future Directions
Upon the development of our model, our Wet Lab work had shown inconclusive results regarding the LVA-degradation tag, and the subsequent Biosensor circuit was unable to produce reproducible results. Our team thus hypothesised two potential reasons
1. The estimation of parameters resulted in an overall system that misrepresented what was actually happening in the system
2. The Biosensor Circuit itself was non-functional, or there was a deeper issue which appeared as a failure of the LVA degradation tag.
Regrettably, our team lacked the time to perform proper sensitivity analysis. This would have allowed us to better fine-tune and determine the Cadmium and aTc sensitivity to trigger our circuit. More specifically, we hoped a sensitivity analysis could determine the parts contributing to the maximum peak of mScarlet or the initial rate of mScarlet production. It would also highlight if certain parts such as the promoter were acting as a bottle-neck for the desired expression or sensitivity of our circuit, and we could have sought to replace such parts with better characterised ones. This would allow us to see whether reducing the strength of certain regulatory elements were worth testing.
Overall, ODEs are merely a model to simulate and are predominantly valuable from trying to validate hypotheses regarding the entirety of the model, rather than predicting quantitative results. Naturally, there are many parameters and data which are not reliably characterised and educated guesses must be made. However, as seen, ODE modelling can show value when computing the flux of many different species and be used to see how regulatory components may interact with one another.
References
- Andersen, J. B., Sternberg, C., Poulsen, L. K., Bjørn, S. P., Givskov, M., & Molin, S. (1998). New Unstable Variants of Green Fluorescent Protein for Studies of Transient Gene Expression in Bacteria. Applied and Environmental Microbiology, 64(6), 2240–2246. https://doi.org/10.1128/aem.64.6.2240-2246.1998
- Bernstein, J. A., Khodursky, A. B., Lin, P.-H. ., Lin-Chao, S., & Cohen, S. N. (2002). Global analysis of mRNA decay and abundance in Escherichia coli at single-gene resolution using two-color fluorescent DNA microarrays. Proceedings of the National Academy of Sciences, 99(15), 9697–9702. https://doi.org/10.1073/pnas.112318199
- Gasteiger, E., Hoogland, C., Gattiker, A., Duvaud, S., Wilkins, M., Appel, R., & Bairoch, A. (2005). Protein Analysis Tools on the ExPASy Server Protein Identification and Analysis Tools on the ExPASy Server (pp. 571–607). https://link.springer.com/book/10.1385/1592598900
- Hu, S., Zhang, G., & Jia, X. (2023). Improvement of a highly sensitive and specific whole-cell biosensor by adding a positive feedback amplifier. Synthetic and Systems Biotechnology, 8(2), 292–299. https://doi.org/10.1016/j.synbio.2023.03.007
- Kamionka, A. (2004). Two mutations in the tetracycline repressor change the inducer anhydrotetracycline to a corepressor. Nucleic Acids Research, 32(2), 842–847. https://doi.org/10.1093/nar/gkh200
- Köbbing, S., Blank, L. M., & Wierckx, N. (2020). Characterization of context-dependent effects on synthetic promoters. Frontiers in Bioengineering and Biotechnology, Volume 8 - 2020. https://doi.org/10.3389/fbioe.2020.00551
- Liu, X., Hu, Q., Yang, J., Huang, S., Wei, T., Chen, W., He, Y., Wang, D., Liu, Z., Wang, K., Gan, J., & Chen, H. (2019). Selective cadmium regulation mediated by a cooperative binding mechanism in CadR. Proceedings of the National Academy of Sciences, 116(41), 20398–20403. https://doi.org/10.1073/pnas.1908610116
- Moran, M. A., Satinsky, B., Gifford, S. M., Luo, H., Rivers, A., Chan, L.-K., Meng, J., Durham, B. P., Shen, C., Varaljay, V. A., Smith, C. B., Yager, P. L., & Hopkinson, B. M. (2012). Sizing up metatranscriptomics. The ISME Journal, 7(2), 237–243. https://doi.org/10.1038/ismej.2012.94
- NISWENDER, K. D., BLACKMAN, S. M., ROHDE, L., MAGNUSON, M. A., & PISTON, D. W. (1995). Quantitative imaging of green fluorescent protein in cultured cells: Comparison of microscopic techniques, use in fusion proteins and detection limits. Journal of Microscopy, 180(2), 109–116. https://doi.org/10.1111/j.1365-2818.1995.tb03665.x
- Scholz, O., Schubert, P., Kintrup, M., & Hillen, W. (2000). Tet Repressor Induction without Mg2+. Biochemistry, 39(35), 10914–10920. https://doi.org/10.1021/bi001018p
- Tuttle, L. M., Salis, H. M., Tomshine, J. R., & Kaznessis, Y. N. (2005). Model-driven designs of an oscillating gene network. Biophysical Journal, 89(6), 3873–3883. https://doi.org/10.1529/biophysj.105.064204
- Volkmer, B., & Heinemann, M. (2011). Condition-Dependent Cell Volume and Concentration of Escherichia coli to Facilitate Data Conversion for Systems Biology Modeling. PLoS ONE, 6(7), e23126. https://doi.org/10.1371/journal.pone.0023126
- Wade, J. T., & Struhl, K. (2008). The transition from transcriptional initiation to elongation. Current Opinion in Genetics & Development, 18(2), 130–136. https://doi.org/10.1016/j.gde.2007.12.008
- Yu, J. (2006). Probing Gene Expression in Live Cells, One Protein Molecule at a Time. Science, 311(5767), 1600–1603. https://doi.org/10.1126/science.1119623
Protein Structure Prediction for Kill Switch System
We discovered the citric acid promoter PodpH, which was used by a previous iGEM team that could be used for our citric-dependent kill switch. However, due to literature on the PopdH promoter and tctD-tctE two-component system being mostly from P. aeruginosa, we were unable to find any literature that describes the use of P. aeruginosa’s PodpH in our chassis P. putida. We tried NCBI BLAST but found no match of the PodpH promoter sequence in P. putida, and we were also unable to locate the native PodpH from P. putida.
tctD and tctE (Tricarboxylate Transport Regulatory Protein D and Tricarboxylate Transport Sensor Kinase E) regulates the promoter PodpH and its gene opdH in P. aeruginosa. Therefore, we decided to compare the protein structure of tctD and tctE from P. aeruginosa and P. putida using AlphaFold 3 and RaptorX, to determine whether PodpH from P. aeruginosa has cross-compatibility.
Abbreviations:
- PP - P. putida
- PA - P. aeruginosa
- tctD-PP - protein from P. putida
- tctD-PA - tctD protein from P. aeruginosa
- tctE-PP - tctE protein from P. putida
- tctE-PA - tctE protein from P. aeruginosa
The object is to find the structural similarities or differences of the tctD and tctE from PP and PA. We first predicted the protein structure for tctD and tctE from both bacteria using AlphaFold 3, then checked the similarities between the tctD and tctE from PP and PA, we used RaptorX for structure alignment. Raptor X structure Alignment employs a statistical learning method to design a new threading scoring function, aiming at better measuring the compatibility between a target sequence and a template structure (Peng, J., & Xu, J. (2011))
Results and Analysis
The quality of the structural alignment of the protein sequence is judged using the parameters provided by Raptor X as follows:
- LALI (Length of Alignment):
- It measure the number of aligned residues in a protein structure alignment
- So in our use case, it counts how many residues from one protein are matched to residues in another protein
- RMSD (Root Mean Square Deviation):
- Formula:
, unit (angstroms(Å))
- It measures the average distance between the aligned atoms in two or more protein structures
- It is quite sensitive to outliers which can inflate the value
- So in our use case, a lower RMSD will indicate a closer structural match.
- uGDT (Unnormalized Global Distance Test)
- It measures how well two protein structures align by counting how many residues are close in space after alignment
- It weights closer alignment more heavily
- GDT ( Global Distance Test)
- It compares alignments across proteins of different sizes using percentage
- In out use case, it provides a more refined view of alignment quality compared to RMSD by considering multiple distance threshold
- -> A higher GDT score indicates better structural similarity
- TMscore (Template Modeling Score) (ranges from 0 to 1)
- Formula:

- It is defined to assess the topological similarity of two protein structures (Jinrui Xu, Yang Zhang (2010))
- Where if the score is:
- Greater than 0.6: indicate a high likelihood (90%) that two proteins share the same fold
- Smaller than 0.4: Suggests a high likelihood (90%) that the proteins have different folds
Comparing tctD-PP and tctD-PA
Figure 1 below shows the structure of tctD-PP and tctD-PA, with pTM scores of 0.59 and 0.68, where most regions of the two proteins have plDDT scores of 70 and 90 or higher, indicating the protein structure is reasonably accurate for comparison.
Figure 1. tctD-PP (left) and tctD-PA (right) structures from AlphaFold 3
Figure 2 below shows the structure alignment attempt for tctD-PP and tctD-PA. The results show that the LALI is 123 out of 222 indicating only half is aligned, but RMSD is 1.07 < 2 which is an excellent alignment with highly similar structures. uGDT value is 120(54) which also shows good alignment with moderate to high similarity.
Based on visual observations, tctD is composed of two distinct parts, and we suspected the two proteins fold in different directions, which causes the LALI value to only be around half as only one part of the two is aligned. Therefore, we designed to compare one part at a time (calling them part 1A, 1B and 2A, and 2B of tctD-PP and tctD-PA respectively).
Results for comparing 1A and 2A indicates that the LALI value is 102 out of 102, so it is an exact match. The RMSD value is 0.26 < 2 indicates excellent alignment with highly similar structures, and the uGDT value is 102(100) showing nearly exact alignment. The TMscore is 0.995, which also shows very likely (90% chance) that 1A and 2A share the same fold. [Figure 3 below]
Results for comparing 1B and 2B indicates that the LALI value is 117 out of 120, so it is a near exact match. The RMSD value is 1.89 < 2 indicates excellent alignment with highly similar structures as well, and the uGDT value is 106(88) showing nearly exact alignment. The TMscore is 0.890, which also shows very likely (90% chance) that 1B and 2B shares the same fold. But compared to 1A and 2A, the structure is less similar, hence probably the reason why the score was lower for the whole tctD protein. [Figure 4 below]
Figure 2. RaptorX results showing alignment of tctD-PP and tctD-PA
Figure 3. RaptorX results comparing 1A and 2A of tctD-PP and tctD-PA
Figure 4. RaptorX results comparing 1B and 2B of tctD-PP and tctD-PA
Comparing tctE-PP and tctEPA
Figure 5 below shows the structure of tctE-PP and tctE-PA, with pTM scores of 0.68 and 0.69, where most regions of the two proteins have plDDT scores of 70 and 90 or higher, indicating the protein structure is highly accurate and is good for comparison.
Figure 5. tctE-PP (left) and tctE-PA (right) structures from AlphaFold 3
Length of the protein is slightly different at 460 and 463 for tctE-PP and tctE-PA respectively. The results for alignment shows the LALI is 459 out of 460-463 which indicates a near exact match for both proteins. The RMSD value 1.80 < 2, which shows excellent alignment with highly similar structures.
The uGDT value is 369(88) > 80 indicates near exact alignment. The TMscore is 0.947 which shows very likely (90% chance) that the protein shares the same fold. [Figure 6 below]
Figure 6.RaptorX results comparing tctE-PP and tctE-PA
The results from structure alignment comparing tctD and tctE of PP and PA shows that the two proteins share a very similar structure. As a result, we determined that the tctD and tctE should be able to regulate PodpH when used in our chassis P. putida. This aided the decision of circuit design to use PodpH and tctD-PP as the citric acid dependent mechanism in our kill switch design.
References
- Peng, J., & Xu, J. (2011). RaptorX: exploiting structure information for protein alignment by statistical inference. Proteins, 79 Suppl 10(Suppl 10), 161–171. https://doi.org/10.1002/prot.23175
- Jinrui Xu, Yang Zhang (2010), How significant is a protein structure similarity with TM-score = 0.5?, Bioinformatics, Volume 26, Issue 7, April 2010, Pages 889–895, https://doi.org/10.1093/bioinformatics/btq066