In Silico Validation of a Modified nirBD Protein Construct Using AlphaFold2

Introduction

Our primary objective is to integrate nirBD, nrfABCDEFG and nrfHAIJ genes, which are responsible for nitrite reductase activity, into the genome of P. putida KT2440. However, due to the limitations on the availability of restriction sites, we end up with an unintended extra sequence of genetic code being carried over into the translated protein.

This can be a potential risk as this extraneous DNA sequence, when translated to amino acids can alter the folding and structure of the resulting protein in significant ways, thereby hampering the functioning of the protein.

Therefore, before investing our resources in performing the experiment in wet-lab, decided to have a prediction on how this modified protein would compare, at least structurally, to its native counterpart.

For this purpose, we used AlphaFold2, a deep learning model by Google’s DeepMind, which can predict the 3D structure of a protein from its amino acid sequence.

Methodology

Our workflow can be divided into three stages:

Step 1: Sequence Preparation and Translation

We obtained the sequence for the three operons as follows:

nirBD: Pseudomonas putida KT2440
nrfHAIJ : Wolinella succinogenes DSM 1740
nrfABCDEFG : ,Escherichia coli K-12

We used an interface to optimize the sequence for expression in E. coli.

We tested these constructs on Snapgene to validate the translation.

Restriction sites were to be added at the ends to allow insertion into plasmids from the pSEVA backbone.

Two options were available for the first restriction site, EcoRI and AvrII.

Using either of the sites would mean some extra nucleotides would be present between the start codon on the plasmid and the inserted sequence.

On translation, this would mean a few extra amino acids, and we wanted to understand how this would affect the protein folding and in effect, the enzymatic activity of the protein.

Step 2: Protein Structure Prediction

Prediction of the folded structure was the most computationally intensive part of this project. The AlphaFold2 prediction process works by generating a Multi Sequence Alignment (MSA) from a huge 2.5TB dataset to identify patterns in protein structure. This process is followed by a GPU intensive process of rendering the coordinates of each particle in 3D space.

Running the whole process locally is not very practical, so we tried out two different approaches:

1. ColabFold: An Open-source project which integrates AlphaFold2 with Google Colab, Google’s cloud computing service.

While it may work for short amino acid chains, for our project the hardware limitations of free tier, 12GB RAM and weekly usage limit, would have been a huge bottleneck.

2. LocalColabFold: This option is what we went ahead with. What separates it from the previous option is that while the neural network for structure prediction runs on the local hardware, the MSA generation step before prediction step is outsourced to a public server.

This significantly reduces local storage and computational requirements. We ran our predictions on high performance systems in our Under Graduate Computer Laboratory (UGCL). The entire setup and modelling process took around 9-10 hours.

Step-3: Structural comparison and Analysis

The trials ranked according to their pLDDT scores (measure of how confident the neural network is of the structure of the predicted protein) are given as output in form of .pdb files along with special confidence plots.

Predicted Aligned Error (PAE) plots of nirB Native protein. Plot shows the confidence of the model in the relative position of an amino acid compared to another. Blue → high certainty, Red → low certainty.

The ranks are decided by pLDDT (predicted Local Distance Difference Test), which is another measure for confidence of the model for the special coordinates of the parts of protein.

For nirB native, the rank wise pLDDT values were: 92, 90.8, 90.3, 90.1 and 89.5.

pLDDT >90 indicates very high confidence in the structure
70 < pLDDT < 90 indicates confident
50 < pLDDT < 70 low confidence
And < 50 indicates very low confidence.

As one can observe, all the predictions are in the Very High to High confidence region. Now, these predicted structures for our native proteins are compared to the modified versions.

There are multiple ways to do this, we chose to use RMSD and TM-score:

Root Mean Square Deviation (RMSD): Measures the average distance between corresponding atoms of two superimposed structures. A lower RMSD value indicates greater similarity.

TM-score: A metric of structural similarity on a scale of [0, 1], where 1 indicates a perfect match. A score > 0.5 signifies that the two proteins share the same global fold.

Native and/or Rank-1 Proteins are taken as reference in all measurements, as indicated by 0 RMSD and TM Score of 1.

pLDDT Scores
Protein	Rank 1	Rank 2	Rank 3	Rank 4	Rank 5
nirB_av	91.8	90.7	90.1	89.9	89.1
nirB_ec	91.1	90.1	89.7	89.5	88.4
nirB_nat	92.0	90.8	90.3	90.1	89.5
nrfA_av	92.2	92.1	92.1	92.0	91.9
nrfA_ec	91.4	91.3	91.3	91.2	91.0
nrfA_nat	92.7	92.6	92.6	92.6	92.2
nrfH_av	84.6	83.5	83.1	82.1	81.9
nrfH_ec	85.1	82.8	82.2	82.2	81.8
nrfH_nat	85.9	84.7	84.4	83.2	82.3

TM Score (Rank-1)
	Native	EcoRI	AvrII
nrfA	1	0.96	0.95
nirB	1	0.95	0.97
nrfH	1	0.86	0.85

TM Score (Rank-2)
	Native	EcoRI	AvrII
nrfA	1	0.95	0.97
nirB	1	0.81	0.84
nrfH	1	0.86	0.85

TM Score (Rank-3)
	Native	EcoRI	AvrII
nrfA	1	0.94	0.95
nirB	1	0.53	0.53
nrfH	1	0.90	0.90

RMSD (Å) Rank-1
	EcoRI	AvrII
nrfA	2.32	1.33
nirB	2.32	1.80
nrfH	2.07	2.12

RMSD (Å) Rank-2
	EcoRI	AvrII
nrfA	2.06	1.78
nirB	5.04	4.42
nrfH	2.45	2.67

RMSD (Å) Rank-3
	EcoRI	AvrII
nrfA	1.53	1.64
nirB	4.06	4.02
nrfH	1.71	1.73

The TM-Scores are above 0.90 for most structures of Rank-1, indicating an almost identical global fold.

RMSD values are well below 2.0 Å for most structures of Rank-1 as well, indicating minor local conformational changes mostly.

These are likely to be in the intrinsically disordered regions away from protein’s active site.

NirB Native-AvrII-EcoRI overlayed

NrfA Native-AvrII-EcoRI overlayed

NrfH Native-AvrII-EcoRI overlayed

Conclusion

The in silico modelling demonstrates that the extra amino acid sequence resulting from our cloning strategy causes little to no change to the overall structure of the NirB, NrfA and NrfH proteins. The high degree of structural homology between native and modified protein models gives us confidence that our engineered protein model will work as intended according to its biological function. These results serve as a crucial validation step in justifying our progression to the wet lab for experimental characterization.

Based on these results we moved ahead with using EcoRI as our restriction site, since the enzyme is well characterized and commonly used for digestion, despite having a few extra amino acids compared to AvrII.