Iteration 1

Design

The chain exchange of CC domains is the core of our project, which requires different binding affinities of CC domain pairs. Based on a previous study (Brechun, Arndt, & Woolley, 2019), we adopted CC domains WinZip B1 and WinZip A2, where WinZip B1/B1 is a medium-strength interacting pair and WinZip A2/B1 is a strongly interacting pair.

We designated WinZip B1 as both the Target chain and Autoinhibitor chain, while WinZip A2 serves as the Displacer chain (Fig.1). Thus, after the Linker between the two WinZip B1 domains is cleaved by proteases, WinZip A2 will bind to B1. We applied this CC domain pair to our positive feedback loop. For further details, please refer to our Design.

Fig.1 Chain exchange diagram.

Build

We first established a set of ordinary differential equations (ODEs) to model the positive feedback pathway, simulating the dynamics of reconstituted TEV protease and relevant products before experimental validation.

Test

We conduct a response variable-time simulation of our positive feedback loop (Fig.2). The results indicate that our pathway achieved the anticipated positive feedback amplification effect, but the amount of reconstituted TEV protease was not as high as expected.

Fig.2 Response variable-time simulation of our positive feedback loop.

Learn

As both the Target chain and Autoinhibitor chain in our pathway are WinZip B1, during the chain-switching reaction, WinZip A2 - acting as the Displacer - binds not only to the Target chain but also to the Autoinhibitor chain. This led to approximately a 50% reduction in the amount of reconstituted TEV protease in our system.

Iteration 2

Design

To enhance the efficacy of the Input 1 component, specifically to reconstitute greater quantities of TEV protease, we aimed to conduct directed evolution on the WinZip B1 target chain to improve its affinity for WinZip A2.

Build

Our Model analyzed the energy distribution between residues of WinZip A2 and WinZip B1 (Fig.3). Residues with lower (more negative) ΔG values contribute more strongly to stabilizing the binding, while higher values indicate weaker or destabilizing effects.

Fig.3 Per-residue decomposition of ΔG calculated by MMPBSA using the normal PB model，A for WinZip A2, B for WinZipB1.

As can be seen from Fig.3, residues Val1 and Glu30 exert a negative contribution towards the binding of the two leucine zippers. We consider introducing a V1L mutation to enhance the hydrophobicity of its hydrophobic region, in conjunction with an E30Q mutation to reduce electrostatic repulsion, thereby decreasing the binding energy.

Test

Following directed evolution, we conducted response variable-time simulations under conditions where the Target chain exhibited increased affinity toward the Displacer chain (Fig.4). It is evident that the total amount of reconstituted TEV protease is greater.

Fig.4 Response variable-time simulation of our positive feedback loop with directed evolution.

Learn

The results of our simulation confirmed that enhancing the binding affinity between WinZip A2 and the evolved WinZip B1 variant effectively increased the amount of reconstituted TEV protease. Future work will focus on experimental validation of the chain exchange efficiency.

Iteration 1

Design

To verify whether our split luciferase fragments could bind and restore activity in response to the interaction of their fused Coiled-Coil domains (CC domain), AP4(C’)/P3(C), we designed two fusion proteins, nLuc_AP4(C') and P3(C)_cLuc (Fig.5).

Fig.5 Principle of the split luciferase complementation assay.

Build

We commissioned a biological company to synthesize nLuc_AP4(C')_pET-15b and P3(C)_cLuc_pET-28a+ vectors, and then performed double transformation of both plasmids into E.coli BL21 (DE3) competent cells. Later, selective pressure with ampicillin and kanamycin was applied to ensure that only cells carrying both plasmids survived. After that, Colony PCR was used to verify whether these two plasmids had been successfully transformed.

Test

IPTG was added to induce the co-expression of nLuc_AP4(C') and P3(C)_cLuc, after that we expected to detect luminescent signals at 590 nm using Bacterial Firefly Luciferase Reporter Gene Assay Kit (Beyotime, Bac-Lumi™). However, we failed to capture significant luminescence, which might indicate poor intracellular binding of nLuc_AP4(C') and P3(C)_cLuc.

Learn

We were unable to determine whether the poor signal output was due to issues with the assay kit or the failure of nLuc_AP4(C') and P3(C)_cLuc to reconstitute an active reconstituted luciferase in the cells. Therefore, we constructed a plasmid expressing intact luciferase as a positive control—this not only verifies the effectiveness of the luminescence detection method but also allows direct comparison of the activity between reconstituted luciferase and intact luciferase.

Iteration 2

Design

We constructed a plasmid that expresses intact luciferase as a positive control based on known sequences of nLuc and cLuc (Fig.6).

Fig.6 Vector Luc_pET-28a+.

Build

We obtained sequences of nLuc and cLuc from nLuc_AP4(C')_pET-15b and P3(C)_cLuc_pET-28a+ by PCR and ligated them to construct Luc_pET-28a+ plasmid by homologous recombination. Then we transformed the plasmid into E. coli BL21(DE3) competent cells, and the successful transformation was verified by colony PCR.

Test

IPTG was used to induce the expression of intact luciferase, after that we expected to detect luminescent signals at 590 nm using Bacterial Firefly Luciferase Reporter Gene Assay Kit (Beyotime, Bac-Lumi™). This time, significant luminescence was detected, confirming the reliability of the assay kit (Fig.7).

Fig.7 Luminescence of complete Luc.

Learn

Our successful positive control ruled out issues with the assay kit, indicating that nLuc and cLuc may have difficulty binding under the conditions of assay reagent or their complex may lack intracellular activity.

Since co-expression in cells makes it hard to control the relative expression levels of different proteins, and in an attempt to eliminate interference from the complex intracellular environment on their binding, we decided to purify these two fusion proteins (nLuc_AP4(C') and P3(C)_cLuc) for extracellular experiments.

In addition, although the protocol of the assay kit recommends incubating for 5 minutes after adding the assay reagent before measuring luminescence, during communication with the BNU-China team, they reminded us that luciferase luminescence decays rapidly. Therefore, in our subsequent experiments, luminescence measurements were started immediately after the assay reagent was added and continued for 10 minutes.

Iteration 3

Design

Since the 6×His-tag is retained on our nLuc_AP4(C')_pET-15b and P3(C)_cLuc_pET-28a+ plasmids, we can purify the two proteins using Ni-NTA column affinity chromatography.

Build

The nLuc_AP4(C')_pET-15b and P3(C)_cLuc_pET-28a+ vectors were separately transformed into E.coli BL21(DE3) competent cells, and protein expressions were induced with IPTG. The cells were lysed by ultrasonic cell disruptor, and pure proteins were obtained via pre-packed Ni-NTA gravity columns with gradient imidazole elution. These proteins were then dialyzed overnight against a buffer containing 250 mM NaCl and 20 mM Tris-HCl (pH 7.5) to remove imidazole.

Test

Purified nLuc_AP4(C') and P3(C)_cLuc proteins at different concentrations were added to a buffer containing 250 mM NaCl and 20 mM Tris-HCl (pH 7.5). After co-incubation for 1 hour, the Bacterial Firefly Luciferase Reporter Gene Assay Kit (Beyotime, Bac-Lumi™) was used to measure the luminescent signals at 590 nm (Fig.8).

Fig.8 The luminescence signal decays with time.

Learn

Measurements over the time course showed that the luminescent signal decayed very rapidly, with no significant luminescence observed at the 5-minute mark. This indicates that the earlier intracellular split luciferase luminescence assay likely failed not because the split luciferase fragments failed to bind or the reconstituted luciferase lacked intracellular activity, but because the luminescent signal had already decayed to the baseline within 5 minutes of incubation.

Determination of the protein concentration gradient revealed that the protein concentration significantly affected the initial luminescence intensity: the split luciferase fragments at a concentration of 10 μM exhibited remarkably strong luminescence at 590 nm. This not only confirms that our reconstituted luciferase is a reliable output in our cell-free system, but also demonstrates that we can use split luciferase as a characterization tool to verify whether other split proteins in our system bind successfully.

Iteration 1

Design

We aimed to verify whether our trisplit GFP (comprising GFP1-9, GFP10, and GFP11) could be reconstituted through rearrangement mediated by CC domains fused to GFP10 and GFP11. Initially, we design the GFP_pET-28a+ vector to co-express the three fusion proteins: GFP1-9, GFP10_AP4(C'), and P3(C)_GFP11 (Fig.9).

Fig.9 Vector GFP_pET-28a+.

Build

The designed fragments were synthesized by a biotechnology company and assembled into a complete plasmid via homologous recombination. The product was then transformed into E. coli TOP10 competent cells for plasmid amplification and extraction.

Test

We sent the open reading frame (ORF) region of the constructed plasmid to the biotechnology company for sequencing. However, after several independent construction trials, sequencing results consistently showed that the segments between the double terminators (i.e., the sequence encoding GFP1-9) was lost (Fig.10).

Fig.10 Sequencing results: missing segments encoding GFP1-9.

Learn

We analyzed the cause of sequence loss. Notably, two identical double terminator sequences were designed on our plasmid. Such long repetitive sequences tend to trigger homologous recombination during the replication in E. coli, which results in the deletion of the segment of the intervening segment. Therefore, when designing plasmids, the inclusion of multiple long repetitive sequences (such as double terminators) on a single plasmid should be avoided. We need to design an additional plasmid to ensure the expression of GFP1-9.

We decided to retain the plasmid without sequence encoding GFP1-9 and renamed it GFP_Final_pET-28a+ (Fig.11).

Fig.11 Vector GFP_Final_pET-28a+.

Iteration 2

Design

To co-express GFP1-9 with GFP_Final_pET-28a+, we construct a separate plasmid GFP1-9_pET-15b dedicated to GFP1-9 expression. GFP1-9_pET-15b confers a different antibiotic resistance marker compared to GFP_Final_pET-28a+, enabling the stable coexistence and co-expression of both plasmids in a single E. coli cell.

Build

We commissioned the biotechnology company to synthesize the GFP1-9_pET-15b plasmid. GFP_Final_pET-28a+ and GFP1-9_pET-15b are co-transformed into BL21(DE3) competent cells. To verify the successful transformation of both plasmids, we conducted colony PCR on the resulting bacterial colonies.

Test

The co-transformed E. coli cells were cultured in an ELISA reader with continuous shaking at 25°C, induced with 0.3 mM IPTG and 0.1% rhamnose (Rha). The culture was monitored for 16 hours, with measurements of fluorescence intensity and OD₆₀₀ taken every 10 minutes. After normalizing the fluorescence intensity to OD₆₀₀ (to account for cell density differences), we generated a curve graph illustrating the fluorescence intensity change over time (Fig.12).

Fig.12 Fluorescence Intensity of reconstituted GFP.

Learn

Fluorescence intensity analysis revealed that a significant fluorescence signal was detected as early as 3 hours after induction. This result confirms the feasibility of using split GFP as the output module for our system.

In E. coli cells, the intracellular concentration of expressed proteins gradually increases from 0 during the induction process—a time-dependent accumulation inherent to in vivo protein expression. In contrast, such a gradual expression process is unnecessary for in vitro experiments. Therefore, we reasonably infer that a shorter incubation period will be sufficient to generate a detectable fluorescence signal in in vitro assays. This experiment also demonstrated that the fluorescence intensity of split GFP can serve as a reliable indicator to assess the binding of other split proteins.

To maximize the predictive accuracy of the Seq2Affinity-CC model for estimating the binding strength of coiled-coil (CC) domains based on their amino acid sequences, we conducted an iterative optimization process under the constraints of the available dataset.

Iteration 1

Design

Given the high-dimensional, small-sample-size characteristics of our dataset, we recognized that complex models were prone to overfitting. We therefore screened a set of lightweight models suitable for such scenarios, including random forests, sparse support vector machines (SVM), Lasso regression, and Ridge regression.

Build

The dataset was split into training and test sets. Each regression model was built, trained, and evaluated by calculating its R-squared (R²) score on the test set to assess performance. Among them, the random forest has been fully tuned for parameters.

Test

The comparative R² scores for all candidate models are shown in Fig. 1.

Fig.13 The R² scores of different models.

Learn

The results indicated that ridge and lasso regression outperformed random forests and sparse support vector machine (SVM) for this specific dataset. The key reason is that lasso and ridge models incorporate regularization, which reduces parameter dimensionality and effectively mitigates overfitting. (Principles are provided in the [Model part].) Based on this finding, we selected Lasso and Ridge regression for further optimization.

Iteration 2

Design

With the model types narrowed down, we pursued further improvement through careful hyperparameter tuning and dataset refinement.

We also followed the suggestions from HP to explore transforming the regression model into a discriminative (classification) model for potentially higher accuracy. Additionally, we considered incorporating non-binding data into the dataset to expand its size and utility for the discriminative approach.

Build

We implemented the following steps:

Re-examined the dataset to remove entries with high uncertainty due to low experimental accuracy.
Systematically screened optimal hyperparameters for the lasso and ridge models using grid search.
Implemented and tested a partial least squares (PLS) regression model and a discriminative model.

Test

After dataset refinement, hyperparameter adjustment, and model development, we evaluated the models based on R² scores for regression and classification accuracy for the discriminative model, followed by a comprehensive comparison.

Learn

The key findings were:

The lasso model achieved a higher predictive performance upper limit compared to the ridge model.
The PLS model performance was also inferior to that of the lasso model.
The discriminative model improved classification accuracy to some extent; however, its performance was still significantly constrained by the small dataset size, even after the inclusion of non-binding peptide pairs.

Consequently, we selected the lasso regression model as our final predictor, as it achieved a qualitative to semi-quantitative prediction level suitable for our application.

In the process of improving the model, we realized that the superior performance of lasso regression in this situation can be fundamentally attributed to its mechanism for handling high-dimensional, small-sample data: Lasso tends to produce sparse solutions by driving some coefficients to zero, effectively performing variable selection. This occurs because its constraint region has sharp corners (e.g., diamond-shaped in 2D), and the optimal solution often lies at one of these corners. In contrast, ridge regression, with its smooth spherical constraint region, does not force coefficients to zero but generally provides more stable estimates. For our specific data characteristics, Lasso's feature selection capability simplifies the model more effectively, enabling it to better identify and learn the key features within the dataset.

We also concluded that a substantial improvement in predictive performance is ultimately dependent on expanding the dataset. As the dataset grows, our engineering framework will allow us to re-evaluate and select even more suitable and powerful models.