Toehold Switch | DUT-China

1. Background & Motivation

As an emerging RNA regulatory element, the toehold switch has shown great potential in biosensing and gene regulation. The purpose of this module is to develop a highly specific and efficient toehold switch for miR-21 and to find new ideas for the design of toehold switches.

2. Problem Discovery

Initial scenario limitation

Using the TrigGate platform of the TAU team, we initially designed and generated 25 toehold switch sequences. After wet experimental verification, the first batch of verification results showed that the generated switches had poor stability and high fluorescence brightness in the off state.

Figure. The toehold sequence-regulated protein expression effect verification experiment is shown in the fluorescence intensity measured by a microplate reader at a wavelength of 560 nm. C1, E1, I1, K1T1 groups were transfected only with mRNA with toehold switch sequence and Luciferase sequence, C2, E2, I2, K2, T2 groups were transfected with mRNA and miR-21 analogs, C3, E3, i3, K3, T3 groups were transfected with mRNA and miR-21 mimics. The toehold sequence used in T1T2T3 group was from the literature, and the others were from the model generation. Statistical significance was calculated by ordinary one-way ANOVA with Bonferroni's method for multiple comparisons (* p<0.05; **p<0.01;*** p<0.001; ****p<0.0001 mean ± SD).

Reasons for the analysis are as follows:

Problem category	Specific reasons
Structural stability	GC content is low and free energy is too high
Nonspecific activation	The pseudoknot structure is not considered, and the dimer tendency is large.
Conformational competition	The Kozak sequence is not fully paired and is prone to unwinding
Parameter is missing	No specific concentration was set due to lack of parameters

3. Genetic Algorithm Engine

What we need is a switch with microRNA specificity and high on/off ratio. By analyzing the results of the first batch of wet experiments, we found that the sequences generated by the model were not ideal and did not meet the requirements of our project, so we carried out the second round of design. The diversity of TrigGate's generation switches is low, and it is inconvenient to use. After in-depth study, we found the problem:

The core module of the platform uses a fixed maximum of seven random numbers at the time of generation to determine the final generation sequence. In other words, when all the constraints input by the user are the same, the sequence difference obtained by generating the same seed for many times is extremely limited. This method can ensure the repeatability of the experiment and ensure the success of the reproduction.

However, under the requirement that we need the best possible switch sequence, such a fixed result is obviously not ideal. Therefore, we improve the genetic algorithm to increase the diversity of seeds and ensure the performance of the seeds.

1.We set a controllable genetic algorithm switch, and the user can choose whether to use the genetic algorithm or not. In addition, the user can control the structure of the genetic algorithm by setting the size of the population or controlling the number of cycles of the genetic algorithm, so as to provide the user with a relatively free use experience.

2.We improve the toehold_generator and create a GeneticToeholdGenerator, and the logic of the genetic algorithm is: using Nupack to generate the initial population, pre-test the fitness corresponding to the parameters to select the appropriate parameters; Subsequently, the obtained optimal parameters are input into a formal generation program, and a round of generation and screening is completed through the following steps: generating a sequence using Nupack under combined constraints, obtaining offspring through crossover and mutation, calculating the free energy difference to determine fitness, and checking for defects such as pseudoknots and reading frame issues.The above operations are then repeated until the preset breeding generation or population number is reached, and the algorithm is stopped.

3.In order to determine the optimal values of the coefficient of variation and the cross coefficient in the genetic algorithm corresponding to different input sequences of the user, we add a test program (ga _ parameter _ test) for the parameters of the genetic algorithm, which can automatically test the fitness and the free energy difference corresponding to different parameter combinations and select the most suitable parameters. Used in formal generation programs.

4. Iterative Optimization

Three-round iterative evolution

1.1 prompted us to think about and improve the model. After careful analysis, we decided to add a genetic algorithm to the program. We first determine the location of the script to use the genetic algorithm, fully understand the mechanism of the sequence generator, and decide to only increase the genetic algorithm part without modifying the remaining links to ensure the effectiveness and accuracy of the model.

The original design constraints of the model are very strict, so we use the original generation logic, that is, we use Nupack to generate random sequences to obtain candidate sequences, and then in the genetic algorithm part, we cross and mutate these candidate sequences to get more diverse results under the original strict constraints, and finally we screen and verify the results.

1.2 In the first round of design, we set the parameters very simply, using only a set of crossover and variation coefficients, which are not sequence-specific. After simple debugging, we did get a larger sequence space. However, when these sequences were tested for visualization and thermodynamic parameters, they showed poor functionality, and we began to rethink and carry out a second iteration.

2.1 In the second round of design, the reason for the failure of the first round is analyzed. We believe that the functional defect is caused by the lack of constraints of thermodynamic parameters. So we added Nupack and Viennarna to the evaluation function to prevent one of them from being uncallable. Subsequently, we obtained the second batch of sequences after eliminating all the reported errors. This batch of sequences shows good thermodynamic properties, but the the visualization results were still not ideal, and we believe that this result still falls short of expectations.

3.1 In the third round of design, we decided to add a parameter testing procedure in order that the structure of the sequence could meet the requirements in the process of population evolution without gradually losing features. We created a script that can automatically perform parameter testing and confirm the optimal crossover and variation coefficients corresponding to the input sequence, so as to specifically evolve different sequences. Experimental verification shows that the results are still not ideal, and we are looking for a better way to set the constraints

Our current model is still not perfect, in the next research process, we will continue to find a better way to generate, in order to perfect the model.

5. Computational Validation

To ensure that the improvements we have made to the TrigGate model, especially the introduction of genetic algorithms, can effectively generate toehold switch sequences with better performance, we use a two-fold verification method based on computational biology. The core idea of this method is to quantitatively evaluate the prediction ability of the improved model from the two key dimensions of functional performance and structural stability of the switch by drawing the"functional adaptability curve"and"free energy change curve". At the same time, we add the log content, so that the information of the program can be interpreted, which is conducive to our evaluation of the function of the model.

First of all, we will use the original model and the improved model to generate 20 (or more) sequences and their model prediction scores, and use the NUPACK and ViennaRNA in the model to calculate the free energy, and calculate the ON/OFF ratio through the machine learning program. (Since we have not completed the iterative improvement of the model, we have not output the results. After the iteration of the model, we will use the real data for evaluation.)

Fitness curve evaluates the functional prediction ability of the model by comparing the prediction score of the model with the ON/OFF ratio, and verifies whether the model can accurately identify high-performance sequences. After getting the results, we will analyze from the following points.

Correlation coefficient R ²: evaluates the strength of the linear relationship between the model predicted score and the ON/OFF ratio, with R ²>0.7 indicating a strong correlation

Data point distribution: observe whether there is a positive correlation trend from the lower left to the upper right, and improve whether the model data points are more concentrated in the upper right quadrant.

Extreme performance: compare the maximum ON/OFF ratio and the average ON/OFF ratio of the two models to see if the improved model has a significant improvement.

Trendline slope: The larger the slope, the stronger the recognition ability of the model to the high-performance series.

The ddG curve evaluates the structural stability prediction accuracy by comparing the model-predicted free energy with a calculated reference value, ensuring that the switch is stable enough in the untriggered state to avoid leaky expression, while verifying a reasonable balance of free energy differences to ensure effective conformational changes. After getting the results, we will analyze from the following points.

Prediction error (RMSE): the root mean square error between the predicted free energy of the model and the calculated value. The smaller the error, the more accurate the prediction.

Stability Distribution: Proportion of sequences with statistical dG _ off in the ideal stability interval (-5 to -15 kcal/mol)

Free Energy Difference Range: Verify that ΔΔG is in the valid trigger range (-10 to -20 kcal/mol)

Data consistency: Observe how close the model predicted values are to the calculated values, ideally distributed closely around the reference line

Through the combined validation of these two curves, we are able to provide strong dual evidence for the validity of the model improvement at the computational level before committing to the wet experiment. The functional adaptability curve ensures that our generated sequences perform better in theoretical function, while the free energy change curve ensures that these sequences are reliable in structural stability. This can not only confirm whether the genetic algorithm maintains the sequence quality while expanding the design space, but also verify in principle whether it is possible to solve the key defects found in the initial wet experiment, such as poor switching stability and high leakage expression, which lays a solid theoretical foundation for the subsequent experimental verification.

6. Future Roadmap

Direction of optimization	Specific measures	Expected effect
Experimental verification	Validate the effectiveness of the improved algorithm through wet-lab experiments	Biological function confirmation
Computational efficiency	Optimize code and utilize efficient libraries	Reduce computation time
Quality improvement	Enhance quality control with multi-dimensional criteria	Increase in the proportion of high-quality sequences
Fitness Function	Refine the fitness function by optimizing weighting factors	Enhanced prediction accuracy

We hope to establish an integrated toehold switch design platform to achieve seamless connection from computational design to experimental verification, and promote the application of RNA regulatory elements in precision medicine.

7. User Manual

Click to Download