5'UTR Sequence | DUT-China

1. Introduction

In this iGEM project, our team proposes achieving precision treatment for liver cancer through modular mRNA regulation of key protein expression. To enable efficient and stable protein expression, we designed and optimized the 5'UTR through dry experiments. We aim to ensure the mRNA drug maintains high expression and low degradation within the liver cancer microenvironment through dry experiment simulations and wet experiment validation.

Conventional computational methods for 5'UTR sequence design and optimization face limitations such as restricted search spaces, delayed functional prediction, and difficulties in handling variable-length sequences, making it challenging to efficiently obtain sequences with high expression and stability. UTRGAN and UTR-Insight are two core models introduced specifically to address these issues. Using these models as a starting point, we conducted a series of studies. Combined with validation and feedback from wet experiments, we ultimately established a complete workflow and developed a web platform.

2. Methods and Results

UTRGAN employs generative adversarial networks to generate sequences with natural structures and functional optimization potential in low-dimensional space. UTR-Insight precisely predicts sequence translation efficiency (MRL), supporting variable-length inputs and multi-metric evaluation. Their synergistic operation not only significantly enhances sequence design efficiency but also provides reliable technical support for sequence property prediction.

We conducted three iterative optimizations based on these two models. First, we replicated the models, using existing tools to design sequences for wet-lab validation. To simplify the workflow and reduce computational time, we designed a fusion model and optimized it based on wet-lab feedback, leading to the design of a second batch of sequences for validation. After receiving feedback from the second batch of wet-lab results, we revisited our entire approach, decided to shift strategies, redesigned the workflow, and developed a web platform. The final batch of wet-lab validation yielded excellent results.

Figure 1. Experimental Iteration Process: Model replication based on UTRGAN and UTR-Insight; generation of the first sequence batch using existing tools followed by wet lab validation; feedback-driven construction of a fusion model that reduces computational overhead and simplifies workflow while producing a second sequence batch for re-validation; workflow restructuring and web platform development based on second-batch feedback to enable parallel design and execution; The third round of wet experiments demonstrated that the platformized workflow significantly enhanced the expression levels of the designed sequences.

（1） First Iteration

During the first iteration, we established a technical environment supporting independent operation of both UTRGAN and UTR-Insight. Considering their characteristics and operational requirements, we adopted a virtual environment isolation strategy, deploying each model in a separate virtual environment to ensure stable, functionally independent operation through granular control.

Building upon the independently replicated models, we advanced the experimental workflow: After autonomously generating the first batch of sequences, we performed prediction and screening using the UTR-Insight model. Selected sequences were then forwarded to wet-lab validation, accumulating critical experimental data for subsequent iterations.

Our initial wet lab results were disappointing, showing no significant improvement in protein expression levels. We analyzed the sequences themselves. Sequence characteristics revealed excessively high GC content and low minimum free energy(MFE) values, which may lead to the formation of overly stable secondary structures (such as hairpin structures). These structures could hinder ribosome binding and scanning, increasing the risk of mRNA degradation.

Sequence	GC Content %	MFE kcal/mol
AGTGCGTGAGCTTGAGTGGGCTTTCAGGTGTCATGGTTCTGGGAAAAGCTGAGGGCTAAAGGCCCTGTAGAGACCGCCGTACGAACCCTCAA	55	-29.8
ACCCCGCGACCTCTTACCGACCCCCCCGCGTGGGTCCCCCGCGGGAGCGGCGGCTGCGCGCAGTCTCAGGC	79	-23.7
GGTGCGCCGCGGCCAGAGGCGCACCCGGCGGGGCTCCACGCCGCGCCAGGGTGAGTGGCTGGCAGCGGTGGCGGCTTCTCGGCCCCGGGAG	81	-52.5
CAGGAGTCCGCCCGGTTTTTTCTGTGGCACTTGAGCTACAAGGGGGCCTGTTCCCTGCACCTGGCTG CAGGTG	63	-26.0
CCGGCGCCAGGAAGCTTGGGGTGGCTTCCGGGCACCCCGGATTCCTGCGGCTTCGGGGCCGAAACCGGGGGCTCCTTTGGTG	72	-42.0
TCGGACCGACTGAGCCTCCCGTCGGCGCCTCCACTTGGAGACTGGTGCAGGGAGAGATGCCCGGCCCGTTGGATGGGCCACC	70	-33.3
TGGTGAGGGCAGGGGAGATGCACGCGGAGGCCCTTCGCACTACCTCCGGCCTCCTCCGAGGTCGATTC	67	-23.7
GCTTCCTCGCCCGGGCGCTGTCTCCTGGATTGGCCCGCGCCGTCCCCAAAGGGCCCTTTGTGGACGCGAGTACGGGATGATGTGTGAAGACAGG	67	-39.2
AGTCCCGATGCTTGAAGTCTGAGTCCACTGCGGCGCTCGCCGCCTCCTCAGGGACAGCACTTACA	61	-12.7
CGGAGGACATGCCTTGTGAGAGGAAAGCGTCCCCAGGATAGGCGCTTGACTTGCCGAATCGCAAGTCACCGGGGGACACA	61	-27.9

Table 1. First batch of sequences Among the first batch of experimentally validated sequences, the average GC content reached 67.6%; the average minimum free energy value attained -32.8 kcal/mol.

We also analyzed the model design perspective, noting that the model imposes no restrictions on GC content. UTRGAN is a generative model based on generative adversarial networks (GANs), characterized by a competitive process between a generator and a discriminator. To evade the discriminator's "penalty," the generator strives to design sequences resembling natural 5'UTRs. Since natural sequences are typically stable with high GC content, this leads to excessively high GC content in the model-generated sequences. Additionally, the model lacks a minimum free energy prediction function; incorporating this feature would further enhance our model.

The first-generation models operated independently, creating significant challenges. We needed to set up two separate environments to accommodate both models. Furthermore, their inputs differed: UTRGAN's output could not be directly fed into UTR-Insight for property prediction, requiring additional data processing. Beyond these hurdles, time consumption was another major concern. Generating a batch of sequences (500 entries) for UTRGAN takes approximately three and a half hours. Following this, data processing and property prediction with UTR-Insight are required, extending the entire workflow to at least eight hours. Reducing computational time and accelerating processing are critical priorities for advancing the project.

To address these issues, we iterated the model to achieve improved results.

（2） Second Iteration

We addressed the limitations of the first-generation model through enhancements. First, to resolve issues like environment setup and data misalignment in the phased workflow, we designed a fusion model integrating UTRGAN and UTR-Insight. The goal was to build a fully integrated end-to-end model, achieving a closed-loop process from sequence generation to property prediction. Second, we added a GC content control module that dynamically adjusts base generation probabilities to prevent excessively high GC content at the source.

To address sequence stability imbalances reported in initial wet-lab experiments, we implemented targeted enhancements across core functionality and optimization logic: At the core functionality level, we integrated minimum free energy prediction capabilities, systematically incorporating mRNA stability ， a critical regulatory dimension ， into the core constraint framework of sequence design. Regarding optimization logic enhancement, we introduced a gradient optimization mechanism. Building upon UTRGAN's original generation logic, this mechanism dynamically guides the parameter space of sequence generation through gradient optimization algorithms, further improving the functional adaptability and experimental validity of generated sequences.

Figure 2. Fusion model mechanism. a) Trained Generator module. Sampling vectors from a 25-dimensional latent space, the generator produces 128 nt monothermic-encoded 5'UTR sequences; through adversarial training between generator and discriminator, the generator ultimately outputs natural-like structural sequences. b) 5'UTR design workflow integrating UTR-Insight. The trained generator produces initial sequences, which undergo MRL/TE prediction via the UTR-Insight model. The generator input vector is then optimized through gradient-ascending backpropagation until high MRL/TE targets are achieved, yielding the optimized sequence.

We restructured six code modules across both models. In the integrated new model, we added a GC content control module to regulate sequence GC content; incorporated a minimum free energy prediction module to quantify mRNA secondary structure stability during sequence generation; We modified the property prediction module of UTR-Insight, which replaces the original property prediction function of the UTRGAN model to achieve more accurate predictions. By restructuring the input and output modules, the distinct functions of the two models are executed sequentially, eliminating the need for manual data alignment. For computational modules, we enhanced computational performance and reduced computation time through lightweight strategies such as deeply separable convolutions, attention mechanisms, and inference acceleration.

Scheme	Design Sequences (500) Time Consumed/h	Complete Workflow Time Required/h	Can be Controlled GC content	Can it be performed Minimum Free Energy Predictable	Is it necessary to Manual Data alignment
Original Model	3.5	8	No	No	Yes
After iteration Model	1.5	3	Yes	Yes	No

Table 2. Comparative Properties of Two-Generation Models

Using the newly designed fusion model, we conducted a second round of sequence design and performed wet-lab validation, with high expectations for the results. However, the outcomes of the second round of wet experiments revealed that our designed sequence representations performed poorly, yielding results even worse than the first round of validation. At this point, we found ourselves in a predicament, as the fusion model failed to deliver the expected performance.

We began re-examining our work to pinpoint the model's issues. During the model reproduction phase, we had omitted pre-training ， a critical factor for enhancing model performance. In the actual reproduction of UTR-Insight, we skipped key pre-training steps, directly incorporating a randomly initialized Transformer encoder into the main model training. This omission prevented the encoder from effectively capturing hierarchical biological features within sequences, creating a gap with the original model's design logic. Consequently, the reproduced model's prediction accuracy failed to match the levels reported in the literature. For the fusion model, we lacked sufficient data for training, and some code modules underwent restructuring. Compared to the original paper, the model's inherent performance could not be guaranteed.

During sequence selection, we relied solely on model prediction values without establishing a reference standard. Consequently, we cannot confirm whether the selected sequences achieve optimal protein expression outcomes.

Although the second-generation model did not yield satisfactory results, it provided valuable insights and allowed us to accumulate significant experience. To enhance model performance and increase the reliability of experimental outcomes, we decided to change our strategy and redesign our model.

（3） Third Iteration

Following discussions and analysis, we altered our original strategy and decided to design a workflow integrating several components. We first pre-trained the model, achieving accuracy comparable to the original literature.

Figure 3. Training Accuracy vs Literature Target. Top-1 training accuracy of our reproduction (dashed) closely matches the reference implementation (solid) and remains within the literature accuracy band (center ±0.5 percentage points) for most of training. Final gap at the last epoch is 0.00 pp, with a tail mean absolute difference of 0.00 pp over the last 10 epochs. Note: "pp" denotes percentage points.

To minimize modifications to the original model, we designed an automated workflow enabling parallel operation of both models for sequence design optimization and property prediction. We first established a compatible environment where both models functioned correctly. Building upon this foundation, we developed a scripted tool to integrate the two models. The workflow incorporates previous best practices, including standardized data formats, GC content constraints, and minimum free energy predictions. To address the lack of reference standards, we selected two natural sequences from the literature for comparison. We calculated their MRL values and minimum free energies, using these metrics as benchmarks to screen for sequences with higher MRL values and comparable minimum free energies.

We generated a third batch of sequences using the new model and selected a subset for wet-lab validation.

Figure 4. Comparison of model output quality across three batches of 5'UTR designs

The design quality of the third batch of 5'UTRs improved compared to the first two batches. Notably, the regulatory efficacy of sequence 37 surpassed that of two naturally occurring highly expressed 5'UTRs, demonstrating application potential. This marks a significant milestone in the iterative refinement of our UTR optimization model.

Following the workflow design, we focused on developing an interactive web platform to provide an effective tool for sequence design optimization.

Video 1. Web Platform Demonstration

This platform enables users to upload sequences or parameters, automatically invoking our new model for sequence design and property prediction. Users can visually observe how different 5'UTR sequences impact translation efficiency and stability within cellular environments. This not only significantly lowers the technical barrier for 5'UTR design but also provides a powerful tool for mRNA therapy research and development.

3. Discussion

Our work underwent three iterative rounds of 5'UTR sequence design, forming a complete closed-loop practice. The ultimately developed interactive web platform transforms technical achievements into low-barrier tools, laying the foundation for subsequent applications while identifying issues in existing workflows: disconnect between generation and functional requirements, delayed stability assessment, and insufficient multi-objective coordination.

Building on these practical insights, we propose a reverse generation strategy beyond initial forward screening attempts to enhance sequence design efficiency and reliability. This model first generates sequences by combining random noise with key target gene information, actively adjusting nucleotide composition during generation to maintain reasonable ranges while ensuring structural integrity and translation efficiency through multi-objective constraints. Next, it comprehensively evaluates generated sequences across dimensions including structural similarity, translational potential, stability, and rational base composition, retaining only sequences meeting all criteria. Finally, it refines qualified sequences to enhance translational efficiency while correcting stability deviations and preventing base composition shifts, ultimately selecting conflict-free sequences with optimal translational efficiency and stability. This effectively resolves the efficiency and reliability issues of traditional workflows.

Figure 4. Flowchart Online Link (solid line): Input target gene information and random noise undergoes conditional generation, with the GC control module dynamically adjusting base probabilities during generation to maintain GC content between 40–60%. Subsequently enters a multidimensional discriminator, scoring and weighting four dimensions: structural similarity (k-mer/secondary structure), translation potential (MRL prediction), stability (MFE range), and base composition (GC%). Only sequences meeting all four criteria pass the gate; Constrained multi-objective optimization is applied to qualifying sequences (enhancing MRL, correcting MFE, stabilizing GC), with non-dominated solutions selected via the Pareto criterion as final outputs. Offline Training (dashed line): Utilizing UTRdb as the baseline, UTRGAN as the generated set, and a high-expression reference set as data sources, a two-stage training approach employs a multi-objective loss function comprising WGAN-GP, MRL loss, MFE interval loss, and GC interval loss. Phase 2: MRL + MFE Synergy), dynamically calibrating weights λ on a high-quality validation set. This workflow significantly enhances design efficiency and expression performance while ensuring reliability.

4. Summary

This study centers on dry-lab experiments for 5′UTR sequence optimization, focusing on applying two existing models and designing a novel one. Through three rounds of dry-wet experiment iterative refinement, we developed an integrated 5′UTR design model combining the UTRGAN generative model and UTR-insight property prediction model. This incorporates GC content control and minimum free energy prediction to achieve multi-objective optimization. We also developed a web platform for sequence design , significantly lowering the technical barrier for 5′UTR design. Addressing limitations in existing workflows, we proposed a reverse-design model that combines a multi-objective generator, multi-dimensional discriminator, and multi-objective optimizer to directly output highly expressible and stable 5′UTR sequences, effectively overcoming shortcomings in traditional approaches.

Moving forward, we plan to broaden the model's applicability by integrating multi-omics data mining mechanisms to enhance design precision, advancing clinical translation to accelerate the transition from laboratory to medical applications. Concurrently, we will combine cross-disciplinary technologies such as gene editing to further elevate the model's application value and innovation potential, providing robust support for subsequent research and development in mRNA therapeutics.

References:

[1] Barazandeh et al. Bioinformatics Advances, 2025 https://doi.org/10.1093/bioadv/vbaf134

[2] Pan et al. BMC Genomics (2025) 26:107 https://doi.org/10.1186/s12864-025-11269-7

5. User Manual

Click to Download