Software | Tsinghua

Overall

To systematically optimize gene expression, we plan to build a comprehensive UTR optimization platform. We downloaded the open-source UTR-LM model from GitHub (https://github.com/a96123155/UTR-LM)^[1], which can predict MRL scores accurately based on the input UTR sequences. (MRL: mean ribosome loading. Ribosome loading refers to the number of ribosomes that are actively translating a specific mRNA molecule at any given time.) Based on this, we developed a mutation generation module and an optimization screening module, enabling automatic, high-throughput generation of UTR mutant sequences, calculation of MRL values, and output of high-scoring sequences. To improve the usability of the platform, we encapsulated the system into a web application and deployed it on a server, facilitating subsequent experiments and use by other iGEM teams.

Principle Explanation

UTR-LM Model

UTR-LM is a language model specifically developed for the 5’ UTR of mRNA. It utilizes artificial intelligence technology to predict and optimize the impact of this region on protein translation efficiency, enabling accurate MRL prediction based on input UTR sequences. For more details, please visit: https://github.com/a96123155/UTR-LM/tree/main

Mutation Optimization

We developed a Mutation Generation Module and an Optimization Screening Module.

Mutation Generation Module: This module takes an original UTR sequence as input and can efficiently generate a large number of diverse mutant sequences. We programmed various mutation strategies, such as single-point random mutation, multi-point random mutation, and directed weighted mutation, to address different optimization needs.

Optimization Screening Module: This module inputs all generated mutant sequences in batch into the UTR-LM model to obtain the predicted MRL score for each mutant. Subsequently, the module sorts all sequences based on their scores and outputs the Top-N (e.g., Top 200) high-scoring sequences.

Through the collaborative work of these two modules, we have successfully established an automated “in vitro evolution” platform for UTR sequences. This platform can rapidly screen theoretically high-expression-potential designed sequences from a vast virtual sequence space.

Website

We have encapsulated this system into a web application and deployed it on a server. You can access our website at: http://39.106.228.43:8000/.

We developed the web application using the Django framework and utilized a Lite Application Server from Alibaba Cloud as the web server. The website’s functionalities include:

Accepting uploaded original UTR sequences (required format: FASTA).
Returning the Top-200 high-scoring optimized sequences to the user in a CSV file.

The website features a clean, intuitive, and user-friendly interface.

Note: Due to server performance limitations and review processes by Alibaba Cloud, the website may occasionally experience accessibility issues.

Code and Operations

Core Code Description

Note: The table below lists the core code used in this project. A large amount of indirectly related code uploaded to the software repository is not listed here. All paths in the table are relative paths within the UTR-LM directory.

File	Description	Path
utr_random_mutator.py	Performs random mutation on sequences from the `UTR-LM-main/design-utr/origin-utr.fa` file.	UTR-LM-main/Scripts/UTRLM_mutants/utr_random_mutator.py
11.py	Data cleansing for the mutated sequences.	UTR-LM-main/Data/11.py
MJ5_Predict_and_Extract_Attention_Embedding.py	Predicts the MRL score for each mutant sequence.	UTR-LM-main/Scripts/UTRLM_downstream/MJ5_Predict_and_Extract_Attention_Embedding.py
mutant1_utr_top200.py	Sorts the sequences based on their scores and outputs the Top-200 sequences.	UTR-LM-main/Scripts/UTRLM_mutants/mutant1_utr_top200.py
main_control.py	Integrates the four files mentioned above. Running this file automatically executes the aforementioned four files in sequence to implement the entire workflow.	UTR-LM-main/main_control.py

Operations: For details about our code and operational methods, please refer to: https://gitlab.igem.org/2025/software-tools/tsinghua

Acknowledgment

In the completion of this project, we have received invaluable support and inspiration from the open-source community and academic peers, to whom we would like to extend our sincere gratitude. First and foremost, we would like to express our heartfelt thanks to the developers and contributors of the UTR-LM project (https://github.com/a96123155/UTR-LM)^[1]. As an excellent open-source resource, its cutting-edge research findings and high-quality code implementation have provided us with a solid theoretical and technical foundation, significantly facilitating the development of our research work. Secondly, we extend our sincere respect and deep gratitude to all the faculty members and senior colleagues of the Lu Zhi Laboratory. We thank you for the meticulous guidance, constructive suggestions, and selfless assistance provided during the specific implementation of this project. Your profound academic expertise, rigorous scientific attitude, and spirit of open sharing serve as a model and motivation for our continuous progress. Finally, we would like to once again express our gratitude to all organizations and individuals who have provided direct or indirect support for this project.

References

Chu, Y., Yu, D., Li, Y. et al. A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions. Nat Mach Intell 6, 449–460 (2024). https://doi.org/10.1038/s42256-024-00823-9 ↩︎