Introduction

RNAi technology, which achieves gene silencing through siRNA-mediated mRNA degradation, has become an important tool in gene function research and drug development. As the precursor of siRNA, dsRNA must meet multiple design requirements, including high specificity, high silencing efficiency, and low off-target effects. During the process of designing dual-fusion and triple-fusion dsRNA sequences, we found that existing software has functional limitations and operational constraints, making it difficult to address complex needs. To overcome these challenges, we developed a highly integrated and automated dsRNA design system called APHiGEM. Starting from sequence input, this system can perform multi-rule screening and conduct in-depth off-target analysis based on multi-species transcriptome databases. Ultimately, it outputs comprehensively ranked and visually presented results, significantly improving the reliability, efficiency, and applicability of dsRNA design.

Motivation

During the initial project phase, consultations with domain experts revealed that compared to single-target RNA, multi-target fused dsRNA typically achieves more efficient gene silencing. To validate this hypothesis, we aggregated initiation site data from multiple siRNA prediction platforms and conducted visualization analysis using in-house developed software. Based on the positional data mapping, we screened specific siRNA sequences and successfully constructed our first triple-fused dsRNA. Subsequent aphid mortality bioassays clearly demonstrated that this multi-target fused dsRNA significantly enhanced both aphid lethality and target mRNA silencing efficiency.

(a) (b)

(c) (d)

Fig 1.(a) This figure presents a visualization analysis based on the collected predicted start site data, which can provide guidance for constructing our multi-fusion dsRNA.(b)After 120 hours of treatment with 800 ng/μl dsCHS, dsCP, dsCYP450, and dsF3, the corrected mortality rates of aphids were 20.48%, 30%, 31.4%, and 44.9%, respectively. Compared to single-gene targets, the multi-target fusion dsF3 demonstrated a significant enhancement in lethal efficacy, validating our hypothesis that simultaneously silencing multiple key genes can improve RNAi-induced mortality.(c)The figure shows the relative expression levels of CHS, CYP450, and CP19 (CP) in brown citrus aphids after treatment with 800 ng/μl dsCHS, dsCYP450, and dsCP for 120 hours, compared to the control. Error bars represent the standard deviation (SD) derived from at least three biological replicates. All datasets were statistically analyzed using a Student's t-test (*P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001).(d)The relative expression levels of CHS, CYP450, and CP19(CP) in Toxoptera citricida after treatment with 800 ng/μl dsF3 for 120 h, respectively, compared to the control. Error bars represent the standard deviation (SD) derived from at least three biological replicates. All datasets were statistically analyzed using a Student's t-test (*P < 0.05; **P < 0.01; ***P < 0.001; ****P < 0.0001).

However, throughout this prediction and application process, we identified a market gap in specialized siRNA/dsRNA prediction products. Furthermore, since most existing prediction tools are built on deep learning architectures, they typically suffer from poor transferability, closed-source limitations, non-transparent detection processes, and difficult redeployment. Additionally, most platforms separate prediction and fusion functions, substantially increasing user operational complexity and diminishing user experience.

Consequently, we conceived the idea of building from scratch an open-source, easily redeployable, and highly scalable computational design system for fused dsRNA—"APHiGEM"—based on empirical rules.

Planning

To build a dsRNA-assisted design system from scratch, we adopted the macro-framework of agile development. By incorporating manual testing, visualization, and multi-script automated testing during the development process, we aimed to obtain user feedback in the shortest time possible and continuously iterate the software based on this input, gradually refining the system's functionality. This approach formed a practical closed loop of "rapid feedback, timely fixes, and efficient development". Our core stakeholders include wet lab partners within our team and colleagues from other laboratories, whose practical feedback directly influenced the direction of functional development and the enhancement of system integrity, as they possess deeper insights into the specific needs of laboratory settings. We first constructed a minimum viable product to visually demonstrate the core functionality, then progressively updated the backend architecture and expanded functional modules. Feedback from each iteration provided substantial data support and improvement directions for subsequent development. This development model enabled us to closely align with users' actual needs and continuously optimize the software.

Initially, we conducted a joint software objective analysis with stakeholders, achieved visualization of target functions through simplified programming, and constructed a minimum viable model. This approach not only ensured the feasibility of software objectives but also laid the foundation for future flexible adjustments. Building upon this model, we relied on the aforementioned practical closed loop of "rapid feedback, timely fixes, and efficient development" to continuously advance software iteration and evolution. With the completion of each iteration, the functionality of the design system gradually expanded. The following explanation will clearly demonstrate the complete iteration process of the APHiGEM system:

APHiGEM-1

Design

To establish modern siRNA design principles^[1] as the foundational process for programmatically selecting candidate sequences, a scoring system aligned with these design principles is required to provide references for preliminary screening of usable sequences. Simultaneously, considering users' habits and application scenarios, the program must exhibit strong inclusivity toward inputs, necessitating the design of multiple input modalities.

Build

Based on modern siRNA design principles, we constructed our own siRNA screening workflow and developed a Flask-based web application framework^[2]. Additionally, to grade each candidate sequence and facilitate users' efficient selection of high-quality sequences, we built a scoring system compatible with modern siRNA design principles.

Test

During preliminary testing, we successfully developed a simple user interface supporting basic sequence input and result display. This interface enables the acquisition of candidate sequences screened through design principles, which meet empirical data and thermodynamic stability requirements^[3]. However, due to the lack of a visualization interface, users were unable to intuitively observe the positional information and distribution of candidate sequences.

Learn

We added a visualization interface that presents candidate sequence analysis results to users in the form of line charts. Users can now intuitively understand the positional information and distribution of candidate sequences, as well as detailed information for each usable sequence, facilitating the selection of qualified candidate sequences.

APHiGEM-2

Design

Reducing off-target effects is a key criterion for evaluating nucleic acid-based therapeutic drugs^[4]. To ensure that the sequences we design meet safety requirements, we plan to use PostgreSQL as the core relational database to collect transcriptome sequence data of potential off-target species^[5]. This will enable sequence alignment between the selected transcriptome sequences and candidate sequences. Meanwhile, we have added a k-mer (k=7) index^[6] to the screening principle in the seed region to improve the efficiency of large-scale screening.

Build

We established a database framework centered on the "Class-Species-Gene" three-table relationship, storing 10 classes and 257 species of potential off-target subjects deemed most relevant for detection, along with their transcriptome sequences for alignment. Using Python library-integrated database methods, we implemented efficient sequence retrieval and alignment capabilities.

Test

Through our calibration tests and data validation based on the script, this program successfully integrated the PostgreSQL database, correctly mapping the front-end species selection to the ID numbers in the database, and accurately performing sequence retrieval and candidate sequence alignment. However, when expanding the data set to a larger transcriptome data set of a single species, although the data storage was successfully completed, significant challenges were encountered in data retrieval - mainly manifested as timeouts when processing a large number of samples and difficulties in retrieving sequences from the database.

Learn

To address these challenges, we implemented substantial optimizations at both database and memory management levels^[7]:

Database indexing optimization: Created composite indexes on frequently queried columns to accelerate sequence retrieval
Connection pooling: Implemented database connection pooling to reduce connection overhead
Batch processing: Modified alignment algorithms to process sequences in batches rather than individually
Memory management: Implemented efficient memory allocation and garbage collection strategies
Caching mechanism: Added result caching to avoid redundant computations for repeated queries

APHiGEM-3

Design

Based on the high-quality siRNA candidate sequences and related information obtained in the earlier version, the next step requires the fusion of multiple sequences into dsRNA.

Build

We provide users with sequence selection options on the off-target analysis results page, allowing them to choose sequences with no or minimal off-target effects for final multi-sequence dsRNA fusion.

Test

Testing revealed that while our software meets the basic requirements for dsRNA fusion, the existing functionality cannot accommodate users' personalized needs for specific features, particularly the inability to provide bias toward particular functions.

Learn

Recognizing the need for personalization, we added a sequence weighting feature prior to dsRNA fusion. Users can now assign weights to each selected siRNA sequence, where the sum of weights determines their length proportion in the final fused dsRNA sequence (e.g., two sequences with 2:1 weight ratio will have approximately 2:1 length proportion in the fused sequence).

This iterative development process ensures alignment between software functionality and laboratory requirements. The agile development framework we employ enables real-time and efficient continuous feedback collection and optimization.

Design

Backend Architecture

Based on our project requirements, we have established the modular architecture shown in Figure 2, with several key components as follows:

Fig 2. The figure illustrates the various components of our backend code and the complete operation process.

Web Application Server (Flask): Lightweight framework, facilitating rapid iteration of front-end and back-end interaction. In this dsRNA design project, the Flask web application server played the role of the core hub. As the backbone of the application layer of the entire system, it is responsible for uniformly coordinating all interactions among the front-end user interface, back-end computing engine, database storage, and external API services.

siRNA Design Engine: The siRNA design engine represents the core algorithm module of this dsRNA computer-aided design tool, tasked with intelligently designing and screening highly efficient siRNA candidate sequences from input gene sequences.

Off-Target Effect Analysis Module: The off-target effect analysis module serves as a critical safety assurance component in the siRNA design system. Its primary function is to evaluate potential non-specific binding effects that designed siRNA sequences may produce in biological systems, ensuring siRNA specificity and safety.

PostgreSQL Database(pgsql): Relational databases are suitable for storing large-scale sequential data and have high query efficiency. This database stores mRNA sequence data related to non-target species, which enables the software backend to achieve rapid data retrieval.

Visualization Component: This component processes positional information of siRNA candidate sequences through visualization techniques, enabling users to intuitively observe the results.

PostgreSQL Database

This project utilizes a PostgreSQL database (port 5433) deployed via Docker containers. Our choice of PostgreSQL over alternatives was driven by its proven reliability and robust feature set that is perfectly suited for our biological data. The schema primarily contains three core tables:

taxonomic_order table ----- Stores the 10 taxonomic classifications of off-target analysis subjects we have incorporated.

species table ----- Contains information on 257 involved species.

gene_sequence table ----- Stores gene sequence data for various species (including gene names, sequence content, length, and GC content).

Specifically, PostgreSQL's superior capability in handling complex queries and its advanced JSON support provide a solid foundation for future in-depth data analysis and schema evolution.

Fig 3. The figure illustrates the three core tables in the database architecture and their specific structures and contents.

The database supports batch import of FASTA files (up to 500MB), provides rapid querying functionality by species and gene names, and enhances response speed for repeated queries through a result caching mechanism. This enables efficient data storage and retrieval services for siRNA design and off-target effect analysis.

Engine

Based on our hands-on experience, we find that existing siRNA/dsRNA prediction tools are predominantly developed by companies using deep learning. While these tools achieve high accuracy for specific types of RNA, they suffer from several limitations: they are not open-source, have poor transferability, and pose high development barriers. These issues hinder their broad application and further development by the iGEM community and future teams.

To tackle this challenge, we took a different approach in our algorithm design by adopting a principle based on "empirical rules." We systematically reviewed extensive literature and code, transforming reported siRNA sequence features into computable rules and organizing them in a logical sequence to ultimately construct the "siRNA Design Engine." This rule-based principle grants the prediction system exceptional extensibility and flexibility. Based on the public template we provide, users can freely add or remove empirical rules by consulting the literature to meet their specific application needs. Imagine this engine as a castle made of building blocks, where each empirical rule is a single block—by rearranging these blocks, you can build customized castles with varied functions. Compared to the "black-box" nature of deep learning models, our method significantly enhances the interpretability of the results, ensuring a transparent and controllable prediction process.

Usage

The APHiGEM system can be deployed and used in two ways: through Docker containerization for rapid deployment, or through manual installation for customized environments. Below we provide detailed instructions for both deployment methods along with usage examples. For more operational details and development tools,please visit our GitLab repository.

Docker: Ensures that different users/labs can reproduce the results in the same environment, avoiding the issue where the application can run on the developer's computer but not on other machines^[8].

Docker containerization deployment

Manual deployment

Usage examples

Conclusion

According to our plan, the APHiGEM system has successfully addressed the aforementioned four key challenges, providing future researchers and scientists in related fields with an integrated dsRNA design platform that combines automated design, visual analysis, and off-target assessment.

However, the current system's off-target analysis still suffers from limitations in sample size. Ideally, alignment should be performed against the complete transcriptome of a given species. Yet, due to constraints in data availability, our database does not yet encompass sufficient transcriptomic information.

It is worth noting that most existing software in this field is costly and, due to technical limitations, largely closed-source, making it difficult for users to customize or improve the tools according to their specific needs. In contrast, APHiGEM not only offers robust support for the computer-aided design of fused dsRNA in terms of functionality but also, thanks to its open-source nature and highly extensible technical architecture, allows subsequent teams to easily use, modify, or further develop the system.

Contribution

Based on our software, future teams can utilize the source code and various script tools provided in our GitLab repository to expand the software in the following aspects:

Deep Learning Model Integration: Introduce deep learning models to predict siRNA efficacy, replacing or supplementing the existing rule-based scoring system.
Advanced Off-Target Effect Analysis: Enhance off-target prediction algorithms by considering additional factors such as seed region mismatches and imperfect pairings.
Thermodynamic Parameter Optimization: Adopt more accurate thermodynamic models to calculate RNA secondary structure and stability.
Multi-Objective Optimization Algorithm: Incorporate multi-objective optimization algorithms to balance multiple design goals such as efficacy, specificity, and stability.
Batch Design and Comparison: Support simultaneous design for multiple genes and enable result comparison.
Gene Editing Integration: Integrate with gene editing technologies (e.g., CRISPR) to provide combined design solutions.
Drug Delivery System Design: Add modules for designing siRNA delivery systems, such as liposomes or nanoparticles.
Experimental Validation Database: Establish a database of experimentally validated siRNA sequences and their efficacy data.
Automated Experimental Design: Automatically generate experimental protocols and citations based on design results.

References

Click to EXPAND the content

[1]Fire, A., Xu, S., Montgomery, M. K., Kostas, S. A., Driver, S. E., & Mello, C. C. (1998). Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature, 391(6669), 806–811. https://doi.org/10.1038/35888

[2]Vert, J. P., Foveau, N., Lajaunie, C., & Vandenbrouck, Y. (2006). An accurate and interpretable model for siRNA efficacy prediction. BMC Bioinformatics, 7, 520. https://doi.org/10.1186/1471-2105-7-520

[3]Bonnal, C., Lischetti, U., Pireddu, L., Silvestri, F., Talenti, A., D'Anastasio, E., Cestari, M., Primi, F., Bovo, S., Morandin, F., Lazzari, B., Taccioli, C., Tilesi, F., Agapito, G., Chillemi, G., Fioravanti, D., Pesole, G., & Zambelli, F. (2022). COVID-19 PubSeq: a public sequence platform for rapid epidemic response. BMC Bioinformatics, 23(1), 949. https://doi.org/10.1186/s12859-022-04924-3

[4]SantaLucia, J., Jr. (1998). A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proceedings of the National Academy of Sciences of the United States of America, 95(4), 1460-1465. https://doi.org/10.1073/pnas.95.4.1460

[5]Weber, J. L., & Myers, G. W. (1997). Human whole-genome shotgun sequencing. Genome Research, 7(5), 401-409. https://doi.org/10.1101/gr.7.5.401

[6]Chen, Y., Shi, Y., Wang, Z., An, X., Wei, S., Andronis, C., Vontas, J., Wang, J.-J., & Niu, J. (2025). dsRNAEngineer: A web-based tool of comprehensive dsRNA design for pest control. Trends in Biotechnology. Advance online publication. https://doi.org/10.1016/j.tibtech.2024.10.004

[7]The PostgreSQL Global Development Group. (2024). PostgreSQL 16.2 documentation. Retrieved March 28, 2024, from https://www.postgresql.org/docs/16/index.html

[8]Marçais, G., Pellow, D., Bork, D., Orenstein, Y., Shamir, R., & Kingsford, C. (2017). Improving the performance of minimizers and winnowing schemes. Bioinformatics, 33(14), i110–i117. https://doi.org/10.1093/bioinformatics/btx235

[9]Henkel, S., Bird, C., Lahiri, S. K., & Reps, T. (2020). A comprehensive study of Docker in the GitHub ecosystem. Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 1155-1166. https://doi.org/10.1145/3368089.3409706