Software | MIT-MAHE

While siRNA-nanoparticle (siRNA-NP) formulations offer a promising approach in tackling pepper foot rot, our team realized that multiple factors are preventing the widespread adoption of this approach for crop protection:

The primary challenge in siRNA-NP delivery systems is finding the most suitable siRNA that will have a stable interaction with the nanoparticle delivering it.
The labour-intensive process of designing siRNA involves weeks or sometimes months of research and optimization.
The lack of computational tools that predict the stability of siRNA-NP complexes without wet lab experimentation.
Failure of traditional approaches when it comes to accounting for complex interdependencies among various parameters like Gibbs free energy (ΔG), entropy (ΔS), RMSD, nonpolar, and polar interactions, resulting in suboptimal formulations and higher clinical failure rates.

Our Solution

To address these hurdles, our team has developed a two-module software integrated with automation and design to rapidly and reproducibly optimize siRNA-nanoparticle systems to address these hurdles.

Designing our siRNA took us roughly two and a half months. In our search to make this process easier, we decided to automate the process and came across Selenium, a Python framework to build bots for web scraping and website testing. This was perfect for our objective, since most websites did not have clear ways to use APIs for requests, but the websites themselves were effortless to navigate.

Using a Selenium-based automation bot, we created a unified pipeline called siUltimate that links all existing siRNA design platforms, such as siDirect, siRNApred, and DuplexFold. The pipeline allows researchers to simply input their gene sequences and automatically retrieve the best 15 siRNAs best suited for their purpose. This approach is powerful because it directly feeds the input into the machine learning (ML) model.

The molecular descriptors like free binding energy, root mean square deviation (RMSD), solvent-accessible surface area, etc, that we evaluate from the siRNA-NP complex using docking and molecular dynamic simulations became the foundation for our machine learning model, which can predict the stability of any given complex.

The real innovation lies in combining automation through Selenium bots, detailed molecular modeling, and ML-based predictions. This represents a step towards building a comprehensive design platform to help researchers design better delivery kits for siRNA-based solutions.

Approach

The Stability Prediction Model (S.E.N.S.E)

Data Collection: We performed molecular dynamics (MD) and docking simulations with the siRNA-NP complex, obtaining energy values and trajectory data. These outputs were then formatted and used as inputs for the model.

⤓ Download HDOCK Data ⤓ Download Glide Data
Threshold and Analysis: Threshold values and analysis of the data simulated through MD were reviewed and validated through the literature.
Data Loading and Initial Cleaning: We started by loading the data obtained from MD simulations and docking them into a pandas DataFrame, a digital spreadsheet that can handle different types of data in each column. The original file had formatting issues; for example, the header information was scattered across multiple columns. To tackle this, we had to manually assign proper column names to interpret the data. We also cleaned up rows that were not actual data entries, such as repeated headers, empty rows, or rows missing important information, like FILE_NAME. During this process, we extracted the nanoparticle type ('lipid' or 'chitosan') from one of the columns and removed that temporary column since it was no longer needed.
Feature Selection and Target Definition: From the available data columns, we selected specific features that were most relevant for prediction, such as RMSD, Confidence Score, Docking Score, H-Bonding, VanderWaals Energy, Total Energy, Lipophilic, Metal Coordination, Rotatable Bond Energy, and Internal Strain. These features served as the input variables for our HDOCK and Glide Dock simulations.
Data Preprocessing: Before training the model, we first prepared the data. To begin with, we checked for missing values (NaN entries) in our selected features and target variable. No missing values were observed in this dataset, and we applied Standard Scaling to our features and the target variable. This process transforms the data to have an average of 0 and a standard deviation of 1, which is essential because some machine learning models (like Lasso) can get confused if one feature has much larger numbers than another.
Outlier Identification and Removal: On analyzing and visualizing the data (using the SHAP summary plot), we find a data point with an unusually high parameter value significantly affecting the output. As a result, the outlier was disregarded from the dataset.
Model Training: XGBoost iteratively builds an ensemble of decision trees. In each iteration, a new tree is trained to predict the residual errors from the predictions of the previous trees. The predictions of this new tree are then scaled by a learning rate and added to the ensemble's prediction. The objective function being minimized here is the mean squared error, aiming to reduce the error between the predicted and actual docking scores. The n_estimators parameter decides the number of trees in the ensemble (set to 100), and learning_rate (set to 0.1) shrinks the contribution of each tree. The model learns a complex.
Model Interpretation: We loaded docking results from HDOCK, which includes parameters like binding rank, confidence scores, docking scores, and the critical target variable 'Ligand RMSD' - a measure of structural deviation where lower values indicate better binding stability. Then we coded our system to load siRNA sequences from a separate file and link them to their docking results through unique identifiers. To make the sequence data usable for machine learning, each siRNA sequence is converted into a binary numerical format that captures the nucleotide patterns. These encoded sequences are then combined with the docking parameters to create a complete feature set for model training. After preprocessing and normalizing on features, the model predicted Ligand RMSD values based on both the sequence characteristics and docking parameters.
Model Evaluation: We used two standard metrics during cross-validation to measure our model's performance. RMSE (Root Mean Squared Error) tells us how far off our predictions are from the actual values - lower numbers are better. R-squared measures the proportion of variation in the target variable that the model can explain, with values closer to 1.0 indicating better performance and 1.0 representing perfect predictions. We calculated these metrics' average and standard deviation across all cross-validation rounds to assess the model's performance robustly. The model can capture nearly 75% of the variance in the docking score data, showing a pretty good predictive capability. Together, the iterations confirmed that sequence-level features (like GC and AU content, dinucleotide frequencies, and entropy) and nanoparticle properties play a part. The evaluation also highlighted that while non-linear models like XGBoost are better at handling the complexity of biological interactions, linear models like Lasso remain essential for understanding which parameters drive the predictions.

siRNA Design Automation (siUltimate)

Inspiration: The process of siRNA design was quite challenging and time-consuming. It requires extensive knowledge of the suitable tools and methods. The siRNA design process spanned over two months, highlighting the need for a more streamlined and efficient method (Reynolds et al., 2004; Ui-Tei et al., 2004)
Search for suitable tools: In our search for ways to automate this process, we came across Selenium, a Python framework to build bots for web scraping and website testing. This was perfect for our purposes, as most websites did not have any clear ways to use APIs for requests, but the websites themselves were straightforward to navigate.
How Selenium works: Selenium automates the process of website testing by interacting with websites in a human-like manner. Elements on a webpage can be selected using CSS selectors, making it very intuitive and simple for anyone with web development experience to use. Interaction with elements can be performed by providing keyboard inputs. Using this approach, our software automated interactions with siDirect, siRNApred, and MaxExpect (Amarzguioui & Prydz, 2004)

The results and analysis of our designed model are explained in detail in our model page.

User Manual

Go to our software repository for instructions on building and opening the application.
Enter the target gene sequence in the rectangular input box. It can either be in the FASTA format, or as a singular continuous string. You can enter either the target mRNA sequence, or the target gene sequence (enter exons only).
Select a nanoparticle from the dropdown list, or choose “None” if you only wish to design an siRNA sequence.

Click on the “Design siRNA” button.
After submission, you will see two options:

Check Status
Submit Another Job

On clicking the "Check Status" button, a screen with a progress bar will display, showing the current progress of the job.

Once the job is completed, a "View Output" button will appear.

Fig 7. Job status page on job completion

On clicking the button, the output page containing the results table will display:

A list of siRNA sequences is generated along with nanoparticle docking score if a nanoparticle was selected.
A list of top-ranked siRNA sequences is displayed if the “None” option was selected.

Additionally, to share experimental results and nanoparticle data to help us build a better model, an email ID (softwareigemmitmahe@gmail.com) is provided on the homepage through which the team can be contacted.

How is this Software Useful?

This software provides a powerful, time-saving, and predictive platform for designing siRNAs best suited to the chosen nanoparticle systems. Here is why it is helpful for researchers:

Rapid siRNA Selection

Gene sequences are automatically screened using siRNA design platforms (siDirect, siRNApred, DuplexFold) through Selenium automation.
Saves weeks of manual work, providing the top 15 candidate siRNAs instantly.

Predicts Stability Without Computation Power

Predicts docking score without performing any docking and allows you to get the best-suited siRNA for the chosen nanoparticle for the application required.
The model utilizes previously collected docking data (HDOCK, Glide) and molecular descriptors (ΔG, RMSD, H-bonding, van der Waals, solvent-accessible surface area, etc.).
ML models (XGBoost, Random Forest, Lasso) predict interactions of siRNA-NP complexes without performing actual docking.

Feature-Rich and Accurate

Sequence-specific features (AU content, GC content, dinucleotide counts, sequence entropy) and nanoparticle type (Chitosan vs Lipid).
Achieves ~75% variance explanation (R² ~0.7459), giving confidence in predictions.

Integration & Reproducibility

Combines automation, docking, molecular descriptors, and ML prediction in one streamlined pipeline.
Reproducible and user-friendly for researchers unfamiliar with coding.

Future Scope

In the future, collaborations with labs can play a crucial role in this process, working with standardized parameters and results from different docking software will clarify the key factors influencing siRNA–nanoparticle stability. Integrating computational modelling and experimental validation will allow us to refine our model, helping us ensure the model is experimentally and theoretically robust. We plan on using larger training datasets to enhance the model's reliability and predictive features and expand the number of nanoparticles studied. This will strengthen the model further, increasing its applicability across different systems. This will help us create a computational tool that is accurate and practical.

References

Amarzguioui, M., & Prydz, H. (2004). An algorithm for selection of functional siRNA Sequences. Biochemical and Biophysical Research Communications, 316(4), 1050–1058. https://doi.org/10.1016/j.bbrc.2004.02.157

Reynolds, A., Leake, D., Boese, Q., Scaringe, S., Marshall, W. S., & Khvorova, A. (2004). Rational siRNA design for RNA interference. Nature Biotechnology, 22(3), 326–330. https://doi.org/10.1038/nbt936

Ui-Tei, K., Naito, Y., Takahashi, F., Haraguchi, T., Ohki-Hamazaki, H., Juni, A., Ueda, R., Saigo, K. (2004). Guidelines for selecting highly effective siRNA sequences for mammalian and chick RNA interference. Nucleic Acids Research, 32(3), 936–948. https://doi.org/10.1093/nar/gkh247