Software | Thessaloniki

Overview

Machine learning is transforming synthetic biology by enabling predictive design to replace traditional trial-and-error approaches. Advanced and efficient machine learning models can generate accurate predictions that accelerate laboratory workflows, advance research, and reduce experimental costs. However, model performance critically depends on training data quality, making extensive preprocessing essential to ensure models learn from rich, high-quality datasets.

In siREN, our machine learning models provided initial predictions of candidate siRNA silencing efficacy before experimental validation. To support model development, we created siRBench, a harmonized dataset containing more than 4000 siRNAs with experimentally measured silencing efficiencies and complementary thermodynamic and structural features.

siRBench Dataset

Creating a high-quality dataset was essential before developing machine learning models. While multiple public siRNA sequence datasets exist, direct comparison reveals significant challenges:

Inconsistent efficacy reporting: Silencing efficiencies are experimentally measured using different assays, protocols, and cell lines, and are often reported on different scales.
Heterogeneous sequence formats: siRNAs vary in length (commonly 19 nt or 21 nt) and orientation (5’-3’ or 3’-5’), complicating direct integration.
Data inconsistencies: Several datasets contain minor errors that undermine reproducibility.

As iGEM Thessaloniki 2025, we created siRBench, the first harmonized training and benchmarking dataset integrating siRNA data at scale. The siRBench dataset is available on our GitLab Repository.

Diagram 1: siRBench Dataset Creation Pipeline

Gathering and Preprocessing Data

We established clear specifications for siRBench to guide our dataset creation pipeline. For each siRNA sequence, we defined the following required fields:

The siRNA antisense strand: 19 nt length, without overhangs, in 5'-3' orientation
Exact target mRNA site: 19 nt sequence
Extended target mRNA site: 57 nt sequence containing the 19 nt target site with 19 nt flanking regions on each side
Efficacy score: Normalized value from 0 (ineffective) to 1 (highly effective) representing silencing performance
Data source: The originating dataset source publication (from 11 public sources)
Cell line: The experimental cell system used for efficacy measurements
Engineered features: Computed properties for sequence analysis

We enforced the following standardization rules across all sequences in the dataset:

Sequence uniformity: All siRNA antisense strands are 19 nt in length, oriented 5'→3', and reverse-complement to their corresponding exact target mRNA sites
Efficacy normalization: All efficacy values are normalized to a 0–1 scale, representing the percentage of target mRNA knockdown (0 = no silencing, 1 = complete silencing)
Uniqueness: Each siRNA-target mRNA pair appears only once in the dataset

The final siRBench dataset comprises 4098 siRNA antisense sequences from 11 different source publications: Huesken ¹, Takayuki ², Shabalina ³,Simone ⁴, Amarzguioui ⁵, Harborth ⁶, Hsieh ⁷, Khvorova ⁸, Reynolds ⁹, Vickers ¹⁰, Ui-Tei ¹¹. Note that Mixset is an aggregated dataset containing sequences from seven of these sources (Amarzguioui, Harborth, Hsieh, Khvorova, Reynolds, Vickers, and Ui-Tei), which we separated back into their original publications for proper cell line annotation.

The dropdown buttons below include the documentation of the preprocessing of each different dataset we found.

We obtained data from these sources through previous publications on siRNA efficacy prediction, which provided .csv files containing: the siRNA antisense strand, the extended target mRNA sequence and siRNA efficacy values.

Sequence orientation standardization: We extracted the central 19nt target mRNA sequence and determined strand orientation by checking whether the siRNA was complement (3'-5') or reverse-complement (5'-3') to the target site. Huesken and Mixset sequences were already in 5'-3' orientation, while Takayuki sequences were 3'-5' and required reversal before integration into siRBench.

Efficacy normalization: All efficacy values from these sources were already normalized to the [0,1] range, where higher values indicate stronger silencing activity.

Cell line annotation: Cell lines were H1299 for Huesken and HeLa for Takayuki. Since the Mixset dataset aggregated sequences from 7 different sources, we traced each sequence back to its original publication to assign the appropriate experimental cell line for each entry.

The siRNA dataset from Shabalina et al. comprised 653 antisense strand sequences with reported gene-silencing activity values, GenBank accession numbers for target mRNAs and target site coordinates. Unlike other sources, efficacy values were reported as non-negative real numbers where 0 indicated complete knockdown and higher values represented weaker silencing, the inverse of our desired scale.

We used the provided GenBank accession numbers and genomic coordinates to retrieve extended target mRNA sequences.However, we discovered that siRNA sequences were not always fully complementary to the reported target regions. To resolve this, we downloaded complete target mRNA sequences and performed exhaustive searches to identify exact complementary sites for each siRNA. This approach successfully located complementary targets for 650 of 653 siRNAs; the remaining 3 were excluded from siRBench.

We first normalized activity values to the [0,1] range. To align with other datasets where higher values indicate stronger silencing, we applied the transformation 1−α for each normalized activity value α. This ensured that all siRBench efficacy scores follow the same interpretation: 0 represents no silencing and 1 represents complete knockdown.

All Shabalina sequences were in 5'-3' orientation, and experiments were conducted in HeLa cells.

The Simone dataset comprises 322 siRNA sequences obtained from the siRNADiscovery GitHub repository. The data was provided in three separate files: two FASTA files containing the siRNA sequences and the full-length target mRNA sequences, and one CSV file mapping siRNA-target pairs (via FASTA IDs) to efficacy values. We merged these files into a single CSV file containing three columns: the siRNA sequence, the extended target mRNA site, and the efficacy value. All siRNA sequences from the Simone were21nt in length.To maintain consistency with our 19 nt standard, we trimmed the last 2nt of the siRNA sequence.

The efficiency values were already normalized to the [0,1] range with higher values indicating stronger silencing, matching siRBench specifications. All sequences were in 5'-3' orientation, and experiments were conducted in Hep3B cells.

Duplicate removal was the final essential step before finalizing our dataset. We defined duplicates as siRNA pairs sharing identical antisense strands and extended target mRNA sites.

We first examined each dataset individually. The Huesken, Takayuki and Simone datasets contained no duplicates. The Shabalina dataset contained 3 duplicate siRNAs with identical efficacy scores; we retained one sequence from each pair. Furthermore, the Mixset dataset contained 8 duplicate pairs, from different source studies with conflicting efficacy scores. For these cases, we retained the sequence from the less-represented source to maintain dataset balance, excluding 4 additional siRNAs from siRBench.

We identified 398 duplicate pairs between Mixset and Shabalina with conflicting efficacy values due to different experimental conditions and cell lines. We retained the Mixset-derived efficacy values, as this dataset is more widely used as a benchmark in siRNA prediction studies.

This comprehensive preprocessing procedure yielded a balanced, normalized and high-quality training dataset maximizing the number of unique, reliable siRNA sequences.

Feature Engineering

Feature engineering transforms raw data into numerical representations that machine learning models can process. For siRBench, we converted each siRNA-target mRNA pair into a feature set capturing sequence composition, structural properties, and thermodynamic characteristics.

Machine learning algorithms cannot directly interpret nucleotide sequences, which are categorical rather than numerical data. Instead, models learn from quantitative features such as nucleotide frequencies, position-specific k-mers, predicted secondary structures, duplex stability (ΔG), and other sequence-derived properties. By pre-calculating these features, we enabled our models to indentify motifs that distinguish highly effective siRNAs from less effective ones.

Feature engineering was particularly critical for integrating heterogeneous data from multiple public datasets. It provided a unified numerical framework for comparing siRNAs measured under different experimental conditions and cell lines. The quality and diversity of the engineered features directly impact model accuracy and generalizability, making feature engineering a central pillar of reliable siRNA efficacy prediction.

Through extensive literature review, we identified 100 thermodynamic and structural features to enrich our training dataset. All feature values are rounded to three decimal places. The dropdown button below lists these features with descriptions of their biological significance.

Thermodynamic Features

ends: Terminal asymmetry metric comparing 5′ and 3′ end stability of the siRNA duplex
DG_total: Total Gibbs free energy (ΔG) of duplex formation calculated from nearest-neighbor parameters ¹²
DH_total: Total enthalpy change (ΔH, kcal/mol) of duplex formation from nearest-neighbor parameters ¹²
DG_pos1..18: Per-step stacking ΔG along the duplex at each of the 18 nearest-neighbor positions in the 19-bp region
DH_pos1..18: Per-step stacking ΔH at corresponding positions

Structural Constraint Energies (RNAfold/RNAcofold)

single_energy_total: Minimum free energy (MFE) of the isolated guide strand (RNAfold)
single_energy_pos1..19: Position-specific accessibility of each nucleotide in the guide strand (RNAfold)
duplex_energy_total: Hybridization ΔG for the guide-target duplex (RNAcofold)
duplex_energy_sirna_pos1..19: Per-nucleotide contribution of the guide strand to duplex energy at each aligned position (RNAcofold)
duplex_energy_target_pos1..19: Per-nucleotide contribution of the target site to duplex energy (RNAcofold)

RNAup Accessibility Features

RNAup_open_dG: Energy cost to unpair interacting regions on both guide and target strands prior to binding
RNAup_interaction_dG: Hybridization energy gained upon binding of unpaired regions

Proposed Training and Testing Sets

Before model development, we strategically split siRBench into training and test sets based on experimental cell line rather than using random splitting methods like scikit-learn's train_test_split function. This cell line-based approach ensures the model is trained and evaluated on different biological contexts, preventing information leakage and better reflecting real-world scenarios where predictions must generalize to new cellular environments. The training set comprises 3,408 siRNAs with efficacy measurements from H1299 and HeLa cell lines, while the test set contains 690 siRNAs from all other cell lines. The training set was further randomly divided into training (90%) and validation (10%) subsets.

Machine Learning Models

With siRBench finalized, we developed machine learning models to predict siRNA silencing efficacy. Our model development followed iterative engineering cycles using the Design-Build-Test-Learn framework (see our Engineering Page).

We explored multiple machine learning approaches and ultimately implemented two models with distinct architectures, both fully documented in our GitLab Repository.

Diagram 2: Machine Learning Models Development Pipeline

LightGBM-Based Model

We sought an advanced machine learning architecture suited to our task's characteristics: multiple engineered features, a medium-sized dataset, and potentially complex non-linear relationships. After literature review, we selected LightGBM (Light Gradient Boosting Machine) ¹³, an efficient implementation of gradient boosting decision trees.

Unlike linear models, LightGBM captures non-linear feature interactions, where silencing depends on the complex interplay of thermodynamic stability, accessibility, and positional effects rather than simple linear rules. LightGBM builds an ensemble of decision trees sequentially, with each tree correcting errors from previous ones. Its leaf-wise growth strategy expands branches that maximize error reduction, enabling the model to learn highly specific patterns. Efficiency optimizations including histogram-based feature binning, gradient-based sampling, and exclusive feature bundling make it faster and more memory-efficient than other gradient boosting implementations.

The advantages for our dataset are:

Efficiently handles medium-sized datasets (~4,100 siRNAs) without requiring the massive training sets needed for deep learning.
Naturally processes correlated and position-specific features (DG_pos1..18, DH_pos1..18, duplex energies) to identify non-linear patterns that simpler models cannot capture.

We evaluated performance using Mean Squared Error (MSE) and Pearson correlation coefficient (R) to measure alignment between predicted and actual silencing efficacies.

	LightGBM Model
Mean Squared Error (MSE)	0,0529
Pearson Coefficient (R)	0,5454

We applied this model to predict silencing efficacy for all siRNAs targeting BCL2 and BTK genes, including both our own designs and experimentally validated sequences from the literature. Predictions are available on our Model Page.

Full documentation for the LightGBM model can be found in our GitLab Repository.

TabPFN Regressor

After implementing and evaluating LightGBM, we expanded our modeling strategy by exploring a fundamentally different machine learning approach. Our goal was to determine whether an alternative learning mechanism could capture additional patterns and potentially improve predictive performance.

Given our tabular dataset with complex non-linear dependencies, we focused our literature review on models specifically designed for tabular data prediction. We selected the TabPFN (Tabular Prior-Data Fitted Network) regressor ¹⁴, a transformer-based model pre-trained on millions of synthetic tasks, that enables accurate predictions with minimal hyperparameter tuning.

TabPFN leverages transformer architecture and extensive pre-training to capture complex, non-linear feature interactions in tabular data without requiring dataset-specific training. This fundamentally different mechanism from gradient boosting allowed us to explore whether alternative model architectures could enhance siRNA efficacy prediction.

We evaluated performance using Mean Squared Error (MSE) and Pearson correlation coefficient (R) to measure alignment between predicted and actual silencing efficacies.

	TabPFN Regressor Model
Mean Squared Error (MSE)	0,0665
Pearson Coefficient (R)	0,5576

Full documentation for TabPFN model development is available in our GitLab Repository.

Future Advancements

Our software currently comprises two machine learning models for siRNA silencing efficacy prediction: a gradient-boosting framework (LightGBM) and a probabilistic foundation model (TabPFN). Both provide strong baselines for predicting silencing efficacy in our project. However, we envision several future advancements that could significantly enhance the scope and accuracy of our tool:

Deep Learning Architectures: Integration of Convolutional Neural Networks (CNNs) for motif extraction, Long Short-Term Memory networks (LSTMs) for sequential dependencies, and Attention-based mechanisms (e.g., Transformers) for long-range contextual interactions.
Pre-Trained Biological Models: Incorporation of RNA foundation models (such as RNA-FM) to leverage embeddings trained on large-scale RNA datasets, enabling generalization beyond our experimental training data.
Expanded Feature Engineering: Addition of features beyond thermodynamic and sequence-based properties, such as secondary structure accessibility metrics, to further improve model performance.
Interactive User Interface: Development of a user-friendly web application allowing researchers to input candidate siRNA sequences and receive real-time predictive scores in a user-friendly manner.

By pursuing these advancements, our software can evolve from a predictive tool into a comprehensive decision-support platform for rational design of therapeutic siRNAs.

Huesken, D., Lange, J., Mickanin, C. et al. Design of a genome-wide siRNA library using an artificial neural network. Nat Biotechnol 23, 995–1001 (2005). https://doi.org/10.1038/nbt1118
Takayuki Katoh, Tsutomu Suzuki, Specific residues at every third position of siRNA shape its efficient RNAi activity, Nucleic Acids Research, Volume 35, Issue 4, 15 February 2007, Page e27, https://doi.org/10.1093/nar/gkl1120
Shabalina, S.A., Spiridonov, A.N. & Ogurtsov, A.Y. Computational models with thermodynamic and composition features improve siRNA design. BMC Bioinformatics 7, 65 (2006). https://doi.org/10.1186/1471-2105-7-65
Sciabola S, Cao Q, Orozco M. et al. Improved nucleic acid descriptors for siRNA efficacy prediction. Nucleic Acids Res 2013;41:1383–94. 10.1093/nar/gks1191.
Mohammed Amarzguioui, Torgeir Holen, Eshrat Babaie, Hans Prydz, Tolerance for mutations and chemical modifications in a siRNA, Nucleic Acids Research, Volume 31, Issue 2, 15 January 2003, Pages 589–595, https://doi.org/10.1093/nar/gkg147
Harborth J, Elbashir SM, Vandenburgh K, Manninga H, Scaringe SA, Weber K, Tuschl T. Sequence, chemical, and structural variation of small interfering RNAs and short hairpin RNAs and the effect on mammalian gene silencing. Antisense Nucleic Acid Drug Dev. 2003 Apr;13(2):83-105. doi: 10.1089/108729003321629638. PMID: 12804036.
Andrew C. Hsieh, Ronghai Bo, Judith Manola, Francisca Vazquez, Olivia Bare, Anastasia Khvorova, Stephen Scaringe, William R. Sellers, A library of siRNA duplexes targeting the phosphoinositide 3‐kinase pathway: determinants of gene silencing for use in cell‐based screens, Nucleic Acids Research, Volume 32, Issue 3, 1 February 2004, Pages 893–901, https://doi.org/10.1093/nar/gkh238
Anastasia Khvorova, Angela Reynolds, Sumedha D. Jayasena, Functional siRNAs and miRNAs Exhibit Strand Bias, Cell, Volume 115, Issue 2, 2003, Pages 209-216, ISSN 0092-8674, https://doi.org/10.1016/S0092-8674(03)00801-8
Reynolds A, Leake D, Boese Q, Scaringe S, Marshall WS, Khvorova A. Rational siRNA design for RNA interference. Nat Biotechnol. 2004 Mar;22(3):326-30. doi: 10.1038/nbt936. Epub 2004 Feb 1. PMID: 14758366.
https://www.jbc.org/article/S0021-9258(19)32641-9/fulltext
Kumiko Ui-Tei, Yuki Naito, Fumitaka Takahashi, Takeshi Haraguchi, Hiroko Ohki‐Hamazaki, Aya Juni, Ryu Ueda, Kaoru Saigo, Guidelines for the selection of highly effective siRNA sequences for mammalian and chick RNA interference, Nucleic Acids Research, Volume 32, Issue 3, 1 February 2004, Pages 936–948, https://doi.org/10.1093/nar/gkh247
https://pubs.acs.org/doi/10.1021/bi9809425
LightGBM's Documentation Page, https://lightgbm.readthedocs.io/en/stable/
Hollmann, N., Müller, S., Purucker, L. et al. Accurate predictions on small data with a tabular foundation model. Nature 637, 319–326 (2025). https://doi.org/10.1038/s41586-024-08328-6

S O F T W A R E

"Life is a DNA software system."

- Craig Venter