The applications of the engineering cycle were instrumental to the progress and success of our project, guiding both our wet lab and dry lab efforts. Through multiple Design-Build-Test-Learn (DBTL) iterations, we systematically refined the key components of our modeling approach, while simultaneously addressing and resolving experimental challenges in the laboratory. Each iteration provided valuable insights that informed the next, ensuring a continuous process of improvement and optimization.
In the Wet Lab, our engineering cycle iterations focused on optimizing experimental design and validating gene silencing efficiency in MEC-1 cells:
In the Dry Lab, our iterations aimed to enhance the performance and accuracy of our siRNA efficacy prediction pipeline through progressive data refinement and model advancement:
Click on each button to review our Engineering Cycle Iterations!
Click on each button to review our Wet Lab Engineering Cycle Iterations!
Click on each button to review our Dry Lab Engineering Cycle Iterations!
GSK126 is a selective small-molecule and an S-adenosyl methionine competitive inhibitor that inhibits EZH2 methyltransferase, which is often observed to be dysregulated in cancer. MEC-1 cells were cultured and exposed to increasing concentrations of GSK126 to observe potential changes in cell viability, proliferation, or morphology.
This step was essential not only to understand how MEC-1 cells respond to pharmacological treatment but also to gain practical familiarity with their handling, culture dynamics, and baseline behavior, enabling subsequent comparison with the cellular effects observed following siRNA treatment.
The experiment was conducted in accordance with the protocol provided in our Experiments Page.
More detailed descriptions of the experimental setup, dilutions, and handling steps can be found in our Lab Notebook.
The effects of the inhibitor were evaluated through:
From this cycle, we determined a concentration range that can be effectively paired with siRNA treatments, allowing us to fine-tune between cytostatic and cytotoxic responses based on the specific goals of each experiment.
The goal of this experimental cycle was to establish a reliable method for delivering siRNAs into MEC-1 cells and to evaluate whether this approach could achieve functional gene silencing of BCL-2. Given that MEC-1 is a suspension cell line with known sensitivity to lipid-based transfection reagents, particular attention was placed on identifying conditions that balance efficient uptake with minimal cytotoxicity.
Further procedural details are provided in the Lab Notebook.
Transfection of MEC-1 cells with BCL-2 siRNAs revealed an inhibitory effect on cellular proliferation. At early timepoints (24 h), both proliferation and viability were reduced across all conditions, with the strongest inhibition observed at higher concentrations (50 nM) and in the INTERFERin-only group. However, at later timepoints, cells exhibited partial recovery of growth and viability, indicating an adaptive response following initial treatment stress, which appeared to be primarily associated with the transfection reagent rather than the siRNA itself.
These effects were largely cytostatic, reflecting transient stress responses rather than overt cytotoxicity. Over time, cells demonstrated progressive recovery: by 72 h, morphology and viability had stabilized, and by 96 h, proliferation in all groups, including high-concentration and INTERFERin-only conditions, surpassed that of the control. Simultaneously, indicates that MEC-1 cells adapt to INTERFERin exposure and restore normal growth dynamics following the initial stress phase. From this cycle, we established that concentrations in the 5–20 nM range provide a balance between effective transfection and minimal long-term impact on cell viability, forming the basis for subsequent functional silencing studies, with 20 nM showing the most favorable results. However, the cytotoxicity observed at higher INTERFERin volumes prompted a dedicated titration experiment to optimize reagent-to-cell and siRNA ratios and enhance overall cellular tolerance.
This cycle aimed to determine the optimal concentration range of INTERFERin that allows efficient siRNA delivery while minimizing cytotoxicity. MEC-1 cells were exposed to gradient volumes of INTERFERin. Cell viability, growth, and morphology were assessed over time to distinguish transient stress effects from sustained toxicity.
Further procedural details are provided in the Lab Notebook.
Treatment of MEC-1 cells with INTERFERin demonstrated a dose-dependent inhibitory effect on cellular proliferation. Smaller volumes were associated primarily with cytostatic outcomes consistent with a temporary arrest of cell division, but higher volumes (5 μL) showed cytotoxicity. When comparing cultures seeded at different densities, the inhibitory effect was slightly less pronounced at 300,000 cells/mL than at 150,000 cells/mL, likely due to reduced effective dose per cell and the presence of survival-supporting microenvironmental factors.
INTERFERin exhibits cytotoxic/cytostatic effects at high doses. Its main negative impact at larger volumes (3–5 µL) is an inducible and reversible arrest of the cell cycle, rather than induction of cell death. This suggests that the observed effects are likely non-permanent and that MEC-1 cells can recover once the stress is alleviated.
From this cycle, we established an optimal working range of INTERFERin volumes that balances efficient transfection with minimal cytotoxicity and an optimal cell seeding setup, providing a reliable basis for downstream siRNA delivery experiments.
The objective of this second repetition was to validate and refine the conditions established in the initial siRNA transfection experiment. As MEC-1 cells are highly sensitive to lipid-based transfection reagents, we aimed to confirm the reproducibility of siRNA uptake and BCL2 gene silencing under the optimized parameters identified previously. Repeating the experiment allowed us to assess consistency across independent runs, strengthen the reliability of our observations, and further define the balance between efficient delivery and minimal off-target cytotoxic effects.
Further procedural details are provided in the Lab Notebook.
All conditions were tested in replicates to ensure reproducibility.
Transfection of MEC-1 cells with BCL2 siRNAs revealed a concentration-dependent pattern of uptake and processing. At 20 nM, a distinct fluorescence peak was observed at 6 h, followed by a gradual decline and stabilization by 24–48 h. This dynamic profile indicates efficient internalization and intracellular processing of the siRNA. In contrast, at 50 nM, fluorescence increased continuously over time, suggesting overloading of the delivery system and sequestration of siRNAs in vesicles, accompanied by mild growth inhibition. Lower concentrations (5–10 nM), scrambled siRNA, and INTERFERin-only controls exhibited no significant fluorescence signal. Importantly, cell viability remained consistently high (>85–90%) across all conditions, demonstrating that the treatments influenced proliferation dynamics rather than inducing cytotoxicity.
Our observations demonstrated that higher siRNA concentrations correlated with increased uptake efficiency, confirming successful delivery into MEC-1 cells. Among the range of concentrations tested, the 20 nM and 50 nM treatments produced similar cellular outcomes, including slightly reduced viability and evident cell cycle arrest. Notably, the 20 nM concentration, representing a milder treatment, achieved comparable functional results to 50 nM, suggesting it may serve as a more optimal and less cytotoxic condition for subsequent experiments. These findings underscore the importance of balancing delivery efficiency with cellular tolerance, guiding further optimization of transfection parameters.
This cycle aimed to confirm that optimized transfection conditions translate into functional BCL2 silencing at the transcript (and, where feasible, protein) level. Cells across all conditions were harvested at 6h, 24 h, and 48 h for total RNA isolation, cDNA synthesis, and qPCR quantification of BCL2 normalized to stable reference genes (e.g., YWHAZ). To confirm that the qPCR amplified the correct product and total RNA integrity, we checked the samples by gel electrophoresis to verify the RNA pattern and the fact that the bands were specific and of the expected size. Although cell pellets were prepared for downstream protein analysis, Western blot experiments could not be completed before the wiki freeze due to time constraints. These assays remain part of our planned follow-up validation to corroborate transcript-level findings at the protein level.
Further procedural details are provided in the Lab Notebook.
qPCR analysis supported these findings, showing a partial reduction in BCL2 mRNA levels (~30%) at 20 nM and a more pronounced decrease (~50%) at 50 nM, although these effects did not reach statistical significance due to experimental variability and technical constraints. Given that the total RNA level of quality and quantity of the extracted template was suboptimal, it is likely limiting detection sensitivity and contributing to the observed variability. Nevertheless, the biological evidence aligns strongly with the expected silencing effect: in the cells that successfully internalized the siRNA, clear signs of slightly reduced viability and evident cell cycle arrest were observed. These results validate the functional activity of the siRNA treatment and underscore the need to scale up the experimental setup to obtain higher-quality RNA templates, thereby enabling more robust and statistically supported downstream analyses.
Furthermore, fluctuations of the YWHAZ reference gene in negative controls highlighted the necessity of re-examining its suitability as a normalizer. Future validation efforts will therefore include the use of additional biological replicates and assessment of protein-level expression to confirm the stability of normalization strategies.
The first step, before searching for training data, was to define the purpose of our machine learning model and what the expected input(s) and output(s) would be. Our goal from the beginning was to build a model that would predict the siRNA silencing efficiency. We defined this siRNA silencing efficiency to be the percentage of the target mRNA knockdown, therefore the output of the model would be a number in the range [0, 1]. where 0 indicates no silencing effect and 1 indicates maximum silencing effect. In other words, the higher the model output is, the more promising the candidate siRNA is.
Regarding the training data, we decided that our dataset should contain:
As input to the model we decided to use: the siRNA antisense strand, the extended target mRNA site and the siRNA silencing efficacy score. The other data would be taken into consideration for the creation of the training dataset and its split into training and testing sets. For the creation of a harmonized dataset, we made sure that all siRNA sequences followed the below-mentioned rules:
For this model, we made the following assumptions:
For the first engineering cycle of the machine learning model development, we decided to employ not one, but two basic machine learning models: Support Vector Regression (SVR) and XGBoost and evaluate these models' performance. In order to transform the input data (siRNA and target mRNA sequences) into a numerical form that these models can process, we decided to employ the k-mer method, and used four different values for k (k=2, 3, 4, 5), to evaluate which value of k ensures greater model performance.
As a metric for the model performance evaluation, we decided to use the Mean Squared Error (MSE).
Initially, we had to build our training dataset, which needed to be rich and balanced in order to ensure great data quality and enhance model performance. We conducted literature review and came across 5 different siRNA silencing efficacy datasets: Huesken, Takayuki, Mixset, Shabalina and Simone. The Mixset dataset comprises siRNA antisense sequences from 7 different sources: Amarzguioui, Harborth, Hsieh, Khvorova, Reynolds, Vickers, Ui-Tei. We then preprocessed and curated the data (more information in the dropdown buttons below), resulting in the creation of a harmonized training dataset that comprises 4098 siRNA antisense strands, along with their respective silencing efficacy value.
We obtained data from these sources through previous publications on siRNA efficacy prediction, which provided .csv files containing: the siRNA antisense strand, the extended target mRNA sequence and siRNA efficacy values.
Sequence orientation standardization: We extracted the central 19nt target mRNA sequence and determined strand orientation by checking whether the siRNA was complement (3'-5') or reverse-complement (5'-3') to the target site. Huesken and Mixset sequences were already in 5'-3' orientation, while Takayuki sequences were 3'-5' and required reversal before integration into siRBencht.
Efficacy normalization: All efficacy values from these sources were already normalized to the [0,1] range, where higher values indicate stronger silencing activity.
Cell line annotation: Cell lines were H1299 for Huesken and HeLa for Takayuki. Since the Mixset dataset aggregated sequences from 7 different sources, we traced each sequence back to its original publication to assign the appropriate experimental cell line for each entry.
The siRNA dataset from Shabalina et al. comprised 653 antisense strand sequences with reported gene-silencing activity values, GenBank accession numbers for target mRNAs and target site coordinates. Unlike other sources, efficacy values were reported as non-negative real numbers where 0 indicated complete knockdown and higher values represented weaker silencing, the inverse of our desired scale.
We used the provided GenBank accession numbers and genomic coordinates to retrieve extended target mRNA sequences.However, we discovered that siRNA sequences were not always fully complementary to the reported target regions. To resolve this, we downloaded complete target mRNA sequences and performed exhaustive searches to identify exact complementary sites for each siRNA. This approach successfully located complementary targets for 650 of 653 siRNAs; the remaining 3 were excluded from siRBench.
We first normalized activity values to the [0,1] range. To align with other datasets where higher values indicate stronger silencing, we applied the transformation 1−α for each normalized activity value α. This ensured that all siRBench efficacy scores follow the same interpretation: 0 represents no silencing and 1 represents complete knockdown.
All Shabalina sequences were in 5'-3' orientation, and experiments were conducted in HeLa cells.
The Simone dataset comprises 322 siRNA sequences obtained from the siRNADiscovery GitHub repository. The data was provided in three separate files: two FASTA files containing the siRNA sequences and the full-length target mRNA sequences, and one CSV file mapping siRNA-target pairs (via FASTA IDs) to efficacy values. We merged these files into a single CSV file containing three columns: the siRNA sequence, the extended target mRNA site, and the efficacy value. All siRNA sequences from the Simone were21nt in length.To maintain consistency with our 19 nt standard, we trimmed the last 2nt of the siRNA sequence.
The efficiency values were already normalized to the [0,1] range with higher values indicating stronger silencing, matching siRBench specifications. All sequences were in 5'-3' orientation, and experiments were conducted in Hep3B cells.
Duplicate removal was the final essential step before finalizing our dataset. We defined duplicates as siRNA pairs sharing identical antisense strands and extended target mRNA sites.
We first examined each dataset individually. The Huesken, Takayuki and Simone datasets contained no duplicates. The Shabalina dataset contained 3 duplicate siRNAs with identical efficacy scores; we retained one sequence from each pair. Furthermore, the Mixset dataset contained 8 duplicate pairs, from different source studies with conflicting efficacy scores. For these cases, we retained the sequence from the less-represented source to maintain dataset balance, excluding 4 additional siRNAs from siRBench.
We identified 398 duplicate pairs between Mixset and Shabalina with conflicting efficacy values due to different experimental conditions and cell lines. We retained the Mixset-derived efficacy values, as this dataset is more widely used as a benchmark in siRNA prediction studies.
This comprehensive preprocessing procedure yielded a balanced, normalized and high-quality training dataset maximizing the number of unique, reliable siRNA sequences.
Before model development, we strategically split siRBench into training and test sets based on experimental cell line rather than using random splitting methods like scikit-learn's train_test_split function. This cell line-based approach ensures the model is trained and evaluated on different biological contexts, preventing information leakage and better reflecting real-world scenarios where predictions must generalize to new cellular environments. The training set comprises 3,408 siRNAs with efficacy measurements from H1299 and HeLa cell lines, while the test set contains 690 siRNAs from all other cell lines. The training set was further randomly divided into training (90%) and validation (10%) subsets.
The final step of the “Build” step was to write the code for the SVR and XGBoost models. The main parts of the code are shown in the pictures below.
As a metric for the model performance evaluation, we used the Mean Squared Error (MSE). The table below includes the calculated MSE (with 4 decimal digits) for the different values of k that we tried, for both machine learning models.
| SVR Model | XGBoost Model | |
| k=2 | 0,0702 | 0,0733 |
| k=3 | 0,0653 | 0,0738 |
| k=4 | 0,0733 | 0,0749 |
| k=5 | 0,0792 | 0,0764 |
Through this cycle, we successfully built a harmonized dataset by integrating multiple different datasets, and preprocessing and curating the data. We also managed to build two different machine learning models for siRNA silencing efficacy prediction and evaluated their performance: from the above MSE metric table, we can conclude that both models showcased similar efficiency, while the greatest performance was accomplished when the Support Vector Regression Model was employed and the k value for the k-mer method was equal to 3.
We moved on to complete more engineering cycles for this model development, with the goal of improving model performance. In this step, we brainstormed ways to improve model efficiency and decided to implement these two ideas:
Based on the findings of the previous engineering cycle, we firstly moved on to conduct literature review on the features that affect siRNA silencing efficacy, with the goal of applying feature engineering. We hypothesized that the addition of engineered (pre-calculated) sequence features to the model input would result in the improvement of the model performance, since the model would be able to make predictions based on more features that previously could not calculate on its own.
Through extensive literature review, we identified 100 thermodynamic and structural features to enrich our training dataset. All feature values are rounded to three decimal places. The dropdown button below lists these features with descriptions of their biological significance.
Thermodynamic Features
Structural Constraint Energies (RNAfold/RNAcofold)
RNAup Accessibility Features
For the model fitting, we used 90% of the training dataset for model training and the rest 10% for the validation set.
Afterwards, we proceeded to search for more advanced machine learning models to employ for our task. We deemed that by using a model that adapts better to the characteristics of our task (multiple calculated features, medium-sized dataset, potential non-linear relationships), the model performance would be enhanced. After literature review, we selected LightGBM (Light Gradient Boosting Machine), an efficient implementation of gradient boosting decision trees.
Unlike linear models, LightGBM captures non-linear feature interactions, where silencing depends on the complex interplay of thermodynamic stability, accessibility, and positional effects rather than simple linear rules. LightGBM builds an ensemble of decision trees sequentially, with each tree correcting errors from previous ones. Its leaf-wise growth strategy expands branches that maximize error reduction, enabling the model to learn highly specific patterns. Efficiency optimizations including histogram-based feature binning, gradient-based sampling, and exclusive feature bundling make it faster and more memory-efficient than other gradient boosting implementations.
The advantages for our dataset are:
After completing the “Design” step of this engineering cycle, we moved on to write code for both feature engineering and LightGBM model development. Full documentation of these aspects of our modeling are available in our GitLab Repository.
Full documentation for the final training dataset creation can be found in our GitLab Repository.
Full documentation for LightGBM Model development can be found in our GitLab Repository.
The next step after we built our enriched training dataset and trained the model was to evaluate the model performance. We used the Mean Squared Error (MSE) metric, and added the Pearson Correlation Coefficient (R) metric, in order to measure the linear correlation between the predicted and the actual silencing efficacies. The table below shows the metrics values.
| LightGBM Model | |
| Mean Squared Error (MSE) | 0,0529 |
| Pearson Correlation Coefficient (R) | 0,5454 |
Compared to the previous models we implemented, we observed a clear improvement in predictive accuracy, as the Mean Squared Error (MSE) decreased to 0,0529. A lower MSE indicates that the predicted siRNA efficacies are, on average, closer to the experimental values, meaning the model has become more precise.
Additionally, the Pearson correlation coefficient (R = 0.5454) shows that there is a moderate positive linear correlation between our predictions and the true values. In practice, this means that the model successfully captures the trend of siRNA activity: when experimental efficacy is high, predicted efficacy also tends to be high, and vice versa. Although not perfect, this correlation confirms that the model is learning meaningful biological patterns rather than random noise.
At this stage, our LightGBM model already provided satisfactory results, showing both a reduced prediction error and a meaningful correlation with experimental values. We could have stopped here and considered this as our final predictive tool. However, we wanted to go the extra mile and explore whether a different machine learning approach, based on a distinct underlying mechanism, could further improve performance and capture additional patterns in the data.
To achieve this, we recognized the need for a deeper literature review on alternative architectures and modeling strategies, which would guide us in selecting and implementing the next model in our pipeline. Therefore, we implemented one more engineering cycle for the model development.
After implementing and evaluating LightGBM, we expanded our modeling strategy by exploring a fundamentally different machine learning approach. Our goal was to determine whether an alternative learning mechanism could capture additional patterns and potentially improve predictive performance.
Given our tabular dataset with complex non-linear dependencies, we focused our literature review on models specifically designed for tabular data prediction. We selected the TabPFN (Tabular Prior-Data Fitted Network) regressor, a transformer-based model pre-trained on millions of synthetic tasks, that enables accurate predictions with minimal hyperparameter tuning.
TabPFN leverages transformer architecture and extensive pre-training to capture complex, non-linear feature interactions in tabular data without requiring dataset-specific training. This fundamentally different mechanism from gradient boosting allowed us to explore whether alternative model architectures could enhance siRNA efficacy prediction.
Regarding our training dataset, we did not make any changes to it for this iteration of the engineering cycle.
We moved on to write code for the TabPFN Regressor Model. Full documentation for TabPFN model development is available in our GitLab Repository.
For the model performance evaluation, we used the Mean Squared Error (MSE) metric, and the Pearson Correlation Coefficient (R) metric. The table below shows the metrics values.
| TabPFN Regressor Model | |
| Mean Squared Error (MSE) | 0,0665 |
| Pearson Correlation Coefficient (R) | 0,5576 |
From the evaluation of the TabPFN regressor, we obtained a Mean Squared Error (MSE) of 0,0665 and a Pearson correlation coefficient (R) of 0,5576 on the test set. Although this MSE value is slightly higher than the one achieved by LightGBM (0,0529), it still reflects a reasonable prediction error. On the other hand, the Pearson correlation coefficient is slightly higher compared to LightGBM (0,5454), showing that TabPFN was able to capture the overall trend between predicted and true efficacies slightly better.
This comparison highlights the trade-off between the two models:
Considering these complementary strengths, we propose both models as part of our modeling strategy, since each offers distinct advantages for understanding and predicting siRNA silencing efficacy.
Finally, we recognize that further engineering cycles could involve the construction of deep learning architectures tailored to RNA data. These may include Convolutional Neural Networks (CNNs) to capture local sequence motifs, LSTMs and attention layers to learn sequential dependencies, and the integration of pre-trained RNA language models such as RNA-FM. Such approaches could provide even more powerful modeling capabilities in future iterations of our work.