E N G I N E E R I N G

S U C C E S S

"The only real mistake is the one from which we learn nothing."

- Henry Ford

Overview


The applications of the engineering cycle were instrumental to the progress and success of our project, guiding both our wet lab and dry lab efforts. Through multiple Design-Build-Test-Learn (DBTL) iterations, we systematically refined the key components of our modeling approach, while simultaneously addressing and resolving experimental challenges in the laboratory. Each iteration provided valuable insights that informed the next, ensuring a continuous process of improvement and optimization.

In the Wet Lab, our engineering cycle iterations focused on optimizing experimental design and validating gene silencing efficiency in MEC-1 cells:

In the Dry Lab, our iterations aimed to enhance the performance and accuracy of our siRNA efficacy prediction pipeline through progressive data refinement and model advancement:

Click on each button to review our Engineering Cycle Iterations!

Click on each button to review our Wet Lab Engineering Cycle Iterations!

Click on each button to review our Dry Lab Engineering Cycle Iterations!

Iteration 1: Benchmark MEC-1 response to GSK126 inhibitor


Design


GSK126 is a selective small-molecule and an S-adenosyl methionine competitive inhibitor that inhibits EZH2 methyltransferase, which is often observed to be dysregulated in cancer. MEC-1 cells were cultured and exposed to increasing concentrations of GSK126 to observe potential changes in cell viability, proliferation, or morphology.

This step was essential not only to understand how MEC-1 cells respond to pharmacological treatment but also to gain practical familiarity with their handling, culture dynamics, and baseline behavior, enabling subsequent comparison with the cellular effects observed following siRNA treatment.

Build


The experiment was conducted in accordance with the protocol provided in our Experiments Page.

  • MEC-1 cells were seeded in a 24-well plate at optimal density following centrifugation and resuspension.
  • Cells treated with a gradient of GSK126 concentrations (nM).
    • 0
    • 5
    • 10
    • 20
    • 40
  • Controls included untreated cells and DMSO vehicle (0.2–0.5% final concentration, non-toxic range).
  • 2 technical replicates were prepared for each condition.
  • Incubation for 24 and 48 hours at 37 °C, 5% CO₂.

More detailed descriptions of the experimental setup, dilutions, and handling steps can be found in our Lab Notebook.

Test


The effects of the inhibitor were evaluated through:

  • Microscopic observation of morphology and culture homogeneity.
  • Cell viability and proliferation capacity assessment using Trypan Blue staining and Neubauer chamber counts.
  • 3 technical replicates were measured at both 24- and 48-hour timepoints to track time-dependent responses.

Learn


From this cycle, we determined a concentration range that can be effectively paired with siRNA treatments, allowing us to fine-tune between cytostatic and cytotoxic responses based on the specific goals of each experiment.

Iteration 2: Establishing transfection feasibility and reagent tolerance in MEC-1


Design


The goal of this experimental cycle was to establish a reliable method for delivering siRNAs into MEC-1 cells and to evaluate whether this approach could achieve functional gene silencing of BCL-2. Given that MEC-1 is a suspension cell line with known sensitivity to lipid-based transfection reagents, particular attention was placed on identifying conditions that balance efficient uptake with minimal cytotoxicity.

Build


  • MEC-1 cells were cultured until viability reached ≥90% and then were seeded into 24-well plates.
  • siRNAs were prepared at working concentrations.
  • siRNAs complexed with INTERFERin in Opti-MEM and incubated to allow complex formation.
  • Experimental layout included:
    • Concentration gradient: 5–50 nM.
    • Controls: untreated, INTERFERin-only, scrambled siRNA, siRNA-only.
  • Plates were incubated at 37 °C, 5% CO₂ for up to 96 h.

Further procedural details are provided in the Lab Notebook.

Test


  • Samples were collected at 24, 48, 72, and 96 h for:
    • Cell viability and cell proliferation capacity assays (Trypan Blue)
    • Cy3 fluorescence uptake
    • Microscopic observation of morphology and culture homogeneity

Learn


Transfection of MEC-1 cells with BCL-2 siRNAs revealed an inhibitory effect on cellular proliferation. At early timepoints (24 h), both proliferation and viability were reduced across all conditions, with the strongest inhibition observed at higher concentrations (50 nM) and in the INTERFERin-only group. However, at later timepoints, cells exhibited partial recovery of growth and viability, indicating an adaptive response following initial treatment stress, which appeared to be primarily associated with the transfection reagent rather than the siRNA itself.

These effects were largely cytostatic, reflecting transient stress responses rather than overt cytotoxicity. Over time, cells demonstrated progressive recovery: by 72 h, morphology and viability had stabilized, and by 96 h, proliferation in all groups, including high-concentration and INTERFERin-only conditions, surpassed that of the control. Simultaneously, indicates that MEC-1 cells adapt to INTERFERin exposure and restore normal growth dynamics following the initial stress phase. From this cycle, we established that concentrations in the 5–20 nM range provide a balance between effective transfection and minimal long-term impact on cell viability, forming the basis for subsequent functional silencing studies, with 20 nM showing the most favorable results. However, the cytotoxicity observed at higher INTERFERin volumes prompted a dedicated titration experiment to optimize reagent-to-cell and siRNA ratios and enhance overall cellular tolerance.

Iteration 3: INTERFERin titration for cytotoxicity optimization


Design


This cycle aimed to determine the optimal concentration range of INTERFERin that allows efficient siRNA delivery while minimizing cytotoxicity. MEC-1 cells were exposed to gradient volumes of INTERFERin. Cell viability, growth, and morphology were assessed over time to distinguish transient stress effects from sustained toxicity.

Build


  • MEC-1 cells were counted and plated in 24-well plates at uniform density.
  • Master mixes are prepared by combining INTERFERin with Opti-MEM and incubating for 15 minutes to allow complex formation.
  • Gradient of INTERFERin concentrations (μL) tested in parallel across technical replicates.
    • 1
    • 2
    • 3
    • 5
  • Controls: untreated cells.
  • Plates were incubated at 37 °C, 5% CO₂ for 48 h.

Further procedural details are provided in the Lab Notebook.

Test


  • Cell viability was measured at 24 h and 48 h across all INTERFERin concentrations.
  • Cell proliferation trends were recorded relative to controls.
  • Microscopic assessment for stress indicators (aggregation, clustering, reduced viable cell numbers).

Learn


Treatment of MEC-1 cells with INTERFERin demonstrated a dose-dependent inhibitory effect on cellular proliferation. Smaller volumes were associated primarily with cytostatic outcomes consistent with a temporary arrest of cell division, but higher volumes (5 μL) showed cytotoxicity. When comparing cultures seeded at different densities, the inhibitory effect was slightly less pronounced at 300,000 cells/mL than at 150,000 cells/mL, likely due to reduced effective dose per cell and the presence of survival-supporting microenvironmental factors.

INTERFERin exhibits cytotoxic/cytostatic effects at high doses. Its main negative impact at larger volumes (3–5 µL) is an inducible and reversible arrest of the cell cycle, rather than induction of cell death. This suggests that the observed effects are likely non-permanent and that MEC-1 cells can recover once the stress is alleviated.

From this cycle, we established an optimal working range of INTERFERin volumes that balances efficient transfection with minimal cytotoxicity and an optimal cell seeding setup, providing a reliable basis for downstream siRNA delivery experiments.

Iteration 4: Repetition 2-Establishing transfection feasibility and reagent tolerance in MEC-1


Design


The objective of this second repetition was to validate and refine the conditions established in the initial siRNA transfection experiment. As MEC-1 cells are highly sensitive to lipid-based transfection reagents, we aimed to confirm the reproducibility of siRNA uptake and BCL2 gene silencing under the optimized parameters identified previously. Repeating the experiment allowed us to assess consistency across independent runs, strengthen the reliability of our observations, and further define the balance between efficient delivery and minimal off-target cytotoxic effects.

Build


  • MEC-1 cells were seeded into 24-well plates (200 µL/well).
  • Eight experimental conditions were prepared with Master Mixes (Opti-MEM, siRNA at varying concentrations, including Cy3-labeled constructs, and INTERFERin).
  • Complexes were incubated for 15 min at room temperature before addition to cells.
  • Two plates were prepared for each time point (6 h, 24 h, 48 h):
    • Plate A: RNA extraction + cellular uptake
    • Plate B: protein extraction
    • Medium was supplemented at 6 h post-transfection to reach a final volume of 1 mL/well.

Further procedural details are provided in the Lab Notebook.

Test


  • Cell viability & proliferation were measured at 6 h, 24 h, and 48 h using Trypan Blue staining and Neubauer chamber counting.
  • Cell morphology was assessed by microscopy at each time point, with representative images captured.
  • siRNA uptake was quantified using Cy3 fluorescence from one-third of the samples at each time point.
  • RNA analysis: remaining cells from RNA plates were pelleted for RNA extraction, and later processed for qPCR to measure BCL2 expression relative to reference genes.
  • Protein analysis: 300,000 cells per condition collected from protein plates, pelleted for downstream protein extraction and BCL2 assessment.

All conditions were tested in replicates to ensure reproducibility.

Learn


Transfection of MEC-1 cells with BCL2 siRNAs revealed a concentration-dependent pattern of uptake and processing. At 20 nM, a distinct fluorescence peak was observed at 6 h, followed by a gradual decline and stabilization by 24–48 h. This dynamic profile indicates efficient internalization and intracellular processing of the siRNA. In contrast, at 50 nM, fluorescence increased continuously over time, suggesting overloading of the delivery system and sequestration of siRNAs in vesicles, accompanied by mild growth inhibition. Lower concentrations (5–10 nM), scrambled siRNA, and INTERFERin-only controls exhibited no significant fluorescence signal. Importantly, cell viability remained consistently high (>85–90%) across all conditions, demonstrating that the treatments influenced proliferation dynamics rather than inducing cytotoxicity.

Our observations demonstrated that higher siRNA concentrations correlated with increased uptake efficiency, confirming successful delivery into MEC-1 cells. Among the range of concentrations tested, the 20 nM and 50 nM treatments produced similar cellular outcomes, including slightly reduced viability and evident cell cycle arrest. Notably, the 20 nM concentration, representing a milder treatment, achieved comparable functional results to 50 nM, suggesting it may serve as a more optimal and less cytotoxic condition for subsequent experiments. These findings underscore the importance of balancing delivery efficiency with cellular tolerance, guiding further optimization of transfection parameters.

Iteration 5: Functional validation of BCL2 silencing/Expression Readouts


Design


This cycle aimed to confirm that optimized transfection conditions translate into functional BCL2 silencing at the transcript (and, where feasible, protein) level. Cells across all conditions were harvested at 6h, 24 h, and 48 h for total RNA isolation, cDNA synthesis, and qPCR quantification of BCL2 normalized to stable reference genes (e.g., YWHAZ). To confirm that the qPCR amplified the correct product and total RNA integrity, we checked the samples by gel electrophoresis to verify the RNA pattern and the fact that the bands were specific and of the expected size. Although cell pellets were prepared for downstream protein analysis, Western blot experiments could not be completed before the wiki freeze due to time constraints. These assays remain part of our planned follow-up validation to corroborate transcript-level findings at the protein level.

Build


  • MEC-1 cells were transfected under optimized conditions (20 nM BCL2 siRNA with INTERFERin).
  • Conditions included: untreated cells, INTERFERin-only, non-targeting siRNA (scrambled), siRNA-only, and BCL-2 targeting siRNA.
  • RNA plates processed:
    • Cells were pelleted and total RNA extracted.
    • cDNA was synthesized for downstream qPCR.
    • qPCR performed for BCL2 and reference genes (e.g., p21, YWHAZ), with samples run in replicates.
  • Specificity of amplicons verified by gel electrophoresis.
  • Parallel protein pellets were prepared for Western blotting; however, due to time constraints, these assays could not be completed before the wiki freeze.

Further procedural details are provided in the Lab Notebook.

Test


  • qPCR analysis performed on cDNA from BCL2-siRNA, non-targeting siRNA, INTERFERin-only, and untreated groups.
  • Relative expression calculated using the ΔΔCt method with YWHAZ as the reference gene.
  • Gel electrophoresis was run to verify the specificity and expected size of qPCR amplicons, as well as check total RNA status and its integrity.
  • Fluorescence analysis: siRNA uptake was monitored at every tested concentration using Cy3-labeled siRNAs. Fluorescence intensity was measured over time with a spectrometer to assess internalization dynamics.
  • Viability and growth monitoring were performed at the same timepoints to confirm that reduced BCL2 expression was not due to cytotoxicity.
  • Replicates were included across all groups to ensure reproducibility.
  • Protein pellets were reserved for Western blotting, which could not be carried out before the wiki freeze.

Learn


qPCR analysis supported these findings, showing a partial reduction in BCL2 mRNA levels (~30%) at 20 nM and a more pronounced decrease (~50%) at 50 nM, although these effects did not reach statistical significance due to experimental variability and technical constraints. Given that the total RNA level of quality and quantity of the extracted template was suboptimal, it is likely limiting detection sensitivity and contributing to the observed variability. Nevertheless, the biological evidence aligns strongly with the expected silencing effect: in the cells that successfully internalized the siRNA, clear signs of slightly reduced viability and evident cell cycle arrest were observed. These results validate the functional activity of the siRNA treatment and underscore the need to scale up the experimental setup to obtain higher-quality RNA templates, thereby enabling more robust and statistically supported downstream analyses.

Furthermore, fluctuations of the YWHAZ reference gene in negative controls highlighted the necessity of re-examining its suitability as a normalizer. Future validation efforts will therefore include the use of additional biological replicates and assessment of protein-level expression to confirm the stability of normalization strategies.

Iteration 1: Training Dataset Creation and Designing Introductory ML Models


Design


The first step, before searching for training data, was to define the purpose of our machine learning model and what the expected input(s) and output(s) would be. Our goal from the beginning was to build a model that would predict the siRNA silencing efficiency. We defined this siRNA silencing efficiency to be the percentage of the target mRNA knockdown, therefore the output of the model would be a number in the range [0, 1]. where 0 indicates no silencing effect and 1 indicates maximum silencing effect. In other words, the higher the model output is, the more promising the candidate siRNA is.

Regarding the training data, we decided that our dataset should contain:

  • The siRNA antisense strand: 19 nt length, without overhangs, in 5'-3' orientation
  • Exact target mRNA site: 19 nt sequence
  • Extended target mRNA site: 57 nt sequence containing the 19 nt target site with 19 nt flanking regions on each side
  • Efficacy score: Normalized value from 0 (ineffective) to 1 (highly effective) representing silencing performance
  • Data source: The originating dataset source publication (from 11 public sources)
  • Cell line: The experimental cell system used for efficacy measurements
  • Engineered features: Computed properties for sequence analysis

As input to the model we decided to use: the siRNA antisense strand, the extended target mRNA site and the siRNA silencing efficacy score. The other data would be taken into consideration for the creation of the training dataset and its split into training and testing sets. For the creation of a harmonized dataset, we made sure that all siRNA sequences followed the below-mentioned rules:

  • Sequence uniformity: All siRNA antisense strands are 19 nt in length, oriented 5'→3', and reverse-complement to their corresponding exact target mRNA sites
  • Efficacy normalization: All efficacy values are normalized to a 0–1 scale, representing the percentage of target mRNA knockdown (0 = no silencing, 1 = complete silencing)
  • Uniqueness: Each siRNA-target mRNA pair appears only once in the dataset

For this model, we made the following assumptions:

  • siRNA function is decided primarily by the antisense (guide) strand: The target recognition inside RISC is facilitated only by the antisense strand, which is assembled into Ago2.
  • Availability of target mRNA sites influences silencing efficacy: siRNAs cannot bind to sequences that are buried in stable secondary structures or protein complexes. By incorporating the mRNA region surrounding the binding site, we are able to compensate for local folding and accessibility influencing binding.

For the first engineering cycle of the machine learning model development, we decided to employ not one, but two basic machine learning models: Support Vector Regression (SVR) and XGBoost and evaluate these models' performance. In order to transform the input data (siRNA and target mRNA sequences) into a numerical form that these models can process, we decided to employ the k-mer method, and used four different values for k (k=2, 3, 4, 5), to evaluate which value of k ensures greater model performance.

As a metric for the model performance evaluation, we decided to use the Mean Squared Error (MSE).

Build


Initially, we had to build our training dataset, which needed to be rich and balanced in order to ensure great data quality and enhance model performance. We conducted literature review and came across 5 different siRNA silencing efficacy datasets: Huesken, Takayuki, Mixset, Shabalina and Simone. The Mixset dataset comprises siRNA antisense sequences from 7 different sources: Amarzguioui, Harborth, Hsieh, Khvorova, Reynolds, Vickers, Ui-Tei. We then preprocessed and curated the data (more information in the dropdown buttons below), resulting in the creation of a harmonized training dataset that comprises 4098 siRNA antisense strands, along with their respective silencing efficacy value.

We obtained data from these sources through previous publications on siRNA efficacy prediction, which provided .csv files containing: the siRNA antisense strand, the extended target mRNA sequence and siRNA efficacy values.

Sequence orientation standardization: We extracted the central 19nt target mRNA sequence and determined strand orientation by checking whether the siRNA was complement (3'-5') or reverse-complement (5'-3') to the target site. Huesken and Mixset sequences were already in 5'-3' orientation, while Takayuki sequences were 3'-5' and required reversal before integration into siRBencht.

Efficacy normalization: All efficacy values from these sources were already normalized to the [0,1] range, where higher values indicate stronger silencing activity.

Cell line annotation: Cell lines were H1299 for Huesken and HeLa for Takayuki. Since the Mixset dataset aggregated sequences from 7 different sources, we traced each sequence back to its original publication to assign the appropriate experimental cell line for each entry.

The siRNA dataset from Shabalina et al. comprised 653 antisense strand sequences with reported gene-silencing activity values, GenBank accession numbers for target mRNAs and target site coordinates. Unlike other sources, efficacy values were reported as non-negative real numbers where 0 indicated complete knockdown and higher values represented weaker silencing, the inverse of our desired scale.

We used the provided GenBank accession numbers and genomic coordinates to retrieve extended target mRNA sequences.However, we discovered that siRNA sequences were not always fully complementary to the reported target regions. To resolve this, we downloaded complete target mRNA sequences and performed exhaustive searches to identify exact complementary sites for each siRNA. This approach successfully located complementary targets for 650 of 653 siRNAs; the remaining 3 were excluded from siRBench.

We first normalized activity values to the [0,1] range. To align with other datasets where higher values indicate stronger silencing, we applied the transformation 1−α for each normalized activity value α. This ensured that all siRBench efficacy scores follow the same interpretation: 0 represents no silencing and 1 represents complete knockdown.

All Shabalina sequences were in 5'-3' orientation, and experiments were conducted in HeLa cells.

The Simone dataset comprises 322 siRNA sequences obtained from the siRNADiscovery GitHub repository. The data was provided in three separate files: two FASTA files containing the siRNA sequences and the full-length target mRNA sequences, and one CSV file mapping siRNA-target pairs (via FASTA IDs) to efficacy values. We merged these files into a single CSV file containing three columns: the siRNA sequence, the extended target mRNA site, and the efficacy value. All siRNA sequences from the Simone were21nt in length.To maintain consistency with our 19 nt standard, we trimmed the last 2nt of the siRNA sequence.

The efficiency values were already normalized to the [0,1] range with higher values indicating stronger silencing, matching siRBench specifications. All sequences were in 5'-3' orientation, and experiments were conducted in Hep3B cells.

Duplicate removal was the final essential step before finalizing our dataset. We defined duplicates as siRNA pairs sharing identical antisense strands and extended target mRNA sites.

We first examined each dataset individually. The Huesken, Takayuki and Simone datasets contained no duplicates. The Shabalina dataset contained 3 duplicate siRNAs with identical efficacy scores; we retained one sequence from each pair. Furthermore, the Mixset dataset contained 8 duplicate pairs, from different source studies with conflicting efficacy scores. For these cases, we retained the sequence from the less-represented source to maintain dataset balance, excluding 4 additional siRNAs from siRBench.

We identified 398 duplicate pairs between Mixset and Shabalina with conflicting efficacy values due to different experimental conditions and cell lines. We retained the Mixset-derived efficacy values, as this dataset is more widely used as a benchmark in siRNA prediction studies.

This comprehensive preprocessing procedure yielded a balanced, normalized and high-quality training dataset maximizing the number of unique, reliable siRNA sequences.

Before model development, we strategically split siRBench into training and test sets based on experimental cell line rather than using random splitting methods like scikit-learn's train_test_split function. This cell line-based approach ensures the model is trained and evaluated on different biological contexts, preventing information leakage and better reflecting real-world scenarios where predictions must generalize to new cellular environments. The training set comprises 3,408 siRNAs with efficacy measurements from H1299 and HeLa cell lines, while the test set contains 690 siRNAs from all other cell lines. The training set was further randomly divided into training (90%) and validation (10%) subsets.

The final step of the “Build” step was to write the code for the SVR and XGBoost models. The main parts of the code are shown in the pictures below.

Figure 1: Code for the Development of the Support Vector Regression Model
Figure 2: Code for the Development of the XGBoost Model

Test


As a metric for the model performance evaluation, we used the Mean Squared Error (MSE). The table below includes the calculated MSE (with 4 decimal digits) for the different values of k that we tried, for both machine learning models.

SVR Model XGBoost Model
k=2 0,0702 0,0733
k=3 0,0653 0,0738
k=4 0,0733 0,0749
k=5 0,0792 0,0764

Learn


Through this cycle, we successfully built a harmonized dataset by integrating multiple different datasets, and preprocessing and curating the data. We also managed to build two different machine learning models for siRNA silencing efficacy prediction and evaluated their performance: from the above MSE metric table, we can conclude that both models showcased similar efficiency, while the greatest performance was accomplished when the Support Vector Regression Model was employed and the k value for the k-mer method was equal to 3.

We moved on to complete more engineering cycles for this model development, with the goal of improving model performance. In this step, we brainstormed ways to improve model efficiency and decided to implement these two ideas:

  • Conduct literature review on features that affect the siRNA silencing efficacy, and apply feature engineering in order to help the model make predictions based not only on the input sequences, but also on engineered thermodynamic, structural and sequence features.
  • Conduct literature review for more advanced machine learning models to employ for this task and therefore, improve efficiency.

Iteration 2: Improving Training Dataset and Employing A More Advanced ML Model


Design


Based on the findings of the previous engineering cycle, we firstly moved on to conduct literature review on the features that affect siRNA silencing efficacy, with the goal of applying feature engineering. We hypothesized that the addition of engineered (pre-calculated) sequence features to the model input would result in the improvement of the model performance, since the model would be able to make predictions based on more features that previously could not calculate on its own.

Through extensive literature review, we identified 100 thermodynamic and structural features to enrich our training dataset. All feature values are rounded to three decimal places. The dropdown button below lists these features with descriptions of their biological significance.

Thermodynamic Features

  • ends: Terminal asymmetry metric comparing 5′ and 3′ end stability of the siRNA duplex
  • DG_total: Total Gibbs free energy (ΔG) of duplex formation calculated from nearest-neighbor parameters
  • DH_total: Total enthalpy change (ΔH, kcal/mol) of duplex formation from nearest-neighbor parameters
  • DG_pos1..18: Per-step stacking ΔG along the duplex at each of the 18 nearest-neighbor positions in the 19-bp region
  • DH_pos1..18: Per-step stacking ΔH at corresponding positions

Structural Constraint Energies (RNAfold/RNAcofold)

  • single_energy_total: Minimum free energy (MFE) of the isolated guide strand (RNAfold)
  • single_energy_pos1..19: Position-specific accessibility of each nucleotide in the guide strand (RNAfold)
  • duplex_energy_total: Hybridization ΔG for the guide-target duplex (RNAcofold)
  • duplex_energy_sirna_pos1..19: Per-nucleotide contribution of the guide strand to duplex energy at each aligned position (RNAcofold)
  • duplex_energy_target_pos1..19: Per-nucleotide contribution of the target site to duplex energy (RNAcofold)

RNAup Accessibility Features

  • RNAup_open_dG: Energy cost to unpair interacting regions on both guide and target strands prior to binding
  • RNAup_interaction_dG: Hybridization energy gained upon binding of unpaired regions

For the model fitting, we used 90% of the training dataset for model training and the rest 10% for the validation set.

Afterwards, we proceeded to search for more advanced machine learning models to employ for our task. We deemed that by using a model that adapts better to the characteristics of our task (multiple calculated features, medium-sized dataset, potential non-linear relationships), the model performance would be enhanced. After literature review, we selected LightGBM (Light Gradient Boosting Machine), an efficient implementation of gradient boosting decision trees.

Unlike linear models, LightGBM captures non-linear feature interactions, where silencing depends on the complex interplay of thermodynamic stability, accessibility, and positional effects rather than simple linear rules. LightGBM builds an ensemble of decision trees sequentially, with each tree correcting errors from previous ones. Its leaf-wise growth strategy expands branches that maximize error reduction, enabling the model to learn highly specific patterns. Efficiency optimizations including histogram-based feature binning, gradient-based sampling, and exclusive feature bundling make it faster and more memory-efficient than other gradient boosting implementations.

The advantages for our dataset are:

  • Efficiently handles medium-sized datasets (~4,100 siRNAs) without requiring the massive training sets needed for deep learning.
  • Naturally processes correlated and position-specific features (DG_pos1..18, DH_pos1..18, duplex energies) to identify non-linear patterns that simpler models cannot capture.

Build


After completing the “Design” step of this engineering cycle, we moved on to write code for both feature engineering and LightGBM model development. Full documentation of these aspects of our modeling are available in our GitLab Repository.

Full documentation for the final training dataset creation can be found in our GitLab Repository.

Full documentation for LightGBM Model development can be found in our GitLab Repository.

Test


The next step after we built our enriched training dataset and trained the model was to evaluate the model performance. We used the Mean Squared Error (MSE) metric, and added the Pearson Correlation Coefficient (R) metric, in order to measure the linear correlation between the predicted and the actual silencing efficacies. The table below shows the metrics values.

LightGBM Model
Mean Squared Error (MSE) 0,0529
Pearson Correlation Coefficient (R) 0,5454

Learn


Compared to the previous models we implemented, we observed a clear improvement in predictive accuracy, as the Mean Squared Error (MSE) decreased to 0,0529. A lower MSE indicates that the predicted siRNA efficacies are, on average, closer to the experimental values, meaning the model has become more precise.

Additionally, the Pearson correlation coefficient (R = 0.5454) shows that there is a moderate positive linear correlation between our predictions and the true values. In practice, this means that the model successfully captures the trend of siRNA activity: when experimental efficacy is high, predicted efficacy also tends to be high, and vice versa. Although not perfect, this correlation confirms that the model is learning meaningful biological patterns rather than random noise.

At this stage, our LightGBM model already provided satisfactory results, showing both a reduced prediction error and a meaningful correlation with experimental values. We could have stopped here and considered this as our final predictive tool. However, we wanted to go the extra mile and explore whether a different machine learning approach, based on a distinct underlying mechanism, could further improve performance and capture additional patterns in the data.

To achieve this, we recognized the need for a deeper literature review on alternative architectures and modeling strategies, which would guide us in selecting and implementing the next model in our pipeline. Therefore, we implemented one more engineering cycle for the model development.

Iteration 3: Employing A Different Advanced ML Model


Design


After implementing and evaluating LightGBM, we expanded our modeling strategy by exploring a fundamentally different machine learning approach. Our goal was to determine whether an alternative learning mechanism could capture additional patterns and potentially improve predictive performance.

Given our tabular dataset with complex non-linear dependencies, we focused our literature review on models specifically designed for tabular data prediction. We selected the TabPFN (Tabular Prior-Data Fitted Network) regressor, a transformer-based model pre-trained on millions of synthetic tasks, that enables accurate predictions with minimal hyperparameter tuning.

TabPFN leverages transformer architecture and extensive pre-training to capture complex, non-linear feature interactions in tabular data without requiring dataset-specific training. This fundamentally different mechanism from gradient boosting allowed us to explore whether alternative model architectures could enhance siRNA efficacy prediction.

Regarding our training dataset, we did not make any changes to it for this iteration of the engineering cycle.

Build


We moved on to write code for the TabPFN Regressor Model. Full documentation for TabPFN model development is available in our GitLab Repository.

Test


For the model performance evaluation, we used the Mean Squared Error (MSE) metric, and the Pearson Correlation Coefficient (R) metric. The table below shows the metrics values.

TabPFN Regressor Model
Mean Squared Error (MSE) 0,0665
Pearson Correlation Coefficient (R) 0,5576

Learn


From the evaluation of the TabPFN regressor, we obtained a Mean Squared Error (MSE) of 0,0665 and a Pearson correlation coefficient (R) of 0,5576 on the test set. Although this MSE value is slightly higher than the one achieved by LightGBM (0,0529), it still reflects a reasonable prediction error. On the other hand, the Pearson correlation coefficient is slightly higher compared to LightGBM (0,5454), showing that TabPFN was able to capture the overall trend between predicted and true efficacies slightly better.

This comparison highlights the trade-off between the two models:

  • LightGBM achieved lower prediction error, meaning its predictions were closer to the experimental values on average.
  • TabPFN showed a marginally stronger correlation with the true values, suggesting it better preserved the global ranking and direction of siRNA efficacy trends.

Considering these complementary strengths, we propose both models as part of our modeling strategy, since each offers distinct advantages for understanding and predicting siRNA silencing efficacy.

Finally, we recognize that further engineering cycles could involve the construction of deep learning architectures tailored to RNA data. These may include Convolutional Neural Networks (CNNs) to capture local sequence motifs, LSTMs and attention layers to learn sequential dependencies, and the integration of pre-trained RNA language models such as RNA-FM. Such approaches could provide even more powerful modeling capabilities in future iterations of our work.

Accesibility Bar