Antisense oligonucleotides (ASOs) have demonstrated clinical success in several genetic diseases, but their therapeutic potential in cancer remains largely unrealized. Tumor heterogeneity, delivery barriers, and off-target effects have limited efficacy, even though ASOs uniquely enable precise modulation of RNA, including targets considered undruggable. Designing potent and specific ASOs therefore remains a major challenge, often constrained by labor-intensive experimental optimization.
We developed ASOdesigner, a stand-alone machine-learning framework for ASO sequence generation and evaluation. The process begins with comprehensive data cleanup and validation, which involves locating many target genes across the human genome, including SNP-containing variants, and verifying corresponding experimental inhibition data retrieved from Lens.org and other public sources. We then apply an ODE-based kinetic model to infer an efficiency variable representing ASO–RNA hybridization dynamics, which serves as the target of the machine-learning predictor. The model learns this variable from an extensive feature library encompassing structural, thermodynamic, biochemical, and transcriptomic properties, including folding free energy, duplex hybridization energy, local GC/AT skew, repetitiveness, toxic motif frequency, RNase H1 cleavage preferences, positional nucleotide entropy, codon-usage bias, mRNA half-life, transcript accessibility, secondary-structure context, and global expression level. The full ASODesigner source code is available on GitHub: ASOdesign. The final model and production software used in this work are hosted in our iGEM repository: Software Tools.
The next stage performs genome-wide off-target analysis, integrating Watson–Crick mismatch tolerance and homology mapping using a bow-tie-indexed database and a modified RIsearch hybridization engine. Finally, top-ranked ASOs undergo molecular-dynamics simulations executed on a high-performance computing (HPC) cluster with Slurm-based parallelization, enabling large-scale evaluation of conformational stability and validation of predicted binding efficiency.
ASOdesigner operates independently from, but interoperates with, additional computational modules: a gene interaction network graph identifying selective and non-toxic gene targets, and an antibody–epitope design module enabling targeted delivery and immune activation. Our antibody–epitope modeling uses PSSM-based sequence optimization to identify insertion sites that preserve antibody structure while ensuring efficient ASO delivery, complemented by expression optimization using MNDL and ESO stability predictions to achieve high yield. We further applied epitope presentation prediction algorithms to identify which neo-epitopes are most likely to be displayed on cancer cells, ensuring that immune activation aligns with naturally presented tumor antigens.
Together, these models form an integrated computational pipeline-from ASO design to delivery-offering a transferable framework that other teams can adapt for therapeutic modeling, advancing ASO-based cancer therapeutics toward practical viability.
Our model integrates multiple components that together form the complete ASO design and analysis pipeline. Each icon below represents one of these key elements - CLICK on any icon to explore how our model connects to that part in more detail.
Our project integrates three complementary modeling frameworks that together capture every layer of therapeutic design—from molecular efficiency to system-level targeting and delivery.
1. ASOdesigner.
We developed a stand-alone machine-learning framework for ASO sequence generation and
evaluation. After comprehensive data cleanup and validation—locating many target genes
across the genome, including SNP-containing variants, and verifying inhibition data from
Lens.org—we applied an ODE-based kinetic model to infer a quantitative efficiency
variable that summarizes experimental ASO activity.
This variable serves as the target of a machine-learning model trained on a rich feature
library encompassing structural, thermodynamic, biochemical, and transcriptomic
descriptors:
folding free energy, hybridization energy, GC/AT skew, repetitiveness, toxic motif
frequency, RNase H1 cleavage preferences, codon-usage bias, mRNA half-life, transcript
accessibility, secondary-structure context, and global expression level.
We further perform genome-wide off-target analysis using a bow-tie-indexed homology
search and a modified RIsearch hybridization engine, while molecular-dynamics
simulations on an HPC cluster (Slurm-based parallelization) validate the stability and
persistence of top candidates.
2. Gene interaction network.
A graph-based network model identifies synthetic-lethal gene pairs, allowing us to
select cancer-specific, non-toxic targets.
Predictions were validated against published synthetic-lethality datasets, confirming
biological plausibility and guiding wet-lab priorities.
3. Antibody–epitope modeling.
For targeted delivery and immune activation, we designed an antibody–epitope integration
module using PSSM-based sequence optimization to locate insertion sites that preserve
antibody structure while ensuring efficient ASO delivery.
Expression was further optimized using MNDL and ESO stability predictions to achieve
high yield.
Finally, epitope presentation prediction algorithms identified neo-epitopes most likely
to be displayed by cancer cells, aligning immune activation with naturally presented
tumor antigens.
Together, these frameworks form a unified and modular computational pipeline—from ASO efficiency prediction to target prioritization and immune-optimized delivery—demonstrating strong originality, technical sophistication, and biological integration.
Modeling revealed several biological and experimental principles
that changed how we interpret
ASO performance and design.
We found that dose and treatment period have a strong influence on
measured inhibition and must be
corrected for to ensure comparability between datasets.
Across cell lines, the importance of specific features varied substantially,
highlighting the need for a
modest, low-bias feature set or additional variables to describe
cell-specific responses.
Our analysis showed that AT-skew and multiple folding-related
features are major
determinants of ASO efficiency, suggesting that structural factors play a larger
role than previously
assumed.
We also found that RNase H1 sequence motifs contribute measurable
predictive value to inhibition
outcomes, supporting their inclusion in future models.
While off-target features are known to explain toxicity, our analysis also revealed a
modest positive correlation with on-target inhibition, likely due to
competition with partially
complementary transcripts.
Molecular-dynamics simulations helped us to understand that structural
biophysical features such as
local base pair stagger are correlative with ASO performances.
Through this process, we learned that even within selective synthetic-lethal contexts,
suitable targets are rare once toxicity and essentiality are considered
— highlighting the value of
gene-interaction networks for narrowing viable options.
Moreover, research into what constitutes a true loss-of-function
mutation can dramatically change the
estimated number of patients carrying actionable biomarkers, redefining therapeutic
reach.
In the antibody-epitope module, PSSM-based optimization and expression
modeling showed that many
theoretically valid insertion sites disrupt folding or yield, underscoring the need for
structural and translational constraints in design.
Together, these findings clarified the biological limits within which
computational optimization remains effective
and turned modeling into a framework for understanding — not just predicting —
therapeutic behavior.
Our team extensively relied on experimental measurements throughout the project. Our ASO model was built upon experimentally derived datasets from the literature, which we used to train and validate our features. Many of these features themselves were engineered using quantitative measurements that were not readily available computationally.
Target selection relied on data identifying biomarkers prevalent in NSCLC patients, as well as experimental databases and literature-based sources. To read more, visit our Target Selection section .
For the immune component, neo-epitope discovery and optimal antibody-insertion site prediction were guided by experimentally validated datasets. To read more, visit our Immune System Engagement & Antibody Optimization section .
We used experimentally measured structures and thermodynamic data as the basis for molecular dynamics (MD) simulations of ASO–target complexes, allowing us to extract new quantitative features—such as structural stability and base-pair persistence—to further improve our model’s predictive accuracy.
The consistent improvement of our model over the existing reference sequence suggests that the additional features we incorporated— PS-based hybridization weights, diverse RNA folding techniques, sequence-derived metrics, RNase H motifs, and others— meaningfully enhance predictive performance. We believe that continuing to expand along this direction will yield even greater improvements.
Our modeling approach demonstrates how integrating multiple computational layers can produce both biologically interpretable and predictive outcomes. We introduced a hybrid ODE–machine learning (ML) framework that reduces overfitting and bias by incorporating biological constraints while still capturing non-linear relationships. This hybrid design balances mechanistic understanding with predictive power—an approach broadly applicable to synthetic and systems biology.
We further combined high-resolution molecular dynamics (MD) simulations with low-resolution ML models, using coarse nearest-neighbor weights to approximate hybridization. The ML layer provides generalizable direction, while MD offers atomistic validation—bridging macroscopic learning with molecular precision. This structure can guide other teams seeking to connect large-scale predictions with biophysical mechanisms.
Beyond architecture, several transferable principles in our workflow exemplify good modeling practice:
Finally, our modular pipeline design allows each component—ASO modeling, gene-network prioritization, and epitope optimization—to operate independently yet interconnect meaningfully, serving as a scalable blueprint for future synthetic biology teams.
Antisense oligonucleotides (ASOs) are short, single-stranded DNA or RNA molecules that selectively bind to complementary mRNA sequences to regulate gene expression. Their ability to silence specific genes has made them powerful therapeutic tools for both genetic and oncologic applications. However, designing effective and safe ASOs remains a challenge due to factors such as RNA structure, hybridization dynamics, off-target binding, and chemical modifications. Our model integrates these parameters—together with systems-level analyses such as synthetic lethality and immune activation—to create a unified, multi-layered framework for precise ASO design and delivery.
For several decades, ASOs have been a promising therapeutics tool, targeting disease at the mRNA level with considerable potential in both genetic disorders and cancer therapy [1], [2]. ASOs are typically short, single-stranded DNA or RNA molecules, 16–22 nucleotides in length, that bind to complementary mRNA sequences through Watson–Crick base pairing. This interaction can regulate gene expression by a variety of mechanisms, the most common being RNase H-mediated cleavage of the RNA strand (Figure 1). Their ability to target specific proteins and induce knockdown makes them powerful tools, with more than ten drugs already approved by the FDA and many additional candidates under clinical evaluation [3], [4]. ASO efficacy depends critically on two core elements, the nucleotide sequence and chemical modifications, which together influence efficacy, target specificity, nuclease resistance and safety [4], [5].
The design of an ASO typically starts with selecting regions complementary to the target mRNA. Although sequences around 20 nucleotides in length are usually sufficient to achieve genomic and transcriptomic specificity, partial matches or mismatched binding can still result in off-target interactions. In addition to sequence complementarity, the structural features of the target mRNA, such as folding patterns and secondary structures, play an important role in determining binding accessibility. Furthermore, chemical modifications introduced into the ASO backbone influence its stability, affinity for the target, and interactions with cellular proteins [1]. Consequently, successful ASO development depends on accurately predicting both target engagement and the off-target effects.
A range of computational tools support ASO design, each addressing specific aspects of the process. Secondary structure prediction models such as Sfold and ViennaRNA can identify accessible regions of the mRNA likely to permit efficient hybridization and even provide numeric characteristics of the structure for better prediction [6]. Analysis of sequence composition, including GC content and characteristic nucleotide motifs, informs predictions of binding stability [7]. Meanwhile, tools such as GGGenome help detect potential off-target interactions by identifying homologous sequences across the transcriptome, considering mismatches, gaps and strand location [8]. Although informative, these tools often operate in isolation, requiring users to combine results manually across platforms, which can be fragmented and time-consuming.
Despite notable advances, current methods show clear limitations. The lack of integrated approaches can reduce prediction accuracy and leave key factors unaddressed. Hybridization dynamics and mRNA secondary structure are sometimes overlooked or only partially modeled, which may compromise predictions [1]. Off-target effects also remain a major challenge due to their potential for unwanted consequences on non-target transcripts [4].
Several specialized computational models have emerged to address these challenges. ASOptimizer employs a deep learning framework that incorporates sequence engineering and chemical modification optimization, showing improved knockdown efficiency in vitro, but only optimizes existing sequence modifications without accounting for properties that may limit effectiveness, such as off-target effects and mRNA folding [9]. PARED systematically generates and annotates ASO candidates, evaluating features such as exon location, SNP overlap, and cross-species alignment. However, it functions primarily as a candidate enumeration and annotation tool, leaving interpretation and prioritization to the researcher [10]. eSkip-Finder, meanwhile, focuses on exon-skipping ASOs using machine learning trained on curated datasets, offering precise predictions within its niche but with limited applicability to general ASO design [11]. Collectively, these advances illustrate meaningful progress toward specialized computational tools, while also underscoring the need for broadly validated, integrated platforms that account for additional molecular and cellular features.
Our goal is to address this gap by developing a multi-step algorithm that integrates the diverse factors influencing ASO activity. This includes binding affinity to the target mRNA, prediction of RNA binding protein motifs that could compete with ASOs on the target, off-target risks, ASO and RNA sequence motifs and chemical modifications of the antisense. Another important aspect is mRNA-specific features such as secondary structure and expression levels, as well as cellular context and RNase H function.
Hybridization occurs between the ASO and the target mRNA which causes the degradation of the mRNA. This connection between the two strands is an important step in the process of degradation which makes it important to incorporate into our model. Molecular dynamic simulations provide the ability to explore structural arrangement, interactions and motion of the oligos with their chemical modifications by solving Newton's equations of motion. This analysis can be used to enhance the ability of our model to consider stability and dynamics of ASOs and hybridization to the mRNA [12].
The design of ASOs also relies on carefully selecting the target gene, with the aim of minimizing off-target effects on healthy cells and reducing toxicity. One strategy to achieve this is based on the concept of synthetic lethality (SL), where the simultaneous inhibition or mutation of two interacting genes leads to cell death, while disruption of only one gene has little or no effect. By targeting a gene that has an SL partner unique to cancer cells, it is possible to selectively kill tumor cells while sparing healthy ones [13]. Incorporating this analysis into ASO design supports our goal of developing cancer therapeutics that are both effective and highly specific in their choice of targets.
Our solution relies on mRNA degradation using our ASOs, delivered by conjugating the oligo to an antibody, with an integrated epitope to create a treatment from several directions. Epitopes are short peptides derived from specific regions of antigens that interact with the immune system by being presented on major histocompatibility complex (MHC) molecules for recognition by T cells [14], [15], [16]. Incorporating an epitope into our design increases its presentation on the tumor cell surface, enhancing tumor-specific immune responses and promoting the formation of immunological memory to reduce the risk of recurrence. Because immunotherapy depends on identifying epitopes capable of binding MHC molecules and arising from tumor mutations, epitope prediction and selection are key steps in our model [14]. In our design, the selected epitope is fused into the antibody structure, and computational modeling is applied to ensure that antibody functionality is preserved while enabling effective immune activation.
By combining these elements within a comprehensive computational framework, with a calculated and precise training process and embedding them into our novel pipeline, we aim to enhance efficacy, reduce toxicity, and ultimately support the design of specific ASOs for tumor cell elimination, and create a multi-step platform to fight cancer.
Our platform overcomes the limitations of existing ASO design tools through a comprehensive, data-driven approach. Learn more in our Comparison To Competitors.
Our goal is to develop a predictive model for the design of all components of the antibody–ASO–epitope conjugate platform for cancer therapy. The model is designed to achieve:
The primary aim of our model is to create a ranking framework that incorporates novel features not considered in previous approaches, in order to improve both specificity and efficacy. Based on this ranking, the model generates the most promising ASO sequences for a given target.
Unique predictive features include:
Prioritizing ASO targets based on synthetic lethality (SL) principles, ensuring selective killing of cancer cells while sparing healthy tissue.
Designing antibodies with optimized expression levels and epitope incorporation, ensuring that the chosen epitopes are effectively displayed on cancer cells, while maintaining structural integrity of the antibody. This enables efficient antibody–oligonucleotide conjugation and proper immune activation alongside ASO activity.
In exploring modeling methods for ASOs, we came across ASOptimizer, a framework designed to optimize ASO design at both the sequence and molecular levels [9]. In their study, the authors constructed a dataset of experimental observations extracted from patents and scientific publications on ASOs. This resource provided us with an opportunity to build our model on real biological data and investigate the behavior of ASOs under different conditions.
The dataset we retrieved contains ~37,000 in vitro entries, each including an ASO sequence along with experimental details such as ASO concentration (volume in the dataset), treatment duration, cell line type, target gene, cell density per well, sequence length, and several other parameters. Entries also specify chemical modifications and their positions within the sequence, different linkage types, SMILES notation, and inhibition percentages measured by qRT-PCR. The inhibition value served as the key readout to assess the effectiveness of an ASO in reducing target mRNA levels.
We began by exploring the dataset using the Python Pandas package. By examining unique values and distributions, we mapped the diversity of target genes, the number of distinct cell lines, and the variation across experimental setups. The ASOptimizer paper placed particular emphasis on the role of chemical modification patterns, and indeed we found the dataset to contain a wide range of modification types and positional arrangements. Distributions of inhibition range, types of experimental properties, sequence length, genes and cell lines are shown in figures 1 and 2.
After the initial exploration, we validated gene names and cell line identifiers. To simplify downstream analysis, we added standardized features such as the canonical gene name and the organism of origin for each cell line. Columns with inconsistent data types (e.g., mixing numeric and string values) were converted to a uniform format. We also checked the validity of the sequence length column and removed entries with missing inhibition values, and averaged repetitive experiments to avoid introducing bias into the model.
To enrich the dataset, we designed functions to identify the location of each ASO sense sequence on the target gene pre-mRNA. For this task, we used the human genome reference (GENCODE, release 34; GRCh38.p13) [17]. In cases where ASO sequences could not be mapped to pre-mRNA, we searched across different transcripts using the Python package gget [18]. For some sequences, further investigation revealed that they originated from patents describing ASOs targeting single nucleotide polymorphisms (SNPs) [19]. For each successfully mapped sequence, we recorded its absolute location in the target and its relative position (normalized by transcript length) as new features.
We next identified and flagged entries with abnormal values, such as unusually high ASO volumes or unrecognized cell lines. Where possible, we verified questionable entries by consulting the original patent records. Flagged entries were annotated in the dataset to enable filtering during downstream analyses (see Figure 3).
Two specific challenges required special handling:
After preprocessing, the dataset was transformed from the raw ~37,000 entries into a consistent and reliable form. The final dataset contained ~35,000 entries with validated sequence, experimental, and modification information, with indications for duplicated entries. Each entry included its ASO sequence, inhibition measurement, and associated experimental metadata, ensuring that the data was valid and informative for further analysis and feature engineering.
Although the dataset included a wide range of genes and cell lines, certain experimental conditions may have introduced substantial bias. In particular, high ASO concentrations and various treatment periods can artificially boost inhibition values, making it difficult to distinguish whether an observed effect reflects the intrinsic properties of the ASO sequence or simply the experimental setup. This complicates interpretation and risks training a model that captures biases in experimental design rather than true sequence–function relationships. An example of the effect of the ASO volume is shown in figure 4.
Another major challenge was handling chemical modifications. Identical ASO sequences can appear multiple times in the dataset with different modification types and positional patterns, which dramatically influence inhibition outcomes. To capture these effects accurately, the model must be designed to treat modifications not as minor variations, but as key determinants of ASO behavior that require dedicated feature encoding. Without explicitly addressing modifications, predictions risk being misleading or incomplete.
Finally, duplicated or inconsistent entries also posed difficulties. While we implemented strategies to merge and clean these records, not all ambiguities could be resolved. Consulting patents to validate questionable entries was particularly time-consuming and could not be applied universally. Despite these challenges, the preprocessing pipeline allowed us to create a structured, enriched dataset with reliable features. This prepared the ground for building models capable of predicting ASO performance based on both sequence and chemical properties.
We begin with a simplified model, assuming the ASO–mRNA reaction proceeds at a rate that depends linearly on reactant concentrations. Because ASO degradation is slow and largely independent of target interactions [Z1], we can treat the ASO concentration as constant throughout the reaction.
Let $V$ denote the ASO concentration (sometimes referred to as volume), $m$ the concentration of target mRNA, and $k$ the efficiency constant of that ASO. The mRNA degradation can be expressed as:
\[ \dfrac{dm}{dt} = -kVm(t) \]
The solution of this first-order differential equation is:
\[ m(t) = m(0)e^{-kVt} \]
Experimentally, mRNA degradation is reported as the inhibition percentage: if $I$ denotes inhibition, $T$ the treatment period, and $m(0)=1$, then
\[ I = 100\,(1 - m(T)) \]
Substituting the analytical solution gives:
\[ \begin{aligned} I &= 100\,(1 - e^{-kVT}) \\ e^{-kVT} &= 1 - \frac{I}{100} \\ -kVT &= \ln(1 - \tfrac{I}{100}) =: I^* \\ -k &= \frac{\ln(1 - \tfrac{I}{100})}{VT} \end{aligned} \]
Here $I^*$ is termed the log-inhibition. Because the dataset occasionally contains $I=100$, which would cause infinite efficiency, we introduce a small bias term $c$ as a log-correction: $\,I^* = \log(100 - I + c)$, with $c = 1$ (≈1% bias).
Two insights follow: (1) correlations with log-inhibition are more meaningful than those with raw inhibition (see Figures 5 and 6), and (2) if the ODE model holds, log-inhibition should vary linearly with both ASO concentration ($V$) and treatment time ($T$).
The table below summarizes Pearson correlations between volume and inhibition before and after the log transformation across several genes and cell lines.
| Cell line | Gene | Samples | Pearson (Inhibition) | p-value (Inhibition) | Pearson (Log-Inhibition) | p-value (Log-Inhibition) |
|---|---|---|---|---|---|---|
| A431 | KRAS | 925 | 0.422 | \(2.95\times10^{-41}\) | 0.477 | \(9.23\times10^{-54}\) |
| A431 | SOD1 | 429 | 0.409 | \(1.03\times10^{-18}\) | 0.409 | \(9.65\times10^{-19}\) |
| A431 | APOL1 | 2134 | 0.368 | \(2.10\times10^{-69}\) | 0.363 | \(1.91\times10^{-67}\) |
| KARPAS-229 | IRF5 | 972 | 0.301 | \(8.73\times10^{-22}\) | 0.419 | \(1.15\times10^{-42}\) |
| MM.1R | IRF4 | 3394 | 0.383 | \(3.47\times10^{-119}\) | 0.483 | \(6.67\times10^{-198}\) |
| SK-MEL-28 | IRF4 | 2836 | 0.295 | \(3.96\times10^{-58}\) | 0.397 | \(7.22\times10^{-108}\) |
| H929 | IRF4 | 73 | -0.044 | \(7.12\times10^{-1}\) | -0.083 | \(4.86\times10^{-1}\) |
| HepaRG | HSD17B13 | 5566 | 0.515 | \(<10^{-300}\) | 0.619 | \(<10^{-300}\) |
| HepG2 | SOD1 | 1625 | 0.365 | \(2.32\times10^{-52}\) | 0.476 | \(1.54\times10^{-92}\) |
| HepG2 | DGAT2 | 2504 | 0.543 | \(8.70\times10^{-192}\) | 0.608 | \(1.30\times10^{-252}\) |
Pearson correlations improved markedly after converting to log-inhibition, supporting the ODE-based transformation. However, when applied to the treatment period ($T$), the relationship was weak or even negative:
Thus, while the simple ODE captures the volume dependence, it does not fully explain the effect of treatment time. We therefore corrected for volume but not for treatment period.
From the equation above, one might attempt a direct normalization $-kT = \dfrac{I^*}{V}$, but measurement noise makes this unreliable. To mitigate this, we fitted a linear regression (OLS) between volume and log-inhibition, obtaining the empirical fit \[ (-kT)(V \cdot 6.01\times10^{-5} + 0.537) = I^*, \] and hence \[ -kT = \dfrac{I^*}{6.01\times10^{-5}V + 0.537} := I^V, \] where $I^V$ is the volume-corrected inhibition. The derived value of $-kT$ was then used as the efficiency descriptor in the machine-learning model.
With the cleaned and validated dataset in place, we next aimed to understand the patterns behind ASO performance. To do so, we translated the experimental observations into quantitative models that connect inhibition outcomes with factors such as concentration, treatment duration, and molecular properties.
The following parts describes our Feature Engineering processes.
Discover how our Machine Learning model combines these features for optimal predictions.
When an ASO binds its target mRNA, hybridization occurs. A common way to quantify this interaction is the melting temperature ($T_m$), defined as the temperature at which half of molecules are duplexed and half remain single-stranded.
An optimal binding affinity has been observed empirically [20] (see Figures 7 and 8). If duplex stability is too low, hybrid formation and RNase H1 recruitment are inefficient. If binding is too strong, off-target hybridization increases. For 16-mer gapmer LNA–ASOs, $T_m \approx 50$–$55\,^\circ$C balances efficacy and safety; values $>55\,^\circ$C correlate with hepatotoxic risk [20]. Another thermodynamic measure of binding strength is the Gibbs free energy, denoted $\Delta G^\circ$.
Accurate $T_m$ or $\Delta G^\circ$ values require in vitro melting experiments (UV or fluorescence), which underpin the nearest-neighbor thermodynamic database [21].
ASOs typically include (i) phosphate-backbone modifications, most commonly complete phosphorothioate (PS) substitution, and (ii) sugar modifications such as 2′-O-MOE or locked nucleic acid (LNA). While PS backbones are nearly universal, sugar chemistry strongly influences hybridization affinity and specificity. Accurate modeling must account for both.
We use several pre-computed hybridization metrics as model features:
We incorporated PS nearest-neighbor weights derived from wet-lab measurements [23]. Values reported at $50\,^\circ$C were converted to $37\,^\circ$C using the standard thermodynamic relation:
\[ \Delta G^\circ(T) = \Delta H^\circ - T\,\Delta S^\circ \]
where $\Delta G^\circ(T)$ is the standard Gibbs free energy of duplex formation at temperature $T$ (in kelvin), $\Delta H^\circ$ is the enthalpy change of hybridization, and $\Delta S^\circ$ is the corresponding entropy change. This linear approximation assumes $\Delta H^\circ$ and $\Delta S^\circ$ remain constant over the temperature range considered, allowing extrapolation of free energies between experimentally measured and physiological conditions. It captures the balance between enthalpic stabilization from base-pairing and stacking, and the entropic cost of ordering two flexible strands into a duplex.
The dataset provided PS/PS, PS/DNA, and DNA/DNA weights. PS/RNA data were unavailable, as RNA strands are less stable and difficult to assess via high-resolution melting. We then estimated:
\[ \Delta G^\circ_{\mathrm{PSDNA\text{–}RNA}} \approx \Delta G^\circ_{\mathrm{DNA\text{–}RNA}} + \bigl(\Delta G^\circ_{\mathrm{PSDNA\text{–}DNA}} - \Delta G^\circ_{\mathrm{DNA\text{–}DNA}}\bigr) \]
This approximation enables rapid on-target PS–RNA energy estimation. However, it is limited to local nearest-neighbor corrections; transcriptome-wide search tools provide broader context, including alternative sites, structural accessibility, and off-target potential.
Molecular-dynamics simulations yielded nearest-neighbor weights for PS-MOE/RNA duplexes [24]. Summing these values across each sequence provides an MOE-aware hybridization feature.
A 2025 tool proposed a heuristic formula for estimating $T_m$ [25]:
\[ T_m = 100.5 + 41(N_G + N_C - N_A - N_T) - \frac{820}{N_A + N_T + N_G + N_C} + 16.6\,\log_{10}\!\bigl([\mathrm{Na}^+]\bigr) \]
where \(N_A, N_T, N_G, N_C\) are the counts of A, T, G, and C nucleotides in the ASO sequence. This heuristic provides a rapid $T_m$ estimate for short oligonucleotides under monovalent-salt conditions, complementing thermodynamics-based calculations.
Other interesting features that relate to hybridization energy are self_energy and
internal_fold. The first is the energy of 2 ASOs binding together, speculating that if
this happens,
then the likelihood to bind to target mRNA is reduced. This was calculated with the primer3.calc_homodimer
function.
internal_fold is the gibbs free energy of an ASO hybridized to itself, calculated with
ViennaRNA
(see Fold section).
While already providing predictive power, we believe that enriching the calculation with proper
chemical weights
of ASOs, and utilizing MD, we can increase predictive power even further.
Overall, we integrated both simplified analytical features and high-resolution, physics-based methods, improving upon existing approaches for modeling ASO hybridization and enhancing on-target prediction accuracy. When combined with fast transcriptome-wide search tools such as RIsearch (see Off-target analysis), these hybridization metrics also enable efficient and accurate off-target assessment, drastically improving genome-scale prediction quality. Future work will incorporate experimentally derived LNA nearest-neighbor weights from the literature, further refining thermodynamic modeling for chemically modified oligonucleotides.
RNase H is an endonuclease that specifically cleaves the RNA strand of an RNA:DNA duplex, while leaving the DNA intact. In human cells, RNase H1 is essential for maintaining genome stability: it removes RNA primers from Okazaki fragments during DNA replication and resolves R-loops formed when RNA hybridizes to complementary DNA.
Unlike sequence-specific nucleases, RNase H1 recognizes the structural feature of the RNA:DNA hybrid rather than a defined sequence motif. However, its cleavage is not uniform — the efficiency and exact cut site depend strongly on the local nucleotide environment surrounding the scissile phosphate [26].
This property is especially important for ASOs. When an ASO binds its target mRNA, it forms a DNA:RNA hybrid that recruits RNase H1. The enzyme then catalyzes degradation of the RNA strand, reducing expression of the target gene. Understanding how RNase H1 recognizes and cleaves different hybrid contexts is therefore central to predicting ASO potency. Certain bases and short sequence motifs promote cleavage, while others inhibit it, creating characteristic patterns of preference.
To systematically map the determinants of RNase H1 activity, the high-throughput RNase H Sequence Preference Assay (H-SPA) developed by Kielpinski et al. (2017) was applied [27]. This assay compares cleaved versus input libraries of randomized RNA:DNA hybrids, allowing precise measurement of sequence influence on cleavage.
The results showed that both single nucleotides and dinucleotide combinations surrounding the scissile phosphate can strongly modulate cleavage efficiency. These findings established that RNase H1 function follows reproducible, sequence-specific rules — knowledge that can be directly integrated into predictive computational models for ASO design.
In the H-SPA assay, randomized RNA:DNA hybrids were subjected to RNase H1 digestion, and cleaved versus input libraries were sequenced. To quantify the contribution of each base, enrichment scores were calculated as:
$$ \omega_i = \frac{f^{\text{cleaved}}_{i}(b)}{f^{\text{input}}_{i}(b)} $$
where \( f^{\text{cleaved}}_{b} \) is the frequency of base \( b \) at position \( i \) in the cleaved population, and \( f^{\text{input}}_{b} \) is its frequency in the input library [27]. These enrichment values were then assembled into position-specific scoring matrices (PSSMs), representing the contribution of each nucleotide at each position relative to the cleavage site. Equivalent matrices were also constructed for dinucleotides, capturing local pairwise sequence effects beyond single bases.
In addition, the authors introduced a relative processing rate to capture kinetic-like activity of cleavage:
$$ k_{\text{real}} = \frac{\log \log \left( 1 - f_{\text{sub}} \right)}{\log \log \left( 1 - f_{\text{ref}} \right)} $$
where \( f_{\text{sub}} \) is the cleaved fraction of a given substrate, and \( f_{\text{ref}} \) is that of a reference substrate.
Both log2 fold-change \(\log_{2}(\text{FC})\) and \( k_{\text{real}} \) values were reported for three experimental constructs, each based on an independent input library and cleavage experiment. This means the full set of enrichment matrices was generated three times under distinct experimental conditions, producing separate weight sets that together provide a robust representation of RNase H1 sequence preferences [27].
Based on the H-SPA–derived weight sets, we defined several RNase H1–related feature families. These features capture different aspects of sequence preference and cleavage efficiency, ranging from single-nucleotide effects to local dinucleotide context and motif presence. The table below summarizes each feature type, its description, and the variants computed.
| Feature type | Description | Variants computed |
|---|---|---|
| Position-specific nucleotide scores | Each base within the ASO window was assigned a weight from the H-SPA-derived PSSM, and the average score (normalized by window length) was calculated. This feature reflects how specific nucleotides at defined positions favor or disfavor RNase H1 cleavage. | 3 experimental sets (R4a, R4b, R7) × 2 weight types (log2(FC), \( k_{\text{rel}} \)) |
| Dinucleotide scores | Similar to the single-nucleotide PSSMs but applied to adjacent base pairs. Each dinucleotide within the ASO window was assigned a weight from the H-SPA-derived PSSM, and the average score (normalized by the number of dinucleotides) was calculated. This feature captures local sequence context effects that single-base models cannot fully explain. | 3 × 2 (as above) |
| Motif presence |
Binary indicators (0/1) marking whether an ASO contains specific motifs identified in H-SPA.
Motifs such as TCCC or GGGA were associated with higher cleavage
efficiency, whereas other motifs were associated with reduced cleavage.
|
Independent of weight sets |
While the original article used fixed windows (14–16 nt for nucleotide weights and ~13 nt for dinucleotide weights), our dataset contained ASOs of variable lengths. To adapt the scoring system, we developed a correlation-optimized windowing strategy (see Figure 9):
This approach ensured that each ASO was scored in a way that best reflected the experimentally relevant determinants of RNase H1 cleavage, rather than assuming a single fixed central window.
In addition to sequence-intrinsic determinants of RNase H1 cleavage, we considered the possibility that off-target binding to transcripts encoding RNASEH1 itself could compromise ASO efficacy. Because ASO activity relies on recruitment of RNase H1 to the intended RNA target, knockdown of the enzyme on unintended transcripts represents a form of “inhibition of the inhibition”. Even modest levels of such off-target binding may therefore have disproportionate functional impact.
To capture this effect, we introduced a feature quantifying the RNASEH1 off-target burden of each ASO. The scoring framework was rooted in statistical mechanics: we enumerated all predicted binding sites across the transcriptome and summed their Boltzmann-weighted contributions to the partition function, effectively calculating a statistical-mechanics–derived Z value for each ASO (energies from RISearch [22]). Sites mapping specifically to RNASEH1 transcripts were extracted to construct an off-target score reflecting the potential for enzyme knockdown.
For each ASO, all predicted RNA binding sites were retrieved using RISearch [22]. Each interaction was assigned an energy value \( \Delta G \), and its contribution to the partition function was computed as:
$$ Z = \sum_{i} e^{-\Delta G_{i}/RT} $$
Where \( RT = 0.616 \, \text{kcal/mol} \) at 310 K, and \( Z \) is the sum over all candidate off-target sites. The subset of contributions mapping specifically to RNASEH1 transcripts was defined as the RNase H1 off-target score.
Across the full dataset, this feature showed negative correlations with ASO inhibition efficiency (Figures 10-12), consistent with the expected biology-ASOs predicted to engage RNASEH1 transcripts tended to be less effective against their intended targets.
When stratifying ASOs by chemistry and length classes, however, more nuanced patterns emerged:
A summary of these group-wise results is provided in Figures 13, 14, highlighting both consistent trends and exceptional cases.
In addition to sequence-derived preferences and off-target risks, the cellular abundance of RNase H1 itself can influence ASO potency. Because RNase H1 is the enzyme responsible for catalyzing cleavage of RNA in ASO–mRNA hybrids, limited enzyme availability may restrict inhibition, whereas higher expression levels may enhance knockdown efficiency.
To capture this context-dependent factor, we extracted $log(\text{TPM}+1)$ expression values for
RNASEH1 across relevant cell lines from the OmicsExpressionProteinCodingGenesTPMLogp1
dataset.
Cell line identifiers were harmonized using Achilles IDs, and values were assigned on a
per-cell-line basis
as a continuous feature (RNASEH1_expression).
Preliminary correlation analyses indicated that RNase H1 expression exhibits variable but biologically plausible associations with measured inhibition. Several cell lines showed the expected positive relationship between higher RNASEH1 expression and stronger ASO activity, consistent with enzyme-limiting behavior.
By integrating experimentally derived PSSMs with our correlation-optimized windowing approach, we translated the sequence preferences of RNase H1 into robust computational features. These features capture both single-base and dinucleotide contributions, as well as motif-level effects, providing a biologically grounded representation of cleavage propensity.
In parallel, we introduced an additional feature quantifying the potential sequestration of RNase H1 through unintended binding to its own transcripts. By embedding this off-target burden into a statistical-mechanics framework, we accounted for the possibility that enzyme availability itself may be compromised by ASO binding. Together, these complementary perspectives- intrinsic sequence determinants of cleavage and extrinsic risks of enzyme sequestration- contribute to a more accurate prediction of ASO activity and highlight the central role of RNase H1 in determining therapeutic efficiency.
RNA folding determines the secondary structure of transcripts, which strongly influences their accessibility for binding by ASOs. Regions that are base-paired or highly structured are less accessible, whereas unpaired or single-stranded regions are more available for hybridization. Therefore, assessment of RNA folding is considered critical in the selection of ASO target sites. In general, sequences predicted to be accessible are more likely to support efficient binding and functional inhibition (see Figure 15). Multiple studies have demonstrated that improved target accessibility can significantly enhance knockdown efficacy [28][29].
To capture local structural stability within each target region, we computed folding-related
features using the ViennaRNA package [30],
which estimates RNA secondary structure and calculates the minimum free energy (MFE) through dynamic
programming algorithms.
For each target RNA, we applied a custom function calculate_avg_mfe_over_sense_region
that scans the sequence with a sliding window to evaluate the average local MFE per nucleotide.
Specifically, windows of 40 nucleotides with a 15-nt step were used. The MFE of each window was
normalized per nucleotide, and the average across overlapping regions was computed to yield the
energy landscape.
The folding features were optimized by systematically varying the window and step sizes to maximize
correlation with the inhibition values.
Step sizes ranging from 1 to 10 and window sizes ranging from 20 to 70 nucleotides were
examined.
For each ASO, the sense region was further extended upstream and downstream using the get_sense_with_flanks
function to account for contextual folding effects around the target site.
The final average MFE feature thus reflects the mean energy of the entire flanked sense
region.
More negative MFE values correspond to stronger RNA folding (i.e., more stable local secondary
structures), while higher values indicate more open conformations (see Figure 16).
Correlation analysis revealed that ASOs targeting regions with less stable (less negative) MFE
values tend to exhibit higher inhibition efficiency (Spearman’s ρ = 0.21, p ≈ 1.5×10⁻²³²),
supporting the biological interpretation that loosely folded regions are more accessible for ASO
binding.
In parallel, average local accessibility was calculated using a sliding-window approach
implemented in the Raccess software [31]
(also referred to as RNAaccess). This algorithm estimates the probability of each nucleotide being
unpaired,
and thus accessible for hybridization with the ASO.
For this purpose, we utilized a custom Python function named AccessCalculator, developed by
Yehuda Landau, a member of the Tuller Lab.
This function receives several key parameters:
(1) seq – the nucleotide sequence for which accessibility is to be computed.
(2) access_size – the length of the target region under evaluation.
(3) in_gc, max_gc – optional constraints controlling the GC content range.
(4) access_win_size – defines the number of nucleotides to extend upstream and downstream of
the target for context-based folding estimation.
(5) seed_sizes – determines the probability that each seed region of a given size remains
unpaired.
The schematic representation in Figure 17 illustrates this process: the mRNA trigger sequence is scanned by overlapping windows, local folding probabilities are computed within each window, and accessibility values are aggregated across the binding segment. This visualization helps to understand how the moving-window algorithm captures the local structural context around ASO target sites, providing a probabilistic estimate of nucleotide unpairing and accessibility.
The output of this computation is a probabilistic accessibility score per nucleotide, representing
the likelihood of being unpaired.
We then applied the compute_sense_accessibility function to aggregate these data into a
single numerical value per ASO,
the mean accessibility across all nucleotides within its binding region.
A higher computed value indicates that the target region is more structurally closed and thus less
accessible to ASO binding.
Due to the computational intensity of the analysis, the process was parallelized and executed in
batches before being merged into the final dataset.
When correlating the resulting accessibility values with experimental log(inhibition) data, a negative Spearman correlation (ρ = –0.14, p = 3.02×10⁻¹¹⁹) was observed, supporting the intuitive relationship that more open regions tend to be more effectively targeted by ASOs (see Figure 18).
In addition to the average MFE \ accessibility features, two complementary folding-related features
were examined:
Mean MFE values at the sequence edges - calculated using the function calculate_mfe_over_edges_sense_region,
which evaluates the mean folding energy of the first and last four nucleotides of each target.
This feature was motivated by the hypothesis that the terminal regions may play a key role in local
opening and closing dynamics, thereby influencing ASO accessibility.
The observed correlation with inhibition was slightly lower but comparable to that of the mean MFE
feature.
Minimum average MFE across the region - computed using the function calculate_min_mfe_over_sense_region,
designed to capture the most stable local substructure within each ASO target.
This approach was based on the rationale that the strongest folded subregion may dominate the
accessibility landscape, and that averaging across the entire region could obscure such effects.
However, in practice, this feature also did not show a substantially higher correlation with
inhibition compared to the mean MFE metric.
Both features were ultimately examined in the feature selection step of the model to allow data-driven evaluation of their predictive contribution relative to other structural and thermodynamic parameters.
The therapeutic potential of Antisense oligonucleotides depends not only on chemical modifications or protein interactions, but also on the intrinsic sequence properties of the oligonucleotide itself. The sequence dictates how the ASO folds, how it interacts with RNA and proteins, and whether it harbors repetitive or toxic motifs that might compromise safety or efficacy. While chemical design can be adjusted, the underlying sequence composition remains a fundamental determinant of ASO performance.
To capture these properties systematically, we developed a set of sequence-derived features. These features quantify base composition, structural tendencies within the ASO itself, repetitiveness, and the presence of known toxic motifs. By doing so, we transform qualitative biological intuitions into quantitative predictors that can be integrated into machine learning models for ASO optimization.
Unlike the folding of target RNA, which determines whether a binding site is open or blocked, the sequence of the ASO itself encodes its own structural and functional tendencies. Before any chemical modification is added, the letters of an ASO already shape its behavior: whether it folds onto itself, remains flexible, or carries hidden motifs that influence safety and efficacy.
One crucial aspect is self-structure formation. Certain ASO sequences can fold back on themselves, forming intramolecular hairpins or even G-quadruplexes. Palindromic regions, long homopolymeric runs, or stretches of consecutive guanines make these conformations particularly likely. While stable, such self-structures act as competitors: instead of engaging the intended mRNA, the ASO becomes trapped in its own secondary structure, reducing potency and sometimes increasing toxicity [33][34].
Base composition also exerts a strong influence. GC-rich ASOs behave like rigid beams, forming strong duplexes but often at the cost of reduced flexibility and higher risk of aggregation. AT-rich sequences, in contrast, are looser and more flexible, which may help them adapt structurally but can also reduce binding stability. The composition can be summarized by GC content (overall G+C fraction) and nucleotide skews (AT and GC skew measuring strand asymmetry); together, these capture the trade-off between strength and adaptability [35] .
Another layer of sequence biology lies in repetitiveness and motifs. Long G-runs are notorious for forming G-quadruplexes, which hinder target binding and have been linked to adverse outcomes in preclinical studies [36] . Tandem or dispersed repeats decrease sequence uniqueness, increasing the chance of binding to unintended transcripts. Beyond structural motifs, toxic sequence patterns such as $\text{UGU}$, $\text{TGGT}$, or $\text{GGGT}$ have been correlated with hepatotoxicity and immune activation, highlighting the importance of screening them computationally before candidate selection[37] .
Entropy and diversity are subtler but equally important measures. Sequences with balanced distributions of nucleotides and dinucleotides have higher entropy, making them less prone to repetitive structures and more likely to remain accessible for hybridization. In contrast, low-entropy sequences are more predictable, more repetitive and often less effective. Importantly, studies have shown that the local sequence context itself strongly influences the thermodynamic properties and hybridization efficiency of ASOs, with even single modifications exerting sequence-dependent effects [38] . Finally, flexibility is encoded in the local arrangement of bases. Dinucleotides such as AT and TA introduce weaker stacking interactions, functioning like hinges within the oligonucleotide. These flexible points allow the ASO to adapt dynamically when pairing with RNA, potentially enhancing recognition and supporting RNase H1 activity[39] . Together, these properties for a hidden biological language of ASO sequences - a language that encodes stability, flexibility, safety, and efficacy. By decoding it through computational features, we can anticipate which sequences are likely to succeed as therapeutic candidates, long before they are tested in the lab.
To transform the biological properties of ASO sequences into measurable descriptors, we defined several feature families. Each feature captures one aspect of the sequence and is normalized by ASO length, ensuring fair comparison across oligonucleotides of different sizes.
1. Secondary Structure Features
These features measure the tendency of an ASO to fold back on itself and form intramolecular structures that may reduce its ability to bind the target mRNA.
AAAA, CCCC).
These stretches promote stacking interactions or unusual local structures, both of which can
reduce hybridization efficiency with the intended RNA.
2. Composition Features
These reflect the overall nucleotide balance within the ASO, which directly impacts stability, flexibility and immunogenicity.
3. Repetitiveness and Motifs
These capture sequence patterns that may compromise specificity or safety of the ASO.
4. Entropy and Diversityy
These features quantify the complexity of the ASO sequence. High complexity usually improves specificity, while low complexity reflects bias or repetitiveness.
where \( P_d \) is the frequency of each dinucleotide. Low values indicate bias toward repetitive dinucleotide patterns.
5. Additional Features (Single-Feature Categories)
These features complement the core families by highlighting structural adaptability and nonlinear effects
Through this framework, each ASO is mapped into a high-dimensional feature space that encodes its structural, compositional and functional tendencies. This systematic representation allows us to compare, filter and optimize candidates with precision.
By converting sequence properties into numerical descriptors, each ASO is represented by a multidimensional profile that highlights its inherent strengths and weaknesses. Self-folding signatures, toxic motifs, $\text{GC}$/$\text{AT}$ balance, entropy, and flexibility are all captured in a way that allows us to compare candidates objectively.
Codon usage bias (CUB) describes the unequal use of synonymous codons in protein-coding regions. Although multiple codons can encode the same amino acid, organisms tend to prefer certain codons over others due to evolutionary constraints, tRNA availability, and the need to optimize translation efficiency. These preferences are not random: they influence ribosome speed, pausing, and local RNA structure [40][41][42].
For ASOs, this context is highly relevant. Sites in regions of strong codon optimization may be structurally constrained, while sites enriched in rare codon may represent more accessible binding opportunities. ASO-target interactions are therefore determined not only by sequence complementarity but also by the biological translation context, which can modulate ASO activity through effects on RNase H1 access and binding [43].
By capturing CUB-related properties as quantitative features, we create a systematic way to represent translation-related biases in our predictive models. This helps bridge the gap between sequence-level biology and ASO performance, ensuring that design decisions incorporate both molecular complementarity and translational context.
The following features were selected to capture different aspects of codon usage bias and sequence complementarity. Each of them reflects a distinct biological property that can influence ASO binding and activity. Together, they form a complementary set of descriptors that quantify both codon-level translation dynamics and ASO-target hybridization potential.
ENC is one of the earliest and most widely used measures of codon usage bias. It provides a simple yet informative way to quantify how restricted a sequence is in its choice of synonymous codons. Highly expressed genes often show stronger bias toward "preferred" codons, while lowly expressed or less optimized genes tend to use codons more uniformly [40]
Definition: for each amino acid $\text{AA}$ with $k$ codons the probability that two random codons are identical:
In the last step, compute the ENC score:
For ASO design, ENC highlights how translation efficiency may influence accessibility.
CAI was introduced as a way to measure how well a gene's codon usage matches that of a reference set of highly expressed genes. Because translation is shaped by evolutionary optimization, codons found in ribosomal and housekeeping genes are often considered "optimal". Comparing local sequence usage with this reference gives insight into how efficiently the region may be translated [41].
Definition: First, calculate the Relative Synonymous Codon Usage(RSCU) parameter for each codon as follows:
Relation to ASOs
In the context of ASOs, CAI reflects whether ribosomes are frequently present at the binding site.
tAI extends the logic of codon adaptation by considering the actual tRNA pool of the organism. Instead of relying solely on reference gene sets, it incorporates tRNA gene copy numbers and wobble base-pairing rules. In this way, tAI reflects the translational "supply side": how available each codon's decoding machinery is within the cell [42].
Definition: In the first step, a weight is assigned to each codon, reflecting the abundance of the tRNAs that recognize it, and is calculated as follows:
After calculating the weights $W_i$, the next step is to normalize them as follows:
In the final step, calculate the tAI of a coding region (a gene or a part of it) as the geometric mean over the $w_i$ of its codons:
Relation to ASOs
For ASO design, tAI provides a direct link between tRNA availability and local ribosome speed.
Unlike ENC, CAI and tAI, which focus on codon usage, Chimera captures the sequence-level match between the ASO and its potential target sites. This feature is based on the Chimera ARS framework [44]. It indirectly measures hybridization potential by quantifying subsequences overlaps between the ASO and transcript regions.
Definition:
First, choose your reference set and the coding sequence. Then, for each position $i$ in the coding sequence $S$ find the longest sub-sequence $S_i$ that starts in that position, and also appears at the reference set. Afterward, let $|S|$ denote the length of a sequence $S$; the ChimeraARS of $S$ is the mean length of all subsequences $S_i$:
Relation to ASOs
Chimera captures the sequence similarity between the ASO and mRNA windows across the transcriptome, serving a dual-purpose metric:
To evaluate codon usage bias and hybridization features in a biologically relevant context, we applied a window-based approach centered on the ASO binding site. This reflects the fact that ribosomes, RNA-binding proteins and RNA structures operate on regions rather than isolated nucleotides.
Each feature was computed using algorithms adapted from the original literature, applied consistently across all contexts (ASO, local windows, global transcript).
ENC (Effective Number of Codons)
CAI (Codon Adaptation Index)
tAI (Translational Adaptation Index)
Chimera Similarity
General Note:
All calculations were performed consistently across $\pm20-70$ nt windows, global CDS and ASO sequence where applicable. This ensures comparability between features and prevents context-specific bias.
| Feature | CDS windows | pre-mRNA windows | Global (full CDS) | ASO sequence |
|---|---|---|---|---|
| ENC | ✓ | ✗ | ✓ | ✓ |
| tAI | ✓ | ✗ | ✓ | ✓ |
| CAI | ✓ | ✗ | ✓ | ✗ |
| Chimera | ✓ | ✓ | ✗ | ✗ |
Conclusion
CUB features capture how codon usage shapes the translational and structural environment around ASO binding sites. ENC, CAI and tAI measure bias and adaptation, while Chimera similarity directly quantifies sequence complementarity. By combining local windows, global transcript measures and ASO-level descriptors these features create a comprehensive view of both biological context and hybridization potential.
When designing ASOs, chemical modifications play a central role, since they enable the molecules to remain stable in biological environments, reach their target sites, and maintain a therapeutic activity. At the same time, the type and distribution of these modifications can strongly influence potency, specificity and safety, which makes them an essential aspect to capture in our feature set.
The therapeutic potential of ASOs is closely linked to their chemical design. Unmodified DNA oligos are inherently unstable in biological fluids and undergo rapid degradation by nucleases, leading to short half-life and poor bioavailability. To overcome these limitations, a broad set of chemical modifications have been introduced at the backbone, sugar or nucleobase level. These modifications improve stability, enhance RNA binding affinity, reduce immunogenicity, and modulate pharmacokinetics [45].
One of the earliest and most widely adopted modifications in therapeutic ASOs is the phosphorothioate (PS) backbone. In this design, one of the non-bridging oxygen atoms in the phosphate group is replaced with sulfur. This subtle change dramatically increases resistance to nuclease degradation, which is essential for maintaining stability in biological fluids. In addition, PS linkages promote interactions with plasma and cellular proteins, improving systemic distribution and cellular uptake. However, these benefits come at a cost: PS linkages can reduce binding affinity for the RNA target and are associated with immune-related toxicities, such as complement activation or platelet effects [46],[47]. Thus, while PS chemistry forms the backbone of almost all therapeutic ASOs, it also illustrates the balance between improved pharmacokinetics and potential safety liabilities.
Sugar modifications represent another major class of chemical strategies. Substitutions at the 2' position of the ribose, such as 2'-O-methyl(2'-OMe) and 2'-O-methoxyethyl(2'-MOE) enhance nuclease resistance and improve binding affinity toward the target RNA target. These modifications also reduce immune recognition, which is critical for tolerability in vivo. Among sugar chemistries, Locked Nucleic Acids(LNA) are particularly notable: they contain a methylene bridge that locks the ribose into a rigid conformation, leading to very high affinity for complementary RNA and a significant increase in thermal stability of the ASO-RNA duplex [46],[48]. This enhanced potency, however, has to be carefully balanced, as high densities of LNA residues have been linked with hepatotoxicity.
A particularly effective strategy that combines different chemistries is the gapmer architecture. Gapmers consist of a central stretch of DNA nucleotides flanked on both sides by chemically modified nucleotides such as MOE or LNA. The flanking regions provide nuclease resistance and stabilize the oligonucleotide, which the central "gap" is recognized by RNase H1, enabling catalytic cleavage of the target RNA. This dual functionality makes gapmers highly versatile: they retain the stability conferred by chemical modifications but also harness the enzymatic machinery of the cell for efficient RNA degradation [49]. Gapmers therefore represent a cornerstone design in antisense therapeutics, bridging chemical stability with biological activity.
Other ASO designs, such as mixmers or fully modified steric blockers, also rely on chemical modifications, but gapmers remain the most widely used therapeutic architecture and best illustrate how modification patterns translate into functional features [50].
Experimental studies have shown that the biological activity of ASOs depends not only on whether chemical modifications are present but also on how they are arranged across the sequence [45],[46],[47],[48],[49]. Parameters such as modification density, clustering, symmetry, or repetitiveness have all been linked to changes in stability, potency and toxicity.
To systematically capture these principles, we developed a set of features designed to reflect concepts highlighted in literature: for example, terminal runs to represent nuclease protection, block length to represent clustering, or entropy to represent diversity.
By doing so, we created a framework where each ASO sequence can be mapped onto a numerical profile of its chemical modification architecture. This bridge between biology and computation allows us to evaluate the contribution of chemical design to ASO performance and integrate these effects into predictive models for optimization.
The following illustrations present the individual features we defined to describe chemical modifications in ASOs. Each figure shows:
Together, these visuals provide an intuitive understanding of how each feature is calculated and what biological property it represents.
By computing these features for every ASO, each sequence can be represented as a numerical profile of its chemical modification architecture. These profiles can be correlated with experimental data on nuclease stability, RNase H! activity, immunogenicity or toxicity. In this way, modification patterns are transformed into quantitative predictors that can be integrated into machine learning models for ASO design and optimization.
RNA binding proteins (RBPs) are proteins that bind RNA molecules and influence their stability, processing, and translation. They act after transcription, shaping the life cycle of the mRNA in the cell. Each RBP recognizes short sequence motifs (typically 3-8 nucleotides) or structural elements, and by binding to these regions they can protect the transcript, destabilize it, regulate its splicing, or control how efficiently it is translated [51],[52],[53]. In this way, RBPs function as local regulators that fine-tune RNA fate. For example, stabilizers such as HuR (ELAVL1) bind and protect mRNAs from rapid decay, whereas destabilizers such as TTP (ZFP36) promote degradation by recruiting decay machinery [54]. Beyond stability control, translation regulators affect how efficiently ribosomes initiate or elongate, directly shaping protein output, while splicing regulators influence whether specific exons are included or skipped during pre-mRNA processing. In many cases, the role of an RBP is still not fully understood, and such cases are categorized as other/unclear.
For an ASO to be effective, it must access its complementary site on the target mRNA and form a stable duplex that recruits RNase H1 or blocks translation. However, this binding does not occur in isolation: the local RNA environment is often occupied by RBPs. These proteins, recognized through their short sequence motifs, can directly overlap with the ASO site or cluster in the surrounding region, thereby altering accessibility [55].
If stabilizing RBPs dominate the region, they can protect the transcript and counteract ASO-mediated degradation. Conversely, when destabilizing RBPs are enriched, the RNA is already prone to decay, and ASOs may act in a more permissive environment. Translation regulators may modulate whether ASO binding reduces protein output, while splicing regulators can shift isoform usage in ways that either enhance or diminish the therapeutic effect. In many cases, the combined influence of multiple RBPs creates a complex competitive landscape [56].
Thus, the motif landscape of RBPs is not a secondary detail but rather a critical determinant of ASO performance. Two ASOs with identical complementarity to their targets can behave very differently depending on whether their binding sites are free from RBP interference or embedded within regions densely occupied by stabilizers, destabilizers, or other regulatory proteins [57]. Figure 30 illustrates this principle: in the top panel, RBP binding occupies the target site and impedes ASO hybridization, whereas in the bottom panel the site is accessible, allowing stable ASO-mRNA duplex formation.
To translate these biological insights into quantitative features, we first generated an integrated dataset that links motif information with cell line–specific expression profiles. Motif data were obtained from the ATtRACT database, which provides a curated collection of RNA-binding protein motifs derived from experimental studies.
The raw motif table included the cell line in which each motif is relevant, the motif sequence in RNA alphabet, and the RBP identity (gene name). We then extended this table with two additional layers of information: a functional role label for each RBP (stabilizer, destabilizer, splicing regulator, translation regulator or other/unclear) and the corresponding expression values obtained from transcript abundance files for each cell line.
The final processed table therefore contained, for each row, the specific cell line providing the biological context, the motif sequence, the RBP identity, the functional role label, and the expression level of that RBP in the given cell line. This structured integration ensured that each motif hit could be quantitatively interpreted in terms of both its sequence-level presence and its biological activity. The Table below provides a few representative examples of motifs from this dataset. The complete dataset is alos available for download as a CSV file.
| Cell line | Gene (raw) | Expression | Gene name | Gene ID | Organism | Motif | Len | Functional role |
|---|---|---|---|---|---|---|---|---|
| A431 | RBM5 (10181) | 4.835095398 | RBM5 | ENSG00000003756 | Homo_sapiens | AAAAAAA | 7 | other/unclear |
| A549 | SRSF1 (6426) | 8.046300059 | SRSF1 | ENSG00000136450 | Homo_sapiens | CCCAGCA | 7 | splicing_regulator |
| KMS11 | PCBP2 (5094) | 11.57882081 | PCBP2 | ENSG00000197111 | Homo_sapiens | ACCUUAAA | 8 | stabilizer |
| A431 | PABPC1 (26986) | 9.606238631 | PABPC1 | ENSG00000070756 | Homo_sapiens | AAAAAAAA | 8 | translation_regulator |
| HELA | ELAVL1 (1994) | 5.311862727 | ELAVL1 | ENSG00000066044 | Homo_sapiens | UUUGUUU | 7 | stabilizer |
| A431 | ELAVL2 (1993) | 3.475162724 | ELAVL2 | ENSG00000107105 | Homo_sapiens | CUUUA | 5 | stabilizer |
| HELA | SRSF3 (6428) | 8.701327257 | SRSF3 | ENSG00000112081 | Homo_sapiens | UCAACGAUU | 9 | splicing_regulator |
| HELA | ZFP36 (7538) | 3.146591687 | ZFP36 | ENSG00000128016 | Homo_sapiens | ACCUGC | 6 | destabilizer |
| A431 | PCBP2 (5094) | 8.741048949 | PCBP2 | ENSG00000197111 | Homo_sapiens | CCUUCCU | 7 | stabilizer |
| A549 | PUM1 (9698) | 4.580929345 | PUM1 | ENSG00000134644 | Homo_sapiens | UGUAAUAUU | 9 | destabilizer |
Example entries from the extended RBP motif dataset, showing for each motif the associated cell line, RNA sequence, RBP identity, functional role and expresion value.
This merged and annotated dataset formed the foundation for all downstream feature calculation, enabling us to weight motif overlaps by expression and incorporate biological roles directly into the analysis.
Building on these biological insights and the integrated dataset of motifs and expression values, we analyzed ASO binding sites and their surrounding regions for RBP motifs. By treating these short sequence patterns as quantifiable elements, we transformed the qualitative influence of RBPs into numerical features that capture their potential to shape ASO performance. To ensure that both fine-scale and transcript-wide effects were represented, features were computed at two complementary levels:
RBP Overlap Density
This feature measures how strongly the region that includes the ASO binding site and its surrounding nucleotides is obstructed by RBP motifs. For each window and for the full CDS, we calculated the proportion of nucleotides covered by at least one motif:
\[ \text{Overlap Density} = \dfrac{\#\{\text{Covered bases in the region}\}}{L} \]
where $\#\{\text{Covered bases in the region}\}$ is the number of nucleotides overlapped by one or more motifs, and $L$ is the length of the selected region. Higher values reflect stronger RBP occupancy and more extensive competition with the ASO, whereas lower values indicate that the region is relatively free of RBP interference. Figure 31 illustrates this principle: in the top panel (high density), multiple RBPs cover much of the window around the ASO binding site, limiting accessibility. In the bottom panel (low density), only a few RBPs occupy the window, leaving the site more exposed and permissive for ASO-mRNA duplex formation.
Role-Contrast Features
This feature evaluates the balance between two functional groups of RBPs (for example destabilizers versus stabilizers) within the region that includes the ASO site and its flanking nucleotides. It is expressed as both a difference and a ratio:
\[ \Delta = D_A - D_B , \quad R = \frac{D_A}{D_B + \varepsilon} \]
Where $D_A$ and $D_B$ are the motif densities of roles $A$ and $B$ in the region, $L$ si the length of the region, and $\epsilon$ is a small constant to prevent division by zero. For example, in biological terms, stabilizers dominance over destabilizers ($D_A > D_B$) implies protection of the transcript and reduced ASO efficacy, whereas destabilizers dominance ($D_A < D_B$) indicates a decay prone environment that may be more permissive to ASO activity. In our case we implemented this feature on the following pairs: destabilizer vs stabilizer, splicing regulator vs translation regulator, destabilizer vs splicing regulator, destabilizer vs translation regulator, unclear vs stabilizer and unclear vs destabilizer. Figure 33 illustrates this principle: the top panel shows a window dominated by destabilizers, while the bottom panel shows a window dominated by stabilizers.
Expression-Weighted Density
This feature incorporated RBP mRNA expression levels into the density calculation, recognizing that abundantly transcribed RBPs are more likely to exert stronger effects in the cell. For each window or the full CDS, we computed:
\[ \text{Weighted Density} = \frac{\sum_i (L_i \cdot E_i)}{L} \]
Where \(L_i\) is the overlap length contributed by RBP \(i\), \(E_i\) is its mRNA expression in the given cell line, and \(L\) is the total length of the region. Biologically, high weighted density indicates that RBPs with high mRNA abundance cover the region extensively, reducing ASO accessibility, while low weighted density suggests minimal obstruction from actively expressed RBPs.
To evaluate the relevance of this feature in practice, we tested the correlation between expression-weighted density and ASO log-inhibition, focusing here on stabilizer RBPs as a representative example.
Motif Diversity (ASO-only and window-level)
This feature captures whether regulation of the region (ASO site plus its surroundings) is dominated by a single RBP or distributed among multiple RBPs. For each window or the full CDS, it is expressed as:
\[ D_R = \dfrac{N_{\text{unique}}}{N_{\text{total}}} \]
where \(D_R\) is the Diversity Ratio, \(N_{\text{unique}}\) is the number of unique RBP identities, and \(N_{\text{total}}\) is the total number of motif hits. We evaluated two complementary variants: ASO-only diversity, counting motif overlaps that fall strictly on the ASO site (same value for all windows), and window-level diversity, counting all overlaps within the surrounding window. Biologically, a low \(D_R\) suggests that one RBP largely determines the outcome (which can either hinder or support ASO action depending on its role), whereas high diversity indicates competition among many RBPs, potentially leaving more opportunities for ASO binding. Figure 37 illustrates this principle: in the top panel, the window is dominated by a single RBP type (low diversity), while in the bottom panel, multiple RBP types occupy the window (high diversity).
To evaluate the relevance of this feature in practice, we tested the correlation between motif diversity and ASO log-inhibition, focusing here on stabilizer RBPs as a representative example.
Positional Splits
This feature measures how RBP occupancy is distributed across different parts of the ASO binding site within its surrounding region. The ASO binding site is divided into three equal segments: Left, Core, and Right. For each segment, we calculate the fraction of nucleotides covered by at least one motif:
\[ \mathrm{Frac_{part}} = \frac{N_{\mathrm{overlap}}}{L_{\mathrm{part}}} \]
Where \(N_{\mathrm{overlap}}\) is the number of nucleotides overlapped by motifs within that segment, and \(L_{\mathrm{part}}\) is the length of the segment. Each fraction ranges from 0 (no overlap) to 1 (full overlap). These values were computed per role, across local windows (\(\pm20, \pm40, \pm70\) nucleotides) and for the global CDS. Biologically, this feature highlights whether RBP binding tends to cluster at the edges of the ASO site or directly over its central core. Core overlaps are likely to interfere most strongly with ASO–mRNA hybridization, while overlaps confined to flanking edges may be less disruptive. Figure 39 illustrates this principle: in the top panel, most RBPs bind at the edges, resulting in high left/right fractions and a low core fraction, while in the bottom panel, RBPs cluster in the middle, producing a high core fraction and low left/right fractions.
Together, these five features provide a multidimensional profile of the RBP environment in both local windows (ASO site plus surrounding regions) and the global CDS. They describe: how obstructed the region is (overlap density), how RBP mRNA expression strengthens or weakens that obstruction (weighted density), which functional forces dominate (role contrast, on both regular and mRNA-weighted densities), whether regulation is monopolized or diverse (motif diversity), and where along the ASO binding site RBP occupancy is concentrated (positional splits). By systematically merging motif and mRNA expression data and computing these features in both local and global contexts, we capture the biological reality of RBP regulation and ensure our ASO predictions account for both fine-scale adn transcript-wide influences.
When designing ASOs, one of the major challenges is avoiding off-target effects [58]. These can be broadly categorized into hybridization-dependent and hybridization-independent mechanisms.
Hybridization-dependent off-targets occur when an ASO partially binds unintended RNA transcripts through imperfect sequence complementarity, leading to RNase H-mediated cleavage or steric blocking of translation. In contrast, hybridization-independent off-targets arise without base pairing and are instead driven by the chemical properties of the ASO—such as phosphorothioate backbones or hydrophobic modifications—that promote nonspecific interactions with proteins or trigger immune activation.
ASOs are designed to bind a specific RNA transcript, yet hybridization-dependent off-targeting can still occur with other RNAs. The likelihood of such unintended binding depends primarily on two factors:
An off-target interaction is therefore most relevant when both sequence similarity and transcript abundance permit significant hybridization between the ASO and unintended RNAs.
As an initial step, we used the Hamming distance to quantify sequence similarity between ASOs and RNA targets. It measures how many positions differ between two sequences of equal length:
\( d_H(x, y) = \sum_{i=1}^{n} [x_i \ne y_i] \)
While this simple metric captures single-base substitutions, it does not account for insertions or deletions. We therefore use it as a starting point before extending our analysis to more complex alignment-based measures.
To evaluate an ASO’s potential to degrade its intended target, we analyzed the distribution of perfect and near-perfect matches (Hamming distance ≤ 2) across the genome. Near-complementary sites can act as competing sinks, transiently binding the ASO and limiting its availability for productive binding to the correct target transcript.
We also examined imperfect matches within the target gene itself. In some cases, partial complementarity can still enable RNase H–mediated cleavage, similar to what occurs in off-target interactions, thereby contributing to overall target degradation.
For estimating potential off-targets across the whole genome, we used the Bowtie alignment tool. The workflow involved several steps:
For estimating potential off-targets across the whole genome, we used the Bowtie alignment tool. The workflow involved several steps:
bowtie-build for fast searching.
bowtie, producing
SAM-format
alignments.
The index creation was parallelized across chromosomes and completed within minutes, allowing us to perform off-target searches extremely quickly. Because Bowtie’s search runtime scales approximately linearly with the number of detected off-targets, genome-wide analysis could be executed efficiently with minimal computational cost.
RNA molecules — both pre-mRNA and mature RNA — are inherently three-dimensional, and their structure plays a critical role in determining how they interact with other molecules. In addition to Hamming-distance–based analysis, we also evaluate ASOs by their binding free energy ($\Delta G$) to both the intended target and potential off-targets.
RIsearch is a fast alignment-based algorithm for large-scale detection of nucleotide interactions, used in applications such as CRISPR off-target prediction and RNA–RNA or RNA–DNA hybridization analysis. RIsearch employs an efficient dynamic programming algorithm with a time complexity of O(query length × target length), allowing it to account for potential secondary-structure interactions between the ASO and its RNA target.
RIsearch also accounts for RNA structural features such as bulges, loops, and hairpins formed between the reference and query sequences, estimating the Gibbs free energy of binding between each ASO and its candidate gene locations. Lower (more negative) $\Delta G$ values correspond to more thermodynamically stable duplexes and serve as an indicator of higher binding affinity and potential ASO efficiency.
For our analysis, we used an in-house modified version of RIsearch. This customized implementation allows us to adjust the scoring weights for each nucleotide pairing, providing fine-grained control over how different base-pair interactions contribute to the overall binding energy. This flexibility enables more accurate modeling of ASO–RNA interactions in line with experimental observations.
To tailor it for ASO design, we first transposed its built-in RNA/DNA interaction matrix so that the query strand corresponds to PS-DNA (the ASO) and the reference strand to RNA (the target mRNA), ensuring that base-pair energetics are applied in the correct orientation. We then integrated nearest-neighbor (NN) parameters derived from chemical HRM experiments for PS-DNA/DNA duplexes, and using our previously defined conversion formula (See Hybridization section), inferred the equivalent PS-DNA/RNA weights. This adaptation enables chemistry-aware $\Delta G$ calculations for ASO–mRNA hybrids, where more negative values indicate stronger and more thermodynamically stable binding.
Data Sources
To capture these aspects, we used data from the Cancer Dependency Map (DepMap) [59]. All datasets were taken from the 25Q2 release of DepMap, ensuring consistency across expression, mutation, and cell line annotation files. DepMap is a large public resource that provides multi-omics data across hundreds of human cancer cell lines, including:
In addition, we used [60] for:
Building the Mutated Transcriptome
We first focused on constructing the transcriptome layer, as it offered a more
direct
and accessible
way to capture functional genetic variation across cell lines. For each cell line, we then built a
mutated transcriptome by integrating these data:
The result was a complete list of all transcripts in the cell line, each with its sequence (mutated if needed) and expression level.
Calculating Off-Target Scores
With the mutated transcriptome available, we could evaluate how strongly each ASO binds unintended
RNAs:
\[ \text{OffTarget}_{\text{ARTM}_1} = \sum_{i=1}^{N} \left( E_i \times \text{Expr}_i \right) \]
where \(E_i\) is the hybridization free energy (\(\Delta G\)) between the ASO and the \(i\)-th transcript, and \(\text{Expr}_i\) is the corresponding transcript’s expression level (e.g., \(\text{TPM}\) or \(\log(\text{TPM}+1)\)). The subscript \(\text{ARTM}_1\) denotes the first arithmetic method, where off-target contributions are summed as weighted products of binding strength and transcript abundance.
We called this approach the arithmetic method, where hybridization strength is directly weighted by transcript expression levels.
Results and Improvements
At first, we used mature mRNA sequences (after splicing) and found only a weak
correlation between the off-target score and ASO activity.
However, previous studies suggest that many ASOs act primarily on pre-mRNA rather
than
mature transcripts.
When we repeated the analysis using pre-mRNA sequences (including introns), the results improved
substantially.
Off-target scores showed a stronger, statistically significant
(see supplementary Tables A,B,C)
negative correlation with ASO
efficacy —
the higher the predicted off-target binding, the less effective the ASO was at silencing its
intended
target, exactly as expected.
Effect of Hybridization Cutoffs
We tested different cutoffs for minimum hybridization strength to determine which interactions are
biologically relevant.
A threshold of around 1200 RIsearch units (corresponding to approximately 10–12
consecutive complementary base pairs)
gave the best results (Figures 40-42).
This value provided a good balance between noise reduction — filtering out weak,
spurious interactions —
and information retention, preserving interactions strong enough to compete with
the
intended target.
Comparing Different Scoring Methods
To ensure our off-target metric was robust, we systematically compared multiple formulations that varied in how transcript expression was incorporated. These methods explored different mathematical relationships between hybridization energy and expression level.
1. Squared Expression Weighting
Our first variant extended the arithmetic model by giving more weight to highly expressed transcripts:
\[ \text{OffTarget}_{\text{ARTM}_2} = \sum_{i=1}^{N} \left( E_i \times \text{Expr}_i^{2} \right) \]
where \(E_i\) is the hybridization free energy (\(\Delta G\)) between the ASO and the \(i\)-th transcript, and \(\text{Expr}_i\) is its expression level (e.g., \(\text{TPM}\) or \(\log(\text{TPM}+1)\)). The subscript \(\text{ARTM}_2\) denotes the second arithmetic method, where expression values are squared to emphasize highly expressed off-target transcripts.
This formulation amplified the contribution of abundant RNAs, highlighting cases where expression level alone may dominate off-target behavior. It produced slightly weaker correlations compared to the main arithmetic model but remained statistically significant (see Table D).
2. Geometric Method
Next, we explored a geometric formulation, where hybridization energies were combined multiplicatively rather than additively. This approach captures proportional rather than linear effects of expression:
\[ \text{OffTarget}_{GEO} = \prod_{i=1}^{N} \left( E_i^{\text{Expr}_i} \right) \]
where \(E_i\) is the hybridization free energy (\(\Delta G\)) and \(\text{Expr}_i\) is its transcript expression level. Highly expressed transcripts therefore contribute multiplicatively to the total off-target score.
To prevent numerical underflow caused by very small products, we applied a log–exp transformation:
\[ \text{OffTarget}_{GEO} = \exp \left( \sum_{i=1}^{N} \text{Expr}_i \times \log(E_i) \right) \]
This preserved the geometric interpretation while allowing stable numerical computation. The results were comparable to the arithmetic model (see Table E), demonstrating that expression weighting remains robust across mathematical formulations.
3. Statistical (Boltzmann-Weighted) Method
Finally, we introduced a Boltzmann-weighted statistical approach, which treats hybridization events probabilistically, weighting them according to their thermodynamic favorability:
\[ \text{OffTarget}_{\text{MS}} = \sum_{i=1}^{N} \text{Expr}_i \times \exp \left( -\frac{E_i}{\kappa_B T} \right) \]
where \(E_i\) is the binding free energy (\(\Delta G\)) between the ASO and the \(i\)-th transcript, \(\kappa_B\) is the Boltzmann constant, and \(T\) is the effective temperature (in Kelvin). This formulation links molecular thermodynamics with expression-level weighting, approximating the statistical likelihood that an ASO transiently binds a given transcript in the cellular environment. The subscript \(\text{MS}\) denotes the mechanical-statistics method, which integrates thermodynamic weighting of hybridization energy into the off-target scoring process.
The Boltzmann-weighted model maintained the same negative correlation trend observed in arithmetic and geometric methods, confirming that biologically, expression-driven weighting remains consistent across both deterministic and probabilistic frameworks (see Table F).
Summary of Results
Together, these results demonstrate that weighting off-target scores by hybridization energy and transcript expression provides a reliable predictor of ASO performance.
We've begun our journey with data curation, then we utilized a simple ODE to reduce Volume related bias, and we started to engineer features that explain this volume-corrected inhibition. Now we need to put all the pieces together, and find for each ASO, its $-k$ efficiency number, or $-kT$ as we don't correct for treatment period in this version.
As we focus on lung cancer, or more specifically A549 cell lines, we noticed that we don't have any lung cells in our data. The most common ones are melanoma(A431, SK-MEL-28), liver cells (HepG2, HepaRG) and lymphoma (MM1.R, KARPAS-229). This calls for a model that can generalize across cell lines, and won't suffer from experiment or cell line specific bias.
We chose to utilize XGBRanker[62], a model based on XGBoost, that is focused on ranking pairs of ASOs. The idea is the following - different experiments and cell lines, might have some unique properties that relate specifically to them. By utilizing a pairwise model, we can utilize an apples to apples comparison, allowing us to find features that generalize well across cell lines.
We applied feature selection to identify which biological and chemical parameters most strongly associate with ASO efficacy while limiting overfitting. Features with occasional missing values (e.g., codon-usage metrics for intron-targeting ASOs) were 0-filled to maintain comparability across all sequences.
Because the feature space was large, we used a two-stage procedure. At each step we (i) ranked the remaining features by mutual information with the response and shortlisted the top-80, then (ii) selected from this pool the feature whose maximum absolute Spearman correlation to the already-chosen set was smallest (ties broken by higher MI). We repeated this until 40 features were selected, preserving relevance while actively limiting redundancy.
After this screen, we manually curated a balanced panel that preserved representation from key categories (hybridization/structure, sequence composition, folding/accessibility, RNase H motifs, contextual/genomic, and assay/experimental). The table below shows representative features removed during the preliminary pass and the closest retained counterparts.
| Dropped (Preliminary) | Closest Kept | Comment |
|---|---|---|
|
premRNA_gc_skew_50 premRNA_gc_skew_20 |
gc_skew | Best GC-skew feature chosen (ASO-only) |
|
premRNA_at_skew_50 premRNA_at_skew_20 |
at_skew | Best AT-skew feature chosen (ASO-only) |
|
premRNA_gc_content_20 premRNA_cg_dinucleotide_fraction_70 |
gc_content gc_content_3_prime_5 |
Best GC-content descriptors |
|
premRNA_sequence_entropy_50 premRNA_dinucleotide_entropy_30 premRNA_dinucleotide_entropy_70 |
entropy nucleotide_diversity |
Non-redundant entropy/heterogeneity measures |
|
premRNA_tandem_repeats_score_70 premRNA_homooligo_count_70 premRNA_homooligo_count_40 |
homooligo_count poly_pyrimidine_stretch |
Repeat-type features with distinct signal |
|
Modification_evenness Modification_symmetry_score Modification_adjacent_pair_count Modification_dominant_mod_fraction Modification_pos_std |
Modification_skew_index Modification_min_distance_to_3prime |
Compact modification-asymmetry proxies |
|
dsm_su95_rev_wGU_pos1386t37Falseon_target_energy_max600 dsm_su95_rev_woGU_pos1384t37Trueon_target_energy_max600 |
self_energy internal_fold exp_ps_hybr |
Representative hybridization/secondary-structure signals |
| min_mfe_45 |
on_target_fold_openness_normalized40_15 sense_avg_accessibility |
mRNA folding / accessibility retained |
| RNaseH1_score_R4b | RNaseH1_Krel_score_R7_krel | More informative RNase-H window kept |
| ENC_score_global_CDS | CAI_score_global_CDS | Best global codon-usage |
| premRNA_stop_codon_count_70 | stop_codon_count | Simpler stop-codon load (ASO-proximal) |
| premRNA_at_rich_region_score_50 | at_rich_region_score | Best AT-richness descriptor (ASO) |
| off_target.top200.cutoff800.premRNA_log | off_target.top200.cutoff600.premRNA_TPM | Best TPM-weighted off-target |
| mrna_length |
Location_div_by_length sense_exon sense_intron sense_utr |
Contextualized length/position features |
We also included normalized_start, normalized_sense_start_from_end,
hairpin_score, and the assay-level Treatment_Period(hours) as
domain-informed additions.
Backward Feature Selection
We began with 28 candidates and removed features one at a time until 15 remained. At each round we
trained a
full model without each candidate (“leave-one-feature-out”), scored every variant, and
dropped the feature
whose removal produced the smallest drop (or a gain) in our objective. This keeps the
selection grounded in
the full multivariate context rather than isolated, one-by-one additions.
The table shows each feature’s ablation effect (ΔNDCG, see more Definition of Metrics ): the change in validation score when that feature is removed. These values do not sum to the final NDCG because effects overlap across correlated features. Positive values indicate improved generalization; negative values indicate a small adverse impact.
| Feature | Δ gmean NDCG@200 (unseen cell lines) |
|---|---|
| Treatment_Period(hours) | +0.0535 |
| at_skew | +0.0301 |
| CAI_score_global_CDS | +0.0245 |
| stop_codon_count | +0.0245 |
| sense_avg_accessibility | +0.0204 |
| on_target_fold_openness_normalized40_15 | +0.0197 |
| sense_utr | +0.0189 |
| nucleotide_diversity | +0.0180 |
| internal_fold | +0.0169 |
| normalized_start | +0.0141 |
| RNaseH1_Krel_score_R7_krel | +0.0125 |
| hairpin_score | +0.0112 |
| Modification_min_distance_to_3prime | +0.0081 |
| at_rich_region_score | +0.0051 |
| self_energy | −0.0022 |
This NDCG@200 threshold for unseen cell lines balances sensitivity to top-performers with stability across datasets of very different sizes (≈1,000–8,000 ASOs per cell line). Under this objective, certain biological signals (e.g., sequence composition or RNase-H window effects) dominated, while explicit hybridization energies contributed less once correlated variables were present.
self_energy was removed because its ablation increased the validation NDCG (ΔNDCG >
0 in the table), indicating the feature hurt performance when included.
On treatment period and potential bias. We include
Treatment_Period(hours) as an assay-level “hyper-feature.” The
ODE transform that corrected dose/volume does not capture time effects, so we
treat duration explicitly rather than forcing a mismatched fit. This makes any
protocol differences transparent while keeping the model focused on
biology-driven features. Next, we will extend the kinetics (e.g., Michaelis–Menten
or related ODEs) to account for duration without introducing bias.
The final model therefore uses 14 features (with Treatment_Period(hours) acting as an
experimental hyper-feature) and shows consistent generalization across unseen cell lines.
The chosen features are:
The journey of selecting our therapeutic target begins with one central question: Which synthetic lethal (SL) pair holds the greatest promise for lung cancer patients? This process is not straightforward — it requires balancing clinical relevance, biological plausibility, and computational evidence.
We first set out to identify mutations that are truly relevant for lung cancer. Using large-scale genomic datasets, including TCGA for LUAD and LUSC, as well as MSK 2022/2023 cohorts for metastatic NSCLC and CPTAC 2021 for squamous carcinoma, we compiled a comprehensive map of frequently altered genes in these tumors [63]–[66]. From TCGA LUAD, we analyzed 2,701,623 mutations spanning 17,909 genes, and from TCGA LUSC, 2,380,747 mutations across 17,950 genes. This large-scale mutational landscape served as the foundation for downstream prioritization of genes that may serve as viable therapeutic targets.
Not every mutation can serve as a useful entry point. We therefore intersected our gene list with curated databases of synthetic lethal interactions, which represent the current state-of-the-art resources for SL data [67]–[68]. To ensure robustness, only pairs with a confidence score above 0.5 were retained, yielding approximately 1,000 candidate genes.
Because the therapeutic logic of synthetic lethality typically begins with a dysfunctional gene, a reasonable assumption is that loss-of-function (LOF) mutations occurring frequently in tumor cells will most often involve tumor suppressor genes (TSGs). Only LOF mutations are relevant in this context, since the principle of synthetic lethality relies on the simultaneous loss of function in two genes — which is fatal for the cancer cell, whereas LOF in each gene individually is not harmful.
To focus our analysis, we refined our candidate list using the BIOINFO classification of tumor suppressor genes [69], which narrowed the set to 176 genes with known or suspected tumor-suppressive roles.
Not all mutations in these genes translate into functional loss. We therefore applied stringent criteria to classify mutations as bona fide loss-of-function (LOF) events:
As an illustrative case, TP53 emerged from our LOF classification pipeline as one of the most frequently disrupted tumor suppressors in lung cancer [73]. Despite its high overall mutation rate, each specific high-confidence LOF mutation is rare (<1%), reflecting extensive heterogeneity across patients. To ensure clinical relevance, we excluded genes with a total LOF frequency below 5% in lung cancer cohorts.
After this multi-layered refinement, we arrived at a shortlist of the top synthetic lethal (SL) pairs, representing the most promising therapeutic opportunities to advance into biological validation and ASO design. Among these, particular emphasis was placed on patient populations harboring loss-of-function mutations in TP53 and homozygous deletions in MTAP, two alterations that play central roles in tumor progression and therapeutic vulnerability [74],[75].
After classifying mutations according to our multi-model LOF framework,
we estimated the overall frequency of LOF events for each tumor suppressor gene across
non–small cell lung cancer (NSCLC) cohorts.
The figure below presents the top genes with the highest LOF frequencies,
highlighting those that represent recurrent vulnerabilities in NSCLC.
Among these, MTAP (16%) and TP53 (15%) were prioritized for further study
due to their strong biological rationale and the existence of known inhibitory mechanisms and
synthetic lethal partners.
An overview of the top tumor suppressor genes ranked by LOF frequency is shown in Figure 47, while the complete computational pipeline for identifying tumor suppressor–synthetic lethality (TSG–SL) pairs is illustrated in Figure 48.
At the heart of this strategy lies the essence of synthetic lethality - exploiting vulnerabilities unique to cancer cells. By silencing the SL partner of a mutated tumor suppressor gene, our ASO aims to induce selective cancer cell death while sparing healthy cells that retain the intact pathway. Furthermore, by targeting a conserved region of the partner mRNA, our approach seeks to overcome tumor heterogeneity and maintain efficacy across diverse patient mutation profiles.
For MTAP, the loss of this metabolic enzyme leads to intracellular accumulation of methylthioadenosine (MTA), which selectively inhibits the arginine methyltransferase PRMT5. This inhibition creates a unique metabolic dependency on the PRMT5–MAT2A–RIOK1 axis, as PRMT5 activity relies on methyl donors produced by MAT2A and its interaction with RIOK1. Consequently, MTAP-deleted cells become highly sensitive to disruption of these partners, forming a well-defined synthetic lethality network that offers a biologically grounded opportunity for targeted therapy [75].
For TP53, one of the most frequently mutated tumor suppressors in lung cancer, mutations display remarkable variability, spanning over four orders of magnitude in transcriptional activity between mutant forms. These variants interact with genetic and epigenetic modifiers, shaping heterogeneous phenotypes and influencing treatment response [73]. Given the extensive number of potential SL partners reported in the literature [67]–[68], we performed a systematic evaluation of candidate pairs to identify those most consistent across datasets and biologically plausible for ASO-mediated targeting. This integrative analysis helped prioritize the most promising TP53-related synthetic lethal interactions to guide experimental validation.
We extended the TP53 search by ranking genes whose neighborhoods most resemble TP53 across multiple evidence channels, each reflecting a distinct biological signal: coexpression (genes varying together across datasets → functional linkage), experimental (directly assayed interactions or perturbation evidence), database (manually curated pathways/complexes), text mining (literature co-mentions suggesting related biology), neighborhood (conserved genomic/protein context), fusion (gene-fusion events implying coupling), cooccurrence (phylogenetic co-presence across species), and, where available, genetic high / genetic low (bins of genetic-interaction confidence). For every channel we compute per-gene similarity (Jaccard, Overlap) and neighborhood enrichment (hypergeometric p-value), so table headers like “Jaccard coexpression” or “Hypergeom experimental” read directly as: similarity/enrichment with TP53 based solely on that evidence stream.
Jaccard
coexpression, Hypergeom experimental).
combined_score,
require the selected channel’s score > 0, map STRING IDs to gene symbols, and collapse
[76]
duplicate undirected edges.
pmid_support count;
for STRING we retain the max of numeric columns per edge.
N(gene) are constructed from the filtered graph. We drop very small
nodes (less than 10) and trim extreme hubs by percentile (99th percentile).
Our ranking methodoly was the following:
tp53_like_results.csv with Protein, Degree, SharedWithTP53, Jaccard,
OverlapCoef, HypergeomP.
The STRING single-channel routine returns a similarly structured table per evidence channel.
Using our TP53-like neighborhood screen across multiple evidence channels, RB1 and TP53BP1 consistently showed strong overlap with the TP53 interaction neighborhood while passing hub filters. Both genes rank with extremely low neighborhood-enrichment p-values and non-trivial Jaccard overlap in several independent channels, indicating robust TP53-proximity rather than a single-dataset artifact.
We generated the following CSV output for out analysis: Gene interaction CSV
Bottom line: RB1 and TP53BP1 are TP53-like across multiple evidence streams, statistically enriched yet non-hubby, and biologically complementary. That makes them high-confidence, interpretable choices for follow-up analyses and wet-lab validation.
To design antibodies capable of efficient expression, immune activation, and effective delivery, we implemented a multi-stage computational and experimental workflow integrating codon optimization, structural modeling, and epitope selection.
The amino acid sequences of the heavy and light chains of cetuximab (the chosen antibody for the therapeutic complex) were retrieved and processed through two complementary optimization routes.
First, the optimized sequences were refined using the Evolutionary Stability Optimizer (ESO) [77], which detects and eliminates mutational and epigenetic hotspots to increase long-term genetic stability without altering protein structure.
Second, codon usage and non-coding regulatory elements were optimized using the
MNDL Bio deep-learning platform
[78],
to enhance transcriptional and translational efficiency in CHO cells.
An illustration of this process is shown in Figure 49.
To guide the design of antibody–epitope constructs, we developed a multi-step in silico pipeline. First, frequent non–small cell lung cancer (NSCLC) mutations were identified from TCGA datasets to define clinically relevant targets [79].
Candidate neo-epitopes were then predicted by integrating proteasomal processing and MHC class I presentation algorithms, which were developed in collaboration with Tuller’s Lab member -Nicolas Lynn (explained in detail below).
In parallel, the antibody sequence space was analyzed using codon distribution and positional probability profiles [80], derived from large-scale analysis of millions of antibody sequences.
These data were processed to generate codon usage frequencies and position-specific scoring matrices (PSSMs), which enabled the identification of structurally permissive regions suitable for peptide insertion (explained in detail below, based on collaboration with Doron Armon’s, a Tuller Lab member).
From this process, multiple candidate constructs were generated and subsequently evaluated computationally for structural stability and the likelihood of efficient MHC-I presentation. An illustration of this pipeline is shown in Figure 50, alongside the complete pipeline in Figure 51.
To guide the design of antibody–epitope fusion constructs, we employed a multi-step in silico pipeline for predicting tumor-specific neoepitopes with high immunogenic potential. The workflow, developed in collaboration with Tuller’s Lab member Nicolas Lynn, integrates multiple layers of antigen processing and presentation modeling (Figure 52).
First, recurrent non–small cell lung cancer (NSCLC) mutations were extracted from TCGA datasets and processed through the OncoSplice algorithm [81] to identify mis-splicing events and their resulting aberrant transcripts. These transcripts were translated in silico to derive tumor-specific 12-mer peptide candidates absent from the normal human proteome (filtered via UniProt BLAST [82]).
Next, a computational ensemble of tools was used to model each key stage of the MHC-I antigen presentation pathway - from proteasomal cleavage to peptide-MHC binding and T-cell activation. NetChop [83], MHCFlurry [84], and NetCTLpan [85] predicted peptide processing and TAP transport; NetMHCpan4.1 [86] and MHCFlurry [84] estimated binding affinity to MHC-I alleles; and DeepImmuno [87] and IEDB (Calis) [88] evaluated immunogenicity likelihood through TCR–pMHC interaction features.
Each peptide received a composite score reflecting its processing efficiency, presentation probability, and immunogenic potential, enabling ranking of candidate epitopes. This pipeline demonstrated that several predicted neoepitopes undergo all major biological steps required for proteolytic cleavage, MHC-I loading, and T-cell recognition — confirming their potential as immunogenic targets for antibody conjugation and vaccine design.
To characterize the antibody sequence landscape and identify regions suitable for epitope insertion, large-scale antibody repertoires were analyzed using data from the Observed Antibody Space (OAS) database [89]. This resource compiles billions of cleaned and annotated antibody sequences from over 100 immune-repertoire sequencing studies, including approximately 4.1 million paired and 2.4 billion unpaired heavy and light chain sequences.
For this analysis, only human unpaired antibody sequences were selected. Using high-performance computing (HPC) resources and the Dask library for parallelized data management [90], the variable V(D)J regions of antibody sequences were extracted and translated into amino acids. Only productive and complete V(D)J sequences were retained while excluding immature B-cell data to ensure biological relevance.
Codon usage frequencies were computed for each data unit by counting codons within each amino acid group, excluding stop codons and ambiguous nucleotides. These frequencies were then aggregated across all human data units to generate global codon distribution profiles for both heavy and light chains.
To obtain a position-specific scoring matrix (PSSM) describing positional amino acid preferences, approximately 120,000 productive light-chain sequences were randomly sampled from the OAS dataset, ensuring complete V(D)J coverage [89], [91].
Multiple sequence alignment was performed using the ANARCI tool [92], which assigns standardized IMGT numbering to antibody variable regions [93]. The engineered light-chain sequence from our study was aligned according to this numbering scheme to ensure consistent positional mapping.
Amino acid frequencies were then converted into a PSSM using the Biopython package [94]. The computation used background frequencies derived from the overall amino acid distribution and applied pseudo-counts equal to 0.5× the background frequency to prevent zeros in the matrix.
This matrix captures positional conservation and diversity across antibody variable regions, serving as a statistical baseline to identify structurally permissive sites for epitope insertion while maintaining the native antibody fold. This analysis was performed in collaboration with Doron Armon (Tuller Lab). provided the foundational data supporting our computational design of antibody–epitope constructs.
Two matrices were utilized - a position-specific scoring matrix (PSSM) and a position weight matrix (PWM), generated from approximately 120,000 antibody variable-region sequences [80].
The PSSM contains position-dependent scores for each amino acid, defined as:
\[ M_{ij} = \log_{2}\!\left(\frac{p_{ij}}{q_i}\right) \tag{1} \]Where \(p_{ij}\) represents the probability of amino acid \(i\) occuring at position \(j\), and \(q_i\) denotes the background probability of amino acid \(i\) as estimated from the multiple sequence alignment (MSA).
The PWM matrix contains the raw positional probabilities - \(p_ij\) for each amino acid at every position.
Based on a reference article provided in[95], which applied the BLOSUM50 substitution matrix, we grouped amino acids according to their biochemical similarity. This grouping aimed to reduce the dimensionality of the matrices while accounting for conservative substitutions that preserve structural and functional properties during epitope insertion into the antibody sequence.
Accordingly, amino acids were clustered into five groups:
\[ \begin{aligned} G1 &= \mathrm{L, V, I, M, C}\\ G2 &= \mathrm{A, S, G, T, P}\\ G3 &= \mathrm{F, Y, W}\\ G4 &= \mathrm{E, D, N, Q}\\ G5 &= \mathrm{K, R, H} \end{aligned} \]We processed the PWM by aggregating amino acid probabilities according to the newly defined similarity groups, summing the individual probabilities within each group to obtain group-level position-specific probabilities.
To reconstruct the background probabilities \(q_i\) for each amino acid, we applied a reverse-engineering approach using the relationship between the PSSM and PWM matrices. Specifically we removed the logarithmic transformation from the PSSM by converting each entry elementwise as \(2^M\), and subsequently computed the background probabilities by dividing the corresponding PWM values by these transformed PSSM values.
The reconstructed background probabilities were then grouped by summing the values of all amino acids within each similarity group. Using these values and the corresponding grouped PWM, a new PSSM was calculated according to the earlier equation for \( M_{ij} \).
Once the set of neo-epitopes was predicted based on NSCLC mutations, each epitope was evaluated for
insertion
compatibility within the antibody variable region using the updated PSSM matrix. Using a sliding
window equal in
length to each epitope, we computed the cumulative PSSM score by summing the position-specific
values for the amino acids (or their respective groups) within the window, thereby quantifying the
positional
compatibility between the epitope and each potential insertion site. CDR regions of the variable
domain were
excluded from the analysis due to their importance in receptor binding[96].
The formula used:
All the epitopes were combined into a dataframe and the dataframe was sorted from the highest to the lowest score.
Finally, the top-scoring epitope-insertion pairs were re-evaluated using the MHC-presentation prediction model to confirm that the inserted epitopes retained high predicted processing and presentation potential following integration into the antibody scaffold.
Molecular Dynamics (MD) Simulations
Molecular Dynamics (MD) simulations offer a powerful computational approach to study the structural
and dynamic
behavior of biomolecules at atomic resolution. By numerically solving Newton’s equations of motion
for atoms over
time, MD enables prediction of how molecules fold, bind, and fluctuate in environments mimicking
physiological
conditions [97].
In the context of our project, MD simulations were used to evaluate the conformational dynamics and
interaction
profiles of ASOs with target mRNA duplexes.
Extracted features from these simulations—such as hybridization energies, RMSD profiles, and
flexibility
metrics—were incorporated as a second step in our prediction model.
These dynamic features go beyond static structure-based predictors, offering a more nuanced
understanding of ASO
efficacy.
Initial models of the DNA:RNA duplexes were constructed using Discovery Studio. The extracted PDB files were then reformatted to comply with AMBER’s LEaP atom conventions. To model chemically modified antisense oligonucleotides, the [98] toolkit was employed to generate custom libraries for four phosphorothioate (PS)-modified residues: PS-modified Adenine (PSA), PS-modified Cytosine (PSC), PS-modified Guanine (PSG), and PS-modified Thymine (PST). Using CPPTRAJ [99], the duplex PDBs were updated according to the created libraries, and native residues were renamed to their modified counterparts.
Figure 53. Initial 3D structure of an ASO–RNA duplex after PDB construction and cleaning.
Topology and coordinate files were prepared using the LEaP program within AmberTools 2025. The OL15 [100][101] force field was applied for modified DNA strands and OL3[102] for RNA, while missing parameters were supplemented with the general AMBER force field (GAFF2) [103] and libraries generated by modXNA for the modified residues. Each PS-DNA:RNA duplex was solvated in a truncated octahedral box of the optimal 4-point rigid water model (OPC)[104], with a 10 Å buffer between the solute and the box boundary. The system was neutralized with Na⁺ and Cl⁻ counterions, and excess ions were introduced to achieve a physiological ionic strength of 150 mM.
Figure 54. ASO–RNA duplex after system setup in LEaP, solvated in a truncated octahedral box of water with added NaCl ions, prepared for simulation.
All simulations were performed using the sander engine from AmberTools 2025. A single system underwent a two-stage energy minimization: first, with restraints on the DNA:RNA duplex to allow relaxation of the solvent and ions, followed by unrestrained minimization of the entire system. The structure was then gradually heated to 310 K over 100 ps under constant volume conditions, with harmonic restraints applied to the nucleic acids. Equilibration was performed for 500 picoseconds in the NPT ensemble (constant pressure and temperature) at 310 K and 1 atm, with reduced restraint forces. Finally, a 2 nanosecond unrestrained main production run was carried out. All simulations used a 2 fs timestep, a 10 Å cutoff for nonbonded interactions, and employed the Langevin thermostat for temperature control. The SHAKE algorithm [105] was applied to constrain all hydrogen atoms bound to heavy atoms. Trajectories were saved every 2 picoseconds for downstream analysis.
Figure 55. Representative MD trajectory of an ASO–RNA duplex during the production stage.
To efficiently run the large set of duplex simulations, we employed parallel computing on the Power-SLURM and Power9 high-performance computing (HPC) servers at Tel Aviv University. Job submission and resource allocation were handled using PBS scripts, which automated the generation of input files, submission of tasks, and monitoring of progress across multiple experiments. Each simulation was parallelized across 64 CPU cores, allowing for efficient scaling and reduced execution time. On average, a full production run for a single duplex required approximately 15 hours of wall-clock time to complete. Depending on queue availability, between 10 and 30 simulations were executed in parallel, corresponding to an effective utilization of roughly 640–1920 CPU cores at peak load. This approach enabled us to manage the computationally demanding workload of simulating over 100 modified ASOs, ensuring both reproducibility and scalability across multiple systems.
For post-simulation analysis, trajectories and topology files were processed using CPPTRAJ.
Solvent molecules and ions were removed, and the duplex was centered and imaged into the primary
unit cell to ensure structural consistency across frames.
The following ֵanalysis was performed:
nastruct command [99], a comprehensive set of
helicoidal and
base-pair step descriptors were extracted, including shear, stretch, stagger, buckle, propeller
twist, opening,
hydrogen bonding, and groove dimensions (major/minor). These parameters provide a detailed
picture of the duplex
geometry and stability.
Together, this analysis captures both the global dynamics and fine-grained structural features of
the duplex.
To illustrate the computational workflow, below is a schematic of the MD pipeline, from initial PDB
preparation through parallelized simulations to final structural and energetic analyses.
Molecular dynamics (MD) simulations were performed for approximately one hundred antisense oligonucleotides (ASOs) in complex with their target RNA sequences. From these trajectories, we systematically extracted a rich set of structural and energetic descriptors, yielding more than five hundred features for each duplex. These features spanned local base-pair and base-step geometries, global stability measures such as RMSD, sugar puckering states, hydrogen bonding patterns, and binding free energy decompositions obtained from MMPBSA analysis. Together, this large-scale dataset provides a quantitative bridge between atomistic dynamics and experimentally measured inhibition values for the ASOs.
In total, over 500 structural and energetic descriptors were extracted from the molecular dynamics (MD) trajectories. These features capture both the geometry of the duplex and its thermodynamic behavior, providing a rich quantitative fingerprint of each ASO. Below we summarize the main categories:
A complete glossary of all features with definitions and explanations, is available in the Supplementary Appendix
While the full dataset includes hundreds of structural and energetic descriptors as discussed, the following figures visualize a few representative examples on a single duplex, illustrating how these features are extracted and what kinds of insights they may provide.
The root mean square deviation (RMSD) over the course of the simulation provides a measure of the overall structural stability of the system. As shown in Figure 58, the RMSD rises rapidly during initial equilibration but subsequently stabilizes, fluctuating around a plateau value. This stabilization indicates that the duplex adopts a persistent conformation within the timescale of the simulation, and no large-scale unfolding events are observed.
Another representative feature is the twist angle distribution across base-pair steps. Twist is a fundamental descriptor of nucleic acid helical geometry, reflecting the degree of rotation between consecutive base pairs. In Figure 59, the average twist values are plotted for each base-pair step of the duplex. While most steps retain persistent close to canonical twist values, several positions exhibit some deviations, pointing perhaps to regions of local flexibility or distortion induced by the ASO sequence. Such local geometric heterogeneity can influence recognition, hybridization stability, and ultimately biological efficacy.
Energetic contributions were also decomposed to provide insight into the forces stabilizing or destabilizing the duplex. Figure 60 displays as example of the energy breakdown from a simulation: van der Waals, electrostatic, generalized Born solvation, and nonpolar surface terms of the complex. As expected, electrostatics in vacuum (EEL) contribute strongly destabilizing forces (positive value), whereas solvation (EGB) counterbalances these interactions with stabilizing contributions (negative value). van der Waals interactions provide additional stabilization, while the nonpolar surface term (ESURF) is minor. This decomposition highlights the delicate balance of competing forces that governs duplex stability.
It is important to note that the RMSD profile, twist distribution, and energy decomposition shown here represent only a small subset of the structural and energetic descriptors extracted from the full molecular dynamics production runs. In total, hundreds of features like these were computed, capturing diverse aspects of duplex stability, geometry, and dynamics, many of which may provide additional insight into ASO behavior beyond the examples highlighted here.
Beyond single-case visualizations, the complete dataset of ~ one hundred ASOs allows a direct quantitative comparison between simulation-derived features and experimental inhibition. In order to normalize the inhibition data, raw values were converted to a logarithmic scale (log inhibition). Correlation analysis across the full dataset revealed that many structural and energetic features exhibit significant associations with inhibition potency. As shown in Figure 61, several energy-based terms, display correlation coefficients around 0.6 with log inhibition, underscoring their predictive value.
In addition to simple correlation, binned scatter plots were generated to better visualize the relationship between feature values and inhibition. In Figure 62, inhibition values are averaged within bins of feature values, revealing monotonic trends across multiple descriptors. These plots emphasize how shifts in specific structural or energetic parameters translate into consistent differences in biological activity, reinforcing the mechanistic link between MD-derived features and function.
Finally, the interrelationships between the most predictive features were explored using a correlation heatmap. Figure 63 displays the correlation structure of the top features with log inhibition.
Together, these analyses demonstrate that atomistic simulations of ASO:RNA duplexes capture features that are strongly associated with biological inhibition. The ability to extract and quantify these descriptors provides not only mechanistic insight into ASO-RNA interactions but also a foundation for predictive modeling, where MD-derived features can be systematically leveraged to guide rational ASO design.
To further validate the predictive value of the extracted molecular dynamics features, we simulated a subset of seven ASO-RNA duplexes that we experimentally tested in our wet-lab. These ASOs were designed by our main computational model against MALAT1 RNA, and their ability to reduce MALAT1 RNA levels was quantified in cell-based assays. By running MD simulations on the same ASO-RNA duplexes, we directly compared computationally derived structural and energetic features with experimentally observed RNA levels.
It should be noted, that these ASOs had been preselected for experimental validation based on our main computational model that ranked candidates among other things according to energetic descriptors - such as hybridization free energies, nearest-neighbor hybridization, and melting temperature heuristics. Consequently, this pre-selection enriched the experimental set for sequences that were already optimized with respect to energetic stability.
This background is important in interpreting our findings. When correlating the MD-derived features with experimental expression outcomes, we observed that energetic descriptors contributed very little additional predictive power within this small, preselected set. Instead, the features that showed the strongest correlations with experimental RNA levels of MALAT1 were predominantly structural backbone descriptors such as roll, rise, shear, or groove widths (more information about backbone descriptors is available in the Supplementary Appendix). This contrasts with the previous ~100 ASO dataset previously discussed, in which energetic features were among the strongest predictors. We hypothesize that because the experimental ASOs were already selected from the “best” energetic space, as a result, energetic features lost their discriminatory power, and non-energetic structural descriptors, absent from the initial selection pipeline, emerged as the differentiating factors.
To illustrate this point, we quantified the relative contribution of energetic versus non-energetic features in the experimental dataset. Among the top 20 features, none were energetic descriptors, whereas in the top 30 features, only 7% were energetic (Figure 64).
We next visualized selected correlations in detail. Scatter plots of representative structural features (Figure 65, top row) showed strong associations with expression, with clear linear trends and highly significant correlation coefficients. In contrast, scatter plots of the previously top selected energetic features (Figure 65, bottom row) demonstrated much weaker predictive ability in this experimental set, reinforcing the observation that energetic features provide limited discriminatory power once pre-selection has already been applied.
Overall, these findings suggest a complementary role between our main computational model and as a second step - MD-derived descriptors. Together, they may form the basis for an integrated, multi-layered predictive pipeline for ASO efficacy.
While our current analyses focused on duplex formation between ASOs and their target RNA sequences, an important future step will be to systematically simulate the folding behavior of single-stranded ASOs. In biological systems, ASOs are not always present in their idealized linear conformation; instead, they may self-organize into secondary structures. These conformations can significantly influence hybridization kinetics, accessibility of specific motifs, and ultimately the efficiency of target inhibition.
MD simulations of self-folding events would allow us to capture this structural heterogeneity at atomic resolution. By monitoring how an initially linear ASO collapses into partially folded states, we can assess the degree of structural stability and identify sequence motifs prone to intramolecular interactions. Such simulations will provide complementary insights to duplex analyses: whereas duplex stability reflects the binding potential with the intended RNA target, single-stranded folding reveals the internal competition that may reduce the effective binding fraction in a cellular environment.
As an illustrative example, a preliminary simulation of a randomly selected ASO demonstrates how a single strand spontaneously folds into a compact structure over the course of 10 nanoseconds. Although not part of our experimental panel, this case highlights the importance of incorporating MD derived folding behavior into predictive models. In future work, integrating single-stranded folding features with duplex-derived parameters will enable a more comprehensive understanding of ASO activity and improve predictive power across diverse sequence contexts.
Figure 67. Molecular dynamics simulation of a single-stranded ASO folding into a compact secondary structure.
Across both the large-scale dataset of ~100 ASOs and the experimentally validated subset of seven MALAT1-targeting ASOs, our analyses revealed that structural and backbone-related features derived from molecular dynamics are strong predictors of biological efficacy. While energetic descriptors dominated the larger screening dataset, they lost discriminative power within the preselected experimental subset, underscoring how structural variability becomes prominent once energetic space is saturated. Importantly, our single-stranded folding simulations further highlight the intrinsic conformational heterogeneity that may impact target accessibility.
Taken together, these results suggest that molecular dynamics is most valuable as a second-stage prioritization tool within our general ASO discovery pipeline. The initial generation of candidate ASOs should rely on our broader predictive model, which is computationally efficient and captures many determinants. From this set, MD simulations can then refine prioritization by quantifying structural features, that strongly correlate with inhibitory activity. In this way, MD complements the general model: it adds a higher-resolution filter that helps focus experimental resources on the most promising ASOs.
Our model integrates structural, thermodynamic, and sequence-based features of antisense oligonucleotides (ASOs) to predict their inhibition efficiency. It was trained using experimentally measured datasets across multiple human cell lines to identify parameters most predictive of ASO activity.
The model was trained with two primary goals: first, to produce results that generalize across multiple cell lines, and second, to improve the ranking of top-performing ASOs rather than solely maximizing correlation values.
During development, we adopted a cross-validation approach — training on three cell lines and testing on the fourth (the unseen cell line). For the final version, we trained on all available data to maximize the model’s learning capacity, which means there is no completely unseen test set for direct comparison.
This evaluation approach led to a clear improvement over our previous model:
| Cell line | Samples | Spearman (Old) | Spearman (New) |
|---|---|---|---|
| A431 | 6541 | 0.16 | 0.429 |
| SK-MEL-28 | 3394 | 0.28 | 0.340 |
| KARPAS-229 | 972 | 0.24 | 0.290 |
| MM1.R | 2836 | 0.40 | 0.484 |
Correlations improved across all unseen cell lines, most notably in A431 (a melanoma-derived line) — a particularly relevant result, as we hypothesize that its behavior may be closer to lung tissue than that of lymphoma or liver-derived lines. Moreover, the correlations became more consistent overall: the lowest correlation increased from 0.16 to 0.29. Since our focus is on top-performing ASOs, we next evaluated ranking-based metrics.
The Normalized Discounted Cumulative Gain (NDCG)[107] measures how well the model ranks ASOs by their true experimental effectiveness. It assigns higher importance to correctly ranking the most active ASOs near the top, while penalizing errors that appear lower in the list. Formally, it is defined as:
\[ \mathrm{NDCG@k} = \frac{1}{\mathrm{IDCG@k}} \sum_{i=1}^{k} \frac{2^{rel_i} - 1}{\log_2(i + 1)} \]
where \( rel_i \) is the true relevance of the ASO at rank \( i \), and \( \mathrm{IDCG@k} \) is the maximum possible (ideal) DCG for the same set. Values close to 1 indicate nearly perfect ranking.
The Precision@k (P@k) metric measures the fraction of true top-performing ASOs among the top \( k \) predictions:
\[ \mathrm{P@k} = \frac{\text{Number of true top ASOs in top } k}{k} \]
While Precision focuses on how many of the top predictions are correct, NDCG accounts for their relative ordering, making both complementary in assessing ASO ranking performance.
The following table compares the ranking performance of our ASO efficacy model across four human cell lines. Metrics include precision at 50 and 100 (P@50, P@100) and normalized discounted cumulative gain (NDCG) at various cutoffs (50, 100, 200). Higher values indicate better ranking performance.
| Cell line | Samples | P@50 (Old) | P@100 (Old) | NDCG@200 (Old) | NDCG@All (Old) | P@50 (New) | P@100 (New) | NDCG@200 (New) | NDCG@All (New) |
|---|---|---|---|---|---|---|---|---|---|
| A431 | 6541 | 0 | 0 | 0.5192 | 0.965 | 0.04 | 0.08 | 0.606 | 0.957 |
| SK-MEL-28 | 3394 | 0.06 | 0.12 | 0.8428 | 0.981 | 0.10 | 0.23 | 0.84 | 0.976 |
| KARPAS-229 | 972 | 0.02 | 0.17 | 0.8576 | 0.980 | 0.02 | 0.07 | 0.847 | 0.973 |
| MM1.R | 2836 | 0.12 | 0.17 | 0.7929 | 0.9826 | 0.02 | 0.07 | 0.728 | 0.972 |
The new model demonstrates a clear trade-off between peak performance on specific cell lines and robust generalization. While the previous model achieved higher NDCG scores on KARPAS-229 and MM1.R, it failed completely on A431, with a precision of zero. Our new model successfully remedies this by significantly improving performance on the weakest cell line (A431), establishing a much higher and more consistent baseline across all data. This shift aligns directly with our primary goal of creating a generalizable model that avoids catastrophic failures on unseen data, even if it means a slight reduction in ranking performance on previously well-predicted lines.
When training on the full dataset, the global correlation between predicted scores and experimentally measured inhibition is summarized below.
When ranking all ASOs by the model’s predicted scores and correlating them with volume-corrected inhibition, we observe a strong fit. Even more notably, the raw inhibition values themselves follow the same trend — as shown by the dense cluster of dark-red points in the upper-right corner — confirming that our model successfully identifies high-performing ASOs.
As for a direct measurement of top performers, we can take a look at the following graph:
We can see that for the top \( k \), our model performs well — especially when \( k \) is small, which is the most relevant use case, as we aim to minimize cost and resources by ordering only a few top ASOs.
In practical terms, this means that if we select the top 50 ASOs predicted by the model, a large proportion are indeed among the experimentally strongest inhibitors. This enables substantial cost and time reduction when designing ASO libraries, as only a small subset needs to be synthesized and tested experimentally.
Future iterations of the model will incorporate additional features, including RNA–protein interaction profiles and LNA thermodynamic weights, and will be validated on newly measured datasets to confirm cross-cell-line generalization and robustness.
In summary, our updated model successfully meets its design objectives. It demonstrates superior and more consistent correlation across multiple cell lines, particularly improving performance on previously challenging data. Furthermore, its ability to accurately rank the most effective ASOs in the top percentiles, as shown by the precision-at-k analysis, confirms its utility as a powerful tool for prioritizing candidates for experimental validation. This balance of generalization and top-tier ranking performance marks a significant step forward in our ASO efficacy prediction capabilities.
Our model demonstrated its effectiveness in real experiments, successfully designing ASOs that outperformed
the current MALAT1 benchmark used in the industry. This confirms that ASOdesigner’s predictions translate
into meaningful experimental results and can greatly reduce the effort required for ASO development.
Specifically, it supports the novel features we introduced in the our model.
The details of this success and its experimental validation are described on
our Results page, and our process is described
in our Engineering page on our second and third iterations.
Moving forward, we plan to enhance the model by integrating features such as RNA-protein interaction
profiles and improve the way we model chemical modification effects, and by validating it across additional
datasets to ensure broader applicability. These next steps will further strengthen ASOdesigner as a
reliable and versatile tool for advancing ASO-based applications such as cancer therapeutics.