Model

Abstract

Antisense oligonucleotides (ASOs) have demonstrated clinical success in several genetic diseases, but their therapeutic potential in cancer remains largely unrealized. Tumor heterogeneity, delivery barriers, and off-target effects have limited efficacy, even though ASOs uniquely enable precise modulation of RNA, including targets considered undruggable. Designing potent and specific ASOs therefore remains a major challenge, often constrained by labor-intensive experimental optimization.

We developed ASOdesigner, a stand-alone machine-learning framework for ASO sequence generation and evaluation. The process begins with comprehensive data cleanup and validation, which involves locating many target genes across the human genome, including SNP-containing variants, and verifying corresponding experimental inhibition data retrieved from Lens.org and other public sources. We then apply an ODE-based kinetic model to infer an efficiency variable representing ASO–RNA hybridization dynamics, which serves as the target of the machine-learning predictor. The model learns this variable from an extensive feature library encompassing structural, thermodynamic, biochemical, and transcriptomic properties, including folding free energy, duplex hybridization energy, local GC/AT skew, repetitiveness, toxic motif frequency, RNase H1 cleavage preferences, positional nucleotide entropy, codon-usage bias, mRNA half-life, transcript accessibility, secondary-structure context, and global expression level. The full ASODesigner source code is available on GitHub: ASOdesign. The final model and production software used in this work are hosted in our iGEM repository: Software Tools.

The next stage performs genome-wide off-target analysis, integrating Watson–Crick mismatch tolerance and homology mapping using a bow-tie-indexed database and a modified RIsearch hybridization engine. Finally, top-ranked ASOs undergo molecular-dynamics simulations executed on a high-performance computing (HPC) cluster with Slurm-based parallelization, enabling large-scale evaluation of conformational stability and validation of predicted binding efficiency.

ASOdesigner operates independently from, but interoperates with, additional computational modules: a gene interaction network graph identifying selective and non-toxic gene targets, and an antibody–epitope design module enabling targeted delivery and immune activation. Our antibody–epitope modeling uses PSSM-based sequence optimization to identify insertion sites that preserve antibody structure while ensuring efficient ASO delivery, complemented by expression optimization using MNDL and ESO stability predictions to achieve high yield. We further applied epitope presentation prediction algorithms to identify which neo-epitopes are most likely to be displayed on cancer cells, ensuring that immune activation aligns with naturally presented tumor antigens.

Together, these models form an integrated computational pipeline-from ASO design to delivery-offering a transferable framework that other teams can adapt for therapeutic modeling, advancing ASO-based cancer therapeutics toward practical viability.

Our model integrates multiple components that together form the complete ASO design and analysis pipeline. Each icon below represents one of these key elements - CLICK on any icon to explore how our model connects to that part in more detail.

iGEM Judging Criteria

Background

Antisense oligonucleotides (ASOs) are short, single-stranded DNA or RNA molecules that selectively bind to complementary mRNA sequences to regulate gene expression. Their ability to silence specific genes has made them powerful therapeutic tools for both genetic and oncologic applications. However, designing effective and safe ASOs remains a challenge due to factors such as RNA structure, hybridization dynamics, off-target binding, and chemical modifications. Our model integrates these parameters—together with systems-level analyses such as synthetic lethality and immune activation—to create a unified, multi-layered framework for precise ASO design and delivery.

Model Objectives

Our goal is to develop a predictive model for the design of all components of the antibody–ASO–epitope conjugate platform for cancer therapy. The model is designed to achieve:

Accurate ASO Prediction

The primary aim of our model is to create a ranking framework that incorporates novel features not considered in previous approaches, in order to improve both specificity and efficacy. Based on this ranking, the model generates the most promising ASO sequences for a given target.

Unique predictive features include:

  • On-target hybridization efficiency under different chemical modifications
  • Molecular dynamics (MD) associated features
  • mRNA folding openness features
  • Off-target potential (considering expression levels of off-targets)
  • RNase H–preferred cleavage regions
  • Additional features described in the Methods section

Smart Target Selection

Prioritizing ASO targets based on synthetic lethality (SL) principles, ensuring selective killing of cancer cells while sparing healthy tissue.

Immune System Engagement and Antibody Optimization

Designing antibodies with optimized expression levels and epitope incorporation, ensuring that the chosen epitopes are effectively displayed on cancer cells, while maintaining structural integrity of the antibody. This enables efficient antibody–oligonucleotide conjugation and proper immune activation alongside ASO activity.

Methods

Dataset and Pre-Processing

Dataset Source

In exploring modeling methods for ASOs, we came across ASOptimizer, a framework designed to optimize ASO design at both the sequence and molecular levels [9]. In their study, the authors constructed a dataset of experimental observations extracted from patents and scientific publications on ASOs. This resource provided us with an opportunity to build our model on real biological data and investigate the behavior of ASOs under different conditions.

The dataset we retrieved contains ~37,000 in vitro entries, each including an ASO sequence along with experimental details such as ASO concentration (volume in the dataset), treatment duration, cell line type, target gene, cell density per well, sequence length, and several other parameters. Entries also specify chemical modifications and their positions within the sequence, different linkage types, SMILES notation, and inhibition percentages measured by qRT-PCR. The inhibition value served as the key readout to assess the effectiveness of an ASO in reducing target mRNA levels.

Pre-Processing Steps

Understanding the Data

We began by exploring the dataset using the Python Pandas package. By examining unique values and distributions, we mapped the diversity of target genes, the number of distinct cell lines, and the variation across experimental setups. The ASOptimizer paper placed particular emphasis on the role of chemical modification patterns, and indeed we found the dataset to contain a wide range of modification types and positional arrangements. Distributions of inhibition range, types of experimental properties, sequence length, genes and cell lines are shown in figures 1 and 2.

Figure 1: Distribution of inhibition percentages in ASO experiments.
Histogram shows the frequency distribution of inhibition values ranging from -50% to 100% (n=34,741). Red dashed line indicates mean inhibition (45.7%), orange dashed line shows median inhibition (47%). Inset shows detailed distribution of extreme negative entries(<-50%) with frequency counts and range values.
Competitive Analysis Matrix
Figure 2: Comprehensive distribution analysis of ASO experimental dataset characteristics.
A) Target gene distribution. Horizontal bar chart showing the top 15 most frequent genes exist in data, with IRF4 leading at 6,397 entries, followed by HSD17B13 (5,660 entries) and MALAT1 (3,063 entries). B) ASO volume distribution. Log-scale histogram displaying the bimodal distribution of treatment concentrations (nM), which in the data refered to as volume, with concentrations around 1-10 nM and 1-10k nM ranges. C) Treatment duration distribution. Bar chart showing experimental treatment duration, predominantly 24-hour treatments (26,316 entries) and 48-hour treatments (5,100 entries). D) ASO sequence length distribution. Pie chart illustrating nucleotide length frequencies, with 16-mers comprising 63.6% of sequences, 17-mers at 16.2%, and 20-mers at 13.4%. E) Cell line distribution. Pie chart showing the top 10 cell lines used, led by HepaRG (19.6%), A431 (19.1%), and HepG2 (17.8%)

Validation and Data Cleaning

After the initial exploration, we validated gene names and cell line identifiers. To simplify downstream analysis, we added standardized features such as the canonical gene name and the organism of origin for each cell line. Columns with inconsistent data types (e.g., mixing numeric and string values) were converted to a uniform format. We also checked the validity of the sequence length column and removed entries with missing inhibition values, and averaged repetitive experiments to avoid introducing bias into the model.

Extracting Additional Features

To enrich the dataset, we designed functions to identify the location of each ASO sense sequence on the target gene pre-mRNA. For this task, we used the human genome reference (GENCODE, release 34; GRCh38.p13) [17]. In cases where ASO sequences could not be mapped to pre-mRNA, we searched across different transcripts using the Python package gget [18]. For some sequences, further investigation revealed that they originated from patents describing ASOs targeting single nucleotide polymorphisms (SNPs) [19]. For each successfully mapped sequence, we recorded its absolute location in the target and its relative position (normalized by transcript length) as new features.

Dealing with Outliers

We next identified and flagged entries with abnormal values, such as unusually high ASO volumes or unrecognized cell lines. Where possible, we verified questionable entries by consulting the original patent records. Flagged entries were annotated in the dataset to enable filtering during downstream analyses (see Figure 3).

Two specific challenges required special handling:

  • Modification scans: Some experiments used identical ASO sequences but varied the chemical modifications. These were explicitly labeled as “modification scan” experiments so that our model could disentangle sequence effects from modification effects.
  • Duplicated rows with differing inhibition values: In these cases, we merged the duplicates by assigning a common index and averaging the inhibition percentage across replicates.
Competitive Analysis Matrix
Figure 3: Overview of the preprocessing pipeline applied to the ASOptimizer dataset, including data cleaning, validation, feature extraction, and handling of special cases, leading to the final dataset used for modeling.

Final Dataset

After preprocessing, the dataset was transformed from the raw ~37,000 entries into a consistent and reliable form. The final dataset contained ~35,000 entries with validated sequence, experimental, and modification information, with indications for duplicated entries. Each entry included its ASO sequence, inhibition measurement, and associated experimental metadata, ensuring that the data was valid and informative for further analysis and feature engineering.

Challenges and Considerations

Although the dataset included a wide range of genes and cell lines, certain experimental conditions may have introduced substantial bias. In particular, high ASO concentrations and various treatment periods can artificially boost inhibition values, making it difficult to distinguish whether an observed effect reflects the intrinsic properties of the ASO sequence or simply the experimental setup. This complicates interpretation and risks training a model that captures biases in experimental design rather than true sequence–function relationships. An example of the effect of the ASO volume is shown in figure 4.

Competitive Analysis Matrix
Figure 4: Distribution of ASO inhibition across concentration (volume) ranges.
ASOs were grouped into logarithmically spaced bins based on tested volume. Each point represents the mean inhibition percentage within a bin, with vertical error bars showing the standard deviation (STD). Labels below each point indicate the mean ± STD (inhibition %) and the number of ASOs in the bin (n). The x-axis is logarithmic while the y-axis is linear to capture trends across orders of magnitude. Entries with missing, zero, or negative values for volume or inhibition were filtered out, and bins containing fewer than three ASOs were excluded. Pearson and Spearman correlation analyses yielded significant positive associations between ASO concentration (volume) and inhibition (Pearson r = 0.27, p < 10⁻³⁰⁰; Spearman ρ = 0.28, p < 10⁻³⁰⁰).

Another major challenge was handling chemical modifications. Identical ASO sequences can appear multiple times in the dataset with different modification types and positional patterns, which dramatically influence inhibition outcomes. To capture these effects accurately, the model must be designed to treat modifications not as minor variations, but as key determinants of ASO behavior that require dedicated feature encoding. Without explicitly addressing modifications, predictions risk being misleading or incomplete.

Finally, duplicated or inconsistent entries also posed difficulties. While we implemented strategies to merge and clean these records, not all ambiguities could be resolved. Consulting patents to validate questionable entries was particularly time-consuming and could not be applied universally. Despite these challenges, the preprocessing pipeline allowed us to create a structured, enriched dataset with reliable features. This prepared the ground for building models capable of predicting ASO performance based on both sequence and chemical properties.


ODE-Based Kinetics

First-Order Kinetics

We begin with a simplified model, assuming the ASO–mRNA reaction proceeds at a rate that depends linearly on reactant concentrations. Because ASO degradation is slow and largely independent of target interactions [Z1], we can treat the ASO concentration as constant throughout the reaction.

Let $V$ denote the ASO concentration (sometimes referred to as volume), $m$ the concentration of target mRNA, and $k$ the efficiency constant of that ASO. The mRNA degradation can be expressed as:

\[ \dfrac{dm}{dt} = -kVm(t) \]

The solution of this first-order differential equation is:

\[ m(t) = m(0)e^{-kVt} \]

Relating the Model to Experimental Inhibition

Experimentally, mRNA degradation is reported as the inhibition percentage: if $I$ denotes inhibition, $T$ the treatment period, and $m(0)=1$, then

\[ I = 100\,(1 - m(T)) \]

Substituting the analytical solution gives:

\[ \begin{aligned} I &= 100\,(1 - e^{-kVT}) \\ e^{-kVT} &= 1 - \frac{I}{100} \\ -kVT &= \ln(1 - \tfrac{I}{100}) =: I^* \\ -k &= \frac{\ln(1 - \tfrac{I}{100})}{VT} \end{aligned} \]

Here $I^*$ is termed the log-inhibition. Because the dataset occasionally contains $I=100$, which would cause infinite efficiency, we introduce a small bias term $c$ as a log-correction: $\,I^* = \log(100 - I + c)$, with $c = 1$ (≈1% bias).

Implications

Two insights follow: (1) correlations with log-inhibition are more meaningful than those with raw inhibition (see Figures 5 and 6), and (2) if the ODE model holds, log-inhibition should vary linearly with both ASO concentration ($V$) and treatment time ($T$).

Model Validation

The table below summarizes Pearson correlations between volume and inhibition before and after the log transformation across several genes and cell lines.

Cell line Gene Samples Pearson (Inhibition) p-value (Inhibition) Pearson (Log-Inhibition) p-value (Log-Inhibition)
A431 KRAS 925 0.422 \(2.95\times10^{-41}\) 0.477 \(9.23\times10^{-54}\)
A431 SOD1 429 0.409 \(1.03\times10^{-18}\) 0.409 \(9.65\times10^{-19}\)
A431 APOL1 2134 0.368 \(2.10\times10^{-69}\) 0.363 \(1.91\times10^{-67}\)
KARPAS-229 IRF5 972 0.301 \(8.73\times10^{-22}\) 0.419 \(1.15\times10^{-42}\)
MM.1R IRF4 3394 0.383 \(3.47\times10^{-119}\) 0.483 \(6.67\times10^{-198}\)
SK-MEL-28 IRF4 2836 0.295 \(3.96\times10^{-58}\) 0.397 \(7.22\times10^{-108}\)
H929 IRF4 73 -0.044 \(7.12\times10^{-1}\) -0.083 \(4.86\times10^{-1}\)
HepaRG HSD17B13 5566 0.515 \(<10^{-300}\) 0.619 \(<10^{-300}\)
HepG2 SOD1 1625 0.365 \(2.32\times10^{-52}\) 0.476 \(1.54\times10^{-92}\)
HepG2 DGAT2 2504 0.543 \(8.70\times10^{-192}\) 0.608 \(1.30\times10^{-252}\)

Pearson correlations improved markedly after converting to log-inhibition, supporting the ODE-based transformation. However, when applied to the treatment period ($T$), the relationship was weak or even negative:

Correlation between treatment period and inhibition
Figure 5: Negative correlation between treatment period and inhibition.
The plot shows the relationship between the duration of treatment and observed inhibition values. A clear negative trend is observed, indicating that longer treatment periods correspond to reduced inhibition. Each data point represents a mean value with error bars indicating standard deviation.
Correlation between treatment period and log-inhibition
Figure 6: Negative correlation between treatment period and log-inhibition.
The log-transformed inhibition data preserve the inverse relationship seen in Figure 5, confirming the robustness of the observed trend across different data scales.

Thus, while the simple ODE captures the volume dependence, it does not fully explain the effect of treatment time. We therefore corrected for volume but not for treatment period.

Dealing with Noise

From the equation above, one might attempt a direct normalization $-kT = \dfrac{I^*}{V}$, but measurement noise makes this unreliable. To mitigate this, we fitted a linear regression (OLS) between volume and log-inhibition, obtaining the empirical fit \[ (-kT)(V \cdot 6.01\times10^{-5} + 0.537) = I^*, \] and hence \[ -kT = \dfrac{I^*}{6.01\times10^{-5}V + 0.537} := I^V, \] where $I^V$ is the volume-corrected inhibition. The derived value of $-kT$ was then used as the efficiency descriptor in the machine-learning model.



Feature Engineering

With the cleaned and validated dataset in place, we next aimed to understand the patterns behind ASO performance. To do so, we translated the experimental observations into quantitative models that connect inhibition outcomes with factors such as concentration, treatment duration, and molecular properties.
The following parts describes our Feature Engineering processes.


Hybridization

When an ASO binds its target mRNA, hybridization occurs. A common way to quantify this interaction is the melting temperature ($T_m$), defined as the temperature at which half of molecules are duplexed and half remain single-stranded.

An optimal binding affinity has been observed empirically [20] (see Figures 7 and 8). If duplex stability is too low, hybrid formation and RNase H1 recruitment are inefficient. If binding is too strong, off-target hybridization increases. For 16-mer gapmer LNA–ASOs, $T_m \approx 50$–$55\,^\circ$C balances efficacy and safety; values $>55\,^\circ$C correlate with hepatotoxic risk [20]. Another thermodynamic measure of binding strength is the Gibbs free energy, denoted $\Delta G^\circ$.

Accurate $T_m$ or $\Delta G^\circ$ values require in vitro melting experiments (UV or fluorescence), which underpin the nearest-neighbor thermodynamic database [21].

ASO Chemical Modifications

ASOs typically include (i) phosphate-backbone modifications, most commonly complete phosphorothioate (PS) substitution, and (ii) sugar modifications such as 2′-O-MOE or locked nucleic acid (LNA). While PS backbones are nearly universal, sugar chemistry strongly influences hybridization affinity and specificity. Accurate modeling must account for both.

In Silico Hybridization Scoring

We use several pre-computed hybridization metrics as model features:

Phosphorothioate (PS) Nearest-Neighbor Weights

We incorporated PS nearest-neighbor weights derived from wet-lab measurements [23]. Values reported at $50\,^\circ$C were converted to $37\,^\circ$C using the standard thermodynamic relation:

\[ \Delta G^\circ(T) = \Delta H^\circ - T\,\Delta S^\circ \]

where $\Delta G^\circ(T)$ is the standard Gibbs free energy of duplex formation at temperature $T$ (in kelvin), $\Delta H^\circ$ is the enthalpy change of hybridization, and $\Delta S^\circ$ is the corresponding entropy change. This linear approximation assumes $\Delta H^\circ$ and $\Delta S^\circ$ remain constant over the temperature range considered, allowing extrapolation of free energies between experimentally measured and physiological conditions. It captures the balance between enthalpic stabilization from base-pairing and stacking, and the entropic cost of ordering two flexible strands into a duplex.

The dataset provided PS/PS, PS/DNA, and DNA/DNA weights. PS/RNA data were unavailable, as RNA strands are less stable and difficult to assess via high-resolution melting. We then estimated:

\[ \Delta G^\circ_{\mathrm{PSDNA\text{–}RNA}} \approx \Delta G^\circ_{\mathrm{DNA\text{–}RNA}} + \bigl(\Delta G^\circ_{\mathrm{PSDNA\text{–}DNA}} - \Delta G^\circ_{\mathrm{DNA\text{–}DNA}}\bigr) \]

This approximation enables rapid on-target PS–RNA energy estimation. However, it is limited to local nearest-neighbor corrections; transcriptome-wide search tools provide broader context, including alternative sites, structural accessibility, and off-target potential.

RISearch hybridization efficiency and ASO performance
Figure 7. Relationship between RISearch-assisted hybridization strength and ASO efficacy. When examining the fraction of high performers (ASOs achieving >93% inhibition), two peaks emerge—one near 16-mer ASOs and another near 20-mer ASOs. Weak hybridization results in low activity, whereas excessive hybridization reduces efficacy and may increase unmeasured toxicity. An optimal hybridization range is therefore required for efficient ASO activity.
Hybridization efficiency and ASO performance
Figure 8. Relationship between direct hybridization strength and ASO efficacy. Two clear peaks appear—one corresponding to ~16-mer ASOs and another to ~20-mer ASOs. A third, weaker peak near –2750 scaled Gibbs energy exhibits wide confidence intervals, indicating limited statistical support in that region.
Molecular-Dynamics MOE/PS Weights

Molecular-dynamics simulations yielded nearest-neighbor weights for PS-MOE/RNA duplexes [24]. Summing these values across each sequence provides an MOE-aware hybridization feature.

Simplified Melting-Temperature Heuristic

A 2025 tool proposed a heuristic formula for estimating $T_m$ [25]:

\[ T_m = 100.5 + 41(N_G + N_C - N_A - N_T) - \frac{820}{N_A + N_T + N_G + N_C} + 16.6\,\log_{10}\!\bigl([\mathrm{Na}^+]\bigr) \]

where \(N_A, N_T, N_G, N_C\) are the counts of A, T, G, and C nucleotides in the ASO sequence. This heuristic provides a rapid $T_m$ estimate for short oligonucleotides under monovalent-salt conditions, complementing thermodynamics-based calculations.

Other Features

Other interesting features that relate to hybridization energy are self_energy and internal_fold. The first is the energy of 2 ASOs binding together, speculating that if this happens, then the likelihood to bind to target mRNA is reduced. This was calculated with the primer3.calc_homodimer function. internal_fold is the gibbs free energy of an ASO hybridized to itself, calculated with ViennaRNA (see Fold section).
While already providing predictive power, we believe that enriching the calculation with proper chemical weights of ASOs, and utilizing MD, we can increase predictive power even further.

Summary

Overall, we integrated both simplified analytical features and high-resolution, physics-based methods, improving upon existing approaches for modeling ASO hybridization and enhancing on-target prediction accuracy. When combined with fast transcriptome-wide search tools such as RIsearch (see Off-target analysis), these hybridization metrics also enable efficient and accurate off-target assessment, drastically improving genome-scale prediction quality. Future work will incorporate experimentally derived LNA nearest-neighbor weights from the literature, further refining thermodynamic modeling for chemically modified oligonucleotides.


RNase H

RNase H is an endonuclease that specifically cleaves the RNA strand of an RNA:DNA duplex, while leaving the DNA intact. In human cells, RNase H1 is essential for maintaining genome stability: it removes RNA primers from Okazaki fragments during DNA replication and resolves R-loops formed when RNA hybridizes to complementary DNA.

Unlike sequence-specific nucleases, RNase H1 recognizes the structural feature of the RNA:DNA hybrid rather than a defined sequence motif. However, its cleavage is not uniform — the efficiency and exact cut site depend strongly on the local nucleotide environment surrounding the scissile phosphate [26].

This property is especially important for ASOs. When an ASO binds its target mRNA, it forms a DNA:RNA hybrid that recruits RNase H1. The enzyme then catalyzes degradation of the RNA strand, reducing expression of the target gene. Understanding how RNase H1 recognizes and cleaves different hybrid contexts is therefore central to predicting ASO potency. Certain bases and short sequence motifs promote cleavage, while others inhibit it, creating characteristic patterns of preference.

RNase H

To systematically map the determinants of RNase H1 activity, the high-throughput RNase H Sequence Preference Assay (H-SPA) developed by Kielpinski et al. (2017) was applied [27]. This assay compares cleaved versus input libraries of randomized RNA:DNA hybrids, allowing precise measurement of sequence influence on cleavage.

The results showed that both single nucleotides and dinucleotide combinations surrounding the scissile phosphate can strongly modulate cleavage efficiency. These findings established that RNase H1 function follows reproducible, sequence-specific rules — knowledge that can be directly integrated into predictive computational models for ASO design.

In the H-SPA assay, randomized RNA:DNA hybrids were subjected to RNase H1 digestion, and cleaved versus input libraries were sequenced. To quantify the contribution of each base, enrichment scores were calculated as:

$$ \omega_i = \frac{f^{\text{cleaved}}_{i}(b)}{f^{\text{input}}_{i}(b)} $$

where \( f^{\text{cleaved}}_{b} \) is the frequency of base \( b \) at position \( i \) in the cleaved population, and \( f^{\text{input}}_{b} \) is its frequency in the input library [27]. These enrichment values were then assembled into position-specific scoring matrices (PSSMs), representing the contribution of each nucleotide at each position relative to the cleavage site. Equivalent matrices were also constructed for dinucleotides, capturing local pairwise sequence effects beyond single bases.

In addition, the authors introduced a relative processing rate to capture kinetic-like activity of cleavage:

$$ k_{\text{real}} = \frac{\log \log \left( 1 - f_{\text{sub}} \right)}{\log \log \left( 1 - f_{\text{ref}} \right)} $$

where \( f_{\text{sub}} \) is the cleaved fraction of a given substrate, and \( f_{\text{ref}} \) is that of a reference substrate.

Both log2 fold-change \(\log_{2}(\text{FC})\) and \( k_{\text{real}} \) values were reported for three experimental constructs, each based on an independent input library and cleavage experiment. This means the full set of enrichment matrices was generated three times under distinct experimental conditions, producing separate weight sets that together provide a robust representation of RNase H1 sequence preferences [27].

Features implemented in our model

Based on the H-SPA–derived weight sets, we defined several RNase H1–related feature families. These features capture different aspects of sequence preference and cleavage efficiency, ranging from single-nucleotide effects to local dinucleotide context and motif presence. The table below summarizes each feature type, its description, and the variants computed.

Feature type Description Variants computed
Position-specific nucleotide scores Each base within the ASO window was assigned a weight from the H-SPA-derived PSSM, and the average score (normalized by window length) was calculated. This feature reflects how specific nucleotides at defined positions favor or disfavor RNase H1 cleavage. 3 experimental sets (R4a, R4b, R7) × 2 weight types (log2(FC), \( k_{\text{rel}} \))
Dinucleotide scores Similar to the single-nucleotide PSSMs but applied to adjacent base pairs. Each dinucleotide within the ASO window was assigned a weight from the H-SPA-derived PSSM, and the average score (normalized by the number of dinucleotides) was calculated. This feature captures local sequence context effects that single-base models cannot fully explain. 3 × 2 (as above)
Motif presence Binary indicators (0/1) marking whether an ASO contains specific motifs identified in H-SPA. Motifs such as TCCC or GGGA were associated with higher cleavage efficiency, whereas other motifs were associated with reduced cleavage. Independent of weight sets

Our Window Optimization Strategy

While the original article used fixed windows (14–16 nt for nucleotide weights and ~13 nt for dinucleotide weights), our dataset contained ASOs of variable lengths. To adapt the scoring system, we developed a correlation-optimized windowing strategy (see Figure 9):

  • For each ASO length L, all possible placements of the H-SPA scoring window along the ASO were evaluated.
  • Feature values were computed for each placement and correlated with experimental inhibition data.
  • The window placement that maximized correlation for a given ASO length was selected and used consistently for scoring across all ASOs of that length.

This approach ensured that each ASO was scored in a way that best reflected the experimentally relevant determinants of RNase H1 cleavage, rather than assuming a single fixed central window.

Correlation strength per evaluated window
Figure 9: Illustration of our window optimization procedure. All possible window placements along an ASO are evaluated, feature values are computed and correlated with inhibition, and the window with the strongest correlation is selected. The example illustrates a single ASO length, but in practice the procedure was applied independently to all lengths.
RNase H1 Off-Target Feature

In addition to sequence-intrinsic determinants of RNase H1 cleavage, we considered the possibility that off-target binding to transcripts encoding RNASEH1 itself could compromise ASO efficacy. Because ASO activity relies on recruitment of RNase H1 to the intended RNA target, knockdown of the enzyme on unintended transcripts represents a form of “inhibition of the inhibition”. Even modest levels of such off-target binding may therefore have disproportionate functional impact.

To capture this effect, we introduced a feature quantifying the RNASEH1 off-target burden of each ASO. The scoring framework was rooted in statistical mechanics: we enumerated all predicted binding sites across the transcriptome and summed their Boltzmann-weighted contributions to the partition function, effectively calculating a statistical-mechanics–derived Z value for each ASO (energies from RISearch [22]). Sites mapping specifically to RNASEH1 transcripts were extracted to construct an off-target score reflecting the potential for enzyme knockdown.

Off-Target Partition Function Scoring

For each ASO, all predicted RNA binding sites were retrieved using RISearch [22]. Each interaction was assigned an energy value \( \Delta G \), and its contribution to the partition function was computed as:

$$ Z = \sum_{i} e^{-\Delta G_{i}/RT} $$

Where \( RT = 0.616 \, \text{kcal/mol} \) at 310 K, and \( Z \) is the sum over all candidate off-target sites. The subset of contributions mapping specifically to RNASEH1 transcripts was defined as the RNase H1 off-target score.

Across the full dataset, this feature showed negative correlations with ASO inhibition efficiency (Figures 10-12), consistent with the expected biology-ASOs predicted to engage RNASEH1 transcripts tended to be less effective against their intended targets.

Correlations with ASO inhibition efficiency table
Figure 10: Correlations with ASO inhibition efficiency table. The table summarizes the correlation values obtained between different computed features and experimental ASO inhibition efficiency across all datasets.
Pearson correlation between ASO off-target score and inhibition
Figure 11: Pearson correlation between ASO off-target score with RNaseH1 and ASO inhibition across varying cutoff values (minimum interaction score required for inclusion). Circles indicate correlations computed over the full dataset, whereas X symbols represent correlations after excluding zero-valued cases to focus on informative instances.
Spearman correlation between ASO off-target score and inhibition
Figure 12: Spearman correlation between ASO off-target to RNaseH1 score and ASO inhibition across varying cutoff values (minimum interaction score required for inclusion). Circles indicate correlations computed over the full dataset, whereas X symbols represent correlations after excluding zero-valued cases to focus on informative instances.

When stratifying ASOs by chemistry and length classes, however, more nuanced patterns emerged:

  • Several modification/length groups displayed correlations in the expected negative direction.
  • A smaller subset of groups showed weak or even opposite correlations, suggesting context-dependent compensatory effects.
  • Not all groups reached statistical significance, but in multiple cases the association was strong enough to support the biological plausibility of RNase H1 off-target sequestration.

A summary of these group-wise results is provided in Figures 13, 14, highlighting both consistent trends and exceptional cases.

Summary table of group-wise results
Figure 13: Summary table of group-wise results.
Spearman correlation between off-target score and inhibition across chemical modifications and lengths
Figure 14: Spearman correlation between off-target score (cutoff 0) with RNaseH1 and inhibition across different chemical modifications and sequence lengths.

RNase H1 Expression Feature

In addition to sequence-derived preferences and off-target risks, the cellular abundance of RNase H1 itself can influence ASO potency. Because RNase H1 is the enzyme responsible for catalyzing cleavage of RNA in ASO–mRNA hybrids, limited enzyme availability may restrict inhibition, whereas higher expression levels may enhance knockdown efficiency.

To capture this context-dependent factor, we extracted $log(\text{TPM}+1)$ expression values for RNASEH1 across relevant cell lines from the OmicsExpressionProteinCodingGenesTPMLogp1 dataset. Cell line identifiers were harmonized using Achilles IDs, and values were assigned on a per-cell-line basis as a continuous feature (RNASEH1_expression).

Preliminary correlation analyses indicated that RNase H1 expression exhibits variable but biologically plausible associations with measured inhibition. Several cell lines showed the expected positive relationship between higher RNASEH1 expression and stronger ASO activity, consistent with enzyme-limiting behavior.


Conclusion

By integrating experimentally derived PSSMs with our correlation-optimized windowing approach, we translated the sequence preferences of RNase H1 into robust computational features. These features capture both single-base and dinucleotide contributions, as well as motif-level effects, providing a biologically grounded representation of cleavage propensity.

In parallel, we introduced an additional feature quantifying the potential sequestration of RNase H1 through unintended binding to its own transcripts. By embedding this off-target burden into a statistical-mechanics framework, we accounted for the possibility that enzyme availability itself may be compromised by ASO binding. Together, these complementary perspectives- intrinsic sequence determinants of cleavage and extrinsic risks of enzyme sequestration- contribute to a more accurate prediction of ASO activity and highlight the central role of RNase H1 in determining therapeutic efficiency.


RNA Folding

Relevance of RNA Folding for Target Accessibility

RNA folding determines the secondary structure of transcripts, which strongly influences their accessibility for binding by ASOs. Regions that are base-paired or highly structured are less accessible, whereas unpaired or single-stranded regions are more available for hybridization. Therefore, assessment of RNA folding is considered critical in the selection of ASO target sites. In general, sequences predicted to be accessible are more likely to support efficient binding and functional inhibition (see Figure 15). Multiple studies have demonstrated that improved target accessibility can significantly enhance knockdown efficacy [28][29].

Relevance of RNA Folding for Target Accessibility Illustration
Figure 15: Relevance of RNA Folding for Target Accessibility illustration. The schematic demonstrates how RNA secondary structure influences ASO binding, where accessible single-stranded regions serve as favorable hybridization sites, while highly structured or base-paired regions reduce binding efficiency.

Computed Folding Features

To capture local structural stability within each target region, we computed folding-related features using the ViennaRNA package [30], which estimates RNA secondary structure and calculates the minimum free energy (MFE) through dynamic programming algorithms. For each target RNA, we applied a custom function calculate_avg_mfe_over_sense_region that scans the sequence with a sliding window to evaluate the average local MFE per nucleotide. Specifically, windows of 40 nucleotides with a 15-nt step were used. The MFE of each window was normalized per nucleotide, and the average across overlapping regions was computed to yield the energy landscape. The folding features were optimized by systematically varying the window and step sizes to maximize correlation with the inhibition values. Step sizes ranging from 1 to 10 and window sizes ranging from 20 to 70 nucleotides were examined.
For each ASO, the sense region was further extended upstream and downstream using the get_sense_with_flanks function to account for contextual folding effects around the target site. The final average MFE feature thus reflects the mean energy of the entire flanked sense region.
More negative MFE values correspond to stronger RNA folding (i.e., more stable local secondary structures), while higher values indicate more open conformations (see Figure 16).

Relationship between local RNA folding and ASO inhibition efficiency
Figure 16: Relationship between local RNA folding and ASO inhibition efficiency. Data points represent binned averages of ASOs grouped by folding energy, with shaded regions indicating the ±95% confidence interval. The average minimum free energy (MFE) was calculated using ViennaRNA with a 40-nt window and 15-nt step, over the sense region versus the logarithm of ASO inhibition. Spearman’s correlation observed – ρ = 0.21, p ≈ 1.5 × 10⁻²³² (N = 23,519). The schematic below illustrates this trend – as the mRNA becomes less folded, the ASO can more easily hybridize to its target site.

Correlation analysis revealed that ASOs targeting regions with less stable (less negative) MFE values tend to exhibit higher inhibition efficiency (Spearman’s ρ = 0.21, p ≈ 1.5×10⁻²³²), supporting the biological interpretation that loosely folded regions are more accessible for ASO binding.
In parallel, average local accessibility was calculated using a sliding-window approach implemented in the Raccess software [31] (also referred to as RNAaccess). This algorithm estimates the probability of each nucleotide being unpaired, and thus accessible for hybridization with the ASO.

For this purpose, we utilized a custom Python function named AccessCalculator, developed by Yehuda Landau, a member of the Tuller Lab. This function receives several key parameters:
(1) seq – the nucleotide sequence for which accessibility is to be computed.
(2) access_size – the length of the target region under evaluation.
(3) in_gc, max_gc – optional constraints controlling the GC content range.
(4) access_win_size – defines the number of nucleotides to extend upstream and downstream of the target for context-based folding estimation.
(5) seed_sizes – determines the probability that each seed region of a given size remains unpaired.

The schematic representation in Figure 17 illustrates this process: the mRNA trigger sequence is scanned by overlapping windows, local folding probabilities are computed within each window, and accessibility values are aggregated across the binding segment. This visualization helps to understand how the moving-window algorithm captures the local structural context around ASO target sites, providing a probabilistic estimate of nucleotide unpairing and accessibility.

Algorithm for segment accessibility approximation
Figure 17: Algorithm for segment accessibility approximation [32] – schematic representation of the moving-window approach used by the AccessCalculator function to estimate RNA accessibility along an mRNA sequence. The mRNA trigger sequence is scanned by a series of overlapping windows (defined by the Moving Window Size and Moving Window Index). Within each window, local folding probabilities are computed, and accessibility values are aggregated across the trigger binding segment (access_size). This process captures the local structural context surrounding the ASO binding site, providing a probabilistic estimate of nucleotide unpairing and accessibility.

The output of this computation is a probabilistic accessibility score per nucleotide, representing the likelihood of being unpaired. We then applied the compute_sense_accessibility function to aggregate these data into a single numerical value per ASO, the mean accessibility across all nucleotides within its binding region.
A higher computed value indicates that the target region is more structurally closed and thus less accessible to ASO binding. Due to the computational intensity of the analysis, the process was parallelized and executed in batches before being merged into the final dataset.

When correlating the resulting accessibility values with experimental log(inhibition) data, a negative Spearman correlation (ρ = –0.14, p = 3.02×10⁻¹¹⁹) was observed, supporting the intuitive relationship that more open regions tend to be more effectively targeted by ASOs (see Figure 18).

Relationship between ASO accessibility and inhibition efficiency
Figure 18: Each blue point represents an individual ASO, red points represent accessibility bins to visualize the overall trend. The x-axis denotes the average local accessibility of the ASO binding region, computed using the AccessCalculator function based on Raccess folding probabilities, while the y-axis shows the experimental log(inhibition) values. A mild but significant negative correlation was observed (Spearman ρ = –0.13, p ≈ 8.06 × 10⁻¹³⁸, N = 23,519), indicating that ASOs targeting more open (highly accessible) regions tend to achieve stronger inhibition effects.


In addition to the average MFE \ accessibility features, two complementary folding-related features were examined:
Mean MFE values at the sequence edges - calculated using the function calculate_mfe_over_edges_sense_region, which evaluates the mean folding energy of the first and last four nucleotides of each target. This feature was motivated by the hypothesis that the terminal regions may play a key role in local opening and closing dynamics, thereby influencing ASO accessibility. The observed correlation with inhibition was slightly lower but comparable to that of the mean MFE feature.

RNA target sequence highlighting terminal regions for mean MFE calculation
Figure 19 RNA target sequence showing the first and last four nucleotides in red, representing the sequence edges used to calculate mean MFE. The remaining central nucleotides are shown in pink. This highlights the contribution of terminal regions to local folding and ASO accessibility.

Minimum average MFE across the region - computed using the function calculate_min_mfe_over_sense_region, designed to capture the most stable local substructure within each ASO target. This approach was based on the rationale that the strongest folded subregion may dominate the accessibility landscape, and that averaging across the entire region could obscure such effects. However, in practice, this feature also did not show a substantially higher correlation with inhibition compared to the mean MFE metric.

RNA target sequence showing minimum average MFE subregion
Figure 20: RNA target sequence with the most stable local subregion highlighted in green, representing the minimum average MFE region. The remaining nucleotides are shown in pink, illustrating that the strongest folded substructure may dominate local accessibility for ASO binding.

Both features were ultimately examined in the feature selection step of the model to allow data-driven evaluation of their predictive contribution relative to other structural and thermodynamic parameters.


Sequence Based Features

Introduction

The therapeutic potential of Antisense oligonucleotides depends not only on chemical modifications or protein interactions, but also on the intrinsic sequence properties of the oligonucleotide itself. The sequence dictates how the ASO folds, how it interacts with RNA and proteins, and whether it harbors repetitive or toxic motifs that might compromise safety or efficacy. While chemical design can be adjusted, the underlying sequence composition remains a fundamental determinant of ASO performance.

To capture these properties systematically, we developed a set of sequence-derived features. These features quantify base composition, structural tendencies within the ASO itself, repetitiveness, and the presence of known toxic motifs. By doing so, we transform qualitative biological intuitions into quantitative predictors that can be integrated into machine learning models for ASO optimization.

Super-ASO structure

Biological Insight

Unlike the folding of target RNA, which determines whether a binding site is open or blocked, the sequence of the ASO itself encodes its own structural and functional tendencies. Before any chemical modification is added, the letters of an ASO already shape its behavior: whether it folds onto itself, remains flexible, or carries hidden motifs that influence safety and efficacy.

One crucial aspect is self-structure formation. Certain ASO sequences can fold back on themselves, forming intramolecular hairpins or even G-quadruplexes. Palindromic regions, long homopolymeric runs, or stretches of consecutive guanines make these conformations particularly likely. While stable, such self-structures act as competitors: instead of engaging the intended mRNA, the ASO becomes trapped in its own secondary structure, reducing potency and sometimes increasing toxicity [33][34].

Base composition also exerts a strong influence. GC-rich ASOs behave like rigid beams, forming strong duplexes but often at the cost of reduced flexibility and higher risk of aggregation. AT-rich sequences, in contrast, are looser and more flexible, which may help them adapt structurally but can also reduce binding stability. The composition can be summarized by GC content (overall G+C fraction) and nucleotide skews (AT and GC skew measuring strand asymmetry); together, these capture the trade-off between strength and adaptability [35] .

Another layer of sequence biology lies in repetitiveness and motifs. Long G-runs are notorious for forming G-quadruplexes, which hinder target binding and have been linked to adverse outcomes in preclinical studies [36] . Tandem or dispersed repeats decrease sequence uniqueness, increasing the chance of binding to unintended transcripts. Beyond structural motifs, toxic sequence patterns such as $\text{UGU}$, $\text{TGGT}$, or $\text{GGGT}$ have been correlated with hepatotoxicity and immune activation, highlighting the importance of screening them computationally before candidate selection[37] .

Entropy and diversity are subtler but equally important measures. Sequences with balanced distributions of nucleotides and dinucleotides have higher entropy, making them less prone to repetitive structures and more likely to remain accessible for hybridization. In contrast, low-entropy sequences are more predictable, more repetitive and often less effective. Importantly, studies have shown that the local sequence context itself strongly influences the thermodynamic properties and hybridization efficiency of ASOs, with even single modifications exerting sequence-dependent effects [38] . Finally, flexibility is encoded in the local arrangement of bases. Dinucleotides such as AT and TA introduce weaker stacking interactions, functioning like hinges within the oligonucleotide. These flexible points allow the ASO to adapt dynamically when pairing with RNA, potentially enhancing recognition and supporting RNase H1 activity[39] . Together, these properties for a hidden biological language of ASO sequences - a language that encodes stability, flexibility, safety, and efficacy. By decoding it through computational features, we can anticipate which sequences are likely to succeed as therapeutic candidates, long before they are tested in the lab.

From Biological Insight to Computational Features

To transform the biological properties of ASO sequences into measurable descriptors, we defined several feature families. Each feature captures one aspect of the sequence and is normalized by ASO length, ensuring fair comparison across oligonucleotides of different sizes.

1. Secondary Structure Features

These features measure the tendency of an ASO to fold back on itself and form intramolecular structures that may reduce its ability to bind the target mRNA.

  • Hairpin structure and ΔG: Predicts whether the ASO can form intramolecular hairpins and estimates their stability. A hairpin forms when complementary regions within the sequence base-pair, creating a stem-loop. The free energy (ΔG) quantifies how stable this fold is: more negative values indicate stronger self-folding, which competes with target binding.
  • Palindromic fraction: Measures the proportion of subsequences that are reverse complements of one another. Palindromic regions are strong candidates for forming hairpins or stem-loop structures, thereby increasing the likelihood of self-interaction
  • Homooligo count: Detects long runs of identical nucleotides (e.g. AAAA, CCCC). These stretches promote stacking interactions or unusual local structures, both of which can reduce hybridization efficiency with the intended RNA.
  • GC block length: Identifies the longest consecutive stretch of G or C nucleotides. GC blocks form especially stable secondary structures due to strong hydrogen bonding and stacking interactions, which may increase rigidity and limit flexibility of the ASO.

2. Composition Features

These reflect the overall nucleotide balance within the ASO, which directly impacts stability, flexibility and immunogenicity.

  • $\text{GC}$ content: The fraction of guanine and cytosine bases in the sequence, defined as:
    \( \mathrm{\text{GC content}} = \dfrac{G + C}{G + C + A + T} \)
    Where $\text{G,C,A}$ and $\text{T}$ denote the counts of the corresponding nucleotides. Higher $\text{GC}$ increases thermodynamic stability but reduces flexibility. Local $\text{GC}$ at the 3′ end is especially relevant for duplex formation.
  • AT skew: Measures imbalance between adenine and thymine, defined as:
    \( \mathrm{\text{AT skew}} = \dfrac{A - T}{A + T} \)
    Positive values indicate enrichment of adenine, while negative values indicate enrichment of thymine.
  • GC skew: Measures imbalance between guanine and cytosine, defined as:
    \( \mathrm{\text{GC skew}} = \dfrac{G - C}{G + C} \)
    This skew reflects strand asymmetry and may indicate local preferences in base pairing and folding.
  • Purine content: The fraction of purine in the sequence. Purines(AG) are bulkier than pyrimidines(CT), and high purine content influences stacking interactions and helical geometry.
  • Poly-pyrimidine stretches: Identifies the longest consecutive run of cytosine or thymine. Such stretches may promote unusual conformations and affect protein recognition.
  • CG dinucleotide fraction: The relative frequency of CG dinucleotides in the sequence, defined as:
    \( \mathrm{\text{CpG}} = \dfrac{\text{count of CG dinucleotides}}{\text{total dinucleotides}} \)
    This metric provides a standardizes way to capture enrichment of CpG motifs across ASOs of different lengths.

3. Repetitiveness and Motifs

These capture sequence patterns that may compromise specificity or safety of the ASO.

  • G-run count: Detects long runs of guanine bases. These regions can fold into G-quadruplexes, highly stable four-stranded structures that hinder hybridization to target mRNA strand.
  • Tandem repeats score: Identifies short motifs repeated consecutively (e.g $\text{CACACA}$). Such repeats include sequence uniqueness and increase risk of off-target binding.
  • Dispersed repeats score: Measures repeated motifs that appear in multiple positions along the ASO. These dispersed repeats create redundancy and may compromise target specificity.
  • Toxic motif count: Flags presence of motifs associated with toxicity ($\text{TGT}$, $\text{TGGT}$, $\text{GGGT}$). Their occurrence has been linked to immune activation or hepatotoxicity in vivo.

4. Entropy and Diversityy

These features quantify the complexity of the ASO sequence. High complexity usually improves specificity, while low complexity reflects bias or repetitiveness.

  • Nucleotide entropy: Shannon entropy based on the distribution of $\text{A,C,G}$ and $\text{T}$ nucleotides, defined as:
    \( \mathrm{H_{\text{ASO}_{\text{nuc}}}} = -\sum_{n \in \{A, C, G, T\}} P_n \cdot \log_2(P_n) \)
    Where $P_n$ is the frequency of base $n$. Higher entropy indicates a balanced nucleotide composition.
  • Dinucleotide entropy: Extending Shannon's entropy to adjacent pairs, defined as:
    \( \mathrm{H_{\text{ASO,dinuc}}} = -\sum_{d \in \{AA, AT, TT, GC, \dots\}} P_d \cdot \log_2(P_d) \)

    where \( P_d \) is the frequency of each dinucleotide. Low values indicate bias toward repetitive dinucleotide patterns.

  • Nucleotide diversity: Measures the proportion of distinct dinucleotide types present out of the 16 possible. A higher diversity score corresponds to a greater variability and less redundancy.

5. Additional Features (Single-Feature Categories)

These features complement the core families by highlighting structural adaptability and nonlinear effects

  • Flexibility ($\text{AT}/\text{TA}$ fraction): $\text{AT}$ and $\text{TA}$ dinucleotides stack weakly compared to $\text{GC}$, creating "hinges" in the backbone. A higher fraction of these dinucleotides implies greater conformational flexibility during hybridization.
  • Interactions: Some features influence one another in a non-linear way. For example, high $\text{GC}$ content may only reduce efficacy if combined with strong self-structure. Interaction terms (e.g. $\text{GC}$ content $\times$ hairpin $\Delta G$ capture such combined effects, enabling more accurate modeling.
ASO sequence feature families
Figure 21: ASO sequences are translated into feature families that capture structure, composition, repetitiveness, and diversity for computational analysis.

Through this framework, each ASO is mapped into a high-dimensional feature space that encodes its structural, compositional and functional tendencies. This systematic representation allows us to compare, filter and optimize candidates with precision.

Application

By converting sequence properties into numerical descriptors, each ASO is represented by a multidimensional profile that highlights its inherent strengths and weaknesses. Self-folding signatures, toxic motifs, $\text{GC}$/$\text{AT}$ balance, entropy, and flexibility are all captured in a way that allows us to compare candidates objectively.


Codon Usage Bias

Introduction

Codon usage bias (CUB) describes the unequal use of synonymous codons in protein-coding regions. Although multiple codons can encode the same amino acid, organisms tend to prefer certain codons over others due to evolutionary constraints, tRNA availability, and the need to optimize translation efficiency. These preferences are not random: they influence ribosome speed, pausing, and local RNA structure [40][41][42].

Comparison of codon usage bias in the MTAP coding sequence and the human codon usage table
Figure 22: Comparison of codon usage bias between the MTAP coding sequence and the human codon usage table. Bars represent the relative adaptiveness of individual codons within each amino acid family, with blue corresponding to MTAP and magenta to the average human codon preferences. The analysis highlights families where MTAP closely mirrors human codon bias (e.g., Pro, Thr, Val) and others showing notable deviations (e.g., Arg, Ser).

For ASOs, this context is highly relevant. Sites in regions of strong codon optimization may be structurally constrained, while sites enriched in rare codon may represent more accessible binding opportunities. ASO-target interactions are therefore determined not only by sequence complementarity but also by the biological translation context, which can modulate ASO activity through effects on RNase H1 access and binding [43].

By capturing CUB-related properties as quantitative features, we create a systematic way to represent translation-related biases in our predictive models. This helps bridge the gap between sequence-level biology and ASO performance, ensuring that design decisions incorporate both molecular complementarity and translational context.

Super-ASO structure

Feature Definitions

The following features were selected to capture different aspects of codon usage bias and sequence complementarity. Each of them reflects a distinct biological property that can influence ASO binding and activity. Together, they form a complementary set of descriptors that quantify both codon-level translation dynamics and ASO-target hybridization potential.

ENC (Effective Number of Codons)

ENC is one of the earliest and most widely used measures of codon usage bias. It provides a simple yet informative way to quantify how restricted a sequence is in its choice of synonymous codons. Highly expressed genes often show stronger bias toward "preferred" codons, while lowly expressed or less optimized genes tend to use codons more uniformly [40]

Definition: for each amino acid $\text{AA}$ with $k$ codons the probability that two random codons are identical:

\( \mathrm{F_\text{AA}} = p_1^2 +p_2^2 +\dots+p_k^2 \)
where $p_i$ is the frequency that we see that codon $i$ encoding the amino acid $\text{AA}$. The average $F$ is then calculated for amino acids according to their number of synonymous codons as follows:
  • $F_2$: mean $F_\text{AA}$ over all amino acids with 2 codons (there are 9 such amino acids).
  • $F_3$: mean $F_\text{AA}$ over all amino acids with 3 codons (there is 1 such amino acids).
  • $F_4$: mean $F_\text{AA}$ over all amino acids with 4 codons (there are 5 such amino acids).
  • $F_6$: mean $F_\text{AA}$ over all amino acids with 6 codons (there are 3 such amino acids).

In the last step, compute the ENC score:

\( \text{ENC} = 2 + \dfrac{9}{F_2} + \dfrac{1}{F_3} + \dfrac{5}{F_4} + \dfrac{3}{F_6} \)

Codon usage bias ENC comparison
Figure 23: The upper sequence illustrates a low ENC score with minimal codon variety, while the lower sequence illustrates a high ENC score with diverse codon usage.
Biological Interpretation:
  • Low ENC (close to 20) → strong codon bias, highly optimized translation [40].
  • High ENC (close to 61) → weak codon bias, less optimized translation [40].
Relation to ASOs

For ASO design, ENC highlights how translation efficiency may influence accessibility.

  • Low ENC (optimized codon usage) → rapid ribosome progression, ribosome crowding → reduced ASO accessibility
  • High ENC (diverse codon usage, including rare codons) → slower ribosome progression increased ASO accessibility
CAI (Codon Adaptation Index)

CAI was introduced as a way to measure how well a gene's codon usage matches that of a reference set of highly expressed genes. Because translation is shaped by evolutionary optimization, codons found in ribosomal and housekeeping genes are often considered "optimal". Comparing local sequence usage with this reference gives insight into how efficiently the region may be translated [41].

Definition: First, calculate the Relative Synonymous Codon Usage(RSCU) parameter for each codon as follows:

\( \text{RSCU}_i = \dfrac{X_i}{\dfrac{1}{n}\displaystyle\sum_{i=1}^n X_i} \)
where $n$ is the number of synonymous codons ($1\le n\le6$) for the amino acid under study (that codon $i$ encodes), and $X_i$ is the number of occurrences of codon $i$ in the reference set. In the second step, calculate the relative adaptiveness of a codon ($w_i$) as follows:
\( w_i = \dfrac{\text{RSCU}_i}{\text{RSCU}_\text{max}} \)

where $\text{RSCU}_\text{max}$ is the RSCU value for the most frequently used codon for an amino acid. In the final step, calculate the CAI value for a gene as the geometric mean of $w_i$ values for all the codons used in that gene:
\( \mathrm{CAI} = \left( \displaystyle\prod_{i=1}^{L} w_i \right)^{\tfrac{1}{L}} \)
where $L$ is the number of codons in the ORF.
Diagram explaining CAI
Figure 24: Codon usage frequencies from an illustrative reference set, where each codon is shown with its relative occurrence frequency. To complement the reference, two example sequences are included to demonstrate how codon choice influences CAI scoring: the left sequence is enriched in rare codons and represents a low CAI score, while the right sequence is enriched in frequent codons and represents a high CAI score.
Biological Interpretation:

  • High CAI → codons optimized for efficient translation.
  • Low CAI → codons deviating from optimal patterns, associated with reduced efficiency [41].

Relation to ASOs

In the context of ASOs, CAI reflects whether ribosomes are frequently present at the binding site.

  • High CAI → fast translation with frequent ribosome occupancy → site less accessible for ASO binding.
  • Low CAI → reduced translation efficiency, occasional ribosome slowdown → greater opportunity for ASO binding.

tAI (Translational Adaptation Index)

tAI extends the logic of codon adaptation by considering the actual tRNA pool of the organism. Instead of relying solely on reference gene sets, it incorporates tRNA gene copy numbers and wobble base-pairing rules. In this way, tAI reflects the translational "supply side": how available each codon's decoding machinery is within the cell [42].

Definition: In the first step, a weight is assigned to each codon, reflecting the abundance of the tRNAs that recognize it, and is calculated as follows:

\( W_i = \displaystyle\sum_{j=1}^{m} (1 - S_{ij})\, \text{tCGN}_{ij} \)
Where $W_i$ is the value for codon $i$, $S_{ij}$ is the affinity of the interaction between codon $i$ and tRNA $j$, the sum is over all relevant tRNA molecules that recognize the codon, and $\text{tCGN}_ij$ is the level of tRNA $j$ that recognizes codon $i$.

After calculating the weights $W_i$, the next step is to normalize them as follows:

\[ w_i = \begin{cases} \dfrac{W_i}{\max(W_j)}, & W_i \neq 0 \\[1em] \operatorname{mean}(w_j), & \text{otherwise} \end{cases} \]

In the final step, calculate the tAI of a coding region (a gene or a part of it) as the geometric mean over the $w_i$ of its codons:

\[ \text{tAI} = \left( \displaystyle\prod_{i=1}^{L} w_i \right)^{\tfrac{1}{L}} \]

Where $L$ is the number of codons on the sequence.
Diagram explaining tAI
Figure 25: Illustration of how the tAI score reflects codon adaptation to the cellular tRNA pool. Codon-specific weights are determined by the availability of tRNAs that recognize each codon. The figure contrasts a sequence composed mainly of codons recognized by abundant tRNAs (high tAI score) with a sequence containing codons decoded by rare tRNAs (low tAI score).
Biological Interpretation

  • High tAI → codons well-supported by abundant tRNAs → faster translation
  • Low tAI → reliance on rare tRNAs → slower, less efficient translation[42].

Relation to ASOs

For ASO design, tAI provides a direct link between tRNA availability and local ribosome speed.

  • High tAI → abundant tRNAs, smooth translation → shorter time window for ASO binding
  • Low tAI → rare tRNAs, ribosome stalling → greater ASO accessibility

Chimera ARS

Unlike ENC, CAI and tAI, which focus on codon usage, Chimera captures the sequence-level match between the ASO and its potential target sites. This feature is based on the Chimera ARS framework [44]. It indirectly measures hybridization potential by quantifying subsequences overlaps between the ASO and transcript regions.

Definition:

First, choose your reference set and the coding sequence. Then, for each position $i$ in the coding sequence $S$ find the longest sub-sequence $S_i$ that starts in that position, and also appears at the reference set. Afterward, let $|S|$ denote the length of a sequence $S$; the ChimeraARS of $S$ is the mean length of all subsequences $S_i$:
\[ \mathrm{ChimeraARS}(S) = \frac{1}{|S|} \sum_{i=1}^{|S|} |S_i| \]
Diagram explaining ChimeraARS
Figure 26: Demonstration of how the Chimera score reflects sequence complementarity relative to a reference. The reference sequence is shown at the top, and two example sequences are aligned below it. The sequence on the left yields a high Chimera score due to multiple overlapping matches, whereas the sequence on the right yields a low score with only a few partial matches.

Relation to ASOs

Chimera captures the sequence similarity between the ASO and mRNA windows across the transcriptome, serving a dual-purpose metric:

  • On-target: High similarity at the intended site → strong complementarity and efficient RNase H1 recruitment
  • Off-target: High similarity with other mRNA regions → risk of unintended binding and off-target effects

Windows Generation

To evaluate codon usage bias and hybridization features in a biologically relevant context, we applied a window-based approach centered on the ASO binding site. This reflects the fact that ribosomes, RNA-binding proteins and RNA structures operate on regions rather than isolated nucleotides.

  • Window sizes: Symmetrical windows of $\pm 20,30,40,50,60$ and $70$ nucleotides were defined around each ASO binding site. These sizes capture both very local influences ($\pm 20$ nt, close to ribosome footprint) and broader contexts (up to $\pm70$ nt) that may affect translation and RNA folding.
  • Multiple contexts:
    • Local windows represent the immediate sequence environment around the ASO site.
    • Global values (e.g., full CDS) represent the overall codon usage of the transcript.
    • ASO-level values capture intrinsic properties of the ASO sequence itself.
  • Clipping: When windows extend beyond transcript boundaries, they were trimmed so that no artificial nucleotides were added.

Diagram explaining sequence window definitions
Figure 27: Illustration of how local sequence windows were defined for feature calculation. Shown here are examples of windows of ±20, ±30, and ±40 nucleotides around the ASO binding site. In practice, additional window sizes (±50, ±60, and ±70) were also applied to capture broader sequence contexts.

How Each Feature is Calculated

Each feature was computed using algorithms adapted from the original literature, applied consistently across all contexts (ASO, local windows, global transcript).

ENC (Effective Number of Codons)

  • Codon frequencies were calculated within each window.
  • ENC values were derived using the standard formula.
  • Applied to CDS windows, full CDS, and ASO sequence

CAI (Codon Adaptation Index)

  • Codon weights were derived from a reference set of housekeeping and ribosomal genes
  • For each window or CDS, codon frequencies were compared against these weights to produce a CAI score

tAI (Translational Adaptation Index)

  • Codon weights were derived from human tRNA gene copy numbers, adjusted for wobble base-pairing rules.
  • Scores were calculated for each window, full CDS and ASO sequence.

Chimera Similarity

  • For each CDS or pre-mRNA window, we quantified the overlap between ASO subsequences and transcript subsequences.
  • Values were normalized by ASO length to allow comparisons across ASOs.

General Note:

All calculations were performed consistently across $\pm20-70$ nt windows, global CDS and ASO sequence where applicable. This ensures comparability between features and prevents context-specific bias.

Summary Table & General Summary

Summary of the context in which each feature was calculated. While the previous sections provide detailed definitions and illustrative figures for ENC, CAI, tAI and Chimera, the table below offers a concise overview of the calculation scope across ASO level, local windows, full CDS, and pre-mRNA contexts.
Feature CDS windows pre-mRNA windows Global (full CDS) ASO sequence
ENC
tAI
CAI
Chimera
feature calculated in this context
feature not calculated in this context

Conclusion

CUB features capture how codon usage shapes the translational and structural environment around ASO binding sites. ENC, CAI and tAI measure bias and adaptation, while Chimera similarity directly quantifies sequence complementarity. By combining local windows, global transcript measures and ASO-level descriptors these features create a comprehensive view of both biological context and hybridization potential.


Modifications

Introduction

When designing ASOs, chemical modifications play a central role, since they enable the molecules to remain stable in biological environments, reach their target sites, and maintain a therapeutic activity. At the same time, the type and distribution of these modifications can strongly influence potency, specificity and safety, which makes them an essential aspect to capture in our feature set.

The therapeutic potential of ASOs is closely linked to their chemical design. Unmodified DNA oligos are inherently unstable in biological fluids and undergo rapid degradation by nucleases, leading to short half-life and poor bioavailability. To overcome these limitations, a broad set of chemical modifications have been introduced at the backbone, sugar or nucleobase level. These modifications improve stability, enhance RNA binding affinity, reduce immunogenicity, and modulate pharmacokinetics [45].

One of the earliest and most widely adopted modifications in therapeutic ASOs is the phosphorothioate (PS) backbone. In this design, one of the non-bridging oxygen atoms in the phosphate group is replaced with sulfur. This subtle change dramatically increases resistance to nuclease degradation, which is essential for maintaining stability in biological fluids. In addition, PS linkages promote interactions with plasma and cellular proteins, improving systemic distribution and cellular uptake. However, these benefits come at a cost: PS linkages can reduce binding affinity for the RNA target and are associated with immune-related toxicities, such as complement activation or platelet effects [46],[47]. Thus, while PS chemistry forms the backbone of almost all therapeutic ASOs, it also illustrates the balance between improved pharmacokinetics and potential safety liabilities.

Sugar modifications represent another major class of chemical strategies. Substitutions at the 2' position of the ribose, such as 2'-O-methyl(2'-OMe) and 2'-O-methoxyethyl(2'-MOE) enhance nuclease resistance and improve binding affinity toward the target RNA target. These modifications also reduce immune recognition, which is critical for tolerability in vivo. Among sugar chemistries, Locked Nucleic Acids(LNA) are particularly notable: they contain a methylene bridge that locks the ribose into a rigid conformation, leading to very high affinity for complementary RNA and a significant increase in thermal stability of the ASO-RNA duplex [46],[48]. This enhanced potency, however, has to be carefully balanced, as high densities of LNA residues have been linked with hepatotoxicity.

A particularly effective strategy that combines different chemistries is the gapmer architecture. Gapmers consist of a central stretch of DNA nucleotides flanked on both sides by chemically modified nucleotides such as MOE or LNA. The flanking regions provide nuclease resistance and stabilize the oligonucleotide, which the central "gap" is recognized by RNase H1, enabling catalytic cleavage of the target RNA. This dual functionality makes gapmers highly versatile: they retain the stability conferred by chemical modifications but also harness the enzymatic machinery of the cell for efficient RNA degradation [49]. Gapmers therefore represent a cornerstone design in antisense therapeutics, bridging chemical stability with biological activity.

Other ASO designs, such as mixmers or fully modified steric blockers, also rely on chemical modifications, but gapmers remain the most widely used therapeutic architecture and best illustrate how modification patterns translate into functional features [50].

Diagram illustrating common ASO chemical modifications
Figure 28: Illustration of common ASO chemical modifications. Top: Gapmer with MOE-modified flanks. Middle: Fully phosphorothioated (PS) backbone. Bottom: Mixmer LNA-modified ASO.
Overall, chemical modifications define the therapeutic window of ASOs. They enhance potency, stability and biodistribution, but they may also introduce toxicities if modification density is too high, if distribution is uneven, or if repetitive motifs are present. Understanding these trade-offs is essential for effective ASO design [45],[46],[47],[48],[49].

From Biological Insights to Computational Features

Experimental studies have shown that the biological activity of ASOs depends not only on whether chemical modifications are present but also on how they are arranged across the sequence [45],[46],[47],[48],[49]. Parameters such as modification density, clustering, symmetry, or repetitiveness have all been linked to changes in stability, potency and toxicity.

To systematically capture these principles, we developed a set of features designed to reflect concepts highlighted in literature: for example, terminal runs to represent nuclease protection, block length to represent clustering, or entropy to represent diversity.

By doing so, we created a framework where each ASO sequence can be mapped onto a numerical profile of its chemical modification architecture. This bridge between biology and computation allows us to evaluate the contribution of chemical design to ASO performance and integrate these effects into predictive models for optimization.

Features of Chemical Modifications

The following illustrations present the individual features we defined to describe chemical modifications in ASOs. Each figure shows:

  • A schematic sequence with modified nucleotides highlighted.
  • The corresponding mathematical formula.
  • An example feature output.

Together, these visuals provide an intuitive understanding of how each feature is calculated and what biological property it represents.

Diagram illustrating ASO chemical modifications – example 2 Diagram illustrating ASO chemical modifications – example 3 Diagram illustrating ASO chemical modifications – example 4
Figure 29: Additional illustrations of ASO chemical modification patterns. For comparability across ASOs of different lengths, all computed features were normalized by ASO length (or by window size, where applicable) during the feature calculation process.

Application

By computing these features for every ASO, each sequence can be represented as a numerical profile of its chemical modification architecture. These profiles can be correlated with experimental data on nuclease stability, RNase H! activity, immunogenicity or toxicity. In this way, modification patterns are transformed into quantitative predictors that can be integrated into machine learning models for ASO design and optimization.


RNA Binding Proteins

Introduction

RNA binding proteins (RBPs) are proteins that bind RNA molecules and influence their stability, processing, and translation. They act after transcription, shaping the life cycle of the mRNA in the cell. Each RBP recognizes short sequence motifs (typically 3-8 nucleotides) or structural elements, and by binding to these regions they can protect the transcript, destabilize it, regulate its splicing, or control how efficiently it is translated [51],[52],[53]. In this way, RBPs function as local regulators that fine-tune RNA fate. For example, stabilizers such as HuR (ELAVL1) bind and protect mRNAs from rapid decay, whereas destabilizers such as TTP (ZFP36) promote degradation by recruiting decay machinery [54]. Beyond stability control, translation regulators affect how efficiently ribosomes initiate or elongate, directly shaping protein output, while splicing regulators influence whether specific exons are included or skipped during pre-mRNA processing. In many cases, the role of an RBP is still not fully understood, and such cases are categorized as other/unclear.

Impact of RBPs on ASO Efficacy

For an ASO to be effective, it must access its complementary site on the target mRNA and form a stable duplex that recruits RNase H1 or blocks translation. However, this binding does not occur in isolation: the local RNA environment is often occupied by RBPs. These proteins, recognized through their short sequence motifs, can directly overlap with the ASO site or cluster in the surrounding region, thereby altering accessibility [55].

If stabilizing RBPs dominate the region, they can protect the transcript and counteract ASO-mediated degradation. Conversely, when destabilizing RBPs are enriched, the RNA is already prone to decay, and ASOs may act in a more permissive environment. Translation regulators may modulate whether ASO binding reduces protein output, while splicing regulators can shift isoform usage in ways that either enhance or diminish the therapeutic effect. In many cases, the combined influence of multiple RBPs creates a complex competitive landscape [56].

Thus, the motif landscape of RBPs is not a secondary detail but rather a critical determinant of ASO performance. Two ASOs with identical complementarity to their targets can behave very differently depending on whether their binding sites are free from RBP interference or embedded within regions densely occupied by stabilizers, destabilizers, or other regulatory proteins [57]. Figure 30 illustrates this principle: in the top panel, RBP binding occupies the target site and impedes ASO hybridization, whereas in the bottom panel the site is accessible, allowing stable ASO-mRNA duplex formation.

Comparison of ASO binding outcomes in different RBP environments
Figure 30: Comparison of ASO binding outcomes in different RBP environments. In the top panel, RBP binding occupies the target site and blocks ASO hybridization. In the bottom panel, the site remains accessible, enabling stable ASO–mRNA duplex formation. This schematic highlights how local RBP occupancy can strongly influence the efficacy of otherwise identical ASOs.

Data Processing

To translate these biological insights into quantitative features, we first generated an integrated dataset that links motif information with cell line–specific expression profiles. Motif data were obtained from the ATtRACT database, which provides a curated collection of RNA-binding protein motifs derived from experimental studies.

The raw motif table included the cell line in which each motif is relevant, the motif sequence in RNA alphabet, and the RBP identity (gene name). We then extended this table with two additional layers of information: a functional role label for each RBP (stabilizer, destabilizer, splicing regulator, translation regulator or other/unclear) and the corresponding expression values obtained from transcript abundance files for each cell line.

The final processed table therefore contained, for each row, the specific cell line providing the biological context, the motif sequence, the RBP identity, the functional role label, and the expression level of that RBP in the given cell line. This structured integration ensured that each motif hit could be quantitatively interpreted in terms of both its sequence-level presence and its biological activity. The Table below provides a few representative examples of motifs from this dataset. The complete dataset is alos available for download as a CSV file.

Cell line Gene (raw) Expression Gene name Gene ID Organism Motif Len Functional role
A431 RBM5 (10181) 4.835095398 RBM5 ENSG00000003756 Homo_sapiens AAAAAAA 7 other/unclear
A549 SRSF1 (6426) 8.046300059 SRSF1 ENSG00000136450 Homo_sapiens CCCAGCA 7 splicing_regulator
KMS11 PCBP2 (5094) 11.57882081 PCBP2 ENSG00000197111 Homo_sapiens ACCUUAAA 8 stabilizer
A431 PABPC1 (26986) 9.606238631 PABPC1 ENSG00000070756 Homo_sapiens AAAAAAAA 8 translation_regulator
HELA ELAVL1 (1994) 5.311862727 ELAVL1 ENSG00000066044 Homo_sapiens UUUGUUU 7 stabilizer
A431 ELAVL2 (1993) 3.475162724 ELAVL2 ENSG00000107105 Homo_sapiens CUUUA 5 stabilizer
HELA SRSF3 (6428) 8.701327257 SRSF3 ENSG00000112081 Homo_sapiens UCAACGAUU 9 splicing_regulator
HELA ZFP36 (7538) 3.146591687 ZFP36 ENSG00000128016 Homo_sapiens ACCUGC 6 destabilizer
A431 PCBP2 (5094) 8.741048949 PCBP2 ENSG00000197111 Homo_sapiens CCUUCCU 7 stabilizer
A549 PUM1 (9698) 4.580929345 PUM1 ENSG00000134644 Homo_sapiens UGUAAUAUU 9 destabilizer

Example entries from the extended RBP motif dataset, showing for each motif the associated cell line, RNA sequence, RBP identity, functional role and expresion value.

This merged and annotated dataset formed the foundation for all downstream feature calculation, enabling us to weight motif overlaps by expression and incorporate biological roles directly into the analysis.

Features of RBP

Building on these biological insights and the integrated dataset of motifs and expression values, we analyzed ASO binding sites and their surrounding regions for RBP motifs. By treating these short sequence patterns as quantifiable elements, we transformed the qualitative influence of RBPs into numerical features that capture their potential to shape ASO performance. To ensure that both fine-scale and transcript-wide effects were represented, features were computed at two complementary levels:

  • Local windows - $\pm 20, \pm 30, \pm 40, \pm 50, \pm 60$ and $\pm70$ nucleotides around the ASO binding site, describing the immediate neighborhood that directly governs accessibility.
  • Global CDS - across the entire coding sequence, capturing broader RBP occupancy trends that affect transcript-wide stability and processing.
The features developed are outlined below.

RBP Overlap Density

This feature measures how strongly the region that includes the ASO binding site and its surrounding nucleotides is obstructed by RBP motifs. For each window and for the full CDS, we calculated the proportion of nucleotides covered by at least one motif:

\[ \text{Overlap Density} = \dfrac{\#\{\text{Covered bases in the region}\}}{L} \]

where $\#\{\text{Covered bases in the region}\}$ is the number of nucleotides overlapped by one or more motifs, and $L$ is the length of the selected region. Higher values reflect stronger RBP occupancy and more extensive competition with the ASO, whereas lower values indicate that the region is relatively free of RBP interference. Figure 31 illustrates this principle: in the top panel (high density), multiple RBPs cover much of the window around the ASO binding site, limiting accessibility. In the bottom panel (low density), only a few RBPs occupy the window, leaving the site more exposed and permissive for ASO-mRNA duplex formation.

Illustration of RBP overlap density across the ASO binding site window
Figure 31: Illustration of RBP overlap density across the ASO binding-site window. The top panel shows a high-density scenario, where several RBPs occupy the window and restrict ASO accessibility. The bottom panel shows a low-density scenario, with minimal RBP presence and greater potential for ASO–mRNA duplex formation.
To evaluate the relevance of this feature in practice, we tested the correlation between RBP overlap density and ASO log inhibition, focusing here on stabilizer RBPs as an illustrative example. Similar analyses were performed for other RBP roles (destabilizers, splicing regulators, translation regulators and unclear)
Correlations between RBP overlap density (stabilizers) and ASO log inhibition across multiple windows
Figure 32: Correlations between RBP overlap density (stabilizers) and ASO log-inhibition across multiple windows. Bars represent Pearson (blue), Spearman (orange), and normalized mutual information (green). Negative correlations indicate that higher stabilizer occupancy within the window is associated with reduced ASO efficacy. Shown here for stabilizers; analogous analyses were performed for other RBP roles. On the x-axis, feature names are shown as defined in the dataset. Each name encodes the feature type (overlap density), the RBP role (e.g., stabilizer), and the window size (±20, ±30, ±40, ±50, ±60, ±70, or global).

Role-Contrast Features

This feature evaluates the balance between two functional groups of RBPs (for example destabilizers versus stabilizers) within the region that includes the ASO site and its flanking nucleotides. It is expressed as both a difference and a ratio:

\[ \Delta = D_A - D_B , \quad R = \frac{D_A}{D_B + \varepsilon} \]

Where $D_A$ and $D_B$ are the motif densities of roles $A$ and $B$ in the region, $L$ si the length of the region, and $\epsilon$ is a small constant to prevent division by zero. For example, in biological terms, stabilizers dominance over destabilizers ($D_A > D_B$) implies protection of the transcript and reduced ASO efficacy, whereas destabilizers dominance ($D_A < D_B$) indicates a decay prone environment that may be more permissive to ASO activity. In our case we implemented this feature on the following pairs: destabilizer vs stabilizer, splicing regulator vs translation regulator, destabilizer vs splicing regulator, destabilizer vs translation regulator, unclear vs stabilizer and unclear vs destabilizer. Figure 33 illustrates this principle: the top panel shows a window dominated by destabilizers, while the bottom panel shows a window dominated by stabilizers.

Role-contrast features within the ASO binding-site window
Figure 33: Role-contrast features within the ASO binding-site window. The top panel shows a window dominated by destabilizers, resulting in a role-contrast value that favors destabilizers. The bottom panel shows a window dominated by stabilizers, resulting in a role-contrast value that favors stabilizers.
To evaluate the relevance of this feature in practice, we tested the correlation between role-contrast values and ASO log-inhibition. Here we show results for the destabilizer vs stabilizer pair, expressed both as a difference ($\Delta$) and as a ratio ($R$)
Correlations between role-contrast delta (destabilizer vs stabilizer) and ASO log-inhibition across multiple windows
Figure 34: Correlations between role-contrast Δ (destabilizer vs stabilizer) and ASO log-inhibition across multiple windows. Bars represent Pearson (blue), Spearman (orange), and normalized mutual information (green). Positive correlations indicate that destabilizer dominance is associated with stronger ASO efficacy. On the x-axis, feature names are shown as defined in the dataset. Each name encodes the feature type (role-contrast Δ), the RBP roles being compared (destabilizer vs stabilizer), and the window size (±20, ±30, ±40, ±50, ±60, ±70, or global).
Correlations between role-contrast ratio (destabilizer vs stabilizer) and ASO log-inhibition across multiple windows
Figure 35: Correlations between role-contrast R (destabilizer vs. stabilizer) and ASO log-inhibition across multiple windows. Bars represent Pearson (blue), Spearman (orange), and normalized mutual information (green). Negative correlations indicate that stabilizer dominance (larger denominator) is associated with reduced ASO activity. On the x-axis, feature names are shown as defined in the dataset. Each name encodes the feature type (role-contrast R), the RBP roles being compared (destabilizer vs. stabilizer), and the window size (±20, ±30, ±40, ±50, ±60, ±70, or global).

Expression-Weighted Density

This feature incorporated RBP mRNA expression levels into the density calculation, recognizing that abundantly transcribed RBPs are more likely to exert stronger effects in the cell. For each window or the full CDS, we computed:

\[ \text{Weighted Density} = \frac{\sum_i (L_i \cdot E_i)}{L} \]

Where \(L_i\) is the overlap length contributed by RBP \(i\), \(E_i\) is its mRNA expression in the given cell line, and \(L\) is the total length of the region. Biologically, high weighted density indicates that RBPs with high mRNA abundance cover the region extensively, reducing ASO accessibility, while low weighted density suggests minimal obstruction from actively expressed RBPs.

To evaluate the relevance of this feature in practice, we tested the correlation between expression-weighted density and ASO log-inhibition, focusing here on stabilizer RBPs as a representative example.

Correlations between expression-weighted density (stabilizers) and ASO log-inhibition across multiple windows
Figure 36: Correlations between expression-weighted density (stabilizers) and ASO log-inhibition across multiple windows. Bars represent Pearson (blue), Spearman (orange), and normalized mutual information (green). Negative correlations indicate that higher expression-weighted stabilizer occupancy is associated with reduced ASO efficacy. On the x-axis, feature names are shown as defined in the dataset. Each name encodes the feature type (expression-weighted density), the RBP role (e.g., stabilizer), and the window size (±20, ±30, ±40, ±50, ±60, ±70, or global).

Motif Diversity (ASO-only and window-level)

This feature captures whether regulation of the region (ASO site plus its surroundings) is dominated by a single RBP or distributed among multiple RBPs. For each window or the full CDS, it is expressed as:

\[ D_R = \dfrac{N_{\text{unique}}}{N_{\text{total}}} \]

where \(D_R\) is the Diversity Ratio, \(N_{\text{unique}}\) is the number of unique RBP identities, and \(N_{\text{total}}\) is the total number of motif hits. We evaluated two complementary variants: ASO-only diversity, counting motif overlaps that fall strictly on the ASO site (same value for all windows), and window-level diversity, counting all overlaps within the surrounding window. Biologically, a low \(D_R\) suggests that one RBP largely determines the outcome (which can either hinder or support ASO action depending on its role), whereas high diversity indicates competition among many RBPs, potentially leaving more opportunities for ASO binding. Figure 37 illustrates this principle: in the top panel, the window is dominated by a single RBP type (low diversity), while in the bottom panel, multiple RBP types occupy the window (high diversity).

Illustration of motif diversity within the ASO binding-site window
Figure 37: Illustration of motif diversity within the ASO binding-site window. The top panel shows a low-diversity scenario dominated by one RBP type. The bottom panel shows a high-diversity scenario where multiple RBP types compete for occupancy, resulting in a higher motif diversity value.

To evaluate the relevance of this feature in practice, we tested the correlation between motif diversity and ASO log-inhibition, focusing here on stabilizer RBPs as a representative example.

Correlations between motif diversity (stabilizers) and ASO log-inhibition across multiple windows
Figure 38: Correlations between motif diversity (stabilizers) and ASO log-inhibition across multiple windows. Bars represent Pearson (blue), Spearman (orange), and normalized mutual information (green). Negative correlations indicate that higher expression-weighted stabilizer occupancy is associated with reduced ASO efficacy. On the x-axis, feature names are shown as defined in the dataset. Each name encodes the feature type (motif diversity), the RBP role (e.g., stabilizer), and the window size (±20, ±30, ±40, ±50, ±60, ±70, or global).

Positional Splits

This feature measures how RBP occupancy is distributed across different parts of the ASO binding site within its surrounding region. The ASO binding site is divided into three equal segments: Left, Core, and Right. For each segment, we calculate the fraction of nucleotides covered by at least one motif:

\[ \mathrm{Frac_{part}} = \frac{N_{\mathrm{overlap}}}{L_{\mathrm{part}}} \]

Where \(N_{\mathrm{overlap}}\) is the number of nucleotides overlapped by motifs within that segment, and \(L_{\mathrm{part}}\) is the length of the segment. Each fraction ranges from 0 (no overlap) to 1 (full overlap). These values were computed per role, across local windows (\(\pm20, \pm40, \pm70\) nucleotides) and for the global CDS. Biologically, this feature highlights whether RBP binding tends to cluster at the edges of the ASO site or directly over its central core. Core overlaps are likely to interfere most strongly with ASO–mRNA hybridization, while overlaps confined to flanking edges may be less disruptive. Figure 39 illustrates this principle: in the top panel, most RBPs bind at the edges, resulting in high left/right fractions and a low core fraction, while in the bottom panel, RBPs cluster in the middle, producing a high core fraction and low left/right fractions.

Positional splits across the ASO binding site showing edge vs core occupancy of RBPs
Figure 39: Positional splits across the ASO binding site. The binding site is divided into three equal parts (Left, Core, Right), and the fraction of nucleotides overlapped by RBPs is calculated for each part. The top panel shows edge occupancy, where RBPs bind mainly at the left and right edges of the site, resulting in high left/right fractions and a low core fraction. The bottom panel shows core occupancy, where RBPs concentrate in the middle, producing a high core fraction and low edge fractions, which is expected to interfere most strongly with ASO–mRNA duplex formation.

Summary

Together, these five features provide a multidimensional profile of the RBP environment in both local windows (ASO site plus surrounding regions) and the global CDS. They describe: how obstructed the region is (overlap density), how RBP mRNA expression strengthens or weakens that obstruction (weighted density), which functional forces dominate (role contrast, on both regular and mRNA-weighted densities), whether regulation is monopolized or diverse (motif diversity), and where along the ASO binding site RBP occupancy is concentrated (positional splits). By systematically merging motif and mRNA expression data and computing these features in both local and global contexts, we capture the biological reality of RBP regulation and ensure our ASO predictions account for both fine-scale adn transcript-wide influences.


Off Target Effects

When designing ASOs, one of the major challenges is avoiding off-target effects [58]. These can be broadly categorized into hybridization-dependent and hybridization-independent mechanisms.

Hybridization-dependent off-targets occur when an ASO partially binds unintended RNA transcripts through imperfect sequence complementarity, leading to RNase H-mediated cleavage or steric blocking of translation. In contrast, hybridization-independent off-targets arise without base pairing and are instead driven by the chemical properties of the ASO—such as phosphorothioate backbones or hydrophobic modifications—that promote nonspecific interactions with proteins or trigger immune activation.

ASOs are designed to bind a specific RNA transcript, yet hybridization-dependent off-targeting can still occur with other RNAs. The likelihood of such unintended binding depends primarily on two factors:

  • Sequence complementarity (hybridization strength): the more similar the sequence of the off-target RNA to that of the target, the stronger the binding.
  • Expression level of the RNA: even if an RNA is not a perfect match, high expression levels can still lead to strong competition with the intended target.

An off-target interaction is therefore most relevant when both sequence similarity and transcript abundance permit significant hybridization between the ASO and unintended RNAs.

Sequence-Based Similarity: Match-Distance Analysis

As an initial step, we used the Hamming distance to quantify sequence similarity between ASOs and RNA targets. It measures how many positions differ between two sequences of equal length:

\( d_H(x, y) = \sum_{i=1}^{n} [x_i \ne y_i] \)

While this simple metric captures single-base substitutions, it does not account for insertions or deletions. We therefore use it as a starting point before extending our analysis to more complex alignment-based measures.

To evaluate an ASO’s potential to degrade its intended target, we analyzed the distribution of perfect and near-perfect matches (Hamming distance ≤ 2) across the genome. Near-complementary sites can act as competing sinks, transiently binding the ASO and limiting its availability for productive binding to the correct target transcript.

We also examined imperfect matches within the target gene itself. In some cases, partial complementarity can still enable RNase H–mediated cleavage, similar to what occurs in off-target interactions, thereby contributing to overall target degradation.

For estimating potential off-targets across the whole genome, we used the Bowtie alignment tool. The workflow involved several steps:

For estimating potential off-targets across the whole genome, we used the Bowtie alignment tool. The workflow involved several steps:

  1. Building an index of the reference genome using bowtie-build for fast searching.
  2. Aligning ASO sequences against this indexed genome using bowtie, producing SAM-format alignments.
  3. Converting SAM files to BAM for efficient storage.
  4. Analyzing the alignment counts with a custom Python script that quantified the number of matches per genomic location.

The index creation was parallelized across chromosomes and completed within minutes, allowing us to perform off-target searches extremely quickly. Because Bowtie’s search runtime scales approximately linearly with the number of detected off-targets, genome-wide analysis could be executed efficiently with minimal computational cost.

Structure-Aware Thermodynamics: RISearch Binding Model

RNA molecules — both pre-mRNA and mature RNA — are inherently three-dimensional, and their structure plays a critical role in determining how they interact with other molecules. In addition to Hamming-distance–based analysis, we also evaluate ASOs by their binding free energy ($\Delta G$) to both the intended target and potential off-targets.

RIsearch is a fast alignment-based algorithm for large-scale detection of nucleotide interactions, used in applications such as CRISPR off-target prediction and RNA–RNA or RNA–DNA hybridization analysis. RIsearch employs an efficient dynamic programming algorithm with a time complexity of O(query length × target length), allowing it to account for potential secondary-structure interactions between the ASO and its RNA target.

RIsearch also accounts for RNA structural features such as bulges, loops, and hairpins formed between the reference and query sequences, estimating the Gibbs free energy of binding between each ASO and its candidate gene locations. Lower (more negative) $\Delta G$ values correspond to more thermodynamically stable duplexes and serve as an indicator of higher binding affinity and potential ASO efficiency.

For our analysis, we used an in-house modified version of RIsearch. This customized implementation allows us to adjust the scoring weights for each nucleotide pairing, providing fine-grained control over how different base-pair interactions contribute to the overall binding energy. This flexibility enables more accurate modeling of ASO–RNA interactions in line with experimental observations.

To tailor it for ASO design, we first transposed its built-in RNA/DNA interaction matrix so that the query strand corresponds to PS-DNA (the ASO) and the reference strand to RNA (the target mRNA), ensuring that base-pair energetics are applied in the correct orientation. We then integrated nearest-neighbor (NN) parameters derived from chemical HRM experiments for PS-DNA/DNA duplexes, and using our previously defined conversion formula (See Hybridization section), inferred the equivalent PS-DNA/RNA weights. This adaptation enables chemistry-aware $\Delta G$ calculations for ASO–mRNA hybrids, where more negative values indicate stronger and more thermodynamically stable binding.

Off-target by Cell-Line Expression Profile

Data Sources

To capture these aspects, we used data from the Cancer Dependency Map (DepMap) [59]. All datasets were taken from the 25Q2 release of DepMap, ensuring consistency across expression, mutation, and cell line annotation files. DepMap is a large public resource that provides multi-omics data across hundreds of human cancer cell lines, including:

  • Transcript expression levels (both coding and non-coding RNAs)
  • Mutation profiles for each cell line
  • Cell line identifiers (CCLE IDs) that link datasets together

In addition, we used [60] for:

  • The human reference genome sequence.
  • A GTF annotation file containing exon–intron structures and transcript coordinates.

Building the Mutated Transcriptome
We first focused on constructing the transcriptome layer, as it offered a more direct and accessible way to capture functional genetic variation across cell lines. For each cell line, we then built a mutated transcriptome by integrating these data:

  • Expression matrix – expression levels for each transcript, recorded both in TPM (transcripts per million) and log(TPM+1).
  • Mutation set – mutations reported in the cell line.
  • Transcript annotation – mapping ENST transcript IDs to exon/intron structure and their coordinates in the chromosome.
  • Sequence reconstruction – extracting the transcript sequence from the genome and introducing mutations directly, following cDNA mutation notation.

The result was a complete list of all transcripts in the cell line, each with its sequence (mutated if needed) and expression level.

Calculating Off-Target Scores
With the mutated transcriptome available, we could evaluate how strongly each ASO binds unintended RNAs:

  • For each ASO, we selected the cell line in which it was tested.
  • Using RIsearch, we calculated hybridization energies between the ASO and every transcript in that cell line (excluding the intended target) [61].
  • Binding strength (free energy, \(\Delta G\)) was combined with transcript abundance according to:

\[ \text{OffTarget}_{\text{ARTM}_1} = \sum_{i=1}^{N} \left( E_i \times \text{Expr}_i \right) \]

where \(E_i\) is the hybridization free energy (\(\Delta G\)) between the ASO and the \(i\)-th transcript, and \(\text{Expr}_i\) is the corresponding transcript’s expression level (e.g., \(\text{TPM}\) or \(\log(\text{TPM}+1)\)). The subscript \(\text{ARTM}_1\) denotes the first arithmetic method, where off-target contributions are summed as weighted products of binding strength and transcript abundance.

  • We ranked transcripts by this weighted score and summed contributions from the top \(N\) most relevant off-targets (testing multiple values of \(N\)).

We called this approach the arithmetic method, where hybridization strength is directly weighted by transcript expression levels.

Results and Improvements
At first, we used mature mRNA sequences (after splicing) and found only a weak correlation between the off-target score and ASO activity. However, previous studies suggest that many ASOs act primarily on pre-mRNA rather than mature transcripts. When we repeated the analysis using pre-mRNA sequences (including introns), the results improved substantially. Off-target scores showed a stronger, statistically significant (see supplementary Tables A,B,C) negative correlation with ASO efficacy — the higher the predicted off-target binding, the less effective the ASO was at silencing its intended target, exactly as expected.

Effect of Hybridization Cutoffs
We tested different cutoffs for minimum hybridization strength to determine which interactions are biologically relevant. A threshold of around 1200 RIsearch units (corresponding to approximately 10–12 consecutive complementary base pairs) gave the best results (Figures 40-42). This value provided a good balance between noise reduction — filtering out weak, spurious interactions — and information retention, preserving interactions strong enough to compete with the intended target.

Off-target score performance across hybridization cutoffs
Figure 40. Pearson correlation between off-target arithmetic weighting scores and ASO inhibition across varying analysis parameters. Correlations are shown as a function of different top N thresholds (limiting the analysis to the N most highly expressed genes) and different cutoff values (minimum interaction score required for inclusion). Circles indicate correlations computed over the full dataset, whereas X symbols represent correlations after excluding zero-valued cases to focus on informative instances.
Comparison of ΔG thresholds and prediction accuracy
Figure 41. Pearson correlation between off-target arithmetic weighting scores and ASO inhibition across varying analysis parameters. Correlations are shown as a function of different top N thresholds (limiting the analysis to the N most highly expressed genes) and different cutoff values (minimum interaction score required for inclusion). Circles indicate correlations computed over the full dataset, whereas X symbols represent correlations after excluding zero-valued cases to focus on informative instances.
Correlation between hybridization strength and ASO efficacy
Figure 42. Pearson correlation between off-target arithmetic weighting (by log expression) scores and ASO inhibition across varying analysis parameters. Correlations are shown as a function of different top N thresholds (limiting the analysis to the N most highly expressed genes) and different cutoff values (minimum interaction score required for inclusion). Circles indicate correlations computed over the full dataset, whereas X symbols represent correlations after excluding zero-valued cases to focus on informative instances.

Comparing Different Scoring Methods

To ensure our off-target metric was robust, we systematically compared multiple formulations that varied in how transcript expression was incorporated. These methods explored different mathematical relationships between hybridization energy and expression level.

1. Squared Expression Weighting

Our first variant extended the arithmetic model by giving more weight to highly expressed transcripts:

\[ \text{OffTarget}_{\text{ARTM}_2} = \sum_{i=1}^{N} \left( E_i \times \text{Expr}_i^{2} \right) \]

where \(E_i\) is the hybridization free energy (\(\Delta G\)) between the ASO and the \(i\)-th transcript, and \(\text{Expr}_i\) is its expression level (e.g., \(\text{TPM}\) or \(\log(\text{TPM}+1)\)). The subscript \(\text{ARTM}_2\) denotes the second arithmetic method, where expression values are squared to emphasize highly expressed off-target transcripts.

This formulation amplified the contribution of abundant RNAs, highlighting cases where expression level alone may dominate off-target behavior. It produced slightly weaker correlations compared to the main arithmetic model but remained statistically significant (see Table D).

Off-target score comparison for squared expression weighting (ARTM₂ formulation)
Figure 43. Pearson correlation between off-target arithmetic weighting (by squared expression) scores and ASO inhibition across varying analysis parameters. Correlations are shown as a function of different top N thresholds (limiting the analysis to the N most highly expressed genes) and different cutoff values (minimum interaction score required for inclusion). Circles indicate correlations computed over the full dataset, whereas X symbols represent correlations after excluding zero-valued cases to focus on informative instances.

2. Geometric Method

Next, we explored a geometric formulation, where hybridization energies were combined multiplicatively rather than additively. This approach captures proportional rather than linear effects of expression:

\[ \text{OffTarget}_{GEO} = \prod_{i=1}^{N} \left( E_i^{\text{Expr}_i} \right) \]

where \(E_i\) is the hybridization free energy (\(\Delta G\)) and \(\text{Expr}_i\) is its transcript expression level. Highly expressed transcripts therefore contribute multiplicatively to the total off-target score.

To prevent numerical underflow caused by very small products, we applied a log–exp transformation:

\[ \text{OffTarget}_{GEO} = \exp \left( \sum_{i=1}^{N} \text{Expr}_i \times \log(E_i) \right) \]

This preserved the geometric interpretation while allowing stable numerical computation. The results were comparable to the arithmetic model (see Table E), demonstrating that expression weighting remains robust across mathematical formulations.

Geometric method comparison of off-target scoring (GEO formulation)
Figure 44. Pearson correlation between off-target geometric weighting scores and ASO inhibition across varying analysis parameters. Correlations are shown as a function of different top N thresholds (limiting the analysis to the N most highly expressed genes) and different cutoff values (minimum interaction score required for inclusion). Circles indicate correlations computed over the full dataset, whereas X symbols represent correlations after excluding zero-valued cases to focus on informative instances.
Off-target mechanical statistics-based weighting score correlation analysis
Figure 45. Pearson correlation between off-target mechanical statistics-based weighting scores and ASO inhibition across varying analysis parameters. Correlations are shown as a function of different top N thresholds (limiting the analysis to the N most highly expressed genes) and different cutoff values (minimum interaction score required for inclusion). Circles indicate correlations computed over the full dataset, whereas X symbols represent correlations after excluding zero-valued cases to focus on informative instances.

3. Statistical (Boltzmann-Weighted) Method

Finally, we introduced a Boltzmann-weighted statistical approach, which treats hybridization events probabilistically, weighting them according to their thermodynamic favorability:

\[ \text{OffTarget}_{\text{MS}} = \sum_{i=1}^{N} \text{Expr}_i \times \exp \left( -\frac{E_i}{\kappa_B T} \right) \]

where \(E_i\) is the binding free energy (\(\Delta G\)) between the ASO and the \(i\)-th transcript, \(\kappa_B\) is the Boltzmann constant, and \(T\) is the effective temperature (in Kelvin). This formulation links molecular thermodynamics with expression-level weighting, approximating the statistical likelihood that an ASO transiently binds a given transcript in the cellular environment. The subscript \(\text{MS}\) denotes the mechanical-statistics method, which integrates thermodynamic weighting of hybridization energy into the off-target scoring process.

The Boltzmann-weighted model maintained the same negative correlation trend observed in arithmetic and geometric methods, confirming that biologically, expression-driven weighting remains consistent across both deterministic and probabilistic frameworks (see Table F).

Off-target summary

Summary of Results

  • The pre-mRNA–based approach outperformed the mature mRNA model, aligning with literature suggesting that pre-mRNA is the primary target of ASOs.
  • The arithmetic method with an energy cutoff of approximately 1200 RIsearch units produced the strongest and most consistent correlations (Table A).
  • Alternative scoring methods (squared expression, geometric, and Boltzmann-weighted) confirmed the same overall trend, though with slightly weaker performance.
  • The thermodynamics-based formulation was less effective, likely due to model simplifications in how binding sites were represented.
  • When analyzing non-zero ASO off-target scores, we observed that higher energy cutoffs and smaller top-\(N\) sets produced a stronger negative correlation between off-target binding and inhibition efficiency. This indicates that ASOs with strong hybridization to highly expressed transcripts tend to be less effective, as they become sequestered by these off-targets rather than engaging their intended gene. Therefore, expression-weighted off-target binding may serve as a powerful early-stage filter, enabling the exclusion of ASO candidates that are likely to become trapped in unintended interactions.
Spearman correlation between off-target scores and ASO inhibition across varying cutoff values
Figure 46. Spearman correlation between off-target scores and ASO inhibition across varying cutoff values, restricted to nonzero cases. Correlations are displayed for different top N thresholds, and dashed lines represent linear fits illustrating the general trend across cutoffs.

Together, these results demonstrate that weighting off-target scores by hybridization energy and transcript expression provides a reliable predictor of ASO performance.

Supplementary Tables


Machine Learning

We've begun our journey with data curation, then we utilized a simple ODE to reduce Volume related bias, and we started to engineer features that explain this volume-corrected inhibition. Now we need to put all the pieces together, and find for each ASO, its $-k$ efficiency number, or $-kT$ as we don't correct for treatment period in this version.

As we focus on lung cancer, or more specifically A549 cell lines, we noticed that we don't have any lung cells in our data. The most common ones are melanoma(A431, SK-MEL-28), liver cells (HepG2, HepaRG) and lymphoma (MM1.R, KARPAS-229). This calls for a model that can generalize across cell lines, and won't suffer from experiment or cell line specific bias.

We chose to utilize XGBRanker[62], a model based on XGBoost, that is focused on ranking pairs of ASOs. The idea is the following - different experiments and cell lines, might have some unique properties that relate specifically to them. By utilizing a pairwise model, we can utilize an apples to apples comparison, allowing us to find features that generalize well across cell lines.


Feature Selection

We applied feature selection to identify which biological and chemical parameters most strongly associate with ASO efficacy while limiting overfitting. Features with occasional missing values (e.g., codon-usage metrics for intron-targeting ASOs) were 0-filled to maintain comparability across all sequences.

Because the feature space was large, we used a two-stage procedure. At each step we (i) ranked the remaining features by mutual information with the response and shortlisted the top-80, then (ii) selected from this pool the feature whose maximum absolute Spearman correlation to the already-chosen set was smallest (ties broken by higher MI). We repeated this until 40 features were selected, preserving relevance while actively limiting redundancy.

After this screen, we manually curated a balanced panel that preserved representation from key categories (hybridization/structure, sequence composition, folding/accessibility, RNase H motifs, contextual/genomic, and assay/experimental). The table below shows representative features removed during the preliminary pass and the closest retained counterparts.

Dropped (Preliminary) Closest Kept Comment
premRNA_gc_skew_50
premRNA_gc_skew_20
gc_skew Best GC-skew feature chosen (ASO-only)
premRNA_at_skew_50
premRNA_at_skew_20
at_skew Best AT-skew feature chosen (ASO-only)
premRNA_gc_content_20
premRNA_cg_dinucleotide_fraction_70
gc_content
gc_content_3_prime_5
Best GC-content descriptors
premRNA_sequence_entropy_50
premRNA_dinucleotide_entropy_30
premRNA_dinucleotide_entropy_70
entropy
nucleotide_diversity
Non-redundant entropy/heterogeneity measures
premRNA_tandem_repeats_score_70
premRNA_homooligo_count_70
premRNA_homooligo_count_40
homooligo_count
poly_pyrimidine_stretch
Repeat-type features with distinct signal
Modification_evenness
Modification_symmetry_score
Modification_adjacent_pair_count
Modification_dominant_mod_fraction
Modification_pos_std
Modification_skew_index
Modification_min_distance_to_3prime
Compact modification-asymmetry proxies
dsm_su95_rev_wGU_pos1386t37Falseon_target_energy_max600
dsm_su95_rev_woGU_pos1384t37Trueon_target_energy_max600
self_energy
internal_fold
exp_ps_hybr
Representative hybridization/secondary-structure signals
min_mfe_45 on_target_fold_openness_normalized40_15
sense_avg_accessibility
mRNA folding / accessibility retained
RNaseH1_score_R4b RNaseH1_Krel_score_R7_krel More informative RNase-H window kept
ENC_score_global_CDS CAI_score_global_CDS Best global codon-usage
premRNA_stop_codon_count_70 stop_codon_count Simpler stop-codon load (ASO-proximal)
premRNA_at_rich_region_score_50 at_rich_region_score Best AT-richness descriptor (ASO)
off_target.top200.cutoff800.premRNA_log off_target.top200.cutoff600.premRNA_TPM Best TPM-weighted off-target
mrna_length Location_div_by_length
sense_exon
sense_intron
sense_utr
Contextualized length/position features

We also included normalized_start, normalized_sense_start_from_end, hairpin_score, and the assay-level Treatment_Period(hours) as domain-informed additions.

Backward Feature Selection
We began with 28 candidates and removed features one at a time until 15 remained. At each round we trained a full model without each candidate (“leave-one-feature-out”), scored every variant, and dropped the feature whose removal produced the smallest drop (or a gain) in our objective. This keeps the selection grounded in the full multivariate context rather than isolated, one-by-one additions.

  1. Objective. Geometric mean \( \mathrm{NDCG}@200 \) under leave-one-cell-line CV (4 models: train on 3, evaluate on the held-out 4th).
  2. Round \(k\) procedure. For each remaining feature \(f\): train a model without \(f\), compute the CV objective, and record \( \Delta \mathrm{score}(f) \). Remove the feature whose removal yields the best (least harmful / most beneficial) \( \Delta \mathrm{score}(f) \).
  3. Stopping rule. Stop when 15 features remain, or earlier if any further removal would drop the CV objective beyond a predefined tolerance.

The table shows each feature’s ablation effect (ΔNDCG, see more Definition of Metrics ): the change in validation score when that feature is removed. These values do not sum to the final NDCG because effects overlap across correlated features. Positive values indicate improved generalization; negative values indicate a small adverse impact.

Feature Δ gmean NDCG@200 (unseen cell lines)
Treatment_Period(hours) +0.0535
at_skew +0.0301
CAI_score_global_CDS +0.0245
stop_codon_count +0.0245
sense_avg_accessibility +0.0204
on_target_fold_openness_normalized40_15 +0.0197
sense_utr +0.0189
nucleotide_diversity +0.0180
internal_fold +0.0169
normalized_start +0.0141
RNaseH1_Krel_score_R7_krel +0.0125
hairpin_score +0.0112
Modification_min_distance_to_3prime +0.0081
at_rich_region_score +0.0051
self_energy −0.0022

This NDCG@200 threshold for unseen cell lines balances sensitivity to top-performers with stability across datasets of very different sizes (≈1,000–8,000 ASOs per cell line). Under this objective, certain biological signals (e.g., sequence composition or RNase-H window effects) dominated, while explicit hybridization energies contributed less once correlated variables were present.

self_energy was removed because its ablation increased the validation NDCG (ΔNDCG > 0 in the table), indicating the feature hurt performance when included.

On treatment period and potential bias. We include Treatment_Period(hours) as an assay-level “hyper-feature.” The ODE transform that corrected dose/volume does not capture time effects, so we treat duration explicitly rather than forcing a mismatched fit. This makes any protocol differences transparent while keeping the model focused on biology-driven features. Next, we will extend the kinetics (e.g., Michaelis–Menten or related ODEs) to account for duration without introducing bias.

The final model therefore uses 14 features (with Treatment_Period(hours) acting as an experimental hyper-feature) and shows consistent generalization across unseen cell lines.

The chosen features are:

  • Treatment period (hours) — experimental covariate included to control for exposure time.
  • AT skew — \((A-T)/(A+T)\); does the ASO sequence contain relatively more A’s or T’s overall?
  • CAI score (global CDS) — codon adaptation index computed on the corresponding transcript’s CDS.
  • Stop codon count — number of stop triplets present within the ASO sequence.
  • Sense average accessibility — target-site openness from RAccess on the sense transcript.
  • on_target_fold_openness_normalized40_15 — ViennaRNA windowed openness (sliding windows; normalized) around the target site.
  • sense_utr — indicator: the ASO-targeted sense region lies in a UTR of some transcript.
  • Nucleotide diversity — count of distinct dinucleotides observed (out of 16 possible).
  • Internal fold — propensity of the ASO to fold onto itself (self-structure likelihood).
  • normalized_start — position of the targeted sense start, normalized by total gene length.
  • RNaseH1_Krel_score_R7_krel — best-correlating RNase H1 window (R7) within the ASO.
  • hairpin_score — potential to form hairpins, estimated by recurrence of short motifs in the reverse complement.
  • Modification_min_distance_to_3prime — minimum distance of any chemical modification from the 3′ end (captures modification asymmetry).
  • AT-rich region score — fraction of the sequence contained in AT-rich stretches.

Target Selection

The journey of selecting our therapeutic target begins with one central question: Which synthetic lethal (SL) pair holds the greatest promise for lung cancer patients? This process is not straightforward — it requires balancing clinical relevance, biological plausibility, and computational evidence.

Step 1 – Mapping the Mutational Landscape

We first set out to identify mutations that are truly relevant for lung cancer. Using large-scale genomic datasets, including TCGA for LUAD and LUSC, as well as MSK 2022/2023 cohorts for metastatic NSCLC and CPTAC 2021 for squamous carcinoma, we compiled a comprehensive map of frequently altered genes in these tumors [63][66]. From TCGA LUAD, we analyzed 2,701,623 mutations spanning 17,909 genes, and from TCGA LUSC, 2,380,747 mutations across 17,950 genes. This large-scale mutational landscape served as the foundation for downstream prioritization of genes that may serve as viable therapeutic targets.

Step 2 – Anchoring in Synthetic Lethality

Not every mutation can serve as a useful entry point. We therefore intersected our gene list with curated databases of synthetic lethal interactions, which represent the current state-of-the-art resources for SL data [67][68]. To ensure robustness, only pairs with a confidence score above 0.5 were retained, yielding approximately 1,000 candidate genes.

Step 3 – Focusing on Tumor Suppressors

Because the therapeutic logic of synthetic lethality typically begins with a dysfunctional gene, a reasonable assumption is that loss-of-function (LOF) mutations occurring frequently in tumor cells will most often involve tumor suppressor genes (TSGs). Only LOF mutations are relevant in this context, since the principle of synthetic lethality relies on the simultaneous loss of function in two genes — which is fatal for the cancer cell, whereas LOF in each gene individually is not harmful.

To focus our analysis, we refined our candidate list using the BIOINFO classification of tumor suppressor genes [69], which narrowed the set to 176 genes with known or suspected tumor-suppressive roles.

Step 4 – Defining True Loss-of-Function Mutations

Not all mutations in these genes translate into functional loss. We therefore applied stringent criteria to classify mutations as bona fide loss-of-function (LOF) events:

  • Splicing disruptions predicted by the OncoSplice model, which forecasts the altered mRNA isoform [70].
  • Start codon disruptions scored using the Exposition model for translation initiation [71].
  • Frameshift or nonsense variants annotated as deleterious by Ensembl’s Variant Effect Predictor(VEP) [72], cross-validated with OncoSplice percentile scoring (<0.75 considered LOF).
  • Copy number losses, when present, were treated as strong LOF indicators.

As an illustrative case, TP53 emerged from our LOF classification pipeline as one of the most frequently disrupted tumor suppressors in lung cancer [73]. Despite its high overall mutation rate, each specific high-confidence LOF mutation is rare (<1%), reflecting extensive heterogeneity across patients. To ensure clinical relevance, we excluded genes with a total LOF frequency below 5% in lung cancer cohorts.

Step 5 – From Thousands to Candidates

After this multi-layered refinement, we arrived at a shortlist of the top synthetic lethal (SL) pairs, representing the most promising therapeutic opportunities to advance into biological validation and ASO design. Among these, particular emphasis was placed on patient populations harboring loss-of-function mutations in TP53 and homozygous deletions in MTAP, two alterations that play central roles in tumor progression and therapeutic vulnerability [74],[75].

After classifying mutations according to our multi-model LOF framework, we estimated the overall frequency of LOF events for each tumor suppressor gene across non–small cell lung cancer (NSCLC) cohorts. The figure below presents the top genes with the highest LOF frequencies, highlighting those that represent recurrent vulnerabilities in NSCLC. Among these, MTAP (16%) and TP53 (15%) were prioritized for further study due to their strong biological rationale and the existence of known inhibitory mechanisms and synthetic lethal partners.
An overview of the top tumor suppressor genes ranked by LOF frequency is shown in Figure 47, while the complete computational pipeline for identifying tumor suppressor–synthetic lethality (TSG–SL) pairs is illustrated in Figure 48.

Top tumor suppressor genes ranked by estimated frequency of high-confidence LOF mutations across NSCLC cohorts
Figure 47. Top tumor suppressor genes ranked by estimated frequency of high-confidence loss-of-function (LOF) mutations across NSCLC cohorts. Values account for independent contributions of deletion, frameshift, splicing, and initiation-disruption events, weighted by the expected distribution of 40% LUAD and 30% LUSC cases [73].
Pipeline for identifying tumor suppressor–synthetic lethality (TSG–SL) pairs in lung cancer
Figure 48. This pipeline identifies candidate tumor suppressor–synthetic lethality (TSG–SL) pairs in lung cancer: (1) Database (DB) – large-scale lung cancer genomic datasets (TCGA LUAD/LUSC, MSK 2022/2023, CPTAC 2021) were used to compile mutations and copy number alterations across tumors. (2) Initial Filtering (SL partners) – from this map, only genes with a synthetic lethality (SL) partner were retained, based on curated SL databases (e.g., SynLethDB), keeping only pairs with confidence scores >0.5 (~1000 genes remained). (3) Tumor Suppressor Gene (TSG) Filtering – because synthetic lethality requires functional loss, we restricted the list to genes classified as tumor suppressors (TSGs) using the BIOINFO/TSGene annotation (~176 genes remained). (4) Loss-of-Function (LOF) Mutation Classification – we then classified mutations as bona fide LOF events using four main criteria: (4.1) splicing disruptions (OncoSplice model), (4.2) start codon disruptions (Exposition model), (4.3) frameshift/nonsense variants annotated by VEP (cross-validated with OncoSplice), and (4.4) copy number losses. (5) LOF Frequency Filtering – TSGs with a total LOF mutation frequency <5% in lung cancer cohorts were excluded, leaving only recurrently inactivated tumor suppressors for downstream SL partner selection.

The Guiding Principle

At the heart of this strategy lies the essence of synthetic lethality - exploiting vulnerabilities unique to cancer cells. By silencing the SL partner of a mutated tumor suppressor gene, our ASO aims to induce selective cancer cell death while sparing healthy cells that retain the intact pathway. Furthermore, by targeting a conserved region of the partner mRNA, our approach seeks to overcome tumor heterogeneity and maintain efficacy across diverse patient mutation profiles.

For MTAP, the loss of this metabolic enzyme leads to intracellular accumulation of methylthioadenosine (MTA), which selectively inhibits the arginine methyltransferase PRMT5. This inhibition creates a unique metabolic dependency on the PRMT5–MAT2A–RIOK1 axis, as PRMT5 activity relies on methyl donors produced by MAT2A and its interaction with RIOK1. Consequently, MTAP-deleted cells become highly sensitive to disruption of these partners, forming a well-defined synthetic lethality network that offers a biologically grounded opportunity for targeted therapy [75].

For TP53, one of the most frequently mutated tumor suppressors in lung cancer, mutations display remarkable variability, spanning over four orders of magnitude in transcriptional activity between mutant forms. These variants interact with genetic and epigenetic modifiers, shaping heterogeneous phenotypes and influencing treatment response [73]. Given the extensive number of potential SL partners reported in the literature [67][68], we performed a systematic evaluation of candidate pairs to identify those most consistent across datasets and biologically plausible for ASO-mediated targeting. This integrative analysis helped prioritize the most promising TP53-related synthetic lethal interactions to guide experimental validation.

Interaction Graph Network

We extended the TP53 search by ranking genes whose neighborhoods most resemble TP53 across multiple evidence channels, each reflecting a distinct biological signal: coexpression (genes varying together across datasets → functional linkage), experimental (directly assayed interactions or perturbation evidence), database (manually curated pathways/complexes), text mining (literature co-mentions suggesting related biology), neighborhood (conserved genomic/protein context), fusion (gene-fusion events implying coupling), cooccurrence (phylogenetic co-presence across species), and, where available, genetic high / genetic low (bins of genetic-interaction confidence). For every channel we compute per-gene similarity (Jaccard, Overlap) and neighborhood enrichment (hypergeometric p-value), so table headers like “Jaccard coexpression” or “Hypergeom experimental” read directly as: similarity/enrichment with TP53 based solely on that evidence stream.

  • Channel names in the CSV appear without double underscores (e.g., Jaccard coexpression, Hypergeom experimental).
  • Similarity metrics: \( \displaystyle \text{Jaccard}(g) = \frac{\lvert T \cap N(g)\rvert}{\lvert T \cup N(g)\rvert} \), \( \displaystyle \text{Overlap}(g) = \frac{\lvert T \cap N(g)\rvert}{\min\{\lvert T\rvert,\ \lvert N(g)\rvert\}} \), where \(T = N(\mathrm{TP53})\) and \(N(g)\) is the neighbor set of gene \(g\).
  • Neighborhood enrichment (hypergeometric survival): define population size \(N = \lvert U\rvert\) (all genes except TP53), TP53 degree \(K=\lvert T\rvert\), candidate degree \(d=\lvert N(g)\rvert\), and shared neighbors \(k=\lvert T \cap N(g)\rvert\). We report \( \displaystyle p = \Pr[X \ge k],\quad X \sim \mathrm{Hypergeom}(N,\ K,\ d) \), i.e. \( \displaystyle p = \sum_{x=k}^{\min(K,d)} \frac{\binom{K}{x}\,\binom{N-K}{d-x}}{\binom{N}{d}} \).
  • Filters: minimum degree removes tiny/noisy nodes; a high-percentile degree cutoff excludes extreme hubs before scoring.

About the Databases

  • BioGRID. A manually curated repository of physical and genetic interactions from the literature. We use human–human edges, restrict to physical interactions from Low Throughput studies, remove self-loops and incomplete symbol rows, and collapse duplicates by counting distinct PubMed IDs https://thebiogrid.org/.
  • STRING. A functional association network that aggregates diverse evidence channels (coexpression, experiments, database curation, text mining, etc.). We filter by combined_score, require the selected channel’s score > 0, map STRING IDs to gene symbols, and collapse [76] duplicate undirected edges.

Data & Filtering

  • Undirected edges are canonicalized (A<B) and merged. For BioGRID we keep a pmid_support count; for STRING we retain the max of numeric columns per edge.
  • Neighbor sets N(gene) are constructed from the filtered graph. We drop very small nodes (less than 10) and trim extreme hubs by percentile (99th percentile).

Ranking & Output

Our ranking methodoly was the following:

  • Rank by ascending Hypergeom p, then descending Jaccard, then degree
  • Top-K results are printed/saved. BioGRID script writes tp53_like_results.csv with Protein, Degree, SharedWithTP53, Jaccard, OverlapCoef, HypergeomP. The STRING single-channel routine returns a similarly structured table per evidence channel.
  • If a p-value was available from SLIDR tool it was also added [68][76]
Our ranking process lead us to choose genes RB1 and TP53BP1 as the synthetic lethality target partners.


Why We Chose RB1 and TP53BP1

Using our TP53-like neighborhood screen across multiple evidence channels, RB1 and TP53BP1 consistently showed strong overlap with the TP53 interaction neighborhood while passing hub filters. Both genes rank with extremely low neighborhood-enrichment p-values and non-trivial Jaccard overlap in several independent channels, indicating robust TP53-proximity rather than a single-dataset artifact.

  • Cross-channel robustness. Each gene was evaluated across 7 channels. RB1 was top-5% by enrichment p-value in 1 channel and top-5% by Jaccard in 1; TP53BP1 was top-5% by p-value in 2 channels and top-5% by Jaccard in 1. In ≥5/6 channels they share non-zero TP53 neighbors, demonstrating breadth rather than a single-mode signal.
  • Strong enrichment, not just big degree. Median hypergeometric p-values were ~6.7×10⁻²⁶ (RB1) and ~5.5×10⁻²⁷ (TP53BP1), with median Jaccard ~0.075 and ~0.061, respectively. Both passed degree percentile cutoffs, so significance isn’t driven by hubness.
  • Independent evidence modes. Both genes show very strong enrichment in coexpression and experimental channels (e.g., RB1: p≈4.4×10⁻¹⁰⁹ in experimental; TP53BP1: p≈8.8×10⁻¹⁴¹ in experimental), and RB1 also peaks in a genetic-evidence channel, increasing confidence through orthogonal data.
  • Biological complementarity. RB1 anchors cell-cycle checkpoint control, while TP53BP1 is a central DNA-damage response adaptor. Together they trace two major arms of TP53’s tumor-suppressor network (cell-cycle arrest and genome integrity), giving us mechanistic breadth rather than redundancy.
  • Actionable for downstream work. Because both are well-annotated tumor suppressors with rich literature and assays, they are practical anchors for hypothesis testing (e.g., synthetic-lethal contexts, pathway modulation, and validation readouts).

We generated the following CSV output for out analysis: Gene interaction CSV

Bottom line: RB1 and TP53BP1 are TP53-like across multiple evidence streams, statistically enriched yet non-hubby, and biologically complementary. That makes them high-confidence, interpretable choices for follow-up analyses and wet-lab validation.

Immune System Engagement & Antibody Optimization

Antibody Sequence Optimization

To design antibodies capable of efficient expression, immune activation, and effective delivery, we implemented a multi-stage computational and experimental workflow integrating codon optimization, structural modeling, and epitope selection.

The amino acid sequences of the heavy and light chains of cetuximab (the chosen antibody for the therapeutic complex) were retrieved and processed through two complementary optimization routes.

First, the optimized sequences were refined using the Evolutionary Stability Optimizer (ESO) [77], which detects and eliminates mutational and epigenetic hotspots to increase long-term genetic stability without altering protein structure.

Second, codon usage and non-coding regulatory elements were optimized using the MNDL Bio deep-learning platform [78], to enhance transcriptional and translational efficiency in CHO cells.
An illustration of this process is shown in Figure 49.

Antibody design and optimization workflow using MNDL Bio and ESO platforms
Figure 49. Antibody design and optimization workflow. The process begins with the original cetuximab heavy and light chain plasmids. In the first optimization route (middle row), the sequences are directly processed using the MNDL Bio platform to enhance expression in CHO cells, followed by assembly into expression plasmids and antibody production. In the second route (bottom row), sequences are first optimized using the Evolutionary Stability Optimizer (ESO) system, then further refined via MNDL Bio before assembly and CHO expression. Both optimized constructs, along with the unmodified version (top row), will be compared for antibody yield, binding efficiency, and internalization performance.

Epitope Integration Pipeline Overview

To guide the design of antibody–epitope constructs, we developed a multi-step in silico pipeline. First, frequent non–small cell lung cancer (NSCLC) mutations were identified from TCGA datasets to define clinically relevant targets [79].

Candidate neo-epitopes were then predicted by integrating proteasomal processing and MHC class I presentation algorithms, which were developed in collaboration with Tuller’s Lab member -Nicolas Lynn (explained in detail below).

In parallel, the antibody sequence space was analyzed using codon distribution and positional probability profiles [80], derived from large-scale analysis of millions of antibody sequences.

These data were processed to generate codon usage frequencies and position-specific scoring matrices (PSSMs), which enabled the identification of structurally permissive regions suitable for peptide insertion (explained in detail below, based on collaboration with Doron Armon’s, a Tuller Lab member).

From this process, multiple candidate constructs were generated and subsequently evaluated computationally for structural stability and the likelihood of efficient MHC-I presentation. An illustration of this pipeline is shown in Figure 50, alongside the complete pipeline in Figure 51.

Computational workflow for in silico design of antibody–epitope constructs
Figure 50. Computational workflow for in silico design of antibody–epitope constructs. Frequent mutations in non–small cell lung cancer (NSCLC) were extracted from TCGA datasets to identify clinically relevant neoepitopes already presented on tumor cells (A–B). Using the Epitope Selection and Immunogenicity Modeling pipeline, these tumor-derived epitopes were computationally characterized for MHC-I presentation and immunogenicity. Selected epitopes were then inserted into antibody variable regions (C) while preserving their native structure, as confirmed by PSSM-based compatibility analysis. The resulting antibody–epitope constructs were further evaluated in silico for proper proteasomal processing and MHC-I loading (D), ensuring that the engineered epitopes remain immunogenic upon degradation. Finally, synthetic lethality (SL)-induced stress promotes enhanced presentation of these epitopes near cell death, allowing cytotoxic T cells to recognize and eliminate tumor cells more efficiently, ultimately training the immune system for long-term anti-tumor immunity (E).
Computational workflow for in silico design of antibody–epitope constructs in lung cancer
Figure 51. Computational workflow for in silico design of antibody–epitope constructs. Frequent lung cancer mutations (LUAD, LUSC) are identified from TCGA datasets, and candidate neo-epitopes are predicted using computational pipelines for proteasomal processing and MHC presentation. Antibody sequences are then analyzed using codon distribution data to locate structurally permissive regions within the variable domains. Epitope insertion sites are selected to minimize disruption of the antibody’s sequence profile (PSSM), generating multiple candidate constructs. These constructs are then evaluated in silico to ensure both structural integrity and a high likelihood of MHC-I presentation.

Epitope Selection and Immunogenicity Modeling

To guide the design of antibody–epitope fusion constructs, we employed a multi-step in silico pipeline for predicting tumor-specific neoepitopes with high immunogenic potential. The workflow, developed in collaboration with Tuller’s Lab member Nicolas Lynn, integrates multiple layers of antigen processing and presentation modeling (Figure 52).

First, recurrent non–small cell lung cancer (NSCLC) mutations were extracted from TCGA datasets and processed through the OncoSplice algorithm [81] to identify mis-splicing events and their resulting aberrant transcripts. These transcripts were translated in silico to derive tumor-specific 12-mer peptide candidates absent from the normal human proteome (filtered via UniProt BLAST [82]).

Next, a computational ensemble of tools was used to model each key stage of the MHC-I antigen presentation pathway - from proteasomal cleavage to peptide-MHC binding and T-cell activation. NetChop [83], MHCFlurry [84], and NetCTLpan [85] predicted peptide processing and TAP transport; NetMHCpan4.1 [86] and MHCFlurry [84] estimated binding affinity to MHC-I alleles; and DeepImmuno [87] and IEDB (Calis) [88] evaluated immunogenicity likelihood through TCR–pMHC interaction features.

Each peptide received a composite score reflecting its processing efficiency, presentation probability, and immunogenic potential, enabling ranking of candidate epitopes. This pipeline demonstrated that several predicted neoepitopes undergo all major biological steps required for proteolytic cleavage, MHC-I loading, and T-cell recognition — confirming their potential as immunogenic targets for antibody conjugation and vaccine design.

Stepwise computational pipeline for neoepitope identification and immunogenicity prediction
Figure 52. Stepwise computational pipeline for neoepitope identification and immunogenicity prediction. Step 1–2: Mutational profiles from TCGA datasets are processed through OncoSplice to detect mis-splicing events and their transcript-level consequences. Step 3–5: Tumor-specific 12-mer peptides are generated from aberrant transcripts and filtered against the wild-type proteome to retain only cancer-specific sequences. Step 6: Each candidate peptide is analyzed across key stages of MHC-I presentation — proteasomal processing (NetChop, MHCFlurry, NetCTLpan), MHC binding (NetMHCpan4.1, MHCFlurry), and T-cell recognition (DeepImmuno, IEDB). Step 7: Scores from all modules are integrated into a unified framework that quantifies processing efficiency, presentation probability, and immunogenic potential, yielding a ranked list of tumor-selective epitopes suitable for antibody–epitope design.

Antibody Data Collection and Processing

To characterize the antibody sequence landscape and identify regions suitable for epitope insertion, large-scale antibody repertoires were analyzed using data from the Observed Antibody Space (OAS) database [89]. This resource compiles billions of cleaned and annotated antibody sequences from over 100 immune-repertoire sequencing studies, including approximately 4.1 million paired and 2.4 billion unpaired heavy and light chain sequences.

For this analysis, only human unpaired antibody sequences were selected. Using high-performance computing (HPC) resources and the Dask library for parallelized data management [90], the variable V(D)J regions of antibody sequences were extracted and translated into amino acids. Only productive and complete V(D)J sequences were retained while excluding immature B-cell data to ensure biological relevance.

Codon usage frequencies were computed for each data unit by counting codons within each amino acid group, excluding stop codons and ambiguous nucleotides. These frequencies were then aggregated across all human data units to generate global codon distribution profiles for both heavy and light chains.

To obtain a position-specific scoring matrix (PSSM) describing positional amino acid preferences, approximately 120,000 productive light-chain sequences were randomly sampled from the OAS dataset, ensuring complete V(D)J coverage [89], [91].

Multiple sequence alignment was performed using the ANARCI tool [92], which assigns standardized IMGT numbering to antibody variable regions [93]. The engineered light-chain sequence from our study was aligned according to this numbering scheme to ensure consistent positional mapping.

Amino acid frequencies were then converted into a PSSM using the Biopython package [94]. The computation used background frequencies derived from the overall amino acid distribution and applied pseudo-counts equal to 0.5× the background frequency to prevent zeros in the matrix.

This matrix captures positional conservation and diversity across antibody variable regions, serving as a statistical baseline to identify structurally permissive sites for epitope insertion while maintaining the native antibody fold. This analysis was performed in collaboration with Doron Armon (Tuller Lab). provided the foundational data supporting our computational design of antibody–epitope constructs.

Algorithm to Find Insertion Site in Antibody for Epitopes Using PSSM

Two matrices were utilized - a position-specific scoring matrix (PSSM) and a position weight matrix (PWM), generated from approximately 120,000 antibody variable-region sequences [80].

The PSSM contains position-dependent scores for each amino acid, defined as:

\[ M_{ij} = \log_{2}\!\left(\frac{p_{ij}}{q_i}\right) \tag{1} \]

Where \(p_{ij}\) represents the probability of amino acid \(i\) occuring at position \(j\), and \(q_i\) denotes the background probability of amino acid \(i\) as estimated from the multiple sequence alignment (MSA).

The PWM matrix contains the raw positional probabilities - \(p_ij\) for each amino acid at every position.

Based on a reference article provided in[95], which applied the BLOSUM50 substitution matrix, we grouped amino acids according to their biochemical similarity. This grouping aimed to reduce the dimensionality of the matrices while accounting for conservative substitutions that preserve structural and functional properties during epitope insertion into the antibody sequence.

Accordingly, amino acids were clustered into five groups:

\[ \begin{aligned} G1 &= \mathrm{L, V, I, M, C}\\ G2 &= \mathrm{A, S, G, T, P}\\ G3 &= \mathrm{F, Y, W}\\ G4 &= \mathrm{E, D, N, Q}\\ G5 &= \mathrm{K, R, H} \end{aligned} \]

We processed the PWM by aggregating amino acid probabilities according to the newly defined similarity groups, summing the individual probabilities within each group to obtain group-level position-specific probabilities.

To reconstruct the background probabilities \(q_i\) for each amino acid, we applied a reverse-engineering approach using the relationship between the PSSM and PWM matrices. Specifically we removed the logarithmic transformation from the PSSM by converting each entry elementwise as \(2^M\), and subsequently computed the background probabilities by dividing the corresponding PWM values by these transformed PSSM values.

The reconstructed background probabilities were then grouped by summing the values of all amino acids within each similarity group. Using these values and the corresponding grouped PWM, a new PSSM was calculated according to the earlier equation for \( M_{ij} \).

Once the set of neo-epitopes was predicted based on NSCLC mutations, each epitope was evaluated for insertion compatibility within the antibody variable region using the updated PSSM matrix. Using a sliding window equal in length to each epitope, we computed the cumulative PSSM score by summing the position-specific values for the amino acids (or their respective groups) within the window, thereby quantifying the positional compatibility between the epitope and each potential insertion site. CDR regions of the variable domain were excluded from the analysis due to their importance in receptor binding[96]. The formula used: \(\displaystyle \text{PSSM score}=\sum_{k=j}^{j+L} M_{ik}\ \text{where } i \text{ is the amino acid at position } k\)

All the epitopes were combined into a dataframe and the dataframe was sorted from the highest to the lowest score.

Finally, the top-scoring epitope-insertion pairs were re-evaluated using the MHC-presentation prediction model to confirm that the inserted epitopes retained high predicted processing and presentation potential following integration into the antibody scaffold.

Molecular Dynamics (MD)

Molecular Dynamics (MD) Simulations

Molecular Dynamics (MD) simulations offer a powerful computational approach to study the structural and dynamic behavior of biomolecules at atomic resolution. By numerically solving Newton’s equations of motion for atoms over time, MD enables prediction of how molecules fold, bind, and fluctuate in environments mimicking physiological conditions [97]. In the context of our project, MD simulations were used to evaluate the conformational dynamics and interaction profiles of ASOs with target mRNA duplexes. Extracted features from these simulations—such as hybridization energies, RMSD profiles, and flexibility metrics—were incorporated as a second step in our prediction model. These dynamic features go beyond static structure-based predictors, offering a more nuanced understanding of ASO efficacy.

MD - Methods

1. Initial Structure Generation

Initial models of the DNA:RNA duplexes were constructed using Discovery Studio. The extracted PDB files were then reformatted to comply with AMBER’s LEaP atom conventions. To model chemically modified antisense oligonucleotides, the [98] toolkit was employed to generate custom libraries for four phosphorothioate (PS)-modified residues: PS-modified Adenine (PSA), PS-modified Cytosine (PSC), PS-modified Guanine (PSG), and PS-modified Thymine (PST). Using CPPTRAJ [99], the duplex PDBs were updated according to the created libraries, and native residues were renamed to their modified counterparts.

Figure 53. Initial 3D structure of an ASO–RNA duplex after PDB construction and cleaning.

2. Parameterization and Topology Building

Topology and coordinate files were prepared using the LEaP program within AmberTools 2025. The OL15 [100][101] force field was applied for modified DNA strands and OL3[102] for RNA, while missing parameters were supplemented with the general AMBER force field (GAFF2) [103] and libraries generated by modXNA for the modified residues. Each PS-DNA:RNA duplex was solvated in a truncated octahedral box of the optimal 4-point rigid water model (OPC)[104], with a 10 Å buffer between the solute and the box boundary. The system was neutralized with Na⁺ and Cl⁻ counterions, and excess ions were introduced to achieve a physiological ionic strength of 150 mM.

Figure 54. ASO–RNA duplex after system setup in LEaP, solvated in a truncated octahedral box of water with added NaCl ions, prepared for simulation.

3. Simulation

All simulations were performed using the sander engine from AmberTools 2025. A single system underwent a two-stage energy minimization: first, with restraints on the DNA:RNA duplex to allow relaxation of the solvent and ions, followed by unrestrained minimization of the entire system. The structure was then gradually heated to 310 K over 100 ps under constant volume conditions, with harmonic restraints applied to the nucleic acids. Equilibration was performed for 500 picoseconds in the NPT ensemble (constant pressure and temperature) at 310 K and 1 atm, with reduced restraint forces. Finally, a 2 nanosecond unrestrained main production run was carried out. All simulations used a 2 fs timestep, a 10 Å cutoff for nonbonded interactions, and employed the Langevin thermostat for temperature control. The SHAKE algorithm [105] was applied to constrain all hydrogen atoms bound to heavy atoms. Trajectories were saved every 2 picoseconds for downstream analysis.

Figure 55. Representative MD trajectory of an ASO–RNA duplex during the production stage.

4. Parallelization and HPC Execution

To efficiently run the large set of duplex simulations, we employed parallel computing on the Power-SLURM and Power9 high-performance computing (HPC) servers at Tel Aviv University. Job submission and resource allocation were handled using PBS scripts, which automated the generation of input files, submission of tasks, and monitoring of progress across multiple experiments. Each simulation was parallelized across 64 CPU cores, allowing for efficient scaling and reduced execution time. On average, a full production run for a single duplex required approximately 15 hours of wall-clock time to complete. Depending on queue availability, between 10 and 30 simulations were executed in parallel, corresponding to an effective utilization of roughly 640–1920 CPU cores at peak load. This approach enabled us to manage the computationally demanding workload of simulating over 100 modified ASOs, ensuring both reproducibility and scalability across multiple systems.

5. Post-Simulation Analysis

For post-simulation analysis, trajectories and topology files were processed using CPPTRAJ. Solvent molecules and ions were removed, and the duplex was centered and imaged into the primary unit cell to ensure structural consistency across frames.
The following ֵanalysis was performed:

  • Root Mean Square Deviation (RMSD): Calculated to quantify the structural deviation over time. RMSD was computed for the entire DNA:RNA duplex, as well as specifically for the central base-paired region, to distinguish between end fluctuations and core stability.
  • Sugar Pucker Distributions: The conformations of the ribose/deoxyribose sugars were analyzed to monitor transitions between different pucker states, providing insight into local conformational changes.
  • Nucleic Acid Structural Parameters: Using the nastruct command [99], a comprehensive set of helicoidal and base-pair step descriptors were extracted, including shear, stretch, stagger, buckle, propeller twist, opening, hydrogen bonding, and groove dimensions (major/minor). These parameters provide a detailed picture of the duplex geometry and stability.
  • Binding Free Energy Calculations: To complement the structural descriptors, free energy estimations were performed using MMPBSA.py [106] with the Generalized Born (GB) implicit solvent model. Analysis was done using the modified DNA strand as ligand and the RNA strand as the receptor. The decomposition included: van der Waals interactions, electrostatic interactions in the gas phase, polar solvation energy (Generalized Born model), and nonpolar solvation energy (surface tension model). Gas-phase and solvation contributions were combined to yield total free energy estimates.
  • Entropy Calculations: To include entropic effects, the quasi-harmonic analysis option in MMPBSA.py was applied. This provided an estimate of the configurational entropy contribution (–TΔS), which was added to the enthalpic terms to yield a more complete description of binding free energy.

Together, this analysis captures both the global dynamics and fine-grained structural features of the duplex.
To illustrate the computational workflow, below is a schematic of the MD pipeline, from initial PDB preparation through parallelized simulations to final structural and energetic analyses.

Overview of the full pipeline used in this study.
Figure 56. Overview of the full pipeline used in this study. After initial duplex construction, the system was branched into multiple parallel MD setups (Duplex 1 … Duplex N), each undergoing minimization, heating, equilibration, and production simulations independently. The resulting trajectories were then collected and merged for post-processing, feature extraction, and comparative analysis. This structure ensured reproducibility across replicates and enabled robust evaluation of ASO–RNA duplex behavior.

MD - Results

Overview

Molecular dynamics (MD) simulations were performed for approximately one hundred antisense oligonucleotides (ASOs) in complex with their target RNA sequences. From these trajectories, we systematically extracted a rich set of structural and energetic descriptors, yielding more than five hundred features for each duplex. These features spanned local base-pair and base-step geometries, global stability measures such as RMSD, sugar puckering states, hydrogen bonding patterns, and binding free energy decompositions obtained from MMPBSA analysis. Together, this large-scale dataset provides a quantitative bridge between atomistic dynamics and experimentally measured inhibition values for the ASOs.

Feature Extraction

In total, over 500 structural and energetic descriptors were extracted from the molecular dynamics (MD) trajectories. These features capture both the geometry of the duplex and its thermodynamic behavior, providing a rich quantitative fingerprint of each ASO. Below we summarize the main categories:

  • Base-Pair Step Geometry
    These features include Shear, Stretch, Stagger, Buckle, Propeller, Opening, Twist, Roll, Tilt, Rise, Slide, Shift, Zeta Potential.
    These parameters describe how adjacent base pairs are positioned relative to one another. They capture distortions of the helical axis, unwinding or over-twisting, groove widening, and local flexibility. In addition to the averages of these features, standard deviations were calculated, as high variability may indicate conformational heterogeneity, which can influence hybridization dynamics.
    For these features, each duplex was numbered symmetrically from the 5′ to 3′ end of each strand. As depicted in Figure 57, in a 16-mer ASO, the first base of the ASO strand (position 1) pairs with the last base of the complementary RNA strand (position 32), the second with the 31st, and so on. Base-pair steps are therefore denoted in paired form (e.g., 1–32, 2–31), a convention used in the feature labels and in subsequent plots.
    Figure 57: Numbering scheme of the DNA:ASO duplex. In black indices - the ASO, and in blue indices - the RNA strand
  • Hydrogen Bonding and Base Pairing Stability
    These features include Hydrogen Bonding mean, Base Pairing mean, Major/Minor groove widths. These parameters track the number and strength of hydrogen bonds between paired bases and reflect how tightly and consistently the duplex is held together.
  • Global Structural Stability
    This includes both Root-mean-square deviation (RMSD) features and pucker states. RMSD captures overall deviation of the duplex from the initial structure, reflecting structural stability.Sugar pucker conformations (C2’ endo vs. C3’ endo) control the balance between DNA- and RNA-like geometry.
  • Energetic and Thermodynamic Descriptors
    These parameters include enthalpic (electrostatic, van der Waals, solvation) and entropic contributions. They summarize the balance between stability and flexibility of the ASO-RNA duplex and are used to assess binding strength and the free-energy landscape.

A complete glossary of all features with definitions and explanations, is available in the Supplementary Appendix

Representative Feature Visualization

While the full dataset includes hundreds of structural and energetic descriptors as discussed, the following figures visualize a few representative examples on a single duplex, illustrating how these features are extracted and what kinds of insights they may provide.

The root mean square deviation (RMSD) over the course of the simulation provides a measure of the overall structural stability of the system. As shown in Figure 58, the RMSD rises rapidly during initial equilibration but subsequently stabilizes, fluctuating around a plateau value. This stabilization indicates that the duplex adopts a persistent conformation within the timescale of the simulation, and no large-scale unfolding events are observed.

Figure 58: Root mean square deviation (RMSD) of a duplex structure as a function of simulation time.

Another representative feature is the twist angle distribution across base-pair steps. Twist is a fundamental descriptor of nucleic acid helical geometry, reflecting the degree of rotation between consecutive base pairs. In Figure 59, the average twist values are plotted for each base-pair step of the duplex. While most steps retain persistent close to canonical twist values, several positions exhibit some deviations, pointing perhaps to regions of local flexibility or distortion induced by the ASO sequence. Such local geometric heterogeneity can influence recognition, hybridization stability, and ultimately biological efficacy.

Figure 59: Distribution of twist angles across base-pair steps of the representative duplex. The x-axis shows symmetrically paired positions along the duplex. As illustrated in Figure 57, for example, “1–32” corresponds to the first base of the ASO paired with the 32nd base of the complementary RNA strand, “2–31” corresponds to the second base paired with the 31st, and so on. When two such labels are joined with a dash (e.g., “1–32 | 2–31”), this denotes the step between consecutive base pairs, here between the 1–32 and 2–31 pairs.

Energetic contributions were also decomposed to provide insight into the forces stabilizing or destabilizing the duplex. Figure 60 displays as example of the energy breakdown from a simulation: van der Waals, electrostatic, generalized Born solvation, and nonpolar surface terms of the complex. As expected, electrostatics in vacuum (EEL) contribute strongly destabilizing forces (positive value), whereas solvation (EGB) counterbalances these interactions with stabilizing contributions (negative value). van der Waals interactions provide additional stabilization, while the nonpolar surface term (ESURF) is minor. This decomposition highlights the delicate balance of competing forces that governs duplex stability.

Figure 60: Decomposition of molecular mechanics energies for the ASO-RNA duplex. The relative contributions of van der Waals, electrostatic, internal, and solvation terms highlight the dominant forces stabilizing the complex.

It is important to note that the RMSD profile, twist distribution, and energy decomposition shown here represent only a small subset of the structural and energetic descriptors extracted from the full molecular dynamics production runs. In total, hundreds of features like these were computed, capturing diverse aspects of duplex stability, geometry, and dynamics, many of which may provide additional insight into ASO behavior beyond the examples highlighted here.

Feature-Inhibition Correlations Across the Dataset

Beyond single-case visualizations, the complete dataset of ~ one hundred ASOs allows a direct quantitative comparison between simulation-derived features and experimental inhibition. In order to normalize the inhibition data, raw values were converted to a logarithmic scale (log inhibition). Correlation analysis across the full dataset revealed that many structural and energetic features exhibit significant associations with inhibition potency. As shown in Figure 61, several energy-based terms, display correlation coefficients around 0.6 with log inhibition, underscoring their predictive value.

Figure 61: Ranking of the top simulation-derived features by correlation strength with experimental inhibition values. Several structural and energetic features show high predictive value.

In addition to simple correlation, binned scatter plots were generated to better visualize the relationship between feature values and inhibition. In Figure 62, inhibition values are averaged within bins of feature values, revealing monotonic trends across multiple descriptors. These plots emphasize how shifts in specific structural or energetic parameters translate into consistent differences in biological activity, reinforcing the mechanistic link between MD-derived features and function.

Figure 62: Representative scatter plots of selected features versus inhibition values. Data is binned to emphasize trends, demonstrating clear positive or negative correlations between structural features and biological activity.

Finally, the interrelationships between the most predictive features were explored using a correlation heatmap. Figure 63 displays the correlation structure of the top features with log inhibition.

Figure 63: Heatmap of pairwise correlations between extracted features and inhibition values across all ~100 ASOs.

Together, these analyses demonstrate that atomistic simulations of ASO:RNA duplexes capture features that are strongly associated with biological inhibition. The ability to extract and quantify these descriptors provides not only mechanistic insight into ASO-RNA interactions but also a foundation for predictive modeling, where MD-derived features can be systematically leveraged to guide rational ASO design.

ASO Validation from Wet-Lab Experiment

To further validate the predictive value of the extracted molecular dynamics features, we simulated a subset of seven ASO-RNA duplexes that we experimentally tested in our wet-lab. These ASOs were designed by our main computational model against MALAT1 RNA, and their ability to reduce MALAT1 RNA levels was quantified in cell-based assays. By running MD simulations on the same ASO-RNA duplexes, we directly compared computationally derived structural and energetic features with experimentally observed RNA levels.

It should be noted, that these ASOs had been preselected for experimental validation based on our main computational model that ranked candidates among other things according to energetic descriptors - such as hybridization free energies, nearest-neighbor hybridization, and melting temperature heuristics. Consequently, this pre-selection enriched the experimental set for sequences that were already optimized with respect to energetic stability.

This background is important in interpreting our findings. When correlating the MD-derived features with experimental expression outcomes, we observed that energetic descriptors contributed very little additional predictive power within this small, preselected set. Instead, the features that showed the strongest correlations with experimental RNA levels of MALAT1 were predominantly structural backbone descriptors such as roll, rise, shear, or groove widths (more information about backbone descriptors is available in the Supplementary Appendix). This contrasts with the previous ~100 ASO dataset previously discussed, in which energetic features were among the strongest predictors. We hypothesize that because the experimental ASOs were already selected from the “best” energetic space, as a result, energetic features lost their discriminatory power, and non-energetic structural descriptors, absent from the initial selection pipeline, emerged as the differentiating factors.

To illustrate this point, we quantified the relative contribution of energetic versus non-energetic features in the experimental dataset. Among the top 20 features, none were energetic descriptors, whereas in the top 30 features, only 7% were energetic (Figure 64).

Figure 64: Relative contribution of energetic and structural features among the top predictors.

We next visualized selected correlations in detail. Scatter plots of representative structural features (Figure 65, top row) showed strong associations with expression, with clear linear trends and highly significant correlation coefficients. In contrast, scatter plots of the previously top selected energetic features (Figure 65, bottom row) demonstrated much weaker predictive ability in this experimental set, reinforcing the observation that energetic features provide limited discriminatory power once pre-selection has already been applied.

Figure 65: Representative correlations of structural versus energetic features with experimental expression outcomes. Scatter plots showing three highly predictive backbone features (top row) and three less predictive energetic features (bottom row), each plotted against MALAT1 RNA levels from the wet lab experiment.

Overall, these findings suggest a complementary role between our main computational model and as a second step - MD-derived descriptors. Together, they may form the basis for an integrated, multi-layered predictive pipeline for ASO efficacy.

Simulating Single-Stranded ASO Folding

While our current analyses focused on duplex formation between ASOs and their target RNA sequences, an important future step will be to systematically simulate the folding behavior of single-stranded ASOs. In biological systems, ASOs are not always present in their idealized linear conformation; instead, they may self-organize into secondary structures. These conformations can significantly influence hybridization kinetics, accessibility of specific motifs, and ultimately the efficiency of target inhibition.

MD simulations of self-folding events would allow us to capture this structural heterogeneity at atomic resolution. By monitoring how an initially linear ASO collapses into partially folded states, we can assess the degree of structural stability and identify sequence motifs prone to intramolecular interactions. Such simulations will provide complementary insights to duplex analyses: whereas duplex stability reflects the binding potential with the intended RNA target, single-stranded folding reveals the internal competition that may reduce the effective binding fraction in a cellular environment.

As an illustrative example, a preliminary simulation of a randomly selected ASO demonstrates how a single strand spontaneously folds into a compact structure over the course of 10 nanoseconds. Although not part of our experimental panel, this case highlights the importance of incorporating MD derived folding behavior into predictive models. In future work, integrating single-stranded folding features with duplex-derived parameters will enable a more comprehensive understanding of ASO activity and improve predictive power across diverse sequence contexts.

Figure 67. Molecular dynamics simulation of a single-stranded ASO folding into a compact secondary structure.

Summary of Findings

Across both the large-scale dataset of ~100 ASOs and the experimentally validated subset of seven MALAT1-targeting ASOs, our analyses revealed that structural and backbone-related features derived from molecular dynamics are strong predictors of biological efficacy. While energetic descriptors dominated the larger screening dataset, they lost discriminative power within the preselected experimental subset, underscoring how structural variability becomes prominent once energetic space is saturated. Importantly, our single-stranded folding simulations further highlight the intrinsic conformational heterogeneity that may impact target accessibility.

Taken together, these results suggest that molecular dynamics is most valuable as a second-stage prioritization tool within our general ASO discovery pipeline. The initial generation of candidate ASOs should rely on our broader predictive model, which is computationally efficient and captures many determinants. From this set, MD simulations can then refine prioritization by quantifying structural features, that strongly correlate with inhibitory activity. In this way, MD complements the general model: it adds a higher-resolution filter that helps focus experimental resources on the most promising ASOs.

Model Results

Our model integrates structural, thermodynamic, and sequence-based features of antisense oligonucleotides (ASOs) to predict their inhibition efficiency. It was trained using experimentally measured datasets across multiple human cell lines to identify parameters most predictive of ASO activity.

The model was trained with two primary goals: first, to produce results that generalize across multiple cell lines, and second, to improve the ranking of top-performing ASOs rather than solely maximizing correlation values.

During development, we adopted a cross-validation approach — training on three cell lines and testing on the fourth (the unseen cell line). For the final version, we trained on all available data to maximize the model’s learning capacity, which means there is no completely unseen test set for direct comparison.

This evaluation approach led to a clear improvement over our previous model:

Cell line Samples Spearman (Old) Spearman (New)
A431 6541 0.16 0.429
SK-MEL-28 3394 0.28 0.340
KARPAS-229 972 0.24 0.290
MM1.R 2836 0.40 0.484

Correlations improved across all unseen cell lines, most notably in A431 (a melanoma-derived line) — a particularly relevant result, as we hypothesize that its behavior may be closer to lung tissue than that of lymphoma or liver-derived lines. Moreover, the correlations became more consistent overall: the lowest correlation increased from 0.16 to 0.29. Since our focus is on top-performing ASOs, we next evaluated ranking-based metrics.

Definition of Metrics

The Normalized Discounted Cumulative Gain (NDCG)[107] measures how well the model ranks ASOs by their true experimental effectiveness. It assigns higher importance to correctly ranking the most active ASOs near the top, while penalizing errors that appear lower in the list. Formally, it is defined as:

\[ \mathrm{NDCG@k} = \frac{1}{\mathrm{IDCG@k}} \sum_{i=1}^{k} \frac{2^{rel_i} - 1}{\log_2(i + 1)} \]

where \( rel_i \) is the true relevance of the ASO at rank \( i \), and \( \mathrm{IDCG@k} \) is the maximum possible (ideal) DCG for the same set. Values close to 1 indicate nearly perfect ranking.

The Precision@k (P@k) metric measures the fraction of true top-performing ASOs among the top \( k \) predictions:

\[ \mathrm{P@k} = \frac{\text{Number of true top ASOs in top } k}{k} \]

While Precision focuses on how many of the top predictions are correct, NDCG accounts for their relative ordering, making both complementary in assessing ASO ranking performance.

Ranking Metrics

The following table compares the ranking performance of our ASO efficacy model across four human cell lines. Metrics include precision at 50 and 100 (P@50, P@100) and normalized discounted cumulative gain (NDCG) at various cutoffs (50, 100, 200). Higher values indicate better ranking performance.

Cell line Samples P@50 (Old) P@100 (Old) NDCG@200 (Old) NDCG@All (Old) P@50 (New) P@100 (New) NDCG@200 (New) NDCG@All (New)
A431 6541 0 0 0.5192 0.965 0.04 0.08 0.606 0.957
SK-MEL-28 3394 0.06 0.12 0.8428 0.981 0.10 0.23 0.84 0.976
KARPAS-229 972 0.02 0.17 0.8576 0.980 0.02 0.07 0.847 0.973
MM1.R 2836 0.12 0.17 0.7929 0.9826 0.02 0.07 0.728 0.972

The new model demonstrates a clear trade-off between peak performance on specific cell lines and robust generalization. While the previous model achieved higher NDCG scores on KARPAS-229 and MM1.R, it failed completely on A431, with a precision of zero. Our new model successfully remedies this by significantly improving performance on the weakest cell line (A431), establishing a much higher and more consistent baseline across all data. This shift aligns directly with our primary goal of creating a generalizable model that avoids catastrophic failures on unseen data, even if it means a slight reduction in ranking performance on previously well-predicted lines.

Global Correlation and Ranking

When training on the full dataset, the global correlation between predicted scores and experimentally measured inhibition is summarized below.

Predicted vs Volume-Corrected Inhibition (trained cell lines)
Figure 68. Predicted vs Volume-Corrected Inhibition (trained cell lines). Each point represents an ASO, with color intensity corresponding to the raw log inhibition (darker red = stronger activity). The x-axis shows the model’s predicted ranking (higher = better), while the y-axis indicates measured volume-corrected inhibition. The binned-median line (±10–90%) shows a clear positive trend, and the concentration of dark-red points in the upper-right confirms that high predicted scores align with strong inhibition (Pearson r = 0.521, Spearman ρ = 0.487, n = 13,743).

When ranking all ASOs by the model’s predicted scores and correlating them with volume-corrected inhibition, we observe a strong fit. Even more notably, the raw inhibition values themselves follow the same trend — as shown by the dense cluster of dark-red points in the upper-right corner — confirming that our model successfully identifies high-performing ASOs.

As for a direct measurement of top performers, we can take a look at the following graph:

Precision–at–k performance of the ASO ranking model
Figure 69. Precision–at–k performance of the ASO ranking model. The curve shows the fraction of true top-performing ASOs (top 2%) among the top-k model predictions. High precision at low k indicates that the model accurately ranks the most effective ASOs near the top, with performance gradually approaching background levels as k increases.

We can see that for the top \( k \), our model performs well — especially when \( k \) is small, which is the most relevant use case, as we aim to minimize cost and resources by ordering only a few top ASOs.

In practical terms, this means that if we select the top 50 ASOs predicted by the model, a large proportion are indeed among the experimentally strongest inhibitors. This enables substantial cost and time reduction when designing ASO libraries, as only a small subset needs to be synthesized and tested experimentally.

Future iterations of the model will incorporate additional features, including RNA–protein interaction profiles and LNA thermodynamic weights, and will be validated on newly measured datasets to confirm cross-cell-line generalization and robustness.

In summary, our updated model successfully meets its design objectives. It demonstrates superior and more consistent correlation across multiple cell lines, particularly improving performance on previously challenging data. Furthermore, its ability to accurately rank the most effective ASOs in the top percentiles, as shown by the precision-at-k analysis, confirms its utility as a powerful tool for prioritizing candidates for experimental validation. This balance of generalization and top-tier ranking performance marks a significant step forward in our ASO efficacy prediction capabilities.

Conclusion

Our model demonstrated its effectiveness in real experiments, successfully designing ASOs that outperformed the current MALAT1 benchmark used in the industry. This confirms that ASOdesigner’s predictions translate into meaningful experimental results and can greatly reduce the effort required for ASO development. Specifically, it supports the novel features we introduced in the our model. The details of this success and its experimental validation are described on our Results page, and our process is described in our Engineering page on our second and third iterations.

Moving forward, we plan to enhance the model by integrating features such as RNA-protein interaction profiles and improve the way we model chemical modification effects, and by validating it across additional datasets to ensure broader applicability. These next steps will further strengthen ASOdesigner as a reliable and versatile tool for advancing ASO-based applications such as cancer therapeutics.

References