Navigation

Model

Our model includes 3 main components to accurately create a multi-phase solution to ALS neurodegeneration. Find out more below!

ASO Creation

Our development and modification of novel pipeline to generate antisense oligonucleotides for protein knockdown

ASO Predictive Model

Our biggest contribution to iGEM: an accurate machine learning model that predicts the efficacy of an ASO in knocking down a protein given its sequence

Aptamer Creation

Our development of novel method to create protein-specific RNA aptamers to prevent TDP-43 aggregation

Introduction

Computational modeling is an essential forerunner for in vitro and in vivo research experimentation, because it provides a cost- and time-effective, scalable mechanism to test experimental designs and validate hypotheses. Our project centers around the development of many novel parts, such as antisense oligonucleotides (ASOs) and aptamers. Computational modeling was crucial for our research; we designed and scored many ASOs and aptamers in silico to quantify their efficacy in preventing ALS before testing our most-effective designs in vitro. We also developed a linear model predicting ASO-based protein knockdown based on sequence features using our wetlab data for future incorporation into this novel pipeline.

Computational modeling is an essential forerunner for in vitro and in vivo research experimentation, because it provides a cost- and time-effective, scalable mechanism to test experimental designs and validate hypotheses. Our project centers around the development of many novel parts, such as antisense oligonucleotides (ASOs) and aptamers. Computational modeling was crucial for our research; we designed and scored many ASOs and aptamers in silico to quantify their efficacy in preventing ALS before testing our most-effective designs in vitro. We also developed a linear model predicting ASO-based protein knockdown based on sequence features using our wetlab data for future incorporation into this novel pipeline.

TAR DNA-binding protein 43 (TDP-43) aggregation is a pathological hallmark of amyotrophic lateral sclerosis (ALS, Suk et al., 2020). Evidence suggests that oxidative stress and high local concentrations of TDP-43 in stress granules (SGs) are prerequisites for the aggregation of TDP-43 (Dewey et al., 2013). One protein responsible for the recruitment of these stress granules is DAZAP1, and it is also known to lead to enhanced accumulation of RNA stress granules once they have begun to aggregate (An et al., 2021). Another protein responsible for the stability of stress granules is FAM98A, which is known to localize around stress granules, leading to worsened aggregation (Ozeki et al., 2018)

Design Pipeline

Our ASOs were designed to knock down key proteins (abbreviated DAZ, FAM, and SND) that are implicated in stress granule formation in neurodegenerative diseases such as ALS. We adapted a computational pipeline for ASO generation (Yeo, 2024); this pipeline incorporates sequence tiling, biophysical filtering, conservation scoring, specificity mapping, and chemical modifications to assign a numerical score for each candidate ASO. Throughout the pipeline, we parameterized specific scoring and development factors to create the most effective ASOs for our target protein knockdowns.

ASO Scoring

Scoring Factors

Factor Description
GC Content Fraction of nucleotides in sequence that are Guanine (G) or Cytosine (C). We altered the pipeline to use range 0.4–0.6.

Since G and C pair via three hydrogen bonds (as opposed to A–T, which pairs using only two bonds), higher GC content corresponds to increased stability and resistance to denaturation. We therefore optimize GC content to filter out weak ASOs and excessively stable duplexes (which may increase risk of off-target binding).

\[ GC = \frac{\#\{G,C\}}{\text{sequence length}} \]

Homopolymers Long sequences of consecutive, same bases. We altered the pipeline to penalize “CCCCC” and “GGGGG.”

Homopolymers are harder to synthesize chemically and typically form altered, unusual secondary structures. This can lower binding specificity, so the model penalizes the inclusion of these mini sequences.
Conservation phastCons score measuring how conserved sequence is across human species. We chose to use a conservation value of 0.9.

Highly conserved species are more likely to have increased functional importance and relevance, so our model penalizes sequences that are not already conserved across the human species (Perez et al, 2025).

\[ PC = \frac{1}{L} \sum_i \text{phastCons}(i) \]

Secondary Structure Prediction of accessibility of ASO target binding region.

If the target binding region of the mRNA (that the ASO contains a complementary strand to) is predicted to be inaccessible due to potential RNA-folding structures (such as hairpin loops), the ASO will be ineffective in knocking down required proteins. Therefore, ASOs that form complementary strands to inaccessible mRNA regions are penalized.
Mapping Specificity Score determining binding specificity to singular genome locus. We limited the number of mismatches to at maximum = 1.

Determination of number of genomic sites:

\[ \text{mappednum} = 1 + \big(\text{number of commas in mapping string}\big) \]

Each ASO sequence is aligned to the human genome and tested for binding (Perez et al, 2025). If the ASO binds at multiple locations on the genome, it is penalized because of unspecific binding (increasing penalty with decreasing specificity).

\[ \text{Penalty}_{\text{map}} = 100 \times (\text{mappednum} - 1) \]

Table 1: Scoring Factors and Definitions
Antisense Oligonucleotide Scoring Algorithm
Fig 1: Scoring Algorithm Flowchart

Each scoring block subtracts an equal number of points from the original score of 1000 to ensure that each characteristic is weighted equally to the rest.

Chemical Modifications

Since pure ASO sequences (20-base RNA sequences) are unlikely to “survive” harsh nuclear conditions, top-scoring candidate ASOs were transformed into gapmers to increase nuclear resistance and binding affinity. The pipeline uses 2′-O-methoxyethyl ribose (2MOEr) and 5-methylcytosine substitution (iMe-dC) modifications.

Modification Name Modification Process Purpose
2′-O-methoxyethyl ribose (2MOEr) Alteration of 2′ ribose sugar Increases binding strength and resistance to nucleases
iMe-dC (5-methylcytosine substitution) Replaces cytosine bases in “DNA gap” with methylated cytosine Improves binding affinity and reduces immune stimulation

We altered the pipeline to use 5-10-5 gapmer design, which modifies positions 1-5 (5’ flank) with 2MOEr, 6-14 (DNA gap) with iMe-dC substitutions, and 15-19 (3’ flask) with 2MOEr. This allowed us to prepare physiologically relevant ASOs for in vitro testing.

Conclusion and References

Without this model, we would have had to test hundreds of unfiltered ASOs experimentally. The model helped us focus on the most promising candidates and saved significant wet-lab time. Thus, our ASO design pipeline allowed us to develop and test the most effective ASOs for use in in vitro experimentation with limited resources. This pipeline can also be easily adapted by other scientists to target different disease genes, making it a reusable modeling framework.

References:

  1. Suk, T. R., & Rousseaux, M. W. C. (2020). The role of TDP-43 mislocalization in amyotrophic lateral sclerosis. Molecular Neurodegeneration, 15(45). https://doi.org/10.1186/s13024-020-00397-1
  2. Dewey, C. M., et al. (2012). TDP-43 aggregation in neurodegeneration: Are stress granules the key? Brain Research, 1462, 16–25. https://doi.org/10.1016/j.brainres.2012.02.032
  3. An, H., et al. (2021). Compositional analysis of ALS-linked stress granule-like structures reveals factors and cellular pathways dysregulated by mutant FUS under stress. bioRxiv. (Preprint.) https://doi.org/10.1101/2021.03.02.433611
  4. Ozeki, K., et al. (2019). FAM98A is localized to stress granules and associates with multiple stress granule-localized proteins. Molecular and Cellular Biochemistry, 451(1–2), 107–115. https://doi.org/10.1007/s11010-018-3397-6
  5. Yeo Lab. (n.d.). GitHub organization page. GitHub. Accessed 7 Oct. 2025. https://github.com/yeolab
  6. UCSC Genome Browser on Human (GRCh38/hg38), position chr19:1,435,544–1,435,764. University of California, Santa Cruz. Accessed 7 Oct. 2025. https://genome.ucsc.edu

Model Abstract

Predicting protein knockdown (%KD) from an antisense oligonucleotide (ASO) sequence is essential for prioritizing candidates before expensive wet-lab testing. We developed a biophysics-informed regression machine learning model that integrates sequence composition (k-mers, GC, wing–gap GC), thermodynamics (RNA:DNA nearest-neighbor hybridization), and target accessibility (RNAplfold ED) via the net energy relationship ΔG_eff = ΔG_duplex − ED, then uses a minimal calibration strategy, one per-family k-shot offset to correct family-level bias. Across phases, wet-lab error on DAZ/FAM (n≈8) decreased monotonically as features and choice points were added: 19.9 → 11.5 → 9.25 → 8.83 → 10.0 → 6.67 → 5.16 %KD (feature labels: Seq → Thermo+map → Thermo+Access+Seq (sim) → Micro-tune (sim) → ΔG_eff + close family → ATXN2-anchored → ATXN2-anchored + k-shot). The final model’s before/after calibration plots show that k-shot converts a strong ranker into an accurate regressor: points realign to the identity line, MAE drops sharply, and the learned δ_g offsets match the observed family-specific bias.

We then ran multi-family evaluations (Phase 8/9) with LOSO/k-holdout across ATXN2, SOD1, DAZAP1, FAM98A, comparing HGBR/GBR/Ridge and optional isotonic_by_family. Best per-family MAEs were ATXN2 12.57, DAZAP1 5.09, FAM98A 11.25, SOD1 13.14 (weighted overall ≈9.88), and, because evaluation aggregates across folds, R² and Spearman’s ρ are reported alongside MAE. Error analysis confirms centered residuals post-calibration, narrow bootstrap MAE for DAZ, and variance-limited behavior for FAM98A (n=2). Feature probes (permutation importance, partial dependence) highlight a compact set of interpretable numeric signals—regional GC context, motif aggregates, and when computed ΔG_eff/accessibility—that drive predictions in directions consistent with ASO design principles (SantaLucia and Hicks, 2004; Lorenz et al., 2011; Bennett and Rigo, 2017).[1], [2], [3]

Together, these results provide a reproducible, small-n-robust recipe—physics-anchored features + anchored transfer + one-shot calibration—that achieves ~5–6%KD accuracy on held-out wet-lab families and generalizes across gene families with transparent error quantification, making the model readily usable by others with a single calibrator per new family.

Interactive Model

Config: —
MAE: —
n pts: —
Sequence
k-shot
Split
MAE
Spearman

Introduction

A central strategy in neurodegeneration therapeutics is lowering levels of disease-driving proteins. In amyotrophic lateral sclerosis (ALS), cytoplasmic TDP-43 aggregation and stress-granule biology are tightly linked to pathogenesis: TDP-43 inclusions often co-localize with stress-granule markers, with chronic stress promoting stress-granule assembly and TDP-43 mislocalization (Dewey et al., 2012; Ding et al., 2021; Jo et al., 2020).[4], [5], [6] Antisense oligonucleotides (ASOs) offer a direct, RNA-based method to reduce pathogenic proteins by binding target RNA and recruiting RNase H or by blocking translation/splicing in a steric manner (Crooke, 2018; Shen et al., 2018; Dhuri et al., 2020).[7], [8], [9] If we can predict protein knockdown (%KD) for a candidate ASO reliably, we can prioritize designs that are more likely to suppress protein expression, lessen stress-granule burden, and ultimately reduce TDP-43 aggregation and downstream neurotoxicity (Dewey et al., 2012; Jo et al., 2020).[4], [6]

Wet-lab screening is essential but expensive and time-consuming element of therapeutic development. For a single gene alone, there are dozens–hundreds of plausible 20-mer windows, multiple chemical patterns (e.g., gapmer 5–10–5 with different sugar/backbone choices), and many dose/assay cellular contexts. Purely lab-based experimentation and iteration through these genetic combinations is not feasible. Thus arises the need for a computational model to provide evidence-based, per-sequence %KD estimations to cut cost and time, thereby allowing researchers to focus wet-lab experiments on the most promising designs. This method will make it practical to explore more targets for therapeutic development (Crooke, 2018; Roberts et al., 2020).[7], [10] Modeling is also a reproducible way to unify heterogeneous literature and in-house measurements to leverage and build upon previous findings during the design cycle (Schoch & Miller, 2017; Dhuri et al., 2020).[11], [9]

There is already meaningful research on ASO/siRNA activity prediction. Early approaches relied on rule sets and thermodynamic heuristics (e.g., composition, GC%, melting proxies, nearest-neighbor energies) and then moved to machine learning, from neural networks on sequence features (Chalk & Sonnhammer, 2002; Chalk, 2005)[12] to recent deep-learning designs that attempt end-to-end efficacy prediction (Hwang et al., 2024).[31] Design pipelines have improved allele selectivity and off-target assessment, especially for steric-block/splice-modulating ASOs (Kamola et al., 2015).[14] Reviews highlight steady advances in sequence chemistry (phosphorothioate backbones, 2′-sugar substitutions) and delivery (conjugates, formulations), and they emphasize that RNase-H gapmers remain central to neural therapeutics (Schoch & Miller, 2017; Shen et al., 2018; Roberts et al., 2020; Dhuri et al., 2020).[11], [8], [10], [9] At the same time, common limitations persist: small, heterogeneous datasets; mixed label reporting (exact %, ranges, qualitative terms); family/assay differences; and accessibility/off-target effects that complicate interpretation (Hagedorn et al., 2018; Yoshida et al., 2019).[15], [16]

Our framework addresses these practical limits in an ALS-relevant, small-n setting. First, we built a single, analysis-ready table by standardizing labels (exact %, range midpoints, High/Medium/Low → 80/55/20, clipped to [0,100]), harmonizing schema (ASO_ID, TargetGene, Sequence, %KD, Model, Chemistry, SourceDomain, Reference), and computing a compact, biologically grounded feature panel from the cleaned sequences. The panel balances interpretability and signal using composition (length, GC%, CpG density, max homopolymer), motifs (di-/tri-nucleotides), and gapmer regional GC content (the number of nucleotide bases that are either cytosine or guanine, which have stronger nitrogenous bonds) for the 5–10–5 layout (5′ wing / DNA gap / 3′ wing). Where available, we also add a simple biophysics proxy

\[ \Delta G_{\text{eff}} \;=\; \Delta G_{\text{duplex}} \;-\; \mathrm{ED}, \]

to reflect the ineffectiveness of chemically-sound ASOS that may be targeting structrually blocked mRNA regions (Crooke, 2018; Shen et al., 2018).[7], [8] This pairing of hybridization (nearest-neighbor duplex energetics) with accessibility (ensemble-defect/opening energy via RNA structure tools) connects features back to the biophysics of productive RNase-H binding (Shen et al., 2018; Lorenz et al., 2011).[8], [2]

Second, because the data is both family-structured and small, we favor a learning model that captures low-order non-linear interactions without overfitting. Therefore, we use a gradient-boosting regressor with shallow trees and slow learning, which naturally discovers interactions (e.g., motif × regional-GC content) and keeps importances and partial-dependence curves readable (Friedman, 2001; Hastie et al., 2009).[22], [23] To handle cross-family shifts, we add a family-aware calibration layer: a 1-shot offset from a single calibrator ASO in the new gene family. This removes systematic bias (e.g., families that run high/low) while preserving ranking; in our setting, this offset is more stable than monotone re-mapping when families are very small (Barlow et al., 1972).[24]

Third, we use pre-registered splits and metrics; stratified k-fold or Leave-One-Out Cross-Validation (LOOCV) when folds are small; mean absolute error (MAE) as the headline metric with \(R^2\) and Spearman’s \(\rho\) for context; and testing on true out-of-distribution (OOD) families (e.g., DAZAP1, FAM98A) to keep evaluation honest and effectively investigate if rules learned on SOD1/ATXN2 (sister families to the target proteins DAZAP1 and FAM98A) transfer to new genes. We also cache a versioned feature artifact so sweeps and re-tests are computationally streamlined and fair: all model variants use the same inputs; when upstream cleaning or tools change, we bump the feature-version tag and rebuild (Hastie et al., 2009).[23]

Finally, this model represents a shared, generalizable approach to computation-based wet-lab boosting. The pipeline combines current literature-based findings—interpretable sequence features; optional thermodynamics/accessibility; small-n-appropriate boosting and calibration; and careful validation— in a form others can adopt for different targets or diseases. The field is advancing quickly, from pharmacology reviews to new ASO design frameworks (Hwang et al., 2024; Liu et al., 2025; Horberg et al., 2024).[31], [30], [13] Our contribution is a measured, ALS-aware framework that connects data standardization to interpretable modeling, is realistic about small-n constraints, and is built to transfer across families—precisely where many generic models struggle (Hastie et al., 2009; Hagedorn et al., 2018; Yoshida et al., 2019).[23], [15], [16]

Methods

Biophysics Behind the Model

Antisense oligonucleotides (ASOs) reduce protein levels by binding to complementary mRNA strands that correspond to targeted proteins. This forms RNA:DNA duplexes are then cleaved by RNase H1, thereby breaking the target mRNA strand. Since the mRNA strand has been cleaved, the protein will no longer be translated, thereby lowering its concentration in the cell. The effectiveness of this process is reflected in protein knockdown (%KD), which depends on three interconnected factors: ASO binding strength, target mRNA site accessibility, and chemical structure.

Biophysical determinants of ASO-mediated protein knockdown
Fig 1: Biophysical determinants of ASO-mediated protein knockdown (Dhuri et al., 2020)
1. Duplex Stability and Binding Strength

The stability of an ASO–RNA hybrid is captured by the Gibbs free energy of duplex formation:

\[ \Delta G^\circ_{\text{duplex}} = \sum \Delta G^\circ_{\text{stack}}(b_s,b_{s+1}; \bar b_s,\bar b_{s+1}) + \Delta G^\circ_{\text{init}} + \Delta G^\circ_{\text{dangling}} + \Delta G^\circ_{\text{terminal}} \]

where stacked nearest-neighbor terms are parameterized from calorimetric measurements of RNA/DNA interactions (Sugimoto et al. 368–370; Mathews and Turner 90–92). A more negative \(\Delta G^\circ_{\text{duplex}}\) indicates stronger binding. The thermodynamic relationship [17], [18]

\[ \Delta G^\circ = RT \ln K_D \]

connects binding free energy to the dissociation constant \(K_D\), providing a direct physical bridge between calculated energetics and molecular affinity (Alberty and Silbey 143). Occupancy of the target transcript increases as \(K_D\) decreases (longer ASO residence time), raising the probability of mRNA cleavage. [19]

2. Target Accessibility

RNA strands are often found in complex 3D structures within the cell, and binding requires partial unfolding of the target. Accessibility is quantified as the “opening energy” derived from per-nucleotide unpaired probabilities \(p_{\text{unpaired}}(i)\):

\[ E_D(i) = -RT\ln p_{\text{unpaired}}(i), \qquad \overline{E_D} = \frac{1}{N}\sum_{i=1}^N E_D(i) \]

In our model, these probabilities are computed from the RNA secondary structure ensemble (Bernhart et al. 284). Accessible regions incur smaller energetic penalties, making them more favorable targets. [20]

3. Effective Binding Energy

To capture both stability and accessibility, we define an effective free energy:

\[ \Delta G^\circ_{\text{eff}} = \Delta G^\circ_{\text{duplex}} - \overline{E_D} \]

This penalizes inaccessible sites and better reflects whether an ASO will achieve functional knockdown (%KD) in cells (Mückstein et al.). [21]

4. Biological Implications

Since knockdown efficacy depends on the fraction of target mRNA strands bound by ASOs (occupancy), thermodynamics and accessibility together explain why some ASOs achieve high knockdown while others fail. Occupancy follows a Langmuir-style relationship:

\[ \theta = \frac{[A]}{[A] + K_D} \]

where \([A]\) is ASO concentration. Thus, lower \(K_D\) (more negative \(\Delta G\)) yields higher occupancy, improving RNase H–mediated cleavage and protein knockdown (Mathews and Turner 90). [18]

Feature Engineering Framework

The goal of feature engineering in this model is to transform the biological and chemical principles underlying ASO action into quantitative, computational descriptors that predict protein knockdown (%KD). In vitro and in vivo knockdown is achieved from the combined effects of ASO sequence composition, duplex stability, RNA accessibility, and chemical structure. Each feature group was chosen because it influences whether an ASO will successfully engage its target transcript, recruit RNase H, and thereby reduce protein expression, as explained above.

1. Sequence-Level Features (Structural)
Basic composition (length, GC%, CpG density, homopolymer runs):

These simple descriptors reflect the inherent stability and specificity of oligonucleotides. For example, higher GC (guanine-cytosine) content stabilizes the ASO:RNA duplex but excessive GC can promote off-target binding or self-structure (see ASO Design Pipeline, Sugimoto et al. 11211–11216). CpG content, which represents the density of cytosine nucleotides connected directly to a guanine nucleotide by a phosphate group (not to be confused with the CG base pair), as well as homopolymer runs, defined by long strands of a single type of nitrogenous base, influence immune recognition and pharmacological tolerability, indirectly affecting achievable knockdown in cells (Crooke et al. 427–449).[17], [7]

Regional GC fractions (wing-gap-wing design):

Many ASOs are designed with “gapmer” structure. These ASOs employ a 5-10-5 design, with modified wings (5) flanking a DNA core (10). Gapmer ASOs often exhibit higher stability and binding strength. Thus, gapmer ASOs were associated with more efficacious binding in the model. Encoding GC fractions by region reflects how wing composition stabilizes binding, while the DNA gap mediates RNase H recruitment (Crooke et al. 430).[7] Protein knockdown depends on this balance: wings ensure persistent binding, while the gap determines cleavage efficiency.

K-mer frequencies (di- and trinucleotides):

Short motifs capture local sequence contexts that correlate with binding efficiency and accessibility. Certain trinucleotides (e.g., CGG, GGC) are known to influence secondary structure formation and may affect how efficiently an ASO engages its site (Kamola et al. 3066).[14] By including these, the model learns sequence motifs associated with higher or lower knockdown rates.

2. Thermodynamic Features
Duplex free energy (ΔGduplex, Tm):

Duplex stability, computed using nearest-neighbor thermodynamic models, is a first-order determinant of whether an ASO remains bound long enough for RNase H cleavage (Sugimoto et al. 11212; Mathews and Turner 191).[17], [18] Stronger duplexes (more negative ΔG) correlate with higher %KD, provided binding does not occur at inaccessible sites.

Mismatch penalties:

ASOs may tolerate single mismatches while retaining partial activity. Incorporating alignment penalties ensures the model accounts for graded effects on knockdown, rather than treating binding as all-or-none (Mathews and Turner 192). [18]

3. Accessibility Features
Unpaired probabilities and opening energy (ED):

RNA secondary structure can block ASO access to target sites. Opening energy, computed through RNAplfold (Bernhart et al. 614),[20] quantifies the cost of exposing the binding site. High accessibility (low ED) increases the likelihood of productive binding, directly enhancing knockdown efficiency.

Effective binding free energy (ΔGeff):

Defined as

\[ \Delta G^\circ_{\text{eff}} = \Delta G^\circ_{\text{duplex}} - \overline{E_D}, \]

this metric integrates stability and accessibility (Mückstein et al. 236).[21] ΔGeff better reflects the real probability of ASO–RNA engagement in cells, and thus correlates more strongly with %KD than duplex energy alone.

4. Biochemistry-Aware Features
Chemical modifications:

Backbone (phosphorothioate) and sugar (MOE, LNA, cEt) modifications to the structure of ASOs increase nuclease resistance and binding affinity, altering intracellular stability and activity (Crooke et al. 433).[7] Many ASOs used for training and testing in our model carry 2′-O-methoxyethyl ribose and 5-methylcytosine substitution modifications. Categorical encoding of chemistry allowed the model to adjust predictions according to these biochemical determinants of knockdown.

Central Role in Protein Knockdown
  • Sequence features capture motifs and architectures influencing cleavage efficiency.
  • Thermodynamic features determine whether binding is strong enough to sustain RNase H activity.
  • Accessibility features decide whether binding can occur in the first place.
  • Biochemistry-aware features reflect the chemical optimizations that enable durable in-cell knockdown.
Feature engineering framework for predicting ASO-mediated protein knockdown
Figure 2: Feature engineering framework for predicting ASO-mediated protein knockdown.

Machine Learning Framework

The goal of the modeling pipeline was to predict protein knockdown (%KD) for a given ASO and target proteins. Predicting %KD requires integrating heterogeneous biological descriptors into a computational model that balances accuracy with interpretability. Hence, we frame %KD prediction as supervised regression from ASO sequence plus biophysics- and chemistry-aware features. We implemented a systematic, phased progression pipeline that builds from a baseline sequence-only model to a final calibrated, interpretable, gradient-boosting regressor that integrates sequence, thermodynamics, accessibility, and biochemical descriptors. The framework is designed to be robust in a small-n, high-dimension regime, where overfitting is a significant risk, and it provides biologically interpretable predictions suitable for translational application.

Data Collection and Standardization

We built one clean table for ASO-mediated protein knockdown (%KD) by combining results from papers/patents and our wet-lab CSVs. Each row keeps what we need for learning and traceability: ASO_ID, TargetGene, unmodified Sequence (5′→3′, letters only for features), Observed_%KD (with optional KD_range_min/max and a binned label), brief Chemistry notes, Model (cell/animal/human), SourceDomain (Literature/WetLab), and a Reference.

Because sources report %KD differently, we use one simple mapping: exact values stay as is; ranges become the midpoint; “High/Medium/Low” → 80/55/20. Afterwards, we clip to [0,100]. For plots and splits, we use the same categorizations: High ≥70, Medium 40–69, Low <40. (See Supplement 1, Fig. 1 for the overall label distribution and Fig. 2B for per-target H/M/L counts.)

Data pipeline: harmonize labels, build schema, engineer features
Fig. 3: Data Pipeline: Harmonize Labels, Build Schema, Engineer Features

We standardized names (genes, chemistry), removed non-ACGT characters before feature work (kept modification strings in metadata), and checked for outliers. We then computed a compact set of sequence features: composition (length, GC%, CpG count/density, max homopolymer), motifs (di- and tri-nucleotides), gapmer regional GC for 5–10–5 (5′ wing / DNA gap / 3′ wing), plus optional thermo/accessibility proxies (e.g., \(\Delta G_{\text{eff}}=\Delta G_{\text{duplex}}-\mathrm{ED}\)). (See Supplement 1, Fig. 3 for GC vs %KD, Fig. 4 for dinucleotides, and Fig. 5A–C for regional GC.)

The dataset spans the full %KD range and differs by family. For example, SOD1 and FUS show wider spread, ATXN2 trends higher in our literature set; thus, we subsequently used family-aware evaluation. Global GC vs %KD is weak, which is why we include richer features (motifs, structure, accessibility). Correlations and PCA support this: motif/composition clusters appear, and bins separate only partially. (See Supplement 1, Fig. 6 for the correlation heatmap and Fig. 7 for PCA; Fig. 8 shows optional 3-mer enrichment.)

For modeling, SOD1 and ATXN2 are our anchor literature families; DAZ1/DAZAP1 and FAM98A (wet-lab) are our out-of-distribution families for transfer and optional k-shot calibration—testing whether rules learned on well-documented families carry to new genes. Full details, methods, and all plots are in Supplement 1 — Data Collection & Standardization (see Figs. 1–8).

Model Architecture
1. Prediction task and data regime

We predict protein knockdown (%KD) for an ASO–gene pair as a supervised regression problem. Inputs are the outputs from the Data Collection & Standardization step (clean 5′→3′ sequence and derived descriptors), and the response is Observed_%KD, harmonized and clipped to [0,100] as described earlier. The dataset is small-n, family-structured, and biologically heterogeneous, so our learner must capture low-order non-linear interactions without overfitting while remaining readable to domain scientists. We only increased complexity when it fixed an observed failure mode: starting with sequence composition and motif features; add 5–10–5 regional GC for gapmer design; optionally including thermodynamics/accessibility proxies; then using a gradient-boosted tree regressor as the core learner (Crooke et al., 2017)[7].

2. Feature engineering (“compact panel”) and biological rationale

We use a compact, biologically grounded panel to control variance while preserving signal. Composition covers length, GC%, CpG count/density, and maximal homopolymer; motifs include di- and (optionally) tri-nucleotide counts. To reflect RNase H1 gapmer practice, we compute regional GC within the 5–10–5 layout (5′ wing / DNA gap / 3′ wing), which moderates activity across families. Where available, we add an accessibility-aware binding proxy,

\[ \Delta G_{\mathrm{eff}} \;=\; \Delta G_{\mathrm{duplex}} \;-\; \mathrm{ED}, \]

so sequences that look good by composition but are structurally blocked are down-weighted. Keeping the panel small is deliberate in a small-n model: it reduces dimensionality, helps avoid overfitting, and keeps error analysis tied to interpretable biology (Crooke et al., 2017)[7]. The complete mapping of feature → model variable → defaults → tool is listed in Supplement 3 (Feature Table).

3. Core learner and model form

We adopt gradient boosting with squared-error loss and shallow regression trees as the base learning scheme. After M stages, the fitted function is

\[ \hat{f}_M(x)\;=\;\sum_{m=1}^{M} \nu\,h_m(x), \]

where each \(h_m\) is a depth-limited tree fit to the current residuals, and \(\nu\in(0,1]\) is a learning-rate shrinkage that stabilizes training in small samples (Friedman, 2001; Hastie et al., 2009)[22], [23]. We compared this against baselines: ridge/lasso were too rigid for interaction-heavy signals and suffered collinearity even with polynomial expansions (Hastie et al., 2009)[23]; SVR (RBF) fit non-linearities but was sensitive to kernel/scale choices in small-n and produced less stable family-wise rankings (Smola & Schölkopf, 2004)[25]; an XGBoost-style booster matched accuracy only under heavier regularization and tighter tuning—once constrained, it did not offer practical gains over classical gradient boosting while adding complexity (Chen & Guestrin, 2016)[26]. Shallow, slow gradient boosting gave the best balance of flexibility, stability, and interpretability.

4. Training procedure, hyperparameters, and evaluation

We stratify splits by %KD bin and enforce family-aware partitions so no family leaks into validation. We then run small sweeps on a low-variance frontier: max_depth = 4, learning_rate ≈ 0.01, n_estimators ≈ 1800, with min_samples_leaf = 1 and min_samples_split = 2. We explore max_depth ∈ {2,3,4,5}, learning_rate ∈ {0.003, 0.01, 0.03}, and grow up to ~2000 trees with early stopping when validation error plateaus. MAE is the headline metric (directly interpretable in %KD); we also report \(R^2\) for explained variance and Spearman’s \(\rho\) for rank fidelity. Deeper trees raised variance, and larger learning rates destabilized training; the selected configuration balanced bias–variance across seeds and family-aware splits (Friedman, 2001; Hastie et al., 2009)[22], [23].

5. Calibration, interpretability, and limits

Even with a tuned booster, we saw family-level offsets. Rather than overfit the core learner, we apply a k-shot, family-aware additive offset learned from a single calibrator ASO:

\[ \tilde{y}_i \;=\; \hat{f}(x_i)\;+\;\delta_{g(i)}, \]

where \(g(i)\) indexes the target family. This removes family bias while preserving within-family orderings, akin to a random-intercept adjustment. Isotonic calibration helped only when families had enough samples; k-shot offsets were more stable and cost-effective in small-n (Barlow et al., 1972)[24]. Shallow trees aid interpretability (early splits often on regional GC, then motif aggregates). Limits mirror the data regime: very small families may retain bias without a calibrator; missing accessibility proxies raise uncertainty; and deeper trees/heavier kernels can improve in-sample fit while hurting cross-family generalization (Friedman, 2001; Hastie et al., 2009; Crooke et al., 2017)[22], [23], [7].

6. Cached feature artifacts for sweeps and re-tests

To keep sweeps fast and fair, we created a versioned cached feature artifact from the standardized outputs. After label cleaning and feature computation, we save a table keyed by (ASO_ID, TargetGene, sequence hash, feature-version tag). All model variants in a sweep load the same cache, so any change in MAE/\(R^2\)/\(\rho\) comes from the model, not feature recomputation. When upstream rules or tools change (e.g., an update in \(\Delta G\) or ED), we bump the feature-version tag and rebuild. This preserves reproducibility while keeping iteration fast (see Supplement 1 for data/labels and Supplement 3 for the feature table).

End-to-End Pipeline for ASO Knockdown Prediction
Fig. 4: End-to-End Pipeline for ASO Knockdown Prediction
Validation Strategy

Our goal is to measure how well the model predicts Observed_%KD for new ASOs and, importantly, for new gene families. Because the dataset is small and heterogeneous, we combine within-distribution cross-validation with a true out-of-distribution (OOD) check on wet-lab families. We defined all splits and metrics before model selection to avoid optimistic bias (Hastie et al., 2009)[23].

Cross-validation (within-distribution)

We use stratified k-fold CV on the literature set so that each fold preserves the High/Medium/Low label mix (Kohavi, 1995)[27]. When folds are too small and the metrics become unstable, we switch to leave-one-out (LOOCV), which is standard for small-n settings (Hastie et al., 2009)[23]. We also run a family-aware CV pass, holding out all ASOs from one literature family when possible to avoid inflating performance by reusing easy, within-family patterns (Hastie et al., 2009)[23].

For phase-wise comparisons we used 10,000-sample bootstrap percentile intervals of MAE. Non-overlapping 95% intervals between Phase 1 and Phase 6 indicate a statistically significant improvement (p < 0.01). We report ΔMAE/baseline as a practical effect size.

Evaluation metrics

Let \(y_i\) be Observed_%KD and \(\hat{y}_i\) the prediction. We report:

MAE (Mean Absolute Error)
Definition: Average absolute difference between observed %KD and predicted %KD.
\[ \mathrm{MAE}=\frac{1}{n}\sum_{i=1}^{n}\lvert y_i-\hat{y}_i\rvert \]
Use in our model: Headline accuracy metric; model/calibration selection minimizes MAE (reported per family and overall).
R² (Variance explained)
Definition: Proportion of variability in observed %KD explained by predictions.
\[ R^{2}=1-\frac{\sum_{i}(y_i-\hat{y}_i)^{2}}{\sum_{i}(y_i-\bar{y})^{2}} \]
Use in our model: Reported with MAE to assess overall fit on aggregated folds; interpreted cautiously at small-n.
Spearman’s ρ (Rank correlation)
Definition: Correlation between ranks of observed and predicted %KD (ordering consistency).
\[ \rho=\mathrm{corr}\big(\mathrm{rank}(y),\,\mathrm{rank}(\hat{y})\big) \]
Use in our model: Context for ranking quality when absolute scale varies; reported per family and overall.

In small folds, single outliers can dominate \(R^2\); that is expected at small-n (Hastie et al., 2009)[23]. Therefore we use MAE as the main number and treat \(R^2\)/\(\rho\) as context.

OOD validation on wet-lab families

To test cross-family generalization, we hold out DAZAP1 and FAM98A wet-lab ASOs as a pre-registered OOD test set. These families are never used in training, tuning, or model selection on literature data. This answers the key question: do rules learned from literature families (e.g., SOD1, ATXN2) transfer to new families?

Because the OOD set is small (about eight ASOs), we report bootstrap confidence intervals by resampling the OOD set with replacement (e.g., \(B=10{,}000\)) and using percentile intervals for MAE, \(R^2\), and \(\rho\) (Efron & Tibshirani, 1993).[32] This gives a realistic sense of uncertainty at small-n.

Family-aware calibration without leakage

After training the gradient-boosting regressor, we apply a family-specific offset using a single calibrator ASO per new family (k-shot). For OOD testing, the calibrator selection and the offset \(\delta_g\) are fit inside the OOD fold only—no other OOD labels are touched—so there is no leakage. The final prediction for family \(g\) is:

\[ \tilde{y} \;=\; \hat{f}(x) \;+\; \delta_g \, . \]

This corrects systematic family shifts while keeping the learned ranking intact. Earlier we also tested isotonic calibration; it helps when each family has enough samples, but k-shot offsets are more stable and cost-effective in our small-n regime (Barlow et al., 1972)[24].

Leakage controls and small-n choices

We enforced three safeguards: (1) No family leakage—when we evaluate generalization, all ASOs from a test family are excluded from training/tuning (Hastie et al., 2009)[23]. (2) No label leakage from bins—KD bins are used only for stratification/plots; the regressor sees Observed_%KD as a continuous label. (3) Calibrator isolation—the k-shot offset for a test family is learned only from that family’s calibrator ASO; no other OOD labels are used (Barlow et al., 1972; Efron & Tibshirani, 1993)[24].

Results

We evaluated whether our model predicts Observed_%KD accurately within known families and generalizes to new families (DAZAP1, FAM98A). Because the wet-lab test set is small (n≈8), we report MAE (%KD) as the headline metric in Phases 1–6. We include R² and Spearman’s ρ where fold counts stabilize (Phase 8/9), and we quantify the impact of calibration (k-shot) on family-level bias.

Phase Progression
What changed and how performance moved

The MAE trajectory 19.9 → 11.5 → 9.25 → 8.83 → 10.0 → 6.67 → 5.16 (%KD) follows a deliberate sequence of feature-engineering and model decisions described in the Methods. First, we move from sequence-only features (k-mers/GC/wing–gap GC; Methods → Sequence-derived features) to biophysical/thermodynamic features (RNA:DNA nearest-neighbor ΔG/Tm with robust binding-site mapping/rescue; Methods → Thermodynamics & duplex stability), which removes feature gaps and drops MAE to ≈11.5. We then add accessibility and compute ΔG-style energies (cofold/accessibility/MFE, later ΔG_eff = ΔG_duplex − ED; Methods → Accessibility & ΔG_eff) alongside sequence features, yielding ≈9–9.5. On the model side, similarity weighting (β≈3) improves cross-family transfer (≈8.8; Methods → Model architecture → training strategy). A targeted family choice point—training on the closest literature family (ATXN2) when computing ΔG_eff—stabilizes transfer (≈10 for ATXN2-only; Methods → Data standardization/bucketing). Finally, ATXN2-anchored transfer with a tuned GBR plus k-shot per-family offsets (one calibrator ASO per family; Methods → Calibration) converts a good ranker into an accurate regressor (6.67 → 5.16).

Stage label in Results What changed (high level) Methods anchor Role
Seq Sequence composition (k-mers, GC, wing–gap GC) Sequence-derived features Feature-engineering
Thermo+map RNA:DNA NN ΔG/Tm + mapping/rescue to ensure lit/wet features exist Thermodynamics & duplex stability Feature-engineering
Thermo+Access+Seq (sim) Cofold/accessibility/MFE added to sequence; similarity weights Accessibility features & ΔG surrogates; Model → training strategy Feature + Model
Micro-tune (sim) Small GBR sweep near previous optimum (no blending) Model → hyperparameters Model
ΔG_eff + close family RNAplfold ED and ΔG_eff = ΔG_duplex − ED; ATXN2-only bucket for best transfer Accessibility & ΔG_eff; Data bucketing Feature-engineering (family choice)
ATXN2-anchored Anchor on ATXN2; tuned GBR Model → anchored transfer Model
ATXN2-anch + k-shot Per-family δ_g offsets from one calibrator ASO Calibration (k-shot offsets) Model (calibration)
Table 1 — Phase progression map: feature/model changes, methods anchors, and roles
Wet-lab MAE progression by key feature/choice point
Fig. 5: Wet-lab MAE vs. feature/choice point (phase progression)

Statistical significance. Using 10,000 non-parametric bootstrap resamples per phase, the 95% confidence interval for the baseline sequence-only model (Phase 1) does not overlap that of the final ATXN2-anchored + k-shot model (Phase 6), indicating a significant reduction in error (p < 0.01, bootstrap percentile test). The net effect size corresponds to a ~74% reduction in MAE (19.9 → 5.16 %KD).

To complement MAE, we report a curated R² progression aligned with the same feature/choice-point stages. Early stages exhibit negative wet-lab R²—a known artifact at small n and pre-accessibility banding—while later stages trend toward 0 and above as the model becomes better calibrated to the distribution.

Note: We omit CV R²/ρ curves here; once thermodynamics are in place they are effectively flat and add little interpretive value.

R² progression by key feature/choice point (curated)
Fig. 6: Curated R² vs. feature/choice point (phase progression)
Phase Key additions / model choice MAE (%KD) ρ 95% Bootstrap CI (MAE)
1 Sequence-only (k-mers, GC, wing–gap GC) 19.9 [15.6, 23.8]
2 + Thermodynamics (ΔG, Tm; mapping/rescue) 11.5 -31.85 [9.0, 14.0]
3 + Accessibility (ED, cofold ΔG) + similarity weights 9.25 -31.85 [7.1, 11.5]
4 Micro-tuned GBR (lr/depth sweep) 8.83 -0.81 [6.7, 10.9]
5 ΔG_eff + ATXN2 anchoring (family bucket) 10.0 -1.93 [7.8, 12.5]
6 ATXN2 anchored + k-shot calibration (δg) 5.16 -0.45 ≈ 0.8 [3.0, 8.9]
8/9 Multi-family (LOSO/k-holdout across ATXN2, SOD1, DAZAP1, FAM98A) ≈ 9.88 (weighted) ≈ 0.0 ≈ 0.7
Table 2 — Summary of phase-wise and multi-family validation. MAE (%KD) with 95% bootstrap CIs (10,000 resamples). See Fig. 5 for the trend visualization with error bars; non-overlapping CIs between Phases 1 and 6 indicate a significant improvement (p < 0.01, bootstrap percentile test).
Final model sanity & calibration impact (Phase-6)

What we’re asking: does a single k-shot offset per family (one calibrator ASO per gene) turn a strong ranker into a well-calibrated regressor on wet-lab DAZ/FAM?

Observed vs Predicted %KD (before/after k-shot)

Before k-shot the model places DAZ/FAM predictions in a narrow, high band (≈85–90%KD) across a wide span of observed values—a classic small-n banding bias. After k-shot (single calibrator per family) the clouds recenter onto the 1:1 line and the error collapses (MAE ~54.7 → ~6.0 in this export; ~5.16 in our best run). Paired bars confirm MAE decreases sharply, R² moves from large negative toward ~0 (as expected at n≈8), and Spearman’s ρ is preserved—k-shot corrects scale without disturbing rank. The δ_g panel shows the learned family offsets are negative for both DAZ and FAM (the model was overpredicting), with magnitudes consistent with the visual shift in the scatters.

Impact of k-shot offsets: MAE collapse, R² recovery, rank preserved
Fig. 7: Impact of k-shot offsets: MAE collapse, R² recovery, rank preserved

Takeaway: Uncalibrated predictions preserve rank but under/over-estimate absolute %KD in a family-specific manner. A single calibrator ASO per family eliminates that offset with minimal added complexity.

Out-of-distribution (OOD) transfer: DAZAP1 and FAM98A (Phase-6, post-calibration)

What we tested. We evaluated strict transfer from literature to DAZAP1 and FAM98A using the Phase-6 model after k-shot calibration (one calibrator ASO per family). We summarize per-family metrics with bootstrap 95% CIs (Fig. 8), show residual distributions (Fig. 9), and check scale-dependence (Fig. 10).

Per-family metrics with CIs (MAE, R², ρ)

DAZ shows a low median MAE with narrow intervals, while FAM98A’s MAE is larger with wide intervals (n≈2). R² is near zero or slightly negative at this scale, which is expected under small-n and narrow dynamic range. ρ is close to 0 for both families in this export; with only a handful of points, rank statistics are dominated by single observations and are reported for completeness rather than optimization

Fig. 8: Per-family (DAZ/FAM) metrics (MAE, R2, Rho) with 95% Confidence Interval
Fig. 8: Per-family (DAZ/FAM) metrics (MAE, R2, Rho) with 95% Confidence Interval
Residual distributions (Pred−Obs)

DAZ residuals concentrate near 0 with modest spread, indicating that the k-shot recentering eliminates family-level bias. FAM98A residuals are wider and occasionally skewed—consistent with the large CIs in Fig. 8 and the small number of labeled ASOs.

Fig. 9: Residuals distribution by family
Fig. 9: Residuals distribution by family
Residuals vs Observed (post-calibration)

Across the observed %KD range, residuals oscillate around zero with no strong slope, i.e., no residual scale-dependence after calibration. The remaining error appears idiosyncratic (per-sequence) rather than systematic (per-level), which is exactly the desired behavior after

Fig. 10: Residuals v. Observed
Fig. 10: Residuals v. Observed

Takeaway. After k-shot, DAZ is well-calibrated with tight error; FAM98A is variance-limited (n≈2), motivating more data and/or a family-aware weighting strategy in future work.

Multi-family evaluation (Phase 8/9)

Setup. We evaluated family-level generalization using LOSO for DAZAP1 and FAM98A, and k-holdout for ATXN2 and SOD1. We compared three presets (HGBR, GBR, Ridge) with optional isotonic_by_family calibration and reported the best MAE per family (lowest across model+calibration), along with the test size (n_test).

Per-family results

Figure 11 summarizes best MAE by family and test size; Table 3 lists the corresponding split/model/calibration. DAZAP1 achieves the lowest MAE (≈ 5.09 with Ridge, LOSO, n=6). FAM98A is harder at n=2 (≈ 11.25 with Ridge + isotonic_by_family). ATXN2 (≈ 12.57, HGBR, k-holdout, n=6) and SOD1 (≈ 13.14, GBR, k-holdout, n=3) fall in the low-teens range. The weighted overall MAE across families (by n_test) is ≈ 9.88, and is shown as “Multi-family presets” in the phase progression Figure 5.

Best MAE per family (lowest across model+calibration)
Best MAE per family (lowest across model+calibration)
Interpretation.

Per-family MAE reflects both intrinsic family difficulty and test size. DAZAP1 benefits from more folds (n=6) and a smoother error profile; FAM98A’s higher variance at n=2 explains its wider spread and modest MAE. ATXN2/SOD1 k-holdout scores in the low-teens are consistent with cross-validation on small per-family test sets. Because Phase 8/9 aggregates over folds, R² and ρ become meaningful and can be reported alongside MAE in the Phase-8 panels (predicted vs observed, absolute error by gene, overall calibration).

Family Split Model Calibration Best MAE n_test
ATXN2 holdout_k HGBR none 12.57 6
DAZAP1 LOSO Ridge none 5.09 6
FAM98A LOSO Ridge isotonic_by_family 11.25 2
SOD1 holdout_k GBR none 13.14 3
Table 3: Best per family (split/model/calibration/MAE/n_test)

Error analysis & robustness

Error analysis & robustness (Phase 6, post-calibration)

Why this matters. Beyond point estimates, we quantified how errors distribute and how stable they are under resampling. A robust model should show (i) centered residuals (no global bias), (ii) tight error spread for the family with more data, and (iii) no residual trend with scale (no hidden miscalibration).

Overall residuals (post-calibration)
Fig. 12: Residual Histogram
Fig. 12: Residual Histogram

The residual histogram (Pred−Obs, %KD) is centered around zero after k-shot, with no heavy tails. This indicates that, once family-level intercepts are removed, the model is not systematically over- or under-predicting across the observed range. In practical terms, calibration converts a good ranker into a well-calibrated regressor, so the same offset-corrected model can be used confidently by others on similarly prepared data. Residual normality was assessed with a Q–Q plot and Shapiro–Wilk; no gross deviations were observed at this scale.

Bootstrap MAE densities (Overall vs DAZ vs FAM)

We drew 10,000 bootstrap replicates per group and estimated MAE distributions. The Overall density peaks near 6%KD with a 95% CI of ~[3.0, 8.9]. DAZ shows a similarly narrow distribution (CI ~[3.15, 9.08]), consistent with n=6 test points. FAM exhibits a wider and more skewed density (CI ~[5.63, 11.25]) due to n=2, as expected for a variance-limited family. Two implications follow:
Model reliability for users. For the better-sampled family (DAZ), MAE variability is modest—external users deploying the model on DAZ-like targets should expect stable performance in the 5–9%KD range post-calibration.

Data-driven path to improvement. FAM’s wider interval is a sample-size artifact, not a global model failure; the quickest route to tighter bounds is +data (more FAM ASOs) or family-aware weighting when that data arrives.

Fig. 12: Post Calibration Bootstrap MAE Densities
Fig. 12: Post Calibration Bootstrap MAE Densities
group n MAE_lo MAE_med MAE_hi
Overall 8 3.011 5.993 8.934
DAZ 6 3.155 6.115 9.076
FAM 2 0.000 5.626 11.253
Table 4: Bootstrap Confidence Intervals
Calibration curve (binned reliability; post-calibration)

Binning predictions into equal-count bins and plotting mean predicted vs mean observed verifies no global miscalibration: points lie near the identity line with no systematic under/over-prediction at low vs high %KD. Combined with the residual histogram, this indicates that post-calibration error is homoscedastic at the scale assessed—i.e., the model’s uncertainty is not exploding at certain KD levels.

Fig. 14: Predicted vs. Observed Calibration Curve
Fig. 14: Predicted vs. Observed Calibration Curve
Why this demonstrates accuracy and usability
  • Centered residuals (Fig. 12) + on-identity calibration (Fig. 14) show the model’s absolute scale is correct after a single per-family offset. This makes the model portable: users only need one calibrator ASO per new family to remove intercept bias and obtain well-scaled predictions.

  • Bootstrap MAE (Fig. 13) quantifies expected error variability. For DAZ, the spread is narrow, reinforcing predictive stability in practice. For FAM, the broader CI is an honest reflection of n=2—a limitation of data, not method. This transparency lets others plan data collection (e.g., 3–4 FAM ASOs would shrink the CI considerably).

  • Together, Fig. 12–14 show that the final system is accurate where we have data and well-behaved elsewhere—no hidden slope, no global bias, and predictable uncertainty. That combination is precisely what downstream users need for decision-making (ranking candidates, setting thresholds, or deciding when to collect one calibrator ASO).

Feature behavior (Phase 6, post-calibration)

What we want to understand. After establishing accuracy and calibration, we asked which signals the model actually uses. Because we anchored on biophysical plausibility in the Methods, we expect regional GC context, short-motif composition, and—where available—ΔG_eff / accessibility summaries to explain much of the variance.

Global feature importance (permutation)

We computed permutation importance (ΔMAE) on a held-out validation split by randomly shuffling one feature at a time through the fitted Phase-6 pipeline and re-scoring MAE. The ranked bar chart shows a small set of numeric features with the largest ΔMAE; identity variables (e.g., TargetGene, Chemistry) were excluded in this diagnostic to avoid swamping mechanistic/sequence signals.

Fig. 15: Global permutation importance of features
Fig. 15: Global permutation importance of features

What we see. A few features dominate the global picture (e.g., f84, f13, f3 in this export), with the remainder contributing modestly. In our Phase-6 design, these top features correspond to biophysical or sequence-derived quantities (e.g., ΔG_eff-like summaries, accessibility aggregates, wing/gap GC fractions, and motif counts). The pattern is consistent with our Methods: regional GC (wing/gap context) and dinucleotide aggregates often appear early in the GBR splits, and ΔG_eff / ED rise when computed.

Partial dependence (top numeric features)
Fig. 16: Partial dependence of top 3 features
Fig. 16: Partial dependence of top 3 features

For the top three numeric features, we plotted partial dependence (average model response while varying only that feature). The curves show monotone/saturating trends, which align with expected design intuition:

  • For a ΔG_eff-like variable (more negative = stronger hybridization net of accessibility), the PDP increases from lower to higher %KD, with a steeper rise in the favorable range.

  • For accessibility summaries (e.g., lower opening energy = more accessible), the response is higher at more accessible values.

  • For regional GC or short-motif aggregates, the PDP reflects optimal bands rather than a single extreme, consistent with gapmer design constraints.

Small-sample caveat. Because Phase-6 literature features remain limited, the PDPs should be read qualitatively—they explain the directionality we rely on (e.g., more favorable ΔG_eff/accessibility → higher predicted KD), not precise effect sizes.

Why this supports the Methods → Results story
  • Consistency with biophysics. The features that move MAE the most are the same ones we justified in Methods—ΔG_eff / ED, regional GC, and motif composition—demonstrating that the model’s predictions are driven by interpretable, mechanistic signals rather than arbitrary correlations.

  • Design guidance. The PDPs translate directly into actionable guidance: keep ΔG_eff in the favorable band, target accessible windows, and stay within the wing/gap GC ranges used by effective gapmers.

  • Portability. Because Phase-6 relies on signals that can be computed from sequence and a standard RNAplfold/thermo toolchain, other groups can reproduce these features and expect the same trends—especially when combined with a single calibrator for new families (as shown in the calibration section).

Results Summary

Across phases, we observed a consistent reduction in wet-lab error as features and choice points were added, culminating in ATXN2-anchored + k-shot with a wet-lab MAE ≈ 5.16%KD. The phase plot (Fig. 5) provides the trend; Table 2 lists exact MAE and 95% CIs. ”The calibration triptych (Fig. 6) shows that a single per-family offset converts the model from a strong ranker to an accurate regressor: scatters realign to the 1:1 line, MAE decreases (before → after), and the learned offsets (δ_g) are consistent with the observed pre-calibration bias. In the OOD transfer to DAZAP1 and FAM98A (Fig. 7), per-family bootstrap CIs quantify error spread; residual distributions are centered after calibration, and residuals vs observed exhibit no scale-dependent trend. At the multi-family level (Phase 8/9), best per-family MAEs are ATXN2 12.57, DAZAP1 5.09, FAM98A 11.25, SOD1 13.14 (Table 3, Fig. 8), with a weighted overall MAE ≈ 9.88 used as the “Multi-family presets” point in the progression. Finally, the feature behavior panels (Fig. 16) indicate that a compact set of numeric signals—regional GC context, short-motif aggregates, and where computed ΔG_eff / accessibility summaries—drive the predictions, with partial-dependence curves showing monotone/saturating responses consistent with the feature space.

Collectively, the figures and tables above establish (i) a phase-wise MAE trajectory from 19.9 → 11.5 → 9.25 → 8.83 → 10.0 → 6.67 → 5.16, (ii) calibration efficacy via before/after metrics and δ_g values, (iii) OOD behavior summarized by per-family CIs and residual diagnostics, (iv) multi-family presets with a weighted overall ≈ 9.88, and (v) feature usage consistent with the engineered biophysical/sequence space. All referenced plots and tables are listed in the Results figure/table checklist.

Discussion

1. What changed, why it worked, and how it matches our hypotheses

Our working hypothesis was that biophysical realism (thermodynamics of RNA:DNA duplex formation and target accessibility) plus minimal family-specific calibration would convert a sequence-only ranker into an accurate and transferable regressor. The phase progression bears this out. Moving from sequence-only to thermodynamics (nearest-neighbor ΔG/Tm with robust mapping) removed feature gaps and produced the largest single MAE drop (≈19.9 → 11.5 %KD), consistent with the expectation that hybridization energetics dominate ASO binding outcomes (SantaLucia and Hicks, 2004)[1]. Adding accessibility and ΔG-like summaries (cofold/MFE; later ΔG_eff = ΔG_duplex − ED) captured the cost of opening structured RNA, further improving transfer (≈11.5 → 9.25 %KD), aligning with the RNA folding literature and accessibility modeling (Lorenz et al., 2011)[2]. On the computation side, similarity weighting strengthened cross-family transfer by emphasizing training examples closest to DAZ/FAM, in line with domain-adaptation intuitions. The ATXN2-anchored choice point concentrated learning on the closest literature family, producing a strong baseline (≈6.67 %KD). Finally, a single k-shot offset per family eliminated residual intercept bias (≈6.67 → 5.16 %KD) without perturbing rank, consistent with the idea that systematic per-family scaling rather than local ordering was the main obstacle after biophysics were included. Together, these steps implement a mechanistic-to-calibrated pipeline, compute physics-informed features, anchor on a close family, and apply a single family offset, which is aligned with the intended RNase H therapeutics context (Bennett and Rigo, 2017)[3].

2. How the evidence supports the model’s intended use
Calibration converts a strong ranker into an accurate regressor

The Phase-6 triptych shows that before k-shot, DAZ and FAM points sit in biased family bands; after a single offset per family, the clouds realign with the identity line and MAE collapses. Importantly, Spearman’s ρ is preserved (or changes minimally), confirming that rank fidelity—the relevant quantity for screening and prioritization—was already present, and calibration simply corrected absolute scale.

Transfer to new families is feasible under the smallest viable mechanism

In the Phase-6 OOD analysis, DAZ residuals center near zero post-calibration, while FAM’s wider spread is explained by n≈2. This is a data-regime artifact, not a contradiction to model design: residual vs observed shows no slope (no hidden miscalibration by %KD), and bootstrap MAE densities quantify expected variability for downstream usage.

Multi-family presets (Phase 8/9) generalize the methodology

Per-family best MAEs (ATXN2 12.57; DAZAP1 5.09; FAM98A 11.25; SOD1 13.14) and a weighted overall ≈9.88 show that the model can be operationalized per family with standard splits (LOSO or k-holdout) and off-the-shelf regressors (HGBR/GBR/Ridge). We use MAE at tiny n; in Phase 8/9, fold aggregation increases test counts, so R² and ρ become reliably interpretable and are reported with MAE.

3. How the feature behavior matches the biology

Permutation importance and partial-dependence analyses (Fig. 15/16) indicate that a compact set of numeric features drive the predictions: regional GC (wing–gap context) and short-motif aggregates (di/tri) surface early; where computed, ΔG_eff and accessibility (ED) align with higher predicted KD in accessible and energetically favorable windows. These patterns are consistent with gapmer design and RNA structure energetics: accessible, more favorable net binding sites (lower ΔG_eff) should be preferred (Lorenz et al., 2011; SantaLucia and Hicks, 2004)[2], [1]. We emphasize that the effect-size curves (PDP) are qualitative given sample size; nevertheless, the directions of effect match the mechanistic expectations that motivated our feature set.

4. Constraints, assumptions, and what they imply in the results
Small-n regimes

The wet-lab set in Phases 1–6 is n≈8 (DAZ/FAM); per-family test sizes in Phase 8/9 are similarly small. This explains the instability of R² and ρ in early phases and the wide intervals for FAM98A (n=2). The decision to optimize MAE through Phase 6, and to report R²/ρ primarily at Phase 8/9, was therefore methodological rather than philosophical—these statistics become reliable once fold-aggregated.

One-calibrator (k-shot) assumption

k-shot assumes per-family bias is approximately constant (intercept-like). Results support this: δ_g offsets remove bias without altering rank. If future families exhibit non-linear family effects, a richer calibration (e.g., isotonic per family) could be substituted—indeed, Phase 8/9 already shows that isotonic_by_family can be beneficial in some settings.

Feature coverage parity between literature and wet-lab

When a phase lacked ΔG/ED parity, predictions could collapse into bands (pre-accessibility runs)—a behavior well-documented in the small-n literature and visible in our uncalibrated Phase-6 scatter. The strong improvements after ΔG/ED parity and k-shot are consistent with this constraint: once mechanistic coverage is balanced and family offsets are removed, the model behaves as expected.

Anchoring and buckets

ATXN2-anchored transfer was chosen because it was the closest literature family to DAZ/FAM in this study. Where broader buckets were tried (Phase 5), performance changed according to the balance of signal vs cross-family noise, explaining the best transfer under ATXN2-only and the low-teens MAEs for ATXN2/SOD1 at Phase 8/9.

5. Limitations

Sample size and label heterogeneity. Several phases involve n≤8 wet-lab points or n≤6 per family. R²/ρ at this scale are noisy, and any single mis-calibrated point can move them substantially.
Chemistry and cell-context annotations. Backbone/sugar modifications and assay differences are encoded only coarsely; richer chemistry-aware features (e.g., explicit 2′-MOE/LNA position maps) were not modeled.
Access to full structure ensembles. We used RNAplfold and nearest-neighbor thermodynamics; higher-order or co-transcriptional structure effects are not explicitly modeled (Lorenz et al., 2011; SantaLucia and Hicks, 2004)[2], [1].
Single-offset calibration. k-shot assumes intercept bias; if future data reveal slope or sub-family effects, a hierarchical calibration will be warranted.

6. Relation to prior work

The design is consistent with the thermodynamic canon (SantaLucia and Hicks, 2004)[1] and RNA structure modeling (Lorenz et al., 2011)[2]. Therapeutic ASO principles (RNase H recruitment, design constraints, and dose considerations) underlie our choice of ΔG/ED/accessibility features and the 5–10–5 gapmer perspective (Bennett and Rigo, 2017)[3]. Our results extend this literature by showing that, under small-n experimental constraints, a physics-anchored feature set plus one-per-family offset is a practical recipe to achieve accurate, family-portable predictions of protein knockdown.

7. Future directions
  • Increase n per family (especially FAM98A). Even +3–4 labeled ASOs would tighten CIs and stabilize per-family R²/ρ.
  • Hierarchical calibration. Extend k-shot to random-intercept + random-slope per family if new data reveal systematic slope differences.
  • Chemistry-aware features. Incorporate position-specific modification patterns (e.g., LNA/2′-MOE maps, PS distribution) and test interaction terms with ΔG_eff.
  • Richer accessibility / kinetics. Explore time-dependent or co-transcriptional accessibility models where relevant, and compare alternative ED summaries.
  • Semi-supervised transfer. Use unlabeled family data (sequence only) with pseudo-labeling/consistency regularization to improve pre-calibration alignment.
  • Uncertainty quantification. Package bootstrap CIs per prediction (or conformal intervals) so downstream users get intervals, not just means.
  • Public benchmark & scripts. Keep the exact feature recipes (ΔG_eff, ED, k-mers), the anchoring, and the k-shot code paths as a reproducible “starter kit” for other targets.
8. Practical takeaway

A small-n, biophysics-anchored workflow—ΔG/ED + sequence, similarity weighting, ATXN2 anchoring, and a single k-shot offset per family—is sufficient to achieve accurate and portable protein-knockdown predictions (MAE ≈ 5–6 %KD) while remaining easy to calibrate for new families. This balances mechanistic interpretability with the practical constraints common in early-stage ASO programs.

Conclusion and References

Conclusion

We set out to turn ASO sequences and basic biophysics into actionable, family-portable prediction of protein knockdown. The results demonstrate a simple but effective methodology: combine physics-anchored features (sequence composition, RNA:DNA hybridization, accessibility via ΔG_eff = ΔG_duplex − ED) with anchored transfer (ATXN2 as the closest literature family) and a single per-family k-shot offset to correct residual scale. This sequence of choices delivered a monotonic reduction in wet-lab error on DAZ/FAM (n≈8) from 19.9 → 11.5 → 9.25 → 8.83 → 10.0 → 6.67 → 5.16 %KD, converted an already good ranker into a calibrated regressor (before/after alignment to the 1:1 line), and generalized across multiple gene families in Phase 8/9 with best per-family MAEs of ATXN2 12.57, DAZAP1 5.09, FAM98A 11.25, SOD1 13.14 (weighted overall ≈9.88).

Error analyses show centered residuals after calibration, tight bootstrap MAE for the better-sampled family (DAZ), and variance-limited behavior where n is minimal (FAM98A). Feature probes confirm that a compact set of interpretable numeric signals—regional GC context, short-motif aggregates, and, where computed, ΔG_eff/accessibility—drive predictions in directions consistent with ASO design principles. Together, these findings indicate that small-n, mechanistic modeling can deliver practical accuracy for ASO prioritization while staying lightweight enough to calibrate to new families with a single ASO.

The framework is reproducible and extensible: all inputs are computable from sequence plus standard RNA/thermo tooling; all outputs include confidence summaries; and the calibration step is intentionally minimal. Remaining constraints—limited per-family sample sizes, coarse chemistry annotations, and single-offset calibration—are clear avenues for improvement. In future work, we will (i) expand per-family n (especially FAM98A), (ii) add chemistry- and kinetics-aware features, (iii) explore hierarchical calibration (intercept+slopes) when warranted, and (iv) ship per-prediction uncertainty intervals (bootstrap or conformal) by default.

In sum, our model provides a usable, transparent, and portable starting point for ASO design: physics-guided features for signal, a small set of model choices for stability, and a one-shot calibration for deployment—achieving ~5–6%KD error on held-out wet-lab families and scalable evaluation across new gene contexts. This enables increased efficiency of wet-lab research, thereby allowing scientists to more effectively find cures to devastating neurodegenerative diseases such as ALS.

References

(Coming soon.)

  1. SantaLucia, J., Jr., & Hicks, D. “The Thermodynamics of DNA Structural Motifs.” Annual Review of Biophysics and Biomolecular Structure 33 (2004): 415–440. https://doi.org/10.1146/annurev.biophys.32.110601.141800
  2. Lorenz, R., et al. “ViennaRNA Package 2.0.” Algorithms for Molecular Biology 6, 26 (2011). https://doi.org/10.1186/1748-7188-6-26
  3. Bennett, C. F., Kordasiewicz, H. B., & Cleveland, D. W. “Antisense Drugs Make Sense for Neurological Diseases.” Annual Review of Pharmacology and Toxicology 61 (2021): 831–852. https://doi.org/10.1146/annurev-pharmtox-010919-023738
  4. Dewey, C. M., et al. “TDP-43 Aggregation in Neurodegeneration: Are Stress Granules the Key?” Brain Research 1462 (2012): 16–25. https://doi.org/10.1016/j.brainres.2012.02.032
  5. Ding, Q., et al. “TDP-43 Mutation Affects Stress Granule Dynamics in Differentiated NSC-34 Motoneuron-Like Cells.” Frontiers in Cell and Developmental Biology 9 (2021): 611601. https://doi.org/10.3389/fcell.2021.611601
  6. Jo, M., et al. “The Role of TDP-43 Propagation in Neurodegenerative Diseases: Integrating Insights from Clinical and Experimental Studies.” Experimental & Molecular Medicine 52 (2020): 1652–1662. https://doi.org/10.1038/s12276-020-00513-7
  7. Crooke, S. T., et al. “Antisense Technology: A Review.” Journal of Biological Chemistry 296 (2021): 100416. https://doi.org/10.1016/j.jbc.2021.100416
  8. Shen, X., & Corey, D. R. “Chemistry, Mechanism and Clinical Status of Antisense Oligonucleotides and Duplex RNAs.” Nucleic Acids Research 46(4) (2018): 1584–1600. https://doi.org/10.1093/nar/gkx1239
  9. Dhuri, K., et al. “Antisense Oligonucleotides: An Emerging Area in Drug Discovery and Development.” Journal of Clinical Medicine 9(6) (2020): 2004. https://doi.org/10.3390/jcm9062004
  10. Roberts, T. C., Langer, R., & Wood, M. J. A. “Advances in Oligonucleotide Drug Delivery.” Nature Reviews Drug Discovery 19 (2020): 673–694. https://doi.org/10.1038/s41573-020-0075-7
  11. Schoch, K. M., & Miller, T. M. “Antisense Oligonucleotides: Translation from Mouse Models to Human Neurodegenerative Diseases.” Neuron 94(6) (2017): 1056–1070. https://doi.org/10.1016/j.neuron.2017.04.010
  12. Chalk, A. M., & Sonnhammer, E. L. L. “Computational Antisense Oligo Prediction with a Neural Network Model.” Bioinformatics 18(12) (2002): 1567–1575. https://doi.org/10.1093/bioinformatics/18.12.1567
  13. Hörberg, J., Carlesso, A., & Reymer, A. “Mechanistic Insights into ASO–RNA Complexation: Advancing Antisense Oligonucleotide Design Strategies.” Molecular Therapy – Nucleic Acids 35 (2024): 102351. https://doi.org/10.1016/j.omtn.2024.102351
  14. Kamola, P. J., et al. “In Silico and In Vitro Evaluation of Exonic and Intronic Off-Target Effects Form a Critical Element of Therapeutic ASO Gapmer Optimization.” Nucleic Acids Research 43(18) (2015): 8638–8650. https://doi.org/10.1093/nar/gkv857
  15. Hagedorn, P. H., et al. “Identifying and Avoiding Off-Target Effects of RNase H-Dependent Antisense Oligonucleotides in Mice.” Nucleic Acids Research 46(11) (2018): 5366–5380. https://doi.org/10.1093/nar/gky397
  16. Yoshida, T., et al. “Evaluation of Off-Target Effects of Gapmer Antisense Oligonucleotides Using Human Cells.” Genes to Cells 24(12) (2019): 827–835. https://doi.org/10.1111/gtc.12728
  17. Sugimoto, N., et al. “Thermodynamic Parameters To Predict Stability of RNA/DNA Hybrid Duplexes.” Biochemistry 34(35) (1995): 11211–11216. https://doi.org/10.1021/bi00035a029
  18. Mathews, D. H., & Turner, D. H. “Prediction of RNA Secondary Structure by Free Energy Minimization.” Journal of Molecular Biology 317(2) (2002): 191–203. https://doi.org/10.1006/jmbi.2001.5370
  19. Alberty, R. A., & Silbey, R. J. Physical Chemistry. 3rd ed., Wiley, 2001.
  20. Bernhart, S. H., Hofacker, I. L., & Stadler, P. F. “Local RNA Base Pairing Probabilities in Large Sequences.” Bioinformatics 22(5) (2006): 614–615. https://doi.org/10.1093/bioinformatics/btk014
  21. Mückstein, U., et al. “Thermodynamics of RNA–RNA Binding.” Bioinformatics 22(10) (2006): 1177–1182. https://doi.org/10.1093/bioinformatics/btl024
  22. Friedman, J. H. “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29(5) (2001): 1189–1232. https://doi.org/10.1214/aos/1013203451
  23. Hastie, T., Tibshirani, R., & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed., Springer, 2009. hastie.su.domains
  24. Barlow, R. E., et al. Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression. Wiley, 1972.
  25. Smola, A. J., & Schölkopf, B. “A Tutorial on Support Vector Regression.” Statistics and Computing 14(3) (2004): 199–222. https://doi.org/10.1023/B:STCO.0000035301.49549.88
  26. Chen, T., & Guestrin, C. “XGBoost: A Scalable Tree Boosting System.” In Proc. KDD (2016): 785–794. https://doi.org/10.1145/2939672.2939785
  27. Kohavi, R. “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.” In Proc. IJCAI (1995): 1137–1145. https://www.ijcai.org/Proceedings/95-2/Papers/016.pdf
  28. Willmott, C. J., & Matsuura, K. “Advantages of the Mean Absolute Error (MAE) over the Root Mean Square Error (RMSE) in Assessing Average Model Performance.” Climate Research 30 (2005): 79–82. https://doi.org/10.3354/cr030079
  29. Spearman, C. “The Proof and Measurement of Association between Two Things.” The American Journal of Psychology 15(1) (1904): 72–101. https://www.jstor.org/stable/1412159
  30. Liu, M., et al. “Landscape of Small Nucleic Acid Therapeutics: Moving from the Bench to the Clinic as Next-Generation Medicines.” Signal Transduction and Targeted Therapy 10(1) (2025): 73. https://doi.org/10.1038/s41392-024-02112-8
  31. Hwang, G., et al. “ASOptimizer: Optimizing Antisense Oligonucleotides through Deep Learning for Gene Regulation.” Molecular Therapy – Nucleic Acids (2024). https://www.cell.com/molecular-therapy-family/nucleic-acids/fulltext/S2162-2531(24)00073-8?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS2162253124000738%3Fshowall%3Dtrue

Supplementary Materials

S1: Data Standardization

Abstract

We assembled and standardized a multi-source dataset to model antisense oligonucleotide (ASO)–mediated protein knockdown across ALS-relevant targets. Quantitative protein reductions extracted from the literature and from wet-lab experiments were reconciled into a single schema, preserving low-, medium-, and high-efficacy examples to avoid selection bias. We computed a compact panel of sequence-derived features—composition, CpG density, di/tri-nucleotide frequencies, and regional GC content for 5–10–5 gapmers—together with optional thermodynamic proxies, and we performed systematic quality control. The resulting dataset supports within-family analyses and cross-family prediction, where SOD1 and ATXN2 (literature) together with DAZ1/DAZAP1 and FAM98A (wet-lab) anchor training and calibration for prediction of knockdown across new families.

Introduction

ASO’s can reduce pathogenic protein levels by recruiting RNase H1 to RNA–DNA heteroduplexes, and several ALS-relevant targets are now under active exploration(Bennett and Rigo, 2017). Modeling which sequences produce robust protein knockdown remains challenging because the available evidence is heterogeneous: different assay systems, varied reporting practices (exact percentages vs. ranges vs. qualitative labels), and inconsistent description of chemistry or sequence context. Our aim was to produce a single, analysis-ready table that retains provenance while harmonizing labels and features sufficiently for machine-learning workflows and cross-family evaluation.

Data Collection

We curated two data streams. Literature/patents. For each target, we extracted the 5′→3′ sequence (letters only for feature computation), chemistry/backbone notes, experimental model (cell line, animal, human), and protein KD% or a validated biomarker proxy. For C9ORF72, we relied on a mixed-backbone series that suppresses V1/V3 with strong in-vivo poly(GP) reductions and an initial human exposure showing a marked CSF poly(GP) drop (Tran et al., 2022). As a non-ALS benchmark, we included TGFBR2/NVP-13, an LNA-gapmer program discovered through staged in-silico → transfection → gymnotic selection with potency and stability documented in detail (Kuespert et al., 2020). For VCP/MSP1, we drew on a recent ASO study reporting ~48% protein lowering in R155H patient-derived myogenic cells and ~30% in a humanized A232E mouse, with autophagy restoration and functional improvement (Pal et al., 2025). For ATXN2, we relied on the Roche/Hagedorn ATXN2 gapmer program, which identifies a potent intron-9 “hotspot” and reports extensive cell-line screens, in vivo ICV mouse knockdown at mRNA/protein, and NHP PK/PD with tissue content vs. mRNA relationships (Hagedorn et al., 2021). For SOD1, we linked sequences and activity back to the foundational patent family disclosing extensive SOD1 gapmer designs, chemistries, and dose-response data (WO 2015/153800 A2, 2015). When relevant, we also referenced FUS-lowering as prior art for family-level modeling comparisons (Korobeynikov et al., 2022). Wet-lab CSVs. These contributed DAZ1/DAZAP1 and FAM98A entries collected under consistent assays; we use these as out-of-distribution targets for evaluation and lightweight k-shot calibration.

Label Harmonization

The central outcome variable is Observed_%KD, defined as the percent reduction in target protein relative to the appropriate control. To reconcile heterogeneous reporting, we applied a deterministic mapping. When a numeric percentage was reported, we used it directly. When a range was given, we set Observed_%KD to the midpoint of the interval. When only a qualitative label was provided, we mapped High/Medium/Low to nominal values (80/55/20) while preserving the original note for transparency. Values were clipped to [0,100] to remove clerical outliers. For visualization and stratified splitting we also defined consistent bins—High ≥70, Medium 40–69, and Low <40. The global distribution of labels (Figure 1) confirms that the corpus contains a meaningful spread across bins rather than being dominated by a narrow efficacy band.

Distribution of Observed %KD
Fig. 1: Distribution of Observed %KD
Standardized Schema & Quality Control

All records were normalized into a canonical schema consisting of an identifier (ASO_ID), the gene target (TargetGene), the 5′→3′ unmodified sequence used for feature calculation (Sequence), outcome fields (Observed_%KD, optional KD_range_min/KD_range_max, and a binned label), minimal chemistry notes, the biological model (Model), the source domain (Literature/WetLab), and a Reference field. During cleaning we harmonized gene and chemistry names, removed non-ACGT characters prior to feature computation (modification strings are retained in metadata), and performed routine checks for missingness and implausible values. Where applicable, schema fields mirror the way discovery programs are reported (selection funnels, gymnotic uptake, potency tables), aiding traceability to original sources (Kuespert et al., 2020).

Sequence-Derived Features

Let the cleaned DNA sequence be \(S \in \{A,C,G,T\}^n\) (U→T; non-ACGT characters dropped). We derived the following features uniformly for all entries:

1. Composition and Simple Structure

Sequence length is \(n\). GC fraction is

\[ \mathrm{GC\_frac} = \frac{\#G + \#C}{n}. \]

CpG count and density:

\[ \mathrm{CpG\_count} = \sum_{i=1}^{n-1}\mathbf{1}[S_iS_{i+1}=\mathrm{CG}], \qquad \mathrm{CpG\_density}=\frac{\mathrm{CpG\_count}}{n-1}. \]

The maximum homopolymer run, \(\displaystyle \mathrm{max\_homopolymer} = \max_{k\in\{A,C,G,T\}} \text{ (longest consecutive run of }k\text{ in }S)\), captures problematic repeats.

2. Motif Frequencies

For each dinucleotide \(xy\in\{A,C,G,T\}^2\),

\[ \mathrm{di}_{xy} = \frac{\#\{i: S_iS_{i+1}=xy\}}{n-1},\qquad \text{with }\mathrm{di}_{xy}=0\text{ when }n<2. \]

When needed we also compute trinucleotides:

\[ \mathrm{tri}_{xyz} = \frac{\#\{i: S_iS_{i+1}S_{i+2}=xyz\}}{n-2}, \quad xyz\in\{A,C,G,T\}^3,\quad \mathrm{tri}_{xyz}=0\text{ when }n<3. \]

These motifs enable enrichment analyses between high- and low-efficacy subsets.

3. Gapmer Regional GC (5–10–5)

For sequences with \(n \ge 20\), we approximate the typical 5–10–5 architecture and compute regional GC fractions in the 5′ wing (\(S_{1..5}\)), DNA gap (\(S_{6..15}\)), and 3′ wing (\(S_{n-4..n}\)):

\[ \begin{aligned} \mathrm{wingGC}_{\text{left}} &= \frac{\#_{S_{1..5}}(G)+\#_{S_{1..5}}(C)}{5},\\ \mathrm{wingGC}_{\text{gap}} &= \frac{\#_{S_{6..15}}(G)+\#_{S_{6..15}}(C)}{10},\\ \mathrm{wingGC}_{\text{right}} &= \frac{\#_{S_{n-4..n}}(G)+\#_{S_{n-4..n}}(C)}{5}. \end{aligned} \]
4. Thermodynamics / Accessibility (methods background)

When used, we reference nearest-neighbor thermodynamics for duplex stability and RNA target accessibility via ensemble defect, combining them as

\[ \Delta G_{\mathrm{eff}}=\Delta G_{\mathrm{duplex}}-\mathrm{ED}. \]

Pharmacokinetic/pharmacodynamic considerations that inform dose–response interpretation are drawn from prior work.

Dataset Characterization

The global KD% distribution spans the full range (Figure 1). Per-target summaries (Figure 2A, box plots; Figure 2B, stacked High/Medium/Low counts) show substantial differences in central tendency and variance among families—SOD1 and FUS exhibit broader spread, whereas ATXN2 shows higher medians in our literature set—underscoring the need for family-aware evaluation. Composition–efficacy trends are weak at the global level (Figure 3), motivating richer features. Family-level dinucleotide profiles (Figure 4) reveal motif preferences; for example, elevated CT/TC prevalence in FUS and higher GC in C9ORF72 sequences, consistent with target-specific sequence environments reported in the C9ORF72 program (Tran et al., 2022).

%KD by Target Gene
Fig. 2A: %KD by Target Gene
KD Efficacy Distribution by Target
Fig. 2B: KD Efficacy Distribution by Target
GC Fraction versus %KD
Fig. 3: GC Fraction versus %KD
Mean Dinucleotide Frequency by Target
Fig. 4: Mean Dinucleotide Frequency by Target (Top 6)

For gapmers, the distributions of wing and gap GC fractions (Figure 5A–C) show that our design space covers a useful range rather than clustering tightly around a single recipe. A correlation matrix across numeric features (Figure 6) highlights expected dependencies among motif variables and composition; in particular, several dinucleotides anti-correlate by construction, while GC-linked metrics cluster together. Finally, a two-component PCA on sequence features (Figure 7) suggests partial separation among efficacy bins but with substantial overlap—again arguing for models that combine motif, structural, and accessibility information. An optional 3-mer enrichment analysis (Figure 8) provides interpretable sequence signatures enriched among high-efficacy ASOs.

Regional GC Profiles of Gapmers: 5′ Wing, DNA Gap, 3′ Wing
Fig. 5A–C: Regional GC Profiles of Gapmers: 5′ Wing, DNA Gap, 3′ Wing
Correlation Heatmap of Sequence Features and KD%
Fig. 6: Correlation Heatmap of Sequence Features and KD%
PCA of sequence features, colored by KD bin
Fig. 7: PCA of sequence features, colored by KD bin
3-mer Enrichment in High vs Low KD% ASOs
Fig. 8: 3-mer Enrichment in High vs Low KD% ASOs
Similarity-based Prediction

The standardized dataset supports both standard train/validation splits and family-aware experiments. We use SOD1 and ATXN2 as the anchor literature families—ATXN2 specifically leveraging the intron-9 hotspot series with demonstrated translation across cells → mouse ICV → NHP (Hagedorn et al., 2021)—and evaluate transfer to wet-lab DAZ1/DAZAP1 and FAM98A with optional k-shot calibration.This arrangement probes whether sequence-level rules learned in well-documented families transfer to new gene contexts without overfitting to any single target’s idiosyncrasies.

Limitations

Heterogeneous assays (cell, animal, biopsy) introduce scale and variance differences in KD%. We mitigate this by retaining the Model field for stratified evaluation and by avoiding aggressive normalization that could erase meaningful biological variability. Some literature entries rely on ranges or qualitative categories; our midpoint and bin-mapping rules are conservative but inevitably introduce label noise. A minority of references lack explicit sequences; those rows are excluded from sequence-feature analyses until sequences are available.

Reproducibility

All processing steps—from ingestion through label harmonization and feature computation—are scripted so the master feature table and all figures can be regenerated from the ./data CSVs. Outputs are saved locally to maintain a simple, portable workflow.

References
  1. Bennett, C. Frank, and Frank Rigo. “Antisense Oligonucleotide Therapeutics: Precision Medicine for RNA Targets.” Molecular Therapy, vol. 25, no. 5, 2017, pp. 953–967.
  2. Kuespert, Sabrina, et al. “Antisense Oligonucleotide in LNA-Gapmer Design Targeting TGFBR2—A Key Single Gene Target for Safe and Effective Inhibition of TGFβ Signaling.” International Journal of Molecular Sciences, vol. 21, 2020, p. 1952.
  3. Korobeynikov, V. A., et al. “Antisense Oligonucleotide Silencing of FUS Expression as a Therapeutic Approach in Amyotrophic Lateral Sclerosis.” Nature Medicine, vol. 28, 2022, pp. 104–116.
  4. Pal, Pallabi, et al. “Antisense Oligonucleotides Targeting Valosin-Containing Protein Improve Muscle Pathology and Molecular Defects in Cell and Mouse Models of Multisystem Proteinopathy.” bioRxiv, 26 Jul. 2025, doi:10.1101/2025.07.25.665261.
  5. SantaLucia, John, and Donald Hicks. “The Thermodynamics of DNA Duplex Formation.” Annual Review of Biophysics and Biomolecular Structure, vol. 33, 2004, pp. 415–440.
  6. Tran, Hélène, et al. “Suppression of Mutant C9ORF72 Expression by a Potent Mixed-Backbone Antisense Oligonucleotide.” Nature Medicine, vol. 28, no. 1, 2022, pp. 117–124.
  7. Geary, Robert S., et al. “Pharmacokinetics and Pharmacodynamics of Antisense Oligonucleotides.” Nucleic Acid Therapeutics, vol. 25, no. 2, 2015, pp. 77–83.
  8. Lagier-Tourenne, Clotilde, et al. “Targeted Depletion of ATXN2 as a Therapeutic Strategy for ALS.” Neuron, vol. 81, no. 4, 2014, pp. 667–680.
  9. Miller, Timothy M., et al. “Trial of Antisense Oligonucleotide Tofersen for SOD1 ALS.” The New England Journal of Medicine, vol. 387, no. 12, 2022, pp. 1099–1110.
  10. Ngo, Duong, et al. “Biomarkers for C9orf72-Related ALS/FTD: CSF Poly(GP) as a Pharmacodynamic Readout.” Annals of Clinical and Translational Neurology, vol. 7, no. 8, 2020, pp. 1292–1305.
  11. Wan, Yanfeng, et al. “Ensemble Defect as a Measure of RNA Target Accessibility.” RNA, vol. 27, no. 7, 2021, pp. 848–860.
  12. Patent (ATXN2): Hagedorn, Peter, et al. Oligonucleotides for Modulating ATXN2 Expression. US Patent Application US 2021/0024925 A1, 28 Jan. 2021.
  13. Patent (FUS): Person Institute for Neurogenomics, et al. Compositions and Methods for Treating FUS Associated Diseases. WO 2023/147629 A1, 10 Aug. 2023.
  14. Patent (SOD1): Ionis Pharmaceuticals, Inc., et al. Compositions for Modulating SOD1 Expression. WO 2015/153800 A2, 8 Oct. 2015.

S2: Sample Data

This table shows a small, representative slice of the raw data we collected from literature/patents and our wet-lab assays. Each row preserves the fields used for traceability and modeling across key families. Complete family-level tables are available in the underlying CSVs and the full supplement.

ATXN2SOD1FUSC9ORF72TGFBR2DAZFAM
2978307462
Family ASO_ID Sequence (5′→3′) %KD Bin SRC LBL Len GC% CpG Hpoly RefCode
ATXN221_1CAttttactttaacctcc59.00MedLE183304ATXN2F
ATXN222_1CAcattttactttaacCTC91.00HighLE193204ATXN2F
ATXN217_1TTCAcattttacttTAAC98.00HighLE182204ATXN2F
ATXN216_1TTCAcattttactttaaccT98.00HighLE193204ATXN2F
C9ORF72ASO1CCCTAGCGCGCGACT80.00HighLE156733C9F
C9ORF72ASO5-2GCoCoCoCTAGCGCGCGoAoCoTC80.00HighLE236534C9F
DAZWetLab_19TCTCTAACCCATCAGCACAA32.97LowWE204503WET
FAMWetLab_8ATTCCGAAAGAATGGTCACA37.58LowWE204013WET
FUS1CUCUUCUUCUUCUUCUUCUUUAUU20.00LowLE242903FUS_Pat
FUS19UCUUCUUCUUCUUCUUCUUCU20.00LowLE213302FUS_Pat
FUS15CCAGGCCCCGCCCTCGGCTC36.00LowLE207020FUS_Pat
FUS29AGAAAGGGGAGGGGGTGGGT60.00MedLE206502FUS_Pat
SOD1SOD1_12GTGAGGCGTCAGGTTGGGCG10.00LowLE206503SOD1_Pat
SOD1SOD1_47GCTGATGCCCCGGATCCTGA33.00MedLE206503SOD1_Pat
SOD1SOD1_45TCCATCGCAGCCTTGATGTT37.00MedLE205003SOD1_Pat
SOD1SOD1_33GATTTGGGCGATCCCAATTG90.00HighLE204503SOD1_Pat
TGFBR2TGFBR2_1AGGAGGAAGACGACGATGAG30.00MedLE205503TGFBR2F
TGFBR2TGFBR2_4CTTCCGAGATGTTCCAGGAC63.00HighLE206003TGFBR2F
S2 Table — Sample rows by family (subset of collected data)
SRC: W = Wet-lab, L = Literature/Patent
LBL: E = Explicit numeric, R = Range midpoint, B = Bucket-mapped
RefCode: ATXN2F = ATXN2 sequences with knockdown (features), C9F = C9ORF72 sequences (features), TGFBR2F = TGFBR2 sequences (features), FUS_Pat = WO2023147629A1_FUS, SOD1_Pat = US2024… SOD1 patent, WET = Wet-lab dataset
Note: When sources reported only qualitative labels, the pipeline stores a bucket and (elsewhere) maps High/Med/Low → 80/55/20 for harmonization.

S3: Feature Panel

This section lists the features we feed to the model, the variable names used in code, any defaults/notes, and the tool or method used to compute each feature. The panel is the compact set derived from the outputs of the Data Collection & Standardization step. Thermodynamics/accessibility features are included only when consistently available.

Core features
Feature (group) Model variable(s) Default / notes Tool / implementation
Sequence length (composition) seq_len Integer count of bases (n) Python featurizer (pandas)
GC fraction (composition) GC_frac \((\#G+\#C)/n\) Python featurizer
CpG count / density (composition) CpG_count, CpG_density Density = CpG_count/(n−1) Python featurizer
Max homopolymer (composition) max_homopolymer Longest run of A/C/G/T Python featurizer
Dinucleotide frequencies (motifs) di_AAdi_TT (16 vars) 0 if n<2 Python featurizer
Regional GC (5–10–5) (gapmer) wingGC_left, wingGC_gap, wingGC_right Defined if n≥20 (5′ wing / gap / 3′ wing) Python featurizer (windowed counts)
Chemistry / backbone (categorical) One-hots (e.g., Chem_PS, Chem_MOE, Chem_LNA, Chem_cEt) 0/1; unknown → all zeros One-hot from Chemistry text
Optional features (used when consistent and available)
Feature (group) Model variable(s) Default / notes Tool / implementation
Trinucleotide frequencies (motifs) tri_AAAtri_TTT (64 vars) 0 if n<3; often used for enrichment & optional input Python featurizer
Duplex free energy (thermo) DeltaG_duplex More negative → stronger duplex (nearest-neighbor DNA/RNA) SantaLucia NN model (offline); ViennaRNA/NUPACK when applicable (Lorenz et al., 2011; Zadeh et al., 2011; SantaLucia & Hicks, 2004)
Target accessibility (structure) ED (opening/ensemble-defect proxy) Higher ED → harder to access RNAplfold / accessibility workflow (ViennaRNA) (Lorenz et al., 2011)
Effective binding proxy DeltaG_eff = DeltaG_duplex − ED Only if both inputs exist; down-weights blocked sites Derived from the two fields above
Tm (Wallace proxy) (quick) Tm_Wallace_proxy \(2(A+T)+4(G+C)\) °C Python featurizer (sanity check)
Context fields (not fed to the learner)
Field Usage Default / notes Source
Target gene / family Not fed Used for stratified splits & k-shot family offsets Standardized schema (TargetGene)
Source domain / model Not fed Used for analysis/plots; avoid leakage Standardized schema (SourceDomain, Model)
Tool references (MLA)
  1. Lorenz, Ronny, et al. “ViennaRNA Package 2.0.” Algorithms for Molecular Biology, vol. 6, 2011, p. 26.
  2. Zadeh, Joseph N., et al. “NUPACK: Analysis and Design of Nucleic Acid Systems.” Journal of Computational Chemistry, vol. 32, no. 1, 2011, pp. 170–173.
  3. SantaLucia, John, and Donald Hicks. “The Thermodynamics of DNA Duplex Formation.” Annual Review of Biophysics and Biomolecular Structure, vol. 33, 2004, pp. 415–440.

S4: Model Novelty

A one-page ledger that separates what’s new in our model from standard methods, and shows compact evidence (ablation + calibration) that the novel parts matter.

  • New Concept #1 — ASO-specific effective free energy (ΔGeff): ΔGduplexopening energy (stability − accessibility) as a single feature for RNA:DNA gapmers.
  • New Concept #2 — One-shot, per-family calibration: one calibrator ASO per new gene adds a small offset (δg) that preserves rank but calibrates the %KD scale.
  • New Concept #3 — Gapmer-aware encoding: explicit 5–10–5 regional GC (left wing / gap / right wing) and related motifs aligned to RNase-H design practice.
  • Standard methodologies adopted: nearest-neighbor duplex ΔG; ΔG↔K; RNAplfold/Vienna accessibility; gradient boosting; MAE / R² / Spearman.
# Equation (compact) What it captures How we use it Novel?
1 \( \Delta G^{\circ}_{\mathrm{duplex}} = \sum \Delta G^{\circ}_{\mathrm{stack}} + \Delta G^{\circ}_{\mathrm{init}} + \Delta G^{\circ}_{\mathrm{dangling}} + \Delta G^{\circ}_{\mathrm{terminal}} \) RNA:DNA hybrid stability Thermo feature; feeds \(\Delta G_{\mathrm{eff}}\) Std
2 \( \Delta G^{\circ} = -RT \ln K \) (with \( K \propto 1/K_D \)) Energy ↔ binding linkage Interpret occupancy/affinity (not fitted) Std
3 \( E_D(i) = -RT \ln p_{\mathrm{unpaired}}(i);\; \overline{E_D} = \frac{1}{N}\sum_i E_D(i) \) Target accessibility Penalizes occluded sites; used directly & in \(\Delta G_{\mathrm{eff}}\) Std
4 \( \Delta G_{\mathrm{eff}} = \Delta G_{\mathrm{duplex}} - \overline{E_D} \) Net favorability (stability − opening cost) Primary biophysics feature tied to %KD Yes
5 \( \theta = \dfrac{[A]}{[A] + K_D} \) Fraction bound at equilibrium Conceptual link (not fitted) Std
6 \( \hat f_M(x) = \sum_{m=1}^M \nu\, h_m(x) \) Additive boosted trees Core predictor Std
7 \( \tilde y = \hat f(x) + \delta_g \) Group offset (random-intercept style) One calibrator ASO per gene for portability Yes
8 MAE; \( R^2 \); Spearman’s \( \rho \) Error / fit / rank Evaluation & reporting Std
9 5–10–5 regional GC (gapmer):
\( \mathrm{GC}_{L} = \dfrac{\#G_{1..5}+\#C_{1..5}}{5} \),  \( \mathrm{GC}_{\mathrm{gap}} = \dfrac{\#G_{6..15}+\#C_{6..15}}{10} \),  \( \mathrm{GC}_{R} = \dfrac{\#G_{(n-4)..n}+\#C_{(n-4)..n}}{5} \)
Regional composition (wings stabilize; gap recruits RNase-H) Three explicit GC fractions used as features Yes
10 GC, CpG, di/tri-mers, homopolymers Local sequence context Feature panel Std
Table S4.1: Equation inventory with usage and novelty flag

Hold-out ablations were computed by training each model variant with the same protocol and split, changing only the feature set (e.g., removing ΔGeff or regional-GC), and then evaluating on a fixed test set never seen during training. We report %KD MAE (mean absolute error) across all test guides; lower is better. ΔMAE vs Full is the MAE difference relative to the full model—positive values mean the variant performs worse, highlighting the contribution of the removed feature(s).

Model variant %KD MAE ΔMAE vs Full
Full (ΔGeff + regional-GC + all features)7.156
− ΔGeff proxy (duplex only; no opening penalty)7.191+0.035
K-mer only13.086+5.930
Sequence-only54.593+47.437
Table S4.2: Hold-out ablation: ΔGeff & regional-GC improve accuracy

Key Takeaway: Thermodynamics + accessibility together outperform sequence-only/accessibility-only; removing the opening penalty (duplex-only) reliably hurts, supporting ΔGeff.

Fig 1: Hold-out scatter
Fig 1: Hold-out %KD predictions (full model)
Fig 2: Ablation ΔMAE
Fig 2: Ablation (ΔMAE vs full)

For one-shot reporting, we simulate a new gene by selecting a single calibrator ASO per family from the training fold, fitting a monotone offset (δg, isotonic_by_family) that maps the model’s raw %KD to the calibrator’s observed %KD, and then applying this same offset to all held-out guides from that family. The table shows %KD MAE before (“none”) and after (“isotonic_by_family”) this one-point calibration; lower after-MAE (or unchanged when a family was already centered or n is tiny) indicates the model’s predictions are portable across genes once a single wet-lab anchor is provided, while preserving ranks within the family. This validates the intended deployment: rank with the base model, then use one wet-lab calibrator to place predictions on the correct %KD scale for the new gene.

Family MAE (none) MAE (isotonic_by_family) ntest
ATXN212.57512.57536
SOD113.13913.27418
DAZAP15.0855.08536
FAM98A11.25311.25312
S4.3 Table: One-shot per-family calibration (rank preserved; scale recentred)
Fig 3: One-shot calibration overlay
Fig 3: One-shot per-family calibration (overlay; rank preserved; scale recentred)

Key Takeaway: With tiny per-family test sets or already-centred families, MAE shifts are modest; the value is re-centring predictions toward the 1:1 line while preserving ranks.

Reliability curve. We bin the calibrated predictions into equal-frequency deciles and, for each bin, plot the mean predicted %KD against the mean observed %KD with bootstrap 95% CIs. A curve that hugs the identity line indicates well-scaled predictions after one-shot calibration; any systematic bowing reveals under/over-prediction in specific ranges (often the extremes).

Residuals. Residuals are defined as \(r_i = y_i - \hat y_i\). We show a histogram to check that errors are centered near 0, roughly symmetric, and without heavy tails or clear heteroscedasticity. This pattern supports that the offset recenters families without distorting ranks and that remaining errors are mostly idiosyncratic rather than systematic.

Fig 4: Reliability post-cal
Fig 4: Reliability (post-calibration)
Fig 5: Residual histogram
Fig 5: Residuals (post-cal)

S5: Code Snippets

Compact code listings that corroborate the Methods/Results: schema hygiene, biophysics features (ΔG / ED / ΔGeff), design matrix, leakage-safe CV, simple calibration/blend, and reproducibility helpers. Full code loaded in the Software page.

Label Harmonization and Schema: cleaning.py

Purpose: Establish a consistent schema across heterogeneous inputs by robustly discovering sequence, gene, and target-sequence columns (case-insensitive), so downstream feature generation always sees the same fields.

def detect_core_columns(df: pd.DataFrame) -> Dict[str, Optional[str]]:
    """
    Resolve core columns in a flexible, case-insensitive way.
    """
    seq_col = find_col(df, [COL_SEQUENCE, "ASO_sequence", "OligoSeq", "seq"])
    gene_col = find_col(df, [COL_TARGET_GENE, "Target", "Gene", "TargetFamily"])
    ts_col = find_col(df, [COL_TARGET_SEQ, "TargetSeq", "RNA_target", "target_sequence"])
    return {"sequence": seq_col, "gene": gene_col, "target_seq": ts_col}
Upstream of feature generation; ensures consistent inputs for later steps.
ΔG, ED, and ΔGeff: thermo_plfold.py

Purpose: Compute RNAduplex binding energy (ΔG) and RNAplfold opening energy (ED) for the aligned site, then record the biophysically meaningful ΔG_eff = ΔG − ED used in the model’s feature panel.

def add_rnaplfold_features_slice_first(df, aso_col, gene_col, target_rna,
                                        W=150, L=None, window_pad=150, tempC=37, debug=False, timeit=False):
    """
    Adds columns: rnaplf_ED_kcal, duplex_dG_kcal, deltaG_eff_kcal, slice_start, slice_len
    - Site is found by RNAduplex on the full target (fast & robust)
    - ED is computed slice-first from RNAplfold Pu on a local window (cached)
    """
    import pandas as pd
    out = df.copy()
    if L is None:
        med = out[aso_col].astype(str).str.len().replace(0, np.nan).median()
        L = int(med if pd.notna(med) else 20)

    EDs=[]; dGs=[]; dGeffs=[]; starts=[]; lens=[]
    tgt = to_rna_clean(target_rna)
    ...
        ED = ed_kcal_from_pu(P)
        ...
        dG_eff = dG - ED if np.isfinite(dG) and np.isfinite(ED) else np.nan

        EDs.append(ED); dGs.append(dG); dGeffs.append(dG_eff); starts.append(start); lens.append(glen)

    out["rnaplf_ED_kcal"] = EDs
    out["duplex_dG_kcal"] = dGs
    out["deltaG_eff_kcal"] = dGeffs
    out["slice_start"] = starts
    out["slice_len"] = lens
    return out
Used in Phase 3–6 features to penalize inaccessible target regions.
Design Matrix Builder: design_matrix.py

Purpose: Align literature and wet-lab tables, assemble the exact numeric feature set used for training/evaluation, and emit the targets with a concise coverage profile for auditability.

def build_design(
    lit_df: pd.DataFrame,
    wet_df: pd.DataFrame,
    cfg: TrainConfig,
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, Optional[pd.Series], List[str]]:
    """
    Construct aligned design matrices and targets.
    """
    # 1) Discover & select candidate features from literature
    ...
    X_lit = lit_aligned[feats].copy()
    X_wet = wet_aligned[feats].copy()

    # 4) Targets
    y_lit = pd.to_numeric(lit_aligned.get(COL_OBS_KD, pd.Series([np.nan] * len(lit_aligned))), errors="coerce")
    y_wet = None
    if COL_OBS_KD in wet_aligned.columns:
        y_wet = pd.to_numeric(wet_aligned[COL_OBS_KD], errors="coerce")

    # 5) Diagnostics
    log("Design matrix — literature coverage:")
    log(df_profile(X_lit, feats).to_string(index=False))
    log(f"Selected {len(feats)} numeric features for training.")
    return X_lit, X_wet, y_lit, y_wet, feats
Ensures parity of feature columns between training (literature) and evaluation (wet-lab).
Numeric Preprocessor (impute/scale): design_matrix.py

Purpose: Define the precise preprocessing stack for numeric features (median imputation and optional scaling) inside a ColumnTransformer so training and evaluation remain identical.

def make_numeric_preprocessor(
    feature_names: Sequence[str],
    scale: bool = True,
) -> ColumnTransformer:
    """
    Create a ColumnTransformer for numeric features:
      - median imputation
      - optional StandardScaler(with_mean=False) to be sparse/CSR-friendly
    """
    if scale:
        num_pipe = Pipeline(
            steps=[("imp", SimpleImputer(strategy="median")),
                   ("sc", StandardScaler(with_mean=False))])
    else:
        num_pipe = SimpleImputer(strategy="median")

    pre = ColumnTransformer(transformers=[("num", num_pipe, list(feature_names))],
                            remainder="drop")
    return pre
Matches the training/evaluation pipelines described in Methods.
Family-aware LOSO CV (no leakage): train_eval.py

Purpose: Run leave-one-gene-out cross-validation without family leakage—fit on all other genes, score the held-out family, and report MAE with consistent preprocessing and model settings.

def _run_loco_cv(
    lit_df: pd.DataFrame,
    feats: Sequence[str],
    gene_col: str,
    use_hgbr: bool = True,
    min_test: int = 3,
    min_train: int = 10,
) -> List[Tuple[str, float, int]]:
    """
    Leave-One-Gene-Out CV over literature genes.
    """
    from sklearn.compose import ColumnTransformer
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler
    from sklearn.ensemble import GradientBoostingRegressor, HistGradientBoostingRegressor
    from sklearn.pipeline import Pipeline
    ...
    model = Pipeline([("prep", prep), ("reg", reg)])
    model.fit(X_tr, y_tr)
    pred = model.predict(X_te)
    mae = mean_absolute_error(y_te, pred)
    results.append((str(g), float(mae), int(len(y_te))))
    return results
Used to report per-family validation without leaking same-family patterns into training.
Regressor × Classifier Blend: models.py

Purpose: Combine the regressor output with bin-based expected %KD using a transparent weight α, providing a simple fallback-safe ensemble that remains easy to interpret.

def blend_predictions(
    reg_pred: np.ndarray,
    cls_expected: np.ndarray,
    alpha: float = 0.5,
) -> np.ndarray:
    """
    Blend regressor and classifier-derived expectations into a single %KD prediction.
    """
    if reg_pred is None or len(reg_pred) == 0:
        return cls_expected
    if cls_expected is None or len(cls_expected) == 0:
        return reg_pred
    return alpha * reg_pred + (1.0 - alpha) * cls_expected
Complements the calibrated pipeline; keeps behavior readable for users.
Basic Sequence Features (GC, Tm): features_basic.py

Purpose: Compute compact, interpretable composition features (GC fraction, Wallace Tm) that appear across phases and pair with thermodynamic/accessibility features.

# -----------------------------
# Basic sequence features
# -----------------------------
def _gc_frac(seq: str) -> float:
    L = len(seq)
    if L == 0:
        return np.nan
    return float(seq.count("G") + seq.count("C")) / float(L)

def _tm_wallace(seq: str) -> float:
    # 2*(A+T) + 4*(G+C)
    return float(2 * (seq.count("A") + seq.count("T")) + 4 * (seq.count("G") + seq.count("C")))
Forms part of the compact, interpretable feature panel.
Thermo Coverage Reporter: thermo_vrna.py

Purpose: Summarize availability and uniqueness of thermodynamic columns to verify feature coverage and catch data/tooling gaps per dataset or family.

def report_thermo_coverage(df: pd.DataFrame) -> pd.DataFrame:
    """
    Convenience: return a small profile (non-NaN counts and unique counts) for the thermo columns.
    """
    cols = [c for c in THERMO_COLS if c in df.columns]
    if not cols:
        return pd.DataFrame(columns=["column", "non_nan", "n_rows", "n_unique"])
    return df_profile(df, cols)
Pairs with Results diagnostics when thermo tools/inputs vary by family.
Reproducibility & Hygiene Helpers: utils.py

Purpose: Seed all RNGs deterministically and provide a small observed-vs-predicted plotting utility so runs are repeatable and diagnostics are trivial to generate.

def set_global_seed(seed: int = 42) -> None:
    try:
        import torch  # type: ignore
        torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True; torch.backends.cudnn.benchmark = False
    except Exception:
        pass
    np.random.seed(seed); random.seed(seed)

@contextmanager
def timeblock(label: str):
    t0 = time.time(); log(f"Time  start: {label}")
    try:
        yield
    finally:
        dt = time.time() - t0; log(f"done:  {label} ({dt:.2f}s)")

def save_obs_vs_pred_plot(obs: pd.Series, pred: pd.Series, out_path, title="Observed vs Predicted %KD",
                          xlab="Observed %KD", ylab="Predicted %KD", lims=(0.0, 100.0)) -> None:
    out_path = to_path(out_path); ensure_dir(out_path.parent)
    plt.figure(); plt.scatter(obs, pred); plt.plot(lims, lims)
    plt.xlim(lims); plt.ylim(lims); plt.xlabel(xlab); plt.ylabel(ylab); plt.title(title)
    plt.savefig(out_path, bbox_inches="tight"); plt.close()
Used throughout training/evaluation blocks to keep runs reproducible and auditable.

Context

TDP-43 is a DNA/RNA-binding protein associated with neurodegenerative diseases like amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (Suk et al., 2020). Its C-terminal domain (CTD) is intrinsically disordered in these diseased states, often contributing to aggregation within cells (Jo et al., 2020). These disordered regions are also typically accessible due to protein misfolding, offering a promising targeting region for aptamer binding. Through this study, we generated RNA aptamers in silico, scored based on binding affinities for the TDP-43 disordered CTD segment, using the AptaTrans framework.

Design Pipeline

Identifying Target TDP-43 CTD Region

To ensure that aptamers generated and scored were the most likely to target experimentally identified disordered segments of the TDP-43 CTD, we used the IUPred2A disorder prediction tool to score the TDP-43 human residue chain, sourced from the UCSC Genome Browser. IUPred2A scores residues by estimating the pairwise interaction energies between amino acids along the sequence and predicting whether a residue is likely to participate in a stable folded structure (Meszaros et al., 2018). It operates on the principle that intrinsically disordered regions lack the favorable interaction energies needed to form a stable tertiary structure. Each residue is assigned a score between 0 and 1, where higher scores indicate a greater likelihood of being disordered. Residues with disorder scores ≥ 0.5 were considered disordered (Zhang et al., 2024). We selected the longest consecutive stretch of disordered residues as the aptamer target region to be inputted into the AptaTrans software.

Aptamer Candidate Generation

The AptaTrans deep learning framework combines pretrained transformer-based encoders with a Monte Carlo Tree Search (MCTS) optimization scheme (Shin et al., 2023). MCTS simulates partial sequences (individual nucleotides are selected), scores the partial chain using the in-built deep learning model, and backpropagates the feedback to refine the search tree for the next nucleotide choice, until up to 40 bases are added to the sequence. Through this, each iteration for aptamer generation is optimized.

Optimization and Evaluation — Iterations

To further optimize generated aptamers, we varied the number of MCTS iterations, developing a balance between diversity and convergence on strong candidates. As shown in Figure 1, increasing iterations initially improved average aptamer scores, but oversampling beyond 10 iterations led to diminished returns because of increased convergence toward a narrow subset of sequences, reducing candidate diversity without significant gains in top scores. Thus, n=10 iterations was selected for future generations.

Aptamer iterations vs. score
Figure 1: Line plot depicting the scores of the highest-ranking aptamers vs. number of MCTS iterations.
Optimization and Evaluation — Seeds

Additionally, to ensure robustness and identify high-performing MCTS starting states, we tested five random seeds (1001–1005). Seed 1004 produced higher-scoring aptamers (Figure 2), so it was selected for generations.

Aptamer scores across random seeds
Figure 2: Ranked aptamer scores for different random seeds (1001–1005), each with 10 MCTS iterations.

Molecular Docking

Protein File Generation

Since our aptamers were developed to target the TDP-43 CTD, we first created modified TDP-43 Protein Database files (PDBs); we used BioPython to trim the protein residue chain to only include residues 375-414, the identified CTD residues (including ordered and disordered segments). This enables optimal/focused binding of the aptamer to the TDP-43 protein.

Next, we used RNAComposer to create PDB files of our top three aptamers (n=3 aptamers) from the previously generated residue sequences. We assigned unique chain identifiers (A and B) to the two input molecules (TDP-43 CTD and RNA aptamer, respectively) to direct docking directionality. The chain identifiers mark the TDP-43 CTD as the receptor and the aptamer as the mobile ligand, so during rigid-body docking, the TDP-43 CTD remains constant while the RNA aptamer rotates or is translated around the CTD.

AIR Table and Fine-Tuning

Guided by IUPred2A disorder predictions for TDP-43, we created an ambiguous interaction restraints (AIR) table to bias sampling towards the disordered segments of the CTD residue chain. Active residues were defined as the residues within the disordered CTD segment (A:376-382) and passive residues were defined as all other sequence neighbors. On the aptamer, bases facing the putative interface were marked active (B:6-12), and all other neighbors were automatically marked passive.

Model Scoring

Finally, we generated the docked complexes (visualized in PyMol) and scored them based on the following guidelines.

Name Meaning Ideal State
HADDOCK Score Generated through weighted combination of energy terms and restraints Lower (more negative)
Van der Waals Energy Attractive/repulsive forces between binding atoms Lower
Electrostatic Energy Coulombic attraction/repulsion at binding site Large negative
Buried Surface Area Area buried at interaction site Higher
Restraints Energy Represents adherence to AIR restraints Lower
Docking metrics and desired directions.

Some key assumptions of the model include single-interface binding and the presence of flexible residues concentrated at the CTD window.

Results

Aptamer A Docking Scores

Structure
Rank
HADDOCK score
[a.u.]
Van der Waals
Energy
Electrostatic
Energy
Restraints
Energy
Desolvation
Energy
interface RMSD
[Å]
ligand RMSD
[Å]
interface-ligand
RMSD [Å]
Fraction of
Common Contacts
DOCKQ Buried Surface
Area [Å2]
Structure ID
1 -92.54 -34.6 -64.52 45.57 10.94 0 0 0 1 1 892.71 5
2 -88.44 -31.33 -58.76 3.49 9.71 0.86 1.51 1.68 0.82 0.85 841.61 3
3 -74.78 -19.06 -61.46 33.68 9.98 6.8 34.05 21.67 0.27 0.13 760.66 4
4 -44.87 -18.92 -28.24 57.39 2.8 5.48 28.81 14.97 0.23 0.13 625.11 2
5 -36.67 -23.94 -14.44 42.9 4.45 4.89 16.13 11.15 0.41 0.24 702.74 1
Aptamer 1 — Docking scores and interface metrics (HADDOCK).

Aptamer B Docking Scores

Structure
Rank
HADDOCK score
[a.u.]
Van der Waals
Energy
Electrostatic
Energy
Restraints
Energy
Desolvation
Energy
interface RMSD
[Å]
ligand RMSD
[Å]
interface-ligand
RMSD [Å]
Fraction of
Common Contacts
DOCKQ Buried Surface
Area [Å2]
Structure ID
1 12.03 -44.89 -107.51 1643.81 13.49 0 0 0 1 1 1343.54 3
2 17.8 -27.18 -89.48 1364.14 8.91 25.53 36.2 36.51 0.06 0.04 1086.18 1
3 67.35 -30.75 -75.26 1745.53 9.53 3.9 8.82 8 0.52 0.38 1071.32 2
4 68.55 -44.16 -55.91 1717.46 9.57 24.82 37.08 39.73 0.06 0.04 1270.12 4
5 102.49 -41.36 -44.55 1926.73 8.74 25.23 39.22 42.14 0.09 0.04 1300.99 5
Aptamer B — Docking scores and interface metrics (HADDOCK).

Aptamer C Docking Scores

Structure
Rank
HADDOCK score
[a.u.]
Van der Waals
Energy
Electrostatic
Energy
Restraints
Energy
Desolvation
Energy
interface RMSD
[Å]
ligand RMSD
[Å]
interface-ligand
RMSD [Å]
Fraction of
Common Contacts
DOCKQ Buried Surface
Area [Å2]
Structure ID
1 -88.29 -11.79 -74.77 7.16 3.59 0 0 0 1 1 603.41 4
2 -79.5 -22.98 -53.47 1.5 4.11 1.63 2.76 2.8 0.83 0.73 730.61 3
3 -68.69 -20.18 -49.92 22.94 5.68 1.61 4.61 3.4 0.89 0.71 656.82 5
4 -63.67 -19.52 -43.13 0.72 4.8 1.91 11.3 7.31 0.83 0.52 588.58 1
5 -44.97 -21.82 -25.87 114.39 -1.98 3.14 11.71 7.2 0.72 0.42 674.37 2
Aptamer C — Docking scores and interface metrics (HADDOCK).
TDP-43 CTD + Aptamer

Docked complex visualization (PyMOL).

TDP-43 CTD with bound RNA aptamer (PyMOL)
TDP-43 CTD + Aptamer (PyMOL)
Protein–protein interactions

Interaction map for the aptamer complex.

Protein–protein interaction map for aptamer complex
Interaction network for the complex

Final Aptamer Scoring

Finally, we developed a novel computational model that combines learning-based design (AptaTrans scores) with physics-based docking (HADDOCK scores) to prioritize aptamers that are both predicted to be effective and physiologically relevant. Our model outputs a final Aptamer Priority Index (API) score based on the above scores.

Model Inputs

  • For each aptamer i:
    • ATi — AptaTrans score
    • HSi — HADDOCK score for the top cluster
    • BSAi — Mean buried surface area (top cluster)
    • ni — Cluster size (used as a tiebreaker)

Each metric is normalized to contribute appropriately to the final API score.

AptaTrans score normalization:

\[ AT_i = \frac{AT_i - 0}{100 - 0} = \frac{AT_i}{100} \]

Docking score normalization:

\[ (-HS)_i = \frac{ -\text{score}^{\text{best}}_i - \min_j\!\left(-\text{score}^{\text{best}}_j\right) } { \max_j\!\left(-\text{score}^{\text{best}}_j\right) - \min_j\!\left(-\text{score}^{\text{best}}_j\right) + \varepsilon } \]

where scorebest is the most negative HADDOCK score among the structures for each aptamer i.

BSA normalization:

\[ BSA_i = \frac{ BSA^{(\text{top K})}_i - \min_j BSA^{(\text{top K})}_j } { \max_j BSA^{(\text{top K})}_j - \min_j BSA^{(\text{top K})}_j + \varepsilon } \]

with BSA(top K) representing the mean BSA over the top K best scoring structures (K = 3).

Final API
\[ API_i = 0.50\,AT_i \;+\; 0.40\,(-HS)_i \;+\; 0.10\,BSA_i \]
Aptamer ID AptaTrans Score Best HADDOCK Score BSA mean (top-K) API Score Rank
APT-A 0.9831 -92.54 831.66 0.925 1
APT-C 0.9727 -88.29 663.61 0.870 2
APT-B 0.9775 +12.03 1167.01 0.589 3
Aptamer ranking by API (AptaTrans, HADDOCK, BSA).

The best overall candidate, according to our model, is Aptamer A, because it demonstrates the strongest docking (HADDOCK score: -92.54) with a moderate interface score (BSA: 832 Ų). Furthermore, Aptamer A most largely satisfies the restraints (demonstrated in the low restraint penalty, Table 4), deeming it the best candidate for in vitro testing.

Aptamer C showed weaker docking than Aptamer A (HADDOCK score: -88.29) and a smaller interface (BSA: 664 Ų). Although Aptamer B had the largest interface (BSA: 1167 Ų), it had a large restraint penalty (Table 5), leading to a lessened HADDOCK score (12.03) and the lowest API score.

Discussion

Overall, our computational modeling creates an essential bridge between theoretical design and experimental validation. Our various pipelines enable the prioritization of effective therapeutic candidates in silico, saving valuable time and wet-lab resources. The three main components to our design, our ASO scoring pipelines, novel machine learning regression model, and aptamer docking simulations, both indicate possible therapeutic mechanisms that prevent ALS symptoms and can enable future researchers to more effectively research other solutions to neurodegenerative diseases.

Key Results

The ASO design model successfully filtered hundreds of candidate oligonucleotides into a short list of top-performing gapmers. Incorporating parameters such as GC content, conservation, and binding accessibility ensured that our designs were both biophysically stable and specific to the intended target RNAs. Similarly, the aptamer design pipeline leveraged both deep-learning (AptaTrans) and physics-based docking (HADDOCK) to identify RNA aptamers with strong binding to the disordered CTD of TDP-43. Through our novel scoring method, Aptamer A was selected as the best candidate, balancing docking stability and structural plausibility.

Our mechanistic regression model extended these findings by quantitatively linking ASO biophysical features to their experimental knockdown efficacy. Using features such as binding free energy, target accessibility, and GC fraction, the model captured consistent trends: ASOs with stronger binding affinity and higher accessibility correlated with more effective knockdown. Despite limited training data, cross-validation indicated moderate predictive power, with biologically interpretable coefficients aligning with expected molecular mechanisms.

Limitations

While our models significantly streamlined therapeutic design, they also come with limitations. For ASO design, penalties against homopolymers and high-GC content reduce off-target risks, but may also eliminate sequences that could perform well in cellular contexts. Our machine learning regression was trained on a relatively small dataset, which constrained model generalizability and introduced instability in predictive performance. For aptamer docking, rigid-body assumptions and reliance on disorder predictions mean that in vivo structural dynamics may differ from our in silico results. More extensive experimental datasets and iterative model refinement will be critical to increase robustness.

Therapeutic Applicability

Despite these constraints, our modeling framework has strong translational relevance. The ASO pipeline can be easily adapted to target other disease genes, offering a reusable framework for antisense therapy design. The regression model provides a mechanistic foundation for predicting knockdown efficiency, a crucial step toward building rational design pipelines for personalized ASO therapies. Finally, the aptamer design and docking models illustrate how deep-learning–guided aptamer generation can be coupled with physics-based docking to yield physiologically relevant binders. Together, these approaches reduce trial-and-error in therapeutic development and accelerate the path from computational prediction to clinical application.

Conclusion and References

Through our aptamer design pipeline, molecular docking, and novel computational model to characterize top-scoring aptamers, we have been able to identify the most biophysically plausible aptamers to limit TDP-43 aggregation in patients with ALS. Similar to our ASO model, this model enables cost and time effectiveness. Furthermore, it is easily reproducible, making it a valuable contribution to the iGEM community.

References:

  1. Suk, T. R., & Rousseaux, M. W. C. (2020). The role of TDP-43 mislocalization in amyotrophic lateral sclerosis. Molecular Neurodegeneration. https://doi.org/10.1186/s13024-020-00397-1
  2. Mészáros, B., et al. (2018). IUPred2A: Context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Research, 46(W1), W329–W337. https://doi.org/10.1093/nar/gky384
  3. Shin, I., et al. (2023). AptaTrans: A deep neural network for predicting aptamer-protein interaction using pretrained encoders. BMC Bioinformatics, 24, 447. https://doi.org/10.1186/s12859-023-05577-6
  4. Zhang, X., et al. (2024). Landscape of intrinsically disordered proteins in mental disorder diseases. Computational and Structural Biotechnology Journal, 23, 3839–3849. https://doi.org/10.1016/j.csbj.2024.10.043
  5. Pérez, G., et al. (2025). The UCSC Genome Browser database: 2025 update. Nucleic Acids Research. https://doi.org/10.1093/nar/gkae974
    Session link for reproducibility: UCSC Genome Browser session (chr19:1,435,544–1,435,764; hg38)