Model | DNHS-SanDiego-Ca

Factor	Description
GC Content	Fraction of nucleotides in sequence that are Guanine (G) or Cytosine (C). We altered the pipeline to use range 0.4–0.6. Since G and C pair via three hydrogen bonds (as opposed to A–T, which pairs using only two bonds), higher GC content corresponds to increased stability and resistance to denaturation. We therefore optimize GC content to filter out weak ASOs and excessively stable duplexes (which may increase risk of off-target binding). \[ GC = \frac{\#\{G,C\}}{\text{sequence length}} \]
Homopolymers	Long sequences of consecutive, same bases. We altered the pipeline to penalize “CCCCC” and “GGGGG.” Homopolymers are harder to synthesize chemically and typically form altered, unusual secondary structures. This can lower binding specificity, so the model penalizes the inclusion of these mini sequences.
Conservation	phastCons score measuring how conserved sequence is across human species. We chose to use a conservation value of 0.9. Highly conserved species are more likely to have increased functional importance and relevance, so our model penalizes sequences that are not already conserved across the human species (Perez et al, 2025). \[ PC = \frac{1}{L} \sum_i \text{phastCons}(i) \]
Secondary Structure	Prediction of accessibility of ASO target binding region. If the target binding region of the mRNA (that the ASO contains a complementary strand to) is predicted to be inaccessible due to potential RNA-folding structures (such as hairpin loops), the ASO will be ineffective in knocking down required proteins. Therefore, ASOs that form complementary strands to inaccessible mRNA regions are penalized.
Mapping Specificity	Score determining binding specificity to singular genome locus. We limited the number of mismatches to at maximum = 1. Determination of number of genomic sites: \[ \text{mappednum} = 1 + \big(\text{number of commas in mapping string}\big) \] Each ASO sequence is aligned to the human genome and tested for binding (Perez et al, 2025). If the ASO binds at multiple locations on the genome, it is penalized because of unspecific binding (increasing penalty with decreasing specificity). \[ \text{Penalty}_{\text{map}} = 100 \times (\text{mappednum} - 1) \]

Modification Name	Modification Process	Purpose
2′-O-methoxyethyl ribose (2MOEr)	Alteration of 2′ ribose sugar	Increases binding strength and resistance to nucleases
iMe-dC (5-methylcytosine substitution)	Replaces cytosine bases in “DNA gap” with methylated cytosine	Improves binding affinity and reduces immune stimulation

Model Abstract

Predicting protein knockdown (%KD) from an antisense oligonucleotide (ASO) sequence is essential for prioritizing candidates before expensive wet-lab testing. We developed a biophysics-informed regression machine learning model that integrates sequence composition (k-mers, GC, wing–gap GC), thermodynamics (RNA:DNA nearest-neighbor hybridization), and target accessibility (RNAplfold ED) via the net energy relationship ΔG_eff = ΔG_duplex − ED, then uses a minimal calibration strategy, one per-family k-shot offset to correct family-level bias. Across phases, wet-lab error on DAZ/FAM (n≈8) decreased monotonically as features and choice points were added: 19.9 → 11.5 → 9.25 → 8.83 → 10.0 → 6.67 → 5.16 %KD (feature labels: Seq → Thermo+map → Thermo+Access+Seq (sim) → Micro-tune (sim) → ΔG_eff + close family → ATXN2-anchored → ATXN2-anchored + k-shot). The final model’s before/after calibration plots show that k-shot converts a strong ranker into an accurate regressor: points realign to the identity line, MAE drops sharply, and the learned δ_g offsets match the observed family-specific bias.

We then ran multi-family evaluations (Phase 8/9) with LOSO/k-holdout across ATXN2, SOD1, DAZAP1, FAM98A, comparing HGBR/GBR/Ridge and optional isotonic_by_family. Best per-family MAEs were ATXN2 12.57, DAZAP1 5.09, FAM98A 11.25, SOD1 13.14 (weighted overall ≈9.88), and, because evaluation aggregates across folds, R² and Spearman’s ρ are reported alongside MAE. Error analysis confirms centered residuals post-calibration, narrow bootstrap MAE for DAZ, and variance-limited behavior for FAM98A (n=2). Feature probes (permutation importance, partial dependence) highlight a compact set of interpretable numeric signals—regional GC context, motif aggregates, and when computed ΔG_eff/accessibility—that drive predictions in directions consistent with ASO design principles (SantaLucia and Hicks, 2004; Lorenz et al., 2011; Bennett and Rigo, 2017).^{[1], [2], [3]}

Together, these results provide a reproducible, small-n-robust recipe—physics-anchored features + anchored transfer + one-shot calibration—that achieves ~5–6%KD accuracy on held-out wet-lab families and generalizes across gene families with transparent error quantification, making the model readily usable by others with a single calibrator per new family.

Interactive Model

Sequence—

k-shot—

Split—

MAE—

R²—

Spearman—

Introduction

A central strategy in neurodegeneration therapeutics is lowering levels of disease-driving proteins. In amyotrophic lateral sclerosis (ALS), cytoplasmic TDP-43 aggregation and stress-granule biology are tightly linked to pathogenesis: TDP-43 inclusions often co-localize with stress-granule markers, with chronic stress promoting stress-granule assembly and TDP-43 mislocalization (Dewey et al., 2012; Ding et al., 2021; Jo et al., 2020).^{[4], [5], [6]} Antisense oligonucleotides (ASOs) offer a direct, RNA-based method to reduce pathogenic proteins by binding target RNA and recruiting RNase H or by blocking translation/splicing in a steric manner (Crooke, 2018; Shen et al., 2018; Dhuri et al., 2020).^{[7], [8], [9]} If we can predict protein knockdown (%KD) for a candidate ASO reliably, we can prioritize designs that are more likely to suppress protein expression, lessen stress-granule burden, and ultimately reduce TDP-43 aggregation and downstream neurotoxicity (Dewey et al., 2012; Jo et al., 2020).^{[4], [6]}

Wet-lab screening is essential but expensive and time-consuming element of therapeutic development. For a single gene alone, there are dozens–hundreds of plausible 20-mer windows, multiple chemical patterns (e.g., gapmer 5–10–5 with different sugar/backbone choices), and many dose/assay cellular contexts. Purely lab-based experimentation and iteration through these genetic combinations is not feasible. Thus arises the need for a computational model to provide evidence-based, per-sequence %KD estimations to cut cost and time, thereby allowing researchers to focus wet-lab experiments on the most promising designs. This method will make it practical to explore more targets for therapeutic development (Crooke, 2018; Roberts et al., 2020).^{[7], [10]} Modeling is also a reproducible way to unify heterogeneous literature and in-house measurements to leverage and build upon previous findings during the design cycle (Schoch & Miller, 2017; Dhuri et al., 2020).^{[11], [9]}

There is already meaningful research on ASO/siRNA activity prediction. Early approaches relied on rule sets and thermodynamic heuristics (e.g., composition, GC%, melting proxies, nearest-neighbor energies) and then moved to machine learning, from neural networks on sequence features (Chalk & Sonnhammer, 2002; Chalk, 2005)^[12] to recent deep-learning designs that attempt end-to-end efficacy prediction (Hwang et al., 2024).^[31] Design pipelines have improved allele selectivity and off-target assessment, especially for steric-block/splice-modulating ASOs (Kamola et al., 2015).^[14] Reviews highlight steady advances in sequence chemistry (phosphorothioate backbones, 2′-sugar substitutions) and delivery (conjugates, formulations), and they emphasize that RNase-H gapmers remain central to neural therapeutics (Schoch & Miller, 2017; Shen et al., 2018; Roberts et al., 2020; Dhuri et al., 2020).^{[11], [8], [10], [9]} At the same time, common limitations persist: small, heterogeneous datasets; mixed label reporting (exact %, ranges, qualitative terms); family/assay differences; and accessibility/off-target effects that complicate interpretation (Hagedorn et al., 2018; Yoshida et al., 2019).^{[15], [16]}

Our framework addresses these practical limits in an ALS-relevant, small-n setting. First, we built a single, analysis-ready table by standardizing labels (exact %, range midpoints, High/Medium/Low → 80/55/20, clipped to [0,100]), harmonizing schema (ASO_ID, TargetGene, Sequence, %KD, Model, Chemistry, SourceDomain, Reference), and computing a compact, biologically grounded feature panel from the cleaned sequences. The panel balances interpretability and signal using composition (length, GC%, CpG density, max homopolymer), motifs (di-/tri-nucleotides), and gapmer regional GC content (the number of nucleotide bases that are either cytosine or guanine, which have stronger nitrogenous bonds) for the 5–10–5 layout (5′ wing / DNA gap / 3′ wing). Where available, we also add a simple biophysics proxy

\[ \Delta G_{\text{eff}} \;=\; \Delta G_{\text{duplex}} \;-\; \mathrm{ED}, \]

to reflect the ineffectiveness of chemically-sound ASOS that may be targeting structrually blocked mRNA regions (Crooke, 2018; Shen et al., 2018).^{[7], [8]} This pairing of hybridization (nearest-neighbor duplex energetics) with accessibility (ensemble-defect/opening energy via RNA structure tools) connects features back to the biophysics of productive RNase-H binding (Shen et al., 2018; Lorenz et al., 2011).^{[8], [2]}

Second, because the data is both family-structured and small, we favor a learning model that captures low-order non-linear interactions without overfitting. Therefore, we use a gradient-boosting regressor with shallow trees and slow learning, which naturally discovers interactions (e.g., motif × regional-GC content) and keeps importances and partial-dependence curves readable (Friedman, 2001; Hastie et al., 2009).^{[22], [23]} To handle cross-family shifts, we add a family-aware calibration layer: a 1-shot offset from a single calibrator ASO in the new gene family. This removes systematic bias (e.g., families that run high/low) while preserving ranking; in our setting, this offset is more stable than monotone re-mapping when families are very small (Barlow et al., 1972).^[24]

Third, we use pre-registered splits and metrics; stratified k-fold or Leave-One-Out Cross-Validation (LOOCV) when folds are small; mean absolute error (MAE) as the headline metric with \(R^2\) and Spearman’s \(\rho\) for context; and testing on true out-of-distribution (OOD) families (e.g., DAZAP1, FAM98A) to keep evaluation honest and effectively investigate if rules learned on SOD1/ATXN2 (sister families to the target proteins DAZAP1 and FAM98A) transfer to new genes. We also cache a versioned feature artifact so sweeps and re-tests are computationally streamlined and fair: all model variants use the same inputs; when upstream cleaning or tools change, we bump the feature-version tag and rebuild (Hastie et al., 2009).^[23]

Finally, this model represents a shared, generalizable approach to computation-based wet-lab boosting. The pipeline combines current literature-based findings—interpretable sequence features; optional thermodynamics/accessibility; small-n-appropriate boosting and calibration; and careful validation— in a form others can adopt for different targets or diseases. The field is advancing quickly, from pharmacology reviews to new ASO design frameworks (Hwang et al., 2024; Liu et al., 2025; Horberg et al., 2024).^{[31], [30], [13]} Our contribution is a measured, ALS-aware framework that connects data standardization to interpretable modeling, is realistic about small-n constraints, and is built to transfer across families—precisely where many generic models struggle (Hastie et al., 2009; Hagedorn et al., 2018; Yoshida et al., 2019).^{[23], [15], [16]}

Methods

Biophysics Behind the Model

Antisense oligonucleotides (ASOs) reduce protein levels by binding to complementary mRNA strands that correspond to targeted proteins. This forms RNA:DNA duplexes are then cleaved by RNase H1, thereby breaking the target mRNA strand. Since the mRNA strand has been cleaved, the protein will no longer be translated, thereby lowering its concentration in the cell. The effectiveness of this process is reflected in protein knockdown (%KD), which depends on three interconnected factors: ASO binding strength, target mRNA site accessibility, and chemical structure.

Fig 1: Biophysical determinants of ASO-mediated protein knockdown (Dhuri et al., 2020)

1. Duplex Stability and Binding Strength

The stability of an ASO–RNA hybrid is captured by the Gibbs free energy of duplex formation:

\[ \Delta G^\circ_{\text{duplex}} = \sum \Delta G^\circ_{\text{stack}}(b_s,b_{s+1}; \bar b_s,\bar b_{s+1}) + \Delta G^\circ_{\text{init}} + \Delta G^\circ_{\text{dangling}} + \Delta G^\circ_{\text{terminal}} \]

where stacked nearest-neighbor terms are parameterized from calorimetric measurements of RNA/DNA interactions (Sugimoto et al. 368–370; Mathews and Turner 90–92). A more negative \(\Delta G^\circ_{\text{duplex}}\) indicates stronger binding. The thermodynamic relationship ^{[17], [18]}

\[ \Delta G^\circ = RT \ln K_D \]

connects binding free energy to the dissociation constant \(K_D\), providing a direct physical bridge between calculated energetics and molecular affinity (Alberty and Silbey 143). Occupancy of the target transcript increases as \(K_D\) decreases (longer ASO residence time), raising the probability of mRNA cleavage. ^[19]

2. Target Accessibility

RNA strands are often found in complex 3D structures within the cell, and binding requires partial unfolding of the target. Accessibility is quantified as the “opening energy” derived from per-nucleotide unpaired probabilities \(p_{\text{unpaired}}(i)\):

\[ E_D(i) = -RT\ln p_{\text{unpaired}}(i), \qquad \overline{E_D} = \frac{1}{N}\sum_{i=1}^N E_D(i) \]

In our model, these probabilities are computed from the RNA secondary structure ensemble (Bernhart et al. 284). Accessible regions incur smaller energetic penalties, making them more favorable targets. ^[20]

3. Effective Binding Energy

To capture both stability and accessibility, we define an effective free energy:

\[ \Delta G^\circ_{\text{eff}} = \Delta G^\circ_{\text{duplex}} - \overline{E_D} \]

This penalizes inaccessible sites and better reflects whether an ASO will achieve functional knockdown (%KD) in cells (Mückstein et al.). ^[21]

4. Biological Implications

Since knockdown efficacy depends on the fraction of target mRNA strands bound by ASOs (occupancy), thermodynamics and accessibility together explain why some ASOs achieve high knockdown while others fail. Occupancy follows a Langmuir-style relationship:

\[ \theta = \frac{[A]}{[A] + K_D} \]

where \([A]\) is ASO concentration. Thus, lower \(K_D\) (more negative \(\Delta G\)) yields higher occupancy, improving RNase H–mediated cleavage and protein knockdown (Mathews and Turner 90). ^[18]

Feature Engineering Framework

The goal of feature engineering in this model is to transform the biological and chemical principles underlying ASO action into quantitative, computational descriptors that predict protein knockdown (%KD). In vitro and in vivo knockdown is achieved from the combined effects of ASO sequence composition, duplex stability, RNA accessibility, and chemical structure. Each feature group was chosen because it influences whether an ASO will successfully engage its target transcript, recruit RNase H, and thereby reduce protein expression, as explained above.

1. Sequence-Level Features (Structural)

Basic composition (length, GC%, CpG density, homopolymer runs):

These simple descriptors reflect the inherent stability and specificity of oligonucleotides. For example, higher GC (guanine-cytosine) content stabilizes the ASO:RNA duplex but excessive GC can promote off-target binding or self-structure (see ASO Design Pipeline, Sugimoto et al. 11211–11216). CpG content, which represents the density of cytosine nucleotides connected directly to a guanine nucleotide by a phosphate group (not to be confused with the CG base pair), as well as homopolymer runs, defined by long strands of a single type of nitrogenous base, influence immune recognition and pharmacological tolerability, indirectly affecting achievable knockdown in cells (Crooke et al. 427–449).^{[17], [7]}

Regional GC fractions (wing-gap-wing design):

Many ASOs are designed with “gapmer” structure. These ASOs employ a 5-10-5 design, with modified wings (5) flanking a DNA core (10). Gapmer ASOs often exhibit higher stability and binding strength. Thus, gapmer ASOs were associated with more efficacious binding in the model. Encoding GC fractions by region reflects how wing composition stabilizes binding, while the DNA gap mediates RNase H recruitment (Crooke et al. 430).^[7] Protein knockdown depends on this balance: wings ensure persistent binding, while the gap determines cleavage efficiency.

K-mer frequencies (di- and trinucleotides):

Short motifs capture local sequence contexts that correlate with binding efficiency and accessibility. Certain trinucleotides (e.g., CGG, GGC) are known to influence secondary structure formation and may affect how efficiently an ASO engages its site (Kamola et al. 3066).^[14] By including these, the model learns sequence motifs associated with higher or lower knockdown rates.

2. Thermodynamic Features

Duplex free energy (ΔG_duplex, Tm):

Duplex stability, computed using nearest-neighbor thermodynamic models, is a first-order determinant of whether an ASO remains bound long enough for RNase H cleavage (Sugimoto et al. 11212; Mathews and Turner 191).^{[17], [18]} Stronger duplexes (more negative ΔG) correlate with higher %KD, provided binding does not occur at inaccessible sites.

Mismatch penalties:

ASOs may tolerate single mismatches while retaining partial activity. Incorporating alignment penalties ensures the model accounts for graded effects on knockdown, rather than treating binding as all-or-none (Mathews and Turner 192). ^[18]

3. Accessibility Features

Unpaired probabilities and opening energy (ED):

RNA secondary structure can block ASO access to target sites. Opening energy, computed through RNAplfold (Bernhart et al. 614),^[20] quantifies the cost of exposing the binding site. High accessibility (low ED) increases the likelihood of productive binding, directly enhancing knockdown efficiency.

Effective binding free energy (ΔG_eff):

Defined as

\[ \Delta G^\circ_{\text{eff}} = \Delta G^\circ_{\text{duplex}} - \overline{E_D}, \]

this metric integrates stability and accessibility (Mückstein et al. 236).^[21] ΔG_eff better reflects the real probability of ASO–RNA engagement in cells, and thus correlates more strongly with %KD than duplex energy alone.

4. Biochemistry-Aware Features

Chemical modifications:

Backbone (phosphorothioate) and sugar (MOE, LNA, cEt) modifications to the structure of ASOs increase nuclease resistance and binding affinity, altering intracellular stability and activity (Crooke et al. 433).^[7] Many ASOs used for training and testing in our model carry 2′-O-methoxyethyl ribose and 5-methylcytosine substitution modifications. Categorical encoding of chemistry allowed the model to adjust predictions according to these biochemical determinants of knockdown.

Central Role in Protein Knockdown

Sequence features capture motifs and architectures influencing cleavage efficiency.
Thermodynamic features determine whether binding is strong enough to sustain RNase H activity.
Accessibility features decide whether binding can occur in the first place.
Biochemistry-aware features reflect the chemical optimizations that enable durable in-cell knockdown.

Figure 2: Feature engineering framework for predicting ASO-mediated protein knockdown.

Machine Learning Framework

The goal of the modeling pipeline was to predict protein knockdown (%KD) for a given ASO and target proteins. Predicting %KD requires integrating heterogeneous biological descriptors into a computational model that balances accuracy with interpretability. Hence, we frame %KD prediction as supervised regression from ASO sequence plus biophysics- and chemistry-aware features. We implemented a systematic, phased progression pipeline that builds from a baseline sequence-only model to a final calibrated, interpretable, gradient-boosting regressor that integrates sequence, thermodynamics, accessibility, and biochemical descriptors. The framework is designed to be robust in a small-n, high-dimension regime, where overfitting is a significant risk, and it provides biologically interpretable predictions suitable for translational application.

Data Collection and Standardization

We built one clean table for ASO-mediated protein knockdown (%KD) by combining results from papers/patents and our wet-lab CSVs. Each row keeps what we need for learning and traceability: ASO_ID, TargetGene, unmodified Sequence (5′→3′, letters only for features), Observed_%KD (with optional KD_range_min/max and a binned label), brief Chemistry notes, Model (cell/animal/human), SourceDomain (Literature/WetLab), and a Reference.

Because sources report %KD differently, we use one simple mapping: exact values stay as is; ranges become the midpoint; “High/Medium/Low” → 80/55/20. Afterwards, we clip to [0,100]. For plots and splits, we use the same categorizations: High ≥70, Medium 40–69, Low <40. (See Supplement 1, Fig. 1 for the overall label distribution and Fig. 2B for per-target H/M/L counts.)

Data pipeline: harmonize labels, build schema, engineer features — Fig. 3: Data Pipeline: Harmonize Labels, Build Schema, Engineer Features

We standardized names (genes, chemistry), removed non-ACGT characters before feature work (kept modification strings in metadata), and checked for outliers. We then computed a compact set of sequence features: composition (length, GC%, CpG count/density, max homopolymer), motifs (di- and tri-nucleotides), gapmer regional GC for 5–10–5 (5′ wing / DNA gap / 3′ wing), plus optional thermo/accessibility proxies (e.g., \(\Delta G_{\text{eff}}=\Delta G_{\text{duplex}}-\mathrm{ED}\)). (See Supplement 1, Fig. 3 for GC vs %KD, Fig. 4 for dinucleotides, and Fig. 5A–C for regional GC.)

The dataset spans the full %KD range and differs by family. For example, SOD1 and FUS show wider spread, ATXN2 trends higher in our literature set; thus, we subsequently used family-aware evaluation. Global GC vs %KD is weak, which is why we include richer features (motifs, structure, accessibility). Correlations and PCA support this: motif/composition clusters appear, and bins separate only partially. (See Supplement 1, Fig. 6 for the correlation heatmap and Fig. 7 for PCA; Fig. 8 shows optional 3-mer enrichment.)

For modeling, SOD1 and ATXN2 are our anchor literature families; DAZ1/DAZAP1 and FAM98A (wet-lab) are our out-of-distribution families for transfer and optional k-shot calibration—testing whether rules learned on well-documented families carry to new genes. Full details, methods, and all plots are in Supplement 1 — Data Collection & Standardization (see Figs. 1–8).

Model Architecture

1. Prediction task and data regime

We predict protein knockdown (%KD) for an ASO–gene pair as a supervised regression problem. Inputs are the outputs from the Data Collection & Standardization step (clean 5′→3′ sequence and derived descriptors), and the response is Observed_%KD, harmonized and clipped to [0,100] as described earlier. The dataset is small-n, family-structured, and biologically heterogeneous, so our learner must capture low-order non-linear interactions without overfitting while remaining readable to domain scientists. We only increased complexity when it fixed an observed failure mode: starting with sequence composition and motif features; add 5–10–5 regional GC for gapmer design; optionally including thermodynamics/accessibility proxies; then using a gradient-boosted tree regressor as the core learner (Crooke et al., 2017)^[7].

2. Feature engineering (“compact panel”) and biological rationale

We use a compact, biologically grounded panel to control variance while preserving signal. Composition covers length, GC%, CpG count/density, and maximal homopolymer; motifs include di- and (optionally) tri-nucleotide counts. To reflect RNase H1 gapmer practice, we compute regional GC within the 5–10–5 layout (5′ wing / DNA gap / 3′ wing), which moderates activity across families. Where available, we add an accessibility-aware binding proxy,

\[ \Delta G_{\mathrm{eff}} \;=\; \Delta G_{\mathrm{duplex}} \;-\; \mathrm{ED}, \]

so sequences that look good by composition but are structurally blocked are down-weighted. Keeping the panel small is deliberate in a small-n model: it reduces dimensionality, helps avoid overfitting, and keeps error analysis tied to interpretable biology (Crooke et al., 2017)^[7]. The complete mapping of feature → model variable → defaults → tool is listed in Supplement 3 (Feature Table).

3. Core learner and model form

We adopt gradient boosting with squared-error loss and shallow regression trees as the base learning scheme. After M stages, the fitted function is

\[ \hat{f}_M(x)\;=\;\sum_{m=1}^{M} \nu\,h_m(x), \]

where each \(h_m\) is a depth-limited tree fit to the current residuals, and \(\nu\in(0,1]\) is a learning-rate shrinkage that stabilizes training in small samples (Friedman, 2001; Hastie et al., 2009)^{[22], [23]}. We compared this against baselines: ridge/lasso were too rigid for interaction-heavy signals and suffered collinearity even with polynomial expansions (Hastie et al., 2009)^[23]; SVR (RBF) fit non-linearities but was sensitive to kernel/scale choices in small-n and produced less stable family-wise rankings (Smola & Schölkopf, 2004)^[25]; an XGBoost-style booster matched accuracy only under heavier regularization and tighter tuning—once constrained, it did not offer practical gains over classical gradient boosting while adding complexity (Chen & Guestrin, 2016)^[26]. Shallow, slow gradient boosting gave the best balance of flexibility, stability, and interpretability.

4. Training procedure, hyperparameters, and evaluation

We stratify splits by %KD bin and enforce family-aware partitions so no family leaks into validation. We then run small sweeps on a low-variance frontier: max_depth = 4, learning_rate ≈ 0.01, n_estimators ≈ 1800, with min_samples_leaf = 1 and min_samples_split = 2. We explore max_depth ∈ {2,3,4,5}, learning_rate ∈ {0.003, 0.01, 0.03}, and grow up to ~2000 trees with early stopping when validation error plateaus. MAE is the headline metric (directly interpretable in %KD); we also report \(R^2\) for explained variance and Spearman’s \(\rho\) for rank fidelity. Deeper trees raised variance, and larger learning rates destabilized training; the selected configuration balanced bias–variance across seeds and family-aware splits (Friedman, 2001; Hastie et al., 2009)^{[22], [23]}.

5. Calibration, interpretability, and limits

Even with a tuned booster, we saw family-level offsets. Rather than overfit the core learner, we apply a k-shot, family-aware additive offset learned from a single calibrator ASO:

\[ \tilde{y}_i \;=\; \hat{f}(x_i)\;+\;\delta_{g(i)}, \]

where \(g(i)\) indexes the target family. This removes family bias while preserving within-family orderings, akin to a random-intercept adjustment. Isotonic calibration helped only when families had enough samples; k-shot offsets were more stable and cost-effective in small-n (Barlow et al., 1972)^[24]. Shallow trees aid interpretability (early splits often on regional GC, then motif aggregates). Limits mirror the data regime: very small families may retain bias without a calibrator; missing accessibility proxies raise uncertainty; and deeper trees/heavier kernels can improve in-sample fit while hurting cross-family generalization (Friedman, 2001; Hastie et al., 2009; Crooke et al., 2017)^{[22], [23], [7]}.

6. Cached feature artifacts for sweeps and re-tests

To keep sweeps fast and fair, we created a versioned cached feature artifact from the standardized outputs. After label cleaning and feature computation, we save a table keyed by (ASO_ID, TargetGene, sequence hash, feature-version tag). All model variants in a sweep load the same cache, so any change in MAE/\(R^2\)/\(\rho\) comes from the model, not feature recomputation. When upstream rules or tools change (e.g., an update in \(\Delta G\) or ED), we bump the feature-version tag and rebuild. This preserves reproducibility while keeping iteration fast (see Supplement 1 for data/labels and Supplement 3 for the feature table).

Fig. 4: End-to-End Pipeline for ASO Knockdown Prediction

Validation Strategy

Our goal is to measure how well the model predicts Observed_%KD for new ASOs and, importantly, for new gene families. Because the dataset is small and heterogeneous, we combine within-distribution cross-validation with a true out-of-distribution (OOD) check on wet-lab families. We defined all splits and metrics before model selection to avoid optimistic bias (Hastie et al., 2009)^[23].

Cross-validation (within-distribution)

We use stratified k-fold CV on the literature set so that each fold preserves the High/Medium/Low label mix (Kohavi, 1995)^[27]. When folds are too small and the metrics become unstable, we switch to leave-one-out (LOOCV), which is standard for small-n settings (Hastie et al., 2009)^[23]. We also run a family-aware CV pass, holding out all ASOs from one literature family when possible to avoid inflating performance by reusing easy, within-family patterns (Hastie et al., 2009)^[23].

For phase-wise comparisons we used 10,000-sample bootstrap percentile intervals of MAE. Non-overlapping 95% intervals between Phase 1 and Phase 6 indicate a statistically significant improvement (p < 0.01). We report ΔMAE/baseline as a practical effect size.

Evaluation metrics

Let \(y_i\) be Observed_%KD and \(\hat{y}_i\) the prediction. We report:

MAE (Mean Absolute Error)

Definition: Average absolute difference between observed %KD and predicted %KD.

\[ \mathrm{MAE}=\frac{1}{n}\sum_{i=1}^{n}\lvert y_i-\hat{y}_i\rvert \]

Use in our model: Headline accuracy metric; model/calibration selection minimizes MAE (reported per family and overall).

R² (Variance explained)

Definition: Proportion of variability in observed %KD explained by predictions.

\[ R^{2}=1-\frac{\sum_{i}(y_i-\hat{y}_i)^{2}}{\sum_{i}(y_i-\bar{y})^{2}} \]

Use in our model: Reported with MAE to assess overall fit on aggregated folds; interpreted cautiously at small-n.

Spearman’s ρ (Rank correlation)

Definition: Correlation between ranks of observed and predicted %KD (ordering consistency).

\[ \rho=\mathrm{corr}\big(\mathrm{rank}(y),\,\mathrm{rank}(\hat{y})\big) \]

Use in our model: Context for ranking quality when absolute scale varies; reported per family and overall.

In small folds, single outliers can dominate \(R^2\); that is expected at small-n (Hastie et al., 2009)^[23]. Therefore we use MAE as the main number and treat \(R^2\)/\(\rho\) as context.

OOD validation on wet-lab families

To test cross-family generalization, we hold out DAZAP1 and FAM98A wet-lab ASOs as a pre-registered OOD test set. These families are never used in training, tuning, or model selection on literature data. This answers the key question: do rules learned from literature families (e.g., SOD1, ATXN2) transfer to new families?

Because the OOD set is small (about eight ASOs), we report bootstrap confidence intervals by resampling the OOD set with replacement (e.g., \(B=10{,}000\)) and using percentile intervals for MAE, \(R^2\), and \(\rho\) (Efron & Tibshirani, 1993).^[32] This gives a realistic sense of uncertainty at small-n.

Family-aware calibration without leakage

After training the gradient-boosting regressor, we apply a family-specific offset using a single calibrator ASO per new family (k-shot). For OOD testing, the calibrator selection and the offset \(\delta_g\) are fit inside the OOD fold only—no other OOD labels are touched—so there is no leakage. The final prediction for family \(g\) is:

\[ \tilde{y} \;=\; \hat{f}(x) \;+\; \delta_g \, . \]

This corrects systematic family shifts while keeping the learned ranking intact. Earlier we also tested isotonic calibration; it helps when each family has enough samples, but k-shot offsets are more stable and cost-effective in our small-n regime (Barlow et al., 1972)^[24].

Leakage controls and small-n choices

We enforced three safeguards: (1) No family leakage—when we evaluate generalization, all ASOs from a test family are excluded from training/tuning (Hastie et al., 2009)^[23]. (2) No label leakage from bins—KD bins are used only for stratification/plots; the regressor sees Observed_%KD as a continuous label. (3) Calibrator isolation—the k-shot offset for a test family is learned only from that family’s calibrator ASO; no other OOD labels are used (Barlow et al., 1972; Efron & Tibshirani, 1993)^[24].

Results

We evaluated whether our model predicts Observed_%KD accurately within known families and generalizes to new families (DAZAP1, FAM98A). Because the wet-lab test set is small (n≈8), we report MAE (%KD) as the headline metric in Phases 1–6. We include R² and Spearman’s ρ where fold counts stabilize (Phase 8/9), and we quantify the impact of calibration (k-shot) on family-level bias.

Phase Progression

What changed and how performance moved

The MAE trajectory 19.9 → 11.5 → 9.25 → 8.83 → 10.0 → 6.67 → 5.16 (%KD) follows a deliberate sequence of feature-engineering and model decisions described in the Methods. First, we move from sequence-only features (k-mers/GC/wing–gap GC; Methods → Sequence-derived features) to biophysical/thermodynamic features (RNA:DNA nearest-neighbor ΔG/Tm with robust binding-site mapping/rescue; Methods → Thermodynamics & duplex stability), which removes feature gaps and drops MAE to ≈11.5. We then add accessibility and compute ΔG-style energies (cofold/accessibility/MFE, later ΔG_eff = ΔG_duplex − ED; Methods → Accessibility & ΔG_eff) alongside sequence features, yielding ≈9–9.5. On the model side, similarity weighting (β≈3) improves cross-family transfer (≈8.8; Methods → Model architecture → training strategy). A targeted family choice point—training on the closest literature family (ATXN2) when computing ΔG_eff—stabilizes transfer (≈10 for ATXN2-only; Methods → Data standardization/bucketing). Finally, ATXN2-anchored transfer with a tuned GBR plus k-shot per-family offsets (one calibrator ASO per family; Methods → Calibration) converts a good ranker into an accurate regressor (6.67 → 5.16).

Stage label in Results	What changed (high level)	Methods anchor	Role
Seq	Sequence composition (k-mers, GC, wing–gap GC)	Sequence-derived features	Feature-engineering
Thermo+map	RNA:DNA NN ΔG/Tm + mapping/rescue to ensure lit/wet features exist	Thermodynamics & duplex stability	Feature-engineering
Thermo+Access+Seq (sim)	Cofold/accessibility/MFE added to sequence; similarity weights	Accessibility features & ΔG surrogates; Model → training strategy	Feature + Model
Micro-tune (sim)	Small GBR sweep near previous optimum (no blending)	Model → hyperparameters	Model
ΔG_eff + close family	RNAplfold ED and ΔG_eff = ΔG_duplex − ED; ATXN2-only bucket for best transfer	Accessibility & ΔG_eff; Data bucketing	Feature-engineering (family choice)
ATXN2-anchored	Anchor on ATXN2; tuned GBR	Model → anchored transfer	Model
ATXN2-anch + k-shot	Per-family δ_g offsets from one calibrator ASO	Calibration (k-shot offsets)	Model (calibration)

Table 1 — Phase progression map: feature/model changes, methods anchors, and roles

Wet-lab MAE progression by key feature/choice point — **Fig. 5:** Wet-lab MAE vs. feature/choice point (phase progression)

Statistical significance. Using 10,000 non-parametric bootstrap resamples per phase, the 95% confidence interval for the baseline sequence-only model (Phase 1) does not overlap that of the final ATXN2-anchored + k-shot model (Phase 6), indicating a significant reduction in error (p < 0.01, bootstrap percentile test). The net effect size corresponds to a ~74% reduction in MAE (19.9 → 5.16 %KD).

To complement MAE, we report a curated R² progression aligned with the same feature/choice-point stages. Early stages exhibit negative wet-lab R²—a known artifact at small n and pre-accessibility banding—while later stages trend toward 0 and above as the model becomes better calibrated to the distribution.

Note: We omit CV R²/ρ curves here; once thermodynamics are in place they are effectively flat and add little interpretive value.

R² progression by key feature/choice point (curated) — **Fig. 6:** Curated R² vs. feature/choice point (phase progression)

Phase	Key additions / model choice	MAE (%KD)	R²	ρ	95% Bootstrap CI (MAE)
1	Sequence-only (k-mers, GC, wing–gap GC)	19.9	—	—	[15.6, 23.8]
2	+ Thermodynamics (ΔG, Tm; mapping/rescue)	11.5	-31.85	—	[9.0, 14.0]
3	+ Accessibility (ED, cofold ΔG) + similarity weights	9.25	-31.85	—	[7.1, 11.5]
4	Micro-tuned GBR (lr/depth sweep)	8.83	-0.81	—	[6.7, 10.9]
5	ΔG_eff + ATXN2 anchoring (family bucket)	10.0	-1.93	—	[7.8, 12.5]
6	ATXN2 anchored + k-shot calibration (δ_g)	5.16	-0.45	≈ 0.8	[3.0, 8.9]
8/9	Multi-family (LOSO/k-holdout across ATXN2, SOD1, DAZAP1, FAM98A)	≈ 9.88 (weighted)	≈ 0.0	≈ 0.7	—

Table 2 — Summary of phase-wise and multi-family validation. MAE (%KD) with 95% bootstrap CIs (10,000 resamples). See Fig. 5 for the trend visualization with error bars; non-overlapping CIs between Phases 1 and 6 indicate a significant improvement (p < 0.01, bootstrap percentile test).

Final model sanity & calibration impact (Phase-6)

What we’re asking: does a single k-shot offset per family (one calibrator ASO per gene) turn a strong ranker into a well-calibrated regressor on wet-lab DAZ/FAM?

Observed vs Predicted %KD (before/after k-shot)

Before k-shot the model places DAZ/FAM predictions in a narrow, high band (≈85–90%KD) across a wide span of observed values—a classic small-n banding bias. After k-shot (single calibrator per family) the clouds recenter onto the 1:1 line and the error collapses (MAE ~54.7 → ~6.0 in this export; ~5.16 in our best run). Paired bars confirm MAE decreases sharply, R² moves from large negative toward ~0 (as expected at n≈8), and Spearman’s ρ is preserved—k-shot corrects scale without disturbing rank. The δ_g panel shows the learned family offsets are negative for both DAZ and FAM (the model was overpredicting), with magnitudes consistent with the visual shift in the scatters.

Fig. 7: Impact of k-shot offsets: MAE collapse, R² recovery, rank preserved

Takeaway: Uncalibrated predictions preserve rank but under/over-estimate absolute %KD in a family-specific manner. A single calibrator ASO per family eliminates that offset with minimal added complexity.

Out-of-distribution (OOD) transfer: DAZAP1 and FAM98A (Phase-6, post-calibration)

What we tested. We evaluated strict transfer from literature to DAZAP1 and FAM98A using the Phase-6 model after k-shot calibration (one calibrator ASO per family). We summarize per-family metrics with bootstrap 95% CIs (Fig. 8), show residual distributions (Fig. 9), and check scale-dependence (Fig. 10).

Per-family metrics with CIs (MAE, R², ρ)

DAZ shows a low median MAE with narrow intervals, while FAM98A’s MAE is larger with wide intervals (n≈2). R² is near zero or slightly negative at this scale, which is expected under small-n and narrow dynamic range. ρ is close to 0 for both families in this export; with only a handful of points, rank statistics are dominated by single observations and are reported for completeness rather than optimization

Fig. 8: Per-family (DAZ/FAM) metrics (MAE, R2, Rho) with 95% Confidence Interval

Residual distributions (Pred−Obs)

DAZ residuals concentrate near 0 with modest spread, indicating that the k-shot recentering eliminates family-level bias. FAM98A residuals are wider and occasionally skewed—consistent with the large CIs in Fig. 8 and the small number of labeled ASOs.

Fig. 9: Residuals distribution by family

Residuals vs Observed (post-calibration)

Across the observed %KD range, residuals oscillate around zero with no strong slope, i.e., no residual scale-dependence after calibration. The remaining error appears idiosyncratic (per-sequence) rather than systematic (per-level), which is exactly the desired behavior after

Takeaway. After k-shot, DAZ is well-calibrated with tight error; FAM98A is variance-limited (n≈2), motivating more data and/or a family-aware weighting strategy in future work.

Multi-family evaluation (Phase 8/9)

Setup. We evaluated family-level generalization using LOSO for DAZAP1 and FAM98A, and k-holdout for ATXN2 and SOD1. We compared three presets (HGBR, GBR, Ridge) with optional isotonic_by_family calibration and reported the best MAE per family (lowest across model+calibration), along with the test size (n_test).

Per-family results

Figure 11 summarizes best MAE by family and test size; Table 3 lists the corresponding split/model/calibration. DAZAP1 achieves the lowest MAE (≈ 5.09 with Ridge, LOSO, n=6). FAM98A is harder at n=2 (≈ 11.25 with Ridge + isotonic_by_family). ATXN2 (≈ 12.57, HGBR, k-holdout, n=6) and SOD1 (≈ 13.14, GBR, k-holdout, n=3) fall in the low-teens range. The weighted overall MAE across families (by n_test) is ≈ 9.88, and is shown as “Multi-family presets” in the phase progression Figure 5.

Best MAE per family (lowest across model+calibration)

Interpretation.

Per-family MAE reflects both intrinsic family difficulty and test size. DAZAP1 benefits from more folds (n=6) and a smoother error profile; FAM98A’s higher variance at n=2 explains its wider spread and modest MAE. ATXN2/SOD1 k-holdout scores in the low-teens are consistent with cross-validation on small per-family test sets. Because Phase 8/9 aggregates over folds, R² and ρ become meaningful and can be reported alongside MAE in the Phase-8 panels (predicted vs observed, absolute error by gene, overall calibration).

Family	Split	Model	Calibration	Best MAE	n_test
ATXN2	holdout_k	HGBR	none	12.57	6
DAZAP1	LOSO	Ridge	none	5.09	6
FAM98A	LOSO	Ridge	isotonic_by_family	11.25	2
SOD1	holdout_k	GBR	none	13.14	3

Table 3: Best per family (split/model/calibration/MAE/n_test)

Error analysis & robustness

Error analysis & robustness (Phase 6, post-calibration)

Why this matters. Beyond point estimates, we quantified how errors distribute and how stable they are under resampling. A robust model should show (i) centered residuals (no global bias), (ii) tight error spread for the family with more data, and (iii) no residual trend with scale (no hidden miscalibration).

Overall residuals (post-calibration)

The residual histogram (Pred−Obs, %KD) is centered around zero after k-shot, with no heavy tails. This indicates that, once family-level intercepts are removed, the model is not systematically over- or under-predicting across the observed range. In practical terms, calibration converts a good ranker into a well-calibrated regressor, so the same offset-corrected model can be used confidently by others on similarly prepared data. Residual normality was assessed with a Q–Q plot and Shapiro–Wilk; no gross deviations were observed at this scale.

Bootstrap MAE densities (Overall vs DAZ vs FAM)

We drew 10,000 bootstrap replicates per group and estimated MAE distributions. The Overall density peaks near 6%KD with a 95% CI of ~[3.0, 8.9]. DAZ shows a similarly narrow distribution (CI ~[3.15, 9.08]), consistent with n=6 test points. FAM exhibits a wider and more skewed density (CI ~[5.63, 11.25]) due to n=2, as expected for a variance-limited family. Two implications follow:
Model reliability for users. For the better-sampled family (DAZ), MAE variability is modest—external users deploying the model on DAZ-like targets should expect stable performance in the 5–9%KD range post-calibration.

Data-driven path to improvement. FAM’s wider interval is a sample-size artifact, not a global model failure; the quickest route to tighter bounds is +data (more FAM ASOs) or family-aware weighting when that data arrives.

Fig. 12: Post Calibration Bootstrap MAE Densities

group	n	MAE_lo	MAE_med	MAE_hi
Overall	8	3.011	5.993	8.934
DAZ	6	3.155	6.115	9.076
FAM	2	0.000	5.626	11.253

Table 4: Bootstrap Confidence Intervals

Calibration curve (binned reliability; post-calibration)

Binning predictions into equal-count bins and plotting mean predicted vs mean observed verifies no global miscalibration: points lie near the identity line with no systematic under/over-prediction at low vs high %KD. Combined with the residual histogram, this indicates that post-calibration error is homoscedastic at the scale assessed—i.e., the model’s uncertainty is not exploding at certain KD levels.

Fig. 14: Predicted vs. Observed Calibration Curve

Why this demonstrates accuracy and usability

Centered residuals (Fig. 12) + on-identity calibration (Fig. 14) show the model’s absolute scale is correct after a single per-family offset. This makes the model portable: users only need one calibrator ASO per new family to remove intercept bias and obtain well-scaled predictions.
Bootstrap MAE (Fig. 13) quantifies expected error variability. For DAZ, the spread is narrow, reinforcing predictive stability in practice. For FAM, the broader CI is an honest reflection of n=2—a limitation of data, not method. This transparency lets others plan data collection (e.g., 3–4 FAM ASOs would shrink the CI considerably).
Together, Fig. 12–14 show that the final system is accurate where we have data and well-behaved elsewhere—no hidden slope, no global bias, and predictable uncertainty. That combination is precisely what downstream users need for decision-making (ranking candidates, setting thresholds, or deciding when to collect one calibrator ASO).

Feature behavior (Phase 6, post-calibration)

What we want to understand. After establishing accuracy and calibration, we asked which signals the model actually uses. Because we anchored on biophysical plausibility in the Methods, we expect regional GC context, short-motif composition, and—where available—ΔG_eff / accessibility summaries to explain much of the variance.

Global feature importance (permutation)

We computed permutation importance (ΔMAE) on a held-out validation split by randomly shuffling one feature at a time through the fitted Phase-6 pipeline and re-scoring MAE. The ranked bar chart shows a small set of numeric features with the largest ΔMAE; identity variables (e.g., TargetGene, Chemistry) were excluded in this diagnostic to avoid swamping mechanistic/sequence signals.

Fig. 15: Global permutation importance of features

What we see. A few features dominate the global picture (e.g., f84, f13, f3 in this export), with the remainder contributing modestly. In our Phase-6 design, these top features correspond to biophysical or sequence-derived quantities (e.g., ΔG_eff-like summaries, accessibility aggregates, wing/gap GC fractions, and motif counts). The pattern is consistent with our Methods: regional GC (wing/gap context) and dinucleotide aggregates often appear early in the GBR splits, and ΔG_eff / ED rise when computed.

Partial dependence (top numeric features)

Fig. 16: Partial dependence of top 3 features

For the top three numeric features, we plotted partial dependence (average model response while varying only that feature). The curves show monotone/saturating trends, which align with expected design intuition:

For a ΔG_eff-like variable (more negative = stronger hybridization net of accessibility), the PDP increases from lower to higher %KD, with a steeper rise in the favorable range.
For accessibility summaries (e.g., lower opening energy = more accessible), the response is higher at more accessible values.
For regional GC or short-motif aggregates, the PDP reflects optimal bands rather than a single extreme, consistent with gapmer design constraints.

Small-sample caveat. Because Phase-6 literature features remain limited, the PDPs should be read qualitatively—they explain the directionality we rely on (e.g., more favorable ΔG_eff/accessibility → higher predicted KD), not precise effect sizes.

Why this supports the Methods → Results story

Consistency with biophysics. The features that move MAE the most are the same ones we justified in Methods—ΔG_eff / ED, regional GC, and motif composition—demonstrating that the model’s predictions are driven by interpretable, mechanistic signals rather than arbitrary correlations.
Design guidance. The PDPs translate directly into actionable guidance: keep ΔG_eff in the favorable band, target accessible windows, and stay within the wing/gap GC ranges used by effective gapmers.
Portability. Because Phase-6 relies on signals that can be computed from sequence and a standard RNAplfold/thermo toolchain, other groups can reproduce these features and expect the same trends—especially when combined with a single calibrator for new families (as shown in the calibration section).

Results Summary

Across phases, we observed a consistent reduction in wet-lab error as features and choice points were added, culminating in ATXN2-anchored + k-shot with a wet-lab MAE ≈ 5.16%KD. The phase plot (Fig. 5) provides the trend; Table 2 lists exact MAE and 95% CIs. ”The calibration triptych (Fig. 6) shows that a single per-family offset converts the model from a strong ranker to an accurate regressor: scatters realign to the 1:1 line, MAE decreases (before → after), and the learned offsets (δ_g) are consistent with the observed pre-calibration bias. In the OOD transfer to DAZAP1 and FAM98A (Fig. 7), per-family bootstrap CIs quantify error spread; residual distributions are centered after calibration, and residuals vs observed exhibit no scale-dependent trend. At the multi-family level (Phase 8/9), best per-family MAEs are ATXN2 12.57, DAZAP1 5.09, FAM98A 11.25, SOD1 13.14 (Table 3, Fig. 8), with a weighted overall MAE ≈ 9.88 used as the “Multi-family presets” point in the progression. Finally, the feature behavior panels (Fig. 16) indicate that a compact set of numeric signals—regional GC context, short-motif aggregates, and where computed ΔG_eff / accessibility summaries—drive the predictions, with partial-dependence curves showing monotone/saturating responses consistent with the feature space.

Collectively, the figures and tables above establish (i) a phase-wise MAE trajectory from 19.9 → 11.5 → 9.25 → 8.83 → 10.0 → 6.67 → 5.16, (ii) calibration efficacy via before/after metrics and δ_g values, (iii) OOD behavior summarized by per-family CIs and residual diagnostics, (iv) multi-family presets with a weighted overall ≈ 9.88, and (v) feature usage consistent with the engineered biophysical/sequence space. All referenced plots and tables are listed in the Results figure/table checklist.

Discussion

1. What changed, why it worked, and how it matches our hypotheses

Our working hypothesis was that biophysical realism (thermodynamics of RNA:DNA duplex formation and target accessibility) plus minimal family-specific calibration would convert a sequence-only ranker into an accurate and transferable regressor. The phase progression bears this out. Moving from sequence-only to thermodynamics (nearest-neighbor ΔG/Tm with robust mapping) removed feature gaps and produced the largest single MAE drop (≈19.9 → 11.5 %KD), consistent with the expectation that hybridization energetics dominate ASO binding outcomes (SantaLucia and Hicks, 2004)^[1]. Adding accessibility and ΔG-like summaries (cofold/MFE; later ΔG_eff = ΔG_duplex − ED) captured the cost of opening structured RNA, further improving transfer (≈11.5 → 9.25 %KD), aligning with the RNA folding literature and accessibility modeling (Lorenz et al., 2011)^[2]. On the computation side, similarity weighting strengthened cross-family transfer by emphasizing training examples closest to DAZ/FAM, in line with domain-adaptation intuitions. The ATXN2-anchored choice point concentrated learning on the closest literature family, producing a strong baseline (≈6.67 %KD). Finally, a single k-shot offset per family eliminated residual intercept bias (≈6.67 → 5.16 %KD) without perturbing rank, consistent with the idea that systematic per-family scaling rather than local ordering was the main obstacle after biophysics were included. Together, these steps implement a mechanistic-to-calibrated pipeline, compute physics-informed features, anchor on a close family, and apply a single family offset, which is aligned with the intended RNase H therapeutics context (Bennett and Rigo, 2017)^[3].

2. How the evidence supports the model’s intended use

Calibration converts a strong ranker into an accurate regressor

The Phase-6 triptych shows that before k-shot, DAZ and FAM points sit in biased family bands; after a single offset per family, the clouds realign with the identity line and MAE collapses. Importantly, Spearman’s ρ is preserved (or changes minimally), confirming that rank fidelity—the relevant quantity for screening and prioritization—was already present, and calibration simply corrected absolute scale.

Transfer to new families is feasible under the smallest viable mechanism

In the Phase-6 OOD analysis, DAZ residuals center near zero post-calibration, while FAM’s wider spread is explained by n≈2. This is a data-regime artifact, not a contradiction to model design: residual vs observed shows no slope (no hidden miscalibration by %KD), and bootstrap MAE densities quantify expected variability for downstream usage.

Multi-family presets (Phase 8/9) generalize the methodology

Per-family best MAEs (ATXN2 12.57; DAZAP1 5.09; FAM98A 11.25; SOD1 13.14) and a weighted overall ≈9.88 show that the model can be operationalized per family with standard splits (LOSO or k-holdout) and off-the-shelf regressors (HGBR/GBR/Ridge). We use MAE at tiny n; in Phase 8/9, fold aggregation increases test counts, so R² and ρ become reliably interpretable and are reported with MAE.

3. How the feature behavior matches the biology

Permutation importance and partial-dependence analyses (Fig. 15/16) indicate that a compact set of numeric features drive the predictions: regional GC (wing–gap context) and short-motif aggregates (di/tri) surface early; where computed, ΔG_eff and accessibility (ED) align with higher predicted KD in accessible and energetically favorable windows. These patterns are consistent with gapmer design and RNA structure energetics: accessible, more favorable net binding sites (lower ΔG_eff) should be preferred (Lorenz et al., 2011; SantaLucia and Hicks, 2004)^{[2], [1]}. We emphasize that the effect-size curves (PDP) are qualitative given sample size; nevertheless, the directions of effect match the mechanistic expectations that motivated our feature set.

4. Constraints, assumptions, and what they imply in the results

Small-n regimes

The wet-lab set in Phases 1–6 is n≈8 (DAZ/FAM); per-family test sizes in Phase 8/9 are similarly small. This explains the instability of R² and ρ in early phases and the wide intervals for FAM98A (n=2). The decision to optimize MAE through Phase 6, and to report R²/ρ primarily at Phase 8/9, was therefore methodological rather than philosophical—these statistics become reliable once fold-aggregated.

One-calibrator (k-shot) assumption

k-shot assumes per-family bias is approximately constant (intercept-like). Results support this: δ_g offsets remove bias without altering rank. If future families exhibit non-linear family effects, a richer calibration (e.g., isotonic per family) could be substituted—indeed, Phase 8/9 already shows that isotonic_by_family can be beneficial in some settings.

Feature coverage parity between literature and wet-lab

When a phase lacked ΔG/ED parity, predictions could collapse into bands (pre-accessibility runs)—a behavior well-documented in the small-n literature and visible in our uncalibrated Phase-6 scatter. The strong improvements after ΔG/ED parity and k-shot are consistent with this constraint: once mechanistic coverage is balanced and family offsets are removed, the model behaves as expected.

Anchoring and buckets

ATXN2-anchored transfer was chosen because it was the closest literature family to DAZ/FAM in this study. Where broader buckets were tried (Phase 5), performance changed according to the balance of signal vs cross-family noise, explaining the best transfer under ATXN2-only and the low-teens MAEs for ATXN2/SOD1 at Phase 8/9.

5. Limitations

Sample size and label heterogeneity. Several phases involve n≤8 wet-lab points or n≤6 per family. R²/ρ at this scale are noisy, and any single mis-calibrated point can move them substantially.
Chemistry and cell-context annotations. Backbone/sugar modifications and assay differences are encoded only coarsely; richer chemistry-aware features (e.g., explicit 2′-MOE/LNA position maps) were not modeled.
Access to full structure ensembles. We used RNAplfold and nearest-neighbor thermodynamics; higher-order or co-transcriptional structure effects are not explicitly modeled (Lorenz et al., 2011; SantaLucia and Hicks, 2004)^{[2], [1]}.
Single-offset calibration. k-shot assumes intercept bias; if future data reveal slope or sub-family effects, a hierarchical calibration will be warranted.

6. Relation to prior work

The design is consistent with the thermodynamic canon (SantaLucia and Hicks, 2004)^[1] and RNA structure modeling (Lorenz et al., 2011)^[2]. Therapeutic ASO principles (RNase H recruitment, design constraints, and dose considerations) underlie our choice of ΔG/ED/accessibility features and the 5–10–5 gapmer perspective (Bennett and Rigo, 2017)^[3]. Our results extend this literature by showing that, under small-n experimental constraints, a physics-anchored feature set plus one-per-family offset is a practical recipe to achieve accurate, family-portable predictions of protein knockdown.

7. Future directions

Increase n per family (especially FAM98A). Even +3–4 labeled ASOs would tighten CIs and stabilize per-family R²/ρ.
Hierarchical calibration. Extend k-shot to random-intercept + random-slope per family if new data reveal systematic slope differences.
Chemistry-aware features. Incorporate position-specific modification patterns (e.g., LNA/2′-MOE maps, PS distribution) and test interaction terms with ΔG_eff.
Richer accessibility / kinetics. Explore time-dependent or co-transcriptional accessibility models where relevant, and compare alternative ED summaries.
Semi-supervised transfer. Use unlabeled family data (sequence only) with pseudo-labeling/consistency regularization to improve pre-calibration alignment.
Uncertainty quantification. Package bootstrap CIs per prediction (or conformal intervals) so downstream users get intervals, not just means.
Public benchmark & scripts. Keep the exact feature recipes (ΔG_eff, ED, k-mers), the anchoring, and the k-shot code paths as a reproducible “starter kit” for other targets.

8. Practical takeaway

A small-n, biophysics-anchored workflow—ΔG/ED + sequence, similarity weighting, ATXN2 anchoring, and a single k-shot offset per family—is sufficient to achieve accurate and portable protein-knockdown predictions (MAE ≈ 5–6 %KD) while remaining easy to calibrate for new families. This balances mechanistic interpretability with the practical constraints common in early-stage ASO programs.

Conclusion and References

Conclusion

We set out to turn ASO sequences and basic biophysics into actionable, family-portable prediction of protein knockdown. The results demonstrate a simple but effective methodology: combine physics-anchored features (sequence composition, RNA:DNA hybridization, accessibility via ΔG_eff = ΔG_duplex − ED) with anchored transfer (ATXN2 as the closest literature family) and a single per-family k-shot offset to correct residual scale. This sequence of choices delivered a monotonic reduction in wet-lab error on DAZ/FAM (n≈8) from 19.9 → 11.5 → 9.25 → 8.83 → 10.0 → 6.67 → 5.16 %KD, converted an already good ranker into a calibrated regressor (before/after alignment to the 1:1 line), and generalized across multiple gene families in Phase 8/9 with best per-family MAEs of ATXN2 12.57, DAZAP1 5.09, FAM98A 11.25, SOD1 13.14 (weighted overall ≈9.88).

Error analyses show centered residuals after calibration, tight bootstrap MAE for the better-sampled family (DAZ), and variance-limited behavior where n is minimal (FAM98A). Feature probes confirm that a compact set of interpretable numeric signals—regional GC context, short-motif aggregates, and, where computed, ΔG_eff/accessibility—drive predictions in directions consistent with ASO design principles. Together, these findings indicate that small-n, mechanistic modeling can deliver practical accuracy for ASO prioritization while staying lightweight enough to calibrate to new families with a single ASO.

The framework is reproducible and extensible: all inputs are computable from sequence plus standard RNA/thermo tooling; all outputs include confidence summaries; and the calibration step is intentionally minimal. Remaining constraints—limited per-family sample sizes, coarse chemistry annotations, and single-offset calibration—are clear avenues for improvement. In future work, we will (i) expand per-family n (especially FAM98A), (ii) add chemistry- and kinetics-aware features, (iii) explore hierarchical calibration (intercept+slopes) when warranted, and (iv) ship per-prediction uncertainty intervals (bootstrap or conformal) by default.

In sum, our model provides a usable, transparent, and portable starting point for ASO design: physics-guided features for signal, a small set of model choices for stability, and a one-shot calibration for deployment—achieving ~5–6%KD error on held-out wet-lab families and scalable evaluation across new gene contexts. This enables increased efficiency of wet-lab research, thereby allowing scientists to more effectively find cures to devastating neurodegenerative diseases such as ALS.

References

(Coming soon.)

SantaLucia, J., Jr., & Hicks, D. “The Thermodynamics of DNA Structural Motifs.” Annual Review of Biophysics and Biomolecular Structure 33 (2004): 415–440. https://doi.org/10.1146/annurev.biophys.32.110601.141800
Lorenz, R., et al. “ViennaRNA Package 2.0.” Algorithms for Molecular Biology 6, 26 (2011). https://doi.org/10.1186/1748-7188-6-26
Bennett, C. F., Kordasiewicz, H. B., & Cleveland, D. W. “Antisense Drugs Make Sense for Neurological Diseases.” Annual Review of Pharmacology and Toxicology 61 (2021): 831–852. https://doi.org/10.1146/annurev-pharmtox-010919-023738
Dewey, C. M., et al. “TDP-43 Aggregation in Neurodegeneration: Are Stress Granules the Key?” Brain Research 1462 (2012): 16–25. https://doi.org/10.1016/j.brainres.2012.02.032
Ding, Q., et al. “TDP-43 Mutation Affects Stress Granule Dynamics in Differentiated NSC-34 Motoneuron-Like Cells.” Frontiers in Cell and Developmental Biology 9 (2021): 611601. https://doi.org/10.3389/fcell.2021.611601
Jo, M., et al. “The Role of TDP-43 Propagation in Neurodegenerative Diseases: Integrating Insights from Clinical and Experimental Studies.” Experimental & Molecular Medicine 52 (2020): 1652–1662. https://doi.org/10.1038/s12276-020-00513-7
Crooke, S. T., et al. “Antisense Technology: A Review.” Journal of Biological Chemistry 296 (2021): 100416. https://doi.org/10.1016/j.jbc.2021.100416
Shen, X., & Corey, D. R. “Chemistry, Mechanism and Clinical Status of Antisense Oligonucleotides and Duplex RNAs.” Nucleic Acids Research 46(4) (2018): 1584–1600. https://doi.org/10.1093/nar/gkx1239
Dhuri, K., et al. “Antisense Oligonucleotides: An Emerging Area in Drug Discovery and Development.” Journal of Clinical Medicine 9(6) (2020): 2004. https://doi.org/10.3390/jcm9062004
Roberts, T. C., Langer, R., & Wood, M. J. A. “Advances in Oligonucleotide Drug Delivery.” Nature Reviews Drug Discovery 19 (2020): 673–694. https://doi.org/10.1038/s41573-020-0075-7
Schoch, K. M., & Miller, T. M. “Antisense Oligonucleotides: Translation from Mouse Models to Human Neurodegenerative Diseases.” Neuron 94(6) (2017): 1056–1070. https://doi.org/10.1016/j.neuron.2017.04.010
Chalk, A. M., & Sonnhammer, E. L. L. “Computational Antisense Oligo Prediction with a Neural Network Model.” Bioinformatics 18(12) (2002): 1567–1575. https://doi.org/10.1093/bioinformatics/18.12.1567
Hörberg, J., Carlesso, A., & Reymer, A. “Mechanistic Insights into ASO–RNA Complexation: Advancing Antisense Oligonucleotide Design Strategies.” Molecular Therapy – Nucleic Acids 35 (2024): 102351. https://doi.org/10.1016/j.omtn.2024.102351
Kamola, P. J., et al. “In Silico and In Vitro Evaluation of Exonic and Intronic Off-Target Effects Form a Critical Element of Therapeutic ASO Gapmer Optimization.” Nucleic Acids Research 43(18) (2015): 8638–8650. https://doi.org/10.1093/nar/gkv857
Hagedorn, P. H., et al. “Identifying and Avoiding Off-Target Effects of RNase H-Dependent Antisense Oligonucleotides in Mice.” Nucleic Acids Research 46(11) (2018): 5366–5380. https://doi.org/10.1093/nar/gky397
Yoshida, T., et al. “Evaluation of Off-Target Effects of Gapmer Antisense Oligonucleotides Using Human Cells.” Genes to Cells 24(12) (2019): 827–835. https://doi.org/10.1111/gtc.12728
Sugimoto, N., et al. “Thermodynamic Parameters To Predict Stability of RNA/DNA Hybrid Duplexes.” Biochemistry 34(35) (1995): 11211–11216. https://doi.org/10.1021/bi00035a029
Mathews, D. H., & Turner, D. H. “Prediction of RNA Secondary Structure by Free Energy Minimization.” Journal of Molecular Biology 317(2) (2002): 191–203. https://doi.org/10.1006/jmbi.2001.5370
Alberty, R. A., & Silbey, R. J. Physical Chemistry. 3rd ed., Wiley, 2001.
Bernhart, S. H., Hofacker, I. L., & Stadler, P. F. “Local RNA Base Pairing Probabilities in Large Sequences.” Bioinformatics 22(5) (2006): 614–615. https://doi.org/10.1093/bioinformatics/btk014
Mückstein, U., et al. “Thermodynamics of RNA–RNA Binding.” Bioinformatics 22(10) (2006): 1177–1182. https://doi.org/10.1093/bioinformatics/btl024
Friedman, J. H. “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29(5) (2001): 1189–1232. https://doi.org/10.1214/aos/1013203451
Hastie, T., Tibshirani, R., & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed., Springer, 2009. hastie.su.domains
Barlow, R. E., et al. Statistical Inference Under Order Restrictions: The Theory and Application of Isotonic Regression. Wiley, 1972.
Smola, A. J., & Schölkopf, B. “A Tutorial on Support Vector Regression.” Statistics and Computing 14(3) (2004): 199–222. https://doi.org/10.1023/B:STCO.0000035301.49549.88
Chen, T., & Guestrin, C. “XGBoost: A Scalable Tree Boosting System.” In Proc. KDD (2016): 785–794. https://doi.org/10.1145/2939672.2939785
Kohavi, R. “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.” In Proc. IJCAI (1995): 1137–1145. https://www.ijcai.org/Proceedings/95-2/Papers/016.pdf
Willmott, C. J., & Matsuura, K. “Advantages of the Mean Absolute Error (MAE) over the Root Mean Square Error (RMSE) in Assessing Average Model Performance.” Climate Research 30 (2005): 79–82. https://doi.org/10.3354/cr030079
Spearman, C. “The Proof and Measurement of Association between Two Things.” The American Journal of Psychology 15(1) (1904): 72–101. https://www.jstor.org/stable/1412159
Liu, M., et al. “Landscape of Small Nucleic Acid Therapeutics: Moving from the Bench to the Clinic as Next-Generation Medicines.” Signal Transduction and Targeted Therapy 10(1) (2025): 73. https://doi.org/10.1038/s41392-024-02112-8
Hwang, G., et al. “ASOptimizer: Optimizing Antisense Oligonucleotides through Deep Learning for Gene Regulation.” Molecular Therapy – Nucleic Acids (2024). https://www.cell.com/molecular-therapy-family/nucleic-acids/fulltext/S2162-2531(24)00073-8?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS2162253124000738%3Fshowall%3Dtrue

Supplementary Materials

S1: Data Standardization

Abstract

We assembled and standardized a multi-source dataset to model antisense oligonucleotide (ASO)–mediated protein knockdown across ALS-relevant targets. Quantitative protein reductions extracted from the literature and from wet-lab experiments were reconciled into a single schema, preserving low-, medium-, and high-efficacy examples to avoid selection bias. We computed a compact panel of sequence-derived features—composition, CpG density, di/tri-nucleotide frequencies, and regional GC content for 5–10–5 gapmers—together with optional thermodynamic proxies, and we performed systematic quality control. The resulting dataset supports within-family analyses and cross-family prediction, where SOD1 and ATXN2 (literature) together with DAZ1/DAZAP1 and FAM98A (wet-lab) anchor training and calibration for prediction of knockdown across new families.

Introduction

ASO’s can reduce pathogenic protein levels by recruiting RNase H1 to RNA–DNA heteroduplexes, and several ALS-relevant targets are now under active exploration(Bennett and Rigo, 2017). Modeling which sequences produce robust protein knockdown remains challenging because the available evidence is heterogeneous: different assay systems, varied reporting practices (exact percentages vs. ranges vs. qualitative labels), and inconsistent description of chemistry or sequence context. Our aim was to produce a single, analysis-ready table that retains provenance while harmonizing labels and features sufficiently for machine-learning workflows and cross-family evaluation.

Data Collection

We curated two data streams. Literature/patents. For each target, we extracted the 5′→3′ sequence (letters only for feature computation), chemistry/backbone notes, experimental model (cell line, animal, human), and protein KD% or a validated biomarker proxy. For C9ORF72, we relied on a mixed-backbone series that suppresses V1/V3 with strong in-vivo poly(GP) reductions and an initial human exposure showing a marked CSF poly(GP) drop (Tran et al., 2022). As a non-ALS benchmark, we included TGFBR2/NVP-13, an LNA-gapmer program discovered through staged in-silico → transfection → gymnotic selection with potency and stability documented in detail (Kuespert et al., 2020). For VCP/MSP1, we drew on a recent ASO study reporting ~48% protein lowering in R155H patient-derived myogenic cells and ~30% in a humanized A232E mouse, with autophagy restoration and functional improvement (Pal et al., 2025). For ATXN2, we relied on the Roche/Hagedorn ATXN2 gapmer program, which identifies a potent intron-9 “hotspot” and reports extensive cell-line screens, in vivo ICV mouse knockdown at mRNA/protein, and NHP PK/PD with tissue content vs. mRNA relationships (Hagedorn et al., 2021). For SOD1, we linked sequences and activity back to the foundational patent family disclosing extensive SOD1 gapmer designs, chemistries, and dose-response data (WO 2015/153800 A2, 2015). When relevant, we also referenced FUS-lowering as prior art for family-level modeling comparisons (Korobeynikov et al., 2022). Wet-lab CSVs. These contributed DAZ1/DAZAP1 and FAM98A entries collected under consistent assays; we use these as out-of-distribution targets for evaluation and lightweight k-shot calibration.

Label Harmonization

The central outcome variable is Observed_%KD, defined as the percent reduction in target protein relative to the appropriate control. To reconcile heterogeneous reporting, we applied a deterministic mapping. When a numeric percentage was reported, we used it directly. When a range was given, we set Observed_%KD to the midpoint of the interval. When only a qualitative label was provided, we mapped High/Medium/Low to nominal values (80/55/20) while preserving the original note for transparency. Values were clipped to [0,100] to remove clerical outliers. For visualization and stratified splitting we also defined consistent bins—High ≥70, Medium 40–69, and Low <40. The global distribution of labels (Figure 1) confirms that the corpus contains a meaningful spread across bins rather than being dominated by a narrow efficacy band.

Standardized Schema & Quality Control

All records were normalized into a canonical schema consisting of an identifier (ASO_ID), the gene target (TargetGene), the 5′→3′ unmodified sequence used for feature calculation (Sequence), outcome fields (Observed_%KD, optional KD_range_min/KD_range_max, and a binned label), minimal chemistry notes, the biological model (Model), the source domain (Literature/WetLab), and a Reference field. During cleaning we harmonized gene and chemistry names, removed non-ACGT characters prior to feature computation (modification strings are retained in metadata), and performed routine checks for missingness and implausible values. Where applicable, schema fields mirror the way discovery programs are reported (selection funnels, gymnotic uptake, potency tables), aiding traceability to original sources (Kuespert et al., 2020).

Sequence-Derived Features

Let the cleaned DNA sequence be \(S \in \{A,C,G,T\}^n\) (U→T; non-ACGT characters dropped). We derived the following features uniformly for all entries:

1. Composition and Simple Structure

Sequence length is \(n\). GC fraction is

\[ \mathrm{GC\_frac} = \frac{\#G + \#C}{n}. \]

CpG count and density:

\[ \mathrm{CpG\_count} = \sum_{i=1}^{n-1}\mathbf{1}[S_iS_{i+1}=\mathrm{CG}], \qquad \mathrm{CpG\_density}=\frac{\mathrm{CpG\_count}}{n-1}. \]

The maximum homopolymer run, \(\displaystyle \mathrm{max\_homopolymer} = \max_{k\in\{A,C,G,T\}} \text{ (longest consecutive run of }k\text{ in }S)\), captures problematic repeats.

2. Motif Frequencies

For each dinucleotide \(xy\in\{A,C,G,T\}^2\),

\[ \mathrm{di}_{xy} = \frac{\#\{i: S_iS_{i+1}=xy\}}{n-1},\qquad \text{with }\mathrm{di}_{xy}=0\text{ when }n<2. \]

When needed we also compute trinucleotides:

\[ \mathrm{tri}_{xyz} = \frac{\#\{i: S_iS_{i+1}S_{i+2}=xyz\}}{n-2}, \quad xyz\in\{A,C,G,T\}^3,\quad \mathrm{tri}_{xyz}=0\text{ when }n<3. \]

These motifs enable enrichment analyses between high- and low-efficacy subsets.

3. Gapmer Regional GC (5–10–5)

For sequences with \(n \ge 20\), we approximate the typical 5–10–5 architecture and compute regional GC fractions in the 5′ wing (\(S_{1..5}\)), DNA gap (\(S_{6..15}\)), and 3′ wing (\(S_{n-4..n}\)):

\[ \begin{aligned} \mathrm{wingGC}_{\text{left}} &= \frac{\#_{S_{1..5}}(G)+\#_{S_{1..5}}(C)}{5},\\ \mathrm{wingGC}_{\text{gap}} &= \frac{\#_{S_{6..15}}(G)+\#_{S_{6..15}}(C)}{10},\\ \mathrm{wingGC}_{\text{right}} &= \frac{\#_{S_{n-4..n}}(G)+\#_{S_{n-4..n}}(C)}{5}. \end{aligned} \]

4. Thermodynamics / Accessibility (methods background)

When used, we reference nearest-neighbor thermodynamics for duplex stability and RNA target accessibility via ensemble defect, combining them as

\[ \Delta G_{\mathrm{eff}}=\Delta G_{\mathrm{duplex}}-\mathrm{ED}. \]

Pharmacokinetic/pharmacodynamic considerations that inform dose–response interpretation are drawn from prior work.

Dataset Characterization

The global KD% distribution spans the full range (Figure 1). Per-target summaries (Figure 2A, box plots; Figure 2B, stacked High/Medium/Low counts) show substantial differences in central tendency and variance among families—SOD1 and FUS exhibit broader spread, whereas ATXN2 shows higher medians in our literature set—underscoring the need for family-aware evaluation. Composition–efficacy trends are weak at the global level (Figure 3), motivating richer features. Family-level dinucleotide profiles (Figure 4) reveal motif preferences; for example, elevated CT/TC prevalence in FUS and higher GC in C9ORF72 sequences, consistent with target-specific sequence environments reported in the C9ORF72 program (Tran et al., 2022).

Fig. 2B: KD Efficacy Distribution by Target

Fig. 4: Mean Dinucleotide Frequency by Target (Top 6)

For gapmers, the distributions of wing and gap GC fractions (Figure 5A–C) show that our design space covers a useful range rather than clustering tightly around a single recipe. A correlation matrix across numeric features (Figure 6) highlights expected dependencies among motif variables and composition; in particular, several dinucleotides anti-correlate by construction, while GC-linked metrics cluster together. Finally, a two-component PCA on sequence features (Figure 7) suggests partial separation among efficacy bins but with substantial overlap—again arguing for models that combine motif, structural, and accessibility information. An optional 3-mer enrichment analysis (Figure 8) provides interpretable sequence signatures enriched among high-efficacy ASOs.

Fig. 5A–C: Regional GC Profiles of Gapmers: 5′ Wing, DNA Gap, 3′ Wing

Fig. 6: Correlation Heatmap of Sequence Features and KD%

Fig. 7: PCA of sequence features, colored by KD bin

Fig. 8: 3-mer Enrichment in High vs Low KD% ASOs

Similarity-based Prediction

The standardized dataset supports both standard train/validation splits and family-aware experiments. We use SOD1 and ATXN2 as the anchor literature families—ATXN2 specifically leveraging the intron-9 hotspot series with demonstrated translation across cells → mouse ICV → NHP (Hagedorn et al., 2021)—and evaluate transfer to wet-lab DAZ1/DAZAP1 and FAM98A with optional k-shot calibration.This arrangement probes whether sequence-level rules learned in well-documented families transfer to new gene contexts without overfitting to any single target’s idiosyncrasies.

Limitations

Heterogeneous assays (cell, animal, biopsy) introduce scale and variance differences in KD%. We mitigate this by retaining the Model field for stratified evaluation and by avoiding aggressive normalization that could erase meaningful biological variability. Some literature entries rely on ranges or qualitative categories; our midpoint and bin-mapping rules are conservative but inevitably introduce label noise. A minority of references lack explicit sequences; those rows are excluded from sequence-feature analyses until sequences are available.

Reproducibility

All processing steps—from ingestion through label harmonization and feature computation—are scripted so the master feature table and all figures can be regenerated from the ./data CSVs. Outputs are saved locally to maintain a simple, portable workflow.

References

Bennett, C. Frank, and Frank Rigo. “Antisense Oligonucleotide Therapeutics: Precision Medicine for RNA Targets.” Molecular Therapy, vol. 25, no. 5, 2017, pp. 953–967.
Kuespert, Sabrina, et al. “Antisense Oligonucleotide in LNA-Gapmer Design Targeting TGFBR2—A Key Single Gene Target for Safe and Effective Inhibition of TGFβ Signaling.” International Journal of Molecular Sciences, vol. 21, 2020, p. 1952.
Korobeynikov, V. A., et al. “Antisense Oligonucleotide Silencing of FUS Expression as a Therapeutic Approach in Amyotrophic Lateral Sclerosis.” Nature Medicine, vol. 28, 2022, pp. 104–116.
Pal, Pallabi, et al. “Antisense Oligonucleotides Targeting Valosin-Containing Protein Improve Muscle Pathology and Molecular Defects in Cell and Mouse Models of Multisystem Proteinopathy.” bioRxiv, 26 Jul. 2025, doi:10.1101/2025.07.25.665261.
SantaLucia, John, and Donald Hicks. “The Thermodynamics of DNA Duplex Formation.” Annual Review of Biophysics and Biomolecular Structure, vol. 33, 2004, pp. 415–440.
Tran, Hélène, et al. “Suppression of Mutant C9ORF72 Expression by a Potent Mixed-Backbone Antisense Oligonucleotide.” Nature Medicine, vol. 28, no. 1, 2022, pp. 117–124.
Geary, Robert S., et al. “Pharmacokinetics and Pharmacodynamics of Antisense Oligonucleotides.” Nucleic Acid Therapeutics, vol. 25, no. 2, 2015, pp. 77–83.
Lagier-Tourenne, Clotilde, et al. “Targeted Depletion of ATXN2 as a Therapeutic Strategy for ALS.” Neuron, vol. 81, no. 4, 2014, pp. 667–680.
Miller, Timothy M., et al. “Trial of Antisense Oligonucleotide Tofersen for SOD1 ALS.” The New England Journal of Medicine, vol. 387, no. 12, 2022, pp. 1099–1110.
Ngo, Duong, et al. “Biomarkers for C9orf72-Related ALS/FTD: CSF Poly(GP) as a Pharmacodynamic Readout.” Annals of Clinical and Translational Neurology, vol. 7, no. 8, 2020, pp. 1292–1305.
Wan, Yanfeng, et al. “Ensemble Defect as a Measure of RNA Target Accessibility.” RNA, vol. 27, no. 7, 2021, pp. 848–860.
Patent (ATXN2): Hagedorn, Peter, et al. Oligonucleotides for Modulating ATXN2 Expression. US Patent Application US 2021/0024925 A1, 28 Jan. 2021.
Patent (FUS): Person Institute for Neurogenomics, et al. Compositions and Methods for Treating FUS Associated Diseases. WO 2023/147629 A1, 10 Aug. 2023.
Patent (SOD1): Ionis Pharmaceuticals, Inc., et al. Compositions for Modulating SOD1 Expression. WO 2015/153800 A2, 8 Oct. 2015.

S2: Sample Data

This table shows a small, representative slice of the raw data we collected from literature/patents and our wet-lab assays. Each row preserves the fields used for traceability and modeling across key families. Complete family-level tables are available in the underlying CSVs and the full supplement.

ATXN2	SOD1	FUS	C9ORF72	TGFBR2	DAZ	FAM
29	78	30	7	4	6	2

Family	ASO_ID	Sequence (5′→3′)	%KD	Bin	SRC	LBL	Len	GC%	CpG	Hpoly	RefCode
ATXN2	21_1	CAttttactttaacctcc	59.00	Med	L	E	18	33	0	4	ATXN2F
ATXN2	22_1	CAcattttactttaacCTC	91.00	High	L	E	19	32	0	4	ATXN2F
ATXN2	17_1	TTCAcattttacttTAAC	98.00	High	L	E	18	22	0	4	ATXN2F
ATXN2	16_1	TTCAcattttactttaaccT	98.00	High	L	E	19	32	0	4	ATXN2F
C9ORF72	ASO1	CCCTAGCGCGCGACT	80.00	High	L	E	15	67	3	3	C9F
C9ORF72	ASO5-2	GCoCoCoCTAGCGCGCGoAoCoTC	80.00	High	L	E	23	65	3	4	C9F
DAZ	WetLab_19	TCTCTAACCCATCAGCACAA	32.97	Low	W	E	20	45	0	3	WET
FAM	WetLab_8	ATTCCGAAAGAATGGTCACA	37.58	Low	W	E	20	40	1	3	WET
FUS	1	CUCUUCUUCUUCUUCUUCUUUAUU	20.00	Low	L	E	24	29	0	3	FUS_Pat
FUS	19	UCUUCUUCUUCUUCUUCUUCU	20.00	Low	L	E	21	33	0	2	FUS_Pat
FUS	15	CCAGGCCCCGCCCTCGGCTC	36.00	Low	L	E	20	70	2	0	FUS_Pat
FUS	29	AGAAAGGGGAGGGGGTGGGT	60.00	Med	L	E	20	65	0	2	FUS_Pat
SOD1	SOD1_12	GTGAGGCGTCAGGTTGGGCG	10.00	Low	L	E	20	65	0	3	SOD1_Pat
SOD1	SOD1_47	GCTGATGCCCCGGATCCTGA	33.00	Med	L	E	20	65	0	3	SOD1_Pat
SOD1	SOD1_45	TCCATCGCAGCCTTGATGTT	37.00	Med	L	E	20	50	0	3	SOD1_Pat
SOD1	SOD1_33	GATTTGGGCGATCCCAATTG	90.00	High	L	E	20	45	0	3	SOD1_Pat
TGFBR2	TGFBR2_1	AGGAGGAAGACGACGATGAG	30.00	Med	L	E	20	55	0	3	TGFBR2F
TGFBR2	TGFBR2_4	CTTCCGAGATGTTCCAGGAC	63.00	High	L	E	20	60	0	3	TGFBR2F

S2 Table — Sample rows by family (subset of collected data)

SRC: W = Wet-lab, L = Literature/Patent

LBL: E = Explicit numeric, R = Range midpoint, B = Bucket-mapped

RefCode: ATXN2F = ATXN2 sequences with knockdown (features), C9F = C9ORF72 sequences (features), TGFBR2F = TGFBR2 sequences (features), FUS_Pat = WO2023147629A1_FUS, SOD1_Pat = US2024… SOD1 patent, WET = Wet-lab dataset

Note: When sources reported only qualitative labels, the pipeline stores a bucket and (elsewhere) maps High/Med/Low → 80/55/20 for harmonization.

S3: Feature Panel

This section lists the features we feed to the model, the variable names used in code, any defaults/notes, and the tool or method used to compute each feature. The panel is the compact set derived from the outputs of the Data Collection & Standardization step. Thermodynamics/accessibility features are included only when consistently available.

Core features

Feature (group)	Model variable(s)	Default / notes	Tool / implementation
Sequence length (composition)	`seq_len`	Integer count of bases (n)	Python featurizer (pandas)
GC fraction (composition)	`GC_frac`	\((\#G+\#C)/n\)	Python featurizer
CpG count / density (composition)	`CpG_count`, `CpG_density`	Density = `CpG_count/(n−1)`	Python featurizer
Max homopolymer (composition)	`max_homopolymer`	Longest run of A/C/G/T	Python featurizer
Dinucleotide frequencies (motifs)	`di_AA` … `di_TT` (16 vars)	0 if `n<2`	Python featurizer
Regional GC (5–10–5) (gapmer)	`wingGC_left`, `wingGC_gap`, `wingGC_right`	Defined if `n≥20` (5′ wing / gap / 3′ wing)	Python featurizer (windowed counts)
Chemistry / backbone (categorical)	One-hots (e.g., `Chem_PS`, `Chem_MOE`, `Chem_LNA`, `Chem_cEt`)	0/1; unknown → all zeros	One-hot from `Chemistry` text

Optional features (used when consistent and available)

Feature (group)	Model variable(s)	Default / notes	Tool / implementation
Trinucleotide frequencies (motifs)	`tri_AAA` … `tri_TTT` (64 vars)	0 if `n<3`; often used for enrichment & optional input	Python featurizer
Duplex free energy (thermo)	`DeltaG_duplex`	More negative → stronger duplex (nearest-neighbor DNA/RNA)	SantaLucia NN model (offline); ViennaRNA/NUPACK when applicable (Lorenz et al., 2011; Zadeh et al., 2011; SantaLucia & Hicks, 2004)
Target accessibility (structure)	`ED` (opening/ensemble-defect proxy)	Higher ED → harder to access	RNAplfold / accessibility workflow (ViennaRNA) (Lorenz et al., 2011)
Effective binding proxy	`DeltaG_eff` = `DeltaG_duplex − ED`	Only if both inputs exist; down-weights blocked sites	Derived from the two fields above
Tm (Wallace proxy) (quick)	`Tm_Wallace_proxy`	\(2(A+T)+4(G+C)\) °C	Python featurizer (sanity check)

Context fields (not fed to the learner)

Field	Usage	Default / notes	Source
Target gene / family	Not fed	Used for stratified splits & k-shot family offsets	Standardized schema (`TargetGene`)
Source domain / model	Not fed	Used for analysis/plots; avoid leakage	Standardized schema (`SourceDomain`, `Model`)

Tool references (MLA)

Lorenz, Ronny, et al. “ViennaRNA Package 2.0.” Algorithms for Molecular Biology, vol. 6, 2011, p. 26.
Zadeh, Joseph N., et al. “NUPACK: Analysis and Design of Nucleic Acid Systems.” Journal of Computational Chemistry, vol. 32, no. 1, 2011, pp. 170–173.
SantaLucia, John, and Donald Hicks. “The Thermodynamics of DNA Duplex Formation.” Annual Review of Biophysics and Biomolecular Structure, vol. 33, 2004, pp. 415–440.

S4: Model Novelty

A one-page ledger that separates what’s new in our model from standard methods, and shows compact evidence (ablation + calibration) that the novel parts matter.

New Concept #1 — ASO-specific effective free energy (ΔG_eff): ΔG_duplex − opening energy (stability − accessibility) as a single feature for RNA:DNA gapmers.
New Concept #2 — One-shot, per-family calibration: one calibrator ASO per new gene adds a small offset (δ_g) that preserves rank but calibrates the %KD scale.
New Concept #3 — Gapmer-aware encoding: explicit 5–10–5 regional GC (left wing / gap / right wing) and related motifs aligned to RNase-H design practice.
Standard methodologies adopted: nearest-neighbor duplex ΔG; ΔG↔K; RNAplfold/Vienna accessibility; gradient boosting; MAE / R² / Spearman.

#	Equation (compact)	What it captures	How we use it	Novel?
1	\( \Delta G^{\circ}_{\mathrm{duplex}} = \sum \Delta G^{\circ}_{\mathrm{stack}} + \Delta G^{\circ}_{\mathrm{init}} + \Delta G^{\circ}_{\mathrm{dangling}} + \Delta G^{\circ}_{\mathrm{terminal}} \)	RNA:DNA hybrid stability	Thermo feature; feeds \(\Delta G_{\mathrm{eff}}\)	Std
2	\( \Delta G^{\circ} = -RT \ln K \) (with \( K \propto 1/K_D \))	Energy ↔ binding linkage	Interpret occupancy/affinity (not fitted)	Std
3	\( E_D(i) = -RT \ln p_{\mathrm{unpaired}}(i);\; \overline{E_D} = \frac{1}{N}\sum_i E_D(i) \)	Target accessibility	Penalizes occluded sites; used directly & in \(\Delta G_{\mathrm{eff}}\)	Std
4	\( \Delta G_{\mathrm{eff}} = \Delta G_{\mathrm{duplex}} - \overline{E_D} \)	Net favorability (stability − opening cost)	Primary biophysics feature tied to %KD	Yes
5	\( \theta = \dfrac{[A]}{[A] + K_D} \)	Fraction bound at equilibrium	Conceptual link (not fitted)	Std
6	\( \hat f_M(x) = \sum_{m=1}^M \nu\, h_m(x) \)	Additive boosted trees	Core predictor	Std
7	\( \tilde y = \hat f(x) + \delta_g \)	Group offset (random-intercept style)	One calibrator ASO per gene for portability	Yes
8	MAE; \( R^2 \); Spearman’s \( \rho \)	Error / fit / rank	Evaluation & reporting	Std
9	5–10–5 regional GC (gapmer): \( \mathrm{GC}_{L} = \dfrac{\#G_{1..5}+\#C_{1..5}}{5} \), \( \mathrm{GC}_{\mathrm{gap}} = \dfrac{\#G_{6..15}+\#C_{6..15}}{10} \), \( \mathrm{GC}_{R} = \dfrac{\#G_{(n-4)..n}+\#C_{(n-4)..n}}{5} \)	Regional composition (wings stabilize; gap recruits RNase-H)	Three explicit GC fractions used as features	Yes
10	GC, CpG, di/tri-mers, homopolymers	Local sequence context	Feature panel	Std

Table S4.1: Equation inventory with usage and novelty flag

Hold-out ablations were computed by training each model variant with the same protocol and split, changing only the feature set (e.g., removing ΔG_eff or regional-GC), and then evaluating on a fixed test set never seen during training. We report %KD MAE (mean absolute error) across all test guides; lower is better. ΔMAE vs Full is the MAE difference relative to the full model—positive values mean the variant performs worse, highlighting the contribution of the removed feature(s).

Model variant	%KD MAE	ΔMAE vs Full
Full (ΔG_eff + regional-GC + all features)	7.156	—
− ΔG_eff proxy (duplex only; no opening penalty)	7.191	+0.035
K-mer only	13.086	+5.930
Sequence-only	54.593	+47.437

Table S4.2: Hold-out ablation: ΔG_eff & regional-GC improve accuracy

Key Takeaway: Thermodynamics + accessibility together outperform sequence-only/accessibility-only; removing the opening penalty (duplex-only) reliably hurts, supporting ΔG_eff.

Fig 1: Hold-out scatter — Fig 1: Hold-out %KD predictions (full model)

Fig 2: Ablation ΔMAE — Fig 2: Ablation (ΔMAE vs full)

For one-shot reporting, we simulate a new gene by selecting a single calibrator ASO per family from the training fold, fitting a monotone offset (δ_g, isotonic_by_family) that maps the model’s raw %KD to the calibrator’s observed %KD, and then applying this same offset to all held-out guides from that family. The table shows %KD MAE before (“none”) and after (“isotonic_by_family”) this one-point calibration; lower after-MAE (or unchanged when a family was already centered or n is tiny) indicates the model’s predictions are portable across genes once a single wet-lab anchor is provided, while preserving ranks within the family. This validates the intended deployment: rank with the base model, then use one wet-lab calibrator to place predictions on the correct %KD scale for the new gene.

Family	MAE (none)	MAE (isotonic_by_family)	n_test
ATXN2	12.575	12.575	36
SOD1	13.139	13.274	18
DAZAP1	5.085	5.085	36
FAM98A	11.253	11.253	12

S4.3 Table: One-shot per-family calibration (rank preserved; scale recentred)

Fig 3: One-shot calibration overlay — Fig 3: One-shot per-family calibration (overlay; rank preserved; scale recentred)

Key Takeaway: With tiny per-family test sets or already-centred families, MAE shifts are modest; the value is re-centring predictions toward the 1:1 line while preserving ranks.

Reliability curve. We bin the calibrated predictions into equal-frequency deciles and, for each bin, plot the mean predicted %KD against the mean observed %KD with bootstrap 95% CIs. A curve that hugs the identity line indicates well-scaled predictions after one-shot calibration; any systematic bowing reveals under/over-prediction in specific ranges (often the extremes).

Residuals. Residuals are defined as \(r_i = y_i - \hat y_i\). We show a histogram to check that errors are centered near 0, roughly symmetric, and without heavy tails or clear heteroscedasticity. This pattern supports that the offset recenters families without distorting ranks and that remaining errors are mostly idiosyncratic rather than systematic.

Fig 4: Reliability post-cal — Fig 4: Reliability (post-calibration)

Fig 5: Residual histogram — Fig 5: Residuals (post-cal)

S5: Code Snippets

Compact code listings that corroborate the Methods/Results: schema hygiene, biophysics features (ΔG / ED / ΔG_eff), design matrix, leakage-safe CV, simple calibration/blend, and reproducibility helpers. Full code loaded in the Software page.

Label Harmonization and Schema: `cleaning.py`

Purpose: Establish a consistent schema across heterogeneous inputs by robustly discovering sequence, gene, and target-sequence columns (case-insensitive), so downstream feature generation always sees the same fields.

def detect_core_columns(df: pd.DataFrame) -> Dict[str, Optional[str]]:
    """
    Resolve core columns in a flexible, case-insensitive way.
    """
    seq_col = find_col(df, [COL_SEQUENCE, "ASO_sequence", "OligoSeq", "seq"])
    gene_col = find_col(df, [COL_TARGET_GENE, "Target", "Gene", "TargetFamily"])
    ts_col = find_col(df, [COL_TARGET_SEQ, "TargetSeq", "RNA_target", "target_sequence"])
    return {"sequence": seq_col, "gene": gene_col, "target_seq": ts_col}

Upstream of feature generation; ensures consistent inputs for later steps.

ΔG, ED, and ΔG_eff: `thermo_plfold.py`

Purpose: Compute RNAduplex binding energy (ΔG) and RNAplfold opening energy (ED) for the aligned site, then record the biophysically meaningful ΔG_eff = ΔG − ED used in the model’s feature panel.

def add_rnaplfold_features_slice_first(df, aso_col, gene_col, target_rna,
                                        W=150, L=None, window_pad=150, tempC=37, debug=False, timeit=False):
    """
    Adds columns: rnaplf_ED_kcal, duplex_dG_kcal, deltaG_eff_kcal, slice_start, slice_len
    - Site is found by RNAduplex on the full target (fast & robust)
    - ED is computed slice-first from RNAplfold Pu on a local window (cached)
    """
    import pandas as pd
    out = df.copy()
    if L is None:
        med = out[aso_col].astype(str).str.len().replace(0, np.nan).median()
        L = int(med if pd.notna(med) else 20)

    EDs=[]; dGs=[]; dGeffs=[]; starts=[]; lens=[]
    tgt = to_rna_clean(target_rna)
    ...
        ED = ed_kcal_from_pu(P)
        ...
        dG_eff = dG - ED if np.isfinite(dG) and np.isfinite(ED) else np.nan

        EDs.append(ED); dGs.append(dG); dGeffs.append(dG_eff); starts.append(start); lens.append(glen)

    out["rnaplf_ED_kcal"] = EDs
    out["duplex_dG_kcal"] = dGs
    out["deltaG_eff_kcal"] = dGeffs
    out["slice_start"] = starts
    out["slice_len"] = lens
    return out

Used in Phase 3–6 features to penalize inaccessible target regions.

Design Matrix Builder: `design_matrix.py`

Purpose: Align literature and wet-lab tables, assemble the exact numeric feature set used for training/evaluation, and emit the targets with a concise coverage profile for auditability.

def build_design(
    lit_df: pd.DataFrame,
    wet_df: pd.DataFrame,
    cfg: TrainConfig,
) -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, Optional[pd.Series], List[str]]:
    """
    Construct aligned design matrices and targets.
    """
    # 1) Discover & select candidate features from literature
    ...
    X_lit = lit_aligned[feats].copy()
    X_wet = wet_aligned[feats].copy()

    # 4) Targets
    y_lit = pd.to_numeric(lit_aligned.get(COL_OBS_KD, pd.Series([np.nan] * len(lit_aligned))), errors="coerce")
    y_wet = None
    if COL_OBS_KD in wet_aligned.columns:
        y_wet = pd.to_numeric(wet_aligned[COL_OBS_KD], errors="coerce")

    # 5) Diagnostics
    log("Design matrix — literature coverage:")
    log(df_profile(X_lit, feats).to_string(index=False))
    log(f"Selected {len(feats)} numeric features for training.")
    return X_lit, X_wet, y_lit, y_wet, feats

Ensures parity of feature columns between training (literature) and evaluation (wet-lab).

Numeric Preprocessor (impute/scale): `design_matrix.py`

Purpose: Define the precise preprocessing stack for numeric features (median imputation and optional scaling) inside a ColumnTransformer so training and evaluation remain identical.

def make_numeric_preprocessor(
    feature_names: Sequence[str],
    scale: bool = True,
) -> ColumnTransformer:
    """
    Create a ColumnTransformer for numeric features:
      - median imputation
      - optional StandardScaler(with_mean=False) to be sparse/CSR-friendly
    """
    if scale:
        num_pipe = Pipeline(
            steps=[("imp", SimpleImputer(strategy="median")),
                   ("sc", StandardScaler(with_mean=False))])
    else:
        num_pipe = SimpleImputer(strategy="median")

    pre = ColumnTransformer(transformers=[("num", num_pipe, list(feature_names))],
                            remainder="drop")
    return pre

Matches the training/evaluation pipelines described in Methods.

Family-aware LOSO CV (no leakage): `train_eval.py`

Purpose: Run leave-one-gene-out cross-validation without family leakage—fit on all other genes, score the held-out family, and report MAE with consistent preprocessing and model settings.

def _run_loco_cv(
    lit_df: pd.DataFrame,
    feats: Sequence[str],
    gene_col: str,
    use_hgbr: bool = True,
    min_test: int = 3,
    min_train: int = 10,
) -> List[Tuple[str, float, int]]:
    """
    Leave-One-Gene-Out CV over literature genes.
    """
    from sklearn.compose import ColumnTransformer
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler
    from sklearn.ensemble import GradientBoostingRegressor, HistGradientBoostingRegressor
    from sklearn.pipeline import Pipeline
    ...
    model = Pipeline([("prep", prep), ("reg", reg)])
    model.fit(X_tr, y_tr)
    pred = model.predict(X_te)
    mae = mean_absolute_error(y_te, pred)
    results.append((str(g), float(mae), int(len(y_te))))
    return results

Used to report per-family validation without leaking same-family patterns into training.

Regressor × Classifier Blend: `models.py`

Purpose: Combine the regressor output with bin-based expected %KD using a transparent weight α, providing a simple fallback-safe ensemble that remains easy to interpret.

def blend_predictions(
    reg_pred: np.ndarray,
    cls_expected: np.ndarray,
    alpha: float = 0.5,
) -> np.ndarray:
    """
    Blend regressor and classifier-derived expectations into a single %KD prediction.
    """
    if reg_pred is None or len(reg_pred) == 0:
        return cls_expected
    if cls_expected is None or len(cls_expected) == 0:
        return reg_pred
    return alpha * reg_pred + (1.0 - alpha) * cls_expected

Complements the calibrated pipeline; keeps behavior readable for users.

Basic Sequence Features (GC, T_m): `features_basic.py`

Purpose: Compute compact, interpretable composition features (GC fraction, Wallace T_m) that appear across phases and pair with thermodynamic/accessibility features.

# -----------------------------
# Basic sequence features
# -----------------------------
def _gc_frac(seq: str) -> float:
    L = len(seq)
    if L == 0:
        return np.nan
    return float(seq.count("G") + seq.count("C")) / float(L)

def _tm_wallace(seq: str) -> float:
    # 2*(A+T) + 4*(G+C)
    return float(2 * (seq.count("A") + seq.count("T")) + 4 * (seq.count("G") + seq.count("C")))

Forms part of the compact, interpretable feature panel.

Thermo Coverage Reporter: `thermo_vrna.py`

Purpose: Summarize availability and uniqueness of thermodynamic columns to verify feature coverage and catch data/tooling gaps per dataset or family.

def report_thermo_coverage(df: pd.DataFrame) -> pd.DataFrame:
    """
    Convenience: return a small profile (non-NaN counts and unique counts) for the thermo columns.
    """
    cols = [c for c in THERMO_COLS if c in df.columns]
    if not cols:
        return pd.DataFrame(columns=["column", "non_nan", "n_rows", "n_unique"])
    return df_profile(df, cols)

Pairs with Results diagnostics when thermo tools/inputs vary by family.

Reproducibility & Hygiene Helpers: `utils.py`

Purpose: Seed all RNGs deterministically and provide a small observed-vs-predicted plotting utility so runs are repeatable and diagnostics are trivial to generate.

def set_global_seed(seed: int = 42) -> None:
    try:
        import torch  # type: ignore
        torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True; torch.backends.cudnn.benchmark = False
    except Exception:
        pass
    np.random.seed(seed); random.seed(seed)

@contextmanager
def timeblock(label: str):
    t0 = time.time(); log(f"Time  start: {label}")
    try:
        yield
    finally:
        dt = time.time() - t0; log(f"done:  {label} ({dt:.2f}s)")

def save_obs_vs_pred_plot(obs: pd.Series, pred: pd.Series, out_path, title="Observed vs Predicted %KD",
                          xlab="Observed %KD", ylab="Predicted %KD", lims=(0.0, 100.0)) -> None:
    out_path = to_path(out_path); ensure_dir(out_path.parent)
    plt.figure(); plt.scatter(obs, pred); plt.plot(lims, lims)
    plt.xlim(lims); plt.ylim(lims); plt.xlabel(xlab); plt.ylabel(ylab); plt.title(title)
    plt.savefig(out_path, bbox_inches="tight"); plt.close()

Used throughout training/evaluation blocks to keep runs reproducible and auditable.

Name	Meaning	Ideal State
HADDOCK Score	Generated through weighted combination of energy terms and restraints	Lower (more negative)
Van der Waals Energy	Attractive/repulsive forces between binding atoms	Lower
Electrostatic Energy	Coulombic attraction/repulsion at binding site	Large negative
Buried Surface Area	Area buried at interaction site	Higher
Restraints Energy	Represents adherence to AIR restraints	Lower

Structure Rank	HADDOCK score [a.u.]	Van der Waals Energy	Electrostatic Energy	Restraints Energy	Desolvation Energy	interface RMSD [Å]	ligand RMSD [Å]	interface-ligand RMSD [Å]	Fraction of Common Contacts	DOCKQ	Buried Surface Area [Å²]	Structure ID
1	-92.54	-34.6	-64.52	45.57	10.94	0	0	0	1	1	892.71	5
2	-88.44	-31.33	-58.76	3.49	9.71	0.86	1.51	1.68	0.82	0.85	841.61	3
3	-74.78	-19.06	-61.46	33.68	9.98	6.8	34.05	21.67	0.27	0.13	760.66	4
4	-44.87	-18.92	-28.24	57.39	2.8	5.48	28.81	14.97	0.23	0.13	625.11	2
5	-36.67	-23.94	-14.44	42.9	4.45	4.89	16.13	11.15	0.41	0.24	702.74	1

Structure Rank	HADDOCK score [a.u.]	Van der Waals Energy	Electrostatic Energy	Restraints Energy	Desolvation Energy	interface RMSD [Å]	ligand RMSD [Å]	interface-ligand RMSD [Å]	Fraction of Common Contacts	DOCKQ	Buried Surface Area [Å²]	Structure ID
1	12.03	-44.89	-107.51	1643.81	13.49	0	0	0	1	1	1343.54	3
2	17.8	-27.18	-89.48	1364.14	8.91	25.53	36.2	36.51	0.06	0.04	1086.18	1
3	67.35	-30.75	-75.26	1745.53	9.53	3.9	8.82	8	0.52	0.38	1071.32	2
4	68.55	-44.16	-55.91	1717.46	9.57	24.82	37.08	39.73	0.06	0.04	1270.12	4
5	102.49	-41.36	-44.55	1926.73	8.74	25.23	39.22	42.14	0.09	0.04	1300.99	5

Structure Rank	HADDOCK score [a.u.]	Van der Waals Energy	Electrostatic Energy	Restraints Energy	Desolvation Energy	interface RMSD [Å]	ligand RMSD [Å]	interface-ligand RMSD [Å]	Fraction of Common Contacts	DOCKQ	Buried Surface Area [Å²]	Structure ID
1	-88.29	-11.79	-74.77	7.16	3.59	0	0	0	1	1	603.41	4
2	-79.5	-22.98	-53.47	1.5	4.11	1.63	2.76	2.8	0.83	0.73	730.61	3
3	-68.69	-20.18	-49.92	22.94	5.68	1.61	4.61	3.4	0.89	0.71	656.82	5
4	-63.67	-19.52	-43.13	0.72	4.8	1.91	11.3	7.31	0.83	0.52	588.58	1
5	-44.97	-21.82	-25.87	114.39	-1.98	3.14	11.71	7.2	0.72	0.42	674.37	2

Aptamer ID	AptaTrans Score	Best HADDOCK Score	BSA mean (top-K)	API Score	Rank
APT-A	0.9831	-92.54	831.66	0.925	1
APT-C	0.9727	-88.29	663.61	0.870	2
APT-B	0.9775	+12.03	1167.01	0.589	3

Navigation

ASO Development Pipeline

ASO Machine Learning Model

Aptamer Development Pipeline

Introduction

Design Pipeline

ASO Scoring

Scoring Factors

Chemical Modifications

Conclusion and References

Model Abstract

Interactive Model

Introduction

Methods

Biophysics Behind the Model

1. Duplex Stability and Binding Strength

2. Target Accessibility

3. Effective Binding Energy

4. Biological Implications

Feature Engineering Framework

1. Sequence-Level Features (Structural)

Basic composition (length, GC%, CpG density, homopolymer runs):

Regional GC fractions (wing-gap-wing design):

K-mer frequencies (di- and trinucleotides):

2. Thermodynamic Features

Duplex free energy (ΔGduplex, Tm):

Mismatch penalties:

3. Accessibility Features

Unpaired probabilities and opening energy (ED):

Effective binding free energy (ΔGeff):

4. Biochemistry-Aware Features

Chemical modifications:

Central Role in Protein Knockdown

Machine Learning Framework

Data Collection and Standardization

Model Architecture

1. Prediction task and data regime

2. Feature engineering (“compact panel”) and biological rationale

3. Core learner and model form

4. Training procedure, hyperparameters, and evaluation

5. Calibration, interpretability, and limits

6. Cached feature artifacts for sweeps and re-tests

Validation Strategy

Cross-validation (within-distribution)

Evaluation metrics

MAE (Mean Absolute Error)

R² (Variance explained)

Spearman’s ρ (Rank correlation)

OOD validation on wet-lab families

Family-aware calibration without leakage

Leakage controls and small-n choices

Results

Phase Progression

What changed and how performance moved

Final model sanity & calibration impact (Phase-6)

Observed vs Predicted %KD (before/after k-shot)

Out-of-distribution (OOD) transfer: DAZAP1 and FAM98A (Phase-6, post-calibration)

Per-family metrics with CIs (MAE, R², ρ)

Residual distributions (Pred−Obs)

Residuals vs Observed (post-calibration)

Multi-family evaluation (Phase 8/9)

Per-family results

Interpretation.

Error analysis & robustness

Error analysis & robustness (Phase 6, post-calibration)

Overall residuals (post-calibration)

Bootstrap MAE densities (Overall vs DAZ vs FAM)

Calibration curve (binned reliability; post-calibration)

Why this demonstrates accuracy and usability

Feature behavior (Phase 6, post-calibration)

Global feature importance (permutation)

Partial dependence (top numeric features)

Why this supports the Methods → Results story

Results Summary

Discussion

1. What changed, why it worked, and how it matches our hypotheses

2. How the evidence supports the model’s intended use

Calibration converts a strong ranker into an accurate regressor

Transfer to new families is feasible under the smallest viable mechanism

Multi-family presets (Phase 8/9) generalize the methodology

Duplex free energy (ΔG_duplex, Tm):

Effective binding free energy (ΔG_eff):

Label Harmonization and Schema: `cleaning.py`

ΔG, ED, and ΔG_eff: `thermo_plfold.py`

Design Matrix Builder: `design_matrix.py`

Numeric Preprocessor (impute/scale): `design_matrix.py`

Family-aware LOSO CV (no leakage): `train_eval.py`

Regressor × Classifier Blend: `models.py`

Basic Sequence Features (GC, T_m): `features_basic.py`

Thermo Coverage Reporter: `thermo_vrna.py`

Reproducibility & Hygiene Helpers: `utils.py`