Software

Overview

Our team’s software journey began with our model, which utilizes computation to identify the most efficacious antisense oligonucleotides (ASOs) and RNA aptamers to test in vitro, given the proteins we attempted to knock down.

ASO Design and Scoring

We adapted a tiling/biophysical/specifity pipeline that scores GC (optimized 0.4–0.6), homopolymers (penalize long G/C runs), conservation (phastCons ≥ 0.9), target accessibility, and genome mapping specificity (≤ 1 mismatch; mapping penalty scaled by multi-hits). Chemical stabilization uses a 5–10–5 gapmer with 2′-MOE flanks and iMe-dC in the DNA gap for binding strength and nuclease resistance).

Aptamer Design and Docking

Target: longest disordered CTD segment from IUPred2A (≥ 0.5 threshold).
Generation: AptaTrans + Monte Carlo Tree Search; best performance at 10 iterations and seed 1004
Docking: RNAComposer structures; HADDOCK with AIR restraints focused on disordered CTD residues; scored by HADDOCK (more negative is better), VdW, electrostatics, buried surface area (BSA), and restraints energy
Aptamer Priority Index (API): weighted aggregate of normalized AptaTrans (50%), best (negated) HADDOCK (40%), and mean BSA over top-K (10%); Aptamer A ranks first (API 0.925; HADDOCK −92.54; BSA ≈ 832 Å²).

Machine Learning Model

We built a supervised regression system that predicts % knockdown (KD) for ASO designs. The final model is a gradient-boosted decision-tree regressor (GBR, scikit-learn) trained on a transparent, reproducible pipeline. It combines sequence features, thermodynamics (ViennaRNA + NUPACK), and accessibility metrics, and evaluates generalization with per-family holdouts. To cope with small-N tabular data, we add family-aware calibration and light domain adaptation.

Tools Used:

Languages & Libraries

Python 3.x
scikit-learn (GradientBoostingRegressor, Pipelines, CV)
numpy / pandas (arrays & tabular ops)
joblib (artifact caching/serialization, simple parallelism)
matplotlib (diagnostic plots)

External binaries (wrapped via Python)

ViennaRNA (e.g., RNAplfold, RNAduplex, etc.)
NUPACK (ensemble energies, partition functions, etc.)

Repository Layout

Repository Link: https://gitlab.igem.org/2025/software-tools/dnhs-sandiego-ca/model

The final notebook/script imports functions from src/ and reads inputs from data/. Artifacts and reports are written to outputs/.

src/
- __init__.py
- cleaning.py # schema checks, de-dup, missing-data policy
- config.py # seeds, paths, model/search parameters
- design_matrix.py # merge to model-ready X/y (+ groups)
- features_basic.py # GC, k-mers, region stats, transforms
- thermo_plfold.py # RNAplfold wrapper (accessibility: ED)
- thermo_vrna.py # ViennaRNA duplex ΔG, Tm, duplex)
- models.py # sklearn estimators & grids
- train_eval.py # training loops, CV, metrics, artifacts
- utils.py # hashing, caching, logging helpers

data/
- *.fasta # sequences
- literature_*.csv # curated tabular features/labels
- wetlab_*.csv # orthogonal labels/benchmarks
- wiki_widget_data.json # export for the interactive viewer

outputs/ # json summaries and run logs (auto-generated)
- plots/ # generated figures

01_base.ipynb # Phase 6 Jupyter notebook to run biophysical/thermodynamic sweeps
02_advanced.ipynb # Phase 9 Jupyter notebook to run multi-family evaluations

Data Pipeline

Ingestion & Cleaning
1. Loads FASTA + CSVs from data/
2. Enforces schemas (required columns, dtypes), de-dups on stable IDs, normalizes family labels.
3. Records file content hashes for provenance.
Thermodynamics & accessibility
1. ViennaRNA wrappers (thermo_vrna.py): ΔG (duplex), Tm estimates, structure-aware descriptors
2. NUPACK wrappers: ensemble/partition metrics used as complementary ΔG-like features
3. Accessibility via RNAplfold (thermo_plfold.py): per-position ED and windowed summaries
4. All calls are cached under data/cache/ keyed by (sequence, parameters).
Feature assembly (features_basic.py, design_matrix.py)
1. Sequence: GC, length, regional GC, selected k-mers.
2. Thermodynamics: ΔG, ED, Tm, derived ΔG_eff = ΔG_duplex − ED.
3. Biochemistry: gapmer architecture/chemistry flags
4. Produces X (features), y (%KD), and groups = family
Split policy
1. Per-family holdouts (grouped CV and final test), matching the intended deployment: new families at inference.

Model Components

Estimator

GradientBoostingRegressor (GBR), loss = squared error (L2)
Additive trees: shallow regression trees appended sequentially to reduce residuals

Why GBR?

Captures non-linear interactions among ΔG, ED, k-mers without deep nets
Stable on tiny datasets; easier to sanity-check than large boosters
Baselines (reference only):
- HGBR — fast but underfit/unstable at this scale
- Ridge — diagnostic for linear signal; underfits non-linear biology

Evaluation Methodology

Primary metric: MAE (headline)
Secondary: RMSE, R² (for sensitivity checks)
Cross-validation: Group-aware CV (families kept intact); final report uses best-case holdouts per family
Model selection: Hyperparameter sweep → pick min-MAE config; refit and evaluate on the family holdout

Reproducibility and Engineering

Config-first: all paths, seeds, grids, and external-tool flags live in config.py
Determinism: fixed random_state; caches keyed by content hashes
Separation of concerns: notebooks/scripts orchestrate; Python modules implement logic
Data contracts: assertive schema checks at module boundaries; leakage guards in pipelines
Artifact lineage: each model/metric JSON includes config + input hashes
Complexity: thermo calls dominate runtime → cached and parallelizable

Conceptual Model Overview

# =========================
# Orchestrator (notebook or script)
# End-to-end pipeline: load → clean → featurize → train → calibrate → evaluate → save
# =========================

# Load run-time configuration (paths, hyperparams, feature flags, etc.)
cfg = load_config()

# 1) Data ingest & validation
#    - Read raw tables/files from disk/cloud
#    - Enforce schema (types, required cols), drop/flag bad rows
raw = load_raw_data(cfg.data_paths)
tbl = clean_and_validate(raw, cfg.schema)

# 2) Feature engineering
#    Thermodynamics-based features using external RNA tools
#    - 'thermo' might include MFE, ΔG, base-pairing probabilities, etc. from ViennaRNA/NUPACK
thermo = build_thermo_features(tbl.sequences, cfg.thermo)        # ViennaRNA + NUPACK

#    Accessibility features
#    - Compute local unpaired probabilities / ensemble defect using RNAplfold (ED)
#    - Captures how accessible regions are for interactions
access = build_accessibility(tbl.sequences, cfg.plfold_params)   # RNAplfold (ED)

#    Basic tabular features
#    - Sequence length, GC%, k-mer counts, categorical flags from metadata, etc.
X_basic = build_basic_features(tbl, cfg.featurization)

#    Assemble final design matrix
#    - Concatenate/align basic + thermo + accessibility features
#    - Apply optional transforms/selection per cfg.feature_opts (scaling, PCA, etc.)
X = assemble_design_matrix(X_basic, thermo, access, cfg.feature_opts)

# 3) Labels & grouping
#    - y: target variable (e.g., experimental activity in 0–100 scale)
#    - groups: family identifiers for grouped CV (prevents leakage across related sequences)
y, groups = get_labels_and_families(tbl)

# 4) Cross-validation & model setup
#    - Grouped CV ensures splits respect family boundaries
cv = make_grouped_cv(groups, n_splits=cfg.cv_splits)

#    - Gradient Boosting Regressor and hyperparameter grid
gbr, grid = make_gbr_and_grid(cfg.gbr_grid)

# 5) Hyperparameter search + fit
#    - Grid search over 'grid' using MAE on grouped CV
#    - Returns the best-fitted model (and optionally the CV results)
best = sweep_and_fit(gbr, grid, X, y, groups, cv, metric="MAE")

# 6) Similarity-weighted predictions
#    - Blend model predictions with neighbor/family similarity weights to reduce variance
y_hat = predict_with_similarity_weights(best, X, groups)

# 7) Family-wise calibration & k-shot adjustment
#    - Isotonic calibration per family to correct systematic bias; clip to valid range
y_cal = family_isotonic_calibration(y_hat, y, groups, clip=(0,100))

#    - Optional few-shot offsets by family (e.g., shift means using small held-out refs)
y_adj = apply_kshot_offsets(y_cal, groups, cfg.kshot)

# 8) Evaluation
#    - Compute metrics overall and by group/family for diagnostic visibility
report = evaluate_predictions(y_adj, y, groups, metrics=["MAE","RMSE","R2"])

# 9) Persistence & exports
#    - Save model (joblib), metrics (JSON), and any plots/artifacts to outputs dir
save_artifacts(best, report, outputs_dir=cfg.outputs)  # joblib + JSON + plots

#    - Export per-family summaries for a downstream interactive widget/visualization
export_family_widget_data(report, X, y, groups, out_dir="data/family_widget_data")

Data Management

Inputs: data/*.fasta, data/literature_master_*.csv, data/wetlab_*.csv
Derived: data/cache/ (thermo/accessibility); data/plots/
Model outputs: outputs/models/, outputs/metrics/, outputs/figures/, outputs/logs/
Widget export: data/family_widget_data/
Provenance: filenames + content hashes recorded alongside every artifact

Quickstart Guide

Install packages

pip install scikit-learn pandas numpy joblib matplotlib
# And install CLI tools: ViennaRNA (incl. RNAplfold) + NUPACK
# Ensure binaries like `RNAplfold`, `RNAduplex`, and `nupack` are on PATH.

Prepare data
- Place FASTA and CSVs in ./data/ (as in the structure above).
Configure
- Edit src/config.py for paths, seeds, model grid, and thermo parameters.
Run
- Execute the final notebook or script that calls into src/ modules.
- Artifacts will appear under ./outputs/, and the interactive widget data under ./data/.

Acknowledgements

scikit-learn, numpy/pandas, joblib, matplotlib
ViennaRNA (incl. RNAplfold) and NUPACK for thermodynamics & accessibility calculations
Jupyter for experiment orchestration

Navigation

Model

Interactive App