Overview
Our team’s software journey began with our model, which utilizes computation to identify the most efficacious antisense oligonucleotides (ASOs) and RNA aptamers to test in vitro, given the proteins we attempted to knock down.
ASO Design and Scoring
We adapted a tiling/biophysical/specifity pipeline that scores GC (optimized 0.4–0.6), homopolymers (penalize long G/C runs), conservation (phastCons ≥ 0.9), target accessibility, and genome mapping specificity (≤ 1 mismatch; mapping penalty scaled by multi-hits). Chemical stabilization uses a 5–10–5 gapmer with 2′-MOE flanks and iMe-dC in the DNA gap for binding strength and nuclease resistance).
Aptamer Design and Docking
- Target: longest disordered CTD segment from IUPred2A (≥ 0.5 threshold).
- Generation: AptaTrans + Monte Carlo Tree Search; best performance at 10 iterations and seed 1004
- Docking: RNAComposer structures; HADDOCK with AIR restraints focused on disordered CTD residues; scored by HADDOCK (more negative is better), VdW, electrostatics, buried surface area (BSA), and restraints energy
- Aptamer Priority Index (API): weighted aggregate of normalized AptaTrans (50%), best (negated) HADDOCK (40%), and mean BSA over top-K (10%); Aptamer A ranks first (API 0.925; HADDOCK −92.54; BSA ≈ 832 Ų).
Machine Learning Model
We built a supervised regression system that predicts % knockdown (KD) for ASO designs. The final model is a gradient-boosted decision-tree regressor (GBR, scikit-learn) trained on a transparent, reproducible pipeline. It combines sequence features, thermodynamics (ViennaRNA + NUPACK), and accessibility metrics, and evaluates generalization with per-family holdouts. To cope with small-N tabular data, we add family-aware calibration and light domain adaptation.
Tools Used:
Languages & Libraries
- Python 3.x
- scikit-learn (GradientBoostingRegressor, Pipelines, CV)
- numpy / pandas (arrays & tabular ops)
- joblib (artifact caching/serialization, simple parallelism)
- matplotlib (diagnostic plots)
External binaries (wrapped via Python)
- ViennaRNA (e.g., RNAplfold, RNAduplex, etc.)
- NUPACK (ensemble energies, partition functions, etc.)
Repository Layout
Repository Link: https://gitlab.igem.org/2025/software-tools/dnhs-sandiego-ca/model
The final notebook/script imports functions from src/ and reads inputs from data/. Artifacts and reports are written to outputs/.
src/
- __init__.py
- cleaning.py # schema checks, de-dup, missing-data policy
- config.py # seeds, paths, model/search parameters
- design_matrix.py # merge to model-ready X/y (+ groups)
- features_basic.py # GC, k-mers, region stats, transforms
- thermo_plfold.py # RNAplfold wrapper (accessibility: ED)
- thermo_vrna.py # ViennaRNA duplex ΔG, Tm, duplex)
- models.py # sklearn estimators & grids
- train_eval.py # training loops, CV, metrics, artifacts
- utils.py # hashing, caching, logging helpers
data/
- *.fasta # sequences
- literature_*.csv # curated tabular features/labels
- wetlab_*.csv # orthogonal labels/benchmarks
- wiki_widget_data.json # export for the interactive viewer
outputs/ # json summaries and run logs (auto-generated)
- plots/ # generated figures
01_base.ipynb # Phase 6 Jupyter notebook to run biophysical/thermodynamic sweeps
02_advanced.ipynb # Phase 9 Jupyter notebook to run multi-family evaluations
Data Pipeline
- Ingestion & Cleaning
- Loads FASTA + CSVs from data/
- Enforces schemas (required columns, dtypes), de-dups on stable IDs, normalizes family labels.
- Records file content hashes for provenance.
- Thermodynamics & accessibility
- ViennaRNA wrappers (thermo_vrna.py): ΔG (duplex), Tm estimates, structure-aware descriptors
- NUPACK wrappers: ensemble/partition metrics used as complementary ΔG-like features
- Accessibility via RNAplfold (thermo_plfold.py): per-position ED and windowed summaries
- All calls are cached under data/cache/ keyed by (sequence, parameters).
- Feature assembly (features_basic.py, design_matrix.py)
- Sequence: GC, length, regional GC, selected k-mers.
- Thermodynamics: ΔG, ED, Tm, derived ΔG_eff = ΔG_duplex − ED.
- Biochemistry: gapmer architecture/chemistry flags
- Produces X (features), y (%KD), and groups = family
- Split policy
- Per-family holdouts (grouped CV and final test), matching the intended deployment: new families at inference.
Model Components
Estimator
- GradientBoostingRegressor (GBR), loss = squared error (L2)
- Additive trees: shallow regression trees appended sequentially to reduce residuals
Capacity / regularization
max_depth ≈ 4
,learning_rate ≈ 0.01
,n_estimators ≈ 1.2–1.8k
- Conservative bias/variance trade-off for small-N, mixed features
Why GBR?
- Captures non-linear interactions among ΔG, ED, k-mers without deep nets
- Stable on tiny datasets; easier to sanity-check than large boosters
- Baselines (reference only):
- HGBR — fast but underfit/unstable at this scale
- Ridge — diagnostic for linear signal; underfits non-linear biology
Preprocessing
- Median imputation + standardization inside an
sklearn.Pipeline
- Fit only on train folds to avoid leakage
Evaluation Methodology
- Primary metric: MAE (headline)
- Secondary: RMSE, R² (for sensitivity checks)
- Cross-validation: Group-aware CV (families kept intact); final report uses best-case holdouts per family
- Model selection: Hyperparameter sweep → pick min-MAE config; refit and evaluate on the family holdout
Reproducibility and Engineering
- Config-first: all paths, seeds, grids, and external-tool flags live in config.py
- Determinism: fixed random_state; caches keyed by content hashes
- Separation of concerns: notebooks/scripts orchestrate; Python modules implement logic
- Data contracts: assertive schema checks at module boundaries; leakage guards in pipelines
- Artifact lineage: each model/metric JSON includes config + input hashes
- Complexity: thermo calls dominate runtime → cached and parallelizable
Conceptual Model Overview
# =========================
# Orchestrator (notebook or script)
# End-to-end pipeline: load → clean → featurize → train → calibrate → evaluate → save
# =========================
# Load run-time configuration (paths, hyperparams, feature flags, etc.)
cfg = load_config()
# 1) Data ingest & validation
# - Read raw tables/files from disk/cloud
# - Enforce schema (types, required cols), drop/flag bad rows
raw = load_raw_data(cfg.data_paths)
tbl = clean_and_validate(raw, cfg.schema)
# 2) Feature engineering
# Thermodynamics-based features using external RNA tools
# - 'thermo' might include MFE, ΔG, base-pairing probabilities, etc. from ViennaRNA/NUPACK
thermo = build_thermo_features(tbl.sequences, cfg.thermo) # ViennaRNA + NUPACK
# Accessibility features
# - Compute local unpaired probabilities / ensemble defect using RNAplfold (ED)
# - Captures how accessible regions are for interactions
access = build_accessibility(tbl.sequences, cfg.plfold_params) # RNAplfold (ED)
# Basic tabular features
# - Sequence length, GC%, k-mer counts, categorical flags from metadata, etc.
X_basic = build_basic_features(tbl, cfg.featurization)
# Assemble final design matrix
# - Concatenate/align basic + thermo + accessibility features
# - Apply optional transforms/selection per cfg.feature_opts (scaling, PCA, etc.)
X = assemble_design_matrix(X_basic, thermo, access, cfg.feature_opts)
# 3) Labels & grouping
# - y: target variable (e.g., experimental activity in 0–100 scale)
# - groups: family identifiers for grouped CV (prevents leakage across related sequences)
y, groups = get_labels_and_families(tbl)
# 4) Cross-validation & model setup
# - Grouped CV ensures splits respect family boundaries
cv = make_grouped_cv(groups, n_splits=cfg.cv_splits)
# - Gradient Boosting Regressor and hyperparameter grid
gbr, grid = make_gbr_and_grid(cfg.gbr_grid)
# 5) Hyperparameter search + fit
# - Grid search over 'grid' using MAE on grouped CV
# - Returns the best-fitted model (and optionally the CV results)
best = sweep_and_fit(gbr, grid, X, y, groups, cv, metric="MAE")
# 6) Similarity-weighted predictions
# - Blend model predictions with neighbor/family similarity weights to reduce variance
y_hat = predict_with_similarity_weights(best, X, groups)
# 7) Family-wise calibration & k-shot adjustment
# - Isotonic calibration per family to correct systematic bias; clip to valid range
y_cal = family_isotonic_calibration(y_hat, y, groups, clip=(0,100))
# - Optional few-shot offsets by family (e.g., shift means using small held-out refs)
y_adj = apply_kshot_offsets(y_cal, groups, cfg.kshot)
# 8) Evaluation
# - Compute metrics overall and by group/family for diagnostic visibility
report = evaluate_predictions(y_adj, y, groups, metrics=["MAE","RMSE","R2"])
# 9) Persistence & exports
# - Save model (joblib), metrics (JSON), and any plots/artifacts to outputs dir
save_artifacts(best, report, outputs_dir=cfg.outputs) # joblib + JSON + plots
# - Export per-family summaries for a downstream interactive widget/visualization
export_family_widget_data(report, X, y, groups, out_dir="data/family_widget_data")
Data Management
- Inputs: data/*.fasta, data/literature_master_*.csv, data/wetlab_*.csv
- Derived: data/cache/ (thermo/accessibility); data/plots/
- Model outputs: outputs/models/, outputs/metrics/, outputs/figures/, outputs/logs/
- Widget export: data/family_widget_data/
- Provenance: filenames + content hashes recorded alongside every artifact
Quickstart Guide
- Install packages
pip install scikit-learn pandas numpy joblib matplotlib # And install CLI tools: ViennaRNA (incl. RNAplfold) + NUPACK # Ensure binaries like `RNAplfold`, `RNAduplex`, and `nupack` are on PATH.
- Prepare data
- Place FASTA and CSVs in ./data/ (as in the structure above).
- Configure
- Edit src/config.py for paths, seeds, model grid, and thermo parameters.
- Run
- Execute the final notebook or script that calls into src/ modules.
- Artifacts will appear under ./outputs/, and the interactive widget data under ./data/.
Acknowledgements
- scikit-learn, numpy/pandas, joblib, matplotlib
- ViennaRNA (incl. RNAplfold) and NUPACK for thermodynamics & accessibility calculations
- Jupyter for experiment orchestration