Navigation

Software

Our software unifies a rigorous ASO/aptamer design-and-evaluation model, a family-aware machine learning predictor, and a public-facing ALS awareness app—presented with the same layout language as our Measurement page.

Model

Our computational pipeline for improved wet-lab research.

Interactive App

Our innovation for spreading awareness about ALS

Overview

Our team’s software journey began with our model, which utilizes computation to identify the most efficacious antisense oligonucleotides (ASOs) and RNA aptamers to test in vitro, given the proteins we attempted to knock down.

ASO Design and Scoring

We adapted a tiling/biophysical/specifity pipeline that scores GC (optimized 0.4–0.6), homopolymers (penalize long G/C runs), conservation (phastCons ≥ 0.9), target accessibility, and genome mapping specificity (≤ 1 mismatch; mapping penalty scaled by multi-hits). Chemical stabilization uses a 5–10–5 gapmer with 2′-MOE flanks and iMe-dC in the DNA gap for binding strength and nuclease resistance).

Aptamer Design and Docking

  • Target: longest disordered CTD segment from IUPred2A (≥ 0.5 threshold).
  • Generation: AptaTrans + Monte Carlo Tree Search; best performance at 10 iterations and seed 1004
  • Docking: RNAComposer structures; HADDOCK with AIR restraints focused on disordered CTD residues; scored by HADDOCK (more negative is better), VdW, electrostatics, buried surface area (BSA), and restraints energy
  • Aptamer Priority Index (API): weighted aggregate of normalized AptaTrans (50%), best (negated) HADDOCK (40%), and mean BSA over top-K (10%); Aptamer A ranks first (API 0.925; HADDOCK −92.54; BSA ≈ 832 Ų).

Machine Learning Model

We built a supervised regression system that predicts % knockdown (KD) for ASO designs. The final model is a gradient-boosted decision-tree regressor (GBR, scikit-learn) trained on a transparent, reproducible pipeline. It combines sequence features, thermodynamics (ViennaRNA + NUPACK), and accessibility metrics, and evaluates generalization with per-family holdouts. To cope with small-N tabular data, we add family-aware calibration and light domain adaptation.

Tools Used:

Languages & Libraries

  • Python 3.x
  • scikit-learn (GradientBoostingRegressor, Pipelines, CV)
  • numpy / pandas (arrays & tabular ops)
  • joblib (artifact caching/serialization, simple parallelism)
  • matplotlib (diagnostic plots)

External binaries (wrapped via Python)

  • ViennaRNA (e.g., RNAplfold, RNAduplex, etc.)
  • NUPACK (ensemble energies, partition functions, etc.)

Repository Layout

Repository Link: https://gitlab.igem.org/2025/software-tools/dnhs-sandiego-ca/model

The final notebook/script imports functions from src/ and reads inputs from data/. Artifacts and reports are written to outputs/.

src/
- __init__.py
- cleaning.py # schema checks, de-dup, missing-data policy
- config.py # seeds, paths, model/search parameters
- design_matrix.py # merge to model-ready X/y (+ groups)
- features_basic.py # GC, k-mers, region stats, transforms
- thermo_plfold.py # RNAplfold wrapper (accessibility: ED)
- thermo_vrna.py # ViennaRNA duplex ΔG, Tm, duplex)
- models.py # sklearn estimators & grids
- train_eval.py # training loops, CV, metrics, artifacts
- utils.py # hashing, caching, logging helpers

data/
- *.fasta # sequences
- literature_*.csv # curated tabular features/labels
- wetlab_*.csv # orthogonal labels/benchmarks
- wiki_widget_data.json # export for the interactive viewer

outputs/ # json summaries and run logs (auto-generated)
- plots/ # generated figures

01_base.ipynb # Phase 6 Jupyter notebook to run biophysical/thermodynamic sweeps
02_advanced.ipynb # Phase 9 Jupyter notebook to run multi-family evaluations

Data Pipeline

  1. Ingestion & Cleaning
    1. Loads FASTA + CSVs from data/
    2. Enforces schemas (required columns, dtypes), de-dups on stable IDs, normalizes family labels.
    3. Records file content hashes for provenance.
  2. Thermodynamics & accessibility
    1. ViennaRNA wrappers (thermo_vrna.py): ΔG (duplex), Tm estimates, structure-aware descriptors
    2. NUPACK wrappers: ensemble/partition metrics used as complementary ΔG-like features
    3. Accessibility via RNAplfold (thermo_plfold.py): per-position ED and windowed summaries
    4. All calls are cached under data/cache/ keyed by (sequence, parameters).
  3. Feature assembly (features_basic.py, design_matrix.py)
    1. Sequence: GC, length, regional GC, selected k-mers.
    2. Thermodynamics: ΔG, ED, Tm, derived ΔG_eff = ΔG_duplex − ED.
    3. Biochemistry: gapmer architecture/chemistry flags
    4. Produces X (features), y (%KD), and groups = family
  4. Split policy
    1. Per-family holdouts (grouped CV and final test), matching the intended deployment: new families at inference.

Model Components

Estimator

  • GradientBoostingRegressor (GBR), loss = squared error (L2)
  • Additive trees: shallow regression trees appended sequentially to reduce residuals

Evaluation Methodology

  • Primary metric: MAE (headline)
  • Secondary: RMSE, R² (for sensitivity checks)
  • Cross-validation: Group-aware CV (families kept intact); final report uses best-case holdouts per family
  • Model selection: Hyperparameter sweep → pick min-MAE config; refit and evaluate on the family holdout

Reproducibility and Engineering

  • Config-first: all paths, seeds, grids, and external-tool flags live in config.py
  • Determinism: fixed random_state; caches keyed by content hashes
  • Separation of concerns: notebooks/scripts orchestrate; Python modules implement logic
  • Data contracts: assertive schema checks at module boundaries; leakage guards in pipelines
  • Artifact lineage: each model/metric JSON includes config + input hashes
  • Complexity: thermo calls dominate runtime → cached and parallelizable

Conceptual Model Overview

# =========================
# Orchestrator (notebook or script)
# End-to-end pipeline: load → clean → featurize → train → calibrate → evaluate → save
# =========================

# Load run-time configuration (paths, hyperparams, feature flags, etc.)
cfg = load_config()

# 1) Data ingest & validation
#    - Read raw tables/files from disk/cloud
#    - Enforce schema (types, required cols), drop/flag bad rows
raw = load_raw_data(cfg.data_paths)
tbl = clean_and_validate(raw, cfg.schema)

# 2) Feature engineering
#    Thermodynamics-based features using external RNA tools
#    - 'thermo' might include MFE, ΔG, base-pairing probabilities, etc. from ViennaRNA/NUPACK
thermo = build_thermo_features(tbl.sequences, cfg.thermo)        # ViennaRNA + NUPACK

#    Accessibility features
#    - Compute local unpaired probabilities / ensemble defect using RNAplfold (ED)
#    - Captures how accessible regions are for interactions
access = build_accessibility(tbl.sequences, cfg.plfold_params)   # RNAplfold (ED)

#    Basic tabular features
#    - Sequence length, GC%, k-mer counts, categorical flags from metadata, etc.
X_basic = build_basic_features(tbl, cfg.featurization)

#    Assemble final design matrix
#    - Concatenate/align basic + thermo + accessibility features
#    - Apply optional transforms/selection per cfg.feature_opts (scaling, PCA, etc.)
X = assemble_design_matrix(X_basic, thermo, access, cfg.feature_opts)

# 3) Labels & grouping
#    - y: target variable (e.g., experimental activity in 0–100 scale)
#    - groups: family identifiers for grouped CV (prevents leakage across related sequences)
y, groups = get_labels_and_families(tbl)

# 4) Cross-validation & model setup
#    - Grouped CV ensures splits respect family boundaries
cv = make_grouped_cv(groups, n_splits=cfg.cv_splits)

#    - Gradient Boosting Regressor and hyperparameter grid
gbr, grid = make_gbr_and_grid(cfg.gbr_grid)

# 5) Hyperparameter search + fit
#    - Grid search over 'grid' using MAE on grouped CV
#    - Returns the best-fitted model (and optionally the CV results)
best = sweep_and_fit(gbr, grid, X, y, groups, cv, metric="MAE")

# 6) Similarity-weighted predictions
#    - Blend model predictions with neighbor/family similarity weights to reduce variance
y_hat = predict_with_similarity_weights(best, X, groups)

# 7) Family-wise calibration & k-shot adjustment
#    - Isotonic calibration per family to correct systematic bias; clip to valid range
y_cal = family_isotonic_calibration(y_hat, y, groups, clip=(0,100))

#    - Optional few-shot offsets by family (e.g., shift means using small held-out refs)
y_adj = apply_kshot_offsets(y_cal, groups, cfg.kshot)

# 8) Evaluation
#    - Compute metrics overall and by group/family for diagnostic visibility
report = evaluate_predictions(y_adj, y, groups, metrics=["MAE","RMSE","R2"])

# 9) Persistence & exports
#    - Save model (joblib), metrics (JSON), and any plots/artifacts to outputs dir
save_artifacts(best, report, outputs_dir=cfg.outputs)  # joblib + JSON + plots

#    - Export per-family summaries for a downstream interactive widget/visualization
export_family_widget_data(report, X, y, groups, out_dir="data/family_widget_data")

Data Management

  • Inputs: data/*.fasta, data/literature_master_*.csv, data/wetlab_*.csv
  • Derived: data/cache/ (thermo/accessibility); data/plots/
  • Model outputs: outputs/models/, outputs/metrics/, outputs/figures/, outputs/logs/
  • Widget export: data/family_widget_data/
  • Provenance: filenames + content hashes recorded alongside every artifact

Quickstart Guide

  1. Install packages
    pip install scikit-learn pandas numpy joblib matplotlib
    # And install CLI tools: ViennaRNA (incl. RNAplfold) + NUPACK
    # Ensure binaries like `RNAplfold`, `RNAduplex`, and `nupack` are on PATH.
  2. Prepare data
    • Place FASTA and CSVs in ./data/ (as in the structure above).
  3. Configure
    • Edit src/config.py for paths, seeds, model grid, and thermo parameters.
  4. Run
    • Execute the final notebook or script that calls into src/ modules.
    • Artifacts will appear under ./outputs/, and the interactive widget data under ./data/.

Acknowledgements

  • scikit-learn, numpy/pandas, joblib, matplotlib
  • ViennaRNA (incl. RNAplfold) and NUPACK for thermodynamics & accessibility calculations
  • Jupyter for experiment orchestration

Overview:

We built a public web app to raise ALS awareness, combining video, resource links, a scored quiz (with retakes and feedback), a per-day notes-based symptom tracker, and a podcast embed for our “Stories of Our Community” series.

Features and Design Details.

  1. Home / Video landing. Prominent welcome and embedded video to humanize ALS (p. 1).
  2. Resources. Curated links to major ALS orgs + podcast section
  3. Quiz. Ten-question quiz with immediate score reporting and the ability to retake; shows which questions were missed so that users have the opportunity to learn from their mistakes
  4. Symptom tracker. Free-form daily notes with timestamp to help users reflect and manage stress. Helps ALS patients and others with mental health challenges and alleviates difficulty with memory
  5. Podcast. Embedded “Stories of Our Community” episodes by our team; also available on Spotify

Why Our App Matters

Surveys/interviews revealed high name recognition but low topic depth; the app bridges that gap by pairing narratives (video/podcast) with factual resources and active recall via the quiz. (See full motivation and educational impact summary on p. 5.)

Technical Details

  • Stack & runtime:
    • The app is a single-page application built with React + Vite. The repo was scaffolded from the official Vite React template, which provides HMR (Fast Refresh) via @vitejs/plugin-react (Babel).
  • Deployment.
    • Our project is deployed on Vercel at als-app-drab.vercel.app (linked from the repository “About” section). Vite’s production build produces static assets that Vercel serves directly.

Vite Usage

Vite’s ES module dev server and optimized Rollup production build make iteration fast (HMR) while producing small static assets that are easy to host on Vercel, which is ideal for our public education app that doesn’t need a custom backend server (lowering expense while maintaining availability and usability).