Software

/pipeline/log - live

hepaswitch$ bootstrap ▶ ref index ✓ | env cache ✓

hepaswitch$ quantify ▶ tumor/normal alignment … done

hepaswitch$ design ▶ k-mer sweep, structure fit, RNAup/cofold … ranked ✓

Generates toehold switches automatically from raw data, splitting transcripts into k-mers and optimizing structure.

Accepts raw Nanopore sequencing data and selects top differentially expressed transcripts.

Intuitive interface automates the entire workflow, enabling quick, personalized riboswitch design.

What's the problem?

Designing reliable toehold switches for eukaryotic contexts is still bottlenecked by choosing the right target region. In polyclonal tumors there is no single, universally up-regulated transcript; expression shifts between subclones, time points and anatomical regions, so a hand-picked site ages quickly.

Our pipeline removes that guesswork: starting from patient reads, it quantifies expression across tumor/normal, selects robust tumor-only regions, and assembles testable switches automatically. The result is fewer subjective decisions, less trial-and-error, and a reproducible path from raw Nanopore data to ranked designs that a wet-lab can immediately validate.

In practice, this enables hospital teams to move from biopsy to patient-specific constructs within a single workflow - keeping data provenance intact, minimizing context switches between tools, and standardizing outputs for downstream cloning and transfection.

Implementation

The workflow is scripted and reproducible: options parsing → environment bootstrap → reference preparation → alignment/quantification → differential expression → export of tumor-only transcripts → riboswitch design.

On the first run the pipeline bootstraps its environment (packages, tools) and prepares the transcriptome reference (FASTA + indices). All heavy assets are cached in a stable folder layout and automatically reused on subsequent runs. Each stage is idempotent - you can re-invoke a later step without re-downloading or re-indexing, and partial outputs are detected safely.

Operational details include consistent run IDs for naming, deterministic seeds for steps using stochastic optimizers, and guardrails that fail fast on malformed inputs (e.g., mixed strandedness or mismatched read types).

Transcript selection

Input is long-read cDNA from tumor and matched normal (.fastq.gz). The module prepares the transcriptome (plus an .mmi index) on first run, aligns and quantifies expression, then computes differential expression across tumor regions to capture intratumoral heterogeneity.

Candidate transcripts are then filtered with guardrails that reduce false positives before design:

Tumor enrichment: retain targets with strong fold-change and adequate base coverage across tumor regions.
Paralog proximity: drop sequences with close paralogs or repeated segments likely to cause off-target triggering.
Sequence sanity: ensure clean ORF context when a reporter will be appended downstream.

The output is a compact FASTA containing top tumor-specific transcripts and their metadata (IDs, coordinates, basic QC), which becomes the direct input to the designer.

Candidates generation

For each target, the designer splits the sequence into overlapping k-mers (sliding windows) and proposes complementary trigger regions. It then assembles full eukaryotic toeholds by adding the stem/loop, the Kozak motif (GCCACCAUGG), start codon (AUG) and linkers - enforcing the reading frame and excluding premature stops.

Structural and interaction scoring proceeds in two passes: (1) iterative fitting toward the intended secondary structure (e.g., RNAinverse-style optimization) and (2) evaluation of RNA:RNA binding using interaction models (RNAup/RNAcofold). Heuristics include GC-content windows, hairpin stability limits, trigger accessibility, and penalties for sequence motifs known to hamper translation or splicing in mammalian cells.

The ranking function combines these terms into a single score, so you see a clear ordering of candidates. For robustness, multiple triggers can be combined into an OR gate within one construct; when enabled, the builder re-checks frame integrity and linker boundaries after concatenation to keep everything in-frame.

Output and user interaction

The designer exports ranked, fully assembled switches with per-candidate annotations (trigger coordinates, predicted structures, energy terms, constraint flags). Typical formats include FASTA (sequence-only), CSV (tabular scores) and JSON (machine-readable bundles for programmatic use).

Optional visualizations show predicted secondary structures and ensemble defects for quick QC; these artifacts align with the tabular rows so you can trace any candidate back to its inputs. Operator input stays minimal: point to the reads (plus an optional reporter FASTA), select threads and an output folder, and the pipeline handles the rest - reusing cached references for faster iteration.

Downstream, the top-ranked design per target can be merged into a single plasmid-ready construct. The builder validates junctions, Kozak placement and ORF continuity to avoid silent frame shifts.

Conducting Differential Expression

This module takes Nanopore sequencing reads (.fastq.gz) from tumor and matched normal samples, runs alignment and quantification, performs differential expression, and outputs a FASTA file with the most tumor-specific transcripts for the next part of the pipeline (riboswitch design).

TL;DR - What does this section do, and how do I run it?

Parse options & set up folders
Environment bootstrap
Reference preparation
Run differential expression
Select “only in tumor” transcripts

Basic one-liner from the project root:

bash scripts/transcript_ranking.sh \
      --tumor  data/TUMOR.fastq.gz \
      --normal data/NORMAL.fastq.gz \
      --threads 16 \
      --job-id MYRUN \
      --pc-lnc-only 1

System requirements

Linux or WSL2
~20 GB of disk space; 16 GB RAM recommended
Internet access on first run (to install the environment and fetch the reference genome)
Nothing else - Micromamba auto-configures the startup files needed to work (shell init hooks, environment root, etc.).

Inputs & core command

--tumor: tumor reads (.fastq.gz)
--normal: matched normal reads
--threads: CPU thread count
--job-id: run name (used to label outputs)
--pc-lnc-only 1: restrict to protein-coding + lncRNA (speeds up the run)

What happens under the hood

Create/verify folders and bootstrap the Micromamba environment
Prepare the reference (download/index on first run)
Align reads and quantify expression
Compute differential expression and rank transcripts
Filter to “only in tumor” candidates

Outputs

Alignment files (.bam) against the reference genome
Salmon results used for differential expression
Tabular files listing significantly differentially expressed transcripts
FASTA file with sequences of top tumor-specific transcripts (input for the riboswitch design block)

First run can be long

The very first execution on a given machine/sample set is slower because the reference genome must be downloaded and indexed. By default, the software detects existing reference assets and reuses them on subsequent runs.

Force a clean reference re-download/re-index

If you want to reset the reference and rebuild from scratch, remove the following files from the ref directory:

rm -f ref/transcripts.mmi \
        ref/transcripts.ref.sha256 \
        ref/transcripts.fasta.path

Designing Riboswitches

This module takes tumor-specific transcripts from the previous step and automatically builds, scores, and exports eukaryotic toehold switches using hepaswitch.py and hepaswitch_script.py. Candidates are generated from overlapping k-mers, assembled with Kozak/AUG/linkers, structurally optimized, ranked, and exported for downstream testing.

TL;DR - What does this section do, and how do I run it?

Read tumor-specific targets from FASTA
Generate k-mer-based trigger regions
Assemble full toehold constructs (Kozak, AUG, linkers)
Optimize & score structures/interactions
Rank and export best switches

Basic one-liner from the project root:

python hepaswitch.py \
      --targets targets.fasta \
      --reporter optional_reporter.fasta \
      --out out/

Inputs & core arguments

targets.fasta - FASTA with top tumor-specific transcripts (output of the differential expression stage).
optional_reporter.fasta - optional reporter ORF to append for assay readout.
out/ - output directory for designed switches and scores.

What happens under the hood

Partition each target transcript into overlapping k-mers to propose trigger regions.
Build full toehold constructs by combining the trigger with structural elements: stem/loop, Kozak GCCACCAUGG, start codon (AUG), and linkers; enforce in-frame coding and absence of premature stop codons.
Iterative structure fitting (e.g., RNAinverse) and interaction scoring (RNAup/RNAcofold) to evaluate RNA:RNA binding energies.
Optional multi-trigger OR logic within one construct for robust detection.

Outputs

Fully assembled toehold sequences (FASTA/CSV) with Kozak/AUG/linkers.
Per-candidate structural annotations and energy scores for ranking.
Optional visualizations of predicted secondary structures and ensemble defects for QC.

The best-scoring design for each target can be combined into a single, test-ready construct; the pipeline ensures all coding regions remain in-frame.

Key screens

1) Input - choose files & how many transcripts to use

Input screen: pick transcripts.fasta and optional reporter.fasta, set count 1–10

Transcripts are read in order; sequences are converted to RNA (T→U).
Reporter FASTA (first record) is optional and passed through to the designer.

2) Design run - progress & per-transcript best

Design run: progress bar and per-transcript best table

Worker thread computes the best complex for each transcript via find_best_complex.
Best rows show the toehold (with linker) and AUG position; GUI updates progress as each finishes.

3) Final construct - OR fusion & export

Final construct: sequence preview with Copy / Save FASTA / Visualize

Multiple toeholds are concatenated with frame-safe trimming using align_toeholds_for_OR.
Save produces continuous-line FASTA (no wrapping).
RNAplot preview is generated on demand (RNAfold → RNAplot → convert).

What happens under the hood

k-mer triggers across 20–35 nt windows → reverse-complement → stem/loop + Kozak + AUG + linker.
Structure fit (RNAinverse), then interaction scoring (RNAup + RNAcofold); scores are combined and ranked.
GUI currently selects the single best candidate per transcript; the script API also exposes Top-N if needed.

Accepted inputs

transcripts.fasta - 1..10 tumor-specific sequences (already chosen upstream)
reporter.fasta - optional reporter ORF (first record is used)

Generated outputs

Per-transcript best toehold (with linker) + AUG position (table view)
Final OR construct (copy/save FASTA, continuous sequence)
Optional RNAplot preview for the final construct

Dependencies

PySide6 (GUI)
ViennaRNA tools in $PATH: RNAfold, RNAcofold, RNAup, RNAinverse, RNAplot (checked at startup)
ImageMagick convert for PS→PNG (structure preview)

The GUI does not perform mapping/quantification or differential expression. Provide the already curated targets from the previous module (e.g., exported targets.fasta).

Demonstration

A short video demonstration showing how to clone the repository and perform riboswitch design using preselected sequences (FASTA) - without Nanopore data processing.

Cloning the repository and quick environment setup,
Running hepaswitch.py on an example targets.fasta file,
Reviewing results (candidate ranking, OR construct preview),
Exporting designed sequences to FASTA for downstream testing.

Note: this demonstration skips the mapping, quantification, and differential expression stages - it uses already selected transcripts.

We can’t include a direct YouTube link here due to iGEM requirements, but we can share it with you on request.

Citations

Green AA, Silver PA, Collins JJ, Yin P. Toehold switches: de-novo-designed regulators of gene expression. Cell. 2014;159(4):925–939. doi:10.1016/j.cell.2014.10.002. PMC4265554
Angenent-Mari NM, Garruss AS, Soenksen LR, Church G, Collins JJ. A deep learning approach to programmable RNA switches. Nature Communications. 2020;11:5057. doi:10.1038/s41467-020-18677-1. Article
Computational design of toehold switches in eukaryotes and prokaryotes for efficient post-transcriptional control. bioRxiv. 2025. doi:10.1101/2025.01.15.633215. Preprint