Model | Stanford - iGEM 2025

Model

Computational modeling of RNA logic gate dynamics

Model Overview

The Dry Lab team determined the best endogenous trigger sequences for half of LiRA's AND gate, leveraging publicly available expression data from Gepia2 and DepMap to do so. Additionally, we improved upon an existing trigger ranking algorithm created by The Gao Lab at Stanford Bioengineering to incorporate RNA folding predictions with viral triggers in mind. Finally, we leveraged STAR to align bulk RNA sequencing data to the Hepatitis B genome. We used our results to determine TPM values for viral genes, ascertaining whether there is enough RNA to meet LiRA's sensitivity threshold. All of these steps were crucial to validating the AND gate by associating oncogenic endogenous transcripts with HBV viral transcripts.

Best Endogenous Triggers

We designed a simple gene filtering and ranking script which prioritizes genes that are both highly expressed and differentially expressed in Hepatocellular Carcinoma. It begins by probing the user for a TPM (Transcripts Per Million) cutoff value, defaulting to 1.0 if none is provided, because filtering out genes with extremely low expression helps reduce noise and focus the analysis on transcripts that are more likely to be biologically meaningful.

The expression file (ACH-000625_expression.csv filtered from DepMap's OmicsExpressionProteinCodingGenesTPMLogp1.csv) is then parsed: the first column containing sample identifiers is removed, and the first data row is extracted to obtain TPM values for each gene. These values are reshaped into a tidy format, filtered to retain only genes meeting the TPM cutoff, and sorted in descending order of expression. This filtered set is saved to tpm_sorted.csv so that the user has a clear view of which genes are expressed at sufficiently high levels for LiRA's sensing capabilities.

Next, the script brings in a mapping file to standardize gene identifiers between DepMap and Gepia2, and ensure that the expression data can be consistently linked to downstream analyses. This step is necessary because different datasets often use slightly different naming conventions, and harmonizing identifiers prevents mismatches that would otherwise cause important genes to be dropped.

The script then reads the differential expression results from table_degenes.txt (procured from Gepia2's HCC tab) and normalizes the gene symbols to allow merging with the expression data. By integrating TPM-filtered expression values with differential expression values, the script ensures that subsequent analyses focus not only on genes that are highly expressed but also on those that show meaningful changes in regulation.

From this merged dataset, the script filters for genes with positive Log2-Fold Change values, since upregulated genes are often drivers of disease states and easier for LiRA to sense. These genes are sorted by fold change and saved as log+tpm_sorted.csv, producing a ranked list of candidates that are both well-expressed and upregulated.

A set of genes from the output which are both differentially expressed and highly expressed in Hep3B. GPC3 is the best, then UBD, then MDK. Many genes are listed.

Figure 1: The first few rows of our csv containing our best endogenous trigger candidates, sorted by log2(tpm + 1) and log2-fold change.

In summary, the script is not only performing filtering and ranking but doing so with the purpose of reducing noise, ensuring data consistency, and prioritizing genes that are both abundant and differentially expressed. This makes it easier to identify biologically meaningful targets for LiRA.

Best Viral Triggers

We designed a pipeline to ensure that LiRA can reliably sense the highly mutable Hepatitis B virus across its diverse genotypes. The fundamental problem driving this research stems from HBV's remarkable ability to evade treatment through rapid mutation and its existence as multiple genotypes (A through H) that exhibit significant sequence variation. It is important to LiRA's functionality that it can sense viral RNA within the cell, independent of any genotype differences or mutations which exist. With that being said, some genes are better for sensing than others, and our ranking algorithm picks them out to improve the already difficult odds.

The pipeline begins by systematically collecting comprehensive genomic data from HBVdb, downloading FASTA sequences for all major viral genotypes across eight critical genomic regions including the precore, core, X protein, surface proteins, and polymerase genes. This broad sampling is essential because LiRA must work across the global diversity of HBV strains that patients actually harbor. The code implements memory-safety features including exact sequence deduplication and hard caps on alignment sizes, recognizing that viral databases contain thousands of sequences that could overwhelm computational resources without proper management.

Once sequences are collected, the system performs multiple sequence alignments using MUSCLE v5 in lean mode, with fallback options to MUSCLE v3 or ClustalW if the preferred tool is unavailable. The alignment step is crucial for identifying evolutionary conservation patterns, as regions that remain unchanged across genotypes are ideal for sensing. The algorithm specifically searches for regions showing greater than 95% sequence conservation, a stringent threshold that ensures target reliability.

The pipeline then scans these conserved regions for specific RNA motifs including CCA, GCA, and UCA sequences, which are required for LiRA's binding and sensing. The resulting "trigger" sequences represent potential viral targets that combine both evolutionary conservation and structural suitability for LiRA to bind.

Viral Alignment Analysis

We wrote a Bash script to run a complete RNA-seq alignment and quantification pipeline for Hep3B bulk RNA seq reads against the HBV reference genome.

The configuration section defines key file paths and parameters. Input FASTQ reads (GSM1211428 from NCBI's Sequence Read Archive), the HBV genome FASTA (NCBI Reference Sequence NC_003977.2), and its GTF annotation (HBVdb) are set as variables, while memory and thread counts are explicitly declared. This makes the script flexible, reproducible, and easy to modify without hard-coding values.

Next, the script activates a pre-existing Conda environment (rnaseq_env) to guarantee that the right versions of STAR, samtools, and featureCounts are available. Using environments avoids dependency conflicts and ensures reproducibility across runs.

The pipeline then creates output subdirectories for the STAR index, alignments, and read counts. This organizational step prevents clutter and makes downstream file handling straightforward.

Then, STAR aligns the RNA-seq reads to the HBV reference genome, outputting a sorted BAM file. Parameters such as maximum multimapping, mismatch tolerance, and detailed SAM attributes are specified to balance sensitivity and specificity. After alignment, the BAM file is indexed with samtools to enable efficient querying by downstream tools.

featureCounts quantifies how many reads map to coding sequences (CDS) defined in the GTF file, grouping by gene locus tags. Options allow fractional assignment of multimappers and handle overlapping features, producing a gene-level expression matrix for later analysis.

Figure 2: 4,464 mapped reads to the Hepatitis B ayw genome from Hep3B bulk seq data.

Afterwards, samtools generates alignment statistics (flagstat) and per-base coverage depth. These quality control metrics are essential for interpreting how well the reads mapped and whether coverage is uniform across the genome. Finally, the result alignments and counts are archived into a compressed tarball for easy transfer, backup, or sharing. The script ends with a success message to clearly indicate completion.

Overall, each part of this script is designed to ensure reproducibility, efficiency, and organization: resource allocation ensures smooth execution on HPC, Conda guarantees the right software environment, directories structure results logically, STAR provides accurate alignment, featureCounts yields biologically relevant quantification, and archiving consolidates outputs for long-term use. These reads were crucial for identifying whether there was enough viral RNA for LiRA to sense within the Hep3B cell line we were originally using. Unfortunately, we proved that there was not enough, which coincides with our failed attempt to sense the HBX gene within the cell line without artificially transfecting it.

Program 5 Execution: Conservation Visualization

The interactive plot reveals insights about sequence positioning and conservation patterns that would have been difficult to discern from raw data alone.

The visualization process begins by optionally inputting custom sequences we want to analyze. The program automatically downloads thousands of HBV genome sequences from the HBVdb Lyon database, then performs sampling-based conservation analysis to identify the most conserved 140 base pair windows across these genomes.

Conservation plot showing scatter points for conserved sequences across genome positions, with gene annotation tracks below showing HBV genes (P, PreS1, PreS2, S, C, PreC, X, SP) as colored horizontal bars

Figure 3: Conservation mapping across the HBV genome. The top panel shows conserved 140bp windows (colored by GC content, sized by exact match percentage) plotted by their genomic position and conservation percentage. Custom input sequences appear as red markers. The bottom panel displays HBV gene regions as horizontal tracks, with split genes like Polymerase (P) and PreS1 shown across their discontinuous genomic locations. The 80% conservation threshold is marked with a dashed red line.

The position plot reveals that the most highly conserved sequences clustered in the Core (C) and X protein regions. We could see how conservation percentage (y-axis) related to genomic position (x-axis), with most conserved regions falling well above our 80% threshold line. The GC content coloring helped identify sequences with different nucleotide compositions, while marker size indicated the percentage of genomes with exact sequence matches.

The gene annotation track proved invaluable for understanding the functional context of our targets. By overlaying conserved sequence positions with gene boundaries, we could confirm that our candidate triggers aligned with stable, functionally important regions rather than highly variable domains. The split regions for genes like Polymerase (P) and PreS1 became visually obvious through the horizontal bars that appeared in multiple genomic locations, helping us understand why some sequences appeared to map to what seemed like distant positions—they were actually within the same gene spanning the circular genome's origin.

When we input our custom candidate sequences (marked in red), we could immediately see how they compared to the naturally conserved regions identified through the sampling algorithm. Sequences that clustered near highly conserved regions with similar conservation percentages became our top priorities for experimental validation. The visual comparison made it obvious which candidates were targeting genuinely conserved regions versus sequences that might work in some genotypes but fail in others.