Model Overview
The Dry Lab team determined the best endogenous trigger sequences for half of LiRA's AND gate, leveraging publicly available expression data from Gepia2 and DepMap to do so. Additionally, we improved upon an existing trigger ranking algorithm created by The Gao Lab at Stanford Bioengineering to incorporate RNA folding predictions with viral triggers in mind. Finally, we leveraged STAR to align bulk RNA sequencing data to the Hepatitis B genome. We used our results to determine TPM values for viral genes, ascertaining whether there is enough RNA to meet LiRA's sensitivity threshold. All of these steps were crucial to validating the AND gate by associating oncogenic endogenous transcripts with HBV viral transcripts.
Best Endogenous Triggers
We designed a simple gene filtering and ranking script which prioritizes genes that are both highly expressed and differentially expressed in Hepatocellular Carcinoma. It begins by probing the user for a TPM (Transcripts Per Million) cutoff value, defaulting to 1.0 if none is provided, because filtering out genes with extremely low expression helps reduce noise and focus the analysis on transcripts that are more likely to be biologically meaningful.
The expression file (ACH-000625_expression.csv filtered from DepMap's OmicsExpressionProteinCodingGenesTPMLogp1.csv) is then parsed: the first column containing sample identifiers is removed, and the first data row is extracted to obtain TPM values for each gene. These values are reshaped into a tidy format, filtered to retain only genes meeting the TPM cutoff, and sorted in descending order of expression. This filtered set is saved to tpm_sorted.csv so that the user has a clear view of which genes are expressed at sufficiently high levels for LiRA's sensing capabilities.
Next, the script brings in a mapping file to standardize gene identifiers between DepMap and Gepia2, and ensure that the expression data can be consistently linked to downstream analyses. This step is necessary because different datasets often use slightly different naming conventions, and harmonizing identifiers prevents mismatches that would otherwise cause important genes to be dropped.
The script then reads the differential expression results from table_degenes.txt (procured from Gepia2's HCC tab) and normalizes the gene symbols to allow merging with the expression data. By integrating TPM-filtered expression values with differential expression values, the script ensures that subsequent analyses focus not only on genes that are highly expressed but also on those that show meaningful changes in regulation.
From this merged dataset, the script filters for genes with positive Log2-Fold Change values, since upregulated genes are often drivers of disease states and easier for LiRA to sense. These genes are sorted by fold change and saved as log+tpm_sorted.csv, producing a ranked list of candidates that are both well-expressed and upregulated.
Figure 1: The first few rows of our csv containing our best endogenous trigger candidates, sorted by log2(tpm + 1) and log2-fold change.
In summary, the script is not only performing filtering and ranking but doing so with the purpose of reducing noise, ensuring data consistency, and prioritizing genes that are both abundant and differentially expressed. This makes it easier to identify biologically meaningful targets for LiRA.
Program 5 Execution: Conservation Visualization
The interactive plot reveals insights about sequence positioning and conservation patterns that would have been difficult to discern from raw data alone.
The visualization process begins by optionally inputting custom sequences we want to analyze. The program automatically downloads thousands of HBV genome sequences from the HBVdb Lyon database, then performs sampling-based conservation analysis to identify the most conserved 140 base pair windows across these genomes.
Figure 3: Conservation mapping across the HBV genome. The top panel shows conserved 140bp windows (colored by GC content, sized by exact match percentage) plotted by their genomic position and conservation percentage. Custom input sequences appear as red markers. The bottom panel displays HBV gene regions as horizontal tracks, with split genes like Polymerase (P) and PreS1 shown across their discontinuous genomic locations. The 80% conservation threshold is marked with a dashed red line.
The position plot reveals that the most highly conserved sequences clustered in the Core (C) and X protein regions. We could see how conservation percentage (y-axis) related to genomic position (x-axis), with most conserved regions falling well above our 80% threshold line. The GC content coloring helped identify sequences with different nucleotide compositions, while marker size indicated the percentage of genomes with exact sequence matches.
The gene annotation track proved invaluable for understanding the functional context of our targets. By overlaying conserved sequence positions with gene boundaries, we could confirm that our candidate triggers aligned with stable, functionally important regions rather than highly variable domains. The split regions for genes like Polymerase (P) and PreS1 became visually obvious through the horizontal bars that appeared in multiple genomic locations, helping us understand why some sequences appeared to map to what seemed like distant positions—they were actually within the same gene spanning the circular genome's origin.
When we input our custom candidate sequences (marked in red), we could immediately see how they compared to the naturally conserved regions identified through the sampling algorithm. Sequences that clustered near highly conserved regions with similar conservation percentages became our top priorities for experimental validation. The visual comparison made it obvious which candidates were targeting genuinely conserved regions versus sequences that might work in some genotypes but fail in others.