Engineering Success

Iteration 1: Initial dCas9-Dam Fusion Design

Design

The design focused on creating a dCas9-Dam fusion protein for programmable DNA methylation. The architecture utilized catalytically inactive Cas9 (dCas9) as a targeting module, fused to the DNA Adenine Methyltransferase (Dam) effector domain. A 15-amino acid Gly-Ser linker was incorporated between the domains to provide flexibility and spacing, ensuring both domains functioned without steric hindrance. The linker and Dam coding sequence were inserted immediately before the dCas9 stop codon to preserve the dCas9 open reading frame and ensure full-length protein expression.

Build

The pdCas9 plasmid was linearized, and the Dam coding sequence was amplified from the E. coli genome via high-fidelity PCR. Primers were designed with specific overlaps for In-Fusion cloning; the primer for linearizing pdCas9 also encoded the 54-nucleotide Gly-Ser linker sequence. Assembly was performed using the In-Fusion Cloning Kit, which utilizes a proprietary enzyme to generate overhangs, and then native cellular ligase joins overlapping sequences after transformation. The reaction mixture was transformed into competent E. coli cells, yielding the final plasmid construct.

Test

Assembly was tested by colony PCR using lysed colonies. Agarose gel electrophoresis showed intense bands in all sample wells, corresponding to the size of the Dam gene.

Learn

The bands were likely due to PCR amplification of the endogenous Dam gene from genomic DNA present in the cell lysate, rather than from the recombinant plasmid. This indicated that plasmid purification should be performed prior to PCR to avoid genomic DNA contamination.

Iteration 2: Plasmid Isolation and PCR Optimization

Iteration 3: Restriction Digestion Analysis

Iteration 4: Gel Optimization and Sequencing Confirmation

Iteration 1: Initial gRNA Design and Cloning Attempt

Iteration 2: Protocol Optimization and Colony Screening

Iteration 3: PAGE Electrophoresis for Resolution

Iteration 4: Digestion Analysis

Iteration 5: DNA Loading Optimization

Iteration 6: Enhanced Digestion Protocol

Iteration 7: Protocol Refinement and Transformation Error

Iteration 8: Fresh Enzyme and Successful Digestion

Iteration 9: Golden Gate Assembly Approach

Iteration 10:

Iteration 11:

Iteration 12:

Iteration 13:

Iteration 1: Reporter Plasmid Construction and Validation

Iteration 1: Initial Fluorescence Testing

Iteration 2: Comparative Analysis with Dam(-) Strains

Cycle 1: Direct Gene Expression Prediction from Sequence Data

Design

Attempting to predict gene expression directly from sequence information, using DeepPGD architecture [Reference 3 for original paper].

Build

DeepPGD is a deep learning framework that predicts DNA methylation probability at specific genomic sites using temporal convolution, BiLSTM, and attention mechanisms. The model represents a significant advancement in methylation prediction, combining Temporal Convolutional Networks (TCNs) and bidirectional long short-term memory (BiLSTM) networks to extract DNA structural and sequence features. Datasets used were the same as those used by the original authors in their paper.

🔍

We initially focused on two key areas for potential novelty:

Interpretability Enhancement

Current deep learning models in genomics, including DeepPGD, suffer from the "black box" problem
We aimed to implement explainable AI (XAI) techniques to understand which sequence features the model considers most important for methylation prediction
Our approach would have involved gradient-based attribution methods and attention visualisation to identify key regulatory motifs

Cross-Species Generalizability

DeepPGD was primarily trained on species-specific datasets
We sought to develop a universal model capable of predicting methylation across multiple species
This would involve training on multi-species datasets to capture conserved methylation patterns and sequence motifs that transcend species boundaries

Test

Interpretability Limitations
Our research revealed fundamental mathematical limitations in applying interpretability methods to deep neural networks for genomics:

Interpreting the optimisation of weights poses a problem because the constraints involved are often unclear, and unconstrained optimisation can lead to an imbalance in weighting. (A few may blow up so much that the effect of other weights is negligible)
Gradient-based methods suffer from vanishing gradients in deep networks, especially with ReLU activations for negative inputs
Attention models can become unstable when the input features are highly correlated, a situation that often occurs in genomic data. They simply do not know which to focus on.
The fundamental black box nature of AI models.

Generalizability Obstacles
Cross-species methylation prediction faces several biological and technical challenges:

Training data scarcity for many species, particularly for CHH methylation contexts in plants
Sequence context dependencies vary significantly between species, limiting model transferability
Species-specific methylation patterns reflect distinct evolutionary pressures and regulatory mechanisms

Learn

While we were able to replicate the results published in the DeepPGD paper, this established a relationship between sequence data and local methylations. However, we were unable to find meaningful steps to predict gene expression while relying on sequence data alone. This implied that there are other factors involved in determining gene expression that were not adequately captured by sequence information.

Cycle 2: Linear Methylation-Expression Relationship Analysis

Design

Attempting to find direct, simple and linear relationships between overall genome methylation and gene expression, using DeepMethyGene architecture

Build

The architecture of DeepMethyGene has been explained later in the “Software” section of this wiki.

To explore and understand the DeepMethyGene architecture, we developed a data processing pipeline in a Google Collaboratory notebook that performed the following key steps:

NOTE: These explorations were carried out on methylation data from Breast Cancer cells, provided on the UCSC Xena integrated cancer genomics platform.

Data Loading and Initial Inspection: We began by loading raw gene expression and DNA methylation datasets, along with a mapping file to link gene identifiers across data types. The gene expression data used Ensembl gene IDs (e.g., ENSG00000242268.2), while the DNA methylation data used HGNC gene symbols (e.g., A2M). This discrepancy in gene identifiers was a critical challenge to address for proper data integration.
Data Preprocessing and Aggregation: We processed the large DNA methylation dataset incrementally to manage memory usage, calculating the average methylation "beta value" for each probe. We then used a mapping file to link these probes to their corresponding gene symbols and calculated the average methylation value for each gene. This processed data was then saved. Similarly, we calculated the average gene expression (FPKM) for each Ensembl gene ID.
Gene ID Harmonisation and Data Merging: To combine the two datasets, we used a mapping file to convert the Ensembl IDs in the expression data to their corresponding HGNC gene symbols. We then merged the processed gene expression and DNA methylation data into a single dataframe based on these common HGNC symbols. This resulted in a unified dataset containing both average expression and average methylation values for 1558 unique genes.
Exploratory Data Analysis: WAs a final step, we visualized the integrated data using scatter plots. We plotted average gene expression against average DNA methylation to explore their relationship. We also created a version of this plot with a log-transformed gene expression axis to better represent the data distribution.

Test

Our analysis revealed that there was no direct correlation between methylation probability and gene expression at the promoter level.

🔍

No direct relationship observed between overall methylation of the genome and average gene expression values

🔍

Able to infer that genes generally have <200 methylation sites (CpG islands) within a 1Mb window, but unable to infer a direct relationship between this and the prediction accuracy (or R^2 values)

Learn

Simple linear relationships between methylation and expression are insufficient
Local (promoter) methylation alone cannot explain gene expression variability
More complex regulatory mechanisms are at play

Cycle 3: GradCAM Analysis for Methylation Site Importance

Cycle 4: Hybrid ElasticNet-CNN Architecture Development

Design

Attempting to find better means of recognizing important methylation sites.

Build

Decided to train a separate ElasticNet linear regression architecture (inspired by geneEXPLORE), to specifically learn the weights assigned to each methylation site. The process has been outlined in more detail in the “Software” section.

Test

When tested on different genes, results of the following forms were obtained -

🔍

geneEXPLORE importance maps when trained on a 10Mb window for gene AADAT

The above weights are suitable for further search algorithms since -

Large number of weights have been assigned ~0 importance, so can be ignored
Individual feature importances clearly retained
Clear peaks of importance visible
Direction of influence available.

To verify if roughly the same sites are identified as important by both the ElasticNet and the AdaptiveRegressiveCNN, we overlayed the two learned weights (after basic scaling and normalization, and obtained the below plot) -

🔍

Here, the green points are the weights assigned by the ElasticNet (y1), the blue are those assigned by DeepMethyGene (y2), and the yellow is the error between the two (y1 - y2). The x axis represents numbering of CpG sites within the 10Mb window for AADAT.

Observed the following -

Both models at least seemed to agree on which sites were unimportant, as evidenced by the solid yellow bar at error 0, which largely corresponds to sites with very less importance in either model.
Unimportant sites do have significantly different variations in assigned importances, but this is expected as ElasticNet focuses on individual features, while the AdaptiveRegressiveCNN learns relationships between features. Roughly 40% of sites had error > 20%, but we considered this good enough to proceed with.

Learn

ElasticNet could identify important features better than AdaptiveRegressiveCNN, but the AdaptiveRegressiveCNN had better predictive power than the ElasticNet. Thus, we decided to adopt an ElasticNet to learn feature importances, but an AdaptiveRegressiveCNN for final predictions. This led to the software “ECHO” as explained in the Software section of this wiki.

Cycle 1: Testing Models for Flux Analysis

Design

Planned to replicate and tweak existing algorithms like metabolic regulatory networks and regulatory dynamic enzyme-cost FBA to try and include the effect of epigenetic modifications.

Build

Metabolic regulatory networks have the following properties:

Continuous variables: metabolite, enzyme, and regulatory protein levels.
Discrete variables: gene expression states (ON or OFF).
Guards and jumps: logical rules that trigger transitions (e.g., if RP > threshold, turn T2 OFF).
Flows: the system of ODEs that governs metabolism in each state.
Each discrete regulatory configuration defines its own set of differential equations, and transitions between states occur when molecular thresholds are crossed.

But the challenges involved here were: Epigenetic modifications are continuous rather than binary (primarily because these modifications are measured for multiple cells) measured as representing them in a system built around discrete on/off logic is difficult. The thresholds or kinetic rates for epigenetic switching are not well established, making it hard to define guard conditions.

rdeFBA models metabolism as a constraint-based system with a quasi-steady-state assumption for metabolites, while regulation is represented through Boolean rules. The model is formulated as a dynamic optimization problem that maximizes biomass over time, subject to metabolic and regulatory constraints. This predicts time-dependent fluxes, enzyme and metabolite levels, and regulatory states.

Again this came with a few challenges: This is similar to the above, epigenetic effects are not binary or linear; they involve graded and context-dependent changes, while r-deFBA depends on boolean or linear constraints. Epigenetic control is cell-type-specific and context-dependent, making it hard to test predictions experimentally at the genome scale.

Test

We ran a few iterations of the models and even though they were giving some results, we realised that both of these required the knowledge of a lot more parameters than we found in literature (kinetic parameters and forms of ODEs) especially given that a lot of molecular mechanisms and enzyme kinetics that are involved in epigenetic modifications have not been elucidated yet.

Learn

We need to look for algorithms that are either simpler in terms of how many variables and parameters that they require in order to produce meaningful results or those that are tailored to model epigenetic modifications.

Cycle 2: Building Epigenetic Gene Regulatory Network

Cycle 3: Using Transcriptome Data from Dam Negative E.Coli

Design

In a separate exploratory tangent, we sought to establish a framework where differential gene expression between Dam-positive (wild-type) and Dam/Dcm-deficient E. coli strains could be leveraged to infer metabolic consequences of methylation changes. The rationale was that Dam methylation exerts wide-reaching regulatory control in E. coli, and the transcriptomic differences between wild-type and mutant backgrounds might provide a proxy for how methylation loss perturbs metabolic network behavior. By systematically integrating these expression datasets into the metabolic model, we aimed to approximate the effect of targeted demethylation events.

Build

To simulate the removal of methylation, we operationalized the process as a substitution of expression levels: each gene's wild-type expression was computationally "switched" to its measured value in the Dam/Dcm double mutant. This approach effectively treated the mutant expression pattern as a demethylated baseline, with the objective of minimizing the discrepancy between the original wild-type profile and the altered state imposed by loss of methylation. The gene expression integration was performed using our standard GIMME pipeline, with constraints applied through the GPR rules of the metabolic network.

Test

We subsequently computed flux distributions under the two conditions and compared them to identify shifts in pathway utilization. While flux solutions were obtained, the results proved suboptimal: the distributions were noisy, biologically implausible in certain pathways, and showed weak correspondence to known metabolic effects of Dam inactivation. In essence, the approach risked conflating methylation-dependent transcriptional effects with secondary stress responses or compensatory changes unrelated to direct epigenetic control. This undermined the specificity of the analysis and suggested that the method was not capturing the causal impact of methylation with sufficient resolution.

Learn

The limitations of this cycle highlighted several key issues: (1) E. coli may not be the most informative system for connecting methylation dynamics to metabolism, given the relatively limited methylation landscape compared to eukaryotes; (2) the direct substitution of expression values from a double mutant strain introduces confounding global effects; and (3) the absence of a more nuanced model of how methylation influences transcriptional regulation made the inferences unreliable. As a result, we decided to pivot away from E. coli as a model organism for this line of inquiry and instead explore systems where methylation has a more central and tractable role in metabolic regulation.

Cycle 4: Epigenetically Regulated dFBA Model and Simulation of the Warburg Effect

Design

The objective was to extend the dynamic enzyme-cost flux balance analysis (deFBA) framework by introducing a quantitative layer that links DNA methylation levels to transcriptional activity, thereby enabling graded regulation instead of binary gene on/off logic. To evaluate the biological implications of this framework, we selected the Warburg effect as a representative phenomenon of metabolic reprogramming in cancer, using the ecHumanGEM reconstruction to model how methylation of glycolytic genes such as PFKP and PKM alters flux distributions under different methylation states.

Build

The base mathematical structure of deFBA was retained but extended with a continuous gene activity weight (α) bounded between 0 and 1, derived from promoter-specific methylation β-values. The functional relationship between methylation and transcriptional activity was captured using an inverted logistic function, allowing smooth transitions from full repression to complete activation. This formulation was embedded into the enzyme synthesis constraints of the deFBA framework.

The extended model was parameterized within the ecHumanGEM network, incorporating human enzyme capacities and turnover constraints. Two dynamic simulations were performed: (1) a baseline (healthy) state reflecting normal methylation levels, and (2) a hypomethylated (cancer-like) state corresponding to decreased methylation on glycolytic promoters. Comparing these states allowed us to approximate how epigenetic deregulation drives enhanced glycolytic flux characteristic of tumor metabolism.

Test

Dynamic simulations were executed using the GUROBI solver on a time-discretized formulation of the hybrid optimization problem. We tracked flux distributions, enzyme allocations, and metabolite concentrations across time for both methylation states. The hypomethylated configuration reproduced the expected metabolic phenotype—an observable increase in glycolytic throughput—consistent with the Warburg hypothesis. Nevertheless, model stability was highly sensitive to parameter scaling within the logistic activation function, and the solver occasionally exhibited slow convergence under dense regulatory coupling.

🔍

Figure: Epigenetically regulated dFBA simulation results showing metabolic reprogramming under different methylation states

Learn

This cycle demonstrated the potential of integrating epigenetic data into metabolic simulations, earning positive feedback for its novelty. However, challenges led us to shift focus:

Computational Complexity: Simulations with continuous regulatory weights increased solver size and runtime, highlighting the need for model reduction.
Biological Assumptions: Simplifications of enzyme turnover and resource capacity may have hidden true post-transcriptional effects of methylation.

Building on these insights, we pivoted to using large-scale cancer methylation data from TCGA. This led to the development of a predictive model using adaptive regression CNNs to infer gene expression from methylation patterns, which is integrated into context-specific metabolic frameworks like GIMME and IMAT.

Cycle 5: Using Spatial Pyramidal Pooling to Make A Unified CNN

Cycle 6: Towards Developing Gene-Wise Thresholds

Engineering Success

Experimental

Design, Build, Test, and Learn

CRISPR-dCas9-Dam

Guide RNA Cloning

Reporter Plasmid (dnaAP2-GFP)

Fluorescence Assay

Software

ECHO

ENIGMA

Experimental​

Design, Build, Test, and Learn

CRISPR-dCas9-Dam​

Design

Build

Test

Learn

Design

Build

Test

Learn

Design

Build

Test

Learn

Design

Build

Test

Learn

Guide RNA Cloning​

Design

Build

Test

Learn

Design

Build

Test

Learn

Design

Build

Test

Learn

Design

Build

Test

Learn

Design

Build

Test

Learn

Design

Build

Test

Learn

Design

Build

Test

Learn

Design

Build

Test

Learn

Design

Build

Test

Learn

Design

Build

Test

Learn

Design

Build

Test

Learn

Design

Build

Test

Learn

Design

Build

Test

Learn

Reporter Plasmid (dnaAP2-GFP)​

Design

Build

Test

Learn

Fluorescence Assay​

Design

Build

Experimental

CRISPR-dCas9-Dam

Guide RNA Cloning

Reporter Plasmid (dnaAP2-GFP)

Fluorescence Assay

Software

ECHO

ENIGMA