Skip to main content

Engineering Success

Experimental

Design, Build, Test, and Learn

We present a streamlined methodology for developing a dCas9-Dam fusion system to achieve targeted DNA methylation and control gene expression. The approach involves constructing a plasmid for expressing the dCas9-Dam fusion protein with its guide RNA, and a second reporter plasmid where GFP expression is controlled by a methylation-sensitive promoter (dnaAP2). Following co-transfection of E. coli with both plasmids, assays are conducted to confirm targeted methylation by the dCas9-Dam system and its direct effect on enhancing GFP expression.

Engineering Design Cycle
🔍
Figure 1: The engineering design cycle guiding our project development

CRISPR-dCas9-Dam

Through multiple design and testing iterations, which included troubleshooting PCR contamination and incomplete restriction digestion, we successfully constructed and confirmed two successful recombinant dCas9-Dam plasmids.

Iteration 1: Initial dCas9-Dam Fusion Design

Design

The design focused on creating a dCas9-Dam fusion protein for programmable DNA methylation. The architecture utilized catalytically inactive Cas9 (dCas9) as a targeting module, fused to the DNA Adenine Methyltransferase (Dam) effector domain. A 15-amino acid Gly-Ser linker was incorporated between the domains to provide flexibility and spacing, ensuring both domains functioned without steric hindrance. The linker and Dam coding sequence were inserted immediately before the dCas9 stop codon to preserve the dCas9 open reading frame and ensure full-length protein expression.

Build

The pdCas9 plasmid was linearized, and the Dam coding sequence was amplified from the E. coli genome via high-fidelity PCR. Primers were designed with specific overlaps for In-Fusion cloning; the primer for linearizing pdCas9 also encoded the 54-nucleotide Gly-Ser linker sequence. Assembly was performed using the In-Fusion Cloning Kit, which utilizes a proprietary enzyme to generate overhangs, and then native cellular ligase joins overlapping sequences after transformation. The reaction mixture was transformed into competent E. coli cells, yielding the final plasmid construct.

Test

Assembly was tested by colony PCR using lysed colonies. Agarose gel electrophoresis showed intense bands in all sample wells, corresponding to the size of the Dam gene.

Learn

The bands were likely due to PCR amplification of the endogenous Dam gene from genomic DNA present in the cell lysate, rather than from the recombinant plasmid. This indicated that plasmid purification should be performed prior to PCR to avoid genomic DNA contamination.

Iteration 2: Plasmid Isolation and PCR Optimization

Design

Isolated plasmids were used as templates for PCR to prevent amplification of genomic Dam. Isolated pdCas9 plasmid and genomic DNA from XL1-Blue E. coli served as negative and positive controls, respectively.

Build

PCR was performed on four isolated recombinant plasmids and two controls. Products were analyzed by gel electrophoresis.

Test

Bands corresponding to the Dam gene were observed in all wells, including the negative control, except for one non-control well.

Learn

The absence of a band in one test well indicated the corresponding colony lacked the recombinant plasmid. The band in the negative control indicated potential genomic DNA contamination during plasmid purification, rendering the results inconclusive.

Iteration 3: Restriction Digestion Analysis

Design

Screening was performed using BamHI restriction digestion, which yields distinct banding patterns for pdCas9 and dCas9-Dam plasmids. Four putative recombinant plasmids and two pdCas9 controls were digested. Simultaneously, four plasmid isolates were sent for Sanger sequencing.

Build

The four putative recombinant plasmids and two controls were digested with BamHI and analyzed by gel electrophoresis. Plasmid samples were prepared for Sanger sequencing.

Test

Gel electrophoresis showed no bands; the DNA ladder bands were also faint.

Learn

The absence of bands and faint ladder suggested issues with the loading dye or gel preparation.

Iteration 4: Gel Optimization and Sequencing Confirmation

Design

Gel electrophoresis was repeated using remaining plasmid samples on a gel with smaller wells.

Build

Gel electrophoresis was run for the four test plasmids and two pdCas9 controls.

Test

Intense bands corresponding to undigested plasmid were observed, with faint, thin bands beneath all test wells. A minute size difference was noted between two test wells and the other two. Sanger sequencing results were obtained.

Learn

The intense bands indicated incomplete digestion, possibly due to reduced enzyme activity or protocol error. The faint bands with slightly larger size suggested the presence of recombinant plasmid. Sanger sequencing confirmed the recombinant plasmid in two of the four samples.

Guide RNA Cloning

Through numerous iterations of gRNA cloning, it was successfully cloned into the dcas9 plasmid. However, sequencing results for the final round of gRNA cloning into the dCas9-Dam plasmid have still not arrived.

Iteration 1: Initial gRNA Design and Cloning Attempt

Design

A 30 nt guide sequence targeting the dnaAP2 promoter was designed using CHOPCHOP software. The guide was chosen to bind upstream of the promoter to avoid CRISPR interference and position the Dam methyltransferase optimally relative to GATC sites. Complementary single-stranded DNA oligos were synthesized. We would insert our gRNA through restriction enzyme digestion followed by ligation. However, the BsaI enzyme we had available to us was from the year 2010 and therefore had very low activity. While waiting for new enzyme procurement, we decided to alter our protocol to optimize for yield of digested plasmid with the current enzyme stock.

Build

The oligos were annealed to form a double-stranded DNA spacer. The pdCas9 plasmid was digested with BsaI.

Test

Nanodrop spectrometry indicated a very low concentration of the linearized pdCas9 plasmid, with purity values suggesting contamination.

  • Concentration = 3.5 ng/µL
  • A260/280 = 2.0

Learn

The low concentration and purity indicated inefficient digestion, attributed to the expired BsaI enzyme.

Iteration 2: Protocol Optimization and Colony Screening

Design

The digestion protocol was scaled up to obtain a higher concentration of linearized plasmid. The plasmid purification protocol was modified: Elution Buffer was heated to 70°C, and elution was performed in two steps.

Build

  1. pdCas9 was digested and purified using the modified protocols. The same expired BsaI was used at a higher concentration.

  2. The linearized pdCas9 was ligated with the annealed gRNA oligos. The ligation mixture was transformed into XL1-Blue competent cells.

  3. Cloning was confirmed using colony PCR using the primers from the infusion cloning

Test

  1. The purified linear pdCas9 showed satisfactory concentration and purity.

    • Concentration = 48.2 ng/µL
    • A260/A280 = 1.85
  2. Transformation yielded a high number of colonies. Colony PCR using repurposed primers showed thick bands in nearly all samples except negative controls, creating ambiguity.

Learn

  1. The repurposed primers were suboptimal: one had low GC content (low melting temperature), the other showed self-annealing, and their melting points were far apart, leading to non-specific binding.

  2. The small amplicon size and large primers complicated distinction between amplicon and primer-dimers. Excess DNA loading and low gel resolution contributed to unclear results.

Iteration 3: PAGE Electrophoresis for Resolution

Design

PAGE electrophoresis was chosen for its superior resolution for small DNA fragments, allowing distinction between the PCR amplicon and primer-dimers. The DNA loading amount was reduced to obtain thinner bands for better resolution.

Build

PAGE electrophoresis was performed with three putative recombinant colony samples and one negative control (water).

Test

All samples, including the control, showed an identical band at the same location.

Learn

These bands were primer-dimers, confirming the absence of recombinant plasmid. This suggested the expired BsaI enzyme did not digest the plasmid.

Iteration 4: Digestion Analysis

Design

Gel electrophoresis was used to analyze the digestion mix, expecting separated bands for linearized and circular pdCas9.

Build

PCR cleanup was performed on the digestion mix. Gel electrophoresis was run with the digestion mix, ligation mix, and four pdCas9 samples.

Test

Broad bands were observed for pdCas9 wells, a faint band at ~10kb for the digestion mix, and no band for the ligation mix.

Learn

Broad bands were caused by excess DNA loading. The faint or absent bands in the digestion and ligation mixes indicated very low DNA concentration, preventing conclusions about digestion efficiency.

Iteration 5: DNA Loading Optimization

Design

The gel electrophoresis protocol was optimized by loading approximately 100ng of DNA per well.

Build

Gel electrophoresis was run with circular pdCas9 and the digestion mix.

Test

The circular pdCas9 well showed an intense broad band at ~10kb. The digestion mix well showed an intense broad band with a thin band below at ~9.3kb and a band near the well.

Learn

The thin band at ~9.3kb indicated the presence of linearized plasmid, constituting less than 10% of the DNA. The band near the well was unexplained, possibly a DNA-BsaI complex contamination.

Iteration 6: Enhanced Digestion Protocol

Design

The digestion protocol was modified: BsaI concentration and incubation time were increased. Ligation was performed at plasmid:insert ratios of 1:600 and 1:60, with a no-insert control. A hypothesis was that the control plate would show fewer colonies due to linear DNA.

Build

Digestion was performed with the optimized protocol. Ligation was carried out at the two ratios.

Test

Transformation of the ligation and digestion mixes yielded a high number of colonies on all plates.

Learn

The high colony count after transforming the digestion mix indicated very low digestion yield.

Iteration 7: Protocol Refinement and Transformation Error

Design

  • Digestion protocol was altered: DNA concentration was reduced tenfold, and incubation time was extended to two days to increase the proportion of digested DNA.

  • The ligation plasmid:insert ratio was set to 1:5.

Build

  • The concentration of linearized plasmid from the previous digestion was confirmed to be low (7ng/µL).

  • Ligation was performed at a 1:5 ratio using a diluted oligonucleotide insert (0.01µM).

  • The ligation mixture and a control (ligation mix without oligos) were transformed into competent E. coli.

  • Test

    Both the control and ligation mix plates showed zero colonies.

    Learn

    A transformation error occurred: antibiotics were added prematurely before the required one-hour pre-plating incubation for plasmid integration, preventing cell recovery. Subsequent work proceeded with a new BsaI enzyme.

    Iteration 8: Fresh Enzyme and Successful Digestion

    Design

    1. The new BsaI-HFv2 enzyme was obtained. A test digestion of pdCas9 was performed to confirm activity.
    2. gRNA cloning was attempted in both pdCas9 and dCas9-Dam plasmids.

    Build

    1. Four dCas9-Dam plasmids and one pdCas9 plasmid were digested with BsaI-HFv2 and analyzed by gel electrophoresis. After confirming activity, large-scale digestion was performed.
    2. Linearized plasmids were purified and ligated with the gRNA oligos, including a no-plasmid control. The mixtures were transformed and plated.

    Test

    1. Gel electrophoresis showed bands corresponding to the expected fragments from BsaI digestion.
    2. The test and control plates showed an equal, but low, number of colonies.

    Learn

    1. The fragment pattern indicated complete digestion, confirming enzyme activity.
    2. The equal colony count between test and control plates suggested no successful cloning so gRNA cloning was repeated.
    Iteration 9: Golden Gate Assembly Approach

    Design

    gRNA cloning was attempted via Golden Gate assembly, utilizing the type IIS properties of BsaI to combine digestion and ligation.

    Build

    Golden Gate assembly was performed by mixing the dCas9-Dam vector, BsaI, T4 ligase, and the oligonucleotide insert. The reaction mixture was transformed into XL1-Blue cells and plated on selective media. Putative colonies were screened by colony PCR, and positive samples were sent for Sanger sequencing.

    Test

    Test and control plates had a similar number of colonies. Colony PCR indicated clear positive results for gRNA cloning into pdCas9, but ambiguous results for dCas9-Dam. Colony PCR was thus repeated to screen for more colonies.

    Learn

    No positive colonies were obtained for gRNA insertion into dCas9-Dam.

    Iteration 10:

    Design

    We decided to send previously tested positive colonies along with the ambiguous colonies for cloning of gRNA for Sanger sequencing.

    Build

    We prepared appropriate primers required for outsourced sequencing of the plasmid backbone-DAM junction, the dCas9-DAM junction and another primer for high resolution sequencing of Dam outside the junction. We packaged and sent the plasmids for sequencing.

    Test

    Sequencing showed that positive colony-PCR results for recombination in dCas9 were in fact recombinant plasmids with gRNA. However, the ambiguous dCas9-Dam results of colony-PCR did not possess the guide RNA at all.

    Learn

    The cause of failed cloning of gRNA into the dCas9-Dam plasmid is not known, but one hypothesis could be self annealing of the dCas9-DAM plasmid, which has a low likelihood of happening.

    Iteration 11:

    Design

    We decided to redo gRNA cloning into dCas9-Dam with digestion succeeded by ligation(as done before), with a slightly modified protocol: We split the volume in two and performed pcr clean up for one half and did not perform pcr clean up for the other half before transformation.

    Build

    We performed digestion and ligation according to the modified protocol, transformed the plasmids into XL1-Blue competent E. coli and then performed colony PCR to test for recombination.

    Test

    SWe obtained no bands in the gel(even in the PCR control). Suspecting a problem with the polymerase, we replaced the Taq polymerase and repeated PCR with a larger number of samples but obtained no bands(there was no problem with the PCR this time since we obtained a band in the positive control).

    Learn

    The results of agarose gel electrophoresis indicated an issue with the PCR performed. We could not verify recombination for any colony.

    Iteration 12:

    Design

    After consulting the PhD scholars we decided to modify the protocol which would increase the chances of successful insertion by decreasing the amount of plasmid taken and doing the ligation and digestion steps separately. (Iteration 12 and iteration 13 was done according to the final gRNA cloning protocol mentioned in the list of protocols)

    Build

    We performed restriction enzyme digestion of the plasmid and according to the modified protocol and did PCR cleanup.

    Test

    The concentration of the plasmid was very low(around 5 ng/ul) which was not satisfactory to proceed with ligation.

    Learn

    Since our plasmid is large(around 10,000 bp) repeated elution during PCR cleanup may improve the plasmid concentration. Further increasing the initial amount of plasmid taken before digestion( 1000 ng instead of 600 ng) can improve plasmid concentrations.

    Iteration 13:

    Design

    We modified the amount of the plasmid taken for digestion (increased the amount from 600 ng to 1000 ng). Additionally, the PCR cleanup procedure was optimized by performing repeated elutions using a preheated elution buffer.

    Build

    We performed digestion of the plasmid, did the modified PCR cleanup and then the ligation. We then proceed to transform the bacterial cells using the heat shock method and checked the successful cloning using PCR.

    Test

    We obtained a band which corresponds to a molecular weight of 200 bp that may indicate successful gRNA cloning so it was sent for sequencing for confirmation.

    Learn

    For larger size plasmids, repeated elutions may be necessary to obtain a good concentration of plasmid after PCR cleanup.

    Reporter Plasmid (dnaAP2-GFP)

    The dnaAP2-GFP reporter plasmid was successfully used to confirm that GFP expression was present in both Dam(+) and Dam(-) cells, although at lower levels in Dam(-) cells, demonstrating the functionality of the dnaAP2 promoter.

    Iteration 1: Reporter Plasmid Construction and Validation

    Design

    The dnaAP2 promoter , a part of the dnaA - ATP operon native to E. coli for our reporter plasmid was chosen as it was known to be methylation sensitive. The sequence of the dnaA operon and gfp-Mut2 was obtained from Professor Bianca Sclavi. Modifications to the dnaA operon were made such as removing the dnaAP1 promoter and mutating regions where transcription repressors bind so that we could examine the effects of methylation alone

    Build

    The custom dnaAP2-GFP plasmid in a pET-blank backbone was synthesized. The plasmid was transformed into XL1-Blue E. coli and isolated.

    Test

    Plasmid presence was confirmed by gel electrophoresis.

    Learn

    Agarose gel electrophoresis showed bands at expected positions, confirming the dnaAP2-GFP plasmid.

    Fluorescence Assay

    The fluorescence assay successfully confirmed that the dnaAP2-GFP reporter is functional and exhibits methylation-sensitive repression, with GFP expression being approximately five-fold higher in Dam(+) cells compared to Dam(-) cells.

    Iteration 1: Initial Fluorescence Testing

    Design

    GFP expression from the dnaAP2-GFP reporter plasmid was tested in Dam(+) XL1-Blue cells using fluorescence spectroscopy.

    Build

    XL1-Blue cells transformed with dnaAP2-GFP and untransformed cells were pelleted, suspended at appropriate dilutions, and analyzed using a microplate reader.

    Test

    OD600 and fluorescence values were measured for all samples.

    Learn

    The transformed cell pellet was visibly greener, indicating GFP expression. More diluted samples showed higher fluorescence/OD600 ratios, indicative of fluorescence quenching, preventing direct comparison of normalized values.

    Iteration 2: Comparative Analysis with Dam(-) Strains

    Design

    The dnaAP2-GFP plasmid was transformed into Dam(-) E. coli. Fluorescence was measured for both Dam(-) and Dam(+) cells within a narrow OD600 range to mitigate quenching effects.

    Build

    Cell pellets for Dam(-) and Dam(+) cells, transformed and untransformed, were prepared. Suspensions were loaded into a microplate, and OD600 and fluorescence were measured. Fluorescence intensities were normalized to OD600.

    Test

    The Dam(+) cell pellet was visibly greener than the Dam(-) pellet. Fluorescence intensity in Dam(+) cells was approximately five-fold higher than in Dam(-) cells.

    Learn

    The visual differences in the cell pellet and the quantitative differences in fluorescence showed reduced GFP expression in Dam(-) cells. Our promoter's capability to express GFP in Dam(-) as well as Dam(+) cells was confirmed.

    Software

    ECHO

    We present ECHO (Epigenetic Control of gene expression with Hybrid Optimization), a computational tool that predicts optimal methylation sites for targeted gene expression control using a hybrid approach combining ElasticNet and CNN architectures.

    Cycle 1: Direct Gene Expression Prediction from Sequence Data

    Design

    Attempting to predict gene expression directly from sequence information, using DeepPGD architecture [Reference 3 for original paper].

    Build

    DeepPGD is a deep learning framework that predicts DNA methylation probability at specific genomic sites using temporal convolution, BiLSTM, and attention mechanisms. The model represents a significant advancement in methylation prediction, combining Temporal Convolutional Networks (TCNs) and bidirectional long short-term memory (BiLSTM) networks to extract DNA structural and sequence features. Datasets used were the same as those used by the original authors in their paper.

    🔍

    We initially focused on two key areas for potential novelty:

    1. Interpretability Enhancement
      • Current deep learning models in genomics, including DeepPGD, suffer from the "black box" problem
      • We aimed to implement explainable AI (XAI) techniques to understand which sequence features the model considers most important for methylation prediction
      • Our approach would have involved gradient-based attribution methods and attention visualisation to identify key regulatory motifs
    2. Cross-Species Generalizability
      • DeepPGD was primarily trained on species-specific datasets
      • We sought to develop a universal model capable of predicting methylation across multiple species
      • This would involve training on multi-species datasets to capture conserved methylation patterns and sequence motifs that transcend species boundaries

    Test

    Interpretability Limitations
    Our research revealed fundamental mathematical limitations in applying interpretability methods to deep neural networks for genomics:

    • Interpreting the optimisation of weights poses a problem because the constraints involved are often unclear, and unconstrained optimisation can lead to an imbalance in weighting. (A few may blow up so much that the effect of other weights is negligible)
    • Gradient-based methods suffer from vanishing gradients in deep networks, especially with ReLU activations for negative inputs
    • Attention models can become unstable when the input features are highly correlated, a situation that often occurs in genomic data. They simply do not know which to focus on.
    • The fundamental black box nature of AI models.

    Generalizability Obstacles
    Cross-species methylation prediction faces several biological and technical challenges:

    • Training data scarcity for many species, particularly for CHH methylation contexts in plants
    • Sequence context dependencies vary significantly between species, limiting model transferability
    • Species-specific methylation patterns reflect distinct evolutionary pressures and regulatory mechanisms

    Learn

    While we were able to replicate the results published in the DeepPGD paper, this established a relationship between sequence data and local methylations. However, we were unable to find meaningful steps to predict gene expression while relying on sequence data alone. This implied that there are other factors involved in determining gene expression that were not adequately captured by sequence information.

    Cycle 2: Linear Methylation-Expression Relationship Analysis

    Design

    Attempting to find direct, simple and linear relationships between overall genome methylation and gene expression, using DeepMethyGene architecture

    Build

    The architecture of DeepMethyGene has been explained later in the “Software” section of this wiki.

    To explore and understand the DeepMethyGene architecture, we developed a data processing pipeline in a Google Collaboratory notebook that performed the following key steps:

    NOTE: These explorations were carried out on methylation data from Breast Cancer cells, provided on the UCSC Xena integrated cancer genomics platform.

    • Data Loading and Initial Inspection: We began by loading raw gene expression and DNA methylation datasets, along with a mapping file to link gene identifiers across data types. The gene expression data used Ensembl gene IDs (e.g., ENSG00000242268.2), while the DNA methylation data used HGNC gene symbols (e.g., A2M). This discrepancy in gene identifiers was a critical challenge to address for proper data integration.

    • Data Preprocessing and Aggregation: We processed the large DNA methylation dataset incrementally to manage memory usage, calculating the average methylation "beta value" for each probe. We then used a mapping file to link these probes to their corresponding gene symbols and calculated the average methylation value for each gene. This processed data was then saved. Similarly, we calculated the average gene expression (FPKM) for each Ensembl gene ID.

    • Gene ID Harmonisation and Data Merging: To combine the two datasets, we used a mapping file to convert the Ensembl IDs in the expression data to their corresponding HGNC gene symbols. We then merged the processed gene expression and DNA methylation data into a single dataframe based on these common HGNC symbols. This resulted in a unified dataset containing both average expression and average methylation values for 1558 unique genes.

    • Exploratory Data Analysis: WAs a final step, we visualized the integrated data using scatter plots. We plotted average gene expression against average DNA methylation to explore their relationship. We also created a version of this plot with a log-transformed gene expression axis to better represent the data distribution.

    Test

    Our analysis revealed that there was no direct correlation between methylation probability and gene expression at the promoter level.

    🔍
    No direct relationship observed between overall methylation of the genome and average gene expression values
    🔍
    Able to infer that genes generally have <200 methylation sites (CpG islands) within a 1Mb window, but unable to infer a direct relationship between this and the prediction accuracy (or R^2 values)

    Learn

    • Simple linear relationships between methylation and expression are insufficient
    • Local (promoter) methylation alone cannot explain gene expression variability
    • More complex regulatory mechanisms are at play
    Cycle 3: GradCAM Analysis for Methylation Site Importance

    Design

    Attempting to unravel the complex methylatory integrations to identify most important methylation sites via gradCAM analysis.

    Build

    The architecture of DeepMethyGene has been explained later in the “Software” section of this wiki. An AdaptiveRegressiveCNN based on the DeepMethyGene architecture was built and trained on methylation data for the gene within a 10Mb window.

    This was followed by a gradCAM analysis on the final convolution layer to identify the numerically important features in the methylation data.

    Test

    When tested on different genes, results of the following forms were obtained -

    🔍
    DeepMethyGene-gradCAM site importance map when trained on a 10Mb window for gene AADAT

    Learn

    AdaptiveRegressiveCNNs (such as DeepMethyGene), while offering better predictive accuracy, proved difficult to extract gradients for, due to the following reasons -

    1. Being a CNN-based architecture, individual feature importances were lost, as long-range feature relationships were learnt. The importance maps (obtained via gradCAM) were smoothed out, and distinct peaks were not visible.

    2. While techniques such as gradCAM allowed us to visualize “important” sites for prediction, the direction of influence for these sites (i.e. does methylation at this site upregulate or downregulate gene expression) was not reliably extractable.

    3. Since the CNN learnt relationships between features, it tended to not give any features very low importance. This would be detrimental for search algorithms (explained later) since every site would require explicit handling.

    Cycle 4: Hybrid ElasticNet-CNN Architecture Development

    Design

    Attempting to find better means of recognizing important methylation sites.

    Build

    Decided to train a separate ElasticNet linear regression architecture (inspired by geneEXPLORE), to specifically learn the weights assigned to each methylation site. The process has been outlined in more detail in the “Software” section.

    Test

    When tested on different genes, results of the following forms were obtained -

    🔍
    geneEXPLORE importance maps when trained on a 10Mb window for gene AADAT

    The above weights are suitable for further search algorithms since -

    1. Large number of weights have been assigned ~0 importance, so can be ignored
    2. Individual feature importances clearly retained
    3. Clear peaks of importance visible
    4. Direction of influence available.

    To verify if roughly the same sites are identified as important by both the ElasticNet and the AdaptiveRegressiveCNN, we overlayed the two learned weights (after basic scaling and normalization, and obtained the below plot) -

    🔍
    Here, the green points are the weights assigned by the ElasticNet (y1), the blue are those assigned by DeepMethyGene (y2), and the yellow is the error between the two (y1 - y2). The x axis represents numbering of CpG sites within the 10Mb window for AADAT.

    Observed the following -

    1. Both models at least seemed to agree on which sites were unimportant, as evidenced by the solid yellow bar at error 0, which largely corresponds to sites with very less importance in either model.

    2. Unimportant sites do have significantly different variations in assigned importances, but this is expected as ElasticNet focuses on individual features, while the AdaptiveRegressiveCNN learns relationships between features. Roughly 40% of sites had error > 20%, but we considered this good enough to proceed with.

    Learn

    ElasticNet could identify important features better than AdaptiveRegressiveCNN, but the AdaptiveRegressiveCNN had better predictive power than the ElasticNet. Thus, we decided to adopt an ElasticNet to learn feature importances, but an AdaptiveRegressiveCNN for final predictions. This led to the software “ECHO” as explained in the Software section of this wiki.

    ENIGMA

    ENIGMA (Epigenetic Network Integration for Metabolic Analysis) represents our computational approach to understanding how epigenetic modifications influence metabolic networks. Through iterative development cycles, we explored various modeling frameworks to bridge the gap between methylation patterns and metabolic flux distributions, ultimately developing gene-specific predictive models that can meaningfully impact genome-scale metabolic simulations.

    Cycle 1: Testing Models for Flux Analysis

    Design

    Planned to replicate and tweak existing algorithms like metabolic regulatory networks and regulatory dynamic enzyme-cost FBA to try and include the effect of epigenetic modifications.

    Build

    Metabolic regulatory networks have the following properties:

    • Continuous variables: metabolite, enzyme, and regulatory protein levels.
    • Discrete variables: gene expression states (ON or OFF).
    • Guards and jumps: logical rules that trigger transitions (e.g., if RP > threshold, turn T2 OFF).
    • Flows: the system of ODEs that governs metabolism in each state.
    • Each discrete regulatory configuration defines its own set of differential equations, and transitions between states occur when molecular thresholds are crossed.

    But the challenges involved here were: Epigenetic modifications are continuous rather than binary (primarily because these modifications are measured for multiple cells) measured as representing them in a system built around discrete on/off logic is difficult. The thresholds or kinetic rates for epigenetic switching are not well established, making it hard to define guard conditions.

    rdeFBA models metabolism as a constraint-based system with a quasi-steady-state assumption for metabolites, while regulation is represented through Boolean rules. The model is formulated as a dynamic optimization problem that maximizes biomass over time, subject to metabolic and regulatory constraints. This predicts time-dependent fluxes, enzyme and metabolite levels, and regulatory states.

    Again this came with a few challenges: This is similar to the above, epigenetic effects are not binary or linear; they involve graded and context-dependent changes, while r-deFBA depends on boolean or linear constraints. Epigenetic control is cell-type-specific and context-dependent, making it hard to test predictions experimentally at the genome scale.

    Test

    We ran a few iterations of the models and even though they were giving some results, we realised that both of these required the knowledge of a lot more parameters than we found in literature (kinetic parameters and forms of ODEs) especially given that a lot of molecular mechanisms and enzyme kinetics that are involved in epigenetic modifications have not been elucidated yet.

    Learn

    We need to look for algorithms that are either simpler in terms of how many variables and parameters that they require in order to produce meaningful results or those that are tailored to model epigenetic modifications.

    Cycle 2: Building Epigenetic Gene Regulatory Network

    Design

    As the above algorithms were not ideal for us to work with in this context, we consulted with Prof. Meiyappan who directed us to Prof. Sriram Chandrasekaran's work. We found one paper of his which had promising results and had the potential for integrating epigenetic regulation.

    Build

    The EGRIN framework uses two main algorithms, cMonkey and Inferelator, to build a gene regulatory network. cMonkey groups co-expressed genes and finds common DNA motifs in their promoters, while Inferelator uses regression to determine which transcription factors (TFs) regulate these gene groups.

    We hypothesized that epigenetic data could be integrated by adding a new layer to the model. Gene grouping in cMonkey could consider shared epigenetic states, and the regression model in Inferelator could make a TF's influence dependent on the chromatin accessibility of its target gene.

    Test

    There was a severe data crunch. The model would need new, matched epigenomic datasets (like ChIP-seq) for the same conditions as the existing gene expression data. We were unable to generate or find CHIP-seq data which was suitable for our project for cells that the wet lab used.

    Learn

    Both cMonkey and Inferelator would require significant redevelopment to incorporate epigenetic variables, increasing computational demand. We also realized that it is difficult to distill complex epigenetic marks into a simple, quantitative "accessibility" score that the model can use. This has been a problem in every existing model that we have looked into.

    Further, other related work in Prof. Sriram's lab aimed at introducing acetylation reactions into the metabolic model and seeing how the flux through those reactions vary with changing environments and growth media. This was not relevant to our project as we wanted to measure how the epigenetic pool affects metabolism and not the other way around.

    Cycle 3: Using Transcriptome Data from Dam Negative E.Coli

    Design

    In a separate exploratory tangent, we sought to establish a framework where differential gene expression between Dam-positive (wild-type) and Dam/Dcm-deficient E. coli strains could be leveraged to infer metabolic consequences of methylation changes. The rationale was that Dam methylation exerts wide-reaching regulatory control in E. coli, and the transcriptomic differences between wild-type and mutant backgrounds might provide a proxy for how methylation loss perturbs metabolic network behavior. By systematically integrating these expression datasets into the metabolic model, we aimed to approximate the effect of targeted demethylation events.

    Build

    To simulate the removal of methylation, we operationalized the process as a substitution of expression levels: each gene's wild-type expression was computationally "switched" to its measured value in the Dam/Dcm double mutant. This approach effectively treated the mutant expression pattern as a demethylated baseline, with the objective of minimizing the discrepancy between the original wild-type profile and the altered state imposed by loss of methylation. The gene expression integration was performed using our standard GIMME pipeline, with constraints applied through the GPR rules of the metabolic network.

    Test

    We subsequently computed flux distributions under the two conditions and compared them to identify shifts in pathway utilization. While flux solutions were obtained, the results proved suboptimal: the distributions were noisy, biologically implausible in certain pathways, and showed weak correspondence to known metabolic effects of Dam inactivation. In essence, the approach risked conflating methylation-dependent transcriptional effects with secondary stress responses or compensatory changes unrelated to direct epigenetic control. This undermined the specificity of the analysis and suggested that the method was not capturing the causal impact of methylation with sufficient resolution.

    Learn

    The limitations of this cycle highlighted several key issues: (1) E. coli may not be the most informative system for connecting methylation dynamics to metabolism, given the relatively limited methylation landscape compared to eukaryotes; (2) the direct substitution of expression values from a double mutant strain introduces confounding global effects; and (3) the absence of a more nuanced model of how methylation influences transcriptional regulation made the inferences unreliable. As a result, we decided to pivot away from E. coli as a model organism for this line of inquiry and instead explore systems where methylation has a more central and tractable role in metabolic regulation.

    Cycle 4: Epigenetically Regulated dFBA Model and Simulation of the Warburg Effect

    Design

    The objective was to extend the dynamic enzyme-cost flux balance analysis (deFBA) framework by introducing a quantitative layer that links DNA methylation levels to transcriptional activity, thereby enabling graded regulation instead of binary gene on/off logic. To evaluate the biological implications of this framework, we selected the Warburg effect as a representative phenomenon of metabolic reprogramming in cancer, using the ecHumanGEM reconstruction to model how methylation of glycolytic genes such as PFKP and PKM alters flux distributions under different methylation states.

    Build

    The base mathematical structure of deFBA was retained but extended with a continuous gene activity weight (α) bounded between 0 and 1, derived from promoter-specific methylation β-values. The functional relationship between methylation and transcriptional activity was captured using an inverted logistic function, allowing smooth transitions from full repression to complete activation. This formulation was embedded into the enzyme synthesis constraints of the deFBA framework.

    The extended model was parameterized within the ecHumanGEM network, incorporating human enzyme capacities and turnover constraints. Two dynamic simulations were performed: (1) a baseline (healthy) state reflecting normal methylation levels, and (2) a hypomethylated (cancer-like) state corresponding to decreased methylation on glycolytic promoters. Comparing these states allowed us to approximate how epigenetic deregulation drives enhanced glycolytic flux characteristic of tumor metabolism.

    Test

    Dynamic simulations were executed using the GUROBI solver on a time-discretized formulation of the hybrid optimization problem. We tracked flux distributions, enzyme allocations, and metabolite concentrations across time for both methylation states. The hypomethylated configuration reproduced the expected metabolic phenotype—an observable increase in glycolytic throughput—consistent with the Warburg hypothesis. Nevertheless, model stability was highly sensitive to parameter scaling within the logistic activation function, and the solver occasionally exhibited slow convergence under dense regulatory coupling.

    Cycle 4 Results
    🔍
    Figure: Epigenetically regulated dFBA simulation results showing metabolic reprogramming under different methylation states

    Learn

    This cycle demonstrated the potential of integrating epigenetic data into metabolic simulations, earning positive feedback for its novelty. However, challenges led us to shift focus:

    • Computational Complexity: Simulations with continuous regulatory weights increased solver size and runtime, highlighting the need for model reduction.
    • Biological Assumptions: Simplifications of enzyme turnover and resource capacity may have hidden true post-transcriptional effects of methylation.

    Building on these insights, we pivoted to using large-scale cancer methylation data from TCGA. This led to the development of a predictive model using adaptive regression CNNs to infer gene expression from methylation patterns, which is integrated into context-specific metabolic frameworks like GIMME and IMAT.

    Cycle 5: Using Spatial Pyramidal Pooling to Make A Unified CNN

    Design

    We moved from E. coli to human cells, where methylation plays a stronger role in transcriptional regulation and large-scale datasets are available. Inspired by DeepMethyGene, we aimed to build a single unified model across all genes. To handle varying CpG densities, we added Spatial Pyramid Pooling (SPP), allowing the network to capture both local and regional methylation effects.

    Build

    We implemented a CNN treating CpG sites as ordered features around transcription start sites. SPP layers aggregated methylation patterns at multiple scales, aiming to predict RNA-seq expression from processed methylation profiles.

    Test

    Although the model converged, validation performance was weak. Predictions poorly correlated with real expression, likely due to loss of gene-specific context, insufficient regulatory features beyond methylation, and sensitivity to window size/preprocessing.

    Learn

    We found that expression prediction is highly gene- and context-dependent. A single global model diluted regulatory signals, making gene-specific or multi-omic approaches more promising directions.

    Cycle 6: Towards Developing Gene-Wise Thresholds

    Design

    We pivoted to human cells and implemented gene-specific deep learning models to predict transcriptional output from methylation data. This followed the logic of DeepMethyGene but trained models individually for each gene, preserving gene-level regulatory context instead of collapsing into a single global framework.

    Build

    For each gene, methylation profiles around promoters and enhancers were processed as ordered inputs and passed through CNN architectures. The models achieved good predictive accuracy, with clear correlations between methylation features and expression values across samples.

    Test

    While the models predicted expression well, the observed changes in gene expression between baseline and perturbed states were often too small to meaningfully alter flux distributions in genome-scale metabolic models. As a result, the integration step produced almost identical flux solutions, highlighting a disconnect between predictive accuracy and metabolic impact.

    Learn

    We recognized that small fluctuations in expression, though statistically significant, may not carry functional weight at the network scale. This cycle taught us to introduce gene-specific thresholds which allowed us to keep the signal from the altered gene expression even in the metabolic model.