SOFTWARE

Our Software project consists of two tools that enable precise epigenetic control and metabolic understanding in synthetic biology. The first tool, ECHO, supports experimental design using the dCas9-DAM fusion protein, which allows targeted DNA methylation to regulate gene expression with specificity. The second tool, ENIGMA integrates machine learning and genome-scale metabolic modeling to predict how these methylation changes impact gene expression and cellular metabolism. Together, these tools create a comprehensive platform that essentially models and optimizes the entire experimental pipeline, from designing targeted epigenetic edits to simulating their functional metabolic consequences, paving the way for rational engineering of gene regulation and metabolic pathways in health and disease.

iGEM IIT-Madras 2025

GitLab Repository: https://gitlab.igem.org/2025/software-tools/iit-madras

A sample collaboration runthrough can be found on the same repository.

Project ECHO

Custom software tools, analysis scripts, and computational resources we've developed for our project and the community.

Epigenetic Control of gene expression with Hybrid Optimization

Introduction

Our main focus while developing our software was to come up with a tool that would aid in designing experiments while using the new parts developed by the Wet Lab team, namely - the dCas9-DAM fusion protein capable of targeted methylation for regulatory control.

Taking the perspective of a researcher seeking to control protein expression via methylation, we came up with two questions that seemed to be the most appropriate:

If I want to control the expression of this protein, which genes' expressions do I need to alter, and by how much?
Given I have a tool capable of performing targeted methylation, where on the genome do I introduce methylations to achieve the required deltas in gene expression?

While the first question is tackled by the modelling-based team, we decided to tackle the second question of identifying the most relevant sites for methylation to accurately control gene expression.

Software Overview

In this context, the tool ECHO (Epigenetic Control of gene expression with Hybrid Optimization) was developed. The idea behind it is simple:

Use an ElasticNet architecture to learn important sites for gene expression regulation
Use an AdaptiveRegressiveCNN to predict which of these sites to regulate in order to achieve target gene expression

🔍

Figure 10: ECHO system overview showing the hybrid approach combining ElasticNet for feature importance learning and AdaptiveRegressiveCNN for accurate gene expression prediction.

With ECHO, we pave the way to reduce experimental pipelines from days to hours, based on data availability. The comparison between pipelines with and without ECHO, and the pros/cons of ECHO, are given below:

🔍

Figure 11: Comparison of experimental pipelines with and without ECHO, demonstrating significant time reduction from days to hours in methylation site identification.

🔍

Figure 12: Comprehensive analysis of ECHO's advantages and limitations, highlighting the trade-offs between computational efficiency and experimental validation requirements.

Why Two Different Models?

Through our DBTL cycles, we realized that the exact interplay between methylation and gene expression is:

Complex - average methylation values of the genome do not have predictive power, which shows site-level methylation is an important feature
Non-local - it is demonstrated that the predictive power of models such as DeepMethyGene increases monotonically when larger methylation windows are considered around the genome (a 10Mb window of methylation data performs significantly better than a 1Mb window)

Therefore, the choice of which AI model to train proved critical. AdaptiveRegressiveCNNs (such as DeepMethyGene), while offering better predictive accuracy, proved difficult to extract gradients for, due to the following reasons:

Being a CNN-based architecture, individual feature importances were lost, as long-range feature relationships were learnt. The importance maps (obtained via gradCAM) were smoothed out, and distinct peaks were not visible.
While techniques such as gradCAM allowed us to visualize "important" sites for prediction, the direction of influence for these sites (i.e. does methylation at this site upregulate or downregulate gene expression) was not reliably extractable.
Since the CNN learnt relationships between features, it tended to not give any features very low importance. This would be detrimental for search algorithms since every site would require explicit handling.

🔍

Figure 13: DeepMethyGene-gradCAM site importance map for gene AADAT (10Mb window). The smoothed importance distribution lacks distinct peaks, making it difficult to identify specific regulatory sites.

To mitigate these issues, we designed an ElasticNet Linear Regressor based on geneEXPLORE, a precursor to the DeepMethyGene model. While having lesser accuracy overall, it had the advantage that, as a linear regressor, its gradients were directly interpretable, and distinguished clear sites of importance for regulatory control of the gene.

🔍

Figure 14: geneEXPLORE importance weights for gene AADAT (10Mb window). Clear peaks indicate specific regulatory sites with distinct directional influence, enabling precise methylation targeting.

The above weights are suitable for further search algorithms since:

Large number of weights have been assigned ~0 importance, so can be ignored
Individual feature importances clearly retained
Clear peaks of importance visible
Direction of influence available

Biological Validation

Click to expand: Biological Validation Details

To verify if the learned importances were biologically significant, we then verified the peaks against annotated enhancers and silencers for the gene from the NCBI database, and found that (for the genes we tested on), most annotated enhancer/silencer regions had a corresponding peak in the geneEXPLORE weights.

NCBI does not annotate enhancer or silencer regions located more than 1–2 Mb from most genes. To explain the peaks observed beyond this range, we refer to the methylation mechanism proposed in the geneEXPLORE paper, which suggests that DNA looping can bring segments located up to 10 Mb away from the promoter into close proximity, thereby influencing gene expression.

🔍

Figure 1: Proposed DNA looping mechanism that explains long-range methylatory regulation of gene expression. This mechanism allows DNA segments up to 10Mb away from the promoter to come in close proximity and affect gene expression.

With these verification practices in mind, we ran the weights analysis for a sample of genes, with promising results:

IGF2 - 1Mb window

Peaks observed around ~250 kb correspond to enhancer regions in the same window

🔍

Figure 2: IGF2 gene geneEXPLORE weights analysis (1Mb window). The clear peaks around 250kb correspond to known enhancer regions, validating the biological relevance of our model's predictions.

GSTT1 - 1Mb window

No documented enhancers more than 6kb from the promoter, so the site at the promoter is assigned the most importance

🔍

Figure 3: GSTT1 gene geneEXPLORE weights analysis (1Mb window). The prominent peak at the promoter region aligns with the lack of documented distant enhancers for this gene.

ACTB - 1Mb window

Housekeeping gene, hence, lack of peaks corresponds to lack of methylatory regulation

🔍

Figure 4: ACTB gene geneEXPLORE weights analysis (1Mb window). As a housekeeping gene, ACTB shows minimal methylatory regulation patterns, consistent with its constitutive expression profile.

SLC7A5 - 10Mb window

Documented to have a promoter hyper-sensitive to methylation - abnormal expression in cancer cells

🔍

Figure 5: SLC7A5 gene geneEXPLORE weights analysis (10Mb window). The multiple regulatory peaks reflect the hypersensitive nature of this gene's promoter to methylation changes, particularly relevant in cancer contexts.

Model Comparison Summary

ElasticNet Linear Regressor:

Capable of learning individual feature importances
Poor overall predictive power

AdaptiveRegressiveCNN:

Bad at learning individual feature importances
Significantly better overall predictive power

Thus, we decided to adopt an ElasticNet to learn feature importances, but an AdaptiveRegressiveCNN for final predictions.

ECHO: Models, Pipeline, and Results

Datasets Used

Methylation dataset: TCGA breast invasive carcinoma (BRCA) DNA methylation (HumanMethylation450)
Dataset Link

Gene expression dataset: TCGA breast invasive carcinoma (BRCA) exon expression by RNAseq (polyA+ IlluminaHiSeq)
Dataset Link

Model 1: Elastic Net Regression (inspired by geneEXPLORE)

Click to expand: Elastic Net Regression Technical Details

Elastic Net regression is a regularized linear regression model that combines the penalties of both Lasso (L1) and Ridge (L2) regression. It has been shown to perform effectively in high-dimensional biological data such as DNA methylation–gene expression relationships, where many features are correlated.

Model Overview

The Elastic Net model assumes the following regression form:

y = β₀ + Σᵢ₌₁ᵖ βᵢxᵢ + ε

Where:

y = gene expression value
xᵢ = methylation values at CpG sites (features) - methylation probability (beta-value) between 0 and 1
βᵢ = regression coefficient for feature i
ε = error term

$Elastic Net Math$

🔍

Figure 6: Mathematical formulation of the Elastic Net loss function, combining L1 (Lasso) and L2 (Ridge) regularization penalties for optimal feature selection in high-dimensional methylation data.

Where:

n = number of samples
λ = overall regularization strength
α ∈ [0,1] where α is the mixing parameter (α = 1 implies Lasso, α = 0 implies Ridge, 0 < α < 1 corresponds to Elastic Net)

Optimization

Algorithm: Coordinate Descent (efficient for high-dimensional sparse problems)
Hyperparameters: tuned via cross-validation; α typically chosen between 0.1 and 0.5 for correlated methylation features
Scaling: Input methylation features are standardized to mean 0 and variance 1

Reported Metrics (from geneEXPLORE)

Mean R²: ~0.05–0.15 depending on gene and tissue type
Correlation (Pearson's r): ~0.2–0.4 range

In our case, since we applied Elastic Net to a different set of genes, we observed results within the same ranges as reported by geneEXPLORE.

Model 2: Adaptive Regressive CNN (Inspired by DeepMethyGene)

Click to expand: Adaptive Regressive CNN Technical Details

Model Overview

The Adaptive Regressive CNN is a deep learning model designed to predict gene expression from DNA methylation data, specifically CpG beta values. It is based on a ResNet-style convolutional neural network, as described in the DeepMethyGene paper.

Reason for Choosing this Model

The Adaptive Regressive CNN architecture was chosen because it is uniquely suited to capture the complex, nonlinear relationships between DNA methylation patterns and gene expression. Unlike linear models such as ElasticNet, which assume additive and independent effects of each CpG site, the CNN-based approach leverages convolutional layers and residual connections to learn both local and long-range dependencies among CpG sites.

🔍

Figure 7: Adaptive Regressive CNN architecture showing the flow from input CpG methylation data through convolutional layers, residual blocks, adaptive pooling, to final gene expression prediction.

Model Architecture

Input: Vector of CpG beta values for a gene's promoter region: x = [β₁, β₂, …, βₙ] where βᵢ is the methylation beta value at CpG site i

Convolutional Layers:

Several 1D convolutional layers (e.g., kernel size 3, 64 filters) extract local patterns from the input vector
Each convolution is followed by batch normalization and a LeakyReLU activation

Residual (ResNet) Blocks: Each block contains two Conv1D layers with skip connections:

$ARCNN Math 1$

🔍

Figure 8: Mathematical representation of residual block architecture, enabling deep network training through skip connections that prevent gradient vanishing.

Adaptive Pooling: Reduces the sequence dimension, allowing the model to handle variable-length input windows.

Fully Connected Layers:

Dense(128) + ReLU
Dense(1) for the final gene expression prediction ŷ

$ARCNN Math 2$

🔍

Figure 9: Final prediction layer mathematical formulation, transforming learned features into gene expression predictions.

Training Configuration:

Optimizer: Adam (learning rate typically 0.001)
Regularization: L2 weight decay (e.g., λ = 10⁻⁴)

Reported Performance

The DeepMethyGene paper reports mean R² ≈ 0.64 and RMSE ≈ 0.25 on held-out test sets of breast cancer genes. In our experiments, results fell within these reported ranges.

The ECHO Pipeline

The full documentation of our tool can be found on the GitLab repository. A standard pipeline is as follows:

Choose your gene of interest
Compile the dataset for your gene (ECHO offers data compilation for roughly 2k human genes whose methylation and gene expression data were available in the UCSC datasets)
Train an ElasticNet on the dataset to obtain the weights for each site
Train on AdaptiveRegressiveCNN on the dataset to use as the model to predict gene expression
Input the methylation data for your cell of interest (by default, ECHO considers the average methylation values from the dataset for each site as its baseline)
Choose the algorithm and direction of regulation to begin prediction

Methylation Modeling

When we say the model "methylates" a site, this means that it sets the beta value corresponding to that site in the input vector as 1, to reflect complete methylation, as expected to be achieved eventually by the Wet lab's dCas9-DAM part. Once more data becomes available, the algorithm can be improved to be more accurate.

Supported Algorithms

The 4 algorithms currently supported by ECHO are (for example, consider upregulation):

Standard sequential - sequentially methylates DNA sites from the most upstream to the most downstream
Circular sequential - methylates sites closer to the promoter and works its way outwards
gradCAM sequential - methylates sites in accordance to the importances generated by gradCAM
elasticNet sequential - methylates sites given positive importance by elasticNet in accordance to magnitude

🔍

Figure 15: Visual comparison of the four ECHO algorithms showing different approaches to methylation site selection: standard sequential, circular sequential, gradCAM sequential, and elasticNet sequential.

Results

The convergence for different algorithms for the human gene AADAT in a 10Mb window are compared below for the different algorithms in ECHO. Note that the minimum/maximum values for gene expression in the dataset was 3-10 FPKM units, while the average is around 5.4. The x-axis is the number of sites methylated according to the algorithm, while the y-axis is the gene expression in log₂(1+FPKM) units.

Algorithm 1 - Standard Sequential

🔍

Figure 16: Standard sequential algorithm convergence curve for AADAT gene showing stochastic growth pattern without data-driven site selection.

Algorithm 2 - Circular Sequential

🔍

Figure 17: Circular sequential algorithm convergence curve for AADAT gene, methylating sites from promoter outward with similar stochastic behavior to standard sequential.

Algorithm 3 - gradCAM Sequential

🔍

Figure 18: gradCAM sequential algorithm convergence curve for AADAT gene showing slower initial growth but improved performance in middle phases.

Algorithm 4 - elasticNet Sequential

🔍

Figure 19: elasticNet sequential algorithm convergence curve for AADAT gene demonstrating superior performance with fastest convergence to target expression levels.

Observations

The standard and circular sequential algorithms do not seem to be relying on any particular data during growth; the growth rate is stochastic
The gradCAM sequential algorithm shows a slower initial growth rate than the first two, but picks up pace in the middle before saturating towards the end
The elasticNet sequential algorithm is able to correctly and accurately predict which sites to methylate for the quickest increase in gene regulation

The elasticNet sequential algorithm shows similar performance for downregulating gene expression as well:

🔍

Figure 20: elasticNet sequential algorithm performance for gene downregulation, demonstrating consistent efficiency for both upregulation and downregulation of target gene expression.

Advantages of elasticNet Sequential Algorithm

Fastest convergence rate - able to achieve maximum gene expression in the lesser number of methylations. This is good from an experimental point of view as the fewer sites that require external methylation, the better.
Faster running time - runs much faster than other algorithms since there is no need to predict via AdaptiveRegressiveCNN after each methylation if gene expression goes up/down. This cost is displaced onto computing weights via geneEXPLORE.

Self-Validation of Predicted Results

Due to the fact that precise epigenetic modification of gene regulation is still a new field, it is difficult to find datasets to verify exact results experimentally. So, to check if the results predicted by ECHO were somewhat valid, we used the following steps:

For a given gene id, window of consideration, baseline methylation series, and targeted gene expression, the recommended methylation sites were generated
The methylation series in the training data were obtained which had the closest gene expression to the target expression
It was verified if the sites recommended for methylation in the dataset had higher methylation values than in the baseline

Validation Results

For a good sample of genes, windows and targets, identifications of 75-85% were obtained, which is promising.

Example of Self-Validation

Interpreting changes proposed by ECHO. To take the expression of AADAT from its baseline of 5.6 to 8 units, ECHO proposed methylation of 40 sites.

When these 40 sites were compared to a methylation series in the training data that had an expression of 8 units, 33 out of the 40 sites in the training data had a higher beta value than the baseline.

Hence, 33/40 = 82.5% of sites were identified "correctly" by the ECHO algorithm.

🔍

Figure 21: Self-validation example for AADAT gene showing comparison between ECHO-predicted methylation sites and actual high-expression methylation patterns from training data, achieving 82.5% accuracy.

Additional Note on "Juiced" Algorithms

The algorithms so far described are "sequential"; given an order of sites to methylate, it sequentially methylates them. Given that inter-site relationships learnt by AdaptiveRegressiveCNNs are largely a black box, it is possible that there might exist some "non-sequential" algorithms, which also check if demethylating previous sites while methylating new ones increase gene expression further.

While exploring this, we found that it is numerically possible to use a mixture of elasticNet sequential and standard sequential algorithms to find methylations patterns that are “juiced”, i.e., with these patterns, the AdaptiveRegressiveCNN outputs methylation values that are significantly greater than the maximum of the training data, as shown below -

🔍

Figure 22: elasticNet-sequential juiced algorithm showing extreme gene expression values beyond training data range. While mathematically achievable, these results likely represent numerical artifacts rather than biologically relevant predictions.

This algorithm has been dubbed the elasticNet-sequential "juiced" algorithm, but is not included in the GitHub documentation due to the following reasons:

Lack of biological significance - Such high values are likely numerical artefacts
Lack of experimental interest - In normal experiments, it is unlikely that changes in gene expression beyond one unit is required
Requirement of large number of methylations - Such high values can only be obtained after selectively methylating hundreds of sites

Future Directions

Non-sequential Methylation Algorithms

As mentioned above, it is possible to further reduce the number of methylations required by tapping into non-sequential algorithms, which demethylate previously predicted sites, to boost gene expression in a lesser number of steps.

Developing Better Heuristics

Using the ElasticNet weights as a heuristic has shown promising results, however, it is possible to further fine tune this heuristic using other kinds of data, for example:

Integrating annotated enhancer/silencer regions from NCBI
Integrating chromatin-capture data which provides information on how likely are two probes to interact with each other, on the basis of their physical distances in the cell

gRNA Recommendations

Once the minimum number of methylation sites are identified, it should then be possible to algorithmically predict the least number of gRNAs required to methylate all the identified sites, since it is possible to design gRNAs that are capable of interacting with multiple sites.

The end goal is to be able to suggest a single gRNA capable of methylating all sites necessary to exactly reduce gene expression to the desired level.

References

Click to expand: References and Citations

GeneEXPLORE - Elmarakeby, H. A., Hwang, J., Arafeh, R., Crowdis, J., Gang, S., Liu, D., ... & Van Allen, E. M. (2021). Biologically informed deep neural network for prostate cancer discovery. Nature, 598(7880), 348-352.
DeepMethyGene - Li, Y., Zhang, X., & Wang, L. (2024). DeepMethyGene: A deep learning approach for predicting gene expression from DNA methylation patterns. BMC Bioinformatics, 25, 1-15.
DeepPGD - Chen, M., Liu, S., & Zhou, K. (2024). DeepPGD: Deep learning for personalized genomic diagnosis using methylation data. International Journal of Molecular Sciences, 25(15), 8146.
UCSC Xena Data Portal - Goldman, M. J., Craft, B., Hastie, M., Repečka, K., McDade, F., Kamath, A., ... & Haussler, D. (2020). Visualizing and interpreting cancer genomics data via the Xena platform. Nature Biotechnology, 38(6), 675-678.

Community Contributions

ECHO contributes to the synthetic biology and bioinformatics community by providing:

Open-source implementation of epigenetic regulation prediction
Standardized pipeline for methylation site identification
Integration of multiple AI approaches for enhanced accuracy
Validation methodologies for epigenetic predictions

Licensing and Usage

All source code and documentation are available under open-source licenses on our GitLab repository. The tool is freely available for academic and research purposes.

ENIGMA: Epigenetic Networks Informing Genome-scale Metabolic Analysis

Introduction

Rationale for Epigenetically Informed Metabolic Networks:

Gene expression is not hardwired into DNA- it's negotiated, rewritten, and enforced through epigenetics. Epigenetic modifications, particularly DNA methylation, serve as critical regulators that fine-tune gene activity without altering the genetic code itself. By modifying methylation patterns in promoter regions, cells can stably silence or activate genes, influencing cellular identity and function.

Cancer is a textbook case of epigenetics gone wrong. Tumor cells typically undergo a "double jeopardy": widespread hypomethylation that activates oncogenes, paired with local hypermethylation that shuts down tumor suppressor genes. The ripple effects extend far beyond gene expression, and methylation actively reshapes the cell's metabolic profile.

Cancer cells rewire their entire metabolic network to fuel relentless growth, with classic signatures like the Warburg effect (aerobic glycolysis) and altered glutamine metabolism.

🔍

Figure: Textbook Metabolic Reprogramming in Cancer Cells Caused by Epigenetic Changes

Epigenetically informed genome-scale metabolic models (GEMs) can directly connect methylation to expression, and expression to metabolism. Such models can illuminate how cancer cells use epigenetic reprogramming to hijack metabolism and the impacts of epigenetic changes on the metabolic network.

Broadly, outside of applications in analyzing cancer cell metabolism, this pipeline can allow us to analyze the metabolic impacts of any precise methylation that the EPIC toolkit can generate.

Existing Techniques:

Genome-scale metabolic modeling (GSMM) has already given researchers powerful tools to simulate cellular metabolism under various genetic and environmental conditions. But where does epigenetics fit in?

There have been two main approaches so far:

Epigenome-Scale Metabolic Models (EGEMs): These models expand human GEMs by explicitly adding reactions for histone acetylation. By optimizing both biomass and acetylation, they allow researchers to study trade-offs between cell growth and global epigenetic activity. But there's a catch: EGEMs treat acetylation as a single "pool," meaning they can only track whether overall acetylation goes up or down—not the precise effects of specific marks. In short, they ask: "How does metabolism influence epigenetics?"

🔍

Figure: Epigenetic Metabolic Models - Modeling how acetylation flux varied in different growth media and growth rates

Gene Expression-Based Integration: A more indirect approach is to use transcriptomics. Context-specific models integrate gene expression data by "turning on" reactions associated with highly expressed genes and suppressing others. Many studies link epigenetics to metabolism this way: measure gene expression under two different epigenetic states, then compare the outcomes. For example, Salehzadeh-Yazdi et al. integrated yeast transcriptomes under different histone-tail mutations to see how acetylation altered global metabolism. This works, but the method can't prove that changes in expression came only from methylation- other factors could be at play.

🔍

Figure: Integration of Transcriptomic Data into Metabolic Analysis Pipelines

Overview of Our Technique:

We're taking a different route. Instead of retroactively linking methylation to metabolism through gene expression snapshots, we predict gene expression directly from methylation profiles using machine learning.

🔍

Figure: Project Pipeline

A rough pipeline of the project is:

Feed a methylation profile into a predictive model.
Generate a synthetic but biologically grounded gene expression profile.
Plug these predicted expressions into a metabolic model, and apply flux balance algorithms to obtain the flux distribution of the cell post the epigenetic perturbation

This lets us ask questions like: "What would happen to metabolism if we flip a single methylation switch?" Instead of passively observing how metabolism correlates with epigenetic states, we can actively simulate how precise epigenetic modifications ripple through the network.

Why Does This Matter?

The most immediate use case of the pipeline is in cancer metabolism, where methylation-driven rewiring of pathways fuels uncontrolled growth and resistance to therapy. We can now pinpoint exactly where deviant methylations in cancer genes affect the metabolism of the cell.

But the applications don't stop there. Epigenetically informed GEMs could help uncover how diet or environment influences metabolic health, predict vulnerabilities in diseases like diabetes or neurodegeneration where methylation is disrupted, and even guide precision medicine by simulating how targeted epigenetic drugs would reshape a patient's metabolism.

Together with our EPIC toolkit, we can go beyond simulation. By testing out optimal methylation profiles in silico and then making precise epigenetic edits experimentally, we create a closed loop between modeling and engineering- allowing us to design and build cells with finely tuned, optimal behavior.

Part 1: Deep Learning Model to Predict Gene Expression

Why Predict Expression from Methylation?

DNA methylation is one of the most powerful switches for regulating gene activity. When CpG islands in promoters are methylated, transcription often grinds to a halt; when they are demethylated, transcription ramps up. This makes "predicting gene expression from methylation profiles" a valid and exciting problem: if we can read the methylation code, we can anticipate how active a gene will be.

But here's the catch- this relationship isn't simple or deterministic. Methylation is just one part of the regulatory orchestra. Chromatin accessibility, histone marks, and transcription factors all weigh in. That means we can't just deterministically calculate expression from methylation with a neat formula. Instead, we need a flexible model that can learn the rules from data, capturing both the obvious and the subtle patterns.

This lends extremely well to Machine Learning approaches.

The Architecture: Adaptive Convolutional Neural Network (CNN) With Residual Blocks

🔍

Figure: Architecture of the CNN

We use the architecture described in DeepMethyGene, a convolutional neural network (CNN) designed specifically to link DNA methylation to gene expression. The backbone is ResNet-inspired, with residual blocks and skip connections that prevent vanishing gradients and keep training stable.

The key idea is in how we represent the input. For each gene, we collect the M values of all CpG probes located within a 10 Mb window around its transcription start site (TSS). This window captures both local promoter methylation (directly influencing transcription initiation) and long-range regulatory elements (enhancers or CpG clusters that can also modulate expression). Because the number of CpG probes varies widely from gene to gene, the model is gene-specific, adapting its input channels to whatever probe set is available for that gene.

Once the probe values are arranged, the CNN treats them like a structured sequence, scanning across them with convolutional filters. This allows the network to detect both local patterns (e.g., a CpG island being consistently methylated) and broader spatial arrangements (e.g., long-range clusters of methylation upstream of the promoter). The residual connections ensure that deeper layers don't forget simpler patterns as they build up hierarchical features.

Generating the Dataset:

We trained our models on paired DNA methylation and RNA-seq data from 873 TCGA breast cancer samples. Both the 450K methylation array data and the Hi-Seq 2000 expression data were downloaded from the Xena Public Data Hubs.

The methylation arrays report values as β values, a quantitative measure of DNA methylation at individual CpG sites. Formally, a β value is calculated as the ratio of methylated probe intensity to the sum of methylated and unmethylated probe intensities:

$\beta = \frac{M}{M + U + \alpha}$

where M and U represent the measured signal intensities of methylated and unmethylated alleles, and α is a constant added to stabilize values when intensities are low. The result is a continuous value between 0 (completely unmethylated) and 1 (fully methylated).

Intuitively, you can think of β values as the "fraction of molecules carrying a methyl group" at a particular CpG site. If out of 100 DNA molecules, 70 are methylated, the β value would be ~0.7.

To improve the statistical properties of these values, we converted β values into M values using a logit transform:

$M_i = \log_2\left(\frac{\beta_i}{1 - \beta_i}\right)$

This reduces skew, expands the compressed regions near 0 and 1, and produces distributions that are closer to normal- making them better suited for regression and machine learning models.

Out of the original 485,577 CpG probes, 90,007 had more than 20% missing values and were removed. For the remaining probes, if fewer than 20% of values were missing, we used K-means clustering to impute them, ensuring a complete and consistent dataset.

For gene expression, we used RNA-seq transcript abundances measured in RPKM (Reads Per Kilobase of transcript, per Million mapped reads). RPKM tells us how many reads came from a gene, adjusted so that gene size and sequencing effort don't bias the comparison. Without this normalization, a 5,000 bp gene would always look more "expressed" than a 500 bp gene simply because it's longer. By correcting for both length and sequencing depth, RPKM reflects the true relative abundance of transcripts.

Using this measure, we excluded genes with very low expression (average RPKM < 1) to avoid noise from weakly expressed or inactive genes, reducing the dataset from 20,530 genes to 17,113. From here, we only decided to train the genes that were also there in the Human Cell metabolic model, further narrowing the number of models to 2416.

In short, for each gene, we had its expression as well as the M values of all the CpG probes for 873 samples.

🔍

Figure: Dataset Generation Pipeline

Why Gene-Specific Models?

Not all genes play by the same epigenetic rules. Metabolic drivers like SLC7A5 or LDHA are exquisitely methylation-sensitive, while housekeeping genes are relatively stable. By training one model per gene, DeepMethyGene learns each gene's unique "epigenetic grammar," instead of forcing a one-size-fits-all rulebook.

That said, it is virtually impossible to pinpoint an exact gene expression value from methylation data alone. By training the model gene by gene, we give each network a natural baseline: it learns the typical expression range for that gene, so even a naive prediction (like averaging across samples) already falls in the right ballpark. The methylation values then act as refinements, nudging the prediction up or down depending on the specific CpG patterns. In this way, gene-wise training provides a stable foundation while still allowing the model to capture fine-tuned regulation.

Results:

The mean absolute error (MAE) across genes was 0.53. To put this in context, the average standard deviation of gene expression values was about 1.6. This means that, on average, the model's predictions deviate by less than one-third of the natural variation in expression, a clear indication that the model is capturing meaningful biological signal rather than noise.

Looking across the full set of ~2000 genes in the human metabolic model, the average coefficient of determination (R²) was 0.51. This outperforms baseline regressors such as Random Forests and Support Vector Machines. Our results are also directly comparable to the performance reported in the original DeepMethyGene study, validating our approach.

Performance was not uniform across all genes. Many genes achieved R² values above 0.8, meaning that for these cases, the model explains more than 80% of the variability in expression. This is a striking result, as it shows that for a subset of genes, the model predictions are highly accurate and nearly deterministic.

🔍

Figure: Distribution of R² values of Genes

Inferences:

When we examined the genes with the highest predictive accuracy (R² > 0.83), we found that most of them had very large input sizes, typically over a thousand CpG probes within the defined window. However, no consistent pattern of expression emerged across these genes, and the wide variation in input sizes made direct comparison of methylation values difficult.

🔍

Table: Best Performing Genes

On the other end of the spectrum, for genes with poor predictive accuracy (R² < 0.129), most had relatively small input sizes, generally clustered around 600 CpG probes. This suggests that genes with fewer CpG sites may provide less methylation information for the model to leverage, contributing to weaker performance.

🔍

Table: Worst Performing Genes

Interpreting the Model: Grad-CAM Insights

We used Grad-CAM to visualize what the network "pays attention to" and assign importances to the various CpG regions for making predictions on gene expression. Grad-CAM works by calculating the gradients of a desired output (here, the gene expression) with respect to the feature maps of a specific convolutional layer. These gradients are averaged to create weights for each feature map, and these weights are then multiplied by their corresponding feature maps. The results are summed and passed through a ReLU function to highlight the most important regions, producing a heatmap that shows what parts of an image a CNN focuses on to make a prediction.

For example, in the AADAT gene (310 CpG sites), the heat map generated suggested that the model generally assigns high importance to CpG sites with high M values. This would mean that heavily methylated regions play a crucial role in the prediction process.

🔍

Figure: Heatmap of the M Value Distribution Across 863 samples

🔍

Figure: Heatmap of Importances Assigned to each CpG Site for all of the 873 Samples

We then wanted to check if the reason for the high importance came because of the methylation pattern itself, and not the genomic coordinates.

When we inverted the input (flipping methylated to unmethylated and vice versa), the highlighted zones shifted accordingly- showing the model focuses on patterns of high M values, not fixed CpG coordinates.

🔍

Figure: Heatmap of Distribution of M Values (Flipped)

🔍

Figure: Heatmap for Importances Assigned to Each CpG Site for all 873 Genes

Randomized inputs, by contrast, produced no clear Grad-CAM patterns- further evidence that the model isn't just memorizing probe positions.

🔍

Figure: Heatmap Generated for a Randomly Generated Array of M Values

However, when we systematically "strided" a methylation pattern (made a set of 5 contiguous CpG sites have an M value of +10) across a 2000 bp promoter region, gene expression predictions varied significantly depending on where the pattern landed. This suggests that both pattern (clusters of high/low methylation) and location (position within the promoter) jointly shape expression.

Our takeaway: methylation prediction isn't about isolated CpGs, but about spatially contextualized patterns- exactly what CNNs are built to capture.

🔍

Figure: Gene Expression Changing with Different Contiguous Sets of 5 CpGs Methylated for LDHA

Also interestingly, the CpG sites deemed as important by the more sophisticated CNN also roughly mirrored what a more basic linear regressor predicted. Later, when we analyse cancer genes that have been atypically methylated, we also see that many of the regions that the model deemed as important are regions that are in the promoter regions of these genes, and have been documented to be the subject of epigenetic changes.

🔍

Figure: Regions Deemed as Important by DeepMethyGene Match with Gene Explore

Unifying Across Genes:

One of the major bottlenecks in the original DeepMethyGene framework is that it requires training an entirely separate model for each gene. We explored whether the architecture could be adapted to train across all genes simultaneously, rather than building individual models.

The challenge lies not in the convolutional layers themselves- they can naturally handle variable-length inputs, but in what comes after. When convolutional feature maps are flattened into a fully connected layer, the number of nodes is fixed, which forces the input length to be fixed as well. Since different genes have different numbers of CpG probes within their input windows, this becomes a structural bottleneck.

This is where spatial pyramid pooling (SPP) offers a solution. Instead of flattening the convolutional feature maps into a rigid vector, SPP divides the feature map into multiple pooling bins at different scales (for example, splitting into 1×1, 2×2, and 4×4 grids). Features within each bin are pooled, producing fixed-length outputs regardless of the original input size. In practice, this allows us to feed probe arrays of any length into the CNN, while still producing a standardized feature vector for downstream fully connected layers.

🔍

Figure: Spatial Pyramidal Pooling

However, given the sheer variation in the gene expression values across the genes, the model was not able to zero in on the right expression ranges for most genes. However, given enough data points and a robust enough model, it would be possible to build a universal model.

Unifying Across Cancers:

In an attempt to increase the range of gene expression data that the model can train on, we wondered if we could also train gene-wise models on other cancer cells as well. The initial assumption was that the methylation profiles and gene expressions for the same genes would show some variation across cancer cells.

However, on running some preliminary tests, we could see that the methylation values as well as the gene expressions did not vary across cancers. What that allowed us to do though, was to significantly expand the size of the dataset that the model trains on. Assuming that the samples do not vary in their M-value and gene expression distributions, we can safely intermix datapoints across various samples.

We trained and tested a subset of around 10 genes with the expanded dataset across multiple cancers. Unsurprisingly, the R2s of all of these genes exceeded their values when they were trained only on Breast Cancer cells.

🔍

Figure: Improved Performance of Genes When Trained On the Unified Cancer Database

Part 2: Integrating Gene Expression in a Metabolic Model

Constraint-Based Metabolic Modeling

At its core, metabolism is a network of thousands of biochemical reactions: nutrients enter the cell, enzymes transform them, and products such as energy and biomass are generated. To study this systematically, scientists construct genome-scale metabolic models (GEMs) that map genes to enzymes, enzymes to reactions, and reactions to metabolites.

Because the complexity of such networks makes them impossible to simulate kinetically at full scale, GEMs are often analyzed using constraint-based approaches. Instead of tracking every reaction dynamically, we represent the system in steady state i.e. the rate at which each metabolite is formed is also the rate at which it is consumed. Constraints are then imposed based on known reaction stoichiometries, thermodynamics, and environmental conditions (e.g., nutrient availability). These constraints carve out a "solution space" of all flux distributions that are physically and biologically possible.

Constraint-based modeling is powerful because it allows us to ask: what could the cell be doing under these conditions? From this foundation, optimization methods are applied to identify plausible or optimal flux patterns.

🔍

Figure: Constraint Based Metabolic Modeling

Flux Balance Analysis (FBA)

The most widely used method within constraint-based modeling is Flux Balance Analysis (FBA). FBA begins with a simple assumption: the cell operates at steady state, meaning the concentration of each metabolite is balanced (what flows in must flow out). This condition can be expressed as a system of linear equations.

However, as each metabolite participates in more than 1 reaction, there are usually more reactions than equations, and the system is underdetermined. To resolve this, FBA introduces an objective function, a biological goal the cell is assumed to maximize. This could be biomass/ATP production, or could also be to minimize the "distance" from the previous wild type flux distribution. Using linear programming, FBA computes the flux distribution that maximizes this objective while satisfying all constraints.

In microbial systems, FBA has been highly predictive, accurately forecasting growth rates, nutrient dependencies, and the effects of gene knockouts. In mammalian and cancer cells, where metabolism is more complex and not always growth-optimized, FBA still provides a framework but it needs extra information, such as transcriptomic or epigenetic data, to be made context-specific.

🔍

Figure: Flux Balance Analysis

Overview of Algorithms to Integrate Transcriptomic Data Into Metabolic Models

A genome-scale model on its own is generic. It assumes all reactions are available simply because the genes exist in the genome and that each reaction occurs at complete efficiency. But in reality, only a subset of genes are expressed in any given tissue or condition. To build more realistic, context-specific models, gene expression data is integrated into GEMs.

There are two broad strategies for integrating the gene expression into metabolic models. The first is to modify flux bounds based on gene expression wherein reactions catalyzed by genes that are highly expressed have their upper and lower bounds forcibly increased and vice versa. The second approach is to modify the objective function instead of changing the flux bounds.

Two of the most influential algorithms representing these approaches are GIMME and iMAT.

GIMME: A Brief

GIMME (Gene Inactivity Moderated by Metabolism and Expression; Becker & Palsson, 2008) is an objective-driven method. It begins with a metabolic reconstruction, gene expression data, and at least one specified metabolic function the cell must achieve, usually biomass.

First, FBA is run to determine the maximum possible flux through the objective function. Reactions linked to genes with expression below a defined threshold are then removed from the model. If this pruning prevents the network from achieving the objective, GIMME begins to add back reactions with an inconsistency score. Reactions that are severely below the threshold, or those that had a high flux pre-pruning carry a larger penalty if they are added back.

The final objective function is to minimize this inconsistency score while still satisfying the stoichiometric constraints at steady state.

🔍

Figure: Algorithmic Framework of the GIMME Algorithm

GIMME: Mathematical Formulation

The GIMME algorithm identifies a flux distribution that minimizes inconsistency between gene expression data and a required metabolic functionality (RMF).

$\mathrm{Minimize:} \quad \sum_i c_i \, |v_i|$

$\mathrm{subject\ to:} \quad S \cdot v = 0$

$a_i \leq v_i \leq b_i$

$v_{\mathrm{RMF}} \geq \alpha \, v_{\mathrm{RMF}}^{\mathrm{max}}$

where:

$S$ is the stoichiometric matrix (metabolites × reactions)
$v$ is the vector of reaction fluxes
$a_i, b_i$ are the lower and upper flux bounds
$v_{\mathrm{RMF}}^{\mathrm{max}}$ is the maximum achievable flux through the required metabolic functionality (RMF)
$\alpha$ is a fraction (e.g., 0.9) enforcing minimum required activity
$c_i$ is a penalty weight determined from expression data:

c_i = \begin{cases} x_{\text{cutoff}} - x_i, & \text{if } x_i < x_{\text{cutoff}}, \\ 0, & \text{if } x_i \geq x_{\text{cutoff}}. \end{cases}

where $x_i$ is the normalized gene expression value mapped to reaction $i$ and $x_{\mathrm{cutoff}}$ is the expression threshold above which a gene is considered active.

To convert the problem to a standard linear program, reversible reactions are split into two irreversible ones:

$v_i = v_i^+ - v_i^-$

$|v_i| = v_i^+ + v_i^-$

so the LP becomes:

$\mathrm{Minimize:} \quad \sum_i c_i (v_i^+ + v_i^-)$

$\mathrm{subject\ to:} \quad S (v^+ - v^-) = 0$

$v_i^+, v_i^- \geq 0$

$v_{\mathrm{RMF}} \geq \alpha v_{\mathrm{RMF}}^{\mathrm{max}}$

The minimized objective value represents the Inconsistency Score (IS), measuring disagreement between gene expression and functional requirements. A normalized score (NCS) can be computed as:

$\mathrm{NCS} = \frac{1.02 \cdot \max(\mathrm{IS}) - \mathrm{IS}}{\max(\mathrm{IS})}$

iMAT: A Brief

iMAT (Integrative Metabolic Analysis Tool; Shlomi et al., 2008; Zur et al., 2010) represents the expression-driven philosophy.

Each reaction in the model is classified as high, moderate, or low expression based on the expression values of the gene that encodes for the enzyme that catalyzes it. Reactions are classified as "active" or "inactive" based on their fluxes in the solution.

The IMAT objective is to maximise the number of high reactions that are active, while also minimize the number of active "low" reactions.

Unlike GIMME, iMAT does not force the model to achieve a particular cellular objective, making it particularly useful in complex systems such as mammalian or cancer cells where defining a single metabolic "goal" is biologically unrealistic. Its strength lies in capturing the expression landscape itself as the main driver of flux distribution.

iMAT: Mathematical Formulation

The iMAT algorithm formulates a Mixed-Integer Linear Programming (MILP) problem that integrates gene expression with a metabolic model to identify a flux distribution consistent with the expression levels.

Let:

$y_i \in \{0,1\}$ : binary variable indicating whether reaction $i$ is active (1) or inactive (0)
$v_i$ : continuous flux variable of reaction $i$
$R_H$ , $R_L$ , $R_M$ : sets of reactions associated with high, low, and medium expression respectively
$\varepsilon$ : small positive flux threshold

The MILP formulation is:

$\mathrm{Maximize:} \quad \sum_{i \in R_H} y_i + \sum_{i \in R_L} (1 - y_i)$

$\mathrm{subject\ to:} \quad S \cdot v = 0$

$a_i y_i \leq v_i \leq b_i y_i, \quad \forall i$

$v_i \geq \varepsilon y_i, \quad \forall i \in R_H$

$v_i \leq \varepsilon (1 - y_i), \quad \forall i \in R_L$

$y_i \in \{0, 1\}$

Here:

The objective maximizes consistency between flux activity and expression levels — promoting flux through highly expressed reactions and minimizing it through lowly expressed ones.
$S \cdot v = 0$ enforces steady-state mass balance.
Reactions in $R_M$ (moderate expression) are not included in the objective but still satisfy the balance constraints.

Comparison Summary:

Feature	GIMME	iMAT
Optimization type	Linear Programming (LP)	Mixed-Integer Linear Programming (MILP)
Expression data type	Continuous (quantitative)	Discrete (high/medium/low)
Objective	Minimize inconsistency (penalty-weighted flux)	Maximize consistency between flux and expression
Requires RMF	Yes	No
Output	Context-specific model + consistency score	Binary active/inactive flux states

Custom Algorithm:

To better integrate transcriptomic information into genome-scale metabolic models, we designed a GPR Pruning-Based Algorithm that links gene expression to metabolic flux capacities in a continuous yet rule-consistent way. The approach begins at the level of genes and propagates through Gene-Protein-Reaction (GPR) associations before ultimately adjusting the flux bounds of the reactions themselves.

Each gene $g$ is assigned a state $S_g = (L_g, P_g)$ derived from its expression value $e_g$ . The discrete component $L_g$ captures whether a gene's expression is low, moderate, or high, as determined by its position relative to the 33rd and 66th percentiles of that gene's reference distribution $(q_{1,g}, q_{2,g})$ :

L(e_g) = \begin{cases} 0, & \text{if } e_g < q_{1,g} \\ 1, & \text{if } q_{1,g} \leq e_g < q_{2,g} \\ 2, & \text{if } e_g \geq q_{2,g} \end{cases}

The continuous component $P_g \in [0,1]$ is the percentile rank of $e_g$ , which preserves fine-grained variation even within a discrete level. In this way, the gene state combines categorical and continuous measures of expression.

These gene states are then propagated through GPR rules, which encode the Boolean relationships between genes and reactions. When a reaction is associated with a single gene, the gene's state directly defines the reaction state. In the case of enzyme complexes, represented by AND operators, the limiting subunit controls activity; formally, the reaction level is the minimum of the constituent levels, with the percentile taken from the subunit responsible for that minimum:

$L_{\mathrm{rxn}} = \min \big(L(e_{g_1}), L(e_{g_2}), \dots, L(e_{g_n})\big), \quad P_{\mathrm{rxn}} = P(e_{g_k}) \quad \mathrm{where} \quad k = \arg\min_i L(e_{g_i}).$

Isozymes, represented by OR operators, behave in the opposite way: activity follows the most highly expressed constituent, with the percentile drawn from the maximally expressed gene:

$L_{\mathrm{rxn}} = \max \big(L(e_{g_1}), L(e_{g_2}), \dots, L(e_{g_n})\big), \quad P_{\mathrm{rxn}} = P(e_{g_k}) \quad \mathrm{where} \quad k = \arg\max_i L(e_{g_i}).$

This evaluation produces a reaction-level state $S_{\mathrm{rxn}} = (L_{\mathrm{rxn}}, P_{\mathrm{rxn}})$ that reflects both the combinatorial gene logic and the underlying transcriptomic strengths.

Finally, the reaction state is used to tune the flux bounds $(v_{\min}, v_{\max})$ of the corresponding reaction. When expression is low $(L_{\mathrm{rxn}} = 0)$ , the flux range is strongly suppressed by a factor

$\alpha = \max(\alpha_{\min}, P_{\mathrm{rxn}}),$

ensuring that reactions can be pruned almost entirely but never vanish spuriously.

At moderate expression $(L_{\mathrm{rxn}} = 1)$ , the original bounds are preserved, corresponding to baseline enzymatic activity:

$v'_{\min} = v_{\min}, \qquad v'_{\max} = v_{\max}.$

High expression $(L_{\text{rxn}} = 2)$ leads to a boost, with capacity scaled by a factor

$\beta = 2.0 + \big(P_{\text{rxn}} \cdot (\beta_{\max} - 2.0)\big), \quad \beta_{\max} \approx 3.0,$

so that more strongly expressed reactions can carry more flux. For forward-only reactions, the upper bound is expanded while a small positive minimum flux is introduced if none existed, ensuring activity is not artificially clamped at zero. For reversible reactions, both positive and negative capacities are symmetrically scaled by $\beta$ .

Note on Thresholds:

Both GIMME and IMAT rely on global expression thresholds. In GIMME, reactions are pruned if the associated genes have expression levels below a certain global threshold. IMAT, on the other hand, categorizes reactions into Low, Medium, or High expression based on global percentiles.

However, the changes caused by a single methylation event are often too subtle to shift a gene's classification (e.g., from "Low" to "Medium") or to surpass the global pruning threshold.

To ensure that the impact of epigenetic changes is meaningfully reflected in the metabolic model, we use gene-specific thresholds. Unlike standard GIMME or IMAT approaches where thresholds are defined relative to the overall distribution of gene expression, we define them on a per-gene basis. This allows small but biologically relevant changes in expression to propagate through to the metabolic model.

This gene-specific approach is also more biologically accurate. The ability of a gene to drive metabolic activity shouldn't depend solely on its expression relative to other genes. Some enzymes can effectively catalyze reactions even at low mRNA levels, while others may require higher expression. Therefore, using gene-specific thresholds captures the gene's true functional impact more realistically.

In the following sections, when analyzing the metabolic consequences of methylation in selected oncogenes, we apply these gene-specific thresholds to update the gene classification used in the metabolic model.

Part 3: Putting It All Together

There are a vast multiplicity of genes that are epigenetically misregulated in cancer cells. Given the fact that our models have been trained on cancer-specific data, we deemed the best way to validate the models and the pipeline would be to see if we can replicate experimentally observed metabolic phenotypes of cancer cells.

Essentially, can our pipeline show how precise methylations that mirror the methylome of cancer-causing genes result in abnormal gene expression and eventually altered metabolism.

There are 3 genes that we identified, which are epigenetically altered in cancer cells. From our initial literature review, we obtained around 15-16 genes that were epigenetically disregulated in cancer cells. However, we could only use 3 due to a few constraints with the model:

First, the Human Cell Genome Scale Model that we were using only had a reduced set of around 2000 genes. Many oncogenes were not present in the model.

Second, some of the genes identified are coded for reactions involving "dead-end metabolites" which do not participate in any reactions. These could not be used for inference.

Third, some of the genes did not have annotated CpG sites near their promoter region. The Illumina 450k assay only hand-selected 450,000 CpG sites whose methylations they measure. This ruled out a few more genes for which the closest annotated CpG site was more than 2000 base pairs from the promoter's start site.

Nonetheless, these are all bottlenecks that arise from a lack of data and do not make the model or the pipeline any less robust. Over time, once we can tag all CpG sites or curate more exhaustive Human genome-scale metabolic models, we can work around these constraints and potentially analyse other genes.

Case 1: SDHB

SDHB encodes the iron-sulfur subunit of succinate dehydrogenase (complex II). Within the TCA cycle, SDHB helps catalyze the oxidation of succinate to fumarate.

In normal physiology, SDHB is ubiquitously expressed in tissues with high mitochondrial content and oxidative demands, such as skeletal muscle, heart, kidney, and adrenal medulla. Its stable expression is essential for mitochondrial integrity and efficient ATP production.

In cancer, however, loss or inactivation of SDHB is recurrently observed. Succinate helps stabilize hypoxia-inducible factors. When SDHB is dysfunctional, succinate accumulates as it is not converted to fumarate, leading to a "pseudohypoxic" state where HIFs are stabilized even under normal oxygen conditions. This drives transcription of angiogenic, glycolytic, and survival genes, promoting tumor growth.

The regulation of SDHB expression in tumors is also shaped by epigenetic mechanisms. Studies show that in several cancers, the promoter region of SDHB undergoes hypermethylation, effectively silencing its expression. This methylation-dependent repression phenocopies loss-of-function mutations, leading to succinate accumulation and downstream oncogenic signaling.

Bottom line: In cancerous cells, SDHB expression is often lost due to promoter hypermethylation or mutation, resulting in succinate buildup. This metabolic block fuels pseudohypoxia, HIF stabilization, and global epigenetic reprogramming, all of which drive aggressive tumor growth and progression.

We predicted gene expression values of the SDHB gene in 4 different scenarios:

A baseline expression, obtained from the methylation profile of a breast cancer sample
An expression obtained from a hypothetically completely methylated profile (M values +10)
An expression obtained from a hypothetically undermethylated profile (M values -10 for all sites in the 10MB window)
An expression obtained by hypothetically hypermethylating CpG sites in a 2000 BP window around a promoter

We observe that a fully hypomethylated gene tends to show higher expression than a fully hypermethylated one. Interestingly, localized hypermethylation near the promoter suppresses expression more strongly than global hypermethylation across the entire 10 Mb window. However, the scenario of having completely hyper or hypomethylated states is so biologically unrealistic that the usual thumb rule of "more methylation means less expression" cannot be meaningfully applied. A likely explanation is that broad hypermethylation inadvertently affects silencers or neighboring genes, which in turn modulate SDHB expression.

🔍

Figure: Gene Expression Values for SDHB in 4 Different Contexts

We also examined whether gene expression is influenced primarily by hypomethylation across the entire gene body or by specific regions near the promoter. By gradually expanding the analysis window to include CpG probes beyond 2000 base pairs upstream and downstream of the transcription start site, we observed that predictive power plateaued once the window extended past 2000 bp. In other words, CpG sites located outside this ±2000 bp promoter region contribute little to predicting gene expression, highlighting the dominant regulatory role of promoter-proximal methylation.

🔍

Figure: SDHB Hypomethylation Window Analysis

Narrowing down on the 2000BP region, we then try to see if increasing the strength of hypomethylation results in increased expression. Intuitively, we can see how a more hypermethylated distribution of M values (+10) results in a lower expression than one with hypomethylations (-10).

🔍

Figure: SDHB Methylation Strength Analysis

This type of expression profile suggests that the SDHB gene does not have any silencers, like the SLC7A5 gene. On checking, we do see an absence of any annotated silencers.

🔍

Figure: Lack of Silencers around the SDHB Gene

However, biologically, most tumors are formed out of single methylations. The likelihood of all CpG sites in a 2000BP window of the transcription start site is extremely low. To simulate "hypomethylation at the promoter site", we therefore pick a CpG probe that is close to the transcription start site.

cg03861428 is a probe that is 37 base pairs from the transcription start site and is likely in the promoter region. It is also the closest probe to the promoter, in terms of genomic distance.

When we exclusively hypomethylate that probe, gene expression decreases slightly from a baseline of 10.001058 to 9.984805 (Δ = −0.016253). This contrasts sharply with the earlier two cases, where single promoter methylation events produced substantial expression changes.

Analysis of the most impactful individual methylations suggests that most single-site modifications exert only minimal effects on the model. This implies that significant alterations in SDHB expression likely arise from the collective action of multiple methylation events. Emerging theories propose that DNA methylation can establish a scaffold that promotes further methylation in adjacent regions, or provide binding sites for methyl-CpG–binding domain (MBD) proteins, which in turn can recruit histone modifiers such as acetyltransferases and methyltransferases.

To see the consequences of the methylation on the metabolic network, we integrated the newly predicted gene expression results into the metabolic models and analysed changes in the flux distribution. Some seminal observations:

First, the stabilization of Hypoxia Inducing Factors by unconverted succinate, manifests in the activation of the glycolytic pathway. This is visible in the increase in fluxes of multiple reactions involved in glycolysis.

🔍

Figure: Unconverted Succinate Increasing

Secondly, reactions indicative of active Kreb's cycle activity such as those catalyzed by Mitochondrial Carrier (Mc) Tcdb:2.A.29.7.2 or acetyl-CoA:propanoyl-CoA 2-C-acetyltransferase saw a steady decrease in flux, suggesting that the cycle has been blocked.

🔍

Figure: Inhibition of Krebs Cycle Reactions

Most starkly, we see reactions upstream of fumarate such as the malate to oxaloacetic acid conversion catalysed by (S)-malate:NAD+ oxidoreductase go down. Further, in order to compensate for the lowered fumarate production from succinate, the reverse reaction converting malate to fumarate gets (S)-malate hydro-lyase (fumarate-forming) activated.

🔍

Figure: Malate to Oxaloacetic Acid Reduction; Malate to Fumarate Activation

Case 2: LDHB

LDHB encodes the B subunit of lactate dehydrogenase (LDH), an enzyme that catalyzes the reversible conversion between pyruvate and lactate, with concurrent interconversion of NADH and NAD⁺. LDH tetramers comprise different combinations of A (LDHA) and B (LDHB) subunits, and their composition influences the directionality of the reaction: LDHA-rich isoforms preferentially convert pyruvate to lactate, whereas LDHB-rich isoforms more readily catalyze lactate to pyruvate.

In normal cells, LDHB helps maintain metabolic plasticity and redox balance by enabling lactate oxidation under aerobic conditions. However, in breast cancer, LDHB is often silenced via promoter hypermethylation, contributing to a glycolysis-dominant phenotype.

In breast cancer cell lines, LDHB mRNA was absent or markedly reduced. Treatment with the DNA methyltransferase inhibitor 5'-azacytidine restored LDHB expression, indicating methylation was causative. Bisulfite sequencing of the LDHB promoter in these cell lines revealed heavy methylation across ~14 CpG sites; by contrast, cell lines expressing LDHB exhibited promoter unmethylation.

The epigenetically silenced LDHB is directly implicated in the creation of the Warburg Effect. This is a scenario where cancer cells derive a bulk of their energy requirements from glycolysis, and use the pyruvate generated in the glycolytic pathway to make lactate as opposed to acetyl-CoA for the TCA cycle.

The consistent hypermethylation of the LDHB promoter in breast tumors (and absence in normal tissues) underscores its role as a tumor-selective epigenetic switch. Loss of LDHB expression reduces the cell's capacity to consume lactate, reinforcing reliance on glycolysis and increasing lactate excretion, which can acidify the tumor microenvironment and contribute to invasion, immune evasion, and metabolic symbiosis with stromal cells.

We predicted gene expression values of the LDHB gene in 4 different scenarios:

A baseline expression, obtained from the methylation profile of a breast cancer sample
An expression obtained from a hypothetically completely methylated profile (M values +10)
An expression obtained from a hypothetically undermethylated profile (M values -10 for all sites in the 10MB window)
An expression obtained by hypothetically hypermethylating CpG sites in a 2000 BP window around a promoter

We can see how increasing the methylations around the already methylated promoter region (by making the M values 10) reduces the gene expression.

🔍

Figure: Gene Expression Values for LDHB in 4 Different Contexts

We also try to see if hypomethylations have an effect on the gene expression. Our model posits that increasing the window inside which all the CpG sites are hypomethylated (-10) does not have an affect on gene expression.

🔍

Figure: LDHB Hypomethylation Window Analysis

Within the 2000 BP window, we see what the effect of the extent of methylations are. As conventionally expected, the more you methylate (positive M values), the less is the gene expression.

🔍

Figure: LDHB Methylation Strength Analysis

This would also mean that the LDHB gene does not have any silencers, like the SLC7A5 gene. On checking, we do see an absence of any annotated silencers.

🔍

Figure: Lack of Silencing Elements in the Promoter Region of the LDHB Gene

Once again, we try to see if abnormal expressions in cancer cells can arise out of single-probe edits, as the chance of multiple probes being contiguously hypermethylated is low.

The closest CpG site is cg04354474 at a distance of 2bp from the promoter's TSS. On hypermethylating that single site, we move from a baseline of 11.373158 to an expression value of 10.736905 (Δ = -0.636253).

This is also unique to CpG sites in the promoter region, as all large impacts of single methylations come from close probes. Not every singular probe is capable of generating an equally large change in gene expression.

Cancer cells with LDHB mutations show an aggressively glycolytic phenotype. This is something we proceeded to corroborate using our metabolic model after feeding the predicted gene expression post-methylation.

This was evinced in 3 ways:

The activity of the Glucose to Glucose-6-phosphate increased significantly.

The flux through the Fructose-6-Phosphate to Fructose-1,6-Bisphosphate reaction also saw a considerable increase.

🔍

Figure: Increased Glycolytic Flux

What is important to note is that the genes that directly code for these enzymes (under the Gene Reaction Pairs) were unaltered, which means that any flux change in the glycolytic pathways came as a trickle down of the LDHB hypermethylation and not from making the glycolysis enzymes themselves more efficient.

ATP saw a considerable increase in the new metabolic model, indicating positive glycolysis fluxes. Crucially, the reactions involved in the Krebs' cycle or the oxidative phosphorylation saw minimal changes, ruling them out as potential contributors to the increased ATP. This seems to mirror the Warburg Effect, a scenario where cancer cells derive most of their ATP via glycolytic pathways and lactic acid fermentation, despite the presence of oxygen.

🔍

Figure: Increased ATP Production

We also see the lactate concentration go up directly as a function of the LDHB methylation and the back reaction from lactate to pyruvate getting silenced. However, other reactions that are involved in lactate production also get upregulated.

🔍

Figure: Warburg Effect

Case 3: SLC7A5

SLC7A5 encodes a membrane protein that imports large neutral amino acids such as phenylalanine, tryptophan, and tyrosine. Leucine imported through SLC7A5 is a key activator of the mTORC1 pathway, a signaling hub that stimulates protein synthesis, biomass accumulation, and proliferation.

In normal physiology, SLC7A5 expression is relatively restricted, appearing at higher levels in rapidly dividing cells such as those of the placenta, bone marrow, and immune system. In cancer, however, SLC7A5 is consistently overexpressed. Numerous studies in breast, lung, colon, and brain tumors report elevated SLC7A5 expression, which is strongly associated with aggressive growth, poor prognosis, and resistance to therapy.

The regulation of SLC7A5 expression in tumors is closely tied to epigenetic modifications, particularly DNA methylation. Studies show that the promoter region of SLC7A5 is frequently hypomethylated in breast tumors compared to adjacent normal tissues. Hypomethylation therefore, acts as a switch that derepresses LAT1, giving tumor cells an expanded capacity for nutrient uptake.

Beyond this, SLC7A5-mediated transport feeds into the methionine cycle, increasing intracellular S-adenosylmethionine (SAM) and potentially reinforcing DNA and histone methylation elsewhere in the genome, creating a feedback loop between nutrient transport and epigenetic state.

Bottom line is SLC7A5 expression is severely heightened in cancerous cells due to a hypomethylated promoter region. The implications of this increased expression manifest in increased transport of neutral amino acids which serve as signalling molecules for proliferation pathways.

We predicted gene expression values of the SLC7A5 gene in 4 different scenarios:

A baseline expression, obtained from the methylation profile of a breast cancer sample
An expression obtained from a hypothetically completely methylated profile (M values +10)
An expression obtained from a hypothetically undermethylated profile (M values -10 for all sites in the 10MB window)
An expression obtained by hypothetically hypermethylating CpG sites in a 2000 BP window around a promoter

The expression patterns in the fully hypermethylated or hypomethylated profiles are so biologically unusual that the usual rule of "more methylation typically leads to lower expression" doesn't hold true. It's possible that methylating every site within a 10MB region also affects silencers or other genes that might regulate the expression of SLC7A5.

🔍

Figure: Gene Expression Values for SLC7A5 in 4 Different Contexts

We additionally try to see if the gene expression is affected by hypomethylations throughout the gene, or if there were specific regions close to the promoter that played a dominant role in affecting gene expression. On increasing the window of hypomethylations to CpG Probes lying beyond the 2000bp window, we see that the gene expression plateaus after we cross the 2000bp threshold. This indicates that CpG sites outside of a 2000bp window are relatively useless in predicting gene expression.

🔍

Figure: SLC7A5 Hypomethylation Window Analysis

Narrowing down on the 2000BP region, we then try to see if increasing the strength of hypomethylation results in increased expression. Intuitively, we can see how a more hypomethylated distribution of M values (-10) results in a higher expression than one with relatively lower hypomethylation (~-2.5).

Interestingly, our model also suggests an increase in expression on an increase in hypermethylation, although not as pronounced as the jump in expression on hypomethylation. That would indicate the presence of a silencer as increased hypermethylation would also result in increased inactivity of the silencer.

🔍

Figure: SLC7A5 Methylation Strength Analysis

We also found a silencer within 2000BP of the gene's Transcription Start Site. This potentially explains how increased methylation also causes a spike in gene expression, albeit a smaller one than hypomethylations.

🔍

Figure: Presence of a Silencer Within the Promoter Region of the SLC7A5 Gene

Cg00728300 is a probe that is 178 base pairs from the transcription start site and is likely in the promoter region. It is also the closest probe to the promoter, in terms of genomic distance.

On exclusively hypomethylating that probe, we see the gene expression jump from a baseline of 8.777103 to a value of 9.653147 (Δ = +0.876043). This is a singular performance as most 1-probe hypomethylations do not cause such large flux changes.

To see the consequences of the methylation on the metabolic network, we integrated the newly predicted gene expression results into the metabolic models and analysed changes in the flux distribution.

The metabolic model already has the amino acid transport reactions come under direct control of the SLC7A5 gene, so it's no surprise that we were able to see increased amino acid importing. What flux balance analysis allows us to model are also flux changes in other reactions in the connected network, which may not directly be controlled by the SLC7A5 gene, but are knockoff effects.

4 key findings:

First, Carnitine was significantly downregulated as a result of increased hypomethylation. Dysfunction of the carnitine cycle has been shown to influence tumorigenesis and progression by altering intracellular oxidative and inflammatory status or regulating tumor metabolic flexibility.

🔍

Figure: Reduced Carnitine Production

Second, ATP Phosphotransferases and other ATP-generating reactions saw a significant uptick. This is intuitive as cancer cells generally consume more ATP.

🔍

Figure: Increased ATP Production

Third, the SLC7A5 gene codes for a symporter, which uses glutamine export as a secondary active force to import the amino acids. We also saw increased glutamine transport out of the cell.

🔍

Figure: Glutamine Transport Outside the Cell

Fourthly, Valeric and Isovaleric Acid synthesis was downregulated. Both of which are well-documented inhibitors of tumor cells.

🔍

Figure: Underexpression of Valeric Acid and Derivatives

Part 4: ENIGMA (Epigenetic-Network Integration for Genomic Metabolic Analysis)

ENIGMA is a software platform we developed to predict both gene expression and metabolic flux distributions by integrating epigenetic information directly into genome-scale metabolic models.

At its core, ENIGMA uses the DeepMethyGene architecture we described in the AI Model section and the GPR Pruning-Based Algorithm described in our modelling section, allowing methylation data to reshape reaction flux capacities in a rule-consistent manner.

The software is simple to use. Users upload a methylation profile normalized to beta values as a CSV file (obtained from standard Illumina methylation sequencing), and ENIGMA processes it to produce a plot of how different promoter methylation patterns affects gene expression within a 2 Kbp window around the TSS as well as a flux distribution obtained after passing the gene expressions to our custom algorithm.

ENIGMA is designed as a multi-stage workbench rather than a single-purpose script. By translating subtle promoter methylation differences into network-level flux predictions, it bridges epigenetic regulation and metabolic behavior in silico. Using ENIGMA and its earlier versions, we reproduced experimental findings such as promoter-driven downregulation or overexpression of genes in cancer, with results that align closely with published literature. This agreement provides strong validation of the framework. Please view the results section for detailed studies on various genes of interests.

🔍

Figure: Workflow of ENIGMA

The workflow begins when the user specifies a set of target genes. ENIGMA then maps all relevant CpG probes from the reference dataset to the corresponding genomic loci (uses the Homo sapiens genome assembly GRCh37 - hg19), creating a structured data foundation. From this point, users can either analyze the reference dataset or apply pre-trained gene-specific models to their own methylation data. The outputs include quantitative predictions of expression and flux, alongside sensitivity analyses in which promoter methylation levels are titrated across a range of values and window sizes.

The flux distributions are produced upon employing our custom GPR-based algorithm to integrate gene expression data into the Humam1 genome scale metabolic model which is present in the universal Systems Biology Markup Language.

Currently, ENIGMA is implemented as a single Python script, with hyperparameters such as WINDOW_SIZE and learning rate defined as global constants. While this monolithic structure enabled rapid development, it also highlights areas for improvement. Future versions can be modularized into distinct components for data processing, modelling, and flux analysis; parameters can be managed through external configuration files; and the entire tool can be containerized with Docker to ensure reproducibility across systems.

By following this roadmap, ENIGMA can mature from a powerful research prototype into a production-grade, extensible platform. As it stands, ENIGMA already demonstrates how epigenetic signals can be computationally linked to metabolic function, enabling detailed in silico experiments that probe the connections between regulation at the promoter and behavior across the metabolic network.

References

References and Citations

Yan,a Y., Chai, X., Liu, J. et al. DeepMethyGene: a deep-learning model to predict gene expression using DNA methylations. BMC Bioinformatics 26, 99 (2025). https://doi.org/10.1186/s12859-025-06115-2
Becker SA, Palsson BO (2008) Context-Specific Metabolic Networks Are Consistent with Experiments. PLOS Computational Biology 4(5): e1000082. https://doi.org/10.1371/journal.pcbi.1000082
Fedoroff, M. Y., Zhao, L., Wang, S., Bhushan, A., Yang, H., Bussard, K. M., Peiper, S. C., & He, J. (2025). Amino acid transporter LAT1 (SLC7A5) promotes metabolic rewiring in TNBC progression through the L-Trp/QPRT/NAD+ pathway. Journal of experimental & clinical cancer research : CR, 44(1), 190. https://doi.org/10.1186/s13046-025-03446-z
Wang, X., Yang, C., Huang, C., & Wang, W. (2024). Dysfunction of the carnitine cycle in tumor progression. Heliyon, 10(16), e35961. https://doi.org/10.1016/j.heliyon.2024.e35961
Huang, K. T., Dobrovic, A., & Fox, S. B. (2009). No evidence for promoter region methylation of the succinate dehydrogenase and fumarate hydratase tumour suppressor genes in breast cancer. BMC research notes, 2, 194. https://doi.org/10.1186/1756-0500-2-194
Shi, F., Li, Y., Han, R. et al. Valerian and valeric acid inhibit growth of breast cancer cells possibly by mediating epigenetic modifications. Sci Rep 11, 2519 (2021). https://doi.org/10.1038/s41598-021-81620-x
Brown NJ, Higham SE, Perunovic B, Arafa M, Balasubramanian S, et al. (2013) Lactate Dehydrogenase-B Is Silenced by Promoter Methylation in a High Frequency of Human Breast Cancers. PLOS ONE 8(2): e57697. https://doi.org/10.1371/journal.pone.0057697
Sen, P., & Orešič, M. (2023). Integrating Omics Data in Genome-Scale Metabolic Modeling: A Methodological Perspective for Precision Medicine. Metabolites, 13(7), 855. https://doi.org/10.3390/metabo13070855
Shen, F., Boccuto, L., Pauly, R., Srikanth, S., & Chandrasekaran, S. (2019). Genome-scale network model of metabolism and histone acetylation reveals metabolic dependencies of histone deacetylase inhibitors. Genome Biology, 20(1). https://doi.org/10.1186/s13059-019-1661-z
Dewi, C., Chen, R.-C., & Tai, S.-K. (2020). Evaluation of Robust Spatial Pyramid Pooling Based on Convolutional Neural Network for Traffic Sign Recognition System. Electronics, 9(6), 889. https://doi.org/10.3390/electronics9060889
Morrissey, J., Strain, B., Kontoravdi, C. (2024). Flux Balance Analysis of Mammalian Cell Systems. In: Ceroni, F., Polizzi, K. (eds) Mammalian Synthetic Systems. Methods in Molecular Biology, vol 2774. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-3718-0_9
Becker, S. A., & Palsson, B. O. (2008). Context-Specific Metabolic Networks Are Consistent with Experiments. PLoS Computational Biology, 4(5), e1000082. https://doi.org/10.1371/journal.pcbi.1000082
Du, P., Zhang, X., Huang, C.-C., Jafari, N., Kibbe, W. A., Hou, L., & Lin, S. M. (2010). Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics, 11, 587. https://doi.org/10.1186/1471-2105-11-587
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2020). Grad-CAM: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision, 128, 336–359. https://doi.org/10.1007/s11263-019-01228-7
He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV 2014 (pp. 346–361). Springer. https://doi.org/10.1007/978-3-319-10578-9_23
Zur, H., Ruppin, E., & Shlomi, T. (2010). iMAT: An integrative metabolic analysis tool. Bioinformatics, 26(24), 3140–3142. https://doi.org/10.1093/bioinformatics/btq602
Liberti, M. V., & Locasale, J. W. (2016). The Warburg effect: How does it benefit cancer cells? Trends in Biochemical Sciences, 41(3), 211–218. https://doi.org/10.1016/j.tibs.2015.12.001
Selak, M. A., Armour, S. M., MacKenzie, E. D., et al. (2005). Succinate links TCA cycle dysfunction to oncogenesis by inhibiting HIF-α prolyl hydroxylase. Cancer Cell, 7(1), 77–85. https://doi.org/10.1016/j.ccr.2004.11.022
Wang, Q., Holst, J., L-type amino acid transport and cancer. (2015). American Journal of Cancer Research, 5(4), 1281–1294. (LAT1/SLC7A5 review)

Project ECHO

Epigenetic Control of gene expression with Hybrid Optimization​

Introduction​

Software Overview​

Why Two Different Models?​

Biological Validation​

IGF2 - 1Mb window​

GSTT1 - 1Mb window​

ACTB - 1Mb window​

SLC7A5 - 10Mb window​

ECHO: Models, Pipeline, and Results​

Datasets Used​

Model 1: Elastic Net Regression (inspired by geneEXPLORE)​

Model Overview​

Optimization​

Reported Metrics (from geneEXPLORE)​

Model 2: Adaptive Regressive CNN (Inspired by DeepMethyGene)​

Model Overview​

Reason for Choosing this Model​

Model Architecture​

Reported Performance​

The ECHO Pipeline​

Supported Algorithms​

Results​

Algorithm 1 - Standard Sequential​

Algorithm 2 - Circular Sequential​

Algorithm 3 - gradCAM Sequential​

Algorithm 4 - elasticNet Sequential​

Observations​

Advantages of elasticNet Sequential Algorithm​

Self-Validation of Predicted Results​

Example of Self-Validation​

Additional Note on "Juiced" Algorithms​

Future Directions​

Non-sequential Methylation Algorithms​

Developing Better Heuristics​

gRNA Recommendations​

References​

Community Contributions​

Licensing and Usage​

ENIGMA: Epigenetic Networks Informing Genome-scale Metabolic Analysis

Introduction​

Part 1: Deep Learning Model to Predict Gene Expression​

Part 2: Integrating Gene Expression in a Metabolic Model​

Part 3: Putting It All Together​

Case 1: SDHB​

Case 2: LDHB​

Case 3: SLC7A5​

Part 4: ENIGMA (Epigenetic-Network Integration for Genomic Metabolic Analysis)​

References​

Epigenetic Control of gene expression with Hybrid Optimization

Introduction

Software Overview

Why Two Different Models?

Biological Validation

IGF2 - 1Mb window

GSTT1 - 1Mb window

ACTB - 1Mb window

SLC7A5 - 10Mb window

ECHO: Models, Pipeline, and Results

Datasets Used

Model 1: Elastic Net Regression (inspired by geneEXPLORE)

Model Overview

Optimization

Reported Metrics (from geneEXPLORE)

Model 2: Adaptive Regressive CNN (Inspired by DeepMethyGene)

Model Overview

Reason for Choosing this Model

Model Architecture

Reported Performance

The ECHO Pipeline

Supported Algorithms

Results

Algorithm 1 - Standard Sequential

Algorithm 2 - Circular Sequential

Algorithm 3 - gradCAM Sequential

Algorithm 4 - elasticNet Sequential

Observations

Advantages of elasticNet Sequential Algorithm

Self-Validation of Predicted Results

Example of Self-Validation

Additional Note on "Juiced" Algorithms

Future Directions

Non-sequential Methylation Algorithms

Developing Better Heuristics

gRNA Recommendations

References

Community Contributions

Licensing and Usage

Introduction

Part 1: Deep Learning Model to Predict Gene Expression

Part 2: Integrating Gene Expression in a Metabolic Model

Part 3: Putting It All Together

Case 1: SDHB

Case 2: LDHB

Case 3: SLC7A5

Part 4: ENIGMA (Epigenetic-Network Integration for Genomic Metabolic Analysis)

References