SOFTWARE
Our Software project consists of two tools that enable precise epigenetic control and metabolic understanding in synthetic biology. The first tool, ECHO, supports experimental design using the dCas9-DAM fusion protein, which allows targeted DNA methylation to regulate gene expression with specificity. The second tool, ENIGMA integrates machine learning and genome-scale metabolic modeling to predict how these methylation changes impact gene expression and cellular metabolism. Together, these tools create a comprehensive platform that essentially models and optimizes the entire experimental pipeline, from designing targeted epigenetic edits to simulating their functional metabolic consequences, paving the way for rational engineering of gene regulation and metabolic pathways in health and disease.
iGEM IIT-Madras 2025
GitLab Repository: https://gitlab.igem.org/2025/software-tools/iit-madras
A sample collaboration runthrough can be found on the same repository.
Project ECHO
Custom software tools, analysis scripts, and computational resources we've developed for our project and the community.
Epigenetic Control of gene expression with Hybrid Optimization
Introduction
Our main focus while developing our software was to come up with a tool that would aid in designing experiments while using the new parts developed by the Wet Lab team, namely - the dCas9-DAM fusion protein capable of targeted methylation for regulatory control.
Taking the perspective of a researcher seeking to control protein expression via methylation, we came up with two questions that seemed to be the most appropriate:
- If I want to control the expression of this protein, which genes' expressions do I need to alter, and by how much?
- Given I have a tool capable of performing targeted methylation, where on the genome do I introduce methylations to achieve the required deltas in gene expression?
While the first question is tackled by the modelling-based team, we decided to tackle the second question of identifying the most relevant sites for methylation to accurately control gene expression.
Software Overview
In this context, the tool ECHO (Epigenetic Control of gene expression with Hybrid Optimization) was developed. The idea behind it is simple:
- Use an ElasticNet architecture to learn important sites for gene expression regulation
- Use an AdaptiveRegressiveCNN to predict which of these sites to regulate in order to achieve target gene expression

With ECHO, we pave the way to reduce experimental pipelines from days to hours, based on data availability. The comparison between pipelines with and without ECHO, and the pros/cons of ECHO, are given below:


Why Two Different Models?
Through our DBTL cycles, we realized that the exact interplay between methylation and gene expression is:
- Complex - average methylation values of the genome do not have predictive power, which shows site-level methylation is an important feature
- Non-local - it is demonstrated that the predictive power of models such as DeepMethyGene increases monotonically when larger methylation windows are considered around the genome (a 10Mb window of methylation data performs significantly better than a 1Mb window)
Therefore, the choice of which AI model to train proved critical. AdaptiveRegressiveCNNs (such as DeepMethyGene), while offering better predictive accuracy, proved difficult to extract gradients for, due to the following reasons:
- Being a CNN-based architecture, individual feature importances were lost, as long-range feature relationships were learnt. The importance maps (obtained via gradCAM) were smoothed out, and distinct peaks were not visible.
- While techniques such as gradCAM allowed us to visualize "important" sites for prediction, the direction of influence for these sites (i.e. does methylation at this site upregulate or downregulate gene expression) was not reliably extractable.
- Since the CNN learnt relationships between features, it tended to not give any features very low importance. This would be detrimental for search algorithms since every site would require explicit handling.

To mitigate these issues, we designed an ElasticNet Linear Regressor based on geneEXPLORE, a precursor to the DeepMethyGene model. While having lesser accuracy overall, it had the advantage that, as a linear regressor, its gradients were directly interpretable, and distinguished clear sites of importance for regulatory control of the gene.

The above weights are suitable for further search algorithms since:
- Large number of weights have been assigned ~0 importance, so can be ignored
- Individual feature importances clearly retained
- Clear peaks of importance visible
- Direction of influence available
Biological Validation
Click to expand: Biological Validation Details
To verify if the learned importances were biologically significant, we then verified the peaks against annotated enhancers and silencers for the gene from the NCBI database, and found that (for the genes we tested on), most annotated enhancer/silencer regions had a corresponding peak in the geneEXPLORE weights.
NCBI does not annotate enhancer or silencer regions located more than 1–2 Mb from most genes. To explain the peaks observed beyond this range, we refer to the methylation mechanism proposed in the geneEXPLORE paper, which suggests that DNA looping can bring segments located up to 10 Mb away from the promoter into close proximity, thereby influencing gene expression.

With these verification practices in mind, we ran the weights analysis for a sample of genes, with promising results:
IGF2 - 1Mb window
Peaks observed around ~250 kb correspond to enhancer regions in the same window

GSTT1 - 1Mb window
No documented enhancers more than 6kb from the promoter, so the site at the promoter is assigned the most importance

ACTB - 1Mb window
Housekeeping gene, hence, lack of peaks corresponds to lack of methylatory regulation

SLC7A5 - 10Mb window
Documented to have a promoter hyper-sensitive to methylation - abnormal expression in cancer cells

ElasticNet Linear Regressor:
- Capable of learning individual feature importances
- Poor overall predictive power
AdaptiveRegressiveCNN:
- Bad at learning individual feature importances
- Significantly better overall predictive power
Thus, we decided to adopt an ElasticNet to learn feature importances, but an AdaptiveRegressiveCNN for final predictions.
ECHO: Models, Pipeline, and Results
Datasets Used
Methylation dataset: TCGA breast invasive carcinoma (BRCA) DNA methylation (HumanMethylation450)
Dataset Link
Gene expression dataset: TCGA breast invasive carcinoma (BRCA) exon expression by RNAseq (polyA+ IlluminaHiSeq)
Dataset Link
Model 1: Elastic Net Regression (inspired by geneEXPLORE)
Click to expand: Elastic Net Regression Technical Details
Elastic Net regression is a regularized linear regression model that combines the penalties of both Lasso (L1) and Ridge (L2) regression. It has been shown to perform effectively in high-dimensional biological data such as DNA methylation–gene expression relationships, where many features are correlated.
Model Overview
The Elastic Net model assumes the following regression form:
y = β₀ + Σᵢ₌₁ᵖ βᵢxᵢ + ε
Where:
- y = gene expression value
- xᵢ = methylation values at CpG sites (features) - methylation probability (beta-value) between 0 and 1
- βᵢ = regression coefficient for feature i
- ε = error term

Where:
- n = number of samples
- λ = overall regularization strength
- α ∈ [0,1] where α is the mixing parameter (α = 1 implies Lasso, α = 0 implies Ridge, 0 < α < 1 corresponds to Elastic Net)
Optimization
- Algorithm: Coordinate Descent (efficient for high-dimensional sparse problems)
- Hyperparameters: tuned via cross-validation; α typically chosen between 0.1 and 0.5 for correlated methylation features
- Scaling: Input methylation features are standardized to mean 0 and variance 1
Reported Metrics (from geneEXPLORE)
- Mean R²: ~0.05–0.15 depending on gene and tissue type
- Correlation (Pearson's r): ~0.2–0.4 range
In our case, since we applied Elastic Net to a different set of genes, we observed results within the same ranges as reported by geneEXPLORE.
Model 2: Adaptive Regressive CNN (Inspired by DeepMethyGene)
Click to expand: Adaptive Regressive CNN Technical Details
Model Overview
The Adaptive Regressive CNN is a deep learning model designed to predict gene expression from DNA methylation data, specifically CpG beta values. It is based on a ResNet-style convolutional neural network, as described in the DeepMethyGene paper.
Reason for Choosing this Model
The Adaptive Regressive CNN architecture was chosen because it is uniquely suited to capture the complex, nonlinear relationships between DNA methylation patterns and gene expression. Unlike linear models such as ElasticNet, which assume additive and independent effects of each CpG site, the CNN-based approach leverages convolutional layers and residual connections to learn both local and long-range dependencies among CpG sites.

Model Architecture
Input: Vector of CpG beta values for a gene's promoter region: x = [β₁, β₂, …, βₙ] where βᵢ is the methylation beta value at CpG site i
Convolutional Layers:
- Several 1D convolutional layers (e.g., kernel size 3, 64 filters) extract local patterns from the input vector
- Each convolution is followed by batch normalization and a LeakyReLU activation
Residual (ResNet) Blocks: Each block contains two Conv1D layers with skip connections:

Adaptive Pooling: Reduces the sequence dimension, allowing the model to handle variable-length input windows.
Fully Connected Layers:
- Dense(128) + ReLU
- Dense(1) for the final gene expression prediction ŷ

Training Configuration:
- Optimizer: Adam (learning rate typically 0.001)
- Regularization: L2 weight decay (e.g., λ = 10⁻⁴)
Reported Performance
The DeepMethyGene paper reports mean R² ≈ 0.64 and RMSE ≈ 0.25 on held-out test sets of breast cancer genes. In our experiments, results fell within these reported ranges.
The ECHO Pipeline
The full documentation of our tool can be found on the GitLab repository. A standard pipeline is as follows:
- Choose your gene of interest
- Compile the dataset for your gene (ECHO offers data compilation for roughly 2k human genes whose methylation and gene expression data were available in the UCSC datasets)
- Train an ElasticNet on the dataset to obtain the weights for each site
- Train on AdaptiveRegressiveCNN on the dataset to use as the model to predict gene expression
- Input the methylation data for your cell of interest (by default, ECHO considers the average methylation values from the dataset for each site as its baseline)
- Choose the algorithm and direction of regulation to begin prediction
When we say the model "methylates" a site, this means that it sets the beta value corresponding to that site in the input vector as 1, to reflect complete methylation, as expected to be achieved eventually by the Wet lab's dCas9-DAM part. Once more data becomes available, the algorithm can be improved to be more accurate.
Supported Algorithms
The 4 algorithms currently supported by ECHO are (for example, consider upregulation):
- Standard sequential - sequentially methylates DNA sites from the most upstream to the most downstream
- Circular sequential - methylates sites closer to the promoter and works its way outwards
- gradCAM sequential - methylates sites in accordance to the importances generated by gradCAM
- elasticNet sequential - methylates sites given positive importance by elasticNet in accordance to magnitude

Results
The convergence for different algorithms for the human gene AADAT in a 10Mb window are compared below for the different algorithms in ECHO. Note that the minimum/maximum values for gene expression in the dataset was 3-10 FPKM units, while the average is around 5.4. The x-axis is the number of sites methylated according to the algorithm, while the y-axis is the gene expression in log₂(1+FPKM) units.
Algorithm 1 - Standard Sequential

Algorithm 2 - Circular Sequential

Algorithm 3 - gradCAM Sequential

Algorithm 4 - elasticNet Sequential

Observations
- The standard and circular sequential algorithms do not seem to be relying on any particular data during growth; the growth rate is stochastic
- The gradCAM sequential algorithm shows a slower initial growth rate than the first two, but picks up pace in the middle before saturating towards the end
- The elasticNet sequential algorithm is able to correctly and accurately predict which sites to methylate for the quickest increase in gene regulation
The elasticNet sequential algorithm shows similar performance for downregulating gene expression as well:

Advantages of elasticNet Sequential Algorithm
- Fastest convergence rate - able to achieve maximum gene expression in the lesser number of methylations. This is good from an experimental point of view as the fewer sites that require external methylation, the better.
- Faster running time - runs much faster than other algorithms since there is no need to predict via AdaptiveRegressiveCNN after each methylation if gene expression goes up/down. This cost is displaced onto computing weights via geneEXPLORE.
Self-Validation of Predicted Results
Due to the fact that precise epigenetic modification of gene regulation is still a new field, it is difficult to find datasets to verify exact results experimentally. So, to check if the results predicted by ECHO were somewhat valid, we used the following steps:
- For a given gene id, window of consideration, baseline methylation series, and targeted gene expression, the recommended methylation sites were generated
- The methylation series in the training data were obtained which had the closest gene expression to the target expression
- It was verified if the sites recommended for methylation in the dataset had higher methylation values than in the baseline
For a good sample of genes, windows and targets, identifications of 75-85% were obtained, which is promising.
Example of Self-Validation
Interpreting changes proposed by ECHO. To take the expression of AADAT from its baseline of 5.6 to 8 units, ECHO proposed methylation of 40 sites.
When these 40 sites were compared to a methylation series in the training data that had an expression of 8 units, 33 out of the 40 sites in the training data had a higher beta value than the baseline.
Hence, 33/40 = 82.5% of sites were identified "correctly" by the ECHO algorithm.

Additional Note on "Juiced" Algorithms
The algorithms so far described are "sequential"; given an order of sites to methylate, it sequentially methylates them. Given that inter-site relationships learnt by AdaptiveRegressiveCNNs are largely a black box, it is possible that there might exist some "non-sequential" algorithms, which also check if demethylating previous sites while methylating new ones increase gene expression further.
While exploring this, we found that it is numerically possible to use a mixture of elasticNet sequential and standard sequential algorithms to find methylations patterns that are “juiced”, i.e., with these patterns, the AdaptiveRegressiveCNN outputs methylation values that are significantly greater than the maximum of the training data, as shown below -

This algorithm has been dubbed the elasticNet-sequential "juiced" algorithm, but is not included in the GitHub documentation due to the following reasons:
- Lack of biological significance - Such high values are likely numerical artefacts
- Lack of experimental interest - In normal experiments, it is unlikely that changes in gene expression beyond one unit is required
- Requirement of large number of methylations - Such high values can only be obtained after selectively methylating hundreds of sites
Future Directions
Non-sequential Methylation Algorithms
As mentioned above, it is possible to further reduce the number of methylations required by tapping into non-sequential algorithms, which demethylate previously predicted sites, to boost gene expression in a lesser number of steps.
Developing Better Heuristics
Using the ElasticNet weights as a heuristic has shown promising results, however, it is possible to further fine tune this heuristic using other kinds of data, for example:
- Integrating annotated enhancer/silencer regions from NCBI
- Integrating chromatin-capture data which provides information on how likely are two probes to interact with each other, on the basis of their physical distances in the cell
gRNA Recommendations
Once the minimum number of methylation sites are identified, it should then be possible to algorithmically predict the least number of gRNAs required to methylate all the identified sites, since it is possible to design gRNAs that are capable of interacting with multiple sites.
The end goal is to be able to suggest a single gRNA capable of methylating all sites necessary to exactly reduce gene expression to the desired level.
References
Click to expand: References and Citations
-
GeneEXPLORE - Elmarakeby, H. A., Hwang, J., Arafeh, R., Crowdis, J., Gang, S., Liu, D., ... & Van Allen, E. M. (2021). Biologically informed deep neural network for prostate cancer discovery. Nature, 598(7880), 348-352.
-
DeepMethyGene - Li, Y., Zhang, X., & Wang, L. (2024). DeepMethyGene: A deep learning approach for predicting gene expression from DNA methylation patterns. BMC Bioinformatics, 25, 1-15.
-
DeepPGD - Chen, M., Liu, S., & Zhou, K. (2024). DeepPGD: Deep learning for personalized genomic diagnosis using methylation data. International Journal of Molecular Sciences, 25(15), 8146.
-
UCSC Xena Data Portal - Goldman, M. J., Craft, B., Hastie, M., Repečka, K., McDade, F., Kamath, A., ... & Haussler, D. (2020). Visualizing and interpreting cancer genomics data via the Xena platform. Nature Biotechnology, 38(6), 675-678.
Community Contributions
ECHO contributes to the synthetic biology and bioinformatics community by providing:
- Open-source implementation of epigenetic regulation prediction
- Standardized pipeline for methylation site identification
- Integration of multiple AI approaches for enhanced accuracy
- Validation methodologies for epigenetic predictions
Licensing and Usage
All source code and documentation are available under open-source licenses on our GitLab repository. The tool is freely available for academic and research purposes.
ENIGMA: Epigenetic Networks Informing Genome-scale Metabolic Analysis
Introduction
Rationale for Epigenetically Informed Metabolic Networks:
Gene expression is not hardwired into DNA- it's negotiated, rewritten, and enforced through epigenetics. Epigenetic modifications, particularly DNA methylation, serve as critical regulators that fine-tune gene activity without altering the genetic code itself. By modifying methylation patterns in promoter regions, cells can stably silence or activate genes, influencing cellular identity and function.
Cancer is a textbook case of epigenetics gone wrong. Tumor cells typically undergo a "double jeopardy": widespread hypomethylation that activates oncogenes, paired with local hypermethylation that shuts down tumor suppressor genes. The ripple effects extend far beyond gene expression, and methylation actively reshapes the cell's metabolic profile.
Cancer cells rewire their entire metabolic network to fuel relentless growth, with classic signatures like the Warburg effect (aerobic glycolysis) and altered glutamine metabolism.

Epigenetically informed genome-scale metabolic models (GEMs) can directly connect methylation to expression, and expression to metabolism. Such models can illuminate how cancer cells use epigenetic reprogramming to hijack metabolism and the impacts of epigenetic changes on the metabolic network.
Broadly, outside of applications in analyzing cancer cell metabolism, this pipeline can allow us to analyze the metabolic impacts of any precise methylation that the EPIC toolkit can generate.
Existing Techniques:
Genome-scale metabolic modeling (GSMM) has already given researchers powerful tools to simulate cellular metabolism under various genetic and environmental conditions. But where does epigenetics fit in?
There have been two main approaches so far:
Epigenome-Scale Metabolic Models (EGEMs): These models expand human GEMs by explicitly adding reactions for histone acetylation. By optimizing both biomass and acetylation, they allow researchers to study trade-offs between cell growth and global epigenetic activity. But there's a catch: EGEMs treat acetylation as a single "pool," meaning they can only track whether overall acetylation goes up or down—not the precise effects of specific marks. In short, they ask: "How does metabolism influence epigenetics?"

Gene Expression-Based Integration: A more indirect approach is to use transcriptomics. Context-specific models integrate gene expression data by "turning on" reactions associated with highly expressed genes and suppressing others. Many studies link epigenetics to metabolism this way: measure gene expression under two different epigenetic states, then compare the outcomes. For example, Salehzadeh-Yazdi et al. integrated yeast transcriptomes under different histone-tail mutations to see how acetylation altered global metabolism. This works, but the method can't prove that changes in expression came only from methylation- other factors could be at play.

Overview of Our Technique:
We're taking a different route. Instead of retroactively linking methylation to metabolism through gene expression snapshots, we predict gene expression directly from methylation profiles using machine learning.

A rough pipeline of the project is:
- Feed a methylation profile into a predictive model.
- Generate a synthetic but biologically grounded gene expression profile.
- Plug these predicted expressions into a metabolic model, and apply flux balance algorithms to obtain the flux distribution of the cell post the epigenetic perturbation
This lets us ask questions like: "What would happen to metabolism if we flip a single methylation switch?" Instead of passively observing how metabolism correlates with epigenetic states, we can actively simulate how precise epigenetic modifications ripple through the network.
Why Does This Matter?
The most immediate use case of the pipeline is in cancer metabolism, where methylation-driven rewiring of pathways fuels uncontrolled growth and resistance to therapy. We can now pinpoint exactly where deviant methylations in cancer genes affect the metabolism of the cell.
But the applications don't stop there. Epigenetically informed GEMs could help uncover how diet or environment influences metabolic health, predict vulnerabilities in diseases like diabetes or neurodegeneration where methylation is disrupted, and even guide precision medicine by simulating how targeted epigenetic drugs would reshape a patient's metabolism.
Together with our EPIC toolkit, we can go beyond simulation. By testing out optimal methylation profiles in silico and then making precise epigenetic edits experimentally, we create a closed loop between modeling and engineering- allowing us to design and build cells with finely tuned, optimal behavior.
Part 1: Deep Learning Model to Predict Gene Expression
Why Predict Expression from Methylation?
DNA methylation is one of the most powerful switches for regulating gene activity. When CpG islands in promoters are methylated, transcription often grinds to a halt; when they are demethylated, transcription ramps up. This makes "predicting gene expression from methylation profiles" a valid and exciting problem: if we can read the methylation code, we can anticipate how active a gene will be.
But here's the catch- this relationship isn't simple or deterministic. Methylation is just one part of the regulatory orchestra. Chromatin accessibility, histone marks, and transcription factors all weigh in. That means we can't just deterministically calculate expression from methylation with a neat formula. Instead, we need a flexible model that can learn the rules from data, capturing both the obvious and the subtle patterns.
This lends extremely well to Machine Learning approaches.
The Architecture: Adaptive Convolutional Neural Network (CNN) With Residual Blocks

We use the architecture described in DeepMethyGene, a convolutional neural network (CNN) designed specifically to link DNA methylation to gene expression. The backbone is ResNet-inspired, with residual blocks and skip connections that prevent vanishing gradients and keep training stable.
The key idea is in how we represent the input. For each gene, we collect the M values of all CpG probes located within a 10 Mb window around its transcription start site (TSS). This window captures both local promoter methylation (directly influencing transcription initiation) and long-range regulatory elements (enhancers or CpG clusters that can also modulate expression). Because the number of CpG probes varies widely from gene to gene, the model is gene-specific, adapting its input channels to whatever probe set is available for that gene.
Once the probe values are arranged, the CNN treats them like a structured sequence, scanning across them with convolutional filters. This allows the network to detect both local patterns (e.g., a CpG island being consistently methylated) and broader spatial arrangements (e.g., long-range clusters of methylation upstream of the promoter). The residual connections ensure that deeper layers don't forget simpler patterns as they build up hierarchical features.
Generating the Dataset:
We trained our models on paired DNA methylation and RNA-seq data from 873 TCGA breast cancer samples. Both the 450K methylation array data and the Hi-Seq 2000 expression data were downloaded from the Xena Public Data Hubs.
The methylation arrays report values as β values, a quantitative measure of DNA methylation at individual CpG sites. Formally, a β value is calculated as the ratio of methylated probe intensity to the sum of methylated and unmethylated probe intensities:
where M and U represent the measured signal intensities of methylated and unmethylated alleles, and α is a constant added to stabilize values when intensities are low. The result is a continuous value between 0 (completely unmethylated) and 1 (fully methylated).
Intuitively, you can think of β values as the "fraction of molecules carrying a methyl group" at a particular CpG site. If out of 100 DNA molecules, 70 are methylated, the β value would be ~0.7.
To improve the statistical properties of these values, we converted β values into M values using a logit transform: