Background

Understanding how bacterial genomes translate raw DNA sequence into cellular behavior is one of the central challenges in synthetic biology. Unlike eukaryotes, bacteria have compact genomes, tightly coupled transcription and translation, and operon structures where multiple genes are regulated together. Accurately predicting bacterial regulatory outcomes from sequence alone could unlock powerful applications:

  • Synthetic promoter and RBS design for fine-tuned gene expression.
  • Operon-level engineering to coordinate multi-gene pathways.
  • Predictive modeling for metabolic pathway optimization.
  • Cross-species generalization to design in new bacterial hosts.

Traditional bioinformatics tools rely on motif scanning or limited-context models, but these often fall short when long-range interactions or subtle context signals matter. With the rise of deep learning in genomics, exemplified by models like AlphaGenome , we saw an opportunity to bring similar methods to bacteria. This vision became the foundation of our project: BactaGenome.

Introduction

BactaGenome (from bacteria + genome) is a conceptual deep learning framework we developed to explore how artificial intelligence can help us understand bacterial genomes. This work is inspired by DeepMind's recent advance AlphaGenome, which demonstrated the power of large neural architectures in reading genomic sequences and predicting functional outputs. Our goal was to bring similar ideas into the world of prokaryotic systems, where compact genomes and operon structures create unique regulatory patterns.

Instead of treating genomic prediction as many small, isolated problems, we envisioned a single architecture that could read in very long stretches of DNA sequence, along with contextuaƒl information about the organism type, and output multiple biological properties at once — from gene expression levels to Ribosome Binding Site (RBS) strength.

Initial adaptation of AlphaGenome
Initial architecture built from AlphaGenome reimplementation.

Project Journey

Our story unfolded through several phases of exploration, iteration, and reflection. What follows is the timeline of how BactaGenome grew from a bold idea into a sequence of experiments, lessons, and reflections.

Step 1: Building on AlphaGenome

Our initial vision was to create an “AlphaGenome for Bacteria.” Using an open-source reimplementation of AlphaGenome, we adapted the architecture to bacteria, reducing the context window to ~100K bp and redesigning the output heads for bacterial-specific tasks (promoter strength, RBS efficiency, operon regulation).

However, training produced unusual results: although the loss decreased, its scale was abnormally large, and the predictions did not align with biological data. We realized this mismatch hinted at deeper incompatibilities between mammalian-style data preprocessing and bacterial genomic datasets.

Loss curves abnormal scale
Loss decreased but at abnormally large scale.
Predictions not matching
Predictions showed poor biological alignment despite training.

Step 2: Rebuilding From Scratch

We decided to reimplement the model in PyTorch to gain full control over the architecture and training. To simplify the problem:

  • Reduced the context window to 16K bp (2^14).
  • Focused on a single modality: promoter strength.
  • Used MSE loss instead of Poisson + multinomial.
Model Configurationpython
CONFIG = {
    "input_seq_len": 2 ** 14,  # 16,384 bp
    "initial_channels": 192,
    "channel_increment_per_stage": 64,
    "num_encoder_stages": 7,
    "transformer_num_blocks": 8,
    "transformer_num_heads": 6,
    "transformer_key_dim": 72,
    "pairwise_channels": 64,
    "pairwise_update_blocks": [0, 2, 4],
    "dropout_rate": 0.1,
    "num_expression_tracks": 1
}

This simplified model yielded surprising behavior: in some validation samples, the model captured the overall shape of promoter activity across the genome. Yet, it failed to clearly identify gene-specific expression levels. In other validation cases, the predictions were completely misaligned.

MSE predictions
MSE-based model sometimes captured global shapes but lacked gene-level clarity.

This stage taught us that MSE alone lacked theoretical grounding for regulatory genomics, though it confirmed our architecture had learning capacity. However, the results revealed that our good performance lacked biological interpretability and could not be generalized to real-world applications.


Step 3: Optimization

After several changes in the loss function. We want to do some other optimization in the parameters and transformer architecture of the model.

Step 4: Returning to Combined Loss

Realizing the limitations of MSE, we returned to the combined Poisson + multinomial loss described in the AlphaGenome paper. Conceptually:

  • Multinomial loss optimizes multi-class classification by maximizing the likelihood of correct regulatory element predictions across sequence positions which can control the internal shape of predictions across the sequence.
  • Poisson loss handles count regression by modeling the log-linear relationship between input features and expected count outcomes which will control the overall scale of predictions.

In practice, our experiments showed that the model could often capture shape but struggled with scale alignment.

Shape right scale wrong
Model learned internal shapes but not global scale.

We tried multiple strategies:

  • Adjusting weights between the two loss components

Define the composite loss as , where , denotes the negative log-likelihood under a multivariate normal model, and denotes the negative log-likelihood under a Poisson model. By tuning the ratio , one can balance the relative contributions of the two terms to accommodate differing noise structures or scales.

  • Replacing Poisson loss with MSE for stability.

In other words, the final loss function can be . F

  • Modifying the output head structure to better match bacterial regulatory tasks.

  • Adjusting preprocessing by experimenting with different normalization strategies

(e.g., mean-scaling, log transformation, and sqrt-based soft-clipping) to better match bacterial data distributions and preserve biologically relevant signal depth.

These produced partial improvements, but Pearson correlation plateaued around 0.4.

Scale adjusted but shape lost
Adjustments improved scale but harmed shape.
Statistics of mismatch
Validation scatter plot, correlation limited to ~0.4.

One promising idea was an auto-balance weighting method. By dynamically adjusting loss weights according to gradient magnitudes, we avoided one component dominating. This yielded stable training and smooth validation loss curves.

More specifically, our method prevents any single loss function from dominating the combined objective, ensuring that gradient descent proceeds correctly and smoothly. By dynamically balancing the contributions of each loss component, training remains stable and each biological property is learned in proportion to its relevance. In additionally we applied the gradient clipping for more stable training. Our pseudo codes are as follows:

here should be the image

Auto-weighting
Auto-balance method stabilized validation loss but correlation remained low.

We also observed “jump points” caused by track-splitting in the loss computation. By unifying tracks, we improved internal prediction consistency.

Jump points
Jump points in outputs due to track splitting.
Unified scale
Unified tracks yielded more stable outputs.

Step 4: Engaging With DeepMind Researchers

At this point, we reached out to the AlphaGenome team at DeepMind. Their replies clarified that unnormalized targets preserve depth information, and that preprocessing steps such as mean-scaling and sqrt-based soft-clipping were essential. This reassured us that our struggles were not implementation errors, but fundamental differences between mammalian and bacterial data characteristics.


Step 5: Multi-Modal Expansion Attempt

Encouraged by their insights, we attempted to scale up and train three modalities together: expression level, promoter strength, and RBS efficiency. This aligned with AlphaGenome’s conclusion that multi-modal training improves performance. However, by late August, limited by time and resources, we had to stop training before achieving convergence.

Lessons Learned

  • Direct adaptation is not enough: Frontier architectures rely on preprocessing pipelines as much as on neural networks.
  • Loss design matters: Balancing scale vs shape requires theory and careful tuning.
  • Bacterial data differ: Sparse signals and coverage patterns introduce unique challenges.
  • Community engagement helps: Expert feedback clarified our design assumptions and offered future directions.

Vision Forward

Although BactaGenome did not converge to a fully working model, the vision remains powerful:

  • A unified bacterial genome model with multi-modal outputs.
  • A computational tool for synthetic biology design: promoters, RBSs, operons.
  • A potential step toward bacterial foundation models.

Most importantly, our journey reflects the iGEM spirit: bold exploration, transparent sharing of setbacks, and curiosity-driven progress. We hope our story inspires future teams to continue bridging AI and bacterial genomics, building on our early steps.

Appendix: Technical Notes

  • Datasets: RegulonDB (promoters, operons), MPRA libraries (promoter variants), RBS Calculator datasets, sRNA interaction databases.
  • Model size: 16K bp context window, 7 encoder stages, 8 transformer blocks, ~20M parameters.
  • Training: Adam optimizer, learning rate 0.002–0.003, batch size 8–24 depending on hardware.
  • Hardware: Experiments run on L40 GPU, limited training steps due to cost and judge on training dynamic.
  • Evaluation metrics: Pearson correlation (for expression), AUROC (for operon membership), R² for regression tasks.

References

Loading references...