BactaGenome: A Foundation Model for Bacterial Genomics

BactaGenome is a deep learning model designed to understand and predict functional elements directly from bacterial DNA sequences. It adapts the powerful AlphaGenome architecture, originally developed for eukaryotic genomes, to the unique characteristics of prokaryotic biology.

Our goal is to create a versatile foundation model that can accelerate research and engineering in synthetic biology, microbial genetics, and drug discovery by learning the complex "grammar" of bacterial gene regulation.

Project Vision

Bacteria are fundamental to biotechnology, medicine, and environmental science. However, predicting how a DNA sequence will behave—how strongly a gene is expressed, where regulation occurs, or how genes are co-regulated—remains a major challenge.

BactaGenome aims to solve this by:

Learning from Sequence: Directly processing raw DNA sequences (A, T, C, G) to learn regulatory motifs and patterns without prior feature engineering.
Multi-Task Learning: Simultaneously predicting a wide range of genomic features, forcing the model to develop a holistic understanding of genome function.
Adapting a State-of-the-Art Architecture: Leveraging the proven power of the Transformer-based U-Net from AlphaGenome and tailoring it to the compact and diverse world of bacterial genomes.

Model Architecture

BactaGenome is built upon a Transformer-based U-Net, a design that excels at capturing both local sequence patterns and long-range dependencies.

DNA Embedding: The input DNA sequence is first passed through a convolutional layer to create an initial, high-resolution embedding.
U-Net Encoder-Decoder:
- The encoder path consists of a series of DownresBlocks that progressively downsample the representation, creating embeddings at multiple scales (e.g., 1bp, 8bp, 128bp). This allows the model to see both fine-grained motifs and broader genomic context.
- The decoder path uses UpresBlocks with skip connections from the encoder, which re-integrates high-resolution information to make precise, base-level predictions.
Transformer Tower: At the lowest resolution (the "bottleneck" of the U-Net), a stack of Transformer layers is applied. This is where the model captures long-range interactions across the ~100kbp input sequence.
Prediction Heads: The multi-scale embeddings from the decoder are fed into specialized, task-specific "heads" that make the final predictions. Each head is a small neural network trained for a specific output modality.

Key Features

AlphaGenome Adaptation: We have carefully adapted parameters for bacterial genomes. For instance, data augmentation shifts are smaller (±256bp vs. ±1024bp) to reflect the compact nature of bacterial regulatory regions.
Organism-Specific Embeddings: The model learns a unique embedding for each bacterial species, allowing it to adapt its predictions to different genomic backgrounds.
Multi-Task Outputs: BactaGenome is designed to predict multiple, biologically relevant targets simultaneously. In our initial phase, these include:
- Gene Expression Level: Predicts log-normalized expression values (TPM/FPKM).
- Gene Density: A proxy for identifying gene-rich regions.
- Operon Membership: A binary classification task to identify genes belonging to operons.
- Transcription Factor Binding Sites (TFBS): A high-resolution task to predict the precise binding locations of various transcription factors, which is critical for learning regulatory grammar.
Flexible and Extensible: The architecture allows new prediction heads for other tasks (e.g., transcription start sites, terminator prediction, sRNA targets) to be easily added.

Training and Data

Phase 1: E. coli on RegulonDB

Our initial focus is on training BactaGenome on the most well-characterized model bacterium, Escherichia coli K-12, using the comprehensive RegulonDB database.

Data Source: We process raw BSON files from RegulonDB to extract experimentally validated data on gene expression, operon structures, and transcription factor binding sites.
Targets: We have engineered a set of realistic and information-rich training targets that force the model to learn meaningful biological patterns.
Data Augmentation: To improve generalization, we apply AlphaGenome-style augmentations during training, including random sequence shifts and reverse-complementation.

Getting Started

Installation

Clone the repository:

bash

git clone https://github.com/your-username/BactaGenome.git
cd BactaGenome

Install the required dependencies. We recommend using a Conda environment.

bash

conda create -n bactagenome python=3.10
conda activate bactagenome
pip install -r requirements.txt

Install the package in editable mode:
bash
```
pip install -e .
```

Training

The main training script is train_regulondb.py. It uses configuration files located in the configs/ directory.

Download Data: Place your raw RegulonDB and E. coli genome files in the data/raw/ directory as specified in the script.
Preprocess Data: The training script will automatically preprocess the data on its first run and save cached files to data/processed/.

Start Training:

bash

python train_regulondb.py --config configs/training/phase1_regulondb.yaml

Monitor with TensorBoard: To visualize the training progress in real-time:
bash
```
tensorboard --logdir logs/phase1_regulondb/tensorboard
```
Then open http://localhost:6006 in your browser.

Inference

Use the inference_regulondb.py script to run a trained model on new data and visualize its predictions.

bash

python inference_regulondb.py \
  --checkpoint checkpoints/phase1_regulondb/best_model_regulondb.pt \
  --config configs/training/phase1_regulondb.yaml

This will generate plots comparing the model's predictions to the ground truth for random samples from the validation set.

Project Structure

BactaGenome/
├── bactagenome/         # Core library code
│   ├── data/            # Data loading, processing, and augmentation
│   ├── model/           # Model architecture (core, heads, components)
│   └── training/        # Trainer and loss functions
├── configs/             # YAML configuration files for models and training
├── data/                # Raw, processed, and cached data
├── checkpoints/         # Saved model weights
├── logs/                # Training logs and TensorBoard events
├── tests/               # Unit and integration tests
├── train_regulondb.py   # Main training script
└── inference_regulondb.py # Inference and visualization script

Future Work

Our roadmap includes:

Expanding Targets: Incorporating more high-resolution targets like Transcription Start Sites (TSS) and terminators.
Multi-Species Training: Expanding the model to train on a diverse set of bacteria to learn both species-specific and universal regulatory principles.
Fine-Tuning for Downstream Tasks: Creating a framework to easily fine-tune the pre-trained BactaGenome model for specific synthetic biology tasks, such as promoter engineering or RBS optimization.

Contributing

We welcome contributions from the community! If you are interested in improving BactaGenome, please see our contributing guidelines (CONTRIBUTING.md) and feel free to open an issue or submit a pull request.

BactaGenome: A Foundation Model for Bacterial Genomics ​

Project Vision ​

Model Architecture ​

Key Features ​

Training and Data ​

Phase 1: E. coli on RegulonDB ​

Getting Started ​

Installation ​

Training ​

Inference ​

Project Structure ​

Future Work ​

Contributing ​

BactaGenome: A Foundation Model for Bacterial Genomics

Project Vision

Model Architecture

Key Features

Training and Data

Phase 1: E. coli on RegulonDB

Getting Started

Installation

Training

Inference

Project Structure

Future Work

Contributing