BactaGenome: A Foundation Model for Bacterial Genomics
BactaGenome is a deep learning model designed to understand and predict functional elements directly from bacterial DNA sequences. It adapts the powerful AlphaGenome architecture, originally developed for eukaryotic genomes, to the unique characteristics of prokaryotic biology.
Our goal is to create a versatile foundation model that can accelerate research and engineering in synthetic biology, microbial genetics, and drug discovery by learning the complex "grammar" of bacterial gene regulation.
Project Vision
Bacteria are fundamental to biotechnology, medicine, and environmental science. However, predicting how a DNA sequence will behave—how strongly a gene is expressed, where regulation occurs, or how genes are co-regulated—remains a major challenge.
BactaGenome aims to solve this by:
- Learning from Sequence: Directly processing raw DNA sequences (
A
,T
,C
,G
) to learn regulatory motifs and patterns without prior feature engineering. - Multi-Task Learning: Simultaneously predicting a wide range of genomic features, forcing the model to develop a holistic understanding of genome function.
- Adapting a State-of-the-Art Architecture: Leveraging the proven power of the Transformer-based U-Net from AlphaGenome and tailoring it to the compact and diverse world of bacterial genomes.
Model Architecture
BactaGenome is built upon a Transformer-based U-Net, a design that excels at capturing both local sequence patterns and long-range dependencies.
- DNA Embedding: The input DNA sequence is first passed through a convolutional layer to create an initial, high-resolution embedding.
- U-Net Encoder-Decoder:
- The encoder path consists of a series of
DownresBlocks
that progressively downsample the representation, creating embeddings at multiple scales (e.g., 1bp, 8bp, 128bp). This allows the model to see both fine-grained motifs and broader genomic context. - The decoder path uses
UpresBlocks
with skip connections from the encoder, which re-integrates high-resolution information to make precise, base-level predictions.
- The encoder path consists of a series of
- Transformer Tower: At the lowest resolution (the "bottleneck" of the U-Net), a stack of Transformer layers is applied. This is where the model captures long-range interactions across the ~100kbp input sequence.
- Prediction Heads: The multi-scale embeddings from the decoder are fed into specialized, task-specific "heads" that make the final predictions. Each head is a small neural network trained for a specific output modality.
Key Features
- AlphaGenome Adaptation: We have carefully adapted parameters for bacterial genomes. For instance, data augmentation shifts are smaller (
±256bp
vs.±1024bp
) to reflect the compact nature of bacterial regulatory regions. - Organism-Specific Embeddings: The model learns a unique embedding for each bacterial species, allowing it to adapt its predictions to different genomic backgrounds.
- Multi-Task Outputs: BactaGenome is designed to predict multiple, biologically relevant targets simultaneously. In our initial phase, these include:
- Gene Expression Level: Predicts log-normalized expression values (TPM/FPKM).
- Gene Density: A proxy for identifying gene-rich regions.
- Operon Membership: A binary classification task to identify genes belonging to operons.
- Transcription Factor Binding Sites (TFBS): A high-resolution task to predict the precise binding locations of various transcription factors, which is critical for learning regulatory grammar.
- Flexible and Extensible: The architecture allows new prediction heads for other tasks (e.g., transcription start sites, terminator prediction, sRNA targets) to be easily added.
Training and Data
Phase 1: E. coli on RegulonDB
Our initial focus is on training BactaGenome on the most well-characterized model bacterium, Escherichia coli K-12, using the comprehensive RegulonDB database.
- Data Source: We process raw BSON files from RegulonDB to extract experimentally validated data on gene expression, operon structures, and transcription factor binding sites.
- Targets: We have engineered a set of realistic and information-rich training targets that force the model to learn meaningful biological patterns.
- Data Augmentation: To improve generalization, we apply AlphaGenome-style augmentations during training, including random sequence shifts and reverse-complementation.
Getting Started
Installation
- Clone the repository:bash
git clone https://github.com/your-username/BactaGenome.git cd BactaGenome
- Install the required dependencies. We recommend using a Conda environment.bash
conda create -n bactagenome python=3.10 conda activate bactagenome pip install -r requirements.txt
- Install the package in editable mode:bash
pip install -e .
Training
The main training script is train_regulondb.py
. It uses configuration files located in the configs/
directory.
- Download Data: Place your raw RegulonDB and E. coli genome files in the
data/raw/
directory as specified in the script. - Preprocess Data: The training script will automatically preprocess the data on its first run and save cached files to
data/processed/
. - Start Training:bash
python train_regulondb.py --config configs/training/phase1_regulondb.yaml
- Monitor with TensorBoard: To visualize the training progress in real-time:bashThen open
tensorboard --logdir logs/phase1_regulondb/tensorboard
http://localhost:6006
in your browser.
Inference
Use the inference_regulondb.py
script to run a trained model on new data and visualize its predictions.
python inference_regulondb.py \
--checkpoint checkpoints/phase1_regulondb/best_model_regulondb.pt \
--config configs/training/phase1_regulondb.yaml
This will generate plots comparing the model's predictions to the ground truth for random samples from the validation set.
Project Structure
BactaGenome/
├── bactagenome/ # Core library code
│ ├── data/ # Data loading, processing, and augmentation
│ ├── model/ # Model architecture (core, heads, components)
│ └── training/ # Trainer and loss functions
├── configs/ # YAML configuration files for models and training
├── data/ # Raw, processed, and cached data
├── checkpoints/ # Saved model weights
├── logs/ # Training logs and TensorBoard events
├── tests/ # Unit and integration tests
├── train_regulondb.py # Main training script
└── inference_regulondb.py # Inference and visualization script
Future Work
Our roadmap includes:
- Expanding Targets: Incorporating more high-resolution targets like Transcription Start Sites (TSS) and terminators.
- Multi-Species Training: Expanding the model to train on a diverse set of bacteria to learn both species-specific and universal regulatory principles.
- Fine-Tuning for Downstream Tasks: Creating a framework to easily fine-tune the pre-trained BactaGenome model for specific synthetic biology tasks, such as promoter engineering or RBS optimization.
Contributing
We welcome contributions from the community! If you are interested in improving BactaGenome, please see our contributing guidelines (CONTRIBUTING.md) and feel free to open an issue or submit a pull request.