Executive Summary

Selection of an appropriate chassis is essential for effective construct functioning within any given aquatic environment. Currently, effective chassis deployment in water is halted by a significant lack of foundational knowledge and resources, making predicting chassis survival extremely difficult. To directly address this knowledge gap, we present AQUERY and AQUIRE, two novel software tools that inform users on chassis survival in a particular deployment location.

AQUERY functions as a novel database that addresses a gap in survival correlational analysis which typically lacks input on environment community structure and how species presence can affect chassis survival. AQUERY is viewable, searchable and queryable, to allow ease of use for reviewing literature data, extracting relevant data points and performing data analysis, streamlining the process of field deployment.

AQUIRE is a unique environmental and metagenomics based machine learning model designed to predict the survival of a chassis in a chosen environment, as a verification step to confirm the feasibility of deployment before putting in an excessive amount of resources into the development and growth of a chassis. AQUIRE leverages the AQUERY database and utilizes advanced machine learning models, allowing it to take a selected chassis, its specific deployment environment conditions and a species abundance matrix in that environment to output a survivability score.

Both AQUERY and AQUIRE are unique and novel software tools that not only support and guide the creation of our foundational framework but help the designing and building of future synthetic biology projects. These tools serve as a foundational basis for our design principles that offer guidance on chassis selection for synthetic biology circuits for more reliable implementation in real world aquatic systems. They serve as streamlined pathways to our foundational framework for advancing bioengineered chassis past the lab to deployment.

Rationale

Although there are large amounts of publicly available data across various aquatic environments from literature, we found a large portion of these datasets are disparate, hard to find, and even harder to organize into a singular research source. Additionally, there existed a significant level of bias, as studies recorded only subsets of data aimed to address their specific goals.

Metagenomic studies have been particularly valuable, utilizing DNA sequencing to provide insight into community structure and dynamics. Metagenomic studies have also been conducted in a wide range of aquatic environments. However, these studies do not provide context about the abiotic factors that contribute to community structure and dynamics, which limits the extent of insight that the information can offer.

Overall we found that most data points highlight only singular aspects of the information needed to make a complete database, containing either environmental information or metagenomic sequence data, failing to effectively harness both.

Environmental Abiotic Data: Conditions like pH, temperature and nutrient levels, offer insight into the growth requirements of biological organisms. However, as useful and essential this information is to understanding species presence and growth, it presents a very narrow view that does not take into account other stressors, particularly the influence of community dynamics. Species competition largely influences whether an organism can survive in a specific environment. Even in the most optimal conditions, if there is high competition the organism will not be able to reach a large population.

Metagenomic Data: For insight into community structure and species abundance metagenomic sequencing has been a vital tool, through DNA sequencing. However it fails to provide essential information about the abiotic conditions at the sampling location.

AQUERY and AQUIRE: Sister Software Tools

This is why we created the AQUERY and AQUIRE software tools that relate these two vital elements of information: environmental abiotic data and metagenomic sequencing. Together they create a full picture that shows how environments and existing communities affect species presence and abundance in order to better understand species and chassis survival in the real world. These aspects are crucial to our project's consideration of real world environmental conditions that are not typically recorded or implemented in lab simulations.

Introduction

AQUERY–a true metagenomic database of aquatic environmental samples and

AQUIRE–a predictive model tool built on three machine learning models and trained on the AQUERY database.

These tools created in the process of our project creation, support our foundational framework and allow for other synthetic biologists to adopt this framework going forward, for effective deployment of synthetic biology solutions.

AQUERY provides a view of various literature findings and their conditions, allowing searching for where a chassis has naturally existed and their abundance variability in these locations. AQUIRE exceeds the limitations of AQUERY, which relies on literature, by allowing users to input and test new conditions. AQUIRE predicts chassis survival in an environment of interest, outputting a survivability score.

Currently AQUERY hosts data on over 2000 metagenomic samples which reveal over 26866 species and their abundance in a sample. AQUIRE is able to predict on 35% of the species in that sample set, predicting on relevant and highly used chassis in synthetic biology.

However our software tools are not static, their ability to grow a central consideration in their creation. As more people add to the AQUERY database and use the AQUIRE model, their value increases offering greater insight.

Our GitLab includes the same pipeline we used in creating AQUERY, allowing users to process their metagenomic samples and retrieve a species abundance matrix. This matrix can then be added to the AQUERY database or imputed for chassis selection using AQUIRE. The pipeline is also valuable as a standalone direct way of viewing the communities that exist within a metagenomic sample. We provide various ways for users to tailor their experience from dataset structure to parameters used for prediction, to enable testing at different levels and act as a supplement to a wide variety of research projects.

With these tools synthetic biologists can better select a chassis for successful deployment, taking synthetic biology from the lab and into a real world flowing environment.

Planning

Our goal this year was to create an adoptable, foundational framework supported by powerful software tools. Achieving this required the creation of a central database and a corresponding predictive model.

We created the database to be universally accessible, allowing anyone to review data from previously conducted studies and the resulting species matrix, all within an easy standardized and central location. Additionally for dry lab applications, AQUERY allows users to query an expansive dataset for other bioinformatic projects, allowing reproducibility and applications for future metagenomic meta-analysis.

Pathway

Problem and Data Review:

As we tackled understanding chassis survival in a real world aqueous system, we lacked a central access of metagenomic data that integrated the samples’ environmental factors (such as temperature, pH, nutrient levels), that would yield vital context of environmental makeup. These abiotic features influence species survival, and not all bacterial chassis can survive outside of limits of their natural environment.

Although websites like the NCBI SRA database host a wide range of metagenomic sample FASTA/FASTQ files, many of them lack vital information on the environmental context. Even within the literature, not every study reported metadata from the sampling location, such as pH and temperature levels, at time of collection.

For researchers to tailor their approach for effective chassis selection, understanding the makeup of sampling locations and review of literature data is necessary. Creating a central standardized platform to access literature data, along with sampling location metadata, stands as the most useful and applicable approach.

Data collection:

Figure. Distribution of Samples by Source Website

In our review, we found existing websites/databases: JGI Gold, Planet Microbe, MGnify and MG-RAST, which contained at least a subset of metagenomic samples along with the associated sample location metadata. Samples from studies of aquatic environments were preferred over laboratory samples, as we wanted to highlight real-world aquatic data. Additional filtering was performed to select for true shotgun metagenomics over 16S metagenomics, as this allows us to attain species-level taxonomic abundance matrices after K-mer-based mapping.

Data cleaning:

After data collection, we sought to create a homogenized database, thus ensuring readable standardized information across all samples. We standardized units and columns of interest, creating a cohesive structure for the database. The final database includes the following features: latitude, longitude, environmental condition, season, depth, temperature, salinity, pH, carbon, phosphorus, carbon dioxide, organic carbon, inorganic carbon, nitrate, nitrite, nitrogen, oxygen concentration, phosphate, chlorophyll, chloride, methane and date of sample collection.

Taxonomic Processing/Pipeline Creation:

Similarly to the homogenization of our environmental feature data, we wanted all samples’ metagenomic taxonomic results to be processed under the same analysis pipeline. This standardization would allow for proper comparison between species relative abundance within a sample across other samples in the full database. We processed all of our collected metagenomic sequences using K-mer based matching, utilizing the Kraken2 and Bracken tools from the Johns Hopkins University Center for Computational Biology. Mapping was performed against the same Kraken standard database.

To optimize the mapping process, we created a metagenomic pipeline that inputs a text file containing SRR accession numbers, and then fetches the SRR numbers’ associated FASTQ file via the SRA Toolkit package. This FASTQ file contains the metagenomic DNA sequence of a sample, and it is then processed using Kraken, which outputs a Kraken report file. Finally, the Kraken report file is inputted into the next step of the pipeline, Braken. Braken creates a comma-separated values (.csv) file containing the list of species within a now-processed sample and their relative abundance value. The Kraken-Braken process is performed on all SRR accession numbers until each has a .csv output with their species abundance matrix data.

Upon completion, the taxonomic processing pipeline outputs a folder of results for each SRA accession number. This output is set up for inclusion into the AQUERY database.

The taxonomic processing aspect of our project can be accessed under the metagenomic_pipeline folder in our GitLab.

Merging taxonomic data:

After each sample taxonomy was processed and saved with Kraken and Braken, we aimed to create our final AQUERY database that incorporated the cleaned environmental metadata with the processed taxonomic data.

We wrote a Python script that assembles all of the taxonomic data of every processed sample into a central sample species abundance matrix. We allow multiple types of user invocations via separate Python scripts which assemble these matrices on different conditions. For example, the query, “Top n” outputs a matrix only including the top n number of species per each sample, and “Global n” outputs a matrix only including the most abundant species across all samples.

Lastly, we merged our final all-sample species matrix with its associated sampling condition metadata. We utilized the “Sample ID” feature as a key to collate these two elements, thus creating the full AQUERY database.

The database merge aspect of our project can be viewed in the assemble folder on our GitLab.

Predictive model creation:

With the creation of AQUERY, we now had sufficient data with which to train AQUIRE, our predictive model to assess chassis feasibility in aquatic environments. We first created a Bash script for model training. Our script harnessed three extremely useful and tested machine learning classifiers as candidates for our predictive model: logistic regression, a Random Forest Classifier and XGBoost. We tracked the accuracy of each model on individual species, saving the most accurate model for a particular species. For each species, its model was trained using data on all other species present in our total dataset to fully capture the effect of the community on the chassis's survival, specifically by examining the relationship between a species' abundance and concurrent changes in the abundances of other species.

Architecture

Figure. Architectural Diagram of software tool components

Database:

DuckDB (Data Engine):

DuckDB is built to handle large datasets. DuckDB also does not require a separate server (like PostgreSQL), instead connects directly to a parquet file making its data retrieval fast.

Parquet (Data Storage):

Parquet is our data format that allows DuckDB to quickly read the relevant data columns. Allows DuckDB to access data much faster than reading a .csv file.

Streamlit (Frontend Interface):

This is the interactive layer that users can view the database, apply filters, view charts and execute custom queries.

FastAPI:

This provides a REST API layer in the backend. It allows specific data requests to other applications or servers even those outside of the streamlit access. It allows for reliable and secure access to the filtered data.

Predictive model:

predictor.py (training script):

This is the script that was used to train and create the .joblib files for running.

model.py:

This contains the core logic and data prep. Its functions are then imported by the following scripts.

Command Line script (executable Python file):

This is the direct scriptable interface for running single sample predictions.

Streamlit (Frontend Interface):

This is the interactive layer where users can select their chassis and input environmental sampling location information.

Structure

AQUERY was specifically engineered to contain essential information drawn from a sufficient number of samples, ensuring two critical outcomes: that users could easily contextualize the environmental location of the data, and that the underlying model would maintain accurate predictive ability. The core data set for each sample consists of its SRA accession number, detailed environmental information, and a taxonomic species matrix. This combination of vital data elements is what makes AQUERY a unique resource, distinguishing it from all other databases of its kind.

Sample Identity: Each sample is uniquely identified with its SRA accession number. We chose this as the unique ID for tracking samples throughout the pipeline.
Spatio-temporal Context: This provides location and timing metadata for identifying environmental patterns per sample.

Latitude and longitude - Specific geographic coordinates to pinpoint exact sampling locations.
Date of collection: Precise date that sample was collected.
Season: Categorical variables (fall, spring, summer, winter) included to detect patterns related to seasonal cycles.

Environmental: We included these features for essential environmental contextualization from sample collection.

Temperature (Celsius)
Depth (meters)
Salinity (parts per trillion)
pH

Chemical/nutrient concentrations: Chemical and nutrient components were kept, as we considered them crucial features for model prediction and analysis.

Carbon, carbon dioxide, organic carbon, inorganic carbon
Phosphorus, phosphate
Nitrate, nitrite, nitrogen
Oxygen
Chlorophyll, chloride, methane concentrations

Applications

In silico testing: AQUIRE serves as a critical resource for in silico testing and model generation. By providing species abundance data from a chosen sampling location along with environmental features of that location, researchers can test whether their desired chassis would survive real-world deployment. This circumvents loss of large amounts of resources and time wasted on a chassis that is not feasible for deployment.

Referencing: AQUERY is established as a centralized hub for aquatic metagenomic sequencing studies. Its primary value as a reference tool is realized by providing a consistently structured dataset that links standardized environmental features (e.g., temperature, salinity) to species abundance values for every sample. AQUERY allows researchers to quickly reference and contextualize their own sequencing projects within the broader global aquatic environment. Users can also review literature data in comparison.

Conservation and resource management: AQUERY with its spatio-temporal features can be used for ecological mapping and conservation. AQUERY provides quantitative insight into species whereabouts, and variances in community structure across different sampling locations and time points, directly supporting informed resource allocation.

Policy: Both AQUERY and AQUIRE offer quantitative insights essential for evidence-based policy making and advocacy. AQUIRE provides quantitative projections on how potential regulatory or environmental changes can affect species survival and ecosystem stability. AQUIRE’s survival prediction score is a key metric for promoting feasible chassis deployment, providing regulatory bodies with a quantifiable measure of bioengineering chassis feasibility and expected persistence.

Future Directions

Our ongoing development plan focuses on three key areas to significantly enhance the capabilities and utility of the AQUERY resource:

Data Expansion:

AQUERY was designed for extendability, allowing future submissions from users. Our goal here is to increase spatio-temporal and environmental coverage of the database, through new submission from the community. With increased coverage, AQUERY will provide broader contextualization of sample locations and enable a deeper understanding of species dynamics under diverse environmental conditions.

Model improvement:

To maximize the accuracy of AQUIRE, we plan to retrain models on an updated AQUERY database for species that currently lack sufficient data for reliable prediction, thereby expanding the utility of our predictive pipeline and tune it to achieve higher prediction accuracy across the board.

Integrating Metatranscriptomics:

While our initial focus was on species presence utilizing metagenomic data, we recognize its limitation in describing gene expression. Our next major milestone would be the development of AQUERY 2.0, which will integrate metatranscriptomic data. We have begun preliminary tests, comparing metagenomic and metatranscriptomic yield. AQUERY 2.0 would offer a full insight into community structure, providing species composition and quantitative data on active genes across environmental conditions.

User Manual

Initialization:

Git clone ‘https://gitlab.igem.org/2025/software-tools/william-and-mary.git’

cd william-and-mary

Setup

Pip Install -r requirements

Conda activate

To use and view the AQUERY database, please go to the database folder in our software tools GitLab. This includes a .csv file that has every sample of the AQUERY database, DuckDB SQL, and an interactive Streamlit app that can be run and used on a web browser.

To use the AQUIRE model, please go to to the “predict” folder in our software tools GitLab/ which includes an: interactive Streamlit app ‘aquire_app.py’ and a command line script ‘CLI_predictor.py’

We recommend using a Python virtual environment or Conda environment to avoid Python package version mismatch on local computers.

Conclusion

Utilizing metagenomic sample data, we created AQUERY and AQUIRE: a central database and predictive model which relate species abundance to environmental conditions. With AQUERY and AQUIRE, we support chassis selection for real-world deployment. The task of assembling this data and training predictive models would be difficult and computationally tasking for every research study to implement. Our software is designed to be compatible with NCBI SRA database, implementable and adaptable for any particular project.

We have developed a powerful platform for metagenomics research that streamlines the chassis selection process for real-world applications with SynBio. Our platform is centered around two core components: AQUERY, a central database that curates and relates species abundance to specific environmental conditions, and AQUIRE, a predictive model that uses this data to forecast the suitability of a chassis for a given environment.

By building AQUERY and AQUIRE, we eliminate the need for every research group to replicate the difficult and computationally intensive task of assembling this data and training predictive models from scratch. Our software is designed to be fully compatible with the NCBI SRA database, making it easily implementable and adaptable for various projects. This approach drastically reduces the computational burden and accelerates the development and deployment of SynBio solutions by providing a data-driven path to optimal chassis selection.