Bioinformatics

To support our team’s wet lab, we investigated the evolutionary landscape and biochemical properties of carbonic anhydrases - the key enzyme in induced calcium carbonate precipitation. Our goal was to identify and characterize a diverse number of CA variants in order to optimally select one for extraterrestrial conditions. We conducted phylogenetic analysis using TreeSAPP on the alpha family and traced sequence motifs and functional divergence across variants that had high catalytic performances. This integrative approached allowed us to provide wet lab with crucial information for further testing.

Introduction

What is Phylogenetics?

Phylogenetics is the study of evolutionary relationships among organisms or genes using molecular sequence data. By constructing phylogenetic trees, researchers can visualize how sequences diverged from common ancestors and identify patterns of conservation and variation ([1]). This analysis provides insights into evolutionary history and potential functional similarities between proteins, guiding the selection of promising targets for applied research.

Role of Carbonic Anhydrase (CA) in Our Project

Our project explores sustainable building materials through microbially induced calcium carbonate precipitation (MICP). In this process, microbes facilitate the conversion of carbon dioxide into bicarbonate, which then combines with calcium to form calcium carbonate. The resulting mineral acts as a natural cement that can bind loose particles of sand, soil, or regolith into a solid, concrete-like material suitable for construction.

Carbonic anhydrase (CA) plays a key role by accelerating the hydration of carbon dioxide, thereby increasing the availability of carbonate ions that drive calcium carbonate formation. This makes the MICP process more efficient and scalable for diverse environments. On Earth, CA-enhanced MICP has potential for green construction, soil stabilization, and carbon sequestration. In space, the same principle could be applied to transform extraterrestrial regolith ([2]) into strong, durable building material, enabling infrastructure development without relying on costly material transport. By bridging Earth-based sustainability with space applications, CA provides a versatile biocatalyst for advancing resilient and eco-friendly construction.

By accelerating the conversion of carbon dioxide into bicarbonate, CA increases the availability of carbonate ions that combine with calcium to form calcium carbonate. This step is essential for efficiently turning loose regolith into strong, durable building material.

Why Perform a Phylogenetic Analysis?

To optimize our bioengineering approach, we are conducting a phylogenetic analysis of carbonic anhydrase sequences. This analysis allows us to strategically select enzyme candidates for experimental testing by first narrowing the field through in-silico comparisons. In doing so, we can better understand CA diversity and functionality, enabling more informed choices and reducing the time and resources required in the lab. Our goals are twofold:

Understand evolutionary diversity: By examining the evolutionary history of CA enzymes, we can identify conserved motifs and functional differences across species. This knowledge highlights structural or activity-related features that may enhance MICP performance under Martian conditions.
Support wet lab candidate selection:

Database annotation: We will integrate functional and taxonomic data into the phylogenetic tree, giving the wet lab team detailed insights into each sequence’s origin and known characteristics.
Cluster identification: By grouping related sequences into taxonomic “clusters,” we can pinpoint promising families of CAs for experimental testing, maximizing diversity while targeting high-functioning variants.

Background & Research: Evolution of Carbonic Anhydxrases

Carbonic anhydrases (CAs) are a diverse group of metalloenzymes that catalyze the reversible hydration of carbon dioxide ([3]):

\mathrm{CO_2 + H_2O \leftrightarrow HCO_3^- + H^+}

Although all carbonic anhydrases (CAs) catalyze the same reaction,they are divided into at least five structurally unrelated families:

\alpha, \beta, \gamma, \delta, \zeta

These families share no significant amino acid sequence similarity, reflecting convergent evolution, where different protein scaffolds independently evolved to solve the same biochemical problem of carbon dioxide conversion ([4]). Within each family, sequences are similar and phylogenetically related, forming distinct clusters with conserved motifs and similar folds. However, between families, there is no detectable similarity, even though all catalyze the same reaction ([5]). For our project, this means our phylogenetic analysis will focus on identifying clusters within a single family most relevant to microbially induced calcium carbonate precipitation (MICP), guiding the wet lab toward selecting promising CA variants for efficient Martian regolith solidification.

Families of Carbonic Anhydrases

Alpha - CA Found in vertebrates, plants, and some bacteria. These are monomeric, zinc-dependent enzymes. All alpha-CAs evolved from a single ancestral gene through multiple duplication events, resulting in different isoforms with specialized functions and tissue distributions.

Type	Representative Organisms	Key Role
Cytoplasmic alpha-CAs	Escherichia coli, Bacillus subtilis	Maintains intracellular pH and supplies bicarbonate for biosynthetic pathways such as fatty acid and nucleotide synthesis
Periplasmic alpha-CAs	Escherichia coli, Neisseria gonorrhoeae	Catalyzes rapid hydration of carbon dioxide for extracellular pH regulation and carbonate equilibrium
Membrane-associated alpha-CAs	Helicobacter pylori, some extremophiles	Facilitates carbon dioxide transport across membranes and contributes to local pH stabilization

Beta-CAs Common in bacteria, algae, fungi, and plants. Unlike α-CAs, these are often oligomeric (form dimers or tetramers) and play key roles in photosynthesis and nitrogen fixation.
Gamma-CAs Mostly found in archaea and some bacteria. These enzymes are structurally distinct and can use either zinc bivalent cation or iron bivalent cation as cofactors, reflecting their origin in early, anaerobic microorganisms.
Delta and Zeta-CAs Restricted to marine diatoms (a group of algae).
- Delta-CAs are structurally different but loosely related to beta-CAs.
- Zeta-CAs are unique because they use cadmium bivalent cation instead of zinc - an adaptation to cadmium-rich ocean environments.

Evolutionary Relationships

Carbonic anhydrases (CAs) show two key evolutionary patterns: convergent and divergent evolution.

Convergent evolution occurs when unrelated proteins independently evolve to solve the same problem.
Divergent evolution happens when a single ancestral gene duplicates and its copies (paralogs) specialize over time. Phylogenetic studies show that groups like CA I/II/III (cytosolic), CA IV/IX/XII (membrane-bound), and CA VA/VB(mitochondrial) arose through these duplication events. Some forms gained new locations in the cell, while others (CA VIII, X, XI) lost catalytic activity but still serve regulatory functions ([4]).

EggNOG and Data Input

Why EggNOG?

EggNOG is a hierarchical database of orthologous groups (OGs) built from genomic and metagenomic data ([6]). Its value lies in orthology inference - grouping genes that derive from a common ancestor, while separating paralogs within species. For carbonic anhydrases (CAs), this is essential because:

Diversity of CA families: With different CA families having no sequence homology, EggNOG provides curated clustering that allows us to separate them cleanly into orthologous groups.
Resolution of paralogs: Within alpha-CAs, EggNOG assigns separate OGs to paralogous isoforms (e.g., CA I vs. CA II vs. CA IX). This helps avoid collapsing distinct functional enzymes into a single cluster.
Taxonomic breadth: EggNOG spans all domains of life (bacteria, archaea, eukaryotes), ensuring coverage of CA diversity from human isoforms to microbial homologs.
Functional annotation: Each OG comes with standardized functional labels, linking sequence data to biochemical roles.
Version upgrade (v5 → v6):
- v5 contained fewer CA sequences and limited subgroup coverage.
- v6 greatly expanded sequence counts, taxonomic distribution, and orthologous group resolution, giving us better evolutionary and functional context ([7]).

Thus, EggNOG provided the foundational dataset for constructing TreeSAPP reference packages and connecting CA diversity to phylogenetic and biochemical metadata.

Input from EggNOG

We extracted sequences and orthologous group information for carbonic anhydrases from EggNOG v6:

OG ID	Taxonomic Distribution	Associated primary Carbonic Anhydrase Class	Sequences	Species count	Functional Annotation
COG3338	Primarily Bacteria	alpha-class (97.2%)	1392	1237	Catalysis of carbon dioxide to bicarbonate
KOG0382	Eukaryotes	beta-class (69.5%)	11927	1268	Catalysis of carbon dioxide to bicarbonate
KOG0789	Eukaryotes	gamma-class (6.5%)	10557	1243	Protein tyrosine phosphatase, catalytic domain

Classification and sequence counts for three primary groups descending from LCOG3338 which represents the ancestral orthologous group that gave rise to modern carbonic anhydrase proteins, serving as a starting point for tracing the evolutionary history and functional diversity of these enzymes across bacteria and eukaryotes in EggNOG v6

EggNOG does not currently provide a public API for bulk retrieval, so we manually downloaded the required files for each orthologous group (OG) from the [8]. These downloads contain precomputed datasets for each OG and serve as the starting point for our phylogenetic pipeline.

For each OG, the following key files were obtained:

FASTA files of protein sequences: contain all member sequences assigned to the OG, representing the diversity of that protein family.
Lineage tables: map EggNOG IDs to their respective taxonomic assignments (e.g., species, genus, family), enabling downstream tree annotation.
Functional annotation tables: link each sequence to standardized functional labels such as carbonic anhydrase activity (EC 4.2.1.1) or other predicted roles.

Preparation of Sequence Data

The downloaded FASTA files initially contained gapped multiple sequence alignments (MSAs), which are not compatible with TreeSAPP reference package creation. To prepare the data:

Ungapping was initially performed using a custom Bash workflow, following the TreeSAPP masking guidelines.
This process removes alignment gaps and other non-standard characters, leaving only clean, ungapped protein sequences for downstream analysis.
The workflow is documented in a Markdown file and executed entirely in Bash, ensuring reproducibility and transparency.

maestro integration

This shell scripting workflow was eventually replaced by a more reliable, reproducible maestro workflow. This workflow provides APIs for clustering, ungapping MSAs, and extracting OGs from EggNOG database files. See the maestro documentation here or the repository here for more information.

Input to TreeSAPP

Once cleaned and organized, these datasets served as raw inputs for the treesapp create workflow, which builds reference packages containing both:

Phylogenetic structure: based on evolutionary relationships inferred from sequence similarity.
Functional annotation: integrating biochemical metadata from EggNOG to map enzyme roles onto the tree. Together, these inputs were used in treesapp create workflows to build reference packages that capture both phylogenetic structure and functional annotation of CA diversity.

Isoelectric point

The isoelectric point (pI) ties the pH level in which the total net charge of a protein is at zero. That is, both the positive and negative charges of different side chains and ends balance each other out. In order to find an optimal pH, we must consider the charges present on the molecule as this will determine its physical and biochemical properties. This includes but it not limited to:

Solubility/Aggregation
- At pH near the pI, proteins will have minimal net charge meaning the electrostatic repulsion will be reduced and molecules will then favor aggregation hence a lower solubility
Structural Stability and Optimization
- Apart from the pI, we must also consider the optimal pH in which the protein, in our case carbonic anhydrase, functions most optimally. When CAs are exposed to extreme pH, then the structural stability of the protein will be hindered thus affecting its functionality. A weaker structural stability may affect the active sites of CAs and cause poor binding to substrates.

As seen above, selecting the right pH based on pI and optimal protein efficiency is critical to the CAs functionality. We therefore wrote a script containing a prewritten function called IsoelectricPoint found in the Biopython library to determine every sequence’s pI within a FASTA file. This information will be used to annotate the reference package, providing additional information for wet lab to select CA candidates

This is a short summary of how the isoelectric point is determined theoretically:

Identify all ionizable groups

The different acidic and basic side chains of each amino acid
Both the N-terminal amino group and C-terminal carboxyl group

Assign pKa values from literature to every ionizable group
Compute fraction protonated for each group using the Henderson-Hasselbalch equation (the probability of ionization based on pKa and pH) (Isoelectric Point Calculator)
Sum total positive and negative charge and find pH level in which charges balance each other to give a net zero ionization.

Methods

TreeSAPP Framework

We adopted TreeSAPP (Tree-based Sequence Alignment and Phylogenetic Placement) as the core computational framework for building and testing reference packages (RefPkgs) of carbonic anhydrases (CAs).

TreeSAPP is specifically designed for phylogenetic classification of functional genes in genomic and metagenomic datasets. Its strength lies in automating a reproducible workflow that integrates sequence clustering, multiple sequence alignment, HMM construction, phylogenetic inference, and taxonomic mapping into a unified package ([9]).

Each TreeSAPP Reference Package (RefPkg) is a portable data object that consolidates all information needed to classify and analyze a protein family, in this case carbonic anhydrases. These packages are central to both phylogenetic tree construction and functional annotation.

A RefPkg internally contains the following data:

FASTA: a multiple sequence alignment (MSA) of reference sequences used to infer relationships between homologs.
HMM profile: a hidden Markov model (HMM) that provides a probabilistic signature for sensitive detection of related sequences in new datasets.
Phylogenetic tree (NEWICK format): a tree structure that captures the evolutionary relationships among sequences.
Accession-to-lineage map: links each sequence ID to its corresponding taxonomic information, enabling precise annotation. The treesapp package view command is used to inspect and extract attributes of a reference package as shown in [10] . By default, it prints details directly to the console (standard out), which makes it easy to pipe outputs into other UNIX tools like awk or sed for downstream processing.

Internally, the core reference package is stored as a Python pickle file (.pkl), which contains the full dataset. Other human-readable files (e.g., FASTA, NEWICK tree, lineage tables) are generated by the TreeSAPP workflows for convenience. These can usually be regenerated directly from the .pkl file using treesapp package view if needed.

TreeSAPP includes tools for evaluation and classification, which support iterative improvements to a reference package. This includes:

Clade-exclusion tests - testing how well the package classifies sequences when specific evolutionary branches are hidden, improving robustness.
Metadata layering - adding functional and taxonomic information to sequences, enriching the tree for interpretation. In this project, TreeSAPP served as the central platform for both:

Constructing reference packages from curated EggNOG data.
Assessing their accuracy and sensitivity, ensuring reliable downstream classification of carbonic anhydrase sequences.

Input Sequences from EggNOG

We extracted protein sequences for carbonic anhydrases (CAs) from EggNOG v6, initially considering three major orthologous groups (OGs):

COG3338: primarily bacterial, alpha-class CAs
KOG0382: eukaryotic alpha-class CAs
KOG0789: eukaryotic gamma-class CAs At first, both bacterial and eukaryotic orthologs were included to capture the full diversity of CA enzymes. However, the eukaryotic groups contained over 10,000 sequences each, which was computationally infeasible to process through TreeSAPP workflows, even using cluster resources.

From a biological standpoint, it is possible to express eukaryotic CA enzymes in bacterial hosts like was done in [11] , as these proteins generally have minimal post-translational modification requirements, meaning they can often fold and function correctly in bacteria such as E. coli. This approach has been used in other studies for experimental characterization.

However, for this phase of the project, we restricted our focus to COG3338 (bacterial $\alpha$ -CAs) for two reasons:

Practicality - The smaller dataset ( about 1,392 sequences before filtering) allowed for efficient phylogenetic analysis and TreeSAPP reference package construction.
Relevance — Since our wet lab uses bacterial hosts for MICP experiments, starting with naturally bacterial alpha-CAs reduces uncertainty about expression and compatibility. In addition, alpha-CAs are generally monomeric, which makes them more amenable to surface display strategies compared to multimeric forms.

Reference Package Construction

Sequence Filtering

The raw EggNOG v6 COG3338 dataset contained 1,392 bacterial alpha-CA sequences. To ensure that only high-quality sequences were included in downstream analyses, several filtering steps were applied:

Mapping to NCBI accessions:

Each EggNOG sequence was cross-referenced against the NCBI protein database to verify that it had a valid, current accession ID.
Mapping was performed using a reference table (map.tsv), which likely originated from one of the EggNOG downloads, though this requires confirmation.
Sequences that could not be matched to a valid NCBI accession were discarded.
Filtering was implemented using a custom Python script available here: filter.py.

Removal of invalid or low-quality entries:

Entries mapped only to Root (no taxonomic assignment) were excluded.
Duplicate sequences were removed to avoid redundancy in the dataset.

Final high-quality dataset:

After filtering, the dataset was reduced to 996 high-quality bacterial alpha-CA sequences, representing a curated and taxonomically resolved set suitable for phylogenetic reference package construction.

Filtering workflow for the alpha-carbonic anhydrase (COG3338) reference package — The initial EggNOG dataset contained 1,392 sequences. Filtering removed 375 entries with invalid accession IDs, 14 with missing taxonomic assignments, and 9 duplicates. After these steps, 996 curated sequences remained. This process ensured that only high-quality, taxonomically valid sequences were retained for TreeSAPP reference package construction, improving both runtime efficiency and classification accuracy.

The initial EggNOG dataset contained 1,392 sequences. Filtering removed 375 entries with invalid accession IDs, 14 with missing taxonomic assignments, and 9 duplicates. After these steps, 996 curated sequences remained. This process ensured that only high-quality, taxonomically valid sequences were retained for TreeSAPP reference package construction, improving both runtime efficiency and classification accuracy.

Reference Package Construction with treesapp create

The command treesapp create is used to generate a TreeSAPP reference package (RefPkg) from a set of input protein or nucleotide sequences ([12]). A RefPkg is a portable collection of curated files that capture phylogenetic relationships, functional signatures, and taxonomic information, which are later used for classification and placement of query sequences.

We reached out to Ryan McLaughlin for technical guidance on TreeSAPP. He helped us and explained how to construct a bacterial alpha-carbonic anhydrase (COG3338) reference package with treesapp create , providing a non-redundant and quality-controlled dataset for downstream TreeSAPP workflows. He clarified how to generate high-quality reference phylogenies and visualize clade-specific features in a way that would support downstream interpretation. Additionally, Ryan assisted us with metadata integration, showing how to link functional and taxonomic information into the phylogenetic framework.

Ryan McLaughlin

PhD Candidate

$ treesap create
--i COG3338.fasta
--refpkg_name aCA
 --fast
 --num_procs 16
 --output aCA_TreeSAP_create

Flags

--i: Input FASTA file with protein sequences to build the reference package.
--output: Directory to save outputs and logs.
--refpkg_name: Name assigned to the reference package for easy tracking.
--fast: makes TreeSAPP build quicker trees.
--num_procs 16: Runs steps like alignment and tree-building on 16 CPU cores for speed. This produced the alpha-CA RefPkg for COG3338, focused on bacterial sequences. The final package provides a curated, non-redundant resource for downstream placement and evaluation in TreeSAPP.

Tree Visualization with treesapp color and iTOL

After building our reference package with treesapp create, we used treesapp color to generate a taxonomic color annotation file. This tool automatically assigns colors to branches in the phylogenetic tree based on taxonomy, making it easier to visually interpret evolutionary patterns ([13]).

The command works by reading the reference package .pkl file, which contains the taxonomic and functional data produced during the earlier steps. Running it produces a text file formatted specifically for iTOL (Interactive Tree of Life), which Ryan McLaughlin recommended to use for visualization of our results.

Example command:

treesapp color \
  -r <directory_path>/refpkg.pkl

<b>r</b> specifies the reference package .pkl file as input.
The output is a color file that maps each sequence to its corresponding taxonomic group.

Visualizing in iTOL

Once the color annotation file was generated, we moved to iTOL for visualization ([14]) :

Uploaded the NEWICK tree created by treesapp create.
Added the color file from treesapp color to link taxonomy-based colors to each branch.
Explored the tree interactively, using iTOL’s interface to:

Identify clusters of closely related alpha-CA sequences.
Compare diversity across major bacterial groups.
Visually confirm that sequences were correctly classified and placed.

The reference package was built using treesapp create and annotated with treesapp colour. The resulting Newick tree was imported into iTOL<b> </b>for visualization. Branches are coloured by taxonomic class (e.g., Bacilli, Alphaproteobacteria, Betaproteobacteria, Cyanophyceae). The tree highlights the wide phylogenetic diversity of COG3338 sequences across bacterial and archaeal groups. While the dataset was still unfiltered at this stage, the figure demonstrates successful workflow execution and provides an overview of the taxonomic breadth of alpha-CAs within EggNOG.

Phylogenetic tree of EggNOG COG3338 — The reference package was built using treesapp create and annotated with treesapp colour. The resulting Newick tree was imported into iTOL<b> </b>for visualization. Branches are coloured by taxonomic class (e.g., Bacilli, Alphaproteobacteria, Betaproteobacteria, Cyanophyceae). The tree highlights the wide phylogenetic diversity of COG3338 sequences across bacterial and archaeal groups. While the dataset was still unfiltered at this stage, the figure demonstrates successful workflow execution and provides an overview of the taxonomic breadth of alpha-CAs within EggNOG.

Metadata Integration

A central innovation of this project was extending TreeSAPP reference packages with biochemical and genomic metadata, allowing us to link sequence placement on the phylogenetic tree with enzyme function and ecological context. By attributing functional data, such as catalytic efficiency, to specific branches and clades, we can identify the most promising CA sequences for experimental testing. This direct connection between computational predictions and wet lab experimentation enables a data-driven approach for selecting candidate enzymes, ensuring that chosen CAs not only fit evolutionary patterns but are also optimized for activity in processes like microbially induced calcium carbonate precipitation.

BRENDA Integration

For the alpha-CA reference packages, we enriched each sequence entry with biochemical metadata from the BRENDA enzyme database, providing functional context alongside evolutionary information ([15]).

Parameters Extracted

Catalytic constant:
- Km — substrate affinity value.
Environmental sensitivities:
- pH optimum — the ideal pH for maximum enzyme activity.
- pH range — the span of pH values where the enzyme remains active.
- Temperature optimum — the ideal temperature for maximum enzyme activity.
- Temperature range — the span of temperatures where the enzyme remains active.

Below are background information retrieved for carbonic anhydrase (EC 4.2.1.1) compiled from the BRENDA enzyme database. These values represent measurements across multiple isoforms and organisms.

Figure 3. Km values for carbonic anhydrase (EC 4.2.1.1). Figure 4. pH optima of carbonic anhydrase (EC 4.2.1.1).

Figure 3. Km values for carbonic anhydrase (EC 4.2.1.1). Figure 4. pH optima of carbonic anhydrase (EC 4.2.1.1).

Figure 5. Temperature optima of carbonic anhydrase (EC 4.2.1.1). Figure 6. Functional pH ranges of carbonic anhydrase (EC 4.2.1.1). Figure 7. Functional temperature ranges of carbonic anhydrase (EC 4.2.1.1).

Figure 5. Temperature optima of carbonic anhydrase (EC 4.2.1.1). Figure 6. Functional pH ranges of carbonic anhydrase (EC 4.2.1.1). Figure 7. Functional temperature ranges of carbonic anhydrase (EC 4.2.1.1).

Mapping Strategy

To connect BRENDA biochemical data with our EggNOG-derived alpha-CA sequences, we relied on a multi-step mapping process centered on taxonomic information and protein identifiers.

EggNOG headers follow the format: >taxonomy_id.locus_id
- taxonomy_id identifies the organism of origin.
- locus_id indicates the specific location of the gene/protein within that organism’s genome. This structure allowed us to trace each sequence back to its source organism.
Primary mapping via UniProt taxonomy IDs:
- Each alpha-CA entry in BRENDA is linked to a UniProt accession and annotated with a UniProt taxonomy ID.
- We used these taxonomy IDs to match BRENDA data with EggNOG sequences that shared the same organism identifier.
Retrieving UniProt accessions:
- When direct accession matches were missing, we queried UniProt using the enzyme’s EC number (4.2.1.1, the canonical code for carbonic anhydrase) combined with the organism name: ec:4.2.1.1 AND (organism_name:"\{organism\}")
- This search provided a corresponding UniProt accession even when only the scientific name was known.
Fallback mapping using EC numbers alone:
- In cases where taxonomy-based mapping was incomplete, we attempted to link sequences directly through the EC number, though this approach provided less precise coverage.

Challenges and Outcomes

Our integration of BRENDA data into the EggNOG-derived alpha-CA reference package faced several hurdles:

Lack of direct UniProt accessions: BRENDA does not consistently provide UniProt accessions for its entries, making it difficult to match biochemical data directly to sequences.
Incomplete biochemical characterization: Many alpha-CA enzymes remain poorly characterized. As a result, numerous organisms in EggNOG lack corresponding BRENDA data for key parameters such as Km, pH optima, and temperature ranges.
Need for broader characterization: These gaps highlight the early stage of alpha-CA research, suggesting that more experimental biochemical studies are needed to better understand enzyme diversity and function across taxa.

From this integration, the top three candidates identified from BRENDA were:

Helicobacter pylori CA
Mesorhizobium loti CA
Escherichia coli CA Notably, Helicobacter pylori CA was already one of the wet lab’s chosen candidates, independently validating their search strategy. The other two enzymes also fell within the same taxonomic order as the wet lab’s selected targets based on TreeSAPP placement, showing strong alignment between the bioinformatics pipeline and wet lab prioritization.

GTDB Integration

To ensure our alpha-CA reference packages were grounded in a phylogenetically consistent and reliable bacterial taxonomy, we integrated data from the Genome Taxonomy Database (GTDB). GTDB is widely regarded as the most comprehensive and standardized bacterial taxonomy resource. Unlike traditional databases, it normalizes taxonomic classifications using genome-wide data and provides consistent hierarchical ranks (e.g., species, genus, family) ([16]).

With GTDB we ensured that all our sequence placements and groupings were based on the most accurate and up-to-date bacterial taxonomy available. This was especially important because:

Many public sequence databases have outdated or inconsistent taxonomy labels.
GTDB provides direct mapping via NCBI accessions, enabling smooth integration with our EggNOG-derived sequence headers.
It allowed us to confidently cluster alpha-CAs by their true evolutionary relationships rather than relying on legacy or fragmented taxonomy systems. In practice, GTDB served as our “truth set” for bacterial taxonomy, ensuring that downstream analysis and visualization reflected biologically accurate relationships and improving the reliability of our candidate selection process.

Data Extraction

To generate a clean subset of alpha-CA genomes, we followed these steps using the Alpha Carbonic Anhydrase Genome Mapping repository:

Initial Genome Identification via Annotree

Queried Annotree for genomes containing the alpha-CA KEGG Orthology (KO) term.
Exported NCBI accession IDs and basic taxonomic data for these genomes.

GTDB Metadata Integration

For each genome, retrieved rich metadata from GTDB, including:

Field	Purpose
`accession`	Unique identifier for each genome in GTDB/NCBI, used for tracking and cross-referencing.
`taxonomy`	Standardized GTDB taxonomic classification, providing phylogenetic context for each sequence.
`uniprot_accessions`	Connects genome proteins to UniProt records for biochemical and functional annotation.
`uniprot_tax_ids`	UniProt taxonomy IDs used to link external resources like BRENDA and SABIO-RK.
`uniprot_sequence`	Protein sequence data serving as the direct input for alpha-CA reference package construction.

Final Subset Creation

Filtered and combined all metadata into a single table (ca_with_taxonomy.csv).
Final dataset contained 42 high-quality alpha-CA genomes ready for downstream TreeSAPP analysis.

Figure 8. GTDB/NCBI filtering workflow for α-CA sequences. Candidate sequences identified with AnnoTree were retained after filtering. During mapping to GTDB/NCBI taxonomy, a subset of sequences could not be assigned, leaving a curated set of annotated alpha-CAs for downstream analysis.

Data Extracted

NCBI Taxonomy Mappings Provided a direct bridge between GTDB classifications and standard NCBI taxonomy IDs. This ensured compatibility with other resources like UniProt and BRENDA.
Sequences for COG3338 Construction The dereplicated, high-quality GTDB genomes supplied the protein sequences used to build our COG3338 alpha-CA reference package.
Functional Metadata Integration While GTDB itself does not contain UniProt data, both NCBI and UniProt rely on the same taxonomy IDs. This allowed us to link sequences from GTDB genomes to UniProt identifiers, enabling functional and biochemical annotation through external databases such as BRENDA

Mapping Strategy

Each GTDB genome was matched to its NCBI taxonomy ID. Because NCBI and UniProt share the same taxonomy ID system, these identifiers provided a consistent way to cross-reference entries across databases.
Using these TaxIDs, we mapped genomes to their NCBI protein accessions, ensuring we could retrieve the corresponding sequences.
These accessions were then cross-referenced to UniProt IDs, creating a functional bridge.
The resulting mapping connected:

GTDB taxonomy → high-quality evolutionary framework.
NCBI accessions → protein sequences for phylogenetics.
UniProt accessions → external functional data such as Km, pH, temperature, and stability.

This unified dataset allowed us to integrate phylogenetic structure with biochemical and ecological context, forming the foundation for candidate selection and wet lab prioritization.

Advantages of GTDB Integration

Incorporating GTDB data into our workflow provided several key benefits for building a reliable and functional alpha-carbonic anhydrase reference package.

Standardized Taxonomy Across Reference Packages GTDB offers a consistent, genome-based taxonomy that eliminates ambiguities found in traditional NCBI naming. This ensured that our TreeSAPP placements were based on a “truth set” of bacterial relationships, improving the accuracy of clustering and interpretation.
Functional Markers for Tree Interpretation While not every alpha-CA genome in GTDB has detailed functional or biochemical data, we integrated those that did into our reference package as “anchor points.”
- For example, alpha-CAs with known catalytic efficiency values from BRENDA serve as functional markers in the tree.
- TreeSAPP then places non-annotated sequences relative 1to these markers, allowing us to infer potential properties of uncharacterized enzymes.
Bridge Between GTDB, NCBI, and UniProt Because GTDB accessions map cleanly to NCBI taxonomy IDs, we could link each genome to external datasets such as:
- UniProt for protein sequences and annotations
- BRENDA for biochemical

Evaluation

Purpose: Clade exclusion analysis simulates situations where the reference database lacks certain taxa. It evaluates how accurately TreeSAPP can classify sequences under such conditions, where our refpkg could be used for placement/analysis of metagenomic data. This is done to ensure the robustness and accuracy of the TreeSAPP method and the principle behind this analysis involves:

Intentionally removing a specific clade (an ancestor and all of its descendants) from the phylogenetic tree
Reinserting the clade based on TreeSAPP method to see if the clade was correctly inserted into the same phylogenetic relationships.

Commands: treesapp evaluate performs the analysis using a FASTA/FASTQ input and a TreeSAPP reference package. Example syntax:

Example:

treesapp evaluate -i input.fasta -r refpkg.pkl --taxonomic_ranks class genus species

i: Input sequence file (FASTA/FASTQ), which are all of the inputs from our refpkg
r: Reference package path (.pkl).
-taxonomic_ranks: Taxonomic levels to evaluate (default: class, species).
l: Trim sequences to simulate fragments (e.g., 150bp).
-fresh: Rebuild phylogenetic tree (slow; not usually needed).
a: Provide precomputed accession-to-lineage mapping to save time.
-tool: Choose classifier (default is TreeSAPP).

Workflow:

Lineage Extraction: Determines taxonomy for input sequences.
Selection: Identifies clades suitable for evaluation (those with multiple sub-taxa).
Clade Removal & Classification:
1. Remove each taxon from the reference package.
2. Iterate over each query sequence that fall under this taxon and remove it.
3. Restore and repeat for next clade.
Distance Analysis: Compares predicted taxonomic labels to the existing placement within the refpkg, measuring how close the classifications are. Outputs:

Located in final_outputs/:

accuracy.tsv:
- Percentage of sequences that were correctly classified at each taxonomic rank
- Provides a standardized metric of accuracy at different taxonomic levels
clade_exclusion_performance.tsv:
- A summary of how many sequences were classified within a certain taxonomic distance from the correct classification
  - measures from the true classification in the taxonomy hierarchy
  - Includes correctly assigned within acceptable taxonomic distance, over classified, and under-classified.
- Lets us interpret the precision and errors within the classification.
taxonomic_recall.tsv: Recall per taxon.

Step-by-step Experimental Procedure

1. Prepare Input Files

Ensure existence of:
- queries.fasta or queries.fastq
- refpkg.pkl

2. Create Output Directory

mkdir -p clade_exclusion_output

3. Run Basic Evaluation 4. Evaluate Fragment Classification Accuracy 5. Use Lineage Mapping File 6. Extend to Additional Ranks

treesapp evaluate \
  -i queries.fasta \
  -r refpkg.pkl \
  -o clade_exclusion_output \
  --taxonomic_ranks class species \
  -a accession2lin.tsv \
  -l 150 \
  --taxonomic_ranks genus order \
  -n 4

The obtained results and analysis will be shared in the following section.

Julia Anstett is enrolled in the Genome Sciences and Technology program as a PhD candidate. Her work focuses on metagenomics, single-cell genomics, and genome quality in the context of the microbial ecology of anoxic marine settings. Julia helped us understand the functionality of TreeSAPP evaluate and how to interpret the output results from evaluate. She also guided us on further implementation of TreeSAPP where inaccurate nodes could be removed for a better representation of the phylogenetic tree.

Julia Anstett

PhD Candidate

Evaluate Results

Accuracy.tsv file:

A drop in placement accuracy is observed at higher resolution with the lowest being at the genus level.
There is more diversity and finer distinctions at lower taxonomic levels

Query	Tool	Rank	Score
aCA	treesapp	domain	100.0
aCA	treesapp	phylum	100.0
aCA	treesapp	class	98.8
aCA	treesapp	order	98.0
aCA	treesapp	family	94.9
aCA	treesapp	genus	91.0

Score: Percentage of sequences that were correctly classified at this rank in %.

clade_exclusion_performance.tsv:

This data summarizes the taxonomic classification performance of TreeSAPP at class and species levels using clade exclusion analysis, measured by taxonomic distance (TaxDist).

TaxDist (Taxonomic Distance): Number of taxonomic ranks between the correct classification and the predicted one. 0means an exact match; higher numbers indicate increasing distance from the correct rank.
Queries: Total number of sequences evaluated at this taxonomic rank. This number stays the same across all rows since all sequences are tested at each distance level.
Correct: Number of sequences correctly classified at this specific TaxDist level. For example, at TaxDist 0, it’s the count of exact matches; at TaxDist 1, it’s the number of predictions one rank away from being correct.
Cumulative: Running total of correctly classified sequences within the current and all previous TaxDist levels. It shows how many predictions were at least this close or better to the correct answer.
Over (Over-classified): Number of sequences classified too specifically---TreeSAPP predicted a more detailed taxonomic rank than the data supported.
Under (Under-classified): Number of sequences classified too generally---TreeSAPP only predicted a broader category when a more specific classification was possible.

Class-level exclusion

TaxDist = 0—2 (very close relatives removed): Accuracy is still high, refpkg can classify sequences at the class level even without immediate relatives.
TaxDist = 3—4 (more distant relatives removed): Noticeable drop in correct classification, but still some successful assignments.
TaxDist ≥ 5: Accuracy drops sharply, sometimes near zero, meaning the reference can’t reliably classify sequences when only very distant relatives are left.

Species-level exclusion

TaxDist = 0 (species present in reference): Nearly perfect classification --- sequences are correctly identified when the exact species exists in the refpkg.
TaxDist = 1—2 (close relatives missing): Accuracy declines, but still decent if genus-level relatives remain.
TaxDist ≥ 3: Accuracy drops rapidly --- without close relatives, the refpkg struggles to correctly classify at species level.

Class-level performance

TaxDist	Queries	Correct	Cumulative	Over	Under
0	806	20	20	336	105
1	806	124	144	317	0
2	806	20	164	297	0
3	806	28	192	269	0
4	806	134	326	135	0
5	806	135	461	0	0
6	806	0	461	0	0
7	806	0	461	0	0

Species-level performance

TaxDist	Queries	Correct	Cumulative	Under
0	697	513	513	130
1	697	72	585	58
2	697	25	610	33
3	697	20	630	13
4	697	5	635	8
5	697	8	643	0
6	697	0	643	0
7	697	0	643	0

representative_taxa_sequences.fasta:

shows a few high quality sequences in order to reduce redundancy within original fasta file which best represents each taxonomic group.
The data given in this section was no used in significance for further analysis

taxonomic_recall.tsv

awk -F'\t' '$3 < 0.7' taxonomic_recall.tsv

We ran the code above to find low-recall clades with a threshold of 70% and found that some species had 0% recall including:

Class-level (0% recall):

1 class level was found in this category: c__Magnetococcia

Species-level (0% recall):

40 species were found in this category including S_Nocardioides sp. JS614, s__Nocardioides szechwanensis, s__Actinokineospora iranica, s__Lentzea albida, etc. This indicates that these taxes are highly misrepresented or missing, and were removed for future TreeSAPP runs.

Conclusions + future directions

Throughout our workflow, we successfully completed an end-to-end TreeSAPP workflow to analyze the evolutionary relationship of carbonic anhydrases relevant to MCIP. This process included constructing a phylogenetic tree from a curated dataset from EGGNOGG, and layering two external metadata sources: BRENDA and GTDB, one to capture enzymatic function and another to refine taxonomic classification. This allowed us to not only place each sequence based on their evolutionary placement but also the biochemical potential of the species - providing us a high-resolution view of the carbonic anhydrase family.

One of the most interesting findings from the analysis was that annotations from BRENDA independently identified H. pylori as a CA source which mirrors the findings of wet lab. The convergence in results between bioinformatics and experimental selection further validates our computational pipeline. Using the results we have gathered from TreeSAPP, we were able to provide wet lab with a ranked list of functionally diverse CA sequences, including activity, environmental resilience and novelty. We also paid close attention to a thermostable and high activity enzyme: the SazCA from S. azorense. To further investigate this variant, we have identified close homologs to SazCA within the same taxonomic order of Aquificales by locating it within the tree. We then identified related thermophilic CAs and expanded the viable candidates to test for Martian-like conditions.

Looking forward, the pipeline we have created allows for additional layers of metadata sources including structural models, kinetic data, molecular dynamics, and further prioritization of enzyme candidates. This will also allow future researchers to expand the search space by incorporating datasets with extreme environments and potentially reveal uncharacterized CAs that are more desirable for those conditions. Finally, we could also broaden our analysis to further CA families beyond just alpha.

Overall, our integrative approach has not only provided crucial data for wet lab, but have also established a reusable pipeline for future protein discovery and analysis.

1. Yang Z, Rannala B. Molecular phylogenetics: Principles and practice. Nat Rev Genet [Internet]. 2012 May [cited 2025 Sept 25];13(5):303—14. Available from: https://www.nature.com/articles/nrg3186

2. Fackrell LE, Schroeder PA, Thompson A, Stockstill-Cahill K, Hibbitts CA. Development of martian regolith and bedrock simulants: Potential and limitations of martian regolith as an in-situ resource. Icarus [Internet]. 2021 Jan 15 [cited 2025 Sept 25];354:114055. Available from: https://www.sciencedirect.com/science/article/pii/S0019103520304061

3. Lindskog S. Structure and mechanism of carbonic anhydrase. Pharmacology & Therapeutics [Internet]. 1997 Jan 1 [cited 2025 Sept 25];74(1):1—20. Available from: https://www.sciencedirect.com/science/article/pii/S0163725896001982

4. Hewett-Emmett D, Tashian RE. Functional Diversity, Conservation, and Convergence in the Evolution of the α-, β-, and γ-Carbonic Anhydrase Gene Families. Molecular Phylogenetics and Evolution [Internet]. 1996 Feb 1 [cited 2025 Sept 25];5(1):50—77. Available from: https://www.sciencedirect.com/science/article/pii/S1055790396900068

5. Supuran CT, Capasso C. An Overview of the Bacterial Carbonic Anhydrases. Metabolites [Internet]. 2017 Nov 11 [cited 2025 Sept 25];7(4):56. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC5746736/

6. Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol Biol Evol [Internet]. 2021 Dec 9;38(12):5825—9. Available from: https://www.ncbi.nlm.nih.gov/pubmed/34597405

7. Hernández-Plaza A, Szklarczyk D, Botas J, Cantalapiedra CP, Giner-Lamia J, Mende DR, et al. eggNOG 6.0: Enabling comparative genomics across 12 535 organisms. Nucleic Acids Res [Internet]. 2023 Jan 6;51(D1):D389—94. Available from: https://www.ncbi.nlm.nih.gov/pubmed/36399505

8. EggNOG Database | Orthology predictions and functional annnotaion [Internet]. [cited 2025 Sept 25]. Available from: http://eggnog6.embl.de/search/ogs/LCOG3338/

9. Morgan-Lang C, McLaughlin R, Armstrong Z, Zhang G, Chan K, Hallam SJ. TreeSAPP: The Tree-based Sensitive and Accurate Phylogenetic Profiler. Bioinformatics [Internet]. 2020 Sept 15 [cited 2025 Sept 25];36(18):4706—13. Available from: https://doi.org/10.1093/bioinformatics/btaa588

10. Reference package operations · hallamlab/TreeSAPP Wiki [Internet]. [cited 2025 Sept 25]. Available from: https://github.com/hallamlab/TreeSAPP/wiki/Reference-package-operations

11. Waheed A, Pham T, Won M, Okuyama T, Sly WS. Human carbonic anhydrase IV: In vitro activation and purification of disulfide-bonded enzyme following expression in Escherichia coli. Protein Expr Purif [Internet]. 1997 Mar;9(2):279—87. Available from: https://www.ncbi.nlm.nih.gov/pubmed/9056493

12. Building reference packages with TreeSAPP · hallamlab/TreeSAPP Wiki [Internet]. [cited 2025 Sept 25]. Available from: https://github.com/hallamlab/TreeSAPP/wiki/Building-reference-packages-with-TreeSAPP

13. Automatically colouring trees [Internet]. GitHub; [cited 2025 Sept 25]. Available from: https://github.com/hallamlab/TreeSAPP/wiki/Automatically-colouring-trees

14. Letunic I, Bork P. Interactive Tree Of Life (iTOL) v5: An online tool for phylogenetic tree display and annotation. Nucleic Acids Res [Internet]. 2021 July 2 [cited 2025 Sept 25];49(W1):W293—6. Available from: https://doi.org/10.1093/nar/gkab301

15. Schomburg I, Chang A, Schomburg D. BRENDA, enzyme data and metabolic information. Nucleic Acids Res [Internet]. 2002 Jan 1 [cited 2025 Sept 25];30(1):47—9. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC99121/

16. Parks DH, Chuvochina M, Rinke C, Mussig AJ, Chaumeil PA, Hugenholtz P. GTDB: An ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res [Internet]. 2022 Jan 7 [cited 2025 Sept 25];50(D1):D785—94. Available from: https://doi.org/10.1093/nar/gkab776

TaxDist	Queries	Correct	Cumulative	Over	Under
0	806	20	20	336	105
1	806	124	144	317	0
2	806	20	164	297	0
3	806	28	192	269	0
4	806	134	326	135	0
5	806	135	461	0	0
6	806	0	461	0	0
7	806	0	461	0	0

TaxDist	Queries	Correct	Cumulative	Under
0	697	513	513	130
1	697	72	585	58
2	697	25	610	33
3	697	20	630	13
4	697	5	635	8
5	697	8	643	0
6	697	0	643	0
7	697	0	643	0

TaxDist	Queries	Correct	Cumulative	Over	Under
0	806	20	20	336	105
1	806	124	144	317	0
2	806	20	164	297	0
3	806	28	192	269	0
4	806	134	326	135	0
5	806	135	461	0	0
6	806	0	461	0	0
7	806	0	461	0	0

TaxDist	Queries	Correct	Cumulative	Under
0	697	513	513	130
1	697	72	585	58
2	697	25	610	33
3	697	20	630	13
4	697	5	635	8
5	697	8	643	0
6	697	0	643	0
7	697	0	643	0

Description

Inclusive Perspective

Results

Team

Biocementing Bacteria

Cyanobacteria

Caulobacter

E. coli

Functional Validation

Bioinformatics

Modelling

Parts

Protocols

Lab Notebook: Biocementing Bacteria

Bioreactor Overview

CB2A Bioreactor: Requirements + Design

CB2A Bioreactor: Build

CB2A Bioreactor: Validation

UTEX 2973 Bioreactor: Requirements + Design

UTEX 2973 Bioreactor: Build

UTEX 2973 Bioreactor: Validation

UTEX 2973 Bioreactor: Media Optimization

Low-Gravity Bioreactor: Requirements + Design

Low-Gravity Bioreactor: Build

Low-Gravity Bioreactor: Validation

Bioprinter: Overview

Bioink: Composition Testing

Bioink: Model Validation

Bioink: Calcium Diffusion Modelling

Bioprinter: Requirements + Design

Bioprinter: Build

Bioprinter: Validation

Lab Notebook: Bioprinter & Bioink

Software

dagger

maestro

miso

Education

Educational Programming & Outreach

UBC iGEM 2025 SynBio Case Competition

Synthetic Biology Children’s Storybook

Inclusivity

Understanding our User’s Needs

Fine-tuning our Understanding

Iterative DBTL Cycles

Project Outcomes

Alternative Inclusive Design Projects

Best Practices for Inclusive Design Project

Needs Finding

Market Research

Strategic Planning

Business Development

Long Term Growth

Medal Criteria

Contributions

Human Practices

Engineering

Attributions

Bioinformatics

Introduction

What is Phylogenetics?

Role of Carbonic Anhydrase (CA) in Our Project

Why Perform a Phylogenetic Analysis?

Background & Research: Evolution of Carbonic Anhydxrases

Families of Carbonic Anhydrases

Evolutionary Relationships

EggNOG and Data Input

Why EggNOG?

Input from EggNOG

Preparation of Sequence Data

maestro integration

Input to TreeSAPP

Isoelectric point

Methods

TreeSAPP Framework

Iterative Refinement and Metadata Layering

Input Sequences from EggNOG

Reference Package Construction

Sequence Filtering

Reference Package Construction with treesapp create

TaxDist	Queries	Correct	Cumulative	Over	Under
0	806	20	20	336	105
1	806	124	144	317	0
2	806	20	164	297	0
3	806	28	192	269	0
4	806	134	326	135	0
5	806	135	461	0	0
6	806	0	461	0	0
7	806	0	461	0	0

TaxDist	Queries	Correct	Cumulative	Under
0	697	513	513	130
1	697	72	585	58
2	697	25	610	33
3	697	20	630	13
4	697	5	635	8
5	697	8	643	0
6	697	0	643	0
7	697	0	643	0