To support our team’s wet lab, we investigated the evolutionary landscape and biochemical properties of carbonic anhydrases - the key enzyme in induced calcium carbonate precipitation. Our goal was to identify and characterize a diverse number of CA variants in order to optimally select one for extraterrestrial conditions. We conducted phylogenetic analysis using TreeSAPP on the alpha family and traced sequence motifs and functional divergence across variants that had high catalytic performances. This integrative approached allowed us to provide wet lab with crucial information for further testing.
Introduction
What is Phylogenetics?
Phylogenetics is the study of evolutionary relationships among organisms or genes using molecular sequence data. By constructing phylogenetic trees, researchers can visualize how sequences diverged from common ancestors and identify patterns of conservation and variation ([1]). This analysis provides insights into evolutionary history and potential functional similarities between proteins, guiding the selection of promising targets for applied research.
Role of Carbonic Anhydrase (CA) in Our Project
Our project explores sustainable building materials through microbially induced calcium carbonate precipitation (MICP). In this process, microbes facilitate the conversion of carbon dioxide into bicarbonate, which then combines with calcium to form calcium carbonate. The resulting mineral acts as a natural cement that can bind loose particles of sand, soil, or regolith into a solid, concrete-like material suitable for construction.
Carbonic anhydrase (CA) plays a key role by accelerating the hydration of carbon dioxide, thereby increasing the availability of carbonate ions that drive calcium carbonate formation. This makes the MICP process more efficient and scalable for diverse environments. On Earth, CA-enhanced MICP has potential for green construction, soil stabilization, and carbon sequestration. In space, the same principle could be applied to transform extraterrestrial regolith ([2]) into strong, durable building material, enabling infrastructure development without relying on costly material transport. By bridging Earth-based sustainability with space applications, CA provides a versatile biocatalyst for advancing resilient and eco-friendly construction.
By accelerating the conversion of carbon dioxide into bicarbonate, CA increases the availability of carbonate ions that combine with calcium to form calcium carbonate. This step is essential for efficiently turning loose regolith into strong, durable building material.
Why Perform a Phylogenetic Analysis?
To optimize our bioengineering approach, we are conducting a phylogenetic analysis of carbonic anhydrase sequences. This analysis allows us to strategically select enzyme candidates for experimental testing by first narrowing the field through in-silico comparisons. In doing so, we can better understand CA diversity and functionality, enabling more informed choices and reducing the time and resources required in the lab. Our goals are twofold:
Understand evolutionary diversity: By examining the evolutionary history of CA enzymes, we can identify conserved motifs and functional differences across species. This knowledge highlights structural or activity-related features that may enhance MICP performance under Martian conditions.
Support wet lab candidate selection:
Database annotation: We will integrate functional and taxonomic data into the phylogenetic tree, giving the wet lab team detailed insights into each sequence’s origin and known characteristics.
Cluster identification: By grouping related sequences into taxonomic “clusters,” we can pinpoint promising families of CAs for experimental testing, maximizing diversity while targeting high-functioning variants.
Background & Research: Evolution of Carbonic Anhydxrases
Carbonic anhydrases (CAs) are a diverse group of metalloenzymes that catalyze the reversible hydration of carbon dioxide ([3]):
CO2+H2O↔HCO3−+H+
Although all carbonic anhydrases (CAs) catalyze the same reaction,they are divided into at least five structurally unrelated families:
α,β,γ,δ,ζ
These families share no significant amino acid sequence similarity, reflecting convergent evolution, where different protein scaffolds independently evolved to solve the same biochemical problem of carbon dioxide conversion ([4]). Within each family, sequences are similar and phylogenetically related, forming distinct clusters with conserved motifs and similar folds. However, between families, there is no detectable similarity, even though all catalyze the same reaction ([5]). For our project, this means our phylogenetic analysis will focus on identifying clusters within a single family most relevant to microbially induced calcium carbonate precipitation (MICP), guiding the wet lab toward selecting promising CA variants for efficient Martian regolith solidification.
Families of Carbonic Anhydrases
Alpha - CA Found in vertebrates, plants, and some bacteria. These are monomeric, zinc-dependent enzymes. All alpha-CAs evolved from a single ancestral gene through multiple duplication events, resulting in different isoforms with specialized functions and tissue distributions.
Type
Representative Organisms
Key Role
Cytoplasmic alpha-CAs
Escherichia coli, Bacillus subtilis
Maintains intracellular pH and supplies bicarbonate for biosynthetic pathways such as fatty acid and nucleotide synthesis
Periplasmic alpha-CAs
Escherichia coli, Neisseria gonorrhoeae
Catalyzes rapid hydration of carbon dioxide for extracellular pH regulation and carbonate equilibrium
Membrane-associated alpha-CAs
Helicobacter pylori, some extremophiles
Facilitates carbon dioxide transport across membranes and contributes to local pH stabilization
Beta-CAs Common in bacteria, algae, fungi, and plants. Unlike α-CAs, these are often oligomeric (form dimers or tetramers) and play key roles in photosynthesis and nitrogen fixation.
Gamma-CAs Mostly found in archaea and some bacteria. These enzymes are structurally distinct and can use either zinc bivalent cation or iron bivalent cationas cofactors, reflecting their origin in early, anaerobic microorganisms.
Delta and Zeta-CAs Restricted to marine diatoms (a group of algae).
Delta-CAs are structurally different but loosely related to beta-CAs.
Zeta-CAs are unique because they use cadmium bivalent cation instead of zinc - an adaptation to cadmium-rich ocean environments.
Evolutionary Relationships
Carbonic anhydrases (CAs) show two key evolutionary patterns: convergent and divergent evolution.
Convergent evolution occurs when unrelated proteins independently evolve to solve the same problem.
Divergent evolution happens when a single ancestral gene duplicates and its copies (paralogs) specialize over time. Phylogenetic studies show that groups like CA I/II/III (cytosolic), CA IV/IX/XII (membrane-bound), and CA VA/VB(mitochondrial) arose through these duplication events. Some forms gained new locations in the cell, while others (CA VIII, X, XI) lost catalytic activity but still serve regulatory functions ([4]).
EggNOG and Data Input
Why EggNOG?
EggNOG is a hierarchical database of orthologous groups (OGs) built from genomic and metagenomic data ([6]). Its value lies in orthology inference - grouping genes that derive from a common ancestor, while separating paralogs within species. For carbonic anhydrases (CAs), this is essential because:
Diversity of CA families: With different CA families having no sequence homology, EggNOG provides curated clustering that allows us to separate them cleanly into orthologous groups.
Resolution of paralogs: Within alpha-CAs, EggNOG assigns separate OGs to paralogous isoforms (e.g., CA I vs. CA II vs. CA IX). This helps avoid collapsing distinct functional enzymes into a single cluster.
Taxonomic breadth: EggNOG spans all domains of life (bacteria, archaea, eukaryotes), ensuring coverage of CA diversity from human isoforms to microbial homologs.
Functional annotation: Each OG comes with standardized functional labels, linking sequence data to biochemical roles.
Version upgrade (v5 → v6):
v5 contained fewer CA sequences and limited subgroup coverage.
v6 greatly expanded sequence counts, taxonomic distribution, and orthologous group resolution, giving us better evolutionary and functional context ([7]).
Thus, EggNOG provided the foundational dataset for constructing TreeSAPP reference packages and connecting CA diversity to phylogenetic and biochemical metadata.
Input from EggNOG
We extracted sequences and orthologous group information for carbonic anhydrases from EggNOG v6:
OG ID
Taxonomic Distribution
Associated primary Carbonic Anhydrase Class
Sequences
Species count
Functional Annotation
COG3338
Primarily Bacteria
alpha-class (97.2%)
1392
1237
Catalysis of carbon dioxide to bicarbonate
KOG0382
Eukaryotes
beta-class (69.5%)
11927
1268
Catalysis of carbon dioxide to bicarbonate
KOG0789
Eukaryotes
gamma-class (6.5%)
10557
1243
Protein tyrosine phosphatase, catalytic domain
Classification and sequence counts for three primary groups descending from LCOG3338 which represents the ancestral orthologous group that gave rise to modern carbonic anhydrase proteins, serving as a starting point for tracing the evolutionary history and functional diversity of these enzymes across bacteria and eukaryotes in EggNOG v6
EggNOG does not currently provide a public API for bulk retrieval, so we manually downloaded the required files for each orthologous group (OG) from the [8]. These downloads contain precomputed datasets for each OG and serve as the starting point for our phylogenetic pipeline.
For each OG, the following key files were obtained:
FASTA files of protein sequences: contain all member sequences assigned to the OG, representing the diversity of that protein family.
Lineage tables: map EggNOG IDs to their respective taxonomic assignments (e.g., species, genus, family), enabling downstream tree annotation.
Functional annotation tables: link each sequence to standardized functional labels such as carbonic anhydrase activity (EC 4.2.1.1) or other predicted roles.
Preparation of Sequence Data
The downloaded FASTA files initially contained gapped multiple sequence alignments (MSAs), which are not compatible with TreeSAPP reference package creation. To prepare the data:
This process removes alignment gaps and other non-standard characters, leaving only clean, ungapped protein sequences for downstream analysis.
The workflow is documented in a Markdown file and executed entirely in Bash, ensuring reproducibility and transparency.
maestro integration
This shell scripting workflow was eventually replaced by a more reliable, reproducible maestro workflow. This workflow provides APIs for clustering, ungapping MSAs, and extracting OGs from EggNOG database files. See the maestro documentation here or the repository here for more information.
Input to TreeSAPP
Once cleaned and organized, these datasets served as raw inputs for the treesapp create workflow, which builds reference packages containing both:
Phylogenetic structure: based on evolutionary relationships inferred from sequence similarity.
Functional annotation: integrating biochemical metadata from EggNOG to map enzyme roles onto the tree. Together, these inputs were used in treesapp create workflows to build reference packages that capture both phylogenetic structure and functional annotation of CA diversity.
Isoelectric point
The isoelectric point (pI) ties the pH level in which the total net charge of a protein is at zero. That is, both the positive and negative charges of different side chains and ends balance each other out. In order to find an optimal pH, we must consider the charges present on the molecule as this will determine its physical and biochemical properties. This includes but it not limited to:
Solubility/Aggregation
At pH near the pI, proteins will have minimal net charge meaning the electrostatic repulsion will be reduced and molecules will then favor aggregation hence a lower solubility
Structural Stability and Optimization
Apart from the pI, we must also consider the optimal pH in which the protein, in our case carbonic anhydrase, functions most optimally. When CAs are exposed to extreme pH, then the structural stability of the protein will be hindered thus affecting its functionality. A weaker structural stability may affect the active sites of CAs and cause poor binding to substrates.
As seen above, selecting the right pH based on pI and optimal protein efficiency is critical to the CAs functionality. We therefore wrote a script containing a prewritten function called IsoelectricPoint found in the Biopython library to determine every sequence’s pI within a FASTA file. This information will be used to annotate the reference package, providing additional information for wet lab to select CA candidates
This is a short summary of how the isoelectric point is determined theoretically:
Identify all ionizable groups
The different acidic and basic side chains of each amino acid
Both the N-terminal amino group and C-terminal carboxyl group
Assign pKa values from literature to every ionizable group
Compute fraction protonated for each group using the Henderson-Hasselbalch equation (the probability of ionization based on pKa and pH) (Isoelectric Point Calculator)
Sum total positive and negative charge and find pH level in which charges balance each other to give a net zero ionization.
Methods
TreeSAPP Framework
We adopted TreeSAPP (Tree-based Sequence Alignment and Phylogenetic Placement) as the core computational framework for building and testing reference packages (RefPkgs) of carbonic anhydrases (CAs).
TreeSAPP is specifically designed for phylogenetic classification of functional genes in genomic and metagenomic datasets. Its strength lies in automating a reproducible workflow that integrates sequence clustering, multiple sequence alignment, HMM construction, phylogenetic inference, and taxonomic mapping into a unified package ([9]).
Each TreeSAPP Reference Package (RefPkg) is a portable data object that consolidates all information needed to classify and analyze a protein family, in this case carbonic anhydrases. These packages are central to both phylogenetic tree construction and functional annotation.
A RefPkg internally contains the following data:
FASTA: a multiple sequence alignment (MSA) of reference sequences used to infer relationships between homologs.
HMM profile: a hidden Markov model (HMM) that provides a probabilistic signature for sensitive detection of related sequences in new datasets.
Phylogenetic tree (NEWICK format): a tree structure that captures the evolutionary relationships among sequences.
Accession-to-lineage map: links each sequence ID to its corresponding taxonomic information, enabling precise annotation. The treesapp package view command is used to inspect and extract attributes of a reference package as shown in [10] . By default, it prints details directly to the console (standard out), which makes it easy to pipe outputs into other UNIX tools like awk or sed for downstream processing.
Internally, the core reference package is stored as a Python pickle file (.pkl), which contains the full dataset. Other human-readable files (e.g., FASTA, NEWICK tree, lineage tables) are generated by the TreeSAPP workflows for convenience. These can usually be regenerated directly from the .pkl file using treesapp package view if needed.
Iterative Refinement and Metadata Layering
TreeSAPP includes tools for evaluation and classification, which support iterative improvements to a reference package. This includes:
Clade-exclusion tests - testing how well the package classifies sequences when specific evolutionary branches are hidden, improving robustness.
Metadata layering - adding functional and taxonomic information to sequences, enriching the tree for interpretation. In this project, TreeSAPP served as the central platform for both:
Constructing reference packages from curated EggNOG data.
Assessing their accuracy and sensitivity, ensuring reliable downstream classification of carbonic anhydrase sequences.
Input Sequences from EggNOG
We extracted protein sequences for carbonic anhydrases (CAs) from EggNOG v6, initially considering three major orthologous groups (OGs):
COG3338: primarily bacterial, alpha-class CAs
KOG0382: eukaryotic alpha-class CAs
KOG0789: eukaryotic gamma-class CAs At first, both bacterial and eukaryotic orthologs were included to capture the full diversity of CA enzymes. However, the eukaryotic groups contained over 10,000 sequences each, which was computationally infeasible to process through TreeSAPP workflows, even using cluster resources.
From a biological standpoint, it is possible to express eukaryotic CA enzymes in bacterial hosts like was done in [11] , as these proteins generally have minimal post-translational modification requirements, meaning they can often fold and function correctly in bacteria such as E. coli. This approach has been used in other studies for experimental characterization.
However, for this phase of the project, we restricted our focus to COG3338 (bacterial α-CAs) for two reasons:
Practicality - The smaller dataset ( about 1,392 sequences before filtering) allowed for efficient phylogenetic analysis and TreeSAPP reference package construction.
Relevance — Since our wet lab uses bacterial hosts for MICP experiments, starting with naturally bacterial alpha-CAs reduces uncertainty about expression and compatibility. In addition, alpha-CAs are generally monomeric, which makes them more amenable to surface display strategies compared to multimeric forms.
Reference Package Construction
Sequence Filtering
The raw EggNOG v6 COG3338 dataset contained 1,392 bacterial alpha-CA sequences. To ensure that only high-quality sequences were included in downstream analyses, several filtering steps were applied:
Mapping to NCBI accessions:
Each EggNOG sequence was cross-referenced against the NCBI protein database to verify that it had a valid, current accession ID.
Mapping was performed using a reference table (map.tsv), which likely originated from one of the EggNOG downloads, though this requires confirmation.
Sequences that could not be matched to a valid NCBI accession were discarded.
Filtering was implemented using a custom Python script available here: filter.py.
Removal of invalid or low-quality entries:
Entries mapped only to Root (no taxonomic assignment) were excluded.
Duplicate sequences were removed to avoid redundancy in the dataset.
Final high-quality dataset:
After filtering, the dataset was reduced to 996 high-quality bacterial alpha-CA sequences, representing a curated and taxonomically resolved set suitable for phylogenetic reference package construction.
The initial EggNOG dataset contained 1,392 sequences. Filtering removed 375 entries with invalid accession IDs, 14 with missing taxonomic assignments, and 9 duplicates. After these steps, 996 curated sequences remained. This process ensured that only high-quality, taxonomically valid sequences were retained for TreeSAPP reference package construction, improving both runtime efficiency and classification accuracy.
Reference Package Construction with treesapp create
The command treesapp create is used to generate a TreeSAPP reference package (RefPkg) from a set of input protein or nucleotide sequences ([12]). A RefPkg is a portable collection of curated files that capture phylogenetic relationships, functional signatures, and taxonomic information, which are later used for classification and placement of query sequences.
We reached out to Ryan McLaughlin for technical guidance on TreeSAPP. He helped us and explained how to construct a bacterial alpha-carbonic anhydrase (COG3338) reference package with treesapp create , providing a non-redundant and quality-controlled dataset for downstream TreeSAPP workflows. He clarified how to generate high-quality reference phylogenies and visualize clade-specific features in a way that would support downstream interpretation. Additionally, Ryan assisted us with metadata integration, showing how to link functional and taxonomic information into the phylogenetic framework.
--i: Input FASTA file with protein sequences to build the reference package.
--output: Directory to save outputs and logs.
--refpkg_name: Name assigned to the reference package for easy tracking.
--fast: makes TreeSAPP build quicker trees.
--num_procs 16: Runs steps like alignment and tree-building on 16 CPU cores for speed. This produced the alpha-CA RefPkg for COG3338, focused on bacterial sequences. The final package provides a curated, non-redundant resource for downstream placement and evaluation in TreeSAPP.
Tree Visualization with treesapp color and iTOL
After building our reference package with treesapp create, we used treesapp color to generate a taxonomic color annotation file. This tool automatically assigns colors to branches in the phylogenetic tree based on taxonomy, making it easier to visually interpret evolutionary patterns ([13]).
The command works by reading the reference package .pkl file, which contains the taxonomic and functional data produced during the earlier steps. Running it produces a text file formatted specifically for iTOL (Interactive Tree of Life), which Ryan McLaughlin recommended to use for visualization of our results.
Example command:
treesapp color \ -r <directory_path>/refpkg.pkl
<b>r</b> specifies the reference package .pkl file as input.
The output is a color file that maps each sequence to its corresponding taxonomic group.
Visualizing in iTOL
Once the color annotation file was generated, we moved to iTOL for visualization ([14]) :
Uploaded the NEWICK tree created by treesapp create.
Added the color file from treesapp color to link taxonomy-based colors to each branch.
Explored the tree interactively, using iTOL’s interface to:
Identify clusters of closely related alpha-CA sequences.
Compare diversity across major bacterial groups.
Visually confirm that sequences were correctly classified and placed.
The reference package was built using treesapp create and annotated with treesapp colour. The resulting Newick tree was imported into iTOL<b> </b>for visualization. Branches are coloured by taxonomic class (e.g., Bacilli, Alphaproteobacteria, Betaproteobacteria, Cyanophyceae). The tree highlights the wide phylogenetic diversity of COG3338 sequences across bacterial and archaeal groups. While the dataset was still unfiltered at this stage, the figure demonstrates successful workflow execution and provides an overview of the taxonomic breadth of alpha-CAs within EggNOG.
Metadata Integration
A central innovation of this project was extending TreeSAPP reference packages with biochemical and genomic metadata, allowing us to link sequence placement on the phylogenetic tree with enzyme function and ecological context. By attributing functional data, such as catalytic efficiency, to specific branches and clades, we can identify the most promising CA sequences for experimental testing. This direct connection between computational predictions and wet lab experimentation enables a data-driven approach for selecting candidate enzymes, ensuring that chosen CAs not only fit evolutionary patterns but are also optimized for activity in processes like microbially induced calcium carbonate precipitation.
BRENDA Integration
For the alpha-CA reference packages, we enriched each sequence entry with biochemical metadata from the BRENDA enzyme database, providing functional context alongside evolutionary information ([15]).
Parameters Extracted
Catalytic constant:
Km — substrate affinity value.
Environmental sensitivities:
pH optimum — the ideal pH for maximum enzyme activity.
pH range — the span of pH values where the enzyme remains active.
Temperature optimum — the ideal temperature for maximum enzyme activity.
Temperature range — the span of temperatures where the enzyme remains active.
Below are background information retrieved for carbonic anhydrase (EC 4.2.1.1) compiled from the BRENDA enzyme database. These values represent measurements across multiple isoforms and organisms.
Figure 3. Km values for carbonic anhydrase (EC 4.2.1.1). Figure 4. pH optima of carbonic anhydrase (EC 4.2.1.1).
Figure 5. Temperature optima of carbonic anhydrase (EC 4.2.1.1). Figure 6. Functional pH ranges of carbonic anhydrase (EC 4.2.1.1). Figure 7. Functional temperature ranges of carbonic anhydrase (EC 4.2.1.1).
Mapping Strategy
To connect BRENDA biochemical data with our EggNOG-derived alpha-CA sequences, we relied on a multi-step mapping process centered on taxonomic information and protein identifiers.
EggNOG headers follow the format: >taxonomy_id.locus_id
taxonomy_id identifies the organism of origin.
locus_id indicates the specific location of the gene/protein within that organism’s genome. This structure allowed us to trace each sequence back to its source organism.
Primary mapping via UniProt taxonomy IDs:
Each alpha-CA entry in BRENDA is linked to a UniProt accession and annotated with a UniProt taxonomy ID.
We used these taxonomy IDs to match BRENDA data with EggNOG sequences that shared the same organism identifier.
Retrieving UniProt accessions:
When direct accession matches were missing, we queried UniProt using the enzyme’s EC number (4.2.1.1, the canonical code for carbonic anhydrase) combined with the organism name: ec:4.2.1.1 AND (organism_name:"\{organism\}")
This search provided a corresponding UniProt accession even when only the scientific name was known.
Fallback mapping using EC numbers alone:
In cases where taxonomy-based mapping was incomplete, we attempted to link sequences directly through the EC number, though this approach provided less precise coverage.
Challenges and Outcomes
Our integration of BRENDA data into the EggNOG-derived alpha-CA reference package faced several hurdles:
Lack of direct UniProt accessions: BRENDA does not consistently provide UniProt accessions for its entries, making it difficult to match biochemical data directly to sequences.
Incomplete biochemical characterization: Many alpha-CA enzymes remain poorly characterized. As a result, numerous organisms in EggNOG lack corresponding BRENDA data for key parameters such as Km, pH optima, and temperature ranges.
Need for broader characterization: These gaps highlight the early stage of alpha-CA research, suggesting that more experimental biochemical studies are needed to better understand enzyme diversity and function across taxa.
From this integration, the top three candidates identified from BRENDA were:
Helicobacter pylori CA
Mesorhizobium loti CA
Escherichia coli CA Notably, Helicobacter pylori CA was already one of the wet lab’s chosen candidates, independently validating their search strategy. The other two enzymes also fell within the same taxonomic order as the wet lab’s selected targets based on TreeSAPP placement, showing strong alignment between the bioinformatics pipeline and wet lab prioritization.
GTDB Integration
To ensure our alpha-CA reference packages were grounded in a phylogenetically consistent and reliable bacterial taxonomy, we integrated data from the Genome Taxonomy Database (GTDB). GTDB is widely regarded as the most comprehensive and standardized bacterial taxonomy resource. Unlike traditional databases, it normalizes taxonomic classifications using genome-wide data and provides consistent hierarchical ranks (e.g., species, genus, family) ([16]).
With GTDB we ensured that all our sequence placements and groupings were based on the most accurate and up-to-date bacterial taxonomy available. This was especially important because:
Many public sequence databases have outdated or inconsistent taxonomy labels.
GTDB provides direct mapping via NCBI accessions, enabling smooth integration with our EggNOG-derived sequence headers.
It allowed us to confidently cluster alpha-CAs by their true evolutionary relationships rather than relying on legacy or fragmented taxonomy systems. In practice, GTDB served as our “truth set” for bacterial taxonomy, ensuring that downstream analysis and visualization reflected biologically accurate relationships and improving the reliability of our candidate selection process.
Queried Annotree for genomes containing the alpha-CA KEGG Orthology (KO) term.
Exported NCBI accession IDs and basic taxonomic data for these genomes.
GTDB Metadata Integration
For each genome, retrieved rich metadata from GTDB, including:
Field
Purpose
accession
Unique identifier for each genome in GTDB/NCBI, used for tracking and cross-referencing.
taxonomy
Standardized GTDB taxonomic classification, providing phylogenetic context for each sequence.
uniprot_accessions
Connects genome proteins to UniProt records for biochemical and functional annotation.
uniprot_tax_ids
UniProt taxonomy IDs used to link external resources like BRENDA and SABIO-RK.
uniprot_sequence
Protein sequence data serving as the direct input for alpha-CA reference package construction.
Final Subset Creation
Filtered and combined all metadata into a single table (ca_with_taxonomy.csv).
Final dataset contained 42 high-quality alpha-CA genomes ready for downstream TreeSAPP analysis.
Figure 8. GTDB/NCBI filtering workflow for α-CA sequences. Candidate sequences identified with AnnoTree were retained after filtering. During mapping to GTDB/NCBI taxonomy, a subset of sequences could not be assigned, leaving a curated set of annotated alpha-CAs for downstream analysis.
Data Extracted
NCBI Taxonomy Mappings Provided a direct bridge between GTDB classifications and standard NCBI taxonomy IDs. This ensured compatibility with other resources like UniProt and BRENDA.
Sequences for COG3338 Construction The dereplicated, high-quality GTDB genomes supplied the protein sequences used to build our COG3338 alpha-CA reference package.
Functional Metadata Integration While GTDB itself does not contain UniProt data, both NCBI and UniProt rely on the same taxonomy IDs. This allowed us to link sequences from GTDB genomes to UniProt identifiers, enabling functional and biochemical annotation through external databases such as BRENDA
Mapping Strategy
Each GTDB genome was matched to its NCBI taxonomy ID. Because NCBI and UniProt share the same taxonomy ID system, these identifiers provided a consistent way to cross-reference entries across databases.
Using these TaxIDs, we mapped genomes to their NCBI protein accessions, ensuring we could retrieve the corresponding sequences.
These accessions were then cross-referenced to UniProt IDs, creating a functional bridge.
NCBI accessions → protein sequences for phylogenetics.
UniProt accessions → external functional data such as Km, pH, temperature, and stability.
This unified dataset allowed us to integrate phylogenetic structure with biochemical and ecological context, forming the foundation for candidate selection and wet lab prioritization.
Advantages of GTDB Integration
Incorporating GTDB data into our workflow provided several key benefits for building a reliable and functional alpha-carbonic anhydrase reference package.
Standardized Taxonomy Across Reference Packages GTDB offers a consistent, genome-based taxonomy that eliminates ambiguities found in traditional NCBI naming. This ensured that our TreeSAPP placements were based on a “truth set” of bacterial relationships, improving the accuracy of clustering and interpretation.
Functional Markers for Tree Interpretation While not every alpha-CA genome in GTDB has detailed functional or biochemical data, we integrated those that did into our reference package as “anchor points.”
For example, alpha-CAs with known catalytic efficiency values from BRENDA serve as functional markers in the tree.
TreeSAPP then places non-annotated sequences relative 1to these markers, allowing us to infer potential properties of uncharacterized enzymes.
Bridge Between GTDB, NCBI, and UniProt Because GTDB accessions map cleanly to NCBI taxonomy IDs, we could link each genome to external datasets such as:
UniProt for protein sequences and annotations
BRENDA for biochemical
Evaluation
Purpose:Clade exclusion analysis simulates situations where the reference database lacks certain taxa. It evaluates how accurately TreeSAPP can classify sequences under such conditions, where our refpkg could be used for placement/analysis of metagenomic data. This is done to ensure the robustness and accuracy of the TreeSAPP method and the principle behind this analysis involves:
Intentionally removing a specific clade (an ancestor and all of its descendants) from the phylogenetic tree
Reinserting the clade based on TreeSAPP method to see if the clade was correctly inserted into the same phylogenetic relationships.
Commands:treesapp evaluate performs the analysis using a FASTA/FASTQ input and a TreeSAPP reference package. Example syntax:
Example:
treesapp evaluate -i input.fasta -r refpkg.pkl --taxonomic_ranks class genus species
i: Input sequence file (FASTA/FASTQ), which are all of the inputs from our refpkg
r: Reference package path (.pkl).
-taxonomic_ranks: Taxonomic levels to evaluate (default: class, species).
l: Trim sequences to simulate fragments (e.g., 150bp).
-fresh: Rebuild phylogenetic tree (slow; not usually needed).
a: Provide precomputed accession-to-lineage mapping to save time.
-tool: Choose classifier (default is TreeSAPP).
Workflow:
Lineage Extraction: Determines taxonomy for input sequences.
Selection: Identifies clades suitable for evaluation (those with multiple sub-taxa).
Clade Removal & Classification:
Remove each taxon from the reference package.
Iterate over each query sequence that fall under this taxon and remove it.
Restore and repeat for next clade.
Distance Analysis: Compares predicted taxonomic labels to the existing placement within the refpkg, measuring how close the classifications are. Outputs:
Located in final_outputs/:
accuracy.tsv:
Percentage of sequences that were correctly classified at each taxonomic rank
Provides a standardized metric of accuracy at different taxonomic levels
clade_exclusion_performance.tsv:
A summary of how many sequences were classified within a certain taxonomic distance from the correct classification
measures from the true classification in the taxonomy hierarchy
Includes correctly assigned within acceptable taxonomic distance, over classified, and under-classified.
Lets us interpret the precision and errors within the classification.
taxonomic_recall.tsv: Recall per taxon.
Step-by-step Experimental Procedure
1. Prepare Input Files
Ensure existence of:
queries.fasta or queries.fastq
refpkg.pkl
2. Create Output Directory
mkdir -p clade_exclusion_output
3. Run Basic Evaluation4. Evaluate Fragment Classification Accuracy5. Use Lineage Mapping File6. Extend to Additional Ranks
treesapp evaluate \ -i queries.fasta \ -r refpkg.pkl \ -o clade_exclusion_output \ --taxonomic_ranks class species \ -a accession2lin.tsv \ -l 150 \ --taxonomic_ranks genus order \ -n 4
The obtained results and analysis will be shared in the following section.
Julia Anstett is enrolled in the Genome Sciences and Technology program as a PhD candidate. Her work focuses on metagenomics, single-cell genomics, and genome quality in the context of the microbial ecology of anoxic marine settings. Julia helped us understand the functionality of TreeSAPP evaluate and how to interpret the output results from evaluate. She also guided us on further implementation of TreeSAPP where inaccurate nodes could be removed for a better representation of the phylogenetic tree.
Julia Anstett
PhD Candidate
Evaluate Results
Accuracy.tsv file:
A drop in placement accuracy is observed at higher resolution with the lowest being at the genus level.
There is more diversity and finer distinctions at lower taxonomic levels
Query
Tool
Rank
Score
Error
aCA
treesapp
domain
100.0
0.0
aCA
treesapp
phylum
100.0
0.0
aCA
treesapp
class
98.8
0.0
aCA
treesapp
order
98.0
0.0
aCA
treesapp
family
94.9
0.0
aCA
treesapp
genus
91.0
0.0
Score: Percentage of sequences that were correctly classified at this rank in %.
clade_exclusion_performance.tsv:
This data summarizes the taxonomic classification performance of TreeSAPP at class and species levels using clade exclusion analysis, measured by taxonomic distance (TaxDist).
TaxDist (Taxonomic Distance): Number of taxonomic ranks between the correct classification and the predicted one. 0means an exact match; higher numbers indicate increasing distance from the correct rank.
Queries: Total number of sequences evaluated at this taxonomic rank. This number stays the same across all rows since all sequences are tested at each distance level.
Correct: Number of sequences correctly classified at this specific TaxDist level. For example, at TaxDist 0, it’s the count of exact matches; at TaxDist 1, it’s the number of predictions one rank away from being correct.
Cumulative: Running total of correctly classified sequences within the current and all previous TaxDist levels. It shows how many predictions were at least this close or better to the correct answer.
Over (Over-classified): Number of sequences classified too specifically---TreeSAPP predicted a more detailed taxonomic rank than the data supported.
Under (Under-classified): Number of sequences classified too generally---TreeSAPP only predicted a broader category when a more specific classification was possible.
Class-level exclusion
TaxDist = 0—2 (very close relatives removed): Accuracy is still high, refpkg can classify sequences at the class level even without immediate relatives.
TaxDist = 3—4 (more distant relatives removed): Noticeable drop in correct classification, but still some successful assignments.
TaxDist ≥ 5: Accuracy drops sharply, sometimes near zero, meaning the reference can’t reliably classify sequences when only very distant relatives are left.
Species-level exclusion
TaxDist = 0 (species present in reference): Nearly perfect classification --- sequences are correctly identified when the exact species exists in the refpkg.
TaxDist = 1—2 (close relatives missing): Accuracy declines, but still decent if genus-level relatives remain.
TaxDist ≥ 3: Accuracy drops rapidly --- without close relatives, the refpkg struggles to correctly classify at species level.
Class-level performance
TaxDist
Queries
Correct
Cumulative
Over
Under
0
806
20
20
336
105
1
806
124
144
317
0
2
806
20
164
297
0
3
806
28
192
269
0
4
806
134
326
135
0
5
806
135
461
0
0
6
806
0
461
0
0
7
806
0
461
0
0
Species-level performance
TaxDist
Queries
Correct
Cumulative
Over
Under
0
697
513
513
0
130
1
697
72
585
0
58
2
697
25
610
0
33
3
697
20
630
0
13
4
697
5
635
0
8
5
697
8
643
0
0
6
697
0
643
0
0
7
697
0
643
0
0
representative_taxa_sequences.fasta:
shows a few high quality sequences in order to reduce redundancy within original fasta file which best represents each taxonomic group.
The data given in this section was no used in significance for further analysis
taxonomic_recall.tsv
awk -F'\t' '$3 < 0.7' taxonomic_recall.tsv
We ran the code above to find low-recall clades with a threshold of 70% and found that some species had 0% recall including:
Class-level (0% recall):
1 class level was found in this category: c__Magnetococcia
Species-level (0% recall):
40 species were found in this category including S_Nocardioides sp. JS614, s__Nocardioides szechwanensis, s__Actinokineospora iranica, s__Lentzea albida, etc. This indicates that these taxes are highly misrepresented or missing, and were removed for future TreeSAPP runs.
Conclusions + future directions
Throughout our workflow, we successfully completed an end-to-end TreeSAPP workflow to analyze the evolutionary relationship of carbonic anhydrases relevant to MCIP. This process included constructing a phylogenetic tree from a curated dataset from EGGNOGG, and layering two external metadata sources: BRENDA and GTDB, one to capture enzymatic function and another to refine taxonomic classification. This allowed us to not only place each sequence based on their evolutionary placement but also the biochemical potential of the species - providing us a high-resolution view of the carbonic anhydrase family.
One of the most interesting findings from the analysis was that annotations from BRENDA independently identified H. pylori as a CA source which mirrors the findings of wet lab. The convergence in results between bioinformatics and experimental selection further validates our computational pipeline. Using the results we have gathered from TreeSAPP, we were able to provide wet lab with a ranked list of functionally diverse CA sequences, including activity, environmental resilience and novelty. We also paid close attention to a thermostable and high activity enzyme: the SazCA from S. azorense. To further investigate this variant, we have identified close homologs to SazCA within the same taxonomic order of Aquificales by locating it within the tree. We then identified related thermophilic CAs and expanded the viable candidates to test for Martian-like conditions.
Looking forward, the pipeline we have created allows for additional layers of metadata sources including structural models, kinetic data, molecular dynamics, and further prioritization of enzyme candidates. This will also allow future researchers to expand the search space by incorporating datasets with extreme environments and potentially reveal uncharacterized CAs that are more desirable for those conditions. Finally, we could also broaden our analysis to further CA families beyond just alpha.
Overall, our integrative approach has not only provided crucial data for wet lab, but have also established a reusable pipeline for future protein discovery and analysis.
1. Yang Z, Rannala B. Molecular phylogenetics: Principles and practice. Nat Rev Genet [Internet]. 2012 May [cited 2025 Sept 25];13(5):303—14. Available from: https://www.nature.com/articles/nrg3186
2. Fackrell LE, Schroeder PA, Thompson A, Stockstill-Cahill K, Hibbitts CA. Development of martian regolith and bedrock simulants: Potential and limitations of martian regolith as an in-situ resource. Icarus [Internet]. 2021 Jan 15 [cited 2025 Sept 25];354:114055. Available from: https://www.sciencedirect.com/science/article/pii/S0019103520304061
4. Hewett-Emmett D, Tashian RE. Functional Diversity, Conservation, and Convergence in the Evolution of the α-, β-, and γ-Carbonic Anhydrase Gene Families. Molecular Phylogenetics and Evolution [Internet]. 1996 Feb 1 [cited 2025 Sept 25];5(1):50—77. Available from: https://www.sciencedirect.com/science/article/pii/S1055790396900068
5. Supuran CT, Capasso C. An Overview of the Bacterial Carbonic Anhydrases. Metabolites [Internet]. 2017 Nov 11 [cited 2025 Sept 25];7(4):56. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC5746736/
6. Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol Biol Evol [Internet]. 2021 Dec 9;38(12):5825—9. Available from: https://www.ncbi.nlm.nih.gov/pubmed/34597405
7. Hernández-Plaza A, Szklarczyk D, Botas J, Cantalapiedra CP, Giner-Lamia J, Mende DR, et al. eggNOG 6.0: Enabling comparative genomics across 12 535 organisms. Nucleic Acids Res [Internet]. 2023 Jan 6;51(D1):D389—94. Available from: https://www.ncbi.nlm.nih.gov/pubmed/36399505
11. Waheed A, Pham T, Won M, Okuyama T, Sly WS. Human carbonic anhydrase IV: In vitro activation and purification of disulfide-bonded enzyme following expression in Escherichia coli. Protein Expr Purif [Internet]. 1997 Mar;9(2):279—87. Available from: https://www.ncbi.nlm.nih.gov/pubmed/9056493
14. Letunic I, Bork P. Interactive Tree Of Life (iTOL) v5: An online tool for phylogenetic tree display and annotation. Nucleic Acids Res [Internet]. 2021 July 2 [cited 2025 Sept 25];49(W1):W293—6. Available from: https://doi.org/10.1093/nar/gkab301
15. Schomburg I, Chang A, Schomburg D. BRENDA, enzyme data and metabolic information. Nucleic Acids Res [Internet]. 2002 Jan 1 [cited 2025 Sept 25];30(1):47—9. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC99121/
16. Parks DH, Chuvochina M, Rinke C, Mussig AJ, Chaumeil PA, Hugenholtz P. GTDB: An ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res [Internet]. 2022 Jan 7 [cited 2025 Sept 25];50(D1):D785—94. Available from: https://doi.org/10.1093/nar/gkab776