Bioinformatics

To support our team’s wet lab, we investigated the evolutionary landscape and biochemical properties of carbonic anhydrases - the key enzyme in induced calcium carbonate precipitation. Our goal was to identify and characterize a diverse number of CA variants in order to optimally select one for extraterrestrial conditions. We conducted phylogenetic analysis using TreeSAPP on the alpha family and traced sequence motifs and functional divergence across variants that had high catalytic performances. This integrative approached allowed us to provide wet lab with crucial information for further testing.

Introduction

What is Phylogenetics?

Phylogenetics is the study of evolutionary relationships among organisms or genes using molecular sequence data. By constructing phylogenetic trees, researchers can visualize how sequences diverged from common ancestors and identify patterns of conservation and variation ([1]). This analysis provides insights into evolutionary history and potential functional similarities between proteins, guiding the selection of promising targets for applied research.

Role of Carbonic Anhydrase (CA) in Our Project

Our project explores sustainable building materials through microbially induced calcium carbonate precipitation (MICP). In this process, microbes facilitate the conversion of carbon dioxide into bicarbonate, which then combines with calcium to form calcium carbonate. The resulting mineral acts as a natural cement that can bind loose particles of sand, soil, or regolith into a solid, concrete-like material suitable for construction.

Carbonic anhydrase (CA) plays a key role by accelerating the hydration of carbon dioxide, thereby increasing the availability of carbonate ions that drive calcium carbonate formation. This makes the MICP process more efficient and scalable for diverse environments. On Earth, CA-enhanced MICP has potential for green construction, soil stabilization, and carbon sequestration. In space, the same principle could be applied to transform extraterrestrial regolith ([2]) into strong, durable building material, enabling infrastructure development without relying on costly material transport. By bridging Earth-based sustainability with space applications, CA provides a versatile biocatalyst for advancing resilient and eco-friendly construction.

By accelerating the conversion of carbon dioxide into bicarbonate, CA increases the availability of carbonate ions that combine with calcium to form calcium carbonate. This step is essential for efficiently turning loose regolith into strong, durable building material.

Why Perform a Phylogenetic Analysis?

To optimize our bioengineering approach, we are conducting a phylogenetic analysis of carbonic anhydrase sequences. This analysis allows us to strategically select enzyme candidates for experimental testing by first narrowing the field through in-silico comparisons. In doing so, we can better understand CA diversity and functionality, enabling more informed choices and reducing the time and resources required in the lab. Our goals are twofold:

  1. Understand evolutionary diversity: By examining the evolutionary history of CA enzymes, we can identify conserved motifs and functional differences across species. This knowledge highlights structural or activity-related features that may enhance MICP performance under Martian conditions.

  2. Support wet lab candidate selection:

Background & Research: Evolution of Carbonic Anhydxrases

Carbonic anhydrases (CAs) are a diverse group of metalloenzymes that catalyze the reversible hydration of carbon dioxide ([3]):

CO2+H2OHCO3+H+\mathrm{CO_2 + H_2O \leftrightarrow HCO_3^- + H^+}

Although all carbonic anhydrases (CAs) catalyze the same reaction,they are divided into at least five structurally unrelated families:

α,β,γ,δ,ζ\alpha, \beta, \gamma, \delta, \zeta

These families share no significant amino acid sequence similarity, reflecting convergent evolution, where different protein scaffolds independently evolved to solve the same biochemical problem of carbon dioxide conversion ([4]). Within each family, sequences are similar and phylogenetically related, forming distinct clusters with conserved motifs and similar folds. However, between families, there is no detectable similarity, even though all catalyze the same reaction ([5]). For our project, this means our phylogenetic analysis will focus on identifying clusters within a single family most relevant to microbially induced calcium carbonate precipitation (MICP), guiding the wet lab toward selecting promising CA variants for efficient Martian regolith solidification.

Families of Carbonic Anhydrases

Evolutionary Relationships

Carbonic anhydrases (CAs) show two key evolutionary patterns: convergent and divergent evolution.

EggNOG and Data Input

Why EggNOG?

EggNOG is a hierarchical database of orthologous groups (OGs) built from genomic and metagenomic data ([6]). Its value lies in orthology inference - grouping genes that derive from a common ancestor, while separating paralogs within species. For carbonic anhydrases (CAs), this is essential because:

Thus, EggNOG provided the foundational dataset for constructing TreeSAPP reference packages and connecting CA diversity to phylogenetic and biochemical metadata.

Input from EggNOG

We extracted sequences and orthologous group information for carbonic anhydrases from EggNOG v6:

OG IDTaxonomic DistributionAssociated primary Carbonic Anhydrase ClassSequencesSpecies countFunctional Annotation
COG3338Primarily Bacteriaalpha-class (97.2%)13921237Catalysis of carbon dioxide to bicarbonate
KOG0382Eukaryotesbeta-class (69.5%)119271268Catalysis of carbon dioxide to bicarbonate
KOG0789Eukaryotesgamma-class (6.5%)105571243Protein tyrosine phosphatase, catalytic domain

Classification and sequence counts for three primary groups descending from LCOG3338 which represents the ancestral orthologous group that gave rise to modern carbonic anhydrase proteins, serving as a starting point for tracing the evolutionary history and functional diversity of these enzymes across bacteria and eukaryotes in EggNOG v6

EggNOG does not currently provide a public API for bulk retrieval, so we manually downloaded the required files for each orthologous group (OG) from the [8]. These downloads contain precomputed datasets for each OG and serve as the starting point for our phylogenetic pipeline.

For each OG, the following key files were obtained:

  1. FASTA files of protein sequences: contain all member sequences assigned to the OG, representing the diversity of that protein family.
  2. Lineage tables: map EggNOG IDs to their respective taxonomic assignments (e.g., species, genus, family), enabling downstream tree annotation.
  3. Functional annotation tables: link each sequence to standardized functional labels such as carbonic anhydrase activity (EC 4.2.1.1) or other predicted roles.

Preparation of Sequence Data

The downloaded FASTA files initially contained gapped multiple sequence alignments (MSAs), which are not compatible with TreeSAPP reference package creation. To prepare the data:

maestro integration

This shell scripting workflow was eventually replaced by a more reliable, reproducible maestro workflow. This workflow provides APIs for clustering, ungapping MSAs, and extracting OGs from EggNOG database files. See the maestro documentation here or the repository here for more information.

Input to TreeSAPP

Once cleaned and organized, these datasets served as raw inputs for the treesapp create workflow, which builds reference packages containing both:

Isoelectric point

The isoelectric point (pI) ties the pH level in which the total net charge of a protein is at zero. That is, both the positive and negative charges of different side chains and ends balance each other out. In order to find an optimal pH, we must consider the charges present on the molecule as this will determine its physical and biochemical properties. This includes but it not limited to:

As seen above, selecting the right pH based on pI and optimal protein efficiency is critical to the CAs functionality. We therefore wrote a script containing a prewritten function called IsoelectricPoint found in the Biopython library to determine every sequence’s pI within a FASTA file. This information will be used to annotate the reference package, providing additional information for wet lab to select CA candidates

This is a short summary of how the isoelectric point is determined theoretically:

  1. Identify all ionizable groups
  1. Assign pKa values from literature to every ionizable group
  2. Compute fraction protonated for each group using the Henderson-Hasselbalch equation (the probability of ionization based on pKa and pH) (Isoelectric Point Calculator)
  3. Sum total positive and negative charge and find pH level in which charges balance each other to give a net zero ionization.

Methods

TreeSAPP Framework

We adopted TreeSAPP (Tree-based Sequence Alignment and Phylogenetic Placement) as the core computational framework for building and testing reference packages (RefPkgs) of carbonic anhydrases (CAs).

TreeSAPP is specifically designed for phylogenetic classification of functional genes in genomic and metagenomic datasets. Its strength lies in automating a reproducible workflow that integrates sequence clustering, multiple sequence alignment, HMM construction, phylogenetic inference, and taxonomic mapping into a unified package ([9]).

Each TreeSAPP Reference Package (RefPkg) is a portable data object that consolidates all information needed to classify and analyze a protein family, in this case carbonic anhydrases. These packages are central to both phylogenetic tree construction and functional annotation.

A RefPkg internally contains the following data:

  1. FASTA: a multiple sequence alignment (MSA) of reference sequences used to infer relationships between homologs.
  2. HMM profile: a hidden Markov model (HMM) that provides a probabilistic signature for sensitive detection of related sequences in new datasets.
  3. Phylogenetic tree (NEWICK format): a tree structure that captures the evolutionary relationships among sequences.
  4. Accession-to-lineage map: links each sequence ID to its corresponding taxonomic information, enabling precise annotation. The treesapp package view command is used to inspect and extract attributes of a reference package as shown in [10] . By default, it prints details directly to the console (standard out), which makes it easy to pipe outputs into other UNIX tools like awk or sed for downstream processing.

Internally, the core reference package is stored as a Python pickle file (.pkl), which contains the full dataset. Other human-readable files (e.g., FASTA, NEWICK tree, lineage tables) are generated by the TreeSAPP workflows for convenience. These can usually be regenerated directly from the .pkl file using treesapp package view if needed.

Iterative Refinement and Metadata Layering

TreeSAPP includes tools for evaluation and classification, which support iterative improvements to a reference package. This includes:

  1. Constructing reference packages from curated EggNOG data.
  2. Assessing their accuracy and sensitivity, ensuring reliable downstream classification of carbonic anhydrase sequences.

Input Sequences from EggNOG

We extracted protein sequences for carbonic anhydrases (CAs) from EggNOG v6, initially considering three major orthologous groups (OGs):

From a biological standpoint, it is possible to express eukaryotic CA enzymes in bacterial hosts like was done in [11] , as these proteins generally have minimal post-translational modification requirements, meaning they can often fold and function correctly in bacteria such as E. coli. This approach has been used in other studies for experimental characterization.

However, for this phase of the project, we restricted our focus to COG3338 (bacterial α\alpha-CAs) for two reasons:

  1. Practicality - The smaller dataset ( about 1,392 sequences before filtering) allowed for efficient phylogenetic analysis and TreeSAPP reference package construction.
  2. Relevance — Since our wet lab uses bacterial hosts for MICP experiments, starting with naturally bacterial alpha-CAs reduces uncertainty about expression and compatibility. In addition, alpha-CAs are generally monomeric, which makes them more amenable to surface display strategies compared to multimeric forms.

Reference Package Construction

Sequence Filtering

The raw EggNOG v6 COG3338 dataset contained 1,392 bacterial alpha-CA sequences. To ensure that only high-quality sequences were included in downstream analyses, several filtering steps were applied:

  1. Mapping to NCBI accessions:
  1. Removal of invalid or low-quality entries:
  1. Final high-quality dataset:
The initial EggNOG dataset contained 1,392 sequences. Filtering removed 375 entries with invalid accession IDs, 14 with missing taxonomic assignments, and 9 duplicates. After these steps, 996 curated sequences remained. This process ensured that only high-quality, taxonomically valid sequences were retained for TreeSAPP reference package construction, improving both runtime efficiency and classification accuracy.

Reference Package Construction with treesapp create

The command treesapp create is used to generate a TreeSAPP reference package (RefPkg) from a set of input protein or nucleotide sequences ([12]). A RefPkg is a portable collection of curated files that capture phylogenetic relationships, functional signatures, and taxonomic information, which are later used for classification and placement of query sequences.

We reached out to Ryan McLaughlin for technical guidance on TreeSAPP. He helped us and explained how to construct a bacterial alpha-carbonic anhydrase (COG3338) reference package with  treesapp create , providing a non-redundant and quality-controlled dataset for downstream TreeSAPP workflows. He clarified how to generate high-quality reference phylogenies and visualize clade-specific features in a way that would support downstream interpretation. Additionally, Ryan assisted us with metadata integration, showing how to link functional and taxonomic information into the phylogenetic framework.

profile-image

Ryan McLaughlin

PhD Candidate

$ treesap create
--i COG3338.fasta
--refpkg_name aCA
 --fast
 --num_procs 16
 --output aCA_TreeSAP_create

Flags

Tree Visualization with treesapp color and iTOL

After building our reference package with treesapp create, we used treesapp color to generate a taxonomic color annotation file. This tool automatically assigns colors to branches in the phylogenetic tree based on taxonomy, making it easier to visually interpret evolutionary patterns ([13]).

The command works by reading the reference package .pkl file, which contains the taxonomic and functional data produced during the earlier steps. Running it produces a text file formatted specifically for iTOL (Interactive Tree of Life), which Ryan McLaughlin recommended to use for visualization of our results.

Example command:

treesapp color \
  -r <directory_path>/refpkg.pkl

Visualizing in iTOL

Once the color annotation file was generated, we moved to iTOL for visualization ([14]) :

  1. Uploaded the NEWICK tree created by treesapp create.
  2. Added the color file from treesapp color to link taxonomy-based colors to each branch.
  3. Explored the tree interactively, using iTOL’s interface to:
The reference package was built using treesapp create and annotated with treesapp colour. The resulting Newick tree was imported into iTOL<b> </b>for visualization. Branches are coloured by taxonomic class (e.g., Bacilli, Alphaproteobacteria, Betaproteobacteria, Cyanophyceae). The tree highlights the wide phylogenetic diversity of COG3338 sequences across bacterial and archaeal groups. While the dataset was still unfiltered at this stage, the figure demonstrates successful workflow execution and provides an overview of the taxonomic breadth of alpha-CAs within EggNOG.

Metadata Integration

A central innovation of this project was extending TreeSAPP reference packages with biochemical and genomic metadata, allowing us to link sequence placement on the phylogenetic tree with enzyme function and ecological context. By attributing functional data, such as catalytic efficiency, to specific branches and clades, we can identify the most promising CA sequences for experimental testing. This direct connection between computational predictions and wet lab experimentation enables a data-driven approach for selecting candidate enzymes, ensuring that chosen CAs not only fit evolutionary patterns but are also optimized for activity in processes like microbially induced calcium carbonate precipitation.

BRENDA Integration

For the alpha-CA reference packages, we enriched each sequence entry with biochemical metadata from the BRENDA enzyme database, providing functional context alongside evolutionary information ([15]).

Parameters Extracted

Below are background information retrieved for carbonic anhydrase (EC 4.2.1.1) compiled from the BRENDA enzyme database. These values represent measurements across multiple isoforms and organisms.

Figure 3. Km values for carbonic anhydrase (EC 4.2.1.1). Figure 4. pH optima of carbonic anhydrase (EC 4.2.1.1).
Figure 5. Temperature optima of carbonic anhydrase (EC 4.2.1.1). Figure 6. Functional pH ranges of carbonic anhydrase (EC 4.2.1.1). Figure 7. Functional temperature ranges of carbonic anhydrase (EC 4.2.1.1).

Mapping Strategy

To connect BRENDA biochemical data with our EggNOG-derived alpha-CA sequences, we relied on a multi-step mapping process centered on taxonomic information and protein identifiers.

Challenges and Outcomes

Our integration of BRENDA data into the EggNOG-derived alpha-CA reference package faced several hurdles:

From this integration, the top three candidates identified from BRENDA were:

  1. Helicobacter pylori CA
  2. Mesorhizobium loti CA
  3. Escherichia coli CA Notably, Helicobacter pylori CA was already one of the wet lab’s chosen candidates, independently validating their search strategy. The other two enzymes also fell within the same taxonomic order as the wet lab’s selected targets based on TreeSAPP placement, showing strong alignment between the bioinformatics pipeline and wet lab prioritization.

GTDB Integration

To ensure our alpha-CA reference packages were grounded in a phylogenetically consistent and reliable bacterial taxonomy, we integrated data from the Genome Taxonomy Database (GTDB). GTDB is widely regarded as the most comprehensive and standardized bacterial taxonomy resource. Unlike traditional databases, it normalizes taxonomic classifications using genome-wide data and provides consistent hierarchical ranks (e.g., species, genus, family) ([16]).

With GTDB we ensured that all our sequence placements and groupings were based on the most accurate and up-to-date bacterial taxonomy available. This was especially important because:

Data Extraction

To generate a clean subset of alpha-CA genomes, we followed these steps using the Alpha Carbonic Anhydrase Genome Mapping repository:

  1. Initial Genome Identification via Annotree
  1. GTDB Metadata Integration
FieldPurpose
accessionUnique identifier for each genome in GTDB/NCBI, used for tracking and cross-referencing.
taxonomyStandardized GTDB taxonomic classification, providing phylogenetic context for each sequence.
uniprot_accessionsConnects genome proteins to UniProt records for biochemical and functional annotation.
uniprot_tax_idsUniProt taxonomy IDs used to link external resources like BRENDA and SABIO-RK.
uniprot_sequenceProtein sequence data serving as the direct input for alpha-CA reference package construction.
  1. Final Subset Creation
Figure 8. GTDB/NCBI filtering workflow for α-CA sequences. Candidate sequences identified with AnnoTree were retained after filtering. During mapping to GTDB/NCBI taxonomy, a subset of sequences could not be assigned, leaving a curated set of annotated alpha-CAs for downstream analysis.

Data Extracted

Mapping Strategy

  1. Each GTDB genome was matched to its NCBI taxonomy ID. Because NCBI and UniProt share the same taxonomy ID system, these identifiers provided a consistent way to cross-reference entries across databases.
  2. Using these TaxIDs, we mapped genomes to their NCBI protein accessions, ensuring we could retrieve the corresponding sequences.
  3. These accessions were then cross-referenced to UniProt IDs, creating a functional bridge.
  4. The resulting mapping connected:

This unified dataset allowed us to integrate phylogenetic structure with biochemical and ecological context, forming the foundation for candidate selection and wet lab prioritization.

Advantages of GTDB Integration

Incorporating GTDB data into our workflow provided several key benefits for building a reliable and functional alpha-carbonic anhydrase reference package.

  1. Standardized Taxonomy Across Reference Packages GTDB offers a consistent, genome-based taxonomy that eliminates ambiguities found in traditional NCBI naming. This ensured that our TreeSAPP placements were based on a “truth set” of bacterial relationships, improving the accuracy of clustering and interpretation.
  2. Functional Markers for Tree Interpretation While not every alpha-CA genome in GTDB has detailed functional or biochemical data, we integrated those that did into our reference package as “anchor points.”
    • For example, alpha-CAs with known catalytic efficiency values from BRENDA serve as functional markers in the tree.
    • TreeSAPP then places non-annotated sequences relative 1to these markers, allowing us to infer potential properties of uncharacterized enzymes.
  3. Bridge Between GTDB, NCBI, and UniProt Because GTDB accessions map cleanly to NCBI taxonomy IDs, we could link each genome to external datasets such as:
    • UniProt for protein sequences and annotations
    • BRENDA for biochemical

Evaluation

Purpose: Clade exclusion analysis simulates situations where the reference database lacks certain taxa. It evaluates how accurately TreeSAPP can classify sequences under such conditions, where our refpkg could be used for placement/analysis of metagenomic data. This is done to ensure the robustness and accuracy of the TreeSAPP method and the principle behind this analysis involves:

Commands: treesapp evaluate performs the analysis using a FASTA/FASTQ input and a TreeSAPP reference package. Example syntax:

Example:

treesapp evaluate -i input.fasta -r refpkg.pkl --taxonomic_ranks class genus species
Workflow:
  1. Lineage Extraction: Determines taxonomy for input sequences.
  2. Selection: Identifies clades suitable for evaluation (those with multiple sub-taxa).
  3. Clade Removal & Classification:
    1. Remove each taxon from the reference package.
    2. Iterate over each query sequence that fall under this taxon and remove it.
    3. Restore and repeat for next clade.
  4. Distance Analysis: Compares predicted taxonomic labels to the existing placement within the refpkg, measuring how close the classifications are. Outputs:

Located in final_outputs/:

Step-by-step Experimental Procedure

1. Prepare Input Files 2. Create Output Directory
mkdir -p clade_exclusion_output
3. Run Basic Evaluation 4. Evaluate Fragment Classification Accuracy 5. Use Lineage Mapping File 6. Extend to Additional Ranks
treesapp evaluate \
  -i queries.fasta \
  -r refpkg.pkl \
  -o clade_exclusion_output \
  --taxonomic_ranks class species \
  -a accession2lin.tsv \
  -l 150 \
  --taxonomic_ranks genus order \
  -n 4

The obtained results and analysis will be shared in the following section.

Julia Anstett is enrolled in the Genome Sciences and Technology program as a PhD candidate. Her work focuses on metagenomics, single-cell genomics, and genome quality in the context of the microbial ecology of anoxic marine settings. Julia helped us understand the functionality of TreeSAPP evaluate and how to interpret the output results from evaluate. She also guided us on further implementation of TreeSAPP where inaccurate nodes could be removed for a better representation of the phylogenetic tree.

profile-image

Julia Anstett

PhD Candidate

Evaluate Results

Accuracy.tsv file:

QueryToolRankScoreError
aCAtreesappdomain100.00.0
aCAtreesappphylum100.00.0
aCAtreesappclass98.80.0
aCAtreesapporder98.00.0
aCAtreesappfamily94.90.0
aCAtreesappgenus91.00.0

Score: Percentage of sequences that were correctly classified at this rank in %.

clade_exclusion_performance.tsv:

This data summarizes the taxonomic classification performance of TreeSAPP at class and species levels using clade exclusion analysis, measured by taxonomic distance (TaxDist).

Class-level exclusion Species-level exclusion

Class-level performance

TaxDistQueriesCorrectCumulativeOverUnder
08062020336105
18061241443170
2806201642970
3806281922690
48061343261350
580613546100
6806046100
7806046100

Species-level performance

TaxDistQueriesCorrectCumulativeOverUnder
06975135130130
169772585058
269725610033
369720630013
4697563508
5697864300
6697064300
7697064300

representative_taxa_sequences.fasta:

taxonomic_recall.tsv

awk -F'\t' '$3 < 0.7' taxonomic_recall.tsv

We ran the code above to find low-recall clades with a threshold of 70% and found that some species had 0% recall including:

Class-level (0% recall): Species-level (0% recall):

Conclusions + future directions

Throughout our workflow, we successfully completed an end-to-end TreeSAPP workflow to analyze the evolutionary relationship of carbonic anhydrases relevant to MCIP. This process included constructing a phylogenetic tree from a curated dataset from EGGNOGG, and layering two external metadata sources: BRENDA and GTDB, one to capture enzymatic function and another to refine taxonomic classification. This allowed us to not only place each sequence based on their evolutionary placement but also the biochemical potential of the species - providing us a high-resolution view of the carbonic anhydrase family.

One of the most interesting findings from the analysis was that annotations from BRENDA independently identified H. pylori as a CA source which mirrors the findings of wet lab. The convergence in results between bioinformatics and experimental selection further validates our computational pipeline. Using the results we have gathered from TreeSAPP, we were able to provide wet lab with a ranked list of functionally diverse CA sequences, including activity, environmental resilience and novelty. We also paid close attention to a thermostable and high activity enzyme: the SazCA from S. azorense. To further investigate this variant, we have identified close homologs to SazCA within the same taxonomic order of Aquificales by locating it within the tree. We then identified related thermophilic CAs and expanded the viable candidates to test for Martian-like conditions.

Looking forward, the pipeline we have created allows for additional layers of metadata sources including structural models, kinetic data, molecular dynamics, and further prioritization of enzyme candidates. This will also allow future researchers to expand the search space by incorporating datasets with extreme environments and potentially reveal uncharacterized CAs that are more desirable for those conditions. Finally, we could also broaden our analysis to further CA families beyond just alpha.

Overall, our integrative approach has not only provided crucial data for wet lab, but have also established a reusable pipeline for future protein discovery and analysis.

1. Yang Z, Rannala B. Molecular phylogenetics: Principles and practice. Nat Rev Genet [Internet]. 2012 May [cited 2025 Sept 25];13(5):303—14. Available from: https://www.nature.com/articles/nrg3186
2. Fackrell LE, Schroeder PA, Thompson A, Stockstill-Cahill K, Hibbitts CA. Development of martian regolith and bedrock simulants: Potential and limitations of martian regolith as an in-situ resource. Icarus [Internet]. 2021 Jan 15 [cited 2025 Sept 25];354:114055. Available from: https://www.sciencedirect.com/science/article/pii/S0019103520304061
3. Lindskog S. Structure and mechanism of carbonic anhydrase. Pharmacology & Therapeutics [Internet]. 1997 Jan 1 [cited 2025 Sept 25];74(1):1—20. Available from: https://www.sciencedirect.com/science/article/pii/S0163725896001982
4. Hewett-Emmett D, Tashian RE. Functional Diversity, Conservation, and Convergence in the Evolution of the α-, β-, and γ-Carbonic Anhydrase Gene Families. Molecular Phylogenetics and Evolution [Internet]. 1996 Feb 1 [cited 2025 Sept 25];5(1):50—77. Available from: https://www.sciencedirect.com/science/article/pii/S1055790396900068
5. Supuran CT, Capasso C. An Overview of the Bacterial Carbonic Anhydrases. Metabolites [Internet]. 2017 Nov 11 [cited 2025 Sept 25];7(4):56. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC5746736/
6. Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol Biol Evol [Internet]. 2021 Dec 9;38(12):5825—9. Available from: https://www.ncbi.nlm.nih.gov/pubmed/34597405
7. Hernández-Plaza A, Szklarczyk D, Botas J, Cantalapiedra CP, Giner-Lamia J, Mende DR, et al. eggNOG 6.0: Enabling comparative genomics across 12 535 organisms. Nucleic Acids Res [Internet]. 2023 Jan 6;51(D1):D389—94. Available from: https://www.ncbi.nlm.nih.gov/pubmed/36399505
8. EggNOG Database | Orthology predictions and functional annnotaion [Internet]. [cited 2025 Sept 25]. Available from: http://eggnog6.embl.de/search/ogs/LCOG3338/
9. Morgan-Lang C, McLaughlin R, Armstrong Z, Zhang G, Chan K, Hallam SJ. TreeSAPP: The Tree-based Sensitive and Accurate Phylogenetic Profiler. Bioinformatics [Internet]. 2020 Sept 15 [cited 2025 Sept 25];36(18):4706—13. Available from: https://doi.org/10.1093/bioinformatics/btaa588
10. Reference package operations · hallamlab/TreeSAPP Wiki [Internet]. [cited 2025 Sept 25]. Available from: https://github.com/hallamlab/TreeSAPP/wiki/Reference-package-operations
11. Waheed A, Pham T, Won M, Okuyama T, Sly WS. Human carbonic anhydrase IV: In vitro activation and purification of disulfide-bonded enzyme following expression in Escherichia coli. Protein Expr Purif [Internet]. 1997 Mar;9(2):279—87. Available from: https://www.ncbi.nlm.nih.gov/pubmed/9056493
12. Building reference packages with TreeSAPP · hallamlab/TreeSAPP Wiki [Internet]. [cited 2025 Sept 25]. Available from: https://github.com/hallamlab/TreeSAPP/wiki/Building-reference-packages-with-TreeSAPP
13. Automatically colouring trees [Internet]. GitHub; [cited 2025 Sept 25]. Available from: https://github.com/hallamlab/TreeSAPP/wiki/Automatically-colouring-trees
14. Letunic I, Bork P. Interactive Tree Of Life (iTOL) v5: An online tool for phylogenetic tree display and annotation. Nucleic Acids Res [Internet]. 2021 July 2 [cited 2025 Sept 25];49(W1):W293—6. Available from: https://doi.org/10.1093/nar/gkab301
15. Schomburg I, Chang A, Schomburg D. BRENDA, enzyme data and metabolic information. Nucleic Acids Res [Internet]. 2002 Jan 1 [cited 2025 Sept 25];30(1):47—9. Available from: https://pmc.ncbi.nlm.nih.gov/articles/PMC99121/
16. Parks DH, Chuvochina M, Rinke C, Mussig AJ, Chaumeil PA, Hugenholtz P. GTDB: An ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res [Internet]. 2022 Jan 7 [cited 2025 Sept 25];50(D1):D785—94. Available from: https://doi.org/10.1093/nar/gkab776